| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
In the change set 6c045a999800c62368470938307951bb669f5afc the error
text for the old flag `--private-users-chown` was repurposed for the
new flag `--private-users-ownership=own` and while doing so the word
`may` was dropped leading to a grammatically incorrect error text.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a two new values to --private-users-ownership=: "map" and
"auto".
"map" exposes the kernel 5.12 idmap feature pretty much 1:1. It fails if
the kernel or used file system doesn't support ID mapping.
"auto" is a bit smarter: if we can make ID mapping work, we'll use it,
otherwise revert back to classic chown()ing. We'll also use chown()ing
if we detect that an image is already ID shifted, both to increase
compatibility with the status quo ante, and to simplify our codepaths,
since the mappings become a lot simpler if we only have to map from zero
to something else, instead of from anything to anything else.
The short -U switch, and --private-users=pick will now imply
--private-users-ownership=auto instead of
--private-users-ownership=chown, since the new logic should be the much
better choice.
|
| |
|
| |
|
|
|
|
|
| |
Let's add a helper that ensures the UID shift/range parameters actually
fit together.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This replaces --private-user-chown by an enum value
--private-user-ownership=off|chown. Changes otherwise very little.
This is mostly preparation for a follow-up commit adding a new "map"
mode, using kernel 5.12 UID mapping mounts.
Note that this does alter codeflow a bit: the new enum already knows
three different values instead of the old true/false pair. Besides "off"
and "chown" it knows -EINVAL, i.e. whenever the value wsn't set
explicitly. This value is changed to "off" or "chown" before use, thus
retaining compat to the status quo before, except it won't override
explicit configuration anymore. Thus, if you explicitly request
--private-user=pick you can now combine it wiht an explicit
--private-user-ownership=off if you like, which will give you a
container that runs under its own UID set, but the files will be owned
by the original image. Makes not much sense besids maybe debugging, but
if requested explicitly I think it's OK to implement.
|
|
|
|
|
|
|
|
|
|
| |
userns identity 1:1 mapping is a pretty useful concept since it isolates
capability sets between containers and hosts, even if it doesn't map
any uid ranges. Let's support it with an explicit concept.
(Note that this is identical to --private-users=0:65536 (which in turn
is identical to --private-users=0), but I think it makes to emphasize
this concept as a high-level one that makes sense to support.)
|
|\
| |
| | |
optionally, grow file systems to partition size when mounting them via GPT auto-discovery
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
tools that deal with OS images
Let's enable this in all tools that intend to write to the OS images.
It's not conditionalized for now, as there already is conditionalization
in the existance or absence of the flag in the GPT partition table (and
it's opt-in), hence it should be OK to just enable this by default for
now if the flag is set.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
string via a format string
It's not going to be efficient if called in inner loops, but it's oh so
handy, and we have some code that does this:
asprintf(&p, "%s…", b, …);
free(b);
b = TAKE_PTR(p);
which can now be replaced by the quicker and easier to read:
strextendf(&p, "…", …);
|
|/
|
|
|
| |
The actual section names are quite different from what the comment so
far suggested. Fix that.
|
|
|
|
|
|
|
|
|
| |
This tries to shorten the race of device reuse a bit more: let's ignore
udev database entries that are older than the time where we started to
use a loopback device.
This doesn't fix the whole loopback device raciness mess, but it makes
the race window a bit shorter.
|
|
|
|
|
|
|
|
|
|
|
| |
Let's drop all monitor uevent that were enqueued before we actually
started setting up the device.
This doesn't fix the race, but it makes the race window smaller: since
we cannot determine the uevent seqnum and the loopback attachment
atomically, there's a tiny window where uevents might be generated by
the device which we mistake for being associated with out use of the
loopback device.
|
|\
| |
| | |
let's read LoadCredentials=/SetCredentials= style cred in sysusers/firstboot and when asking for passwords
|
| | |
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously, the flag did two things at once: enable support for using
generic partitions as root fs if there were only one/allow use of
partition-table-less images as root fs. And secondly, insist that there
was a rootfs, and fail if not. Let's split these two in two separate
options so that they can be used independently of each other.
There are cases where one wants to use one without the other (i.e. when
inspecting things with systemd-dissect tool it should be OK to do so
even if image has no root fs), and it's cleaner anyway.
|
|/
|
|
|
|
|
|
| |
Let's make use of the new dissection in all tools where this makes
sense, which are all tools that dissect images, except for those which
inherently operate on state/configuraiton and thus where an image
without state nor configuration is useless (e.g.
systemd-tmpfiles/systemd-firstboot/… --image= switch).
|
| |
|
| |
|
|
|
|
|
|
| |
Prompted by https://bugzilla.redhat.com/show_bug.cgi?id=1930875 in which
I had previously used json_dispatch_unsigned and passed a return variable of
type unsigned when json_dispatch_unsigned writes a uintmax_t.
|
|
|
|
|
|
|
|
|
|
| |
Clean up ignore_signals() + default_signals() + sigaction_many() a bit:
make it unnecessary to explicitly terminate the signal list with -1.
Merge all three calls into a single function that is just called with
slightly different parameters. And eliminate an unnecessary extra
iteration in its inner for() loop.
No change in behaviour.
|
| |
|
|\
| |
| | |
Support ipv6 for masquerade and dnat in nspawn and networkd
|
| |
| |
| |
| | |
Extend nspawn so it can keep track of one ipv4 and one ipv6 address.
|
|\ \
| | |
| | | |
Envvar assignment cleanup
|
| | | |
|
|/ /
| |
| |
| |
| | |
This fits better in shared/, and the new parse-argument.c file is a good home
for it.
|
| | |
|
| |
| |
| |
| | |
Now that we know we have something useful, no need to make an answer up.
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
As suggested in https://github.com/systemd/systemd/pull/11484#issuecomment-775288617.
This does not touch anything exposed in src/systemd. Changing the defines there
would be a compatibility break.
Note that tests are broken after this commit. They will be fixed in the next one.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The old name originates when this was used to discover "machine" images,
as managed by machined/machinectl. But nowadays this is also used by
portable services and system extensions, hence let's use a more generic
name for this API. Taking inspiration from "dissect-image.[ch]", let's call
this "discover-image.[ch]".
This is pure renaming, no other changes.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
I think this formatting was originally used because it simplified
adding new options to the help messages. However, these days, most
tools their help message end with "\nSee the %s for details.\n" so
the final line almost never has to be edited which eliminates the
benefit of the custom formatting used for printf() help messages.
Let's make things more consistent and use the same formatting for
printf() help messages that we use everywhere else.
Prompted by https://github.com/systemd/systemd/pull/18355#discussion_r567241580
|
| |
| |
| |
| |
| |
| | |
Even though many of those scripts are very simple, it is easier to include
the header than to try to say whether each of those files is trivial enough
not to require one.
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Apparently SELinux inserts control data into AF_UNIX datagrams where we
don't expect it, thus miscalculating the control data. This looks like
something to fix in SELinux, but we still should handle this gracefully
and just drop the offending datagram and continue.
recvmsg_safe() actually already drops the datagram, it's just a matter
of actually ignoring EXFULL (which it generates if control data is too
large) in the right places.
This does this wherever an AF_UNIX/SOCK_DGRAM socket is used with
recvmsg_safe() that is not just internal communication.
Fixes: #17795
Follow-up for: 3691bcf3c5eebdcca5b4f1c51c745441c57a6cd1
|
|
|
|
|
| |
systemd-sysext supports --root= for everything but the image discovery.
Fix that.
|
| |
|
| |
|
|
|
|
| |
Then, we can shorten many test definitions.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This is inline with the OCI runtime spec:
On POSIX platforms, path is either an absolute path or a relative path
to the bundle. For example, with a bundle at /to/bundle and a root
filesystem at /to/bundle/rootfs, the path value can be either
/to/bundle/rootfs or rootfs. The value SHOULD be the conventional
rootfs.
(https://github.com/opencontainers/runtime-spec/blob/master/config.md)
|
| |
|
|\
| |
| | |
add networkd/nspawn nftables backend
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Idea is to use a static ruleset, added when the first attempt to
add a masquerade or dnat rule is made.
The alternative would be to add the ruleset when the init function is called.
The disadvantage is that this enables connection tracking and NAT in the kernel
(as the ruleset needs this to work), which comes with some overhead that might
not be needed (no nspawn usage and no IPMasquerade option set).
There is no additional dependency on the 'nft' userspace binary or other libraries.
sd-netlinks nfnetlink backend is used to modify the nftables ruleset.
The commit message/comments still use nft syntax since that is what
users will see when they use the nft tool to list the ruleset.
The added initial skeleton (added on first fw_add_masquerade/local_dnat
call) looks like this:
table ip io.systemd.nat {
set masq_saddr {
type ipv4_addr
flags interval
elements = { 192.168.59.160/28 }
}
map map_port_ipport {
type inet_proto . inet_service : ipv4_addr . inet_service
elements = { tcp . 2222 : 192.168.59.169 . 22 }
}
chain prerouting {
type nat hook prerouting priority dstnat + 1; policy accept;
fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
}
chain output {
type nat hook output priority -99; policy accept;
ip daddr != 127.0.0.0/8 oif "lo" dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
}
chain postrouting {
type nat hook postrouting priority srcnat + 1; policy accept;
ip saddr @masq_saddr masquerade
}
}
Next calls to fw_add_masquerade/add_local_dnat will then only add/delete the
element/mapping to masq_saddr and map_port_ipport, i.e. the ruleset doesn't
change -- only the set/map content does.
Running test-firewall-util with this backend gives following output
on a parallel 'nft monitor':
$ nft monitor
add table ip io.systemd.nat
add chain ip io.systemd.nat prerouting { type nat hook prerouting priority dstnat + 1; policy accept; }
add chain ip io.systemd.nat output { type nat hook output priority -99; policy accept; }
add chain ip io.systemd.nat postrouting { type nat hook postrouting priority srcnat + 1; policy accept; }
add set ip io.systemd.nat masq_saddr { type ipv4_addr; flags interval; }
add map ip io.systemd.nat map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service; }
add rule ip io.systemd.nat prerouting fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat output ip daddr != 127.0.0.0/8 fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat postrouting ip saddr @masq_saddr masquerade
add element ip io.systemd.nat masq_saddr { 10.1.2.3 }
add element ip io.systemd.nat masq_saddr { 10.0.2.0/28 }
delete element ip io.systemd.nat masq_saddr { 10.0.2.0/28 }
delete element ip io.systemd.nat masq_saddr { 10.1.2.3 }
add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
CTRL-C
Things not implemented/supported:
1. Change monitoring. The kernel allows userspace to learn about changes
made by other clients (using nfnetlink notifications). It would be
possible to detect when e.g. someone removes the systemd nat table.
This would need more work. Its also not clear on how to react to
external changes -- it doesn't seem like a good idea to just auto-undo
everthing.
2. 'set masq_saddr' doesn't handle overlaps.
Example:
fw_add_masquerade(true, AF_INET, "10.0.0.0" , 16);
fw_add_masquerade(true, AF_INET, "10.0.0.0" , 8); /* fails */
With the iptables backend the second call works, as it adds an
independent iptables rule.
With the nftables backend, the range 10.0.0.0-10.255.255.255 clashes with
the existing range of 10.0.0.0-10.0.255.255 so 2nd add gets rejected by the
kernel.
This will generate an error message from networkd ("Could not enable IP
masquerading: File exists").
To resolve this it would be needed to either keep track of the added elements
and perform range merging when overlaps are detected.
However, the add erquests are done using the configured network on a
device, so no overlaps should occur in normal setups.
IPv6 support is added in a extra changeset.
Fixes: #13307
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
for planned nft backend we have three choices:
- open/close a new nfnetlink socket for every operation
- keep a nfnetlink socket open internally
- expose a opaque fw_ctx and stash all internal data here.
Originally I opted for the 2nd option, but during review it was
suggested to avoid static storage duration because of perceived
problems with threaded applications.
This adds fw_ctx and new/free functions, then converts the existing api
and nspawn and networkd to use it.
|
| |
| |
| |
| |
| |
| | |
Next patch will need to pass two pointers to the callback instead
of just the addr mask. Caller will pass a compound structure, so
make this 'void *userdata' to de-clutter the next patch.
|
|/
|
|
|
|
|
|
|
|
| |
No functional change, just moving a bunch of things around. Before
we needed a rather complicated setup to test hostname_setup(), because
the code was in src/core/. When things are moved to src/shared/
we can just test it as any function.
The test is still "unsafe" because hostname_setup() may modify the
hostname.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's clean up hostname_is_valid() a bit: let's turn the second boolean
argument into a more explanatory flags field, and add a flag that
accepts the special name ".host" as valid. This is useful for the
container logic, where the special hostname ".host" refers to the "root
container", i.e. the host system itself, and can be specified at various
places.
let's also get rid of machine_name_is_valid(). It was just an alias,
which is confusing and even more so now that we have the flags param.
|
|
|
|
|
|
|
|
|
| |
bpffs fully respects mount namespaces since kernel version 4.7
References:
- https://github.com/torvalds/linux/commit/e27f4a942a0ee4b84567a3c6cfa84f273e55cbb7
- https://github.com/torvalds/linux/commit/612bacad78ba6d0a91166fc4487af114bac172a8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The old code was only able to pass the value 0 for the inheritable
and ambient capability set when a non-root user was specified.
However, sometimes it is useful to run a program in its own container
with a user specification and some capabilities set. This is needed
when the capabilities cannot be provided by file capabilities (because
the file system is mounted with MS_NOSUID for additional security).
This commit introduces the option --ambient-capability and the config
file option AmbientCapability=. Both are used in a similar way to the
existing Capability= setting. It changes the inheritable and ambient
set (which is 0 by default). The code also checks that the settings
for the bounding set (as defined by Capability= and DropCapability=)
and the setting for the ambient set (as defined by AmbientCapability=)
are compatible. Otherwise, the operation would fail in any way.
Due to the current use of -1 to indicate no support for ambient
capability set the special value "all" cannot be supported.
Also, the setting of ambient capability is restricted to running a
single program in the container payload.
|