| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
| |
Prompted by https://bugzilla.redhat.com/show_bug.cgi?id=1930875 in which
I had previously used json_dispatch_unsigned and passed a return variable of
type unsigned when json_dispatch_unsigned writes a uintmax_t.
|
|
|
|
|
|
|
|
|
|
| |
Clean up ignore_signals() + default_signals() + sigaction_many() a bit:
make it unnecessary to explicitly terminate the signal list with -1.
Merge all three calls into a single function that is just called with
slightly different parameters. And eliminate an unnecessary extra
iteration in its inner for() loop.
No change in behaviour.
|
| |
|
|\
| |
| | |
Support ipv6 for masquerade and dnat in nspawn and networkd
|
| |
| |
| |
| | |
Extend nspawn so it can keep track of one ipv4 and one ipv6 address.
|
|\ \
| | |
| | | |
Envvar assignment cleanup
|
| | | |
|
|/ /
| |
| |
| |
| | |
This fits better in shared/, and the new parse-argument.c file is a good home
for it.
|
| | |
|
| |
| |
| |
| | |
Now that we know we have something useful, no need to make an answer up.
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
As suggested in https://github.com/systemd/systemd/pull/11484#issuecomment-775288617.
This does not touch anything exposed in src/systemd. Changing the defines there
would be a compatibility break.
Note that tests are broken after this commit. They will be fixed in the next one.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The old name originates when this was used to discover "machine" images,
as managed by machined/machinectl. But nowadays this is also used by
portable services and system extensions, hence let's use a more generic
name for this API. Taking inspiration from "dissect-image.[ch]", let's call
this "discover-image.[ch]".
This is pure renaming, no other changes.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
I think this formatting was originally used because it simplified
adding new options to the help messages. However, these days, most
tools their help message end with "\nSee the %s for details.\n" so
the final line almost never has to be edited which eliminates the
benefit of the custom formatting used for printf() help messages.
Let's make things more consistent and use the same formatting for
printf() help messages that we use everywhere else.
Prompted by https://github.com/systemd/systemd/pull/18355#discussion_r567241580
|
| |
| |
| |
| |
| |
| | |
Even though many of those scripts are very simple, it is easier to include
the header than to try to say whether each of those files is trivial enough
not to require one.
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Apparently SELinux inserts control data into AF_UNIX datagrams where we
don't expect it, thus miscalculating the control data. This looks like
something to fix in SELinux, but we still should handle this gracefully
and just drop the offending datagram and continue.
recvmsg_safe() actually already drops the datagram, it's just a matter
of actually ignoring EXFULL (which it generates if control data is too
large) in the right places.
This does this wherever an AF_UNIX/SOCK_DGRAM socket is used with
recvmsg_safe() that is not just internal communication.
Fixes: #17795
Follow-up for: 3691bcf3c5eebdcca5b4f1c51c745441c57a6cd1
|
|
|
|
|
| |
systemd-sysext supports --root= for everything but the image discovery.
Fix that.
|
| |
|
| |
|
|
|
|
| |
Then, we can shorten many test definitions.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
This is inline with the OCI runtime spec:
On POSIX platforms, path is either an absolute path or a relative path
to the bundle. For example, with a bundle at /to/bundle and a root
filesystem at /to/bundle/rootfs, the path value can be either
/to/bundle/rootfs or rootfs. The value SHOULD be the conventional
rootfs.
(https://github.com/opencontainers/runtime-spec/blob/master/config.md)
|
| |
|
|\
| |
| | |
add networkd/nspawn nftables backend
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Idea is to use a static ruleset, added when the first attempt to
add a masquerade or dnat rule is made.
The alternative would be to add the ruleset when the init function is called.
The disadvantage is that this enables connection tracking and NAT in the kernel
(as the ruleset needs this to work), which comes with some overhead that might
not be needed (no nspawn usage and no IPMasquerade option set).
There is no additional dependency on the 'nft' userspace binary or other libraries.
sd-netlinks nfnetlink backend is used to modify the nftables ruleset.
The commit message/comments still use nft syntax since that is what
users will see when they use the nft tool to list the ruleset.
The added initial skeleton (added on first fw_add_masquerade/local_dnat
call) looks like this:
table ip io.systemd.nat {
set masq_saddr {
type ipv4_addr
flags interval
elements = { 192.168.59.160/28 }
}
map map_port_ipport {
type inet_proto . inet_service : ipv4_addr . inet_service
elements = { tcp . 2222 : 192.168.59.169 . 22 }
}
chain prerouting {
type nat hook prerouting priority dstnat + 1; policy accept;
fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
}
chain output {
type nat hook output priority -99; policy accept;
ip daddr != 127.0.0.0/8 oif "lo" dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
}
chain postrouting {
type nat hook postrouting priority srcnat + 1; policy accept;
ip saddr @masq_saddr masquerade
}
}
Next calls to fw_add_masquerade/add_local_dnat will then only add/delete the
element/mapping to masq_saddr and map_port_ipport, i.e. the ruleset doesn't
change -- only the set/map content does.
Running test-firewall-util with this backend gives following output
on a parallel 'nft monitor':
$ nft monitor
add table ip io.systemd.nat
add chain ip io.systemd.nat prerouting { type nat hook prerouting priority dstnat + 1; policy accept; }
add chain ip io.systemd.nat output { type nat hook output priority -99; policy accept; }
add chain ip io.systemd.nat postrouting { type nat hook postrouting priority srcnat + 1; policy accept; }
add set ip io.systemd.nat masq_saddr { type ipv4_addr; flags interval; }
add map ip io.systemd.nat map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service; }
add rule ip io.systemd.nat prerouting fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat output ip daddr != 127.0.0.0/8 fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport
add rule ip io.systemd.nat postrouting ip saddr @masq_saddr masquerade
add element ip io.systemd.nat masq_saddr { 10.1.2.3 }
add element ip io.systemd.nat masq_saddr { 10.0.2.0/28 }
delete element ip io.systemd.nat masq_saddr { 10.0.2.0/28 }
delete element ip io.systemd.nat masq_saddr { 10.1.2.3 }
add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 }
add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 }
CTRL-C
Things not implemented/supported:
1. Change monitoring. The kernel allows userspace to learn about changes
made by other clients (using nfnetlink notifications). It would be
possible to detect when e.g. someone removes the systemd nat table.
This would need more work. Its also not clear on how to react to
external changes -- it doesn't seem like a good idea to just auto-undo
everthing.
2. 'set masq_saddr' doesn't handle overlaps.
Example:
fw_add_masquerade(true, AF_INET, "10.0.0.0" , 16);
fw_add_masquerade(true, AF_INET, "10.0.0.0" , 8); /* fails */
With the iptables backend the second call works, as it adds an
independent iptables rule.
With the nftables backend, the range 10.0.0.0-10.255.255.255 clashes with
the existing range of 10.0.0.0-10.0.255.255 so 2nd add gets rejected by the
kernel.
This will generate an error message from networkd ("Could not enable IP
masquerading: File exists").
To resolve this it would be needed to either keep track of the added elements
and perform range merging when overlaps are detected.
However, the add erquests are done using the configured network on a
device, so no overlaps should occur in normal setups.
IPv6 support is added in a extra changeset.
Fixes: #13307
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
for planned nft backend we have three choices:
- open/close a new nfnetlink socket for every operation
- keep a nfnetlink socket open internally
- expose a opaque fw_ctx and stash all internal data here.
Originally I opted for the 2nd option, but during review it was
suggested to avoid static storage duration because of perceived
problems with threaded applications.
This adds fw_ctx and new/free functions, then converts the existing api
and nspawn and networkd to use it.
|
| |
| |
| |
| |
| |
| | |
Next patch will need to pass two pointers to the callback instead
of just the addr mask. Caller will pass a compound structure, so
make this 'void *userdata' to de-clutter the next patch.
|
|/
|
|
|
|
|
|
|
|
| |
No functional change, just moving a bunch of things around. Before
we needed a rather complicated setup to test hostname_setup(), because
the code was in src/core/. When things are moved to src/shared/
we can just test it as any function.
The test is still "unsafe" because hostname_setup() may modify the
hostname.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's clean up hostname_is_valid() a bit: let's turn the second boolean
argument into a more explanatory flags field, and add a flag that
accepts the special name ".host" as valid. This is useful for the
container logic, where the special hostname ".host" refers to the "root
container", i.e. the host system itself, and can be specified at various
places.
let's also get rid of machine_name_is_valid(). It was just an alias,
which is confusing and even more so now that we have the flags param.
|
|
|
|
|
|
|
|
|
| |
bpffs fully respects mount namespaces since kernel version 4.7
References:
- https://github.com/torvalds/linux/commit/e27f4a942a0ee4b84567a3c6cfa84f273e55cbb7
- https://github.com/torvalds/linux/commit/612bacad78ba6d0a91166fc4487af114bac172a8
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The old code was only able to pass the value 0 for the inheritable
and ambient capability set when a non-root user was specified.
However, sometimes it is useful to run a program in its own container
with a user specification and some capabilities set. This is needed
when the capabilities cannot be provided by file capabilities (because
the file system is mounted with MS_NOSUID for additional security).
This commit introduces the option --ambient-capability and the config
file option AmbientCapability=. Both are used in a similar way to the
existing Capability= setting. It changes the inheritable and ambient
set (which is 0 by default). The code also checks that the settings
for the bounding set (as defined by Capability= and DropCapability=)
and the setting for the ambient set (as defined by AmbientCapability=)
are compatible. Otherwise, the operation would fail in any way.
Due to the current use of -1 to indicate no support for ambient
capability set the special value "all" cannot be supported.
Also, the setting of ambient capability is restricted to running a
single program in the container payload.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
All users pass a NULL/0 for those, things haven't changed since 2015
when this was added originally, so remove the arguments.
THe paramters are re-added as local function variables, initalised
to NULL or 0. A followup patch can then manually remove all
if (NULL) rather than leaving dead-branch optimization to compiler.
Reason for not doing it here is to ease patch review.
Not requiring support for this will ease initial nftables backend
implementation.
In case a use-case comues up later this feature can be re-added.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
Fixes: #17504
(While we are it, also move $SYSTEMD_SECCOMP_LOG= env var description
into the right document section)
Also suggested in: https://github.com/systemd/systemd/issues/17245#issuecomment-704773603
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
name
This beefs up the READ_FULL_FILE_CONNECT_SOCKET logic of
read_full_file_full() a bit: when used a sender socket name may be
specified. If specified as NULL behaviour is as before: the client
socket name is picked by the kernel. But if specified as non-NULL the
client can pick a socket name to use when connecting. This is useful to
communicate a minimal amount of metainformation from client to server,
outside of the transport payload.
Specifically, these beefs up the service credential logic to pass an
abstract AF_UNIX socket name as client socket name when connecting via
READ_FULL_FILE_CONNECT_SOCKET, that includes the requesting unit name
and the eventual credential name. This allows servers implementing the
trivial credential socket logic to distinguish clients: via a simple
getpeername() it can be determined which unit is requesting a
credential, and which credential specifically.
Example: with this patch in place, in a unit file "waldo.service" a
configuration line like the following:
LoadCredential=foo:/run/quux/creds.sock
will result in a connection to the AF_UNIX socket /run/quux/creds.sock,
originating from an abstract namespace AF_UNIX socket:
@$RANDOM/unit/waldo.service/foo
(The $RANDOM is replaced by some randomized string. This is included in
the socket name order to avoid namespace squatting issues: the abstract
socket namespace is open to unprivileged users after all, and care needs
to be taken not to use guessable names)
The services listening on the /run/quux/creds.sock socket may thus
easily retrieve the name of the unit the credential is requested for
plus the credential name, via a simpler getpeername(), discarding the
random preifx and the /unit/ string.
This logic uses "/" as separator between the fields, since both unit
names and credential names appear in the file system, and thus are
designed to use "/" as outer separators. Given that it's a good safe
choice to use as separators here, too avoid any conflicts.
This is a minimal patch only: the new logic is used only for the unit
file credential logic. For other places where we use
READ_FULL_FILE_CONNECT_SOCKET it is probably a good idea to use this
scheme too, but this should be done carefully in later patches, since
the socket names become API that way, and we should determine the right
amount of info to pass over.
|
|
|
|
|
|
| |
When nspawn starts an image, this image could be in any state, including
an aborted first boot. For this case, it needs to correctly handle the
situation like there was no machine-id at all.
|
| |
|
|
|
|
|
|
|
|
| |
We should chown what we allocate ourselves, i.e. any pty we allocate
ourselves. But for stuff we propagate, let's avoid that: we shouldn't
make more changes than necessary.
Fixes: #17229
|
|
|
|
|
|
|
|
|
|
|
| |
When invoked as non-root, we would suggest re-running as root without any
further hint. But this immediately spawns a machine from the local directory,
which can be rather surprising. So let's give a better hint.
(In general, I don't think commandline programs should do "significant" things
when invoked without any arguments. In this regard it would be better if
systemd-nspawn would not spawn a machine from the current directory if called
with no arguments and at least "-D ." would be required.)
|
|
|
|
|
|
| |
Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log
level and flags param. In particular the latter matters, since we
typically don't actually want to follow symlinks when unmounting.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
By default we'll run a container in --console=interactive and
--console=read-only mode depending if we are invoked on a tty or not so
that the container always gets a /dev/console allocated, i.e is always
suitable to run a full init system /as those typically expect a
/dev/console to exist).
With the new --console=autopipe mode we do something similar, but
slightly different: when not invoked on a tty we'll use --console=pipe.
This means, if you invoke some tool in a container with this you'll get
full inetractivity if you invoke it on a tty but things will also be
very nicely pipeable. OTOH you cannot invoke a full init system like
this, because you might or might not become a /dev/console this way...
Prompted-by: #17070
(I named this "autopipe" rather than "auto" or so, since the default
mode probably should be named "auto" one day if we add a name for it,
and this is so similar to "auto" except that it uses pipes in the
non-tty case).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of first becoming a controlling process of the payload pty
as side effect of opening it (without O_NOCTTY), and then possibly
dropping it again, let's do it cleanly an reverse the logic: let's open
the pty without becoming its controller first. Only after everything
went the way we wanted it to go become the controller explicitly.
This has the benefit that the PID 1 stub process we run (as effect of
--as-pid2) doesn't have to lose the tty explicitly, but can just
continue running with things. And we explicitly make the tty controlling
right before invoking actual payload.
In order to make sure everything works as expected validate that the
stub PID 1 in the container really has no conrolling tty by issuing the
TIOCNOTTY tty and expecting ENOTTY, and log about it.
This shouldn't change behaviour much, it just makes thins a bit cleaner,
in particular as we'll not trigger SIGHUP on ourselves (since we are
controller and session leader) due to TIOCNOTTY which we then have to
explicitly ignore.
|
| |
|
|
|
|
|
|
|
| |
If people do this then things are weird, and they should probably use
--console=interactive (i.e. the default) instead.
Prompted-by: #17070
|
|
|
|
|
| |
Let's verify that everything works the way we expect it to work, hence
check setsid() return code.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Just some refactoring: let's place the various verity related parameters
in a common structure, and pass that around instead of the individual
parameters.
Also, let's load the PKCS#7 signature data when finding metadata
right-away, instead of delaying this until we need it. In all cases we
call this there's not much time difference between the metdata finding
and the loading, hence this simplifies things and makes sure root hash
data and its signature is now always acquired together.
|
| |
|