summaryrefslogtreecommitdiffstats
path: root/src/nspawn (follow)
Commit message (Collapse)AuthorAgeFilesLines
* json: rename json_dispatch_{integer,unsigned} -> json_dispatch_{intmax,uintmax}Anita Zhang2021-02-261-2/+2
| | | | | | Prompted by https://bugzilla.redhat.com/show_bug.cgi?id=1930875 in which I had previously used json_dispatch_unsigned and passed a return variable of type unsigned when json_dispatch_unsigned writes a uintmax_t.
* signal-util: make -1 termination of ignore_signals() argument list unnecessaryLennart Poettering2021-02-251-2/+2
| | | | | | | | | | Clean up ignore_signals() + default_signals() + sigaction_many() a bit: make it unnecessary to explicitly terminate the signal list with -1. Merge all three calls into a single function that is just called with slightly different parameters. And eliminate an unnecessary extra iteration in its inner for() loop. No change in behaviour.
* tree-wide: use in_addr_is_set() or friendsYu Watanabe2021-02-171-2/+2
|
* Merge pull request #18007 from fw-strlen/ipv6_masq_and_dnatLennart Poettering2021-02-163-14/+20
|\ | | | | Support ipv6 for masquerade and dnat in nspawn and networkd
| * nspawn: expose container ipv6 address tooFlorian Westphal2021-01-193-14/+20
| | | | | | | | Extend nspawn so it can keep track of one ipv4 and one ipv6 address.
* | Merge pull request #18601 from keszybz/env-assign-cleanupLennart Poettering2021-02-161-7/+3
|\ \ | | | | | | Envvar assignment cleanup
| * | basic/env-util: add variant of strv_env_replace() that does strdup internallyZbigniew Jędrzejewski-Szmek2021-02-151-7/+3
| | |
* | | Move and rename parse_path_argument() functionZbigniew Jędrzejewski-Szmek2021-02-151-6/+7
|/ / | | | | | | | | This fits better in shared/, and the new parse-argument.c file is a good home for it.
* | tree-wide: use free_and_strdup_warn()Yu Watanabe2021-02-111-4/+1
| |
* | tree-wide: propagate error code from _from_string() functionsZbigniew Jędrzejewski-Szmek2021-02-102-7/+5
| | | | | | | | Now that we know we have something useful, no need to make an answer up.
* | tree-wide: use -EINVAL for enum invalid valuesZbigniew Jędrzejewski-Szmek2021-02-102-7/+7
| | | | | | | | | | | | | | | | | | As suggested in https://github.com/systemd/systemd/pull/11484#issuecomment-775288617. This does not touch anything exposed in src/systemd. Changing the defines there would be a compatibility break. Note that tests are broken after this commit. They will be fixed in the next one.
* | shared: rename machine-image.[ch] → discover-image.[ch]Lennart Poettering2021-02-031-1/+1
| | | | | | | | | | | | | | | | | | | | The old name originates when this was used to discover "machine" images, as managed by machined/machinectl. But nowadays this is also used by portable services and system extensions, hence let's use a more generic name for this API. Taking inspiration from "dissect-image.[ch]", let's call this "discover-image.[ch]". This is pure renaming, no other changes.
* | tree-wide: Drop custom formatting for print() help messagesDaan De Meyer2021-01-311-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | I think this formatting was originally used because it simplified adding new options to the help messages. However, these days, most tools their help message end with "\nSee the %s for details.\n" so the final line almost never has to be edited which eliminates the benefit of the custom formatting used for printf() help messages. Let's make things more consistent and use the same formatting for printf() help messages that we use everywhere else. Prompted by https://github.com/systemd/systemd/pull/18355#discussion_r567241580
* | tree-wide: add spdx header on all scripts and helpersZbigniew Jędrzejewski-Szmek2021-01-281-0/+1
| | | | | | | | | | | | Even though many of those scripts are very simple, it is easier to include the header than to try to say whether each of those files is trivial enough not to require one.
* | tree-wide: ignore messages with too long control dataLennart Poettering2021-01-201-0/+4
|/ | | | | | | | | | | | | | | | | Apparently SELinux inserts control data into AF_UNIX datagrams where we don't expect it, thus miscalculating the control data. This looks like something to fix in SELinux, but we still should handle this gracefully and just drop the offending datagram and continue. recvmsg_safe() actually already drops the datagram, it's just a matter of actually ignoring EXFULL (which it generates if control data is too large) in the right places. This does this wherever an AF_UNIX/SOCK_DGRAM socket is used with recvmsg_safe() that is not just internal communication. Fixes: #17795 Follow-up for: 3691bcf3c5eebdcca5b4f1c51c745441c57a6cd1
* machine-image: properly support searching for images below some --root= pathLennart Poettering2021-01-191-1/+1
| | | | | systemd-sysext supports --root= for everything but the image discovery. Fix that.
* meson: move test or fuzzer definitions to relevant meson.build in subdirectoriesYu Watanabe2021-01-181-0/+14
|
* fuzzers: move several fuzzersYu Watanabe2021-01-184-0/+58
|
* meson: make the second and third elements of tests or fuzzers optionalYu Watanabe2021-01-181-1/+1
| | | | Then, we can shorten many test definitions.
* nspawn: minor modernizationZbigniew Jędrzejewski-Szmek2021-01-151-28/+9
|
* nspawn: make rootfs relative to oci bundle pathArian van Putten2021-01-121-1/+17
| | | | | | | | | | | This is inline with the OCI runtime spec: On POSIX platforms, path is either an absolute path or a relative path to the bundle. For example, with a bundle at /to/bundle and a root filesystem at /to/bundle/rootfs, the path value can be either /to/bundle/rootfs or rootfs. The value SHOULD be the conventional rootfs. (https://github.com/opencontainers/runtime-spec/blob/master/config.md)
* nspawn: sort headersYu Watanabe2020-12-181-2/+1
|
* Merge pull request #17026 from fw-strlen/nft_16Lennart Poettering2020-12-163-26/+39
|\ | | | | add networkd/nspawn nftables backend
| * firewall-util: add nftables backendFlorian Westphal2020-12-161-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Idea is to use a static ruleset, added when the first attempt to add a masquerade or dnat rule is made. The alternative would be to add the ruleset when the init function is called. The disadvantage is that this enables connection tracking and NAT in the kernel (as the ruleset needs this to work), which comes with some overhead that might not be needed (no nspawn usage and no IPMasquerade option set). There is no additional dependency on the 'nft' userspace binary or other libraries. sd-netlinks nfnetlink backend is used to modify the nftables ruleset. The commit message/comments still use nft syntax since that is what users will see when they use the nft tool to list the ruleset. The added initial skeleton (added on first fw_add_masquerade/local_dnat call) looks like this: table ip io.systemd.nat { set masq_saddr { type ipv4_addr flags interval elements = { 192.168.59.160/28 } } map map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service elements = { tcp . 2222 : 192.168.59.169 . 22 } } chain prerouting { type nat hook prerouting priority dstnat + 1; policy accept; fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport } chain output { type nat hook output priority -99; policy accept; ip daddr != 127.0.0.0/8 oif "lo" dnat ip addr . port to meta l4proto . th dport map @map_port_ipport } chain postrouting { type nat hook postrouting priority srcnat + 1; policy accept; ip saddr @masq_saddr masquerade } } Next calls to fw_add_masquerade/add_local_dnat will then only add/delete the element/mapping to masq_saddr and map_port_ipport, i.e. the ruleset doesn't change -- only the set/map content does. Running test-firewall-util with this backend gives following output on a parallel 'nft monitor': $ nft monitor add table ip io.systemd.nat add chain ip io.systemd.nat prerouting { type nat hook prerouting priority dstnat + 1; policy accept; } add chain ip io.systemd.nat output { type nat hook output priority -99; policy accept; } add chain ip io.systemd.nat postrouting { type nat hook postrouting priority srcnat + 1; policy accept; } add set ip io.systemd.nat masq_saddr { type ipv4_addr; flags interval; } add map ip io.systemd.nat map_port_ipport { type inet_proto . inet_service : ipv4_addr . inet_service; } add rule ip io.systemd.nat prerouting fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport add rule ip io.systemd.nat output ip daddr != 127.0.0.0/8 fib daddr type local dnat ip addr . port to meta l4proto . th dport map @map_port_ipport add rule ip io.systemd.nat postrouting ip saddr @masq_saddr masquerade add element ip io.systemd.nat masq_saddr { 10.1.2.3 } add element ip io.systemd.nat masq_saddr { 10.0.2.0/28 } delete element ip io.systemd.nat masq_saddr { 10.0.2.0/28 } delete element ip io.systemd.nat masq_saddr { 10.1.2.3 } add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 } delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.4 . 815 } add element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 } delete element ip io.systemd.nat map_port_ipport { tcp . 4711 : 1.2.3.5 . 815 } CTRL-C Things not implemented/supported: 1. Change monitoring. The kernel allows userspace to learn about changes made by other clients (using nfnetlink notifications). It would be possible to detect when e.g. someone removes the systemd nat table. This would need more work. Its also not clear on how to react to external changes -- it doesn't seem like a good idea to just auto-undo everthing. 2. 'set masq_saddr' doesn't handle overlaps. Example: fw_add_masquerade(true, AF_INET, "10.0.0.0" , 16); fw_add_masquerade(true, AF_INET, "10.0.0.0" , 8); /* fails */ With the iptables backend the second call works, as it adds an independent iptables rule. With the nftables backend, the range 10.0.0.0-10.255.255.255 clashes with the existing range of 10.0.0.0-10.0.255.255 so 2nd add gets rejected by the kernel. This will generate an error message from networkd ("Could not enable IP masquerading: File exists"). To resolve this it would be needed to either keep track of the added elements and perform range merging when overlaps are detected. However, the add erquests are done using the configured network on a device, so no overlaps should occur in normal setups. IPv6 support is added in a extra changeset. Fixes: #13307
| * firewall-util: introduce context structureFlorian Westphal2020-12-163-17/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | for planned nft backend we have three choices: - open/close a new nfnetlink socket for every operation - keep a nfnetlink socket open internally - expose a opaque fw_ctx and stash all internal data here. Originally I opted for the 2nd option, but during review it was suggested to avoid static storage duration because of perceived problems with threaded applications. This adds fw_ctx and new/free functions, then converts the existing api and nspawn and networkd to use it.
| * nspawn: pass userdata pointer, not inet_addr unionFlorian Westphal2020-12-162-4/+4
| | | | | | | | | | | | Next patch will need to pass two pointers to the callback instead of just the addr mask. Caller will pass a compound structure, so make this 'void *userdata' to de-clutter the next patch.
* | Move hostname setup logic to new shared/hostname-setup.[ch]Zbigniew Jędrzejewski-Szmek2020-12-161-0/+1
|/ | | | | | | | | | No functional change, just moving a bunch of things around. Before we needed a rather complicated setup to test hostname_setup(), because the code was in src/core/. When things are moved to src/shared/ we can just test it as any function. The test is still "unsafe" because hostname_setup() may modify the hostname.
* hostname-util: flagsify hostname_is_valid(), drop machine_name_is_valid()Lennart Poettering2020-12-153-5/+5
| | | | | | | | | | | | Let's clean up hostname_is_valid() a bit: let's turn the second boolean argument into a more explanatory flags field, and add a flag that accepts the special name ".host" as valid. This is useful for the container logic, where the special hostname ".host" refers to the "root container", i.e. the host system itself, and can be specified at various places. let's also get rid of machine_name_is_valid(). It was just an alias, which is confusing and even more so now that we have the flags param.
* nspawn: remove outdated comment regarding bpffsIlya Dmitrichenko2020-12-141-1/+1
| | | | | | | | | bpffs fully respects mount namespaces since kernel version 4.7 References: - https://github.com/torvalds/linux/commit/e27f4a942a0ee4b84567a3c6cfa84f273e55cbb7 - https://github.com/torvalds/linux/commit/612bacad78ba6d0a91166fc4487af114bac172a8
* systemd-nspawn: Allow setting ambient capability setTorsten Hilbrich2020-12-073-6/+41
| | | | | | | | | | | | | | | | | | | | | | | | The old code was only able to pass the value 0 for the inheritable and ambient capability set when a non-root user was specified. However, sometimes it is useful to run a program in its own container with a user specification and some capabilities set. This is needed when the capabilities cannot be provided by file capabilities (because the file system is mounted with MS_NOSUID for additional security). This commit introduces the option --ambient-capability and the config file option AmbientCapability=. Both are used in a similar way to the existing Capability= setting. It changes the inheritable and ambient set (which is 0 by default). The code also checks that the settings for the bounding set (as defined by Capability= and DropCapability=) and the setting for the ambient set (as defined by AmbientCapability=) are compatible. Otherwise, the operation would fail in any way. Due to the current use of -1 to indicate no support for ambient capability set the special value "all" cannot be supported. Also, the setting of ambient capability is restricted to running a single program in the container payload.
* fw_add_local_dnat: remove unused function argumentsFlorian Westphal2020-12-031-6/+0
| | | | | | | | | | | | | | | All users pass a NULL/0 for those, things haven't changed since 2015 when this was added originally, so remove the arguments. THe paramters are re-added as local function variables, initalised to NULL or 0. A followup patch can then manually remove all if (NULL) rather than leaving dead-branch optimization to compiler. Reason for not doing it here is to ease patch review. Not requiring support for this will ease initial nftables backend implementation. In case a use-case comues up later this feature can be re-added.
* fileio: teach read_full_file_full() to read from offset/with maximum sizeLennart Poettering2020-12-011-1/+4
|
* license: LGPL-2.1+ -> LGPL-2.1-or-laterYu Watanabe2020-11-0929-29/+29
|
* seccomp: allow turning off of seccomp filtering via env varLennart Poettering2020-11-051-1/+1
| | | | | | | | | Fixes: #17504 (While we are it, also move $SYSTEMD_SECCOMP_LOG= env var description into the right document section) Also suggested in: https://github.com/systemd/systemd/issues/17245#issuecomment-704773603
* fileio: beef up READ_FULL_FILE_CONNECT_SOCKET to allow setting sender socket ↵Lennart Poettering2020-11-031-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | name This beefs up the READ_FULL_FILE_CONNECT_SOCKET logic of read_full_file_full() a bit: when used a sender socket name may be specified. If specified as NULL behaviour is as before: the client socket name is picked by the kernel. But if specified as non-NULL the client can pick a socket name to use when connecting. This is useful to communicate a minimal amount of metainformation from client to server, outside of the transport payload. Specifically, these beefs up the service credential logic to pass an abstract AF_UNIX socket name as client socket name when connecting via READ_FULL_FILE_CONNECT_SOCKET, that includes the requesting unit name and the eventual credential name. This allows servers implementing the trivial credential socket logic to distinguish clients: via a simple getpeername() it can be determined which unit is requesting a credential, and which credential specifically. Example: with this patch in place, in a unit file "waldo.service" a configuration line like the following: LoadCredential=foo:/run/quux/creds.sock will result in a connection to the AF_UNIX socket /run/quux/creds.sock, originating from an abstract namespace AF_UNIX socket: @$RANDOM/unit/waldo.service/foo (The $RANDOM is replaced by some randomized string. This is included in the socket name order to avoid namespace squatting issues: the abstract socket namespace is open to unprivileged users after all, and care needs to be taken not to use guessable names) The services listening on the /run/quux/creds.sock socket may thus easily retrieve the name of the unit the credential is requested for plus the credential name, via a simpler getpeername(), discarding the random preifx and the /unit/ string. This logic uses "/" as separator between the fields, since both unit names and credential names appear in the file system, and thus are designed to use "/" as outer separators. Given that it's a good safe choice to use as separators here, too avoid any conflicts. This is a minimal patch only: the new logic is used only for the unit file credential logic. For other places where we use READ_FULL_FILE_CONNECT_SOCKET it is probably a good idea to use this scheme too, but this should be done carefully in later patches, since the socket names become API that way, and we should determine the right amount of info to pass over.
* nspawn: robustly deal with "uninitialized" machine-idHarald Seiler2020-10-191-1/+1
| | | | | | When nspawn starts an image, this image could be in any state, including an aborted first boot. For this case, it needs to correctly handle the situation like there was no machine-id at all.
* tree-wide: assorted coccinelle fixesFrantisek Sumsal2020-10-092-11/+10
|
* nspawn: don't chown() stdin/stdout passed in when --console=pipe is usedLennart Poettering2020-10-023-14/+17
| | | | | | | | We should chown what we allocate ourselves, i.e. any pty we allocate ourselves. But for stuff we propagate, let's avoid that: we shouldn't make more changes than necessary. Fixes: #17229
* nspawn: give better message when invoked as non-root without argumentsZbigniew Jędrzejewski-Szmek2020-09-241-2/+5
| | | | | | | | | | | When invoked as non-root, we would suggest re-running as root without any further hint. But this immediately spawns a machine from the local directory, which can be rather surprising. So let's give a better hint. (In general, I don't think commandline programs should do "significant" things when invoked without any arguments. In this regard it would be better if systemd-nspawn would not spawn a machine from the current directory if called with no arguments and at least "-D ." would be required.)
* mount-util: rework umount_verbose() to take log level and flags argLennart Poettering2020-09-232-6/+7
| | | | | | Let's make umount_verbose() more like mount_verbose_xyz(), i.e. take log level and flags param. In particular the latter matters, since we typically don't actually want to follow symlinks when unmounting.
* mount-util: switch most mount_verbose() code over to not follow symlinksLennart Poettering2020-09-234-72/+77
|
* tree-wide: use ERRNO_IS_PRIVILEGE() whereever appropriateLennart Poettering2020-09-221-1/+1
|
* dissect-image: process /usr/ GPT partition typeLennart Poettering2020-09-191-1/+1
|
* nspawn: add --console=autopipe modeLennart Poettering2020-09-171-3/+9
| | | | | | | | | | | | | | | | | | | | | | By default we'll run a container in --console=interactive and --console=read-only mode depending if we are invoked on a tty or not so that the container always gets a /dev/console allocated, i.e is always suitable to run a full init system /as those typically expect a /dev/console to exist). With the new --console=autopipe mode we do something similar, but slightly different: when not invoked on a tty we'll use --console=pipe. This means, if you invoke some tool in a container with this you'll get full inetractivity if you invoke it on a tty but things will also be very nicely pipeable. OTOH you cannot invoke a full init system like this, because you might or might not become a /dev/console this way... Prompted-by: #17070 (I named this "autopipe" rather than "auto" or so, since the default mode probably should be named "auto" one day if we add a name for it, and this is so similar to "auto" except that it uses pipes in the non-tty case).
* nspawn: don't become TTY controller just to undo it later againLennart Poettering2020-09-172-9/+19
| | | | | | | | | | | | | | | | | | | | | | Instead of first becoming a controlling process of the payload pty as side effect of opening it (without O_NOCTTY), and then possibly dropping it again, let's do it cleanly an reverse the logic: let's open the pty without becoming its controller first. Only after everything went the way we wanted it to go become the controller explicitly. This has the benefit that the PID 1 stub process we run (as effect of --as-pid2) doesn't have to lose the tty explicitly, but can just continue running with things. And we explicitly make the tty controlling right before invoking actual payload. In order to make sure everything works as expected validate that the stub PID 1 in the container really has no conrolling tty by issuing the TIOCNOTTY tty and expecting ENOTTY, and log about it. This shouldn't change behaviour much, it just makes thins a bit cleaner, in particular as we'll not trigger SIGHUP on ourselves (since we are controller and session leader) due to TIOCNOTTY which we then have to explicitly ignore.
* nspawn: fix fd leak on failure pathLennart Poettering2020-09-171-1/+2
|
* nspawn: print log notice when we are invoked from a tty but in "pipe" modeLennart Poettering2020-09-171-2/+8
| | | | | | | If people do this then things are weird, and they should probably use --console=interactive (i.e. the default) instead. Prompted-by: #17070
* nspawn: check return of setsid()Lennart Poettering2020-09-171-1/+4
| | | | | Let's verify that everything works the way we expect it to work, hence check setsid() return code.
* dissect: wrap verity settings in new VeritySettings structureLennart Poettering2020-09-171-46/+41
| | | | | | | | | | | | Just some refactoring: let's place the various verity related parameters in a common structure, and pass that around instead of the individual parameters. Also, let's load the PKCS#7 signature data when finding metadata right-away, instead of delaying this until we need it. In all cases we call this there's not much time difference between the metdata finding and the loading, hence this simplifies things and makes sure root hash data and its signature is now always acquired together.
* nspawn: downgrade log level if the error will be ignoredYu Watanabe2020-09-101-54/+37
|