summaryrefslogtreecommitdiffstats
path: root/src/core/namespace.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* namespace: allow overriding /run with a TemporaryFileSystem=Topi Miettinen2021-12-111-1/+1
| | | | | | | | | | | | Lower priority of RUN, so that TMPFS and especially the mount flags given with `TemporaryFileSystem=` are used. This allows making `/run` private with drop-ins such as: ``` [Service] BindReadOnlyPaths=/run/systemd:/run/systemd:norbind TemporaryFileSystem=/run:nodev,noexec,nosuid,rw,size=32k,nr_inodes=10,mode=0755 ```
* namespace: allow ProcSubset=pid with some ProtectKernel optionsTopi Miettinen2021-11-271-8/+34
| | | | | | | | | In case `/proc` is successfully mounted with pid tree subset only due to `ProcSubset=pid`, the protective mounts for `ProtectKernelTunables=yes` and `ProtectKernelLogs=yes` to non-pid `/proc` paths are failing because the paths don't exist. But the pid only option may have failed gracefully (for example because of ancient kernel), so let's try the mounts but it's not fatal if they don't succeed.
* extension-release.d/: add a new field SYSEXT_SCOPE= for clarifying what a ↵Lennart Poettering2021-11-231-1/+1
| | | | | | | | | | | | | | | | system extension is for This should make things a bit more robust since it ensures system extension can only applied to the right environments. Right now three different "scopes" are defined: 1. "system" (for regular OS systems, after the initrd transition) 2. "initrd" (for sysext images that apply to the initrd environment) 3. "portable" (for sysext images that apply to portable images) If not specified we imply a default of "system portable", i.e. any image where the field is not specified is implicitly OK for application to OS images and for portable services – but not for initrds.
* tree-wide: port various places over to open_mkdir_at()Lennart Poettering2021-11-171-7/+10
|
* shared: clean up mkdir.h/label.h situationLennart Poettering2021-11-161-1/+1
| | | | | | | | | | Previously the mkdir_label() family of calls was implemented in src/shared/mkdir-label.c but its functions partly declared ins src/shared/label.h and partly in src/basic/mkdir.h (!!). That's weird (and wrong). Let's clean this up, and add a proper mkdir-label.h matching the .c file.
* namespace: make tmp dir handling code independent of umask tooLennart Poettering2021-11-121-5/+7
| | | | | | | Let's make all code in namespace.c robust towards weird umask. This doesn't matter too much given that the parent dirs we deal here almost certainly exist anyway, but let's clean this up anyway and make it fully clean.
* namespace: make whole namespace_setup() work regardless of configured umaskLennart Poettering2021-11-121-3/+4
| | | | | | | | | | | | Let's reset the umask during the whole namespace_setup() logic, so that all our mkdir() + mknod() are not subjected to whatever umask might currently be set. This mostly moves the umask save/restore logic out of mount_private_dev() and into the stack frame of namespace_setup() that is further out. Fixes #19899
* namespace: rebreak a few commentsLennart Poettering2021-11-121-16/+14
|
* core: make DynamicUser=1 and StateDirectory= work with ↵Luca Boccassi2021-10-271-1/+35
| | | | | | | | | | | | | | TemporaryFileSystem=/var/lib The /var/lib/private/foo -> /var/lib/foo symlink for StateDirectory and DynamicUser is set up on the host filesystem, before the mount namespacing is brought up. If an empty /var/lib is used, to ensure the service does not see other services data, the symlink is then not available despite /var/lib/private being set up as expected. Make a list of symlinks that need to be set up, and create them after all the namespaced filesystems have been created, but before any eventual read-only switch is flipped.
* basic: spit out chase_symlinks() from fs-util.[ch] → chase-symlinks.[ch]Lennart Poettering2021-10-051-1/+1
|
* dissect-image: load embedded verity signature info from imageLennart Poettering2021-09-281-0/+7
| | | | | This adds support for actually using embedded signature data from partitions.
* tree-wide: mark set-but-not-used variables as unused to make LLVM happyFrantisek Sumsal2021-09-151-1/+1
| | | | | | | | | | | | | | LLVM 13 introduced `-Wunused-but-set-variable` diagnostic flag, which trips over some intentionally set-but-not-used variables or variables attached to cleanup handlers with side effects (`_cleanup_umask_`, `_cleanup_(notify_on_cleanup)`, `_cleanup_(restore_sigsetp)`, etc.): ``` ../src/basic/process-util.c:1257:46: error: variable 'saved_ssp' set but not used [-Werror,-Wunused-but-set-variable] _cleanup_(restore_sigsetp) sigset_t *saved_ssp = NULL; ^ 1 error generated. ```
* Merge pull request #20257 from bluca/seqnoLuca Boccassi2021-08-311-0/+1
|\ | | | | Use new diskseq block device property
| * dissect: use DISKSEQ when waiting for block devicesLuca Boccassi2021-07-281-0/+1
| | | | | | | | | | | | | | | | DISKSEQ is a reliable way to find out if we missed a uevent or not, as it's monotonically increasing. If we parse an event with a smaller or no sequence number, we know we need to wait longer. If we parse an event with a greater sequence number, we know we missed it and the device was reused.
* | tree-wide: port everything over to new sd-id128 compund literal blissLennart Poettering2021-08-201-2/+1
| |
* | Drop the text argument from assert_not_reached()Zbigniew Jędrzejewski-Szmek2021-08-031-3/+3
|/ | | | | | | | | | | | | | | | | In general we almost never hit those asserts in production code, so users see them very rarely, if ever. But either way, we just need something that users can pass to the developers. We have quite a few of those asserts, and some have fairly nice messages, but many are like "WTF?" or "???" or "unexpected something". The error that is printed includes the file location, and function name. In almost all functions there's at most one assert, so the function name alone is enough to identify the failure for a developer. So we don't get much extra from the message, and we might just as well drop them. Dropping them makes our code a tiny bit smaller, and most importantly, improves development experience by making it easy to insert such an assert in the code without thinking how to phrase the argument.
* Revert "core: do not set noexec on sysfs/procfs"Lennart Poettering2021-07-011-1/+1
| | | | This reverts commit b33cd6b3eec52fc50c6c34d6f07a41cc6254c27f.
* core/namespace: drop unnecessary initializationsYu Watanabe2021-06-261-6/+6
|
* Merge pull request #20023 from yuwata/re-enable-nosuid-mount-flagZbigniew Jędrzejewski-Szmek2021-06-251-0/+32
|\ | | | | core: reenable nosuid mount flag when NoNewPrivileges=yes
| * Revert "Revert "Mount all fs nosuid when NoNewPrivileges=yes""Yu Watanabe2021-06-251-0/+32
| | | | | | | | | | | | | | | | | | This reverts commit 1753d3021564671fba3d3196a84da657d15fb632. Let's re-enable that feature now. As reported when the original commit was merged, this causes some trouble on SELinux enabled systems. So, in the subsequent commit, the feature will be disabled when SELinux is enabled. But, anyway, this commit just re-enable that feature unconditionally.
* | ExtensionImages: log explicit error when extension-release metadata does not ↵Luca Boccassi2021-06-251-0/+9
|/ | | | | | | | | | | | | | | match When an ExtensionImages= extension-release metadata does not match, the log messages (unless debug level is set) are pretty much incomprehensible: systemd[463]: run-u11.service: Failed to set up mount namespacing: /run/systemd/unit-extensions/0: Stale file handle systemd[463]: run-u11.service: Failed at step NAMESPACE spawning /usr/bin/echo: Stale file handle Add an explicit log message if we get ESTALE from the dissect code, to make it clear what's happening without needing to enable debugging: systemd[463]: Failed to mount image /tmp/app3.raw, extension-release metadata does not match the lower layer's: ID=debian VERSION_ID=11 SYSEXT_LEVEL=11
* Revert "Mount all fs nosuid when NoNewPrivileges=yes"Topi Miettinen2021-06-141-32/+0
| | | | | | | | | | | | | | | | | | | | This reverts commit d8e3c31bd8e307c8defc759424298175aa0f7001. A poorly documented fact is that SELinux unfortunately uses nosuid mount flag to specify that also a fundamental feature of SELinux, domain transitions, must not be allowed either. While this could be mitigated case by case by changing the SELinux policy to use `nosuid_transition`, such mitigations would probably have to be added everywhere if systemd used automatic nosuid mount flags when `NoNewPrivileges=yes` would be implied. This isn't very desirable from SELinux policy point of view since also untrusted mounts in service's mount namespaces could start triggering domain transitions. Alternatively there could be directives to override this behavior globally or for each service (for example, new directives `SUIDPaths=`/`NoSUIDPaths=` or more generic mount flag applicators), but since there's little value of the commit by itself (setting NNP already disables most setuid functionality), it's simpler to revert the commit. Such new directives could be used to implement the original goal.
* Mount all fs nosuid when NoNewPrivileges=yesTopi Miettinen2021-05-261-0/+32
| | | | | | When `NoNewPrivileges=yes`, the service shouldn't have a need for any setuid/setgid programs, so in case there will be a new mount namespace anyway, mount the file systems with MS_NOSUID.
* dissect-image: add support for optionally mounting images with idmapping onLennart Poettering2021-05-071-1/+1
|
* tree-wide: enable automatic growing of file systems in images in various ↵Lennart Poettering2021-04-231-1/+2
| | | | | | | | | | tools that deal with OS images Let's enable this in all tools that intend to write to the OS images. It's not conditionalized for now, as there already is conditionalization in the existance or absence of the flag in the GPT partition table (and it's opt-in), hence it should be OK to just enable this by default for now if the flag is set.
* dissect: ignore udev database entries from before the loopback attachmentLennart Poettering2021-04-201-0/+1
| | | | | | | | | This tries to shorten the race of device reuse a bit more: let's ignore udev database entries that are older than the time where we started to use a loopback device. This doesn't fix the whole loopback device raciness mess, but it makes the race window a bit shorter.
* dissect: ignore old uevents when waiting for loopback partition scanLennart Poettering2021-04-201-0/+1
| | | | | | | | | | | Let's drop all monitor uevent that were enqueued before we actually started setting up the device. This doesn't fix the race, but it makes the race window smaller: since we cannot determine the uevent seqnum and the loopback attachment atomically, there's a tiny window where uevents might be generated by the device which we mistake for being associated with out use of the loopback device.
* dissect: split read-only flag into twoLennart Poettering2021-04-191-1/+1
| | | | | | | | | | | | | | | Let's have one flag to request that when dissecting an image the loopback device is made read-only, and another one to request that when it is mounted to make it read-only. Previously both concepts were always done read-only together. (Of course, making the loopback device read-only but mounting it read-write doesn't make too much sense, but the kernel should catch that for us, no need to make restrictions from our side there) Use-case for this: in systemd-repart we'd like to operate on images for adding partitions. Thus we'd like to have the loopback device writable, but if we read repart.d/ snippets from it, we want to do that read-only.
* tree-wide: avoid uninitialized warning on _cleanup_ variablesLuca Boccassi2021-04-141-2/+2
| | | | | | | With some versions of the compiler, the _cleanup_ attr makes it think the variable might be freed/closed when uninitialized, even though it cannot happen. The added cost is small enough to be worth the benefit, and optimized builds will help reduce it even further.
* Merge pull request #18958 from poettering/dissect-no-rootZbigniew Jędrzejewski-Szmek2021-03-311-3/+7
|\ | | | | dissect-image: support images without rootfs but with /usr partition + support simple partition versioning via strverscmp() on part label
| * execute: drop DissectImageFlags parameter from namespace_setup()Lennart Poettering2021-03-161-3/+7
| | | | | | | | | | | | The function already has a ridiculous amount of paramaters, let's drop one that is either not used at all or has a constant value and let's pick it internally.
| * dissect-image: split DISSECT_IMAGE_REQUIRE_ROOT in twoLennart Poettering2021-03-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Previously, the flag did two things at once: enable support for using generic partitions as root fs if there were only one/allow use of partition-table-less images as root fs. And secondly, insist that there was a rootfs, and fail if not. Let's split these two in two separate options so that they can be used independently of each other. There are cases where one wants to use one without the other (i.e. when inspecting things with systemd-dissect tool it should be OK to do so even if image has no root fs), and it's cleaner anyway.
* | tree-wide: coccinelle fixesFrantisek Sumsal2021-03-181-4/+2
|/ | | | Another batch of fixes (mostly) generated by Coccinelle.
* Remount /dev/mqueue in unshared mount namespace for PrivateIPCXℹ Ruoyao2021-03-031-1/+33
|
* Refactor network namespace specific functions in generic helpersXℹ Ruoyao2021-03-031-35/+41
|
* tree-wide: fix typoYu Watanabe2021-03-021-1/+1
|
* core: do not set noexec on sysfs/procfsLuca Boccassi2021-02-261-1/+1
| | | | | | | | | It causes a regression in certain running environments (networkd under LXC), so avoid enabling for now. Fixes #18795 Suggested-by: Topi Miettinen <toiwoton@gmail.com>
* Merge pull request #18797 from keszybz/trivial-cleanupsLuca Boccassi2021-02-251-9/+6
|\ | | | | Trivial cleanups
| * core/namespace: inline more iterator variable declarationsZbigniew Jędrzejewski-Szmek2021-02-251-9/+6
| |
* | namespace: return correct error codeLennart Poettering2021-02-251-1/+4
|/
* Add ExtensionImages directive to form overlaysLuca Boccassi2021-02-231-9/+187
| | | | | Add support for overlaying images for services on top of their root fs, using a read-only overlay.
* core/namespace: reafactor applying mounts in a separate functionLuca Boccassi2021-02-231-93/+111
| | | | | The setup_namespace code to apply mounts is a big if block that keeps growing, so refactor it in a separate function.
* namespace: store and use original MountEntry paths when prefixingLuca Boccassi2021-02-161-5/+29
| | | | | | | | | | | | | Some paths (eg: mount_tmpfs) simply assumed that prefixing always happens and it always stores the original path in path_const, and the prefixed path in path_malloc. But if a MountEntry is set up in a helper function and thus uses only _malloc struct members, this assumption doesn't hold and there's a crash. Refactor so that prefixing is done with a helper which stores the original path in a separate struct member, and accessing it also uses a helper which does the right thing.
* MountImages: actually support optional pathsLuca Boccassi2021-02-161-0/+2
| | | | ENOENT did not cause an image mount to be skipped, fix it
* New directives NoExecPaths= ExecPaths=Topi Miettinen2021-01-291-3/+91
| | | | | | | | | | | | | | | Implement directives `NoExecPaths=` and `ExecPaths=` to control `MS_NOEXEC` mount flag for the file system tree. This can be used to implement file system W^X policies, and for example with allow-listing mode (NoExecPaths=/) a compromised service would not be able to execute a shell, if that was not explicitly allowed. Example: [Service] NoExecPaths=/ ExecPaths=/usr/bin/daemon /usr/lib64 /usr/lib Closes: #17942.
* treewide: tighten variable scope in loops (#18372)Susant Sahani2021-01-271-18/+8
| | | | Also use _cleanup_free_ in one more place.
* dissect: split verity_dissect_and_mount helper out for reuseLuca Boccassi2021-01-211-64/+2
|
* core: make NotifyAccess= in combination with RootDirectory=/RootImage= workLennart Poettering2021-01-201-4/+16
| | | | | | | | | | | | Previously if people enabled RootDirectory=/RootImage= and NotifyAccess= together, things wouldn't work, they'd have to explicitly add BindReadOnlyPaths=/run/systemd/notify too. Let's make this implicit. Since both options are opt-in, if people use them together it would be pointless not also defining the BindReadOnlyPaths= entry, in which case we can just do it automatically. See: #18051
* core: add DBUS method to bind mount new nodes without service restartLuca Boccassi2021-01-181-3/+34
| | | | | | | | | | | | | | Allow to setup new bind mounts for a service at runtime (via either DBUS or a new 'systemctl bind' verb) with a new helper that forks into the unit's mount namespace. Add a new integration test to cover this. Useful for zero-downtime addition to services that are running inside mount namespaces, especially when using RootImage/RootDirectory. If a service runs with a read-only root, a tmpfs is added on /run to ensure we can create the airlock directory for incoming mounts under /run/host/incoming.
* MountAPIVFS: always mount a tmpfs on /runLuca Boccassi2021-01-181-1/+20
| | | | | | We need a writable /run for most operations, but in case a read-only RootImage (or similar) is used, by default there's no additional tmpfs mount on /run. Change this behaviour and document it.