| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
| |
Lower priority of RUN, so that TMPFS and especially the mount flags given with
`TemporaryFileSystem=` are used.
This allows making `/run` private with drop-ins such as:
```
[Service]
BindReadOnlyPaths=/run/systemd:/run/systemd:norbind
TemporaryFileSystem=/run:nodev,noexec,nosuid,rw,size=32k,nr_inodes=10,mode=0755
```
|
|
|
|
|
|
|
|
|
| |
In case `/proc` is successfully mounted with pid tree subset only due to
`ProcSubset=pid`, the protective mounts for `ProtectKernelTunables=yes` and
`ProtectKernelLogs=yes` to non-pid `/proc` paths are failing because the paths
don't exist. But the pid only option may have failed gracefully (for example
because of ancient kernel), so let's try the mounts but it's not fatal if they
don't succeed.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
system extension is for
This should make things a bit more robust since it ensures system
extension can only applied to the right environments. Right now three
different "scopes" are defined:
1. "system" (for regular OS systems, after the initrd transition)
2. "initrd" (for sysext images that apply to the initrd environment)
3. "portable" (for sysext images that apply to portable images)
If not specified we imply a default of "system portable", i.e. any image
where the field is not specified is implicitly OK for application to OS
images and for portable services – but not for initrds.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
Previously the mkdir_label() family of calls was implemented in
src/shared/mkdir-label.c but its functions partly declared ins
src/shared/label.h and partly in src/basic/mkdir.h (!!). That's weird
(and wrong).
Let's clean this up, and add a proper mkdir-label.h matching the .c
file.
|
|
|
|
|
|
|
| |
Let's make all code in namespace.c robust towards weird umask. This
doesn't matter too much given that the parent dirs we deal here almost
certainly exist anyway, but let's clean this up anyway and make it fully
clean.
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's reset the umask during the whole namespace_setup() logic, so that
all our mkdir() + mknod() are not subjected to whatever umask might
currently be set.
This mostly moves the umask save/restore logic out of
mount_private_dev() and into the stack frame of namespace_setup() that
is further out.
Fixes #19899
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TemporaryFileSystem=/var/lib
The /var/lib/private/foo -> /var/lib/foo symlink for StateDirectory and
DynamicUser is set up on the host filesystem, before the mount namespacing
is brought up. If an empty /var/lib is used, to ensure the service does not
see other services data, the symlink is then not available despite
/var/lib/private being set up as expected.
Make a list of symlinks that need to be set up, and create them after all
the namespaced filesystems have been created, but before any eventual
read-only switch is flipped.
|
| |
|
|
|
|
|
| |
This adds support for actually using embedded signature data from
partitions.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
LLVM 13 introduced `-Wunused-but-set-variable` diagnostic flag, which
trips over some intentionally set-but-not-used variables or variables
attached to cleanup handlers with side effects (`_cleanup_umask_`,
`_cleanup_(notify_on_cleanup)`, `_cleanup_(restore_sigsetp)`, etc.):
```
../src/basic/process-util.c:1257:46: error: variable 'saved_ssp' set but not used [-Werror,-Wunused-but-set-variable]
_cleanup_(restore_sigsetp) sigset_t *saved_ssp = NULL;
^
1 error generated.
```
|
|\
| |
| | |
Use new diskseq block device property
|
| |
| |
| |
| |
| |
| |
| |
| | |
DISKSEQ is a reliable way to find out if we missed a uevent or not, as
it's monotonically increasing. If we parse an event with a smaller or
no sequence number, we know we need to wait longer. If we parse an
event with a greater sequence number, we know we missed it and the
device was reused.
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In general we almost never hit those asserts in production code, so users see
them very rarely, if ever. But either way, we just need something that users
can pass to the developers.
We have quite a few of those asserts, and some have fairly nice messages, but
many are like "WTF?" or "???" or "unexpected something". The error that is
printed includes the file location, and function name. In almost all functions
there's at most one assert, so the function name alone is enough to identify
the failure for a developer. So we don't get much extra from the message, and
we might just as well drop them.
Dropping them makes our code a tiny bit smaller, and most importantly, improves
development experience by making it easy to insert such an assert in the code
without thinking how to phrase the argument.
|
|
|
|
| |
This reverts commit b33cd6b3eec52fc50c6c34d6f07a41cc6254c27f.
|
| |
|
|\
| |
| | |
core: reenable nosuid mount flag when NoNewPrivileges=yes
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts commit 1753d3021564671fba3d3196a84da657d15fb632.
Let's re-enable that feature now. As reported when the original commit
was merged, this causes some trouble on SELinux enabled systems. So,
in the subsequent commit, the feature will be disabled when SELinux is enabled.
But, anyway, this commit just re-enable that feature unconditionally.
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
match
When an ExtensionImages= extension-release metadata does not match, the
log messages (unless debug level is set) are pretty much incomprehensible:
systemd[463]: run-u11.service: Failed to set up mount namespacing: /run/systemd/unit-extensions/0: Stale file handle
systemd[463]: run-u11.service: Failed at step NAMESPACE spawning /usr/bin/echo: Stale file handle
Add an explicit log message if we get ESTALE from the dissect code, to
make it clear what's happening without needing to enable debugging:
systemd[463]: Failed to mount image /tmp/app3.raw, extension-release metadata does not match the lower layer's: ID=debian VERSION_ID=11 SYSEXT_LEVEL=11
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit d8e3c31bd8e307c8defc759424298175aa0f7001.
A poorly documented fact is that SELinux unfortunately uses nosuid mount flag
to specify that also a fundamental feature of SELinux, domain transitions, must
not be allowed either. While this could be mitigated case by case by changing
the SELinux policy to use `nosuid_transition`, such mitigations would probably
have to be added everywhere if systemd used automatic nosuid mount flags when
`NoNewPrivileges=yes` would be implied. This isn't very desirable from SELinux
policy point of view since also untrusted mounts in service's mount namespaces
could start triggering domain transitions.
Alternatively there could be directives to override this behavior globally or
for each service (for example, new directives `SUIDPaths=`/`NoSUIDPaths=` or
more generic mount flag applicators), but since there's little value of the
commit by itself (setting NNP already disables most setuid functionality), it's
simpler to revert the commit. Such new directives could be used to implement
the original goal.
|
|
|
|
|
|
| |
When `NoNewPrivileges=yes`, the service shouldn't have a need for any
setuid/setgid programs, so in case there will be a new mount namespace anyway,
mount the file systems with MS_NOSUID.
|
| |
|
|
|
|
|
|
|
|
|
|
| |
tools that deal with OS images
Let's enable this in all tools that intend to write to the OS images.
It's not conditionalized for now, as there already is conditionalization
in the existance or absence of the flag in the GPT partition table (and
it's opt-in), hence it should be OK to just enable this by default for
now if the flag is set.
|
|
|
|
|
|
|
|
|
| |
This tries to shorten the race of device reuse a bit more: let's ignore
udev database entries that are older than the time where we started to
use a loopback device.
This doesn't fix the whole loopback device raciness mess, but it makes
the race window a bit shorter.
|
|
|
|
|
|
|
|
|
|
|
| |
Let's drop all monitor uevent that were enqueued before we actually
started setting up the device.
This doesn't fix the race, but it makes the race window smaller: since
we cannot determine the uevent seqnum and the loopback attachment
atomically, there's a tiny window where uevents might be generated by
the device which we mistake for being associated with out use of the
loopback device.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's have one flag to request that when dissecting an image the
loopback device is made read-only, and another one to request that when
it is mounted to make it read-only. Previously both concepts were always
done read-only together.
(Of course, making the loopback device read-only but mounting it
read-write doesn't make too much sense, but the kernel should catch that
for us, no need to make restrictions from our side there)
Use-case for this: in systemd-repart we'd like to operate on images for
adding partitions. Thus we'd like to have the loopback device writable,
but if we read repart.d/ snippets from it, we want to do that read-only.
|
|
|
|
|
|
|
| |
With some versions of the compiler, the _cleanup_ attr makes it think
the variable might be freed/closed when uninitialized, even though it
cannot happen. The added cost is small enough to be worth the benefit,
and optimized builds will help reduce it even further.
|
|\
| |
| | |
dissect-image: support images without rootfs but with /usr partition + support simple partition versioning via strverscmp() on part label
|
| |
| |
| |
| |
| |
| | |
The function already has a ridiculous amount of paramaters, let's drop
one that is either not used at all or has a constant value and let's
pick it internally.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Previously, the flag did two things at once: enable support for using
generic partitions as root fs if there were only one/allow use of
partition-table-less images as root fs. And secondly, insist that there
was a rootfs, and fail if not. Let's split these two in two separate
options so that they can be used independently of each other.
There are cases where one wants to use one without the other (i.e. when
inspecting things with systemd-dissect tool it should be OK to do so
even if image has no root fs), and it's cleaner anyway.
|
|/
|
|
| |
Another batch of fixes (mostly) generated by Coccinelle.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
|
| |
It causes a regression in certain running environments (networkd under LXC),
so avoid enabling for now.
Fixes #18795
Suggested-by: Topi Miettinen <toiwoton@gmail.com>
|
|\
| |
| | |
Trivial cleanups
|
| | |
|
|/ |
|
|
|
|
|
| |
Add support for overlaying images for services on top of their
root fs, using a read-only overlay.
|
|
|
|
|
| |
The setup_namespace code to apply mounts is a big if block that
keeps growing, so refactor it in a separate function.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Some paths (eg: mount_tmpfs) simply assumed that prefixing always
happens and it always stores the original path in path_const, and
the prefixed path in path_malloc.
But if a MountEntry is set up in a helper function and thus uses
only _malloc struct members, this assumption doesn't hold and there's
a crash.
Refactor so that prefixing is done with a helper which stores the
original path in a separate struct member, and accessing it also
uses a helper which does the right thing.
|
|
|
|
| |
ENOENT did not cause an image mount to be skipped, fix it
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement directives `NoExecPaths=` and `ExecPaths=` to control `MS_NOEXEC`
mount flag for the file system tree. This can be used to implement file system
W^X policies, and for example with allow-listing mode (NoExecPaths=/) a
compromised service would not be able to execute a shell, if that was not
explicitly allowed.
Example:
[Service]
NoExecPaths=/
ExecPaths=/usr/bin/daemon /usr/lib64 /usr/lib
Closes: #17942.
|
|
|
|
| |
Also use _cleanup_free_ in one more place.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously if people enabled RootDirectory=/RootImage= and NotifyAccess=
together, things wouldn't work, they'd have to explicitly add
BindReadOnlyPaths=/run/systemd/notify too.
Let's make this implicit. Since both options are opt-in, if people use
them together it would be pointless not also defining the
BindReadOnlyPaths= entry, in which case we can just do it automatically.
See: #18051
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Allow to setup new bind mounts for a service at runtime (via either
DBUS or a new 'systemctl bind' verb) with a new helper that forks into
the unit's mount namespace.
Add a new integration test to cover this.
Useful for zero-downtime addition to services that are running inside
mount namespaces, especially when using RootImage/RootDirectory.
If a service runs with a read-only root, a tmpfs is added on /run
to ensure we can create the airlock directory for incoming mounts
under /run/host/incoming.
|
|
|
|
|
|
| |
We need a writable /run for most operations, but in case a read-only
RootImage (or similar) is used, by default there's no additional
tmpfs mount on /run. Change this behaviour and document it.
|