diff options
author | Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> | 2024-02-23 09:48:47 +0100 |
---|---|---|
committer | Zbigniew Jędrzejewski-Szmek <zbyszek@in.waw.pl> | 2024-02-23 09:48:47 +0100 |
commit | 8e3fee33afed8cb6a0945288f4773363a4d68912 (patch) | |
tree | 2dad2a0965eb44f68fef9ad08dc79d1cd25494ae /docs/_interfaces | |
parent | Merge pull request #31445 from keszybz/slow-tests (diff) | |
download | systemd-8e3fee33afed8cb6a0945288f4773363a4d68912.tar.xz systemd-8e3fee33afed8cb6a0945288f4773363a4d68912.zip |
Revert "docs: use collections to structure the data"
This reverts commit 5e8ff010a1436d33bbf3c108335af6e0b4ff7a2a.
This broke all the URLs, we can't have that. (And actually, we probably don't
_want_ to make the change either. It's nicer to have all the pages in one
directory, so one doesn't have to figure out to which collection the page
belongs.)
Diffstat (limited to 'docs/_interfaces')
-rw-r--r-- | docs/_interfaces/BLOCK_DEVICE_LOCKING.md | 243 | ||||
-rw-r--r-- | docs/_interfaces/CGROUP_DELEGATION.md | 502 | ||||
-rw-r--r-- | docs/_interfaces/CONTAINER_INTERFACE.md | 421 | ||||
-rw-r--r-- | docs/_interfaces/ELF_PACKAGE_METADATA.md | 105 | ||||
-rw-r--r-- | docs/_interfaces/ENVIRONMENT.md | 644 | ||||
-rw-r--r-- | docs/_interfaces/FILE_DESCRIPTOR_STORE.md | 213 | ||||
-rw-r--r-- | docs/_interfaces/INITRD_INTERFACE.md | 70 | ||||
-rw-r--r-- | docs/_interfaces/JOURNAL_EXPORT_FORMATS.md | 158 | ||||
-rw-r--r-- | docs/_interfaces/JOURNAL_FILE_FORMAT.md | 755 | ||||
-rw-r--r-- | docs/_interfaces/JOURNAL_NATIVE_PROTOCOL.md | 191 | ||||
-rw-r--r-- | docs/_interfaces/MEMORY_PRESSURE.md | 238 | ||||
-rw-r--r-- | docs/_interfaces/PASSWORD_AGENTS.md | 41 | ||||
-rw-r--r-- | docs/_interfaces/PORTABILITY_AND_STABILITY.md | 171 | ||||
-rw-r--r-- | docs/_interfaces/ROOT_STORAGE_DAEMONS.md | 194 | ||||
-rw-r--r-- | docs/_interfaces/TEMPORARY_DIRECTORIES.md | 220 | ||||
-rw-r--r-- | docs/_interfaces/TRANSIENT-SETTINGS.md | 511 |
16 files changed, 0 insertions, 4677 deletions
diff --git a/docs/_interfaces/BLOCK_DEVICE_LOCKING.md b/docs/_interfaces/BLOCK_DEVICE_LOCKING.md deleted file mode 100644 index a6e3374bc7..0000000000 --- a/docs/_interfaces/BLOCK_DEVICE_LOCKING.md +++ /dev/null @@ -1,243 +0,0 @@ ---- -title: Locking Block Device Access -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Locking Block Device Access - -*TL;DR: Use BSD file locks -[(`flock(2)`)](https://man7.org/linux/man-pages/man2/flock.2.html) on block -device nodes to synchronize access for partitioning and file system formatting -tools.* - -`systemd-udevd` probes all block devices showing up for file system superblock -and partition table information (utilizing `libblkid`). If another program -concurrently modifies a superblock or partition table this probing might be -affected, which is bad in itself, but also might in turn result in undesired -effects in programs subscribing to `udev` events. - -Applications manipulating a block device can temporarily stop `systemd-udevd` -from processing rules on it — and thus bar it from probing the device — by -taking a BSD file lock on the block device node. Specifically, whenever -`systemd-udevd` starts processing a block device it takes a `LOCK_SH|LOCK_NB` -lock using [`flock(2)`](https://man7.org/linux/man-pages/man2/flock.2.html) on -the main block device (i.e. never on any partition block device, but on the -device the partition belongs to). If this lock cannot be taken (i.e. `flock()` -returns `EAGAIN`), it refrains from processing the device. If it manages to take -the lock it is kept for the entire time the device is processed. - -Note that `systemd-udevd` also watches all block device nodes it manages for -`inotify()` `IN_CLOSE_WRITE` events: whenever such an event is seen, this is -used as trigger to re-run the rule-set for the device. - -These two concepts allow tools such as disk partitioners or file system -formatting tools to safely and easily take exclusive ownership of a block -device while operating: before starting work on the block device, they should -take an `LOCK_EX` lock on it. This has two effects: first of all, in case -`systemd-udevd` is still processing the device the tool will wait for it to -finish. Second, after the lock is taken, it can be sure that `systemd-udevd` -will refrain from processing the block device, and thus all other client -applications subscribed to it won't get device notifications from potentially -half-written data either. After the operation is complete the -partitioner/formatter can simply close the device node. This has two effects: -it implicitly releases the lock, so that `systemd-udevd` can process events on -the device node again. Secondly, it results an `IN_CLOSE_WRITE` event, which -causes `systemd-udevd` to immediately re-process the device — seeing all -changes the tool made — and notify subscribed clients about it. - -Ideally, `systemd-udevd` would explicitly watch block devices for `LOCK_EX` -locks being released. Such monitoring is not supported on Linux however, which -is why it watches for `IN_CLOSE_WRITE` instead, i.e. for `close()` calls to -writable file descriptors referring to the block device. In almost all cases, -the difference between these two events does not matter much, as any locks -taken are implicitly released by `close()`. However, it should be noted that if -an application unlocks a device after completing its work without closing it, -i.e. while keeping the file descriptor open for further, longer time, then -`systemd-udevd` will not notice this and not retrigger and thus reprobe the -device. - -Besides synchronizing block device access between `systemd-udevd` and such -tools this scheme may also be used to synchronize access between those tools -themselves. However, do note that `flock()` locks are advisory only. This means -if one tool honours this scheme and another tool does not, they will of course -not be synchronized properly, and might interfere with each other's work. - -Note that the file locks follow the usual access semantics of BSD locks: since -`systemd-udevd` never writes to such block devices it only takes a `LOCK_SH` -*shared* lock. A program intending to make changes to the block device should -take a `LOCK_EX` *exclusive* lock instead. For further details, see the -`flock(2)` man page. - -And please keep in mind: BSD file locks (`flock()`) and POSIX file locks -(`lockf()`, `F_SETLK`, …) are different concepts, and in their effect -orthogonal. The scheme discussed above uses the former and not the latter, -because these types of locks more closely match the required semantics. - -If multiple devices are to be locked at the same time (for example in order to -format a RAID file system), the devices should be locked in the order of the -the device nodes' major numbers (primary ordering key, ascending) and minor -numbers (secondary ordering key, ditto), in order to avoid ABBA locking issues -between subsystems. - -Note that the locks should only be taken while the device is repartitioned, -file systems formatted or `dd`'ed in, and similar cases that -apply/remove/change superblocks/partition information. It should not be held -during normal operation, i.e. while file systems on it are mounted for -application use. - -The [`udevadm -lock`](https://www.freedesktop.org/software/systemd/man/udevadm.html) command -is provided to lock block devices following this scheme from the command line, -for the use in scripts and similar. (Note though that it's typically preferable -to use native support for block device locking in tools where that's -available.) - -Summarizing: it is recommended to take `LOCK_EX` BSD file locks when -manipulating block devices in all tools that change file system block devices -(`mkfs`, `fsck`, …) or partition tables (`fdisk`, `parted`, …), right after -opening the node. - -# Example of Locking The Whole Disk - -The following is an example to leverage `libsystemd` infrastructure to get the whole disk of a block device and take a BSD lock on it. - -## Compile and Execute -**Note that this example requires `libsystemd` version 251 or newer.** - -Place the code in a source file, e.g. `take_BSD_lock.c` and run the following commands: -``` -$ gcc -o take_BSD_lock -lsystemd take_BSD_lock.c - -$ ./take_BSD_lock /dev/sda1 -Successfully took a BSD lock: /dev/sda - -$ flock -x /dev/sda ./take_BSD_lock /dev/sda1 -Failed to take a BSD lock on /dev/sda: Resource temporarily unavailable -``` - -## Code -```c -/* SPDX-License-Identifier: MIT-0 */ - -#include <stdio.h> -#include <stdlib.h> -#include <string.h> -#include <sys/file.h> -#include <systemd/sd-device.h> -#include <unistd.h> - -static inline void closep(int *fd) { - if (*fd >= 0) - close(*fd); -} - -/** - * lock_whole_disk_from_devname - * @devname: devname of a block device, e.g., /dev/sda or /dev/sda1 - * @open_flags: the flags to open the device, e.g., O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY - * @flock_operation: the operation to call flock, e.g., LOCK_EX|LOCK_NB - * - * given the devname of a block device, take a BSD lock of the whole disk - * - * Returns: negative errno value on error, or non-negative fd if the lock was taken successfully. - **/ -int lock_whole_disk_from_devname(const char *devname, int open_flags, int flock_operation) { - __attribute__((cleanup(sd_device_unrefp))) sd_device *dev = NULL; - sd_device *whole_dev; - const char *whole_disk_devname, *subsystem, *devtype; - int r; - - // create a sd_device instance from devname - r = sd_device_new_from_devname(&dev, devname); - if (r < 0) { - errno = -r; - fprintf(stderr, "Failed to create sd_device: %m\n"); - return r; - } - - // if the subsystem of dev is block, but its devtype is not disk, find its parent - r = sd_device_get_subsystem(dev, &subsystem); - if (r < 0) { - errno = -r; - fprintf(stderr, "Failed to get the subsystem: %m\n"); - return r; - } - if (strcmp(subsystem, "block") != 0) { - fprintf(stderr, "%s is not a block device, refusing.\n", devname); - return -EINVAL; - } - - r = sd_device_get_devtype(dev, &devtype); - if (r < 0) { - errno = -r; - fprintf(stderr, "Failed to get the devtype: %m\n"); - return r; - } - if (strcmp(devtype, "disk") == 0) - whole_dev = dev; - else { - r = sd_device_get_parent_with_subsystem_devtype(dev, "block", "disk", &whole_dev); - if (r < 0) { - errno = -r; - fprintf(stderr, "Failed to get the parent device: %m\n"); - return r; - } - } - - // open the whole disk device node - __attribute__((cleanup(closep))) int fd = sd_device_open(whole_dev, open_flags); - if (fd < 0) { - errno = -fd; - fprintf(stderr, "Failed to open the device: %m\n"); - return fd; - } - - // get the whole disk devname - r = sd_device_get_devname(whole_dev, &whole_disk_devname); - if (r < 0) { - errno = -r; - fprintf(stderr, "Failed to get the whole disk name: %m\n"); - return r; - } - - // take a BSD lock of the whole disk device node - if (flock(fd, flock_operation) < 0) { - r = -errno; - fprintf(stderr, "Failed to take a BSD lock on %s: %m\n", whole_disk_devname); - return r; - } - - printf("Successfully took a BSD lock: %s\n", whole_disk_devname); - - // take the fd to avoid automatic cleanup - int ret_fd = fd; - fd = -EBADF; - return ret_fd; -} - -int main(int argc, char **argv) { - if (argc != 2) { - fprintf(stderr, "Invalid number of parameters.\n"); - return EXIT_FAILURE; - } - - // try to take an exclusive and nonblocking BSD lock - __attribute__((cleanup(closep))) int fd = - lock_whole_disk_from_devname( - argv[1], - O_RDONLY|O_CLOEXEC|O_NONBLOCK|O_NOCTTY, - LOCK_EX|LOCK_NB); - - if (fd < 0) - return EXIT_FAILURE; - - /** - * The device is now locked until the return below. - * Now you can safely manipulate the block device. - **/ - - return EXIT_SUCCESS; -} -``` diff --git a/docs/_interfaces/CGROUP_DELEGATION.md b/docs/_interfaces/CGROUP_DELEGATION.md deleted file mode 100644 index 4210a75767..0000000000 --- a/docs/_interfaces/CGROUP_DELEGATION.md +++ /dev/null @@ -1,502 +0,0 @@ ---- -title: Control Group APIs and Delegation -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Control Group APIs and Delegation - -*Intended audience: hackers working on userspace subsystems that require direct -cgroup access, such as container managers and similar.* - -So you are wondering about resource management with systemd, you know Linux -control groups (cgroups) a bit and are trying to integrate your software with -what systemd has to offer there. Here's a bit of documentation about the -concepts and interfaces involved with this. - -What's described here has been part of systemd and documented since v205 -times. However, it has been updated and improved substantially, even -though the concepts stayed mostly the same. This is an attempt to provide more -comprehensive up-to-date information about all this, particular in light of the -poor implementations of the components interfacing with systemd of current -container managers. - -Before you read on, please make sure you read the low-level kernel -documentation about the -[unified cgroup hierarchy](https://docs.kernel.org/admin-guide/cgroup-v2.html). -This document then adds in the higher-level view from systemd. - -This document augments the existing documentation we already have: - -* [The New Control Group Interfaces](https://www.freedesktop.org/wiki/Software/systemd/ControlGroupInterface) -* [Writing VM and Container Managers](https://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers) - -These wiki documents are not as up to date as they should be, currently, but -the basic concepts still fully apply. You should read them too, if you do something -with cgroups and systemd, in particular as they shine more light on the various -D-Bus APIs provided. (That said, sooner or later we should probably fold that -wiki documentation into this very document, too.) - -## Two Key Design Rules - -Much of the philosophy behind these concepts is based on a couple of basic -design ideas of cgroup v2 (which we however try to adapt as far as we can to -cgroup v1 too). Specifically two cgroup v2 rules are the most relevant: - -1. The **no-processes-in-inner-nodes** rule: this means that it's not permitted -to have processes directly attached to a cgroup that also has child cgroups and -vice versa. A cgroup is either an inner node or a leaf node of the tree, and if -it's an inner node it may not contain processes directly, and if it's a leaf -node then it may not have child cgroups. (Note that there are some minor -exceptions to this rule, though. E.g. the root cgroup is special and allows -both processes and children — which is used in particular to maintain kernel -threads.) - -2. The **single-writer** rule: this means that each cgroup only has a single -writer, i.e. a single process managing it. It's OK if different cgroups have -different processes managing them. However, only a single process should own a -specific cgroup, and when it does that ownership is exclusive, and nothing else -should manipulate it at the same time. This rule ensures that various pieces of -software don't step on each other's toes constantly. - -These two rules have various effects. For example, one corollary of this is: if -your container manager creates and manages cgroups in the system's root cgroup -you violate rule #2, as the root cgroup is managed by systemd and hence off -limits to everybody else. - -Note that rule #1 is generally enforced by the kernel if cgroup v2 is used: as -soon as you add a process to a cgroup it is ensured the rule is not -violated. On cgroup v1 this rule didn't exist, and hence isn't enforced, even -though it's a good thing to follow it then too. Rule #2 is not enforced on -either cgroup v1 nor cgroup v2 (this is UNIX after all, in the general case -root can do anything, modulo SELinux and friends), but if you ignore it you'll -be in constant pain as various pieces of software will fight over cgroup -ownership. - -Note that cgroup v1 is currently the most deployed implementation, even though -it's semantically broken in many ways, and in many cases doesn't actually do -what people think it does. cgroup v2 is where things are going, and most new -kernel features in this area are only added to cgroup v2, and not cgroup v1 -anymore. For example, cgroup v2 provides proper cgroup-empty notifications, has -support for all kinds of per-cgroup BPF magic, supports secure delegation of -cgroup trees to less privileged processes and so on, which all are not -available on cgroup v1. - -## Three Different Tree Setups 🌳 - -systemd supports three different modes how cgroups are set up. Specifically: - -1. **Unified** — this is the simplest mode, and exposes a pure cgroup v2 -logic. In this mode `/sys/fs/cgroup` is the only mounted cgroup API file system -and all available controllers are exclusively exposed through it. - -2. **Legacy** — this is the traditional cgroup v1 mode. In this mode the -various controllers each get their own cgroup file system mounted to -`/sys/fs/cgroup/<controller>/`. On top of that systemd manages its own cgroup -hierarchy for managing purposes as `/sys/fs/cgroup/systemd/`. - -3. **Hybrid** — this is a hybrid between the unified and legacy mode. It's set -up mostly like legacy, except that there's also an additional hierarchy -`/sys/fs/cgroup/unified/` that contains the cgroup v2 hierarchy. (Note that in -this mode the unified hierarchy won't have controllers attached, the -controllers are all mounted as separate hierarchies as in legacy mode, -i.e. `/sys/fs/cgroup/unified/` is purely and exclusively about core cgroup v2 -functionality and not about resource management.) In this mode compatibility -with cgroup v1 is retained while some cgroup v2 features are available -too. This mode is a stopgap. Don't bother with this too much unless you have -too much free time. - -To say this clearly, legacy and hybrid modes have no future. If you develop -software today and don't focus on the unified mode, then you are writing -software for yesterday, not tomorrow. They are primarily supported for -compatibility reasons and will not receive new features. Sorry. - -Superficially, in legacy and hybrid modes it might appear that the parallel -cgroup hierarchies for each controller are orthogonal from each other. In -systemd they are not: the hierarchies of all controllers are always kept in -sync (at least mostly: sub-trees might be suppressed in certain hierarchies if -no controller usage is required for them). The fact that systemd keeps these -hierarchies in sync means that the legacy and hybrid hierarchies are -conceptually very close to the unified hierarchy. In particular this allows us -to talk of one specific cgroup and actually mean the same cgroup in all -available controller hierarchies. E.g. if we talk about the cgroup `/foo/bar/` -then we actually mean `/sys/fs/cgroup/cpu/foo/bar/` as well as -`/sys/fs/cgroup/memory/foo/bar/`, `/sys/fs/cgroup/pids/foo/bar/`, and so on. -Note that in cgroup v2 the controller hierarchies aren't orthogonal, hence -thinking about them as orthogonal won't help you in the long run anyway. - -If you wonder how to detect which of these three modes is currently used, use -`statfs()` on `/sys/fs/cgroup/`. If it reports `CGROUP2_SUPER_MAGIC` in its -`.f_type` field, then you are in unified mode. If it reports `TMPFS_MAGIC` then -you are either in legacy or hybrid mode. To distinguish these two cases, run -`statfs()` again on `/sys/fs/cgroup/unified/`. If that succeeds and reports -`CGROUP2_SUPER_MAGIC` you are in hybrid mode, otherwise not. -From a shell, you can check the `Type` in `stat -f /sys/fs/cgroup` and -`stat -f /sys/fs/cgroup/unified`. - -## systemd's Unit Types - -The low-level kernel cgroups feature is exposed in systemd in three different -"unit" types. Specifically: - -1. 💼 The `.service` unit type. This unit type is for units encapsulating - processes systemd itself starts. Units of these types have cgroups that are - the leaves of the cgroup tree the systemd instance manages (though possibly - they might contain a sub-tree of their own managed by something else, made - possible by the concept of delegation, see below). Service units are usually - instantiated based on a unit file on disk that describes the command line to - invoke and other properties of the service. However, service units may also - be declared and started programmatically at runtime through a D-Bus API - (which is called *transient* services). - -2. 👓 The `.scope` unit type. This is very similar to `.service`. The main - difference: the processes the units of this type encapsulate are forked off - by some unrelated manager process, and that manager asked systemd to expose - them as a unit. Unlike services, scopes can only be declared and started - programmatically, i.e. are always transient. That's because they encapsulate - processes forked off by something else, i.e. existing runtime objects, and - hence cannot really be defined fully in 'offline' concepts such as unit - files. - -3. 🔪 The `.slice` unit type. Units of this type do not directly contain any - processes. Units of this type are the inner nodes of part of the cgroup tree - the systemd instance manages. Much like services, slices can be defined - either on disk with unit files or programmatically as transient units. - -Slices expose the trunk and branches of a tree, and scopes and services are -attached to those branches as leaves. The idea is that scopes and services can -be moved around though, i.e. assigned to a different slice if needed. - -The naming of slice units directly maps to the cgroup tree path. This is not -the case for service and scope units however. A slice named `foo-bar-baz.slice` -maps to a cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/`. A service -`quux.service` which is attached to the slice `foo-bar-baz.slice` maps to the -cgroup `/foo.slice/foo-bar.slice/foo-bar-baz.slice/quux.service/`. - -By default systemd sets up four slice units: - -1. `-.slice` is the root slice. i.e. the parent of everything else. On the host - system it maps directly to the top-level directory of cgroup v2. - -2. `system.slice` is where system services are by default placed, unless - configured otherwise. - -3. `user.slice` is where user sessions are placed. Each user gets a slice of - its own below that. - -4. `machines.slice` is where VMs and containers are supposed to be - placed. `systemd-nspawn` makes use of this by default, and you're very welcome - to place your containers and VMs there too if you hack on managers for those. - -Users may define any amount of additional slices they like though, the four -above are just the defaults. - -## Delegation - -Container managers and suchlike often want to control cgroups directly using -the raw kernel APIs. That's entirely fine and supported, as long as proper -*delegation* is followed. Delegation is a concept we inherited from cgroup v2, -but we expose it on cgroup v1 too. Delegation means that some parts of the -cgroup tree may be managed by different managers than others. As long as it is -clear which manager manages which part of the tree each one can do within its -sub-graph of the tree whatever it wants. - -Only sub-trees can be delegated (though whoever decides to request a sub-tree -can delegate sub-sub-trees further to somebody else if they like). Delegation -takes place at a specific cgroup: in systemd there's a `Delegate=` property you -can set for a service or scope unit. If you do, it's the cut-off point for -systemd's cgroup management: the unit itself is managed by systemd, i.e. all -its attributes are managed exclusively by systemd, however your program may -create/remove sub-cgroups inside it freely, and those then become exclusive -property of your program, systemd won't touch them — all attributes of *those* -sub-cgroups can be manipulated freely and exclusively by your program. - -By turning on the `Delegate=` property for a scope or service you get a few -guarantees: - -1. systemd won't fiddle with your sub-tree of the cgroup tree anymore. It won't - change attributes of any cgroups below it, nor will it create or remove any - cgroups thereunder, nor migrate processes across the boundaries of that - sub-tree as it deems useful anymore. - -2. If your service makes use of the `User=` functionality, then the sub-tree - will be `chown()`ed to the indicated user so that it can correctly create - cgroups below it. Note however that systemd will do that only in the unified - hierarchy (in unified and hybrid mode) as well as on systemd's own private - hierarchy (in legacy and hybrid mode). It won't pass ownership of the legacy - controller hierarchies. Delegation to less privileged processes is not safe - in cgroup v1 (as a limitation of the kernel), hence systemd won't facilitate - access to it. - -3. Any BPF IP filter programs systemd installs will be installed with - `BPF_F_ALLOW_MULTI` so that your program can install additional ones. - -In unit files the `Delegate=` property is superficially exposed as -boolean. However, since v236 it optionally takes a list of controller names -instead. If so, delegation is requested for listed controllers -specifically. Note that this only encodes a request. Depending on various -parameters it might happen that your service actually will get fewer -controllers delegated (for example, because the controller is not available on -the current kernel or was turned off) or more. If no list is specified -(i.e. the property simply set to `yes`) then all available controllers are -delegated. - -Let's stress one thing: delegation is available on scope and service units -only. It's expressly not available on slice units. Why? Because slice units are -our *inner* nodes of the cgroup trees and we freely attach services and scopes -to them. If we'd allow delegation on slice units then this would mean that -both systemd and your own manager would create/delete cgroups below the slice -unit and that conflicts with the single-writer rule. - -So, if you want to do your own raw cgroups kernel level access, then allocate a -scope unit, or a service unit (or just use the service unit you already have -for your service code), and turn on delegation for it. - -The service manager sets the `user.delegate` extended attribute (readable via -`getxattr(2)` and related calls) to the character `1` on cgroup directories -where delegation is enabled (and removes it on those cgroups where it is -not). This may be used by service programs to determine whether a cgroup tree -was delegated to them. Note that this is only supported on kernels 5.6 and -newer in combination with systemd 251 and newer. - -(OK, here's one caveat: if you turn on delegation for a service, and that -service has `ExecStartPost=`, `ExecReload=`, `ExecStop=` or `ExecStopPost=` -set, then these commands will be executed within the `.control/` sub-cgroup of -your service's cgroup. This is necessary because by turning on delegation we -have to assume that the cgroup delegated to your service is now an *inner* -cgroup, which means that it may not directly contain any processes. Hence, if -your service has any of these four settings set, you must be prepared that a -`.control/` subcgroup might appear, managed by the service manager. This also -means that your service code should have moved itself further down the cgroup -tree by the time it notifies the service manager about start-up readiness, so -that the service's main cgroup is definitely an inner node by the time the -service manager might start `ExecStartPost=`. Starting with systemd 254 you may -also use `DelegateSubgroup=` to let the service manager put your initial -service process into a subgroup right away.) - -(Also note, if you intend to use "threaded" cgroups — as added in Linux 4.14 —, -then you should do that *two* levels down from the main service cgroup your -turned delegation on for. Why that? You need one level so that systemd can -properly create the `.control` subgroup, as described above. But that one -cannot be threaded, since that would mean `.control` has to be threaded too — -this is a requirement of threaded cgroups: either a cgroup and all its siblings -are threaded or none –, but systemd expects it to be a regular cgroup. Thus you -have to nest a second cgroup beneath it which then can be threaded.) - -## Three Scenarios - -Let's say you write a container manager, and you wonder what to do regarding -cgroups for it, as you want your manager to be able to run on systemd systems. - -You basically have three options: - -1. 😊 The *integration-is-good* option. For this, you register each container - you have either as a systemd service (i.e. let systemd invoke the executor - binary for you) or a systemd scope (i.e. your manager executes the binary - directly, but then tells systemd about it. In this mode the administrator - can use the usual systemd resource management and reporting commands - individually on those containers. By turning on `Delegate=` for these scopes - or services you make it possible to run cgroup-enabled programs in your - containers, for example a nested systemd instance. This option has two - sub-options: - - a. You transiently register the service or scope by directly contacting - systemd via D-Bus. In this case systemd will just manage the unit for you - and nothing else. - - b. Instead you register the service or scope through `systemd-machined` - (also via D-Bus). This mini-daemon is basically just a proxy for the same - operations as in a. The main benefit of this: this way you let the system - know that what you are registering is a container, and this opens up - certain additional integration points. For example, `journalctl -M` can - then be used to directly look into any container's journal logs (should - the container run systemd inside), or `systemctl -M` can be used to - directly invoke systemd operations inside the containers. Moreover tools - like "ps" can then show you to which container a process belongs (`ps -eo - pid,comm,machine`), and even gnome-system-monitor supports it. - -2. 🙁 The *i-like-islands* option. If all you care about is your own cgroup tree, - and you want to have to do as little as possible with systemd and no - interest in integration with the rest of the system, then this is a valid - option. For this all you have to do is turn on `Delegate=` for your main - manager daemon. Then figure out the cgroup systemd placed your daemon in: - you can now freely create sub-cgroups beneath it. Don't forget the - *no-processes-in-inner-nodes* rule however: you have to move your main - daemon process out of that cgroup (and into a sub-cgroup) before you can - start further processes in any of your sub-cgroups. - -3. 🙁 The *i-like-continents* option. In this option you'd leave your manager - daemon where it is, and would not turn on delegation on its unit. However, - as you start your first managed process (a container, for example) you would - register a new scope unit with systemd, and that scope unit would have - `Delegate=` turned on, and it would contain the PID of this process; all - your managed processes subsequently created should also be moved into this - scope. From systemd's PoV there'd be two units: your manager service and the - big scope that contains all your managed processes in one. - -BTW: if for whatever reason you say "I hate D-Bus, I'll never call any D-Bus -API, kthxbye", then options #1 and #3 are not available, as they generally -involve talking to systemd from your program code, via D-Bus. You still have -option #2 in that case however, as you can simply set `Delegate=` in your -service's unit file and you are done and have your own sub-tree. In fact, #2 is -the one option that allows you to completely ignore systemd's existence: you -can entirely generically follow the single rule that you just use the cgroup -you are started in, and everything below it, whatever that might be. That said, -maybe if you dislike D-Bus and systemd that much, the better approach might be -to work on that, and widen your horizon a bit. You are welcome. - -## Controller Support - -systemd supports a number of controllers (but not all). Specifically, supported -are: - -* on cgroup v1: `cpu`, `cpuacct`, `blkio`, `memory`, `devices`, `pids` -* on cgroup v2: `cpu`, `io`, `memory`, `pids` - -It is our intention to natively support all cgroup v2 controllers as they are -added to the kernel. However, regarding cgroup v1: at this point we will not -add support for any other controllers anymore. This means systemd currently -does not and will never manage the following controllers on cgroup v1: -`freezer`, `cpuset`, `net_cls`, `perf_event`, `net_prio`, `hugetlb`. Why not? -Depending on the case, either their API semantics or implementations aren't -really usable, or it's very clear they have no future on cgroup v2, and we -won't add new code for stuff that clearly has no future. - -Effectively this means that all those mentioned cgroup v1 controllers are up -for grabs: systemd won't manage them, and hence won't delegate them to your -code (however, systemd will still mount their hierarchies, simply because it -mounts all controller hierarchies it finds available in the kernel). If you -decide to use them, then that's fine, but systemd won't help you with it (but -also not interfere with it). To be nice to other tenants it might be wise to -replicate the cgroup hierarchies of the other controllers in them too however, -but of course that's between you and those other tenants, and systemd won't -care. Replicating the cgroup hierarchies in those unsupported controllers would -mean replicating the full cgroup paths in them, and hence the prefixing -`.slice` components too, otherwise the hierarchies will start being orthogonal -after all, and that's not really desirable. One more thing: systemd will clean -up after you in the hierarchies it manages: if your daemon goes down, its -cgroups will be removed too. You basically get the guarantee that you start -with a pristine cgroup sub-tree for your service or scope whenever it is -started. This is not the case however in the hierarchies systemd doesn't -manage. This means that your programs should be ready to deal with left-over -cgroups in them — from previous runs, and be extra careful with them as they -might still carry settings that might not be valid anymore. - -Note a particular asymmetry here: if your systemd version doesn't support a -specific controller on cgroup v1 you can still make use of it for delegation, -by directly fiddling with its hierarchy and replicating the cgroup tree there -as necessary (as suggested above). However, on cgroup v2 this is different: -separately mounted hierarchies are not available, and delegation has always to -happen through systemd itself. This means: when you update your kernel and it -adds a new, so far unseen controller, and you want to use it for delegation, -then you also need to update systemd to a version that groks it. - -## systemd as Container Payload - -systemd can happily run as a container payload's PID 1. Note that systemd -unconditionally needs write access to the cgroup tree however, hence you need -to delegate a sub-tree to it. Note that there's nothing too special you have to -do beyond that: just invoke systemd as PID 1 inside the root of the delegated -cgroup sub-tree, and it will figure out the rest: it will determine the cgroup -it is running in and take possession of it. It won't interfere with any cgroup -outside of the sub-tree it was invoked in. Use of `CLONE_NEWCGROUP` is hence -optional (but of course wise). - -Note one particular asymmetry here though: systemd will try to take possession -of the root cgroup you pass to it *in* *full*, i.e. it will not only -create/remove child cgroups below it, it will also attempt to manage the -attributes of it. OTOH as mentioned above, when delegating a cgroup tree to -somebody else it only passes the rights to create/remove sub-cgroups, but will -insist on managing the delegated cgroup tree's top-level attributes. Or in -other words: systemd is *greedy* when accepting delegated cgroup trees and also -*greedy* when delegating them to others: it insists on managing attributes on -the specific cgroup in both cases. A container manager that is itself a payload -of a host systemd which wants to run a systemd as its own container payload -instead hence needs to insert an extra level in the hierarchy in between, so -that the systemd on the host and the one in the container won't fight for the -attributes. That said, you likely should do that anyway, due to the -no-processes-in-inner-cgroups rule, see below. - -When systemd runs as container payload it will make use of all hierarchies it -has write access to. For legacy mode you need to make at least -`/sys/fs/cgroup/systemd/` available, all other hierarchies are optional. For -hybrid mode you need to add `/sys/fs/cgroup/unified/`. Finally, for fully -unified you (of course, I guess) need to provide only `/sys/fs/cgroup/` itself. - -## Some Dos - -1. ⚡ If you go for implementation option 1a or 1b (as in the list above), then - each of your containers will have its own systemd-managed unit and hence - cgroup with possibly further sub-cgroups below. Typically the first process - running in that unit will be some kind of executor program, which will in - turn fork off the payload processes of the container. In this case don't - forget that there are two levels of delegation involved: first, systemd - delegates a group sub-tree to your executor. And then your executor should - delegate a sub-tree further down to the container payload. Oh, and because - of the no-process-in-inner-nodes rule, your executor needs to migrate itself - to a sub-cgroup of the cgroup it got delegated, too. Most likely you hence - want a two-pronged approach: below the cgroup you got started in, you want - one cgroup maybe called `supervisor/` where your manager runs in and then - for each container a sibling cgroup of that maybe called `payload-xyz/`. - -2. ⚡ Don't forget that the cgroups you create have to have names that are - suitable as UNIX file names, and that they live in the same namespace as the - various kernel attribute files. Hence, when you want to allow the user - arbitrary naming, you might need to escape some of the names (for example, - you really don't want to create a cgroup named `tasks`, just because the - user created a container by that name, because `tasks` after all is a magic - attribute in cgroup v1, and your `mkdir()` will hence fail with `EEXIST`. In - systemd we do escaping by prefixing names that might collide with a kernel - attribute name with an underscore. You might want to do the same, but this - is really up to you how you do it. Just do it, and be careful. - -## Some Don'ts - -1. 🚫 Never create your own cgroups below arbitrary cgroups systemd manages, i.e - cgroups you haven't set `Delegate=` in. Specifically: 🔥 don't create your - own cgroups below the root cgroup 🔥. That's owned by systemd, and you will - step on systemd's toes if you ignore that, and systemd will step on - yours. Get your own delegated sub-tree, you may create as many cgroups there - as you like. Seriously, if you create cgroups directly in the cgroup root, - then all you do is ask for trouble. - -2. 🚫 Don't attempt to set `Delegate=` in slice units, and in particular not in - `-.slice`. It's not supported, and will generate an error. - -3. 🚫 Never *write* to any of the attributes of a cgroup systemd created for - you. It's systemd's private property. You are welcome to manipulate the - attributes of cgroups you created in your own delegated sub-tree, but the - cgroup tree of systemd itself is out of limits for you. It's fine to *read* - from any attribute you like however. That's totally OK and welcome. - -4. 🚫 When not using `CLONE_NEWCGROUP` when delegating a sub-tree to a - container payload running systemd, then don't get the idea that you can bind - mount only a sub-tree of the host's cgroup tree into the container. Part of - the cgroup API is that `/proc/$PID/cgroup` reports the cgroup path of every - process, and hence any path below `/sys/fs/cgroup/` needs to match what - `/proc/$PID/cgroup` of the payload processes reports. What you can do safely - however, is mount the upper parts of the cgroup tree read-only (or even - replace the middle bits with an intermediary `tmpfs` — but be careful not to - break the `statfs()` detection logic discussed above), as long as the path - to the delegated sub-tree remains accessible as-is. - -5. ⚡ Currently, the algorithm for mapping between slice/scope/service unit - naming and their cgroup paths is not considered public API of systemd, and - may change in future versions. This means: it's best to avoid implementing a - local logic of translating cgroup paths to slice/scope/service names in your - program, or vice versa — it's likely going to break sooner or later. Use the - appropriate D-Bus API calls for that instead, so that systemd translates - this for you. (Specifically: each Unit object has a `ControlGroup` property - to get the cgroup for a unit. The method `GetUnitByControlGroup()` may be - used to get the unit for a cgroup.) - -6. ⚡ Think twice before delegating cgroup v1 controllers to less privileged - containers. It's not safe, you basically allow your containers to freeze the - system with that and worse. Delegation is a strongpoint of cgroup v2 though, - and there it's safe to treat delegation boundaries as privilege boundaries. - -And that's it for now. If you have further questions, refer to the systemd -mailing list. - -— Berlin, 2018-04-20 diff --git a/docs/_interfaces/CONTAINER_INTERFACE.md b/docs/_interfaces/CONTAINER_INTERFACE.md deleted file mode 100644 index dcecdecc3e..0000000000 --- a/docs/_interfaces/CONTAINER_INTERFACE.md +++ /dev/null @@ -1,421 +0,0 @@ ---- -title: Container Interface -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# The Container Interface - -Also consult [Writing Virtual Machine or Container -Managers](https://www.freedesktop.org/wiki/Software/systemd/writing-vm-managers). - -systemd has a number of interfaces for interacting with container managers, -when systemd is used inside of an OS container. If you work on a container -manager, please consider supporting the following interfaces. - -## Execution Environment - -1. If the container manager wants to control the hostname for a container - running systemd it may just set it before invoking systemd, and systemd will - leave it unmodified when there is no hostname configured in `/etc/hostname` - (that file overrides whatever is pre-initialized by the container manager). - -2. Make sure to pre-mount `/proc/`, `/sys/`, and `/sys/fs/selinux/` before - invoking systemd, and mount `/sys/`, `/sys/fs/selinux/` and `/proc/sys/` - read-only (the latter via e.g. a read-only bind mount on itself) in order - to prevent the container from altering the host kernel's configuration - settings. (As a special exception, if your container has network namespaces - enabled, feel free to make `/proc/sys/net/` writable. If it also has user, ipc, - uts and pid namespaces enabled, the entire `/proc/sys` can be left writable). - systemd and various other subsystems (such as the SELinux userspace) have - been modified to behave accordingly when these file systems are read-only. - (It's OK to mount `/sys/` as `tmpfs` btw, and only mount a subset of its - sub-trees from the real `sysfs` to hide `/sys/firmware/`, `/sys/kernel/` and - so on. If you do that, make sure to mark `/sys/` read-only, as that - condition is what systemd looks for, and is what is considered to be the API - in this context.) - -3. Pre-mount `/dev/` as (container private) `tmpfs` for the container and bind - mount some suitable TTY to `/dev/console`. If this is a pty, make sure to - not close the controlling pty during systemd's lifetime. PID 1 will close - ttys, to avoid being killed by SAK. It only opens ttys for the time it - actually needs to print something. Also, make sure to create device nodes - for `/dev/null`, `/dev/zero`, `/dev/full`, `/dev/random`, `/dev/urandom`, - `/dev/tty`, `/dev/ptmx` in `/dev/`. It is not necessary to create `/dev/fd` - or `/dev/stdout`, as systemd will do that on its own. Make sure to set up a - `BPF_PROG_TYPE_CGROUP_DEVICE` BPF program — on cgroupv2 — or the `devices` - cgroup controller — on cgroupv1 — so that no other devices but these may be - created in the container. Note that many systemd services use - `PrivateDevices=`, which means that systemd will set up a private `/dev/` - for them for which it needs to be able to create these device nodes. - Dropping `CAP_MKNOD` for containers is hence generally not advisable, but - see below. - -4. `systemd-udevd` is not available in containers (and refuses to start), and - hence device dependencies are unavailable. The `systemd-udevd` unit files - will check for `/sys/` being read-only, as an indication whether device - management can work. Therefore make sure to mount `/sys/` read-only in the - container (see above). Various clients of `systemd-udevd` also check the - read-only state of `/sys/`, including PID 1 itself and `systemd-networkd`. - -5. If systemd detects it is run in a container it will spawn a single shell on - `/dev/console`, and not care about VTs or multiple gettys on VTs. (But see - `$container_ttys` below.) - -6. Either pre-mount all cgroup hierarchies in full into the container, or leave - that to systemd which will do so if they are missing. Note that it is - explicitly *not* OK to just mount a sub-hierarchy into the container as that - is incompatible with `/proc/$PID/cgroup` (which lists full paths). Also the - root-level cgroup directories tend to be quite different from inner - directories, and that distinction matters. It is OK however, to mount the - "upper" parts read-only of the hierarchies, and only allow write-access to - the cgroup sub-tree the container runs in. It's also a good idea to mount - all controller hierarchies with exception of `name=systemd` fully read-only - (this only applies to cgroupv1, of course), to protect the controllers from - alteration from inside the containers. Or to turn this around: only the - cgroup sub-tree of the container itself (on cgroupv2 in the unified - hierarchy, and on cgroupv1 in the `name=systemd` hierarchy) may be writable - to the container. - -7. Create the control group root of your container by either running your - container as a service (in case you have one container manager instance per - container instance) or creating one scope unit for each container instance - via systemd's transient unit API (in case you have one container manager - that manages all instances. Either way, make sure to set `Delegate=yes` in - it. This ensures that the unit you created will be part of all cgroup - controllers (or at least the ones systemd understands). The latter may also - be done via `systemd-machined`'s `CreateMachine()` API. Make sure to use the - cgroup path systemd put your process in for all operations of the container. - Do not add new cgroup directories to the top of the tree. This will not only - confuse systemd and the admin, but also prevent your implementation from - being "stackable". - -## Environment Variables - -1. To allow systemd (and other programs) to identify that it is executed within - a container, please set the `$container` environment variable for PID 1 in - the container to a short lowercase string identifying your - implementation. With this in place the `ConditionVirtualization=` setting in - unit files will work properly. Example: `container=lxc-libvirt` - -2. systemd has special support for allowing container managers to initialize - the UUID for `/etc/machine-id` to some manager supplied value. This is only - enabled if `/etc/machine-id` is empty (i.e. not yet set) at boot time of the - container. The container manager should set `$container_uuid` as environment - variable for the container's PID 1 to the container UUID. (This is similar - to the effect of `qemu`'s `-uuid` switch). Note that you should pass only a - UUID here that is actually unique (i.e. only one running container should - have a specific UUID), and gets changed when a container gets duplicated. - Also note that systemd will try to persistently store the UUID in - `/etc/machine-id` (if writable) when this option is used, hence you should - always pass the same UUID here. Keeping the externally used UUID for a - container and the internal one in sync is hopefully useful to minimize - surprise for the administrator. - -3. systemd can automatically spawn login gettys on additional ptys. A container - manager can set the `$container_ttys` environment variable for the - container's PID 1 to tell it on which ptys to spawn gettys. The variable - should take a space separated list of pty names, without the leading `/dev/` - prefix, but with the `pts/` prefix included. Note that despite the - variable's name you may only specify ptys, and not other types of ttys. Also - you need to specify the pty itself, a symlink will not suffice. This is - implemented in - [systemd-getty-generator(8)](https://www.freedesktop.org/software/systemd/man/systemd-getty-generator.html). - Note that this variable should not include the pty that `/dev/console` maps - to if it maps to one (see below). Example: if the container receives - `container_ttys=pts/7 pts/8 pts/14` it will spawn three additional login - gettys on ptys 7, 8, and 14. - -4. To allow applications to detect the OS version and other metadata of the host - running the container manager, if this is considered desirable, please parse - the host's `/etc/os-release` and set a `$container_host_<key>=<VALUE>` - environment variable for the ID fields described by the [os-release - interface](https://www.freedesktop.org/software/systemd/man/os-release.html), eg: - `$container_host_id=debian` - `$container_host_build_id=2020-06-15` - `$container_host_variant_id=server` - `$container_host_version_id=10` - -5. systemd supports passing immutable binary data blobs with limited size and - restricted access to services via the `ImportCredential=`, `LoadCredential=` - and `SetCredential=` settings. The same protocol may be used to pass credentials - from the container manager to systemd itself. The credential data should be - placed in some location (ideally a read-only and non-swappable file system, - like 'ramfs'), and the absolute path to this directory exported in the - `$CREDENTIALS_DIRECTORY` environment variable. If the container managers - does this, the credentials passed to the service manager can be propagated - to services via `LoadCredential=` or `ImportCredential=` (see ...). The - container manager can choose any path, but `/run/host/credentials` is - recommended. - -## Advanced Integration - -1. Consider syncing `/etc/localtime` from the host file system into the - container. Make it a relative symlink to the containers's zoneinfo dir, as - usual. Tools rely on being able to determine the timezone setting from the - symlink value, and making it relative looks nice even if people list the - container's `/etc/` from the host. - -2. Make the container journal available in the host, by automatically - symlinking the container journal directory into the host journal directory. - More precisely, link `/var/log/journal/<container-machine-id>` of the - container into the same dir of the host. Administrators can then - automatically browse all container journals (correctly interleaved) by - issuing `journalctl -m`. The container machine ID can be determined from - `/etc/machine-id` in the container. - -3. If the container manager wants to cleanly shutdown the container, it might - be a good idea to send `SIGRTMIN+3` to its init process. systemd will then - do a clean shutdown. Note however, that since only systemd understands - `SIGRTMIN+3` like this, this might confuse other init systems. - -4. To support [Socket Activated - Containers](https://0pointer.de/blog/projects/socket-activated-containers.html) - the container manager should be capable of being run as a systemd - service. It will then receive the sockets starting with FD 3, the number of - passed FDs in `$LISTEN_FDS` and its PID as `$LISTEN_PID`. It should take - these and pass them on to the container's init process, also setting - $LISTEN_FDS and `$LISTEN_PID` (basically, it can just leave the FDs and - `$LISTEN_FDS` untouched, but it needs to adjust `$LISTEN_PID` to the - container init process). That's all that's necessary to make socket - activation work. The protocol to hand sockets from systemd to services is - hence the same as from the container manager to the container systemd. For - further details see the explanations of - [sd_listen_fds(1)](https://0pointer.de/public/systemd-man/sd_listen_fds.html) - and the [blog story for service - developers](https://0pointer.de/blog/projects/socket-activation.html). - -5. Container managers should stay away from the cgroup hierarchy outside of the - unit they created for their container. That's private property of systemd, - and no other code should modify it. - -6. systemd running inside the container can report when boot-up is complete - using the usual `sd_notify()` protocol that is also used when a service - wants to tell the service manager about readiness. A container manager can - set the `$NOTIFY_SOCKET` environment variable to a suitable socket path to - make use of this functionality. (Also see information about - `/run/host/notify` below.) - -## Networking - -1. Inside of a container, if a `veth` link is named `host0`, `systemd-networkd` - running inside of the container will by default run DHCPv4, DHCPv6, and - IPv4LL clients on it. It is thus recommended that container managers that - add a `veth` link to a container name it `host0`, to get an automatically - configured network, with no manual setup. - -2. Outside of a container, if a `veth` link is prefixed "ve-", `systemd-networkd` - will by default run DHCPv4 and DHCPv6 servers on it, as well as IPv4LL. It - is thus recommended that container managers that add a `veth` link to a - container name the external side `ve-` + the container name. - -3. It is recommended to configure stable MAC addresses for container `veth` - devices, for example, hashed out of the container names. That way it is more - likely that DHCP and IPv4LL will acquire stable addresses. - -## The `/run/host/` Hierarchy - -Container managers may place certain resources the manager wants to provide to -the container payload below the `/run/host/` hierarchy. This hierarchy should -be mostly immutable (possibly some subdirs might be writable, but the top-level -hierarchy — and probably most subdirs should be read-only to the -container). Note that this hierarchy is used by various container managers, and -care should be taken to avoid naming conflicts. `systemd` (and in particular -`systemd-nspawn`) use the hierarchy for the following resources: - -1. The `/run/host/incoming/` directory mount point is configured for `MS_SLAVE` - mount propagation with the host, and is used as intermediary location for - mounts to establish in the container, for the implementation of `machinectl - bind`. Container payload should usually not directly interact with this - directory: it's used by code outside the container to insert mounts inside - it only, and is mostly an internal vehicle to achieve this. Other container - managers that want to implement similar functionality might consider using - the same directory. - -2. The `/run/host/inaccessible/` directory may be set up by the container - manager to include six file nodes: `reg`, `dir`, `fifo`, `sock`, `chr`, - `blk`. These nodes correspond with the six types of file nodes Linux knows - (with the exceptions of symlinks). Each node should be of the specific type - and have an all zero access mode, i.e. be inaccessible. The two device node - types should have major and minor of zero (which are unallocated devices on - Linux). These nodes are used as mount source for implementing the - `InaccessiblePath=` setting of unit files, i.e. file nodes to mask this way - are overmounted with these "inaccessible" inodes, guaranteeing that the file - node type does not change this way but the nodes still become - inaccessible. Note that systemd when run as PID 1 in the container payload - will create these nodes on its own if not passed in by the container - manager. However, in that case it likely lacks the privileges to create the - character and block devices nodes (there are fallbacks for this case). - -3. The `/run/host/notify` path is a good choice to place the `sd_notify()` - socket in, that may be used for the container's PID 1 to report to the - container manager when boot-up is complete. The path used for this doesn't - matter much as it is communicated via the `$NOTIFY_SOCKET` environment - variable, following the usual protocol for this, however it's suitable, and - recommended place for this socket in case ready notification is desired. - -4. The `/run/host/os-release` file contains the `/etc/os-release` file of the - host, i.e. may be used by the container payload to gather limited - information about the host environment, on top of what `uname -a` reports. - -5. The `/run/host/container-manager` file may be used to pass the same - information as the `$container` environment variable (see above), i.e. a - short string identifying the container manager implementation. This file - should be newline terminated. Passing this information via this file has the - benefit that payload code can easily access it, even when running - unprivileged without access to the container PID 1's environment block. - -6. The `/run/host/container-uuid` file may be used to pass the same information - as the `$container_uuid` environment variable (see above). This file should - be newline terminated. - -7. The `/run/host/credentials/` directory is a good place to pass credentials - into the container, using the `$CREDENTIALS_DIRECTORY` protocol, see above. - -8. The `/run/host/unix-export/` directory shall be writable from the container - payload, and is where container payload can bind `AF_UNIX` sockets in that - shall be *exported* to the host, so that the host can connect to them. The - container manager should bind mount this directory on the host side - (read-only ideally), so that the host can connect to contained sockets. This - is most prominently used by `systemd-ssh-generator` when run in such a - container to automatically bind an SSH socket into that directory, which - then can be used to connect to the container. - -9. The `/run/host/unix-export/ssh` `AF_UNIX` socket will be automatically bound - by `systemd-ssh-generator` in the container if possible, and can be used to - connect to the container. - -10. The `/run/host/userdb/` directory may be used to drop-in additional JSON - user records that `nss-systemd` inside the container shall include in the - system's user database. This is useful to make host users and their home - directories automatically accessible to containers in transitive - fashion. See `nss-systemd(8)` for details. - -11. The `/run/host/home/` directory may be used to bind mount host home - directories of users that shall be made available in the container to. This - may be used in combination with `/run/host/userdb/` above: one defines the - user record, the other contains the user's home directory. - -## What You Shouldn't Do - -1. Do not drop `CAP_MKNOD` from the container. `PrivateDevices=` is a commonly - used service setting that provides a service with its own, private, minimal - version of `/dev/`. To set this up systemd in the container needs this - capability. If you take away the capability, then all services that set this - flag will cease to work. Use `BPF_PROG_TYPE_CGROUP_DEVICE` BPF programs — on - cgroupv2 — or the `devices` controller — on cgroupv1 — to restrict what - device nodes the container can create instead of taking away the capability - wholesale. (Also see the section about fully unprivileged containers below.) - -2. Do not drop `CAP_SYS_ADMIN` from the container. A number of the most - commonly used file system namespacing related settings, such as - `PrivateDevices=`, `ProtectHome=`, `ProtectSystem=`, `MountFlags=`, - `PrivateTmp=`, `ReadWriteDirectories=`, `ReadOnlyDirectories=`, - `InaccessibleDirectories=`, and `MountFlags=` need to be able to open new - mount namespaces and the mount certain file systems into them. You break all - services that make use of these options if you drop the capability. Also - note that logind mounts `XDG_RUNTIME_DIR` as `tmpfs` for all logged in users - and that won't work either if you take away the capability. (Also see - section about fully unprivileged containers below.) - -3. Do not cross-link `/dev/kmsg` with `/dev/console`. They are different things, - you cannot link them to each other. - -4. Do not pretend that the real VTs are available in the container. The VT - subsystem consists of all the devices `/dev/tty[0-9]*`, `/dev/vcs*`, - `/dev/vcsa*` plus their `sysfs` counterparts. They speak specific `ioctl()`s - and understand specific escape sequences, that other ptys don't understand. - Hence, it is explicitly not OK to mount a pty to `/dev/tty1`, `/dev/tty2`, - `/dev/tty3`. This is explicitly not supported. - -5. Don't pretend that passing arbitrary devices to containers could really work - well. For example, do not pass device nodes for block devices to the - container. Device access (with the exception of network devices) is not - virtualized on Linux. Enumeration and probing of meta information from - `/sys/` and elsewhere is not possible to do correctly in a container. Simply - adding a specific device node to a container's `/dev/` is *not* *enough* to - do the job, as `systemd-udevd` and suchlike are not available at all, and no - devices will appear available or enumerable, inside the container. - -6. Don't mount only a sub-tree of the `cgroupfs` into the container. This will not - work as `/proc/$PID/cgroup` lists full paths and cannot be matched up with - the actual `cgroupfs` tree visible, then. (You may "prune" some branches - though, see above.) - -7. Do not make `/sys/` writable in the container. If you do, - `systemd-udevd.service` is started to manage your devices — inside the - container, but that will cause conflicts and errors given that the Linux - device model is not virtualized for containers on Linux and thus the - containers and the host would try to manage the same devices, fighting for - ownership. Multiple other subsystems of systemd similarly test for `/sys/` - being writable to decide whether to use `systemd-udevd` or assume that - device management is properly available on the instance. Among them - `systemd-networkd` and `systemd-logind`. The conditionalization on the - read-only state of `/sys/` enables a nice automatism: as soon as `/sys/` and - the Linux device model are changed to be virtualized properly the container - payload can make use of that, simply by marking `/sys/` writable. (Note that - as special exception, the devices in `/sys/class/net/` are virtualized - already, if network namespacing is used. Thus it is OK to mount the relevant - sub-directories of `/sys/` writable, but make sure to leave the root of - `/sys/` read-only.) - -8. Do not pass the `CAP_AUDIT_CONTROL`, `CAP_AUDIT_READ`, `CAP_AUDIT_WRITE` - capabilities to the container, in particular not to those making use of user - namespaces. The kernel's audit subsystem is still not virtualized for - containers, and passing these credentials is pointless hence, given the - actual attempt to make use of the audit subsystem will fail. Note that - systemd's audit support is partially conditioned on these capabilities, thus - by dropping them you ensure that you get an entirely clean boot, as systemd - will make no attempt to use it. If you pass the capabilities to the payload - systemd will assume that audit is available and works, and some components - will subsequently fail in various ways. Note that once the kernel learnt - native support for container-virtualized audit, adding the capability to the - container description will automatically make the container payload use it. - -## Fully Unprivileged Container Payload - -First things first, to make this clear: Linux containers are not a security -technology right now. There are more holes in the model than in swiss cheese. - -For example: if you do not use user namespacing, and share root and other users -between container and host, the `struct user` structures will be shared between -host and container, and hence `RLIMIT_NPROC` and so of the container users -affect the host and other containers, and vice versa. This is a major security -hole, and actually is a real-life problem: since Avahi sets `RLIMIT_NPROC` of -its user to 2 (to effectively disallow `fork()`ing) you cannot run more than -one Avahi instance on the entire system... - -People have been asking to be able to run systemd without `CAP_SYS_ADMIN` and -`CAP_SYS_MKNOD` in the container. This is now supported to some level in -systemd, but we recommend against it (see above). If `CAP_SYS_ADMIN` and -`CAP_SYS_MKNOD` are missing from the container systemd will now gracefully turn -off `PrivateTmp=`, `PrivateNetwork=`, `ProtectHome=`, `ProtectSystem=` and -others, because those capabilities are required to implement these options. The -services using these settings (which include many of systemd's own) will hence -run in a different, less secure environment when the capabilities are missing -than with them around. - -With user namespacing in place things get much better. With user namespaces the -`struct user` issue described above goes away, and containers can keep -`CAP_SYS_ADMIN` safely for the user namespace, as capabilities are virtualized -and having capabilities inside a container doesn't mean one also has them -outside. - -## Final Words - -If you write software that wants to detect whether it is run in a container, -please check `/proc/1/environ` and look for the `container=` environment -variable. Do not assume the environment variable is inherited down the process -tree. It generally is not. Hence check the environment block of PID 1, not your -own. Note though that this file is only accessible to root. systemd hence early -on also copies the value into `/run/systemd/container`, which is readable for -everybody. However, that's a systemd-specific interface and other init systems -are unlikely to do the same. - -Note that it is our intention to make systemd systems work flawlessly and -out-of-the-box in containers. In fact, we are interested to ensure that the same -OS image can be booted on a bare system, in a VM and in a container, and behave -correctly each time. If you notice that some component in systemd does not work -in a container as it should, even though the container manager implements -everything documented above, please contact us. diff --git a/docs/_interfaces/ELF_PACKAGE_METADATA.md b/docs/_interfaces/ELF_PACKAGE_METADATA.md deleted file mode 100644 index 6cb3f785b4..0000000000 --- a/docs/_interfaces/ELF_PACKAGE_METADATA.md +++ /dev/null @@ -1,105 +0,0 @@ ---- -title: Package Metadata for ELF Files -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Package Metadata for Core Files - -*Intended audience: hackers working on userspace subsystems that create ELF binaries -or parse ELF core files.* - -## Motivation - -ELF binaries get stamped with a unique, build-time generated hex string identifier called -`build-id`, [which gets embedded as an ELF note called `.note.gnu.build-id`](https://fedoraproject.org/wiki/Releases/FeatureBuildId). -In most cases, this allows to associate a stripped binary with its debugging information. -It is used, for example, to dynamically fetch DWARF symbols from a debuginfo server, or -to query the local package manager and find out the package metadata or, again, the DWARF -symbols or program sources. - -However, this usage of the `build-id` requires either local metadata, usually set up by -the package manager, or access to a remote server over the network. Both of those might -be unavailable or forbidden. - -Thus it becomes desirable to add additional metadata to a binary at build time, so that -`systemd-coredump` and other services analyzing core files are able to extract said -metadata simply from the core file itself, without external dependencies. - -## Implementation - -This document will attempt to define a common metadata format specification, so that -multiple implementers might use it when building packages, or core file analyzers, and -so on. - -The metadata will be embedded in a single, new, 4-bytes-aligned, allocated, 0-padded, -read-only ELF header section, in a name-value JSON object format. Implementers working on parsing -core files should not assume a specific list of names, but parse anything that is included -in the section, and should look for the note using the `note type`. Implementers working on -build tools should strive to use the same names, for consistency. The most common will be -listed here. When corresponding to the content of os-release, the values should match, again for consistency. - -If available, the metadata should also include the debuginfod server URL that can provide -the original executable, debuginfo and sources, to further facilitate debugging. - -* Section header - -``` -SECTION: `.note.package` -note type: `0xcafe1a7e` -Owner: `FDO` (FreeDesktop.org) -Value: a single JSON object encoded as a zero-terminated UTF-8 string -``` - -* JSON payload - -```json -{ - "type":"rpm", # this provides a namespace for the package+package-version fields - "os":"fedora", - "osVersion":"33", - "name":"coreutils", - "version":"4711.0815.fc13", - "architecture":"arm32", - "osCpe": "cpe:/o:fedoraproject:fedora:33", # A CPE name for the operating system, `CPE_NAME` from os-release is a good default - "debugInfoUrl": "https://debuginfod.fedoraproject.org/" -} -``` - -The format is a single JSON object, encoded as a zero-terminated `UTF-8` string. -Each name in the object shall be unique as per recommendations of -[RFC8259](https://datatracker.ietf.org/doc/html/rfc8259#section-4). Strings shall -not contain any control character, nor use `\uXXX` escaping. - -When it comes to JSON numbers, this specification assumes that JSON parsers -processing this information are capable of reproducing the full signed 53bit -integer range (i.e. -2⁵³+1…+2⁵³-1) as well as the full 64-bit IEEE floating -point number range losslessly (with the exception of NaN/-inf/+inf, since JSON -cannot encode that), as per recommendations of -[RFC8259](https://datatracker.ietf.org/doc/html/rfc8259#page-8). Fields in -these JSON objects are thus permitted to encode numeric values from these -ranges as JSON numbers, and should not use numeric values not covered by these -types and ranges. - -Reference implementations of [packaging tools for .deb and .rpm](https://github.com/systemd/package-notes) -are available, and provide macros/helpers to include the note in binaries built -by the package build system. They make use of the new `--package-metadata` flag that -is available in the bfd, gold, mold and lld linkers (versions 2.39, 1.3.0 and 15.0 -respectively). This linker flag takes a JSON payload as parameter. - -## Well-known keys - -The metadata format is intentionally left open, so that vendors can add their own information. -A set of well-known keys is defined here, and hopefully shared among all vendors. - -| Key name | Key description | Example value | -|--------------|--------------------------------------------------------------------------|---------------------------------------| -| type | The packaging type | rpm | -| os | The OS name, typically corresponding to ID in os-release | fedora | -| osVersion | The OS version, typically corresponding to VERSION_ID in os-release | 33 | -| name | The source package name | coreutils | -| version | The source package version | 4711.0815.fc13 | -| architecture | The binary package architecture | arm32 | -| osCpe | A CPE name for the OS, typically corresponding to CPE_NAME in os-release | cpe:/o:fedoraproject:fedora:33 | -| debugInfoUrl | The debuginfod server url, if available | https://debuginfod.fedoraproject.org/ | diff --git a/docs/_interfaces/ENVIRONMENT.md b/docs/_interfaces/ENVIRONMENT.md deleted file mode 100644 index eab1ce23e4..0000000000 --- a/docs/_interfaces/ENVIRONMENT.md +++ /dev/null @@ -1,644 +0,0 @@ ---- -title: Known Environment Variables -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Known Environment Variables - -A number of systemd components take additional runtime parameters via -environment variables. Many of these environment variables are not supported at -the same level as command line switches and other interfaces are: we don't -document them in the man pages and we make no stability guarantees for -them. While they generally are unlikely to be dropped any time soon again, we -do not want to guarantee that they stay around for good either. - -Below is an (incomprehensive) list of the environment variables understood by -the various tools. Note that this list only covers environment variables not -documented in the proper man pages. - -All tools: - -* `$SYSTEMD_OFFLINE=[0|1]` — if set to `1`, then `systemctl` will refrain from - talking to PID 1; this has the same effect as the historical detection of - `chroot()`. Setting this variable to `0` instead has a similar effect as - `$SYSTEMD_IGNORE_CHROOT=1`; i.e. tools will try to communicate with PID 1 - even if a `chroot()` environment is detected. You almost certainly want to - set this to `1` if you maintain a package build system or similar and are - trying to use a modern container system and not plain `chroot()`. - -* `$SYSTEMD_IGNORE_CHROOT=1` — if set, don't check whether being invoked in a - `chroot()` environment. This is particularly relevant for systemctl, as it - will not alter its behaviour for `chroot()` environments if set. Normally it - refrains from talking to PID 1 in such a case; turning most operations such - as `start` into no-ops. If that's what's explicitly desired, you might - consider setting `$SYSTEMD_OFFLINE=1`. - -* `$SYSTEMD_FIRST_BOOT=0|1` — if set, assume "first boot" condition to be false - or true, instead of checking the flag file created by PID 1. - -* `$SD_EVENT_PROFILE_DELAYS=1` — if set, the sd-event event loop implementation - will print latency information at runtime. - -* `$SYSTEMD_PROC_CMDLINE` — if set, the contents are used as the kernel command - line instead of the actual one in `/proc/cmdline`. This is useful for - debugging, in order to test generators and other code against specific kernel - command lines. - -* `$SYSTEMD_OS_RELEASE` — if set, use this path instead of `/etc/os-release` or - `/usr/lib/os-release`. When operating under some root (e.g. `systemctl - --root=…`), the path is prefixed with the root. Only useful for debugging. - -* `$SYSTEMD_FSTAB` — if set, use this path instead of `/etc/fstab`. Only useful - for debugging. - -* `$SYSTEMD_SYSROOT_FSTAB` — if set, use this path instead of - `/sysroot/etc/fstab`. Only useful for debugging `systemd-fstab-generator`. - -* `$SYSTEMD_SYSFS_CHECK` — takes a boolean. If set, overrides sysfs container - detection that ignores `/dev/` entries in fstab. Only useful for debugging - `systemd-fstab-generator`. - -* `$SYSTEMD_CRYPTTAB` — if set, use this path instead of `/etc/crypttab`. Only - useful for debugging. Currently only supported by - `systemd-cryptsetup-generator`. - -* `$SYSTEMD_INTEGRITYTAB` — if set, use this path instead of - `/etc/integritytab`. Only useful for debugging. Currently only supported by - `systemd-integritysetup-generator`. - -* `$SYSTEMD_VERITYTAB` — if set, use this path instead of - `/etc/veritytab`. Only useful for debugging. Currently only supported by - `systemd-veritysetup-generator`. - -* `$SYSTEMD_EFI_OPTIONS` — if set, used instead of the string in the - `SystemdOptions` EFI variable. Analogous to `$SYSTEMD_PROC_CMDLINE`. - -* `$SYSTEMD_DEFAULT_HOSTNAME` — override the compiled-in fallback hostname - (relevant in particular for the system manager and `systemd-hostnamed`). - Must be a valid hostname (either a single label or a FQDN). - -* `$SYSTEMD_IN_INITRD` — takes a boolean. If set, overrides initrd detection. - This is useful for debugging and testing initrd-only programs in the main - system. - -* `$SYSTEMD_BUS_TIMEOUT=SECS` — specifies the maximum time to wait for method call - completion. If no time unit is specified, assumes seconds. The usual other units - are understood, too (us, ms, s, min, h, d, w, month, y). If it is not set or set - to 0, then the built-in default is used. - -* `$SYSTEMD_MEMPOOL=0` — if set, the internal memory caching logic employed by - hash tables is turned off, and libc `malloc()` is used for all allocations. - -* `$SYSTEMD_UTF8=` — takes a boolean value, and overrides whether to generate - non-ASCII special glyphs at various places (i.e. "→" instead of - "->"). Usually this is determined automatically, based on `$LC_CTYPE`, but in - scenarios where locale definitions are not installed it might make sense to - override this check explicitly. - -* `$SYSTEMD_EMOJI=0` — if set, tools such as `systemd-analyze security` will - not output graphical smiley emojis, but ASCII alternatives instead. Note that - this only controls use of Unicode emoji glyphs, and has no effect on other - Unicode glyphs. - -* `$RUNTIME_DIRECTORY` — various tools use this variable to locate the - appropriate path under `/run/`. This variable is also set by the manager when - `RuntimeDirectory=` is used, see systemd.exec(5). - -* `$SYSTEMD_CRYPT_PREFIX` — if set configures the hash method prefix to use for - UNIX `crypt()` when generating passwords. By default the system's "preferred - method" is used, but this can be overridden with this environment variable. - Takes a prefix such as `$6$` or `$y$`. (Note that this is only honoured on - systems built with libxcrypt and is ignored on systems using glibc's - original, internal `crypt()` implementation.) - -* `$SYSTEMD_SECCOMP=0` — if set, seccomp filters will not be enforced, even if - support for it is compiled in and available in the kernel. - -* `$SYSTEMD_LOG_SECCOMP=1` — if set, system calls blocked by seccomp filtering, - for example in `systemd-nspawn`, will be logged to the audit log, if the - kernel supports this. - -* `$SYSTEMD_ENABLE_LOG_CONTEXT` — if set, extra fields will always be logged to - the journal instead of only when logging in debug mode. - -* `$SYSTEMD_NETLINK_DEFAULT_TIMEOUT` — specifies the default timeout of waiting - replies for netlink messages from the kernel. Defaults to 25 seconds. - -* `$SYSTEMD_VERITY_SHARING=0` — if set, sharing dm-verity devices by - using a stable `<ROOTHASH>-verity` device mapper name will be disabled. - -* `$SYSTEMD_OPENSSL_KEY_LOADER`— when using OpenSSL to load a key via an engine - or a provider, can be used to force the usage of one or the other interface. - Set to 'engine' to force the usage of the old engine API, and to 'provider' - force the usage of the new provider API. If unset, the provider will be tried - first and the engine as a fallback if that fails. Providers are the new OpenSSL - 3 API, but there are very few if any in a production-ready state, so engines - are still needed. - -`systemctl`: - -* `$SYSTEMCTL_FORCE_BUS=1` — if set, do not connect to PID 1's private D-Bus - listener, and instead always connect through the dbus-daemon D-bus broker. - -* `$SYSTEMCTL_INSTALL_CLIENT_SIDE=1` — if set, enable or disable unit files on - the client side, instead of asking PID 1 to do this. - -* `$SYSTEMCTL_SKIP_SYSV=1` — if set, do not call SysV compatibility hooks. - -* `$SYSTEMCTL_SKIP_AUTO_KEXEC=1` — if set, do not automatically kexec instead of - reboot when a new kernel has been loaded. - -* `$SYSTEMCTL_SKIP_AUTO_SOFT_REBOOT=1` — if set, do not automatically soft-reboot - instead of reboot when a new root file system has been loaded in - `/run/nextroot/`. - -`systemd-nspawn`: - -* `$SYSTEMD_NSPAWN_UNIFIED_HIERARCHY=1` — if set, force `systemd-nspawn` into - unified cgroup hierarchy mode. - -* `$SYSTEMD_NSPAWN_API_VFS_WRITABLE=1` — if set, make `/sys/`, `/proc/sys/`, - and friends writable in the container. If set to "network", leave only - `/proc/sys/net/` writable. - -* `$SYSTEMD_NSPAWN_CONTAINER_SERVICE=…` — override the "service" name nspawn - uses to register with machined. If unset defaults to "nspawn", but with this - variable may be set to any other value. - -* `$SYSTEMD_NSPAWN_USE_CGNS=0` — if set, do not use cgroup namespacing, even if - it is available. - -* `$SYSTEMD_NSPAWN_LOCK=0` — if set, do not lock container images when running. - -* `$SYSTEMD_NSPAWN_TMPFS_TMP=0` — if set, do not overmount `/tmp/` in the - container with a tmpfs, but leave the directory from the image in place. - -* `$SYSTEMD_NSPAWN_CHECK_OS_RELEASE=0` — if set, do not fail when trying to - boot an OS tree without an os-release file (useful when trying to boot a - container with empty `/etc/` and bind-mounted `/usr/`) - -* `$SYSTEMD_SUPPRESS_SYNC=1` — if set, all disk synchronization syscalls are - blocked to the container payload (e.g. `sync()`, `fsync()`, `syncfs()`, …) - and the `O_SYNC`/`O_DSYNC` flags are made unavailable to `open()` and - friends. This is equivalent to passing `--suppress-sync=yes` on the - `systemd-nspawn` command line. - -* `$SYSTEMD_NSPAWN_NETWORK_MAC=...` — if set, allows users to set a specific MAC - address for a container, ensuring that it uses the provided value instead of - generating a random one. It is effective when used with `--network-veth`. The - expected format is six groups of two hexadecimal digits separated by colons, - e.g. `SYSTEMD_NSPAWN_NETWORK_MAC=12:34:56:78:90:AB` - -`systemd-logind`: - -* `$SYSTEMD_BYPASS_HIBERNATION_MEMORY_CHECK=1` — if set, report that - hibernation is available even if the swap devices do not provide enough room - for it. - -* `$SYSTEMD_REBOOT_TO_FIRMWARE_SETUP` — if set, overrides `systemd-logind`'s - built-in EFI logic of requesting a reboot into the firmware. Takes a boolean. - If set to false, the functionality is turned off entirely. If set to true, - instead of requesting a reboot into the firmware setup UI through EFI a file, - `/run/systemd/reboot-to-firmware-setup` is created whenever this is - requested. This file may be checked for by services run during system - shutdown in order to request the appropriate operation from the firmware in - an alternative fashion. - -* `$SYSTEMD_REBOOT_TO_BOOT_LOADER_MENU` — similar to the above, allows - overriding of `systemd-logind`'s built-in EFI logic of requesting a reboot - into the boot loader menu. Takes a boolean. If set to false, the - functionality is turned off entirely. If set to true, instead of requesting a - reboot into the boot loader menu through EFI, the file - `/run/systemd/reboot-to-boot-loader-menu` is created whenever this is - requested. The file contains the requested boot loader menu timeout in µs, - formatted in ASCII decimals, or zero in case no timeout is requested. This - file may be checked for by services run during system shutdown in order to - request the appropriate operation from the boot loader in an alternative - fashion. - -* `$SYSTEMD_REBOOT_TO_BOOT_LOADER_ENTRY` — similar to the above, allows - overriding of `systemd-logind`'s built-in EFI logic of requesting a reboot - into a specific boot loader entry. Takes a boolean. If set to false, the - functionality is turned off entirely. If set to true, instead of requesting a - reboot into a specific boot loader entry through EFI, the file - `/run/systemd/reboot-to-boot-loader-entry` is created whenever this is - requested. The file contains the requested boot loader entry identifier. This - file may be checked for by services run during system shutdown in order to - request the appropriate operation from the boot loader in an alternative - fashion. Note that by default only boot loader entries which follow the - [Boot Loader Specification](https://uapi-group.org/specifications/specs/boot_loader_specification) - and are placed in the ESP or the Extended Boot Loader partition may be - selected this way. However, if a directory `/run/boot-loader-entries/` - exists, the entries are loaded from there instead. The directory should - contain the usual directory hierarchy mandated by the Boot Loader - Specification, i.e. the entry drop-ins should be placed in - `/run/boot-loader-entries/loader/entries/*.conf`, and the files referenced by - the drop-ins (including the kernels and initrds) somewhere else below - `/run/boot-loader-entries/`. Note that all these files may be (and are - supposed to be) symlinks. `systemd-logind` will load these files on-demand, - these files can hence be updated (ideally atomically) whenever the boot - loader configuration changes. A foreign boot loader installer script should - hence synthesize drop-in snippets and symlinks for all boot entries at boot - or whenever they change if it wants to integrate with `systemd-logind`'s - APIs. - -`systemd-udevd` and sd-device library: - -* `$NET_NAMING_SCHEME=` — if set, takes a network naming scheme (i.e. one of - "v238", "v239", "v240"…, or the special value "latest") as parameter. If - specified udev's `net_id` builtin will follow the specified naming scheme - when determining stable network interface names. This may be used to revert - to naming schemes of older udev versions, in order to provide more stable - naming across updates. This environment variable takes precedence over the - kernel command line option `net.naming-scheme=`, except if the value is - prefixed with `:` in which case the kernel command line option takes - precedence, if it is specified as well. - -* `$SYSTEMD_DEVICE_VERIFY_SYSFS` — if set to "0", disables verification that - devices sysfs path are actually backed by sysfs. Relaxing this verification - is useful for testing purposes. - -* `$SYSTEMD_UDEV_EXTRA_TIMEOUT_SEC=` — Specifies an extra timespan that the - udev manager process waits for a worker process kills slow programs specified - by IMPORT{program}=, PROGRAM=, or RUN=, and finalizes the processing event. - If the worker process cannot finalize the event within the specified timespan, - the worker process is killed by the manager process. Defaults to 10 seconds, - maximum allowed is 5 hours. - -`udevadm` and `systemd-hwdb`: - -* `SYSTEMD_HWDB_UPDATE_BYPASS=` — If set to "1", execution of hwdb updates is skipped - when `udevadm hwdb --update` or `systemd-hwdb update` are invoked. This can - be useful if either of these tools are invoked unconditionally as a child - process by another tool, such as package managers running either of these - tools in a postinstall script. - -`nss-systemd`: - -* `$SYSTEMD_NSS_BYPASS_SYNTHETIC=1` — if set, `nss-systemd` won't synthesize - user/group records for the `root` and `nobody` users if they are missing from - `/etc/passwd`. - -* `$SYSTEMD_NSS_DYNAMIC_BYPASS=1` — if set, `nss-systemd` won't return - user/group records for dynamically registered service users (i.e. users - registered through `DynamicUser=1`). - -`systemd-timedated`: - -* `$SYSTEMD_TIMEDATED_NTP_SERVICES=…` — colon-separated list of unit names of - NTP client services. If set, `timedatectl set-ntp on` enables and starts the - first existing unit listed in the environment variable, and - `timedatectl set-ntp off` disables and stops all listed units. - -`systemd-sulogin-shell`: - -* `$SYSTEMD_SULOGIN_FORCE=1` — This skips asking for the root password if the - root password is not available (such as when the root account is locked). - See `sulogin(8)` for more details. - -`bootctl` and other tools that access the EFI System Partition (ESP): - -* `$SYSTEMD_RELAX_ESP_CHECKS=1` — if set, the ESP validation checks are - relaxed. Specifically, validation checks that ensure the specified ESP path - is a FAT file system are turned off, as are checks that the path is located - on a GPT partition with the correct type UUID. - -* `$SYSTEMD_ESP_PATH=…` — override the path to the EFI System Partition. This - may be used to override ESP path auto detection, and redirect any accesses to - the ESP to the specified directory. Note that unlike with `bootctl`'s - `--path=` switch only very superficial validation of the specified path is - done when this environment variable is used. - -* `$KERNEL_INSTALL_CONF_ROOT=…` — override the built in default configuration - directory /etc/kernel/ to read files like entry-token and install.conf from. - -`systemd` itself: - -* `$SYSTEMD_ACTIVATION_UNIT` — set for all NSS and PAM module invocations that - are done by the service manager on behalf of a specific unit, in child - processes that are later (after execve()) going to become unit - processes. Contains the full unit name (e.g. "foobar.service"). NSS and PAM - modules can use this information to determine in which context and on whose - behalf they are being called, which may be useful to avoid deadlocks, for - example to bypass IPC calls to the very service that is about to be - started. Note that NSS and PAM modules should be careful to only rely on this - data when invoked privileged, or possibly only when getppid() returns 1, as - setting environment variables is of course possible in any even unprivileged - contexts. - -* `$SYSTEMD_ACTIVATION_SCOPE` — closely related to `$SYSTEMD_ACTIVATION_UNIT`, - it is either set to `system` or `user` depending on whether the NSS/PAM - module is called by systemd in `--system` or `--user` mode. - -* `$SYSTEMD_SUPPORT_DEVICE`, `$SYSTEMD_SUPPORT_MOUNT`, `$SYSTEMD_SUPPORT_SWAP` - - can be set to `0` to mark respective unit type as unsupported. Generally, - having less units saves system resources so these options might be useful - for cases where we don't need to track given unit type, e.g. `--user` manager - often doesn't need to deal with device or swap units because they are - handled by the `--system` manager (PID 1). Note that setting certain unit - type as unsupported may not prevent loading some units of that type if they - are referenced by other units of another supported type. - -* `$SYSTEMD_DEFAULT_MOUNT_RATE_LIMIT_BURST` — can be set to override the mount - units burst rate limit for parsing `/proc/self/mountinfo`. On a system with - few resources but many mounts the rate limit may be hit, which will cause the - processing of mount units to stall. The burst limit may be adjusted when the - default is not appropriate for a given system. Defaults to `5`, accepts - positive integers. - -`systemd-remount-fs`: - -* `$SYSTEMD_REMOUNT_ROOT_RW=1` — if set and no entry for the root directory - exists in `/etc/fstab` (this file always takes precedence), then the root - directory is remounted writable. This is primarily used by - `systemd-gpt-auto-generator` to ensure the root partition is mounted writable - in accordance to the GPT partition flags. - -`systemd-firstboot` and `localectl`: - -* `$SYSTEMD_LIST_NON_UTF8_LOCALES=1` — if set, non-UTF-8 locales are listed among - the installed ones. By default non-UTF-8 locales are suppressed from the - selection, since we are living in the 21st century. - -`systemd-resolved`: - -* `$SYSTEMD_RESOLVED_SYNTHESIZE_HOSTNAME` — if set to "0", `systemd-resolved` - won't synthesize system hostname on both regular and reverse lookups. - -`systemd-sysext`: - -* `$SYSTEMD_SYSEXT_HIERARCHIES` — this variable may be used to override which - hierarchies are managed by `systemd-sysext`. By default only `/usr/` and - `/opt/` are managed, and directories may be added or removed to that list by - setting this environment variable to a colon-separated list of absolute - paths. Only "real" file systems and directories that only contain "real" file - systems as submounts should be used. Do not specify API file systems such as - `/proc/` or `/sys/` here, or hierarchies that have them as submounts. In - particular, do not specify the root directory `/` here. Similarly, - `$SYSTEMD_CONFEXT_HIERARCHIES` works for confext images and supports the - systemd-confext multi-call functionality of sysext. - -`systemd-tmpfiles`: - -* `$SYSTEMD_TMPFILES_FORCE_SUBVOL` — if unset, `v`/`q`/`Q` lines will create - subvolumes only if the OS itself is installed into a subvolume. If set to `1` - (or another value interpreted as true), these lines will always create - subvolumes if the backing filesystem supports them. If set to `0`, these - lines will always create directories. - -`systemd-sysusers` - -* `$SOURCE_DATE_EPOCH` — if unset, the field of the date of last password change - in `/etc/shadow` will be the number of days from Jan 1, 1970 00:00 UTC until - today. If `$SOURCE_DATE_EPOCH` is set to a valid UNIX epoch value in seconds, - then the field will be the number of days until that time instead. This is to - support creating bit-by-bit reproducible system images by choosing a - reproducible value for the field of the date of last password change in - `/etc/shadow`. See: https://reproducible-builds.org/specs/source-date-epoch/ - -`systemd-sysv-generator`: - -* `$SYSTEMD_SYSVINIT_PATH` — Controls where `systemd-sysv-generator` looks for - SysV init scripts. - -* `$SYSTEMD_SYSVRCND_PATH` — Controls where `systemd-sysv-generator` looks for - SysV init script runlevel link farms. - -systemd tests: - -* `$SYSTEMD_TEST_DATA` — override the location of test data. This is useful if - a test executable is moved to an arbitrary location. - -* `$SYSTEMD_TEST_NSS_BUFSIZE` — size of scratch buffers for "reentrant" - functions exported by the nss modules. - -* `$TESTFUNCS` – takes a colon separated list of test functions to invoke, - causes all non-matching test functions to be skipped. Only applies to tests - using our regular test boilerplate. - -fuzzers: - -* `$SYSTEMD_FUZZ_OUTPUT` — A boolean that specifies whether to write output to - stdout. Setting to true is useful in manual invocations, since all output is - suppressed by default. - -* `$SYSTEMD_FUZZ_RUNS` — The number of times execution should be repeated in - manual invocations. - -Note that it may be also useful to set `$SYSTEMD_LOG_LEVEL`, since all logging -is suppressed by default. - -`systemd-importd`: - -* `$SYSTEMD_IMPORT_BTRFS_SUBVOL` — takes a boolean, which controls whether to - prefer creating btrfs subvolumes over plain directories for machine - images. Has no effect on non-btrfs file systems where subvolumes are not - available anyway. If not set, defaults to true. - -* `$SYSTEMD_IMPORT_BTRFS_QUOTA` — takes a boolean, which controls whether to set - up quota automatically for created btrfs subvolumes for machine images. If - not set, defaults to true. Has no effect if machines are placed in regular - directories, because btrfs subvolumes are not supported or disabled. If - enabled, the quota group of the subvolume is automatically added to a - combined quota group for all such machine subvolumes. - -* `$SYSTEMD_IMPORT_SYNC` — takes a boolean, which controls whether to - synchronize images to disk after installing them, before completing the - operation. If not set, defaults to true. If disabled installation of images - will be quicker, but not as safe. - -`systemd-dissect`, `systemd-nspawn` and all other tools that may operate on -disk images with `--image=` or similar: - -* `$SYSTEMD_DISSECT_VERITY_SIDECAR` — takes a boolean, which controls whether to - load "sidecar" Verity metadata files. If enabled (which is the default), - whenever a disk image is used, a set of files with the `.roothash`, - `.usrhash`, `.roothash.p7s`, `.usrhash.p7s`, `.verity` suffixes are searched - adjacent to disk image file, containing the Verity root hashes, their - signatures or the Verity data itself. If disabled this automatic discovery of - Verity metadata files is turned off. - -* `$SYSTEMD_DISSECT_VERITY_EMBEDDED` — takes a boolean, which controls whether - to load the embedded Verity signature data. If enabled (which is the - default), Verity root hash information and a suitable signature is - automatically acquired from a signature partition, following the - [Discoverable Partitions Specification](https://uapi-group.org/specifications/specs/discoverable_partitions_specification). - If disabled any such partition is ignored. Note that this only disables - discovery of the root hash and its signature, the Verity data partition - itself is still searched in the GPT image. - -* `$SYSTEMD_DISSECT_VERITY_SIGNATURE` — takes a boolean, which controls whether - to validate the signature of the Verity root hash if available. If enabled - (which is the default), the signature of suitable disk images is validated - against any of the certificates in `/etc/verity.d/*.crt` (and similar - directories in `/usr/lib/`, `/run`, …) or passed to the kernel for validation - against its built-in certificates. - -* `$SYSTEMD_DISSECT_VERITY_TIMEOUT_SEC=sec` — takes a timespan, which controls - the timeout waiting for the image to be configured. Defaults to 100 msec. - -* `$SYSTEMD_DISSECT_FILE_SYSTEMS=` — takes a colon-separated list of file - systems that may be mounted for automatically dissected disk images. If not - specified defaults to something like: `ext4:btrfs:xfs:vfat:erofs:squashfs` - -* `$SYSTEMD_LOOP_DIRECT_IO` – takes a boolean, which controls whether to enable - `LO_FLAGS_DIRECT_IO` (i.e. direct IO + asynchronous IO) on loopback block - devices when opening them. Defaults to on, set this to "0" to disable this - feature. - -`systemd-cryptsetup`: - -* `$SYSTEMD_CRYPTSETUP_USE_TOKEN_MODULE` – takes a boolean, which controls - whether to use the libcryptsetup "token" plugin module logic even when - activating via FIDO2, PKCS#11, TPM2, i.e. mechanisms natively supported by - `systemd-cryptsetup`. Defaults to enabled. - -* `$SYSTEMD_CRYPTSETUP_TOKEN_PATH` – takes a path to a directory in the file - system. If specified overrides where libcryptsetup will look for token - modules (.so). This is useful for debugging token modules: set this - environment variable to the build directory and you are set. This variable - is only supported when systemd is compiled in developer mode. - -Various tools that read passwords from the TTY, such as `systemd-cryptenroll` -and `homectl`: - -* `$PASSWORD` — takes a string: the literal password to use. If this - environment variable is set it is used as password instead of prompting the - user interactively. This exists primarily for debugging and testing - purposes. Do not use this for production code paths, since environment - variables are typically inherited down the process tree without restrictions - and should thus not be used for secrets. - -* `$NEWPASSWORD` — similar to `$PASSWORD` above, but is used when both a - current and a future password are required, for example if the password is to - be changed. In that case `$PASSWORD` shall carry the current (i.e. old) - password and `$NEWPASSWORD` the new. - -`systemd-homed`: - -* `$SYSTEMD_HOME_ROOT` – defines an absolute path where to look for home - directories/images. When unspecified defaults to `/home/`. This is useful for - debugging purposes in order to run a secondary `systemd-homed` instance that - operates on a different directory where home directories/images are placed. - -* `$SYSTEMD_HOME_RECORD_DIR` – defines an absolute path where to look for - fixated home records kept on the host. When unspecified defaults to - `/var/lib/systemd/home/`. Similar to `$SYSTEMD_HOME_ROOT` this is useful for - debugging purposes, in order to run a secondary `systemd-homed` instance that - operates on a record database entirely separate from the host's. - -* `$SYSTEMD_HOME_DEBUG_SUFFIX` – takes a short string that is suffixed to - `systemd-homed`'s D-Bus and Varlink service names/sockets. This is also - understood by `homectl`. This too is useful for running an additional copy of - `systemd-homed` that doesn't interfere with the host's main one. - -* `$SYSTEMD_HOMEWORK_PATH` – configures the path to the `systemd-homework` - binary to invoke. If not specified defaults to - `/usr/lib/systemd/systemd-homework`. - - Combining these four environment variables is pretty useful when - debugging/developing `systemd-homed`: -```sh -SYSTEMD_HOME_DEBUG_SUFFIX=foo \ - SYSTEMD_HOMEWORK_PATH=/home/lennart/projects/systemd/build/systemd-homework \ - SYSTEMD_HOME_ROOT=/home.foo/ \ - SYSTEMD_HOME_RECORD_DIR=/var/lib/systemd/home.foo/ \ - /home/lennart/projects/systemd/build/systemd-homed -``` - -* `$SYSTEMD_HOME_MOUNT_OPTIONS_BTRFS`, `$SYSTEMD_HOME_MOUNT_OPTIONS_EXT4`, - `$SYSTEMD_HOME_MOUNT_OPTIONS_XFS` – configure the default mount options to - use for LUKS home directories, overriding the built-in default mount - options. There's one variable for each of the supported file systems for the - LUKS home directory backend. - -* `$SYSTEMD_HOME_MKFS_OPTIONS_BTRFS`, `$SYSTEMD_HOME_MKFS_OPTIONS_EXT4`, - `$SYSTEMD_HOME_MKFS_OPTIONS_XFS` – configure additional arguments to use for - `mkfs` when formatting LUKS home directories. There's one variable for each - of the supported file systems for the LUKS home directory backend. - -`kernel-install`: - -* `$KERNEL_INSTALL_BYPASS` – If set to "1", execution of kernel-install is skipped - when kernel-install is invoked. This can be useful if kernel-install is invoked - unconditionally as a child process by another tool, such as package managers - running kernel-install in a postinstall script. - -`systemd-journald`, `journalctl`: - -* `$SYSTEMD_JOURNAL_COMPACT` – Takes a boolean. If enabled, journal files are written - in a more compact format that reduces the amount of disk space required by the - journal. Note that journal files in compact mode are limited to 4G to allow use of - 32-bit offsets. Enabled by default. - -* `$SYSTEMD_JOURNAL_COMPRESS` – Takes a boolean, or one of the compression - algorithms "XZ", "LZ4", and "ZSTD". If enabled, the default compression - algorithm set at compile time will be used when opening a new journal file. - If disabled, the journal file compression will be disabled. Note that the - compression mode of existing journal files are not changed. To make the - specified algorithm takes an effect immediately, you need to explicitly run - `journalctl --rotate`. - -* `$SYSTEMD_CATALOG` – path to the compiled catalog database file to use for - `journalctl -x`, `journalctl --update-catalog`, `journalctl --list-catalog` - and related calls. - -* `$SYSTEMD_CATALOG_SOURCES` – path to the catalog database input source - directory to use for `journalctl --update-catalog`. - -`systemd-pcrextend`, `systemd-cryptsetup`: - -* `$SYSTEMD_FORCE_MEASURE=1` — If set, force measuring of resources (which are - marked for measurement) even if not booted on a kernel equipped with - systemd-stub. Normally, requested measurement of resources is conditionalized - on kernels that have booted with `systemd-stub`. With this environment - variable the test for that my be bypassed, for testing purposes. - -`systemd-repart`: - -* `$SYSTEMD_REPART_MKFS_OPTIONS_<FSTYPE>` – configure additional arguments to use for - `mkfs` when formatting partition file systems. There's one variable for each - of the supported file systems. - -* `$SYSTEMD_REPART_OVERRIDE_FSTYPE` – if set the value will override the file - system type specified in Format= lines in partition definition files. - -`systemd-nspawn`, `systemd-networkd`: - -* `$SYSTEMD_FIREWALL_BACKEND` – takes a string, either `iptables` or - `nftables`. Selects the firewall backend to use. If not specified tries to - use `nftables` and falls back to `iptables` if that's not available. - -`systemd-storagetm`: - -* `$SYSTEMD_NVME_MODEL`, `$SYSTEMD_NVME_FIRMWARE`, `$SYSTEMD_NVME_SERIAL`, - `$SYSTEMD_NVME_UUID` – these take a model string, firmware version string, - serial number string, and UUID formatted as string. If specified these - override the defaults exposed on the NVME subsystem and namespace, which are - derived from the underlying block device and system identity. Do not set the - latter two via the environment variable unless `systemd-storagetm` is invoked - to expose a single device only, since those identifiers better should be kept - unique. - -`systemd-pcrlock`, `systemd-pcrextend`: - -* `$SYSTEMD_MEASURE_LOG_USERSPACE` – the path to the `tpm2-measure.log` file - (containing userspace measurement data) to read. This allows overriding the - default of `/run/log/systemd/tpm2-measure.log`. - -* `$SYSTEMD_MEASURE_LOG_FIRMWARE` – the path to the `binary_bios_measurements` - file (containing firmware measurement data) to read. This allows overriding - the default of `/sys/kernel/security/tpm0/binary_bios_measurements`. - -Tools using the Varlink protocol (such as `varlinkctl`) or sd-bus (such as -`busctl`): - -* `$SYSTEMD_SSH` – the ssh binary to invoke when the `ssh:` transport is - used. May be a filename (which is searched for in `$PATH`) or absolute path. - -* `$SYSTEMD_VARLINK_LISTEN` – interpreted by some tools that provide a Varlink - service. Takes a file system path: if specified the tool will listen on an - `AF_UNIX` stream socket on the specified path in addition to whatever else it - would listen on. diff --git a/docs/_interfaces/FILE_DESCRIPTOR_STORE.md b/docs/_interfaces/FILE_DESCRIPTOR_STORE.md deleted file mode 100644 index 206dda7038..0000000000 --- a/docs/_interfaces/FILE_DESCRIPTOR_STORE.md +++ /dev/null @@ -1,213 +0,0 @@ ---- -title: File Descriptor Store -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# The File Descriptor Store - -*TL;DR: The systemd service manager may optionally maintain a set of file -descriptors for each service. Those file descriptors are under control of the -service. Storing file descriptors in the manager makes is easier to restart -services without dropping connections or losing state.* - -Since its inception `systemd` has supported the *socket* *activation* -mechanism: the service manager creates and listens on some sockets (and similar -UNIX file descriptors) on behalf of a service, and then passes them to the -service during activation of the service via UNIX file descriptor (short: *fd*) -passing over `execve()`. This is primarily exposed in the -[.socket](https://www.freedesktop.org/software/systemd/man/systemd.socket.html) -unit type. - -The *file* *descriptor* *store* (short: *fdstore*) extends this concept, and -allows services to *upload* during runtime additional fds to the service -manager that it shall keep on its behalf. File descriptors are passed back to -the service on subsequent activations, the same way as any socket activation -fds are passed. - -If a service fd is passed to the fdstore logic of the service manager it only -maintains a duplicate of it (in the sense of UNIX -[`dup(2)`](https://man7.org/linux/man-pages/man2/dup.2.html)), the fd remains -also in possession of the service itself, and it may (and is expected to) -invoke any operations on it that it likes. - -The primary use-case of this logic is to permit services to restart seamlessly -(for example to update them to a newer version), without losing execution -context, dropping pinned resources, terminating established connections or even -just momentarily losing connectivity. In fact, as the file descriptors can be -uploaded freely at any time during the service runtime, this can even be used -to implement services that robustly handle abnormal termination and can recover -from that without losing pinned resources. - -Note that Linux supports the -[`memfd`](https://man7.org/linux/man-pages/man2/memfd_create.2.html) concept -that allows associating a memory-backed fd with arbitrary data. This may -conveniently be used to serialize service state into and then place in the -fdstore, in order to implement service restarts with full service state being -passed over. - -## Basic Mechanism - -The fdstore is enabled per-service via the -[`FileDescriptorStoreMax=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#FileDescriptorStoreMax=) -service setting. It defaults to zero (which means the fdstore logic is turned -off), but can take an unsigned integer value that controls how many fds to -permit the service to upload to the service manager to keep simultaneously. - -If set to values > 0, the fdstore is enabled. When invoked the service may now -(asynchronously) upload file descriptors to the fdstore via the -[`sd_pid_notify_with_fds()`](https://www.freedesktop.org/software/systemd/man/sd_pid_notify_with_fds.html) -API call (or an equivalent re-implementation). When uploading the fds it is -necessary to set the `FDSTORE=1` field in the message, to indicate what the fd -is intended for. It's recommended to also set the `FDNAME=…` field to any -string of choice, which may be used to identify the fd later. - -Whenever the service is restarted the fds in its fdstore will be passed to the -new instance following the same protocol as for socket activation fds. i.e. the -`$LISTEN_FDS`, `$LISTEN_PIDS`, `$LISTEN_FDNAMES` environment variables will be -set (the latter will be populated from the `FDNAME=…` field mentioned -above). See -[`sd_listen_fds()`](https://www.freedesktop.org/software/systemd/man/sd_listen_fds.html) -for details on receiving such fds in a service. (Note that the name set in -`FDNAME=…` does not need to be unique, which is useful when operating with -multiple fully equivalent sockets or similar, for example for a service that -both operates on IPv4 and IPv6 and treats both more or less the same.). - -And that's already the gist of it. - -## Seamless Service Restarts - -A system service that provides a client-facing interface that shall be able to -seamlessly restart can make use of this in a scheme like the following: -whenever a new connection comes in it uploads its fd immediately into its -fdstore. At appropriate times it also serializes its state into a memfd it -uploads to the service manager — either whenever the state changed -sufficiently, or simply right before it terminates. (The latter of course means -that state only survives on *clean* restarts and abnormal termination implies the -state is lost completely — while the former would mean there's a good chance the -next restart after an abnormal termination could continue where it left off -with only some context lost.) - -Using the fdstore for such seamless service restarts is generally recommended -over implementations that attempt to leave a process from the old service -instance around until after the new instance already started, so that the old -then communicates with the new service instance, and passes the fds over -directly. Typically service restarts are a mechanism for implementing *code* -updates, hence leaving two version of the service running at the same time is -generally problematic. It also collides with the systemd service manager's -general principle of guaranteeing a pristine execution environment, a pristine -security context, and a pristine resource management context for freshly -started services, without uncontrolled "leftovers" from previous runs. For -example: leaving processes from previous runs generally negatively affects -lifecycle management (i.e. `KillMode=none` must be set), which disables large -parts of the service managers state tracking, resource management (as resource -counters cannot start at zero during service activation anymore, since the old -processes remaining skew them), security policies (as processes with possibly -out-of-date security policies – SElinux, AppArmor, any LSM, seccomp, BPF — in -effect remain), and similar. - -## File Descriptor Store Lifecycle - -By default any file descriptor stored in the fdstore for which a `POLLHUP` or -`POLLERR` is seen is automatically closed and removed from the fdstore. This -behavior can be turned off, by setting the `FDPOLL=0` field when uploading the -fd via `sd_notify_with_fds()`. - -The fdstore is automatically closed whenever the service is fully deactivated -and no jobs are queued for it anymore. This means that a restart job for a -service will leave the fdstore intact, but a separate stop and start job for -it — executed synchronously one after the other — will likely not. - -This behavior can be modified via the -[`FileDescriptorStorePreserve=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#FileDescriptorStorePreserve=) -setting in service unit files. If set to `yes` the fdstore will be kept as long -as the service definition is loaded into memory by the service manager, i.e. as -long as at least one other loaded unit has a reference to it. - -The `systemctl clean --what=fdstore …` command may be used to explicitly clear -the fdstore of a service. This is only allowed when the service is fully -deactivated, and is hence primarily useful in case -`FileDescriptorStorePreserve=yes` is set (because the fdstore is otherwise -fully closed anyway in this state). - -Individual file descriptors may be removed from the fdstore via the -`sd_notify()` mechanism, by sending an `FDSTOREREMOVE=1` message, accompanied -by an `FDNAME=…` string identifying the fds to remove. (The name does not have -to be unique, as mentioned, in which case *all* matching fds are -closed). Generally it's a good idea to send such messages to the service -manager during initialization of the service whenever an unrecognized fd is -received, to make the service robust for code updates: if an old version -uploaded an fd that the new version doesn't recognize anymore it's good idea to -close it both in the service and in the fdstore. - -Note that storing a duplicate of an fd in the fdstore means the resource pinned -by the fd remains pinned even if the service closes its duplicate of the -fd. This in particular means that peers on a connection socket uploaded this -way will not receive an automatic `POLLHUP` event anymore if the service code -issues `close()` on the socket. It must accompany it with an `FDSTOREREMOVE=1` -notification to the service manager, so that the fd is comprehensively closed. - -## Access Control - -Access to the fds in the file descriptor store is generally restricted to the -service code itself. Pushing fds into or removing fds from the fdstore is -subject to the access control restrictions of any other `sd_notify()` message, -which is controlled via -[`NotifyAccess=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#NotifyAccess=). - -By default only the main service process hence can push/remove fds, but by -setting `NotifyAccess=all` this may be relaxed to allow arbitrary service -child processes to do the same. - -## Soft Reboot - -The fdstore is particularly interesting in [soft -reboot](https://www.freedesktop.org/software/systemd/man/systemd-soft-reboot.service.html) -scenarios, as per `systemctl soft-reboot` (which restarts userspace like in a -real reboot, but leaves the kernel running). File descriptor stores that remain -loaded at the very end of the system cycle — just before the soft-reboot – are -passed over to the next system cycle, and propagated to services they originate -from there. This enables updating the full userspace of a system during -runtime, fully replacing all processes without losing pinning resources, -interrupting connectivity or established connections and similar. - -This mechanism can be enabled either by making sure the service survives until -the very end (i.e. by setting `DefaultDependencies=no` so that it keeps running -for the whole system lifetime without being regularly deactivated at shutdown) -or by setting `FileDescriptorStorePreserve=yes` (and referencing the unit -continuously). - -For further details see [Resource -Pass-Through](https://www.freedesktop.org/software/systemd/man/systemd-soft-reboot.service.html#Resource%20Pass-Through). - -## Initrd Transitions - -The fdstore may also be used to pass file descriptors for resources from the -initrd context to the main system. Restarting all processes after the -transition is important as code running in the initrd should generally not -continue to run after the switch to the host file system, since that pins -backing files from the initrd, and the initrd might contain different versions -of programs than the host. - -Any service that still runs during the initrd→host transition will have its -fdstore passed over the transition, where it will be passed back to any queued -services of the same name. - -The soft reboot cycle transition and the initrd→host transition are -semantically very similar, hence similar rules apply, and in both cases it is -recommended to use the fdstore if pinned resources shall be passed over. - -## Debugging - -The -[`systemd-analyze`](https://www.freedesktop.org/software/systemd/man/systemd-analyze.html#systemd-analyze%20fdstore%20%5BUNIT...%5D) -tool may be used to list the current contents of the fdstore of any running -service. - -The -[`systemd-run`](https://www.freedesktop.org/software/systemd/man/systemd-run.html) -tool may be used to quickly start a testing binary or similar as a service. Use -`-p FileDescriptorStore=4711` to enable the fdstore from `systemd-run`'s -command line. By using the `-t` switch you can even interactively communicate -via processes spawned that way, via the TTY. diff --git a/docs/_interfaces/INITRD_INTERFACE.md b/docs/_interfaces/INITRD_INTERFACE.md deleted file mode 100644 index 0461ae2607..0000000000 --- a/docs/_interfaces/INITRD_INTERFACE.md +++ /dev/null @@ -1,70 +0,0 @@ ---- -title: Initrd Interface -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - - -# The initrd Interface of systemd - -The Linux initrd mechanism (short for "initial RAM disk", also known as -"initramfs") refers to a small file system archive that is unpacked by the -kernel and contains the first userspace code that runs. It typically finds and -transitions into the actual root file system to use. systemd supports both -initrd and initrd-less boots. If an initrd is used, it is a good idea to pass a -few bits of runtime information from the initrd to systemd in order to avoid -duplicate work and to provide performance data to the administrator. In this -page we attempt to roughly describe the interfaces that exist between the -initrd and systemd. These interfaces are currently used by -[mkosi](https://github.com/systemd/mkosi)-generated initrds, dracut and the -Arch Linux initrds. - -* The initrd should mount `/run/` as a tmpfs and pass it pre-mounted when - jumping into the main system when executing systemd. The mount options should - be `mode=0755,nodev,nosuid,strictatime`. - -* It's highly recommended that the initrd also mounts `/usr/` (if split off) as - appropriate and passes it pre-mounted to the main system, to avoid the - problems described in [Booting without /usr is - Broken](https://www.freedesktop.org/wiki/Software/systemd/separate-usr-is-broken). - -* If the executable `/run/initramfs/shutdown` exists systemd will use it to - jump back into the initrd on shutdown. `/run/initramfs/` should be a usable - initrd environment to which systemd will pivot back and the `shutdown` - executable in it should be able to detach all complex storage that for - example was needed to mount the root file system. It's the job of the initrd - to set up this directory and executable in the right way so that this works - correctly. The shutdown binary is invoked with the shutdown verb as `argv[1]`, - optionally followed (in `argv[2]`, `argv[3]`, … systemd's original command - line options, for example `--log-level=` and similar. - -* Storage daemons run from the initrd should follow the guide on - [systemd and Storage Daemons for the Root File System](ROOT_STORAGE_DAEMONS) - to survive properly from the boot initrd all the way to the point where - systemd jumps back into the initrd for shutdown. - -One last clarification: we use the term _initrd_ very generically here -describing any kind of early boot file system, regardless whether that might be -implemented as an actual ramdisk, ramfs or tmpfs. We recommend using _initrd_ -in this sense as a term that is unrelated to the actual backing technologies -used. - -## Using systemd inside an initrd - -It is also possible and recommended to implement the initrd itself based on -systemd. Here are a few terse notes: - -* Provide `/etc/initrd-release` in the initrd image. The idea is that it - follows the same format as the usual `/etc/os-release` but describes the - initrd implementation rather than the OS. systemd uses the existence of this - file as a flag whether to run in initrd mode, or not. - -* When run in initrd mode, systemd and its components will read a couple of - additional command line arguments, which are generally prefixed with `rd.` - -* To transition into the main system image invoke `systemctl switch-root`. - -* The switch-root operation will result in a killing spree of all running - processes. Some processes might need to be excluded from that, see the guide - on [systemd and Storage Daemons for the Root File System](ROOT_STORAGE_DAEMONS). diff --git a/docs/_interfaces/JOURNAL_EXPORT_FORMATS.md b/docs/_interfaces/JOURNAL_EXPORT_FORMATS.md deleted file mode 100644 index e1eb0d36d1..0000000000 --- a/docs/_interfaces/JOURNAL_EXPORT_FORMATS.md +++ /dev/null @@ -1,158 +0,0 @@ ---- -title: Journal Export Formats -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Journal Export Formats - -## Journal Export Format - -_Note that this document describes the binary serialization format of journals only, as used for transfer across the network. -For interfacing with web technologies there's the Journal JSON Format, described below. -The binary format on disk is documented as the [Journal File Format](JOURNAL_FILE_FORMAT)._ - -_Before reading on, please make sure you are aware of the [basic properties of journal entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), in particular realize that they may include binary non-text data (though usually don't), and the same field might have multiple values assigned within the same entry (though usually hasn't)._ - -When exporting journal data for other uses or transferring it via the network/local IPC the _journal export format_ is used. It's a simple serialization of journal entries, that is easy to read without any special tools, but still binary safe where necessary. The format is like this: - -* Two journal entries that follow each other are separated by a double newline. -* Journal fields consisting only of valid non-control UTF-8 codepoints are serialized as they are (i.e. the field name, followed by '=', followed by field data), followed by a newline as separator to the next field. Note that fields containing newlines cannot be formatted like this. Non-control UTF-8 codepoints are the codepoints with value at or above 32 (' '), or equal to 9 (TAB). -* Other journal fields are serialized in a special binary safe way: field name, followed by newline, followed by a binary 64-bit little endian size value, followed by the binary field data, followed by a newline as separator to the next field. -* Entry metadata that is not actually a field is serialized like it was a field, but beginning with two underscores. More specifically, `__CURSOR=`, `__REALTIME_TIMESTAMP=`, `__MONOTONIC_TIMESTAMP=`, `__SEQNUM=`, `__SEQNUM_ID` are introduced this way. Note that these meta-fields are only generated when actual journal files are serialized. They are omitted for entries that do not originate from a journal file (for example because they are transferred for the first time to be stored in one). Or in other words: if you are generating this format you shouldn't care about these special double-underscore fields. But you might find them usable when you deserialize the format generated by us. Additional fields prefixed with two underscores might be added later on, your parser should skip over the fields it does not know. -* The order in which fields appear in an entry is undefined and might be different for each entry that is serialized. -And that's already it. - -This format can be generated via `journalctl -o export`. - -Here's an example for two serialized entries which consist only of text data: - -``` -__CURSOR=s=739ad463348b4ceca5a9e69c95a3c93f;i=4ece7;b=6c7c6013a26343b29e964691ff25d04c;m=4fc72436e;t=4c508a72423d9;x=d3e5610681098c10;p=system.journal -__REALTIME_TIMESTAMP=1342540861416409 -__MONOTONIC_TIMESTAMP=21415215982 -_BOOT_ID=6c7c6013a26343b29e964691ff25d04c -_TRANSPORT=syslog -PRIORITY=4 -SYSLOG_FACILITY=3 -SYSLOG_IDENTIFIER=gdm-password] -SYSLOG_PID=587 -MESSAGE=AccountsService-DEBUG(+): ActUserManager: ignoring unspecified session '8' since it's not graphical: Success -_PID=587 -_UID=0 -_GID=500 -_COMM=gdm-session-wor -_EXE=/usr/libexec/gdm-session-worker -_CMDLINE=gdm-session-worker [pam/gdm-password] -_AUDIT_SESSION=2 -_AUDIT_LOGINUID=500 -_SYSTEMD_CGROUP=/user/lennart/2 -_SYSTEMD_SESSION=2 -_SELINUX_CONTEXT=system_u:system_r:xdm_t:s0-s0:c0.c1023 -_SOURCE_REALTIME_TIMESTAMP=1342540861413961 -_MACHINE_ID=a91663387a90b89f185d4e860000001a -_HOSTNAME=epsilon - -__CURSOR=s=739ad463348b4ceca5a9e69c95a3c93f;i=4ece8;b=6c7c6013a26343b29e964691ff25d04c;m=4fc72572f;t=4c508a7243799;x=68597058a89b7246;p=system.journal -__REALTIME_TIMESTAMP=1342540861421465 -__MONOTONIC_TIMESTAMP=21415221039 -_BOOT_ID=6c7c6013a26343b29e964691ff25d04c -_TRANSPORT=syslog -PRIORITY=6 -SYSLOG_FACILITY=9 -SYSLOG_IDENTIFIER=/USR/SBIN/CROND -SYSLOG_PID=8278 -MESSAGE=(root) CMD (run-parts /etc/cron.hourly) -_PID=8278 -_UID=0 -_GID=0 -_COMM=run-parts -_EXE=/usr/bin/bash -_CMDLINE=/bin/bash /bin/run-parts /etc/cron.hourly -_AUDIT_SESSION=8 -_AUDIT_LOGINUID=0 -_SYSTEMD_CGROUP=/user/root/8 -_SYSTEMD_SESSION=8 -_SELINUX_CONTEXT=system_u:system_r:crond_t:s0-s0:c0.c1023 -_SOURCE_REALTIME_TIMESTAMP=1342540861416351 -_MACHINE_ID=a91663387a90b89f185d4e860000001a -_HOSTNAME=epsilon - -``` - -A message with a binary field produced by -```bash -python3 -c 'from systemd import journal; journal.send("foo\nbar")' -journalctl -n1 -o export -``` - -``` -__CURSOR=s=bcce4fb8ffcb40e9a6e05eee8b7831bf;i=5ef603;b=ec25d6795f0645619ddac9afdef453ee;m=545242e7049;t=50f1202 -__REALTIME_TIMESTAMP=1423944916375353 -__MONOTONIC_TIMESTAMP=5794517905481 -_BOOT_ID=ec25d6795f0645619ddac9afdef453ee -_TRANSPORT=journal -_UID=1001 -_GID=1001 -_CAP_EFFECTIVE=0 -_SYSTEMD_OWNER_UID=1001 -_SYSTEMD_SLICE=user-1001.slice -_MACHINE_ID=5833158886a8445e801d437313d25eff -_HOSTNAME=bupkis -_AUDIT_LOGINUID=1001 -_SELINUX_CONTEXT=unconfined_u:unconfined_r:unconfined_t:s0-s0:c0.c1023 -CODE_LINE=1 -CODE_FUNC=<module> -SYSLOG_IDENTIFIER=python3 -_COMM=python3 -_EXE=/usr/bin/python3.4 -_AUDIT_SESSION=35898 -_SYSTEMD_CGROUP=/user.slice/user-1001.slice/session-35898.scope -_SYSTEMD_SESSION=35898 -_SYSTEMD_UNIT=session-35898.scope -MESSAGE -^G^@^@^@^@^@^@^@foo -bar -CODE_FILE=<string> -_PID=16853 -_CMDLINE=python3 -c from systemd import journal; journal.send("foo\nbar") -_SOURCE_REALTIME_TIMESTAMP=1423944916372858 -``` - -## Journal JSON Format - -_Note that this section describes the JSON serialization format of the journal only, as used for interfacing with web technologies. -For binary transfer of journal data across the network there's the Journal Export Format described above. -The binary format on disk is documented as [Journal File Format](JOURNAL_FILE_FORMAT)._ - -_Before reading on, please make sure you are aware of the [basic properties of journal entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), in particular realize that they may include binary non-text data (though usually don't), and the same field might have multiple values assigned within the same entry (though usually hasn't)._ - -In most cases the Journal JSON serialization is the obvious mapping of the entry field names (as JSON strings) to the entry field values (also as JSON strings) encapsulated in one JSON object. However, there are a few special cases to handle: - -* A field that contains non-printable or non-UTF8 is serialized as a number array instead. This is necessary to handle binary data in a safe way without losing data, since JSON cannot embed binary data natively. Each byte of the binary field will be mapped to its numeric value in the range 0…255. -* The JSON serializer can optionally skip huge (as in larger than a specific threshold) data fields from the JSON object. If that is enabled and a data field is too large, the field name is still included in the JSON object but assigned _null_. -* Within the same entry, Journal fields may have multiple values assigned. This is not allowed in JSON. The serializer will hence create a single JSON field only for these cases, and assign it an array of values (which the can be strings, _null_ or number arrays, see above). -* If the JSON data originates from a journal file it may include the special addressing fields `__CURSOR`, `__REALTIME_TIMESTAMP`, `__MONOTONIC_TIMESTAMP`, `__SEQNUM`, `__SEQNUM_ID`, which contain the cursor string of this entry as string, the realtime/monotonic timestamps of this entry as formatted numeric string of usec since the respective epoch, and the sequence number and associated sequence number ID, both formatted as strings. - -Here's an example, illustrating all cases mentioned above. Consider this entry: - -``` -MESSAGE=Hello World -_UDEV_DEVNODE=/dev/waldo -_UDEV_DEVLINK=/dev/alias1 -_UDEV_DEVLINK=/dev/alias2 -BINARY=this is a binary value \a -LARGE=this is a super large value (let's pretend at least, for the sake of this example) -``` - -This translates into the following JSON Object: -```json -{ - "MESSAGE" : "Hello World", - "_UDEV_DEVNODE" : "/dev/waldo", - "_UDEV_DEVLINK" : [ "/dev/alias1", "/dev/alias2" ], - "BINARY" : [ 116, 104, 105, 115, 32, 105, 115, 32, 97, 32, 98, 105, 110, 97, 114, 121, 32, 118, 97, 108, 117, 101, 32, 7 ], - "LARGE" : null -} -``` diff --git a/docs/_interfaces/JOURNAL_FILE_FORMAT.md b/docs/_interfaces/JOURNAL_FILE_FORMAT.md deleted file mode 100644 index e0737c5933..0000000000 --- a/docs/_interfaces/JOURNAL_FILE_FORMAT.md +++ /dev/null @@ -1,755 +0,0 @@ ---- -title: Journal File Format -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Journal File Format - -_Note that this document describes the binary on-disk format of journals only. -For interfacing with web technologies there's the [Journal JSON Format](JOURNAL_EXPORT_FORMATS.md#journal-json-format). -For transfer of journal data across the network there's the [Journal Export Format](JOURNAL_EXPORT_FORMATS.md#journal-export-format)._ - -The systemd journal stores log data in a binary format with several features: - -* Fully indexed by all fields -* Can store binary data, up to 2^64-1 in size -* Seekable -* Primarily append-based, hence robust to corruption -* Support for in-line compression -* Support for in-line Forward Secure Sealing - -This document explains the basic structure of the file format on disk. We are -making this available primarily to allow review and provide documentation. Note -that the actual implementation in the [systemd -codebase](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/) is the -only ultimately authoritative description of the format, so if this document -and the code disagree, the code is right. That said we'll of course try hard to -keep this document up-to-date and accurate. - -Instead of implementing your own reader or writer for journal files we ask you -to use the [Journal's native C -API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) to access -these files. It provides you with full access to the files, and will not -withhold any data. If you find a limitation, please ping us and we might add -some additional interfaces for you. - -If you need access to the raw journal data in serialized stream form without C -API our recommendation is to make use of the [Journal Export -Format](https://systemd.io/JOURNAL_EXPORT_FORMATS#journal-export-format), which you can -get via `journalctl -o export` or via `systemd-journal-gatewayd`. The export -format is much simpler to parse, but complete and accurate. Due to its -stream-based nature it is not indexed. - -_Or, to put this in other words: this low-level document is probably not what -you want to use as base of your project. You want our [C -API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) instead! -And if you really don't want the C API, then you want the -[Journal Export Format or Journal JSON Format](JOURNAL_EXPORT_FORMATS) -instead! This document is primarily for your entertainment and education. -Thank you!_ - -This document assumes you have a basic understanding of the journal concepts, -the properties of a journal entry and so on. If not, please go and read up, -then come back! This is a good opportunity to read about the [basic properties -of journal -entries](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html), -in particular realize that they may include binary non-text data (though -usually don't), and the same field might have multiple values assigned within -the same entry. - -This document describes the current format of systemd 246. The documented -format is compatible with the format used in the first versions of the journal, -but received various compatible and incompatible additions since. - -If you are wondering why the journal file format has been created in the first -place instead of adopting an existing database implementation, please have a -look [at this -thread](https://lists.freedesktop.org/archives/systemd-devel/2012-October/007054.html). - - -## Basics - -* All offsets, sizes, time values, hashes (and most other numeric values) are 32-bit/64-bit unsigned integers in LE format. -* Offsets are always relative to the beginning of the file. -* The 64-bit hash function siphash24 is used for newer journal files. For older files [Jenkins lookup3](https://en.wikipedia.org/wiki/Jenkins_hash_function) is used, more specifically `jenkins_hashlittle2()` with the first 32-bit integer it returns as higher 32-bit part of the 64-bit value, and the second one uses as lower 32-bit part. -* All structures are aligned to 64-bit boundaries and padded to multiples of 64-bit -* The format is designed to be read and written via memory mapping using multiple mapped windows. -* All time values are stored in usec since the respective epoch. -* Wall clock time values are relative to the Unix time epoch, i.e. January 1st, 1970. (`CLOCK_REALTIME`) -* Monotonic time values are always stored jointly with the kernel boot ID value (i.e. `/proc/sys/kernel/random/boot_id`) they belong to. They tend to be relative to the start of the boot, but aren't for containers. (`CLOCK_MONOTONIC`) -* Randomized, unique 128-bit IDs are used in various locations. These are generally UUID v4 compatible, but this is not a requirement. - -## General Rules - -If any kind of corruption is noticed by a writer it should immediately rotate -the file and start a new one. No further writes should be attempted to the -original file, but it should be left around so that as little data as possible -is lost. - -If any kind of corruption is noticed by a reader it should try hard to handle -this gracefully, such as skipping over the corrupted data, but allowing access -to as much data around it as possible. - -A reader should verify all offsets and other data as it reads it. This includes -checking for alignment and range of offsets in the file, especially before -trying to read it via a memory map. - -A reader must interleave rotated and corrupted files as good as possible and -present them as single stream to the user. - -All fields marked as "reserved" must be initialized with 0 when writing and be -ignored on reading. They are currently not used but might be used later on. - - -## Structure - -The file format's data structures are declared in -[journal-def.h](https://github.com/systemd/systemd/blob/main/src/libsystemd/sd-journal/journal-def.h). - -The file format begins with a header structure. After the header structure -object structures follow. Objects are appended to the end as time -progresses. Most data stored in these objects is not altered anymore after -having been written once, with the exception of records necessary for -indexing. When new data is appended to a file the writer first writes all new -objects to the end of the file, and then links them up at front after that's -done. Currently, seven different object types are known: - -```c -enum { - OBJECT_UNUSED, - OBJECT_DATA, - OBJECT_FIELD, - OBJECT_ENTRY, - OBJECT_DATA_HASH_TABLE, - OBJECT_FIELD_HASH_TABLE, - OBJECT_ENTRY_ARRAY, - OBJECT_TAG, - _OBJECT_TYPE_MAX -}; -``` - -* A **DATA** object, which encapsulates the contents of one field of an entry, i.e. a string such as `_SYSTEMD_UNIT=avahi-daemon.service`, or `MESSAGE=Foobar made a booboo.` but possibly including large or binary data, and always prefixed by the field name and "=". -* A **FIELD** object, which encapsulates a field name, i.e. a string such as `_SYSTEMD_UNIT` or `MESSAGE`, without any `=` or even value. -* An **ENTRY** object, which binds several **DATA** objects together into a log entry. -* A **DATA_HASH_TABLE** object, which encapsulates a hash table for finding existing **DATA** objects. -* A **FIELD_HASH_TABLE** object, which encapsulates a hash table for finding existing **FIELD** objects. -* An **ENTRY_ARRAY** object, which encapsulates a sorted array of offsets to entries, used for seeking by binary search. -* A **TAG** object, consisting of an FSS sealing tag for all data from the beginning of the file or the last tag written (whichever is later). - -## Header - -The Header struct defines, well, you guessed it, the file header: - -```c -_packed_ struct Header { - uint8_t signature[8]; /* "LPKSHHRH" */ - le32_t compatible_flags; - le32_t incompatible_flags; - uint8_t state; - uint8_t reserved[7]; - sd_id128_t file_id; - sd_id128_t machine_id; - sd_id128_t tail_entry_boot_id; - sd_id128_t seqnum_id; - le64_t header_size; - le64_t arena_size; - le64_t data_hash_table_offset; - le64_t data_hash_table_size; - le64_t field_hash_table_offset; - le64_t field_hash_table_size; - le64_t tail_object_offset; - le64_t n_objects; - le64_t n_entries; - le64_t tail_entry_seqnum; - le64_t head_entry_seqnum; - le64_t entry_array_offset; - le64_t head_entry_realtime; - le64_t tail_entry_realtime; - le64_t tail_entry_monotonic; - /* Added in 187 */ - le64_t n_data; - le64_t n_fields; - /* Added in 189 */ - le64_t n_tags; - le64_t n_entry_arrays; - /* Added in 246 */ - le64_t data_hash_chain_depth; - le64_t field_hash_chain_depth; - /* Added in 252 */ - le32_t tail_entry_array_offset; - le32_t tail_entry_array_n_entries; - /* Added in 254 */ - le64_t tail_entry_offset; -}; -``` - -The first 8 bytes of Journal files must contain the ASCII characters `LPKSHHRH`. - -If a writer finds that the **machine_id** of a file to write to does not match -the machine it is running on it should immediately rotate the file and start a -new one. - -When journal file is first created the **file_id** is randomly and uniquely -initialized. - -When a writer creates a file it shall initialize the **tail_entry_boot_id** to -the current boot ID of the system. When appending an entry it shall update the -field to the boot ID of that entry, so that it is guaranteed that the -**tail_entry_monotonic** field refers to a timestamp of the monotonic clock -associated with the boot with the ID indicated by the **tail_entry_boot_id** -field. (Compatibility note: in older versions of the journal, the field was -also supposed to be updated whenever the file was opened for any form of -writing, including when opened to mark it as archived. This behaviour has been -deemed problematic since without an associated boot ID the -**tail_entry_monotonic** field is useless. To indicate whether the boot ID is -updated only on append the JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID is set. If it -is not set, the **tail_entry_monotonic** field is not usable). - -The currently used part of the file is the **header_size** plus the -**arena_size** field of the header. If a writer needs to write to a file where -the actual file size on disk is smaller than the reported value it shall -immediately rotate the file and start a new one. If a writer is asked to write -to a file with a header that is shorter than its own definition of the struct -Header, it shall immediately rotate the file and start a new one. - -The **n_objects** field contains a counter for objects currently available in -this file. As objects are appended to the end of the file this counter is -increased. - -The first object in the file starts immediately after the header. The last -object in the file is at the offset **tail_object_offset**, which may be 0 if -no object is in the file yet. - -The **n_entries**, **n_data**, **n_fields**, **n_tags**, **n_entry_arrays** are -counters of the objects of the specific types. - -**tail_entry_seqnum** and **head_entry_seqnum** contain the sequential number -(see below) of the last or first entry in the file, respectively, or 0 if no -entry has been written yet. - -**tail_entry_realtime** and **head_entry_realtime** contain the wallclock -timestamp of the last or first entry in the file, respectively, or 0 if no -entry has been written yet. - -**tail_entry_monotonic** is the monotonic timestamp of the last entry in the -file, referring to monotonic time of the boot identified by -**tail_entry_boot_id**, but only if the -JOURNAL_COMPATIBLE_TAIL_ENTRY_BOOT_ID feature flag is set, see above. If it -is not set, this field might refer to a different boot then the one in the -**tail_entry_boot_id** field, for example when the file was ultimately -archived. - -**data_hash_chain_depth** is a counter of the deepest chain in the data hash -table, minus one. This is updated whenever a chain is found that is longer than -the previous deepest chain found. Note that the counter is updated during hash -table lookups, as the chains are traversed. This counter is used to determine -when it is a good time to rotate the journal file, because hash collisions -became too frequent. - -Similar, **field_hash_chain_depth** is a counter of the deepest chain in the -field hash table, minus one. - -**tail_entry_array_offset** and **tail_entry_array_n_entries** allow immediate -access to the last entry array in the global entry array chain. - -**tail_entry_offset** allow immediate access to the last entry in the journal -file. - -## Extensibility - -The format is supposed to be extensible in order to enable future additions of -features. Readers should simply skip objects of unknown types as they read -them. If a compatible feature extension is made a new bit is registered in the -header's **compatible_flags** field. If a feature extension is used that makes -the format incompatible a new bit is registered in the header's -**incompatible_flags** field. Readers should check these two bit fields, if -they find a flag they don't understand in compatible_flags they should continue -to read the file, but if they find one in **incompatible_flags** they should -fail, asking for an update of the software. Writers should refuse writing if -there's an unknown bit flag in either of these fields. - -The file header may be extended as new features are added. The size of the file -header is stored in the header. All header fields up to **n_data** are known to -unconditionally exist in all revisions of the file format, all fields starting -with **n_data** needs to be explicitly checked for via a size check, since they -were additions after the initial release. - -Currently only five extensions flagged in the flags fields are known: - -```c -enum { - HEADER_INCOMPATIBLE_COMPRESSED_XZ = 1 << 0, - HEADER_INCOMPATIBLE_COMPRESSED_LZ4 = 1 << 1, - HEADER_INCOMPATIBLE_KEYED_HASH = 1 << 2, - HEADER_INCOMPATIBLE_COMPRESSED_ZSTD = 1 << 3, - HEADER_INCOMPATIBLE_COMPACT = 1 << 4, -}; - -enum { - HEADER_COMPATIBLE_SEALED = 1 << 0, - HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID = 1 << 1, -}; -``` - -HEADER_INCOMPATIBLE_COMPRESSED_XZ indicates that the file includes DATA objects -that are compressed using XZ. Similarly, HEADER_INCOMPATIBLE_COMPRESSED_LZ4 -indicates that the file includes DATA objects that are compressed with the LZ4 -algorithm. And HEADER_INCOMPATIBLE_COMPRESSED_ZSTD indicates that there are -objects compressed with ZSTD. - -HEADER_INCOMPATIBLE_KEYED_HASH indicates that instead of the unkeyed Jenkins -hash function the keyed siphash24 hash function is used for the two hash -tables, see below. - -HEADER_INCOMPATIBLE_COMPACT indicates that the journal file uses the new binary -format that uses less space on disk compared to the original format. - -HEADER_COMPATIBLE_SEALED indicates that the file includes TAG objects required -for Forward Secure Sealing. - -HEADER_COMPATIBLE_TAIL_ENTRY_BOOT_ID indicates whether the -**tail_entry_boot_id** field is strictly updated on initial creation of the -file and whenever an entry is updated (in which case the flag is set), or also -when the file is archived (in which case it is unset). New files should always -set this flag (and thus not update the **tail_entry_boot_id** except when -creating the file and when appending an entry to it. - -## Dirty Detection - -```c -enum { - STATE_OFFLINE = 0, - STATE_ONLINE = 1, - STATE_ARCHIVED = 2, - _STATE_MAX -}; -``` - -If a file is opened for writing the **state** field should be set to -STATE_ONLINE. If a file is closed after writing the **state** field should be -set to STATE_OFFLINE. After a file has been rotated it should be set to -STATE_ARCHIVED. If a writer is asked to write to a file that is not in -STATE_OFFLINE it should immediately rotate the file and start a new one, -without changing the file. - -After and before the state field is changed, `fdatasync()` should be executed on -the file to ensure the dirty state hits disk. - - -## Sequence Numbers - -All entries carry sequence numbers that are monotonically counted up for each -entry (starting at 1) and are unique among all files which carry the same -**seqnum_id** field. This field is randomly generated when the journal daemon -creates its first file. All files generated by the same journal daemon instance -should hence carry the same seqnum_id. This should guarantee a monotonic stream -of sequential numbers for easy interleaving even if entries are distributed -among several files, such as the system journal and many per-user journals. - - -## Concurrency - -The file format is designed to be usable in a simultaneous -single-writer/multiple-reader scenario. The synchronization model is very weak -in order to facilitate storage on the most basic of file systems (well, the -most basic ones that provide us with `mmap()` that is), and allow good -performance. No file locking is used. The only time where disk synchronization -via `fdatasync()` should be enforced is after and before changing the **state** -field in the file header (see below). It is recommended to execute a memory -barrier after appending and initializing new objects at the end of the file, -and before linking them up in the earlier objects. - -This weak synchronization model means that it is crucial that readers verify -the structural integrity of the file as they read it and handle invalid -structure gracefully. (Checking what you read is a pretty good idea out of -security considerations anyway.) This specifically includes checking offset -values, and that they point to valid objects, with valid sizes and of the type -and hash value expected. All code must be written with the fact in mind that a -file with inconsistent structure might just be inconsistent temporarily, and -might become consistent later on. Payload OTOH requires less scrutiny, as it -should only be linked up (and hence visible to readers) after it was -successfully written to memory (though not necessarily to disk). On non-local -file systems it is a good idea to verify the payload hashes when reading, in -order to avoid annoyances with `mmap()` inconsistencies. - -Clients intending to show a live view of the journal should use `inotify()` for -this to watch for files changes. Since file writes done via `mmap()` do not -result in `inotify()` writers shall truncate the file to its current size after -writing one or more entries, which results in inotify events being -generated. Note that this is not used as a transaction scheme (it doesn't -protect anything), but merely for triggering wakeups. - -Note that inotify will not work on network file systems if reader and writer -reside on different hosts. Readers which detect they are run on journal files -on a non-local file system should hence not rely on inotify for live views but -fall back to simple time based polling of the files (maybe recheck every 2s). - - -## Objects - -All objects carry a common header: - -```c -enum { - OBJECT_COMPRESSED_XZ = 1 << 0, - OBJECT_COMPRESSED_LZ4 = 1 << 1, - OBJECT_COMPRESSED_ZSTD = 1 << 2, -}; - -_packed_ struct ObjectHeader { - uint8_t type; - uint8_t flags; - uint8_t reserved[6]; - le64_t size; - uint8_t payload[]; -}; -``` - -The **type** field is one of the object types listed above. The **flags** field -currently knows three flags: OBJECT_COMPRESSED_XZ, OBJECT_COMPRESSED_LZ4 and -OBJECT_COMPRESSED_ZSTD. It is only valid for DATA objects and indicates that -the data payload is compressed with XZ/LZ4/ZSTD. If one of the -OBJECT_COMPRESSED_* flags is set for an object then the matching -HEADER_INCOMPATIBLE_COMPRESSED_XZ/HEADER_INCOMPATIBLE_COMPRESSED_LZ4/HEADER_INCOMPATIBLE_COMPRESSED_ZSTD -flag must be set for the file as well. At most one of these three bits may be -set. The **size** field encodes the size of the object including all its -headers and payload. - - -## Data Objects - -```c -_packed_ struct DataObject { - ObjectHeader object; - le64_t hash; - le64_t next_hash_offset; - le64_t next_field_offset; - le64_t entry_offset; /* the first array entry we store inline */ - le64_t entry_array_offset; - le64_t n_entries; - union { \ - struct { \ - uint8_t payload[] ; \ - } regular; \ - struct { \ - le32_t tail_entry_array_offset; \ - le32_t tail_entry_array_n_entries; \ - uint8_t payload[]; \ - } compact; \ - }; \ -}; -``` - -Data objects carry actual field data in the **payload[]** array, including a -field name, a `=` and the field data. Example: -`_SYSTEMD_UNIT=foobar.service`. The **hash** field is a hash value of the -payload. If the `HEADER_INCOMPATIBLE_KEYED_HASH` flag is set in the file header -this is the siphash24 hash value of the payload, keyed by the file ID as stored -in the **file_id** field of the file header. If the flag is not set it is the -non-keyed Jenkins hash of the payload instead. The keyed hash is preferred as -it makes the format more robust against attackers that want to trigger hash -collisions in the hash table. - -**next_hash_offset** is used to link up DATA objects in the DATA_HASH_TABLE if -a hash collision happens (in a singly linked list, with an offset of 0 -indicating the end). **next_field_offset** is used to link up data objects with -the same field name from the FIELD object of the field used. - -**entry_offset** is an offset to the first ENTRY object referring to this DATA -object. **entry_array_offset** is an offset to an ENTRY_ARRAY object with -offsets to other entries referencing this DATA object. Storing the offset to -the first ENTRY object in-line is an optimization given that many DATA objects -will be referenced from a single entry only (for example, `MESSAGE=` frequently -includes a practically unique string). **n_entries** is a counter of the total -number of ENTRY objects that reference this object, i.e. the sum of all -ENTRY_ARRAYS chained up from this object, plus 1. - -The **payload[]** field contains the field name and date unencoded, unless -OBJECT_COMPRESSED_XZ/OBJECT_COMPRESSED_LZ4/OBJECT_COMPRESSED_ZSTD is set in the -`ObjectHeader`, in which case the payload is compressed with the indicated -compression algorithm. - -If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, Two extra fields are stored to -allow immediate access to the tail entry array in the DATA object's entry array -chain. - -## Field Objects - -```c -_packed_ struct FieldObject { - ObjectHeader object; - le64_t hash; - le64_t next_hash_offset; - le64_t head_data_offset; - uint8_t payload[]; -}; -``` - -Field objects are used to enumerate all possible values a certain field name -can take in the entire journal file. - -The **payload[]** array contains the actual field name, without '=' or any -field value. Example: `_SYSTEMD_UNIT`. The **hash** field is a hash value of -the payload. As for the DATA objects, this too is either the `.file_id` keyed -siphash24 hash of the payload, or the non-keyed Jenkins hash. - -**next_hash_offset** is used to link up FIELD objects in the FIELD_HASH_TABLE -if a hash collision happens (in singly linked list, offset 0 indicating the -end). **head_data_offset** points to the first DATA object that shares this -field name. It is the head of a singly linked list using DATA's -**next_field_offset** offset. - - -## Entry Objects - -``` -_packed_ struct EntryObject { - ObjectHeader object; - le64_t seqnum; - le64_t realtime; - le64_t monotonic; - sd_id128_t boot_id; - le64_t xor_hash; - union { \ - struct { \ - le64_t object_offset; \ - le64_t hash; \ - } regular[]; \ - struct { \ - le32_t object_offset; \ - } compact[]; \ - } items; \ -}; -``` - -An ENTRY object binds several DATA objects together into one log entry, and -includes other metadata such as various timestamps. - -The **seqnum** field contains the sequence number of the entry, **realtime** -the realtime timestamp, and **monotonic** the monotonic timestamp for the boot -identified by **boot_id**. - -The **xor_hash** field contains a binary XOR of the hashes of the payload of -all DATA objects referenced by this ENTRY. This value is usable to check the -contents of the entry, being independent of the order of the DATA objects in -the array. Note that even for files that have the -`HEADER_INCOMPATIBLE_KEYED_HASH` flag set (and thus siphash24 the otherwise -used hash function) the hash function used for this field, as singular -exception, is the Jenkins lookup3 hash function. The XOR hash value is used to -quickly compare the contents of two entries, and to define a well-defined order -between two entries that otherwise have the same sequence numbers and -timestamps. - -The **items[]** array contains references to all DATA objects of this entry, -plus their respective hashes (which are calculated the same way as in the DATA -objects, i.e. keyed by the file ID). - -If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, DATA object offsets are stored -as 32-bit integers instead of 64-bit and the unused hash field per data object is -not stored anymore. - -In the file ENTRY objects are written ordered monotonically by sequence -number. For continuous parts of the file written during the same boot -(i.e. with the same boot_id) the monotonic timestamp is monotonic too. Modulo -wallclock time jumps (due to incorrect clocks being corrected) the realtime -timestamps are monotonic too. - - -## Hash Table Objects - -```c -_packed_ struct HashItem { - le64_t head_hash_offset; - le64_t tail_hash_offset; -}; - -_packed_ struct HashTableObject { - ObjectHeader object; - HashItem items[]; -}; -``` - -The structure of both DATA_HASH_TABLE and FIELD_HASH_TABLE objects are -identical. They implement a simple hash table, with each cell containing -offsets to the head and tail of the singly linked list of the DATA and FIELD -objects, respectively. DATA's and FIELD's next_hash_offset field are used to -chain up the objects. Empty cells have both offsets set to 0. - -Each file contains exactly one DATA_HASH_TABLE and one FIELD_HASH_TABLE -objects. Their payload is directly referred to by the file header in the -**data_hash_table_offset**, **data_hash_table_size**, -**field_hash_table_offset**, **field_hash_table_size** fields. These offsets do -_not_ point to the object headers but directly to the payloads. When a new -journal file is created the two hash table objects need to be created right -away as first two objects in the stream. - -If the hash table fill level is increasing over a certain fill level (Learning -from Java's Hashtable for example: > 75%), the writer should rotate the file -and create a new one. - -The DATA_HASH_TABLE should be sized taking into account to the maximum size the -file is expected to grow, as configured by the administrator or disk space -considerations. The FIELD_HASH_TABLE should be sized to a fixed size; the -number of fields should be pretty static as it depends only on developers' -creativity rather than runtime parameters. - - -## Entry Array Objects - - -```c -_packed_ struct EntryArrayObject { - ObjectHeader object; - le64_t next_entry_array_offset; - union { - le64_t regular[]; - le32_t compact[]; - } items; -}; -``` - -Entry Arrays are used to store a sorted array of offsets to entries. Entry -arrays are strictly sorted by offsets on disk, and hence by their timestamps -and sequence numbers (with some restrictions, see above). - -If the `HEADER_INCOMPATIBLE_COMPACT` flag is set, offsets are stored as 32-bit -integers instead of 64-bit. - -Entry Arrays are chained up. If one entry array is full another one is -allocated and the **next_entry_array_offset** field of the old one pointed to -it. An Entry Array with **next_entry_array_offset** set to 0 is the last in the -list. To optimize allocation and seeking, as entry arrays are appended to a -chain of entry arrays they should increase in size (double). - -Due to being monotonically ordered entry arrays may be searched with a binary -search (bisection). - -One chain of entry arrays links up all entries written to the journal. The -first entry array is referenced in the **entry_array_offset** field of the -header. - -Each DATA object also references an entry array chain listing all entries -referencing a specific DATA object. Since many DATA objects are only referenced -by a single ENTRY the first offset of the list is stored inside the DATA object -itself, an ENTRY_ARRAY object is only needed if it is referenced by more than -one ENTRY. - - -## Tag Object - -```c -#define TAG_LENGTH (256/8) - -_packed_ struct TagObject { - ObjectHeader object; - le64_t seqnum; - le64_t epoch; - uint8_t tag[TAG_LENGTH]; /* SHA-256 HMAC */ -}; -``` - -Tag objects are used to seal off the journal for alteration. In regular -intervals a tag object is appended to the file. The tag object consists of a -SHA-256 HMAC tag that is calculated from the objects stored in the file since -the last tag was written, or from the beginning if no tag was written yet. The -key for the HMAC is calculated via the externally maintained FSPRG logic for -the epoch that is written into **epoch**. The sequence number **seqnum** is -increased with each tag. When calculating the HMAC of objects header fields -that are volatile are excluded (skipped). More specifically all fields that -might validly be altered to maintain a consistent file structure (such as -offsets to objects added later for the purpose of linked lists and suchlike) -after an object has been written are not protected by the tag. This means a -verifier has to independently check these fields for consistency of -structure. For the fields excluded from the HMAC please consult the source code -directly. A verifier should read the file from the beginning to the end, always -calculating the HMAC for the objects it reads. Each time a tag object is -encountered the HMAC should be verified and restarted. The tag object sequence -numbers need to increase strictly monotonically. Tag objects themselves are -partially protected by the HMAC (i.e. seqnum and epoch is included, the tag -itself not). - - -## Algorithms - -### Reading - -Given an offset to an entry all data fields are easily found by following the -offsets in the data item array of the entry. - -Listing entries without filter is done by traversing the list of entry arrays -starting with the headers' **entry_array_offset** field. - -Seeking to an entry by timestamp or sequence number (without any matches) is -done via binary search in the entry arrays starting with the header's -**entry_array_offset** field. Since these arrays double in size as more are -added the time cost of seeking is O(log(n)*log(n)) if n is the number of -entries in the file. - -When seeking or listing with one field match applied the DATA object of the -match is first identified, and then its data entry array chain traversed. The -time cost is the same as for seeks/listings with no match. - -If multiple matches are applied, multiple chains of entry arrays should be -traversed in parallel. Since they all are strictly monotonically ordered by -offset of the entries, advancing in one can be directly applied to the others, -until an entry matching all matches is found. In the worst case seeking like -this is O(n) where n is the number of matching entries of the "loosest" match, -but in the common case should be much more efficient at least for the -well-known fields, where the set of possible field values tend to be closely -related. Checking whether an entry matches a number of matches is efficient -since the item array of the entry contains hashes of all data fields -referenced, and the number of data fields of an entry is generally small (< -30). - -When interleaving multiple journal files seeking tends to be a frequently used -operation, but in this case can be effectively suppressed by caching results -from previous entries. - -When listing all possible values a certain field can take it is sufficient to -look up the FIELD object and follow the chain of links to all DATA it includes. - -### Writing - -When an entry is appended to the journal, for each of its data fields the data -hash table should be checked. If the data field does not yet exist in the file, -it should be appended and added to the data hash table. When a data field's data -object is added, the field hash table should be checked for the field name of -the data field, and a field object be added if necessary. After all data fields -(and recursively all field names) of the new entry are appended and linked up -in the hashtables, the entry object should be appended and linked up too. - -At regular intervals a tag object should be written if sealing is enabled (see -above). Before the file is closed a tag should be written too, to seal it off. - -Before writing an object, time and disk space limits should be checked and -rotation triggered if necessary. - - -## Optimizing Disk IO - -_A few general ideas to keep in mind:_ - -The hash tables for looking up fields and data should be quickly in the memory -cache and not hurt performance. All entries and entry arrays are ordered -strictly by time on disk, and hence should expose an OK access pattern on -rotating media, when read sequentially (which should be the most common case, -given the nature of log data). - -The disk access patterns of the binary search for entries needed for seeking -are problematic on rotating disks. This should not be a major issue though, -since seeking should not be a frequent operation. - -When reading, collecting data fields for presenting entries to the user is -problematic on rotating disks. In order to optimize these patterns the item -array of entry objects should be sorted by disk offset before -writing. Effectively, frequently used data objects should be in the memory -cache quickly. Non-frequently used data objects are likely to be located -between the previous and current entry when reading and hence should expose an -OK access pattern. Problematic are data objects that are neither frequently nor -infrequently referenced, which will cost seek time. - -And that's all there is to it. - -Thanks for your interest! diff --git a/docs/_interfaces/JOURNAL_NATIVE_PROTOCOL.md b/docs/_interfaces/JOURNAL_NATIVE_PROTOCOL.md deleted file mode 100644 index ce00d7e1ae..0000000000 --- a/docs/_interfaces/JOURNAL_NATIVE_PROTOCOL.md +++ /dev/null @@ -1,191 +0,0 @@ ---- -title: Native Journal Protocol -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Native Journal Protocol - -`systemd-journald.service` accepts log data via various protocols: - -* Classic RFC3164 BSD syslog via the `/dev/log` socket -* STDOUT/STDERR of programs via `StandardOutput=journal` + `StandardError=journal` in service files (both of which are default settings) -* Kernel log messages via the `/dev/kmsg` device node -* Audit records via the kernel's audit subsystem -* Structured log messages via `journald`'s native protocol - -The latter is what this document is about: if you are developing a program and -want to pass structured log data to `journald`, it's the Journal's native -protocol that you want to use. The systemd project provides the -[`sd_journal_print(3)`](https://www.freedesktop.org/software/systemd/man/sd_journal_print.html) -API that implements the client side of this protocol. This document explains -what this interface does behind the scenes, in case you'd like to implement a -client for it yourself, without linking to `libsystemd` — for example because -you work in a programming language other than C or otherwise want to avoid the -dependency. - -## Basics - -The native protocol of `journald` is spoken on the -`/run/systemd/journal/socket` `AF_UNIX`/`SOCK_DGRAM` socket on which -`systemd-journald.service` listens. Each datagram sent to this socket -encapsulates one journal entry that shall be written. Since datagrams are -subject to a size limit and we want to allow large journal entries, datagrams -sent over this socket may come in one of two formats: - -* A datagram with the literal journal entry data as payload, without - any file descriptors attached. - -* A datagram with an empty payload, but with a single - [`memfd`](https://man7.org/linux/man-pages/man2/memfd_create.2.html) - file descriptor that contains the literal journal entry data. - -Other combinations are not permitted, i.e. datagrams with both payload and file -descriptors, or datagrams with neither, or more than one file descriptor. Such -datagrams are ignored. The `memfd` file descriptor should be fully sealed. The -binary format in the datagram payload and in the `memfd` memory is -identical. Typically a client would attempt to first send the data as datagram -payload, but if this fails with an `EMSGSIZE` error it would immediately retry -via the `memfd` logic. - -A client probably should bump up the `SO_SNDBUF` socket option of its `AF_UNIX` -socket towards `journald` in order to delay blocking I/O as much as possible. - -## Data Format - -Each datagram should consist of a number of environment-like key/value -assignments. Unlike environment variable assignments the value may contain NUL -bytes however, as well as any other binary data. Keys may not include the `=` -or newline characters (or any other control characters or non-ASCII characters) -and may not be empty. - -Serialization into the datagram payload or `memfd` is straightforward: each -key/value pair is serialized via one of two methods: - -* The first method inserts a `=` character between key and value, and suffixes -the result with `\n` (i.e. the newline character, ASCII code 10). Example: a -key `FOO` with a value `BAR` is serialized `F`, `O`, `O`, `=`, `B`, `A`, `R`, -`\n`. - -* The second method should be used if the value of a field contains a `\n` -byte. In this case, the key name is serialized as is, followed by a `\n` -character, followed by a (non-aligned) little-endian unsigned 64-bit integer -encoding the size of the value, followed by the literal value data, followed by -`\n`. Example: a key `FOO` with a value `BAR` may be serialized using this -second method as: `F`, `O`, `O`, `\n`, `\003`, `\000`, `\000`, `\000`, `\000`, -`\000`, `\000`, `\000`, `B`, `A`, `R`, `\n`. - -If the value of a key/value pair contains a newline character (`\n`), it *must* -be serialized using the second method. If it does not, either method is -permitted. However, it is generally recommended to use the first method if -possible for all key/value pairs where applicable since the generated datagrams -are easily recognized and understood by the human eye this way, without any -manual binary decoding — which improves the debugging experience a lot, in -particular with tools such as `strace` that can show datagram content as text -dump. After all, log messages are highly relevant for debugging programs, hence -optimizing log traffic for readability without special tools is generally -desirable. - -Note that keys that begin with `_` have special semantics in `journald`: they -are *trusted* and implicitly appended by `journald` on the receiving -side. Clients should not send them — if they do anyway, they will be ignored. - -The most important key/value pair to send is `MESSAGE=`, as that contains the -actual log message text. Other relevant keys a client should send in most cases -are `PRIORITY=`, `CODE_FILE=`, `CODE_LINE=`, `CODE_FUNC=`, `ERRNO=`. It's -recommended to generate these fields implicitly on the client side. For further -information see the [relevant documentation of these -fields](https://www.freedesktop.org/software/systemd/man/systemd.journal-fields.html). - -The order in which the fields are serialized within one datagram is undefined -and may be freely chosen by the client. The server side might or might not -retain or reorder it when writing it to the Journal. - -Some programs might generate multi-line log messages (e.g. a stack unwinder -generating log output about a stack trace, with one line for each stack -frame). It's highly recommended to send these as a single datagram, using a -single `MESSAGE=` field with embedded newline characters between the lines (the -second serialization method described above must hence be used for this -field). If possible do not split up individual events into multiple Journal -events that might then be processed and written into the Journal as separate -entries. The Journal toolchain is capable of handling multi-line log entries -just fine, and it's generally preferred to have a single set of metadata fields -associated with each multi-line message. - -Note that the same keys may be used multiple times within the same datagram, -with different values. The Journal supports this and will write such entries to -disk without complaining. This is useful for associating a single log entry -with multiple suitable objects of the same type at once. This should only be -used for specific Journal fields however, where this is expected. Do not use -this for Journal fields where this is not expected and where code reasonably -assumes per-event uniqueness of the keys. In most cases code that consumes and -displays log entries is likely to ignore such non-unique fields or only -consider the first of the specified values. Specifically, if a Journal entry -contains multiple `MESSAGE=` fields, likely only the first one is -displayed. Note that a well-written logging client library thus will not use a -plain dictionary for accepting structured log metadata, but rather a data -structure that allows non-unique keys, for example an array, or a dictionary -that optionally maps to a set of values instead of a single value. - -## Example Datagram - -Here's an encoded message, with various common fields, all encoded according to -the first serialization method, with the exception of one, where the value -contains a newline character, and thus the second method is needed to be used. - -``` -PRIORITY=3\n -SYSLOG_FACILITY=3\n -CODE_FILE=src/foobar.c\n -CODE_LINE=77\n -BINARY_BLOB\n -\004\000\000\000\000\000\000\000xx\nx\n -CODE_FUNC=some_func\n -SYSLOG_IDENTIFIER=footool\n -MESSAGE=Something happened.\n -``` - -(Lines are broken here after each `\n` to make things more readable. C-style -backslash escaping is used.) - -## Automatic Protocol Upgrading - -It might be wise to automatically upgrade to logging via the Journal's native -protocol in clients that previously used the BSD syslog protocol. Behaviour in -this case should be pretty obvious: try connecting a socket to -`/run/systemd/journal/socket` first (on success use the native Journal -protocol), and if that fails fall back to `/dev/log` (and use the BSD syslog -protocol). - -Programs normally logging to STDERR might also choose to upgrade to native -Journal logging in case they are invoked via systemd's service logic, where -STDOUT and STDERR are going to the Journal anyway. By preferring the native -protocol over STDERR-based logging, structured metadata can be passed along, -including priority information and more — which is not available on STDERR -based logging. If a program wants to detect automatically whether its STDERR is -connected to the Journal's stream transport, look for the `$JOURNAL_STREAM` -environment variable. The systemd service logic sets this variable to a -colon-separated pair of device and inode number (formatted in decimal ASCII) of -the STDERR file descriptor. If the `.st_dev` and `.st_ino` fields of the -`struct stat` data returned by `fstat(STDERR_FILENO, …)` match these values a -program can be sure its STDERR is connected to the Journal, and may then opt to -upgrade to the native Journal protocol via an `AF_UNIX` socket of its own, and -cease to use STDERR. - -Why bother with this environment variable check? A service program invoked by -systemd might employ shell-style I/O redirection on invoked subprograms, and -those should likely not upgrade to the native Journal protocol, but instead -continue to use the redirected file descriptors passed to them. Thus, by -comparing the device and inode number of the actual STDERR file descriptor with -the one the service manager passed, one can make sure that no I/O redirection -took place for the current program. - -## Alternative Implementations - -If you are looking for alternative implementations of this protocol (besides -systemd's own in `sd_journal_print()`), consider -[GLib's](https://gitlab.gnome.org/GNOME/glib/-/blob/main/glib/gmessages.c) or -[`dbus-broker`'s](https://github.com/bus1/dbus-broker/blob/main/src/util/log.c). - -And that's already all there is to it. diff --git a/docs/_interfaces/MEMORY_PRESSURE.md b/docs/_interfaces/MEMORY_PRESSURE.md deleted file mode 100644 index 69c23eccb2..0000000000 --- a/docs/_interfaces/MEMORY_PRESSURE.md +++ /dev/null @@ -1,238 +0,0 @@ ---- -title: Memory Pressure Handling -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Memory Pressure Handling in systemd - -When the system is under memory pressure (i.e. some component of the OS -requires memory allocation but there is only very little or none available), -it can attempt various things to make more memory available again ("reclaim"): - -* The kernel can flush out memory pages backed by files on disk, under the - knowledge that it can reread them from disk when needed again. Candidate - pages are the many memory mapped executable files and shared libraries on - disk, among others. - -* The kernel can flush out memory packages not backed by files on disk - ("anonymous" memory, i.e. memory allocated via `malloc()` and similar calls, - or `tmpfs` file system contents) if there's swap to write it to. - -* Userspace can proactively release memory it allocated but doesn't immediately - require back to the kernel. This includes allocation caches, and other forms - of caches that are not required for normal operation to continue. - -The latter is what we want to focus on in this document: how to ensure -userspace process can detect mounting memory pressure early and release memory -back to the kernel as it happens, relieving the memory pressure before it -becomes too critical. - -The effects of memory pressure during runtime generally are growing latencies -during operation: when a program requires memory but the system is busy writing -out memory to (relatively slow) disks in order make some available, this -generally surfaces in scheduling latencies, and applications and services will -slow down until memory pressure is relieved. Hence, to ensure stable service -latencies it is essential to release unneeded memory back to the kernel early -on. - -On Linux the [Pressure Stall Information -(PSI)](https://docs.kernel.org/accounting/psi.html) Linux kernel interface is -the primary way to determine the system or a part of it is under memory -pressure. PSI makes available to userspace a `poll()`-able file descriptor that -gets notifications whenever memory pressure latencies for the system or a -control group grow beyond some level. - -`systemd` itself makes use of PSI, and helps applications to do so too. -Specifically: - -* Most of systemd's long running components watch for PSI memory pressure - events, and release allocation caches and other resources once seen. - -* systemd's service manager provides a protocol for asking services to monitor - PSI events and configure the appropriate pressure thresholds. - -* systemd's `sd-event` event loop API provides a high-level call - `sd_event_add_memory_pressure()` enabling programs using it to efficiently - hook into the PSI memory pressure protocol provided by the service manager, - with very few lines of code. - -## Memory Pressure Service Protocol - -If memory pressure handling for a specific service is enabled via -`MemoryPressureWatch=` the memory pressure service protocol is used to tell the -service code about this. Specifically two environment variables are set by the -service manager, and typically consumed by the service: - -* The `$MEMORY_PRESSURE_WATCH` environment variable will contain an absolute - path in the file system to the file to watch for memory pressure events. This - will usually point to a PSI file such as the `memory.pressure` file of the - service's cgroup. In order to make debugging easier, and allow later - extension it is recommended for applications to also allow this path to refer - to an `AF_UNIX` stream socket in the file system or a FIFO inode in the file - system. Regardless which of the three types of inodes this absolute path - refers to, all three are `poll()`-able for memory pressure events. The - variable can also be set to the literal string `/dev/null`. If so the service - code should take this as indication that memory pressure monitoring is not - desired and should be turned off. - -* The `$MEMORY_PRESSURE_WRITE` environment variable is optional. If set by the - service manager it contains Base64 encoded data (that may contain arbitrary - binary values, including NUL bytes) that should be written into the path - provided via `$MEMORY_PRESSURE_WATCH` right after opening it. Typically, if - talking directly to a PSI kernel file this will contain information about the - threshold settings configurable in the service manager. - -When a service initializes it hence should look for -`$MEMORY_PRESSURE_WATCH`. If set, it should try to open the specified path. If -it detects the path to refer to a regular file it should assume it refers to a -PSI kernel file. If so, it should write the data from `$MEMORY_PRESSURE_WRITE` -into the file descriptor (after Base64-decoding it, and only if the variable is -set) and then watch for `POLLPRI` events on it. If it detects the paths refers -to a FIFO inode, it should open it, write the `$MEMORY_PRESSURE_WRITE` data -into it (as above) and then watch for `POLLIN` events on it. Whenever `POLLIN` -is seen it should read and discard any data queued in the FIFO. If the path -refers to an `AF_UNIX` socket in the file system, the application should -`connect()` a stream socket to it, write `$MEMORY_PRESSURE_WRITE` into it (as -above) and watch for `POLLIN`, discarding any data it might receive. - -To summarize: - -* If `$MEMORY_PRESSURE_WATCH` points to a regular file: open and watch for - `POLLPRI`, never read from the file descriptor. - -* If `$MEMORY_PRESSURE_WATCH` points to a FIFO: open and watch for `POLLIN`, - read/discard any incoming data. - -* If `$MEMORY_PRESSURE_WATCH` points to an `AF_UNIX` socket: connect and watch - for `POLLIN`, read/discard any incoming data. - -* If `$MEMORY_PRESSURE_WATCH` contains the literal string `/dev/null`, turn off - memory pressure handling. - -(And in each case, immediately after opening/connecting to the path, write the -decoded `$MEMORY_PRESSURE_WRITE` data into it.) - -Whenever a `POLLPRI`/`POLLIN` event is seen the service is under memory -pressure. It should use this as hint to release suitable redundant resources, -for example: - -* glibc's memory allocation cache, via - [`malloc_trim()`](https://man7.org/linux/man-pages/man3/malloc_trim.3.html). Similar, - allocation caches implemented in the service itself. - -* Any other local caches, such DNS caches, or web caches (in particular if - service is a web browser). - -* Terminate any idle worker threads or processes. - -* Run a garbage collection (GC) cycle, if the runtime environment supports it. - -* Terminate the process if idle, and can be automatically started when - needed next. - -Which actions precisely to take depends on the service in question. Note that -the notifications are delivered when memory allocation latency already degraded -beyond some point. Hence when discussing which resources to keep and which to -discard, keep in mind it's typically acceptable that latencies incurred -recovering discarded resources at a later point are acceptable, given that -latencies *already* are affected negatively. - -In case the path supplied via `$MEMORY_PRESSURE_WATCH` points to a PSI kernel -API file, or to an `AF_UNIX` opening it multiple times is safe and reliable, -and should deliver notifications to each of the opened file descriptors. This -is specifically useful for services that consist of multiple processes, and -where each of them shall be able to release resources on memory pressure. - -The `POLLPRI`/`POLLIN` conditions will be triggered every time memory pressure -is detected, but not continuously. It is thus safe to keep `poll()`-ing on the -same file descriptor continuously, and executing resource release operations -whenever the file descriptor triggers without having to expect overloading the -process. - -(Currently, the protocol defined here only allows configuration of a single -"degree" of memory pressure, there's no distinction made on how strong the -pressure is. In future, if it becomes apparent that there's clear need to -extend this we might eventually add different degrees, most likely by adding -additional environment variables such as `$MEMORY_PRESSURE_WRITE_LOW` and -`$MEMORY_PRESSURE_WRITE_HIGH` or similar, which may contain different settings -for lower or higher memory pressure thresholds.) - -## Service Manager Settings - -The service manager provides two per-service settings that control the memory -pressure handling: - -* The - [`MemoryPressureWatch=`](https://www.freedesktop.org/software/systemd/man/systemd.resource-control.html#MemoryPressureWatch=) - setting controls whether to enable the memory pressure protocol for the - service in question. - -* The `MemoryPressureThresholdSec=` setting allows to configure the threshold - when to signal memory pressure to the services. It takes a time value - (usually in the millisecond range) that defines a threshold per 1s time - window: if memory allocation latencies grow beyond this threshold - notifications are generated towards the service, requesting it to release - resources. - -The `/etc/systemd/system.conf` file provides two settings that may be used to -select the default values for the above settings. If the threshold isn't -configured via the per-service nor system-wide option, it defaults to 100ms. - -When memory pressure monitoring is enabled for a service via -`MemoryPressureWatch=` this primarily does three things: - -* It enables cgroup memory accounting for the service (this is a requirement - for per-cgroup PSI) - -* It sets the aforementioned two environment variables for processes invoked - for the service, based on the control group of the service and provided - settings. - -* The `memory.pressure` PSI control group file associated with the service's - cgroup is delegated to the service (i.e. permissions are relaxed so that - unprivileged service payload code can open the file for writing). - -## Memory Pressure Events in `sd-event` - -The -[`sd-event`](https://www.freedesktop.org/software/systemd/man/sd-event.html) -event loop library provides two API calls that encapsulate the -functionality described above: - -* The - [`sd_event_add_memory_pressure()`](https://www.freedesktop.org/software/systemd/man/sd_event_add_memory_pressure.html) - call implements the service-side of the memory pressure protocol and - integrates it with an `sd-event` event loop. It reads the two environment - variables, connects/opens the specified file, writes the specified data to it, - then watches it for events. - -* The `sd_event_trim_memory()` call may be called to trim the calling - processes' memory. It's a wrapper around glibc's `malloc_trim()`, but first - releases allocation caches maintained by libsystemd internally. This function - serves as the default when a NULL callback is supplied to - `sd_event_add_memory_pressure()`. - -When implementing a service using `sd-event`, for automatic memory pressure -handling, it's typically sufficient to add a line such as: - -```c -(void) sd_event_add_memory_pressure(event, NULL, NULL, NULL); -``` - -– right after allocating the event loop object `event`. - -## Other APIs - -Other programming environments might have native APIs to watch memory -pressure/low memory events. Most notable is probably GLib's -[GMemoryMonitor](https://developer-old.gnome.org/gio/stable/GMemoryMonitor.html). It -currently uses the per-system Linux PSI interface as the backend, but operates -differently than the above: memory pressure events are picked up by a system -service, which then propagates this through D-Bus to the applications. This is -typically less than ideal, since this means each notification event has to -traverse three processes before being handled. This traversal creates -additional latencies at a time where the system is already experiencing adverse -latencies. Moreover, it focusses on system-wide PSI events, even though -service-local ones are generally the better approach. diff --git a/docs/_interfaces/PASSWORD_AGENTS.md b/docs/_interfaces/PASSWORD_AGENTS.md deleted file mode 100644 index 29bd949077..0000000000 --- a/docs/_interfaces/PASSWORD_AGENTS.md +++ /dev/null @@ -1,41 +0,0 @@ ---- -title: Password Agents -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Password Agents - -systemd 12 and newer support lightweight password agents which can be used to query the user for system-level passwords or passphrases. These are passphrases that are not related to a specific user, but to some kind of hardware or service. Right now this is used exclusively for encrypted hard-disk passphrases but later on this is likely to be used to query passphrases of SSL certificates at Apache startup time as well. The basic idea is that a system component requesting a password entry can simply drop a simple .ini-style file into `/run/systemd/ask-password` which multiple different agents may watch via `inotify()`, and query the user as necessary. The answer is then sent back to the querier via an `AF_UNIX`/`SOCK_DGRAM` socket. Multiple agents might be running at the same time in which case they all should query the user and the agent which answers first wins. Right now systemd ships with the following passphrase agents: - -* A Plymouth agent used for querying passwords during boot-up -* A console agent used in similar situations if Plymouth is not available -* A GNOME agent which can be run as part of the normal user session which pops up a notification message and icon which when clicked receives the passphrase from the user. This is useful and necessary in case an encrypted system hard-disk is plugged in when the machine is already up. -* A [`wall(1)`](https://man7.org/linux/man-pages/man1/wall.1.html) agent which sends wall messages as soon as a password shall be entered. -* A simple tty agent which is built into "`systemctl start`" (and similar commands) and asks passwords to the user during manual startup of a service -* A simple tty agent which can be run manually to respond to all queued passwords - -It is easy to write additional agents. The basic algorithm to follow looks like this: - -* Create an inotify watch on /run/systemd/ask-password, watch for `IN_CLOSE_WRITE|IN_MOVED_TO` -* Ignore all events on files in that directory that do not start with "`ask.`" -* As soon as a file named "`ask.xxxx`" shows up, read it. It's a simple `.ini` file that may be parsed with the usual parsers. The `xxxx` suffix is randomized. -* Make sure to ignore unknown `.ini` file keys in those files, so that we can easily extend the format later on. -* You'll find the question to ask the user in the `Message=` field in the `[Ask]` section. It is a single-line string in UTF-8, which might be internationalized (by the party that originally asks the question, not by the agent). -* You'll find an icon name (following the XDG icon naming spec) to show next to the message in the `Icon=` field in the `[Ask]` section -* You'll find the PID of the client asking the question in the `PID=` field in the `[Ask]` section (Before asking your question use `kill(PID, 0)` and ignore the file if this returns `ESRCH`; there's no need to show the data of this field but if you want to you may) -* `Echo=` specifies whether the input should be obscured. If this field is missing or is `Echo=0`, the input should not be shown. -* The socket to send the response to is configured via `Socket=` in the `[Ask]` section. It is a `AF_UNIX`/`SOCK_DGRAM` socket in the file system. -* Ignore files where the time specified in the `NotAfter=` field in the `[Ask]` section is in the past. The time is specified in usecs, and refers to the `CLOCK_MONOTONIC` clock. If `NotAfter=` is `0`, no such check should take place. -* Make sure to hide a password query dialog as soon as a) the `ask.xxxx` file is deleted, watch this with inotify. b) the `NotAfter=` time elapses, if it is set `!= 0`. -* Access to the socket is restricted to privileged users. To acquire the necessary privileges to send the answer back, consider using PolicyKit. In fact, the GNOME agent we ship does that, and you may simply piggyback on that, by executing "`/usr/bin/pkexec /lib/systemd/systemd-reply-password 1 /path/to/socket`" or "`/usr/bin/pkexec /lib/systemd/systemd-reply-password 0 /path/to/socket`" and writing the password to its standard input. Use '`1`' as argument if a password was entered by the user, or '`0`' if the user canceled the request. -* If you do not want to use PK ensure to acquire the necessary privileges in some other way and send a single datagram to the socket consisting of the password string either prefixed with "`+`" or with "`-`" depending on whether the password entry was successful or not. You may but don't have to include a final `NUL` byte in your message. - -Again, it is essential that you stop showing the password box/notification/status icon if the `ask.xxx` file is removed or when `NotAfter=` elapses (if it is set `!= 0`)! - -It may happen that multiple password entries are pending at the same time. Your agent needs to be able to deal with that. Depending on your environment you may either choose to show all outstanding passwords at the same time or instead only one and as soon as the user has replied to that one go on to the next one. - -You may test this all with manually invoking the "`systemd-ask-password`" tool on the command line. Pass `--no-tty` to ensure the password is asked via the agent system. Note that only privileged users may use this tool (after all this is intended purely for system-level passwords). - -If you write a system level agent a smart way to activate it is using systemd `.path` units. This will ensure that systemd will watch the `/run/systemd/ask-password` directory and spawn the agent as soon as that directory becomes non-empty. In fact, the console, wall and Plymouth agents are started like this. If systemd is used to maintain user sessions as well you can use a similar scheme to automatically spawn your user password agent as well. (As of this moment we have not switched any DE over to use systemd for session management, however.) diff --git a/docs/_interfaces/PORTABILITY_AND_STABILITY.md b/docs/_interfaces/PORTABILITY_AND_STABILITY.md deleted file mode 100644 index abdc3dc658..0000000000 --- a/docs/_interfaces/PORTABILITY_AND_STABILITY.md +++ /dev/null @@ -1,171 +0,0 @@ ---- -title: Interface Portability and Stability -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Interface Portability and Stability Promise - -systemd provides various interfaces developers and programs might rely on. Starting with version 26 (the first version released with Fedora 15) we promise to keep a number of them stable and compatible for the future. - -The stable interfaces are: - -* **The unit configuration file format**. Unit files written now will stay compatible with future versions of systemd. Extensions to the file format will happen in a way that existing files remain compatible. - -* **The command line interface** of `systemd`, `systemctl`, `loginctl`, `journalctl`, and all other command line utilities installed in `$PATH` and documented in a man page. We will make sure that scripts invoking these commands will continue to work with future versions of systemd. Note however that the output generated by these commands is generally not included in the promise, unless it is documented in the man page. Example: the output of `systemctl status` is not stable, but that of `systemctl show` is, because the former is intended to be human readable and the latter computer readable, and this is documented in the man page. - -* **The protocol spoken on the socket referred to by `$NOTIFY_SOCKET`**, as documented in [sd_notify(3)](https://www.freedesktop.org/software/systemd/man/sd_notify.html). - -* Some of the **"special" unit names** and their semantics. To be precise the ones that are necessary for normal services, and not those required only for early boot and late shutdown, with very few exceptions. To list them here: `basic.target`, `shutdown.target`, `sockets.target`, `network.target`, `getty.target`, `graphical.target`, `multi-user.target`, `rescue.target`, `emergency.target`, `poweroff.target`, `reboot.target`, `halt.target`, `runlevel[1-5].target`. - -* **The D-Bus interfaces of the main service daemon and other daemons**. We try to always preserve backwards compatibility, and intentional breakage is never introduced. Nevertheless, when we find bugs that mean that the existing interface was not useful, or when the implementation did something different than stated by the documentation and the implemented behaviour is not useful, we will fix the implementation and thus introduce a change in behaviour. But the API (parameter counts and types) is never changed, and existing attributes and methods will not be removed. - -* For a more comprehensive and authoritative list, consult the chart below. - -The following interfaces will not necessarily be kept stable for now, but we will eventually make a stability promise for these interfaces too. In the meantime we will however try to keep breakage of these interfaces at a minimum: - -* **The set of states of the various state machines used in systemd**, e.g. the high-level unit states inactive, active, deactivating, and so on, as well (and in particular) the low-level per-unit states. - -* **All "special" units that aren't listed above**. - -The following interfaces are considered private to systemd, and are not and will not be covered by any stability promise: - -* **Undocumented switches** to `systemd`, `systemctl` and otherwise. - -* **The internal protocols** used on the various sockets such as the sockets `/run/systemd/shutdown`, `/run/systemd/private`. - -One of the main goals of systemd is to unify basic Linux configurations and service behaviors across all distributions. Systemd project does not contain any distribution-specific parts. Distributions are expected to convert over time their individual configurations to the systemd format, or they will need to carry and maintain patches in their package if they still decide to stay different. - -What does this mean for you? When developing with systemd, don't use any of the latter interfaces, or we will tell your mom, and she won't love you anymore. You are welcome to use the other interfaces listed here, but if you use any of the second kind (i.e. those where we don't yet make a stability promise), then make sure to subscribe to our mailing list, where we will announce API changes, and be prepared to update your program eventually. - -Note that this is a promise, not an eternal guarantee. These are our intentions, but if in the future there are very good reasons to change or get rid of an interface we have listed above as stable, then we might take the liberty to do so, despite this promise. However, if we do this, then we'll do our best to provide a smooth and reasonably long transition phase. - - -## Interface Portability And Stability Chart - -systemd provides a number of APIs to applications. Below you'll find a table detailing which APIs are considered stable and how portable they are. - -This list is intended to be useful for distribution and OS developers who are interested in maintaining a certain level of compatibility with the new interfaces systemd introduced, without relying on systemd itself. - -In general it is our intention to cooperate through interfaces and not code with other distributions and OSes. That means that the interfaces where this applies are best reimplemented in a compatible fashion on those other operating systems. To make this easy we provide detailed interface documentation where necessary. That said, it's all Open Source, hence you have the option to a) fork our code and maintain portable versions of the parts you are interested in independently for your OS, or b) build systemd for your distro, but leave out all components except the ones you are interested in and run them without the core of systemd involved. We will try not to make this any more difficult than necessary. Patches to allow systemd code to be more portable will be accepted on case-by-case basis (essentially, patches to follow well-established standards instead of e.g. glibc or linux extensions have a very high chance of being accepted, while patches which make the code ugly or exist solely to work around bugs in other projects have a low chance of being accepted). - -Many of these interfaces are already being used by applications and 3rd party code. If you are interested in compatibility with these applications, please consider supporting these interfaces in your distribution, where possible. - - -## General Portability of systemd and its Components - -**Portability to OSes:** systemd is not portable to non-Linux systems. It makes use of a large number of Linux-specific interfaces, including many that are used by its very core. We do not consider it feasible to port systemd to other Unixes (let alone non-Unix operating systems) and will not accept patches for systemd core implementing any such portability (but hey, it's git, so it's as easy as it can get to maintain your own fork...). APIs that are supposed to be used as library code are exempted from this: it is important to us that these compile nicely on non-Linux and even non-Unix platforms, even if they might just become NOPs. - -**Portability to Architectures:** It is important to us that systemd is portable to little endian as well as big endian systems. We will make sure to provide portability with all important architectures and hardware Linux runs on and are happy to accept patches for this. - -**Portability to Distributions:** It is important to us that systemd is portable to all Linux distributions. However, the goal is to unify many of the needless differences between the distributions, and hence will not accept patches for certain distribution-specific work-arounds. Compatibility with the distribution's legacy should be maintained in the distribution's packaging, and not in the systemd source tree. - -**Compatibility with Specific Versions of Other packages:** We generally avoid adding compatibility kludges to systemd that work around bugs in certain versions of other software systemd interfaces with. We strongly encourage fixing bugs where they are, and if that's not systemd we rather not try to fix it there. (There are very few exceptions to this rule possible, and you need an exceptionally strong case for it). - - -## General Portability of systemd's APIs - -systemd's APIs are available everywhere where systemd is available. Some of the APIs we have defined are supposed to be generic enough to be implementable independently of systemd, thus allowing compatibility with systems systemd itself is not compatible with, i.e. other OSes, and distributions that are unwilling to fully adopt systemd. - -A number of systemd's APIs expose Linux or systemd-specific features that cannot sensibly be implemented elsewhere. Please consult the table below for information about which ones these are. - -Note that not all of these interfaces are our invention (but most), we just adopted them in systemd to make them more prominently implemented. For example, we adopted many Debian facilities in systemd to push it into the other distributions as well. - - ---- - - -And now, here's the list of (hopefully) all APIs that we have introduced with systemd: - -| API | Type | Covered by Interface Stability Promise | Fully documented | Known External Consumers | Reimplementable Independently | Known Other Implementations | systemd Implementation portable to other OSes or non-systemd distributions | -| --- | ---- | ----------------------------------------------------------------------------------------- | ---------------- | ------------------------ | ----------------------------- | --------------------------- | -------------------------------------------------------------------------- | -| [hostnamed](https://www.freedesktop.org/software/systemd/man/org.freedesktop.hostname1.html) | D-Bus | yes | yes | GNOME | yes | [Ubuntu](https://launchpad.net/ubuntu/+source/ubuntu-system-service), [Gentoo](http://www.gentoo.org/proj/en/desktop/gnome/openrc-settingsd.xml), [BSD](http://uglyman.kremlin.cc/gitweb/gitweb.cgi?p=systembsd.git;a=summary) | partially | -| [localed](https://www.freedesktop.org/software/systemd/man/org.freedesktop.locale1.html) | D-Bus | yes | yes | GNOME | yes | [Ubuntu](https://launchpad.net/ubuntu/+source/ubuntu-system-service), [Gentoo](http://www.gentoo.org/proj/en/desktop/gnome/openrc-settingsd.xml), [BSD](http://uglyman.kremlin.cc/gitweb/gitweb.cgi?p=systembsd.git;a=summary) | partially | -| [timedated](https://www.freedesktop.org/software/systemd/man/org.freedesktop.timedate1.html) | D-Bus | yes | yes | GNOME | yes | [Gentoo](http://www.gentoo.org/proj/en/desktop/gnome/openrc-settingsd.xml), [BSD](http://uglyman.kremlin.cc/gitweb/gitweb.cgi?p=systembsd.git;a=summary) | partially | -| [initrd interface](INITRD_INTERFACE) | Environment, flag files | yes | yes | mkosi, dracut, ArchLinux | yes | ArchLinux | no | -| [Container interface](CONTAINER_INTERFACE) | Environment, Mounts | yes | yes | libvirt/LXC | yes | - | no | -| [Boot Loader interface](BOOT_LOADER_INTERFACE) | EFI variables | yes | yes | gummiboot | yes | - | no | -| [Service bus API](https://www.freedesktop.org/software/systemd/man/org.freedesktop.systemd1.html) | D-Bus | yes | yes | system-config-services | no | - | no | -| [logind](https://www.freedesktop.org/software/systemd/man/org.freedesktop.login1.html) | D-Bus | yes | yes | GNOME | no | - | no | -| [sd-bus.h API](https://www.freedesktop.org/software/systemd/man/sd-bus.html) | C Library | yes | yes | - | maybe | - | maybe | -| [sd-daemon.h API](https://www.freedesktop.org/software/systemd/man/sd-daemon.html) | C Library or Drop-in | yes | yes | numerous | yes | - | yes | -| [sd-device.h API](https://www.freedesktop.org/software/systemd/man/sd-device.html) | C Library | yes | no | numerous | yes | - | yes | -| [sd-event.h API](https://www.freedesktop.org/software/systemd/man/sd-event.html) | C Library | yes | yes | - | maybe | - | maybe | -| [sd-gpt.h API](https://www.freedesktop.org/software/systemd/man/sd-gpt.html) | Header Library | yes | no | - | yes | - | yes | -| [sd-hwdb.h API](https://www.freedesktop.org/software/systemd/man/sd-hwdb.html) | C Library | yes | yes | - | maybe | - | yes | -| [sd-id128.h API](https://www.freedesktop.org/software/systemd/man/sd-id128.html) | C Library | yes | yes | - | yes | - | yes | -| [sd-journal.h API](https://www.freedesktop.org/software/systemd/man/sd-journal.html) | C Library | yes | yes | - | maybe | - | no | -| [sd-login.h API](https://www.freedesktop.org/software/systemd/man/sd-login.html) | C Library | yes | yes | GNOME, polkit, ... | no | - | no | -| [sd-messages.h API](https://www.freedesktop.org/software/systemd/man/sd-messages.html) | Header Library | yes | yes | - | yes | python-systemd | yes | -| [sd-path.h API](https://www.freedesktop.org/software/systemd/man/sd-path.html) | C Library | yes | no | - | maybe | - | maybe | -| [$XDG_RUNTIME_DIR](https://specifications.freedesktop.org/basedir-spec/basedir-spec-latest.html) | Environment | yes | yes | glib, GNOME | yes | - | no | -| [$LISTEN_FDS $LISTEN_PID FD Passing](https://www.freedesktop.org/software/systemd/man/sd_listen_fds.html) | Environment | yes | yes | numerous (via sd-daemon.h) | yes | - | no | -| [$NOTIFY_SOCKET Daemon Notifications](https://www.freedesktop.org/software/systemd/man/sd_notify.html) | Environment | yes | yes | a few, including udev | yes | - | no | -| [argv[0][0]='@' Logic](ROOT_STORAGE_DAEMONS) | `/proc` marking | yes | yes | mdadm | yes | - | no | -| [Unit file format](https://www.freedesktop.org/software/systemd/man/systemd.unit.html) | File format | yes | yes | numerous | no | - | no | -| [Network](https://www.freedesktop.org/software/systemd/man/systemd.network.html) & [Netdev file format](https://www.freedesktop.org/software/systemd/man/systemd.netdev.html) | File format | yes | yes | no | no | - | no | -| [Link file format](https://www.freedesktop.org/software/systemd/man/systemd.link.html) | File format | yes | yes | no | no | - | no | -| [Journal File Format](JOURNAL_FILE_FORMAT) | File format | yes | yes | - | maybe | - | no | -| [Journal Export Format](JOURNAL_EXPORT_FORMATS.md#journal-export-format) | File format | yes | yes | - | yes | - | no | -| [Journal JSON Format](JOURNAL_EXPORT_FORMATS.md#journal-json-format) | File format | yes | yes | - | yes | - | no | -| [Cooperation in cgroup tree](https://www.freedesktop.org/wiki/Software/systemd/PaxControlGroups) | Treaty | yes | yes | libvirt | yes | libvirt | no | -| [Password Agents](PASSWORD_AGENTS) | Socket+Files | yes | yes | - | yes | - | no | -| [udev multi-seat properties](https://www.freedesktop.org/software/systemd/man/sd-login.html) | udev Property | yes | yes | X11, gdm | no | - | no | -| udev session switch ACL properties | udev Property | no | no | - | no | - | no | -| [CLI of systemctl,...](https://www.freedesktop.org/software/systemd/man/systemctl.html) | CLI | yes | yes | numerous | no | - | no | -| [tmpfiles.d](https://www.freedesktop.org/software/systemd/man/tmpfiles.d.html) | File format | yes | yes | numerous | yes | ArchLinux | partially | -| [sysusers.d](https://www.freedesktop.org/software/systemd/man/sysusers.d.html) | File format | yes | yes | unknown | yes | | partially | -| [/etc/machine-id](https://www.freedesktop.org/software/systemd/man/machine-id.html) | File format | yes | yes | D-Bus | yes | - | no | -| [binfmt.d](https://www.freedesktop.org/software/systemd/man/binfmt.d.html) | File format | yes | yes | numerous | yes | - | partially | -| [/etc/hostname](https://www.freedesktop.org/software/systemd/man/hostname.html) | File format | yes | yes | numerous (it's a Debian thing) | yes | Debian, ArchLinux | no | -| [/etc/locale.conf](https://www.freedesktop.org/software/systemd/man/locale.conf.html) | File format | yes | yes | - | yes | ArchLinux | partially | -| [/etc/machine-info](https://www.freedesktop.org/software/systemd/man/machine-info.html) | File format | yes | yes | - | yes | - | partially | -| [modules-load.d](https://www.freedesktop.org/software/systemd/man/modules-load.d.html) | File format | yes | yes | numerous | yes | - | partially | -| [/usr/lib/os-release](https://www.freedesktop.org/software/systemd/man/os-release.html) | File format | yes | yes | some | yes | Fedora, OpenSUSE, ArchLinux, Angstrom, Frugalware, others... | no | -| [sysctl.d](https://www.freedesktop.org/software/systemd/man/sysctl.d.html) | File format | yes | yes | some (it's a Debian thing) | yes | procps/Debian, ArchLinux | partially | -| [/etc/timezone](https://www.freedesktop.org/software/systemd/man/timezone.html) | File format | yes | yes | numerous (it's a Debian thing) | yes | Debian | partially | -| [/etc/vconsole.conf](https://www.freedesktop.org/software/systemd/man/vconsole.conf.html) | File format | yes | yes | - | yes | ArchLinux | partially | -| `/run` | File hierarchy change | yes | yes | numerous | yes | OpenSUSE, Debian, ArchLinux | no | -| [Generators](https://www.freedesktop.org/software/systemd/man/systemd.generator.html) | Subprocess | yes | yes | - | no | - | no | -| [System Updates](https://www.freedesktop.org/software/systemd/man/systemd.offline-updates.html) | System Mode | yes | yes | - | no | - | no | -| [Presets](https://www.freedesktop.org/software/systemd/man/systemd.preset.html) | File format | yes | yes | - | no | - | no | -| Udev rules | File format | yes | yes | numerous | no | no | partially | - - -### Explanations - -Items for which "systemd implementation portable to other OSes" is "partially" means that it is possible to run the respective tools that are included in the systemd tarball outside of systemd. Note however that this is not officially supported, so you are more or less on your own if you do this. If you are opting for this solution simply build systemd as you normally would but drop all files except those which you are interested in. - -Of course, it is our intention to eventually document all interfaces we defined. If we haven't documented them for now, this is usually because we want the flexibility to still change things, or don't want 3rd party applications to make use of these interfaces already. That said, our sources are quite readable and open source, so feel free to spelunk around in the sources if you want to know more. - -If you decide to reimplement one of the APIs for which "Reimplementable independently" is "no", then we won't stop you, but you are on your own. - -This is not an attempt to comprehensively list all users of these APIs. We are just listing the most obvious/prominent ones which come to our mind. - -Of course, one last thing I can't make myself not ask you before we finish here, and before you start reimplementing these APIs in your distribution: are you sure it's time well spent if you work on reimplementing all this code instead of just spending it on adopting systemd on your distro as well? - -## Independent Operation of systemd Programs - -Some programs in the systemd suite are intended to operate independently of the -running init process (or even without an init process, for example when -creating system installation chroots). They can be safely called on systems with -a different init process or for example in package installation scriptlets. - -The following programs currently and in the future will support operation -without communicating with the `systemd` process: -`systemd-escape`, -`systemd-id128`, -`systemd-path`, -`systemd-tmpfiles`, -`systemd-sysctl`, -`systemd-sysusers`. - -Many other programs support operation without the system manager except when -the specific functionality requires such communication. For example, -`journalctl` operates almost independently, but will query the boot id when -`--boot` option is used; it also requires `systemd-journald` (and thus -`systemd`) to be running for options like `--flush` and `--sync`. -`systemd-journal-remote`, `systemd-journal-upload`, `systemd-journal-gatewayd`, -`coredumpctl`, `busctl`, `systemctl --root` also fall into this category of -mostly-independent programs. diff --git a/docs/_interfaces/ROOT_STORAGE_DAEMONS.md b/docs/_interfaces/ROOT_STORAGE_DAEMONS.md deleted file mode 100644 index 69812c9055..0000000000 --- a/docs/_interfaces/ROOT_STORAGE_DAEMONS.md +++ /dev/null @@ -1,194 +0,0 @@ ---- -title: Storage Daemons for the Root File System -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# systemd and Storage Daemons for the Root File System - -a.k.a. _Pax Cellae pro Radix Arbor_ - -(or something like that, my Latin is a bit rusty) - -A number of complex storage technologies on Linux (e.g. RAID, volume -management, networked storage) require user space services to run while the -storage is active and mountable. This requirement becomes tricky as soon as the -root file system of the Linux operating system is stored on such storage -technology. Previously no clear path to make this work was available. This text -tries to clear up the resulting confusion, and what is now supported and what -is not. - -## A Bit of Background - -When complex storage technologies are used as backing for the root file system -this needs to be set up by the initrd, i.e. on Fedora by Dracut. In newer -systemd versions tear-down of the root file system backing is also done by the -initrd: after terminating all remaining running processes and unmounting all -file systems it can (which means excluding the root file system) systemd will -jump back into the initrd code allowing it to unmount the final file systems -(and its storage backing) that could not be unmounted as long as the OS was -still running from the main root file system. The job of the initrd is to -detach/unmount the root file system, i.e. inverting the exact commands it used -to set them up in the first place. This is not only cleaner, but also allows -for the first time arbitrary complex stacks of storage technology. - -Previous attempts to handle root file system setups with complex storage as -backing usually tried to maintain the root storage with program code stored on -the root storage itself, thus creating a number of dependency loops. Safely -detaching such a root file system becomes messy, since the program code on the -storage needs to stay around longer than the storage, which is technically -contradicting. - -## What's new? - -As a result, we hereby clarify that we do not support storage technology setups -where the storage daemons are being run from the storage they maintain -themselves. In other words: a storage daemon backing the root file system cannot -be stored on the root file system itself. - -What we do support instead is that these storage daemons are started from the -initrd, stay running all the time during normal operation and are terminated -only after we returned control back to the initrd and by the initrd. As such, -storage daemons involved with maintaining the root file system storage -conceptually are more like kernel threads than like normal system services: -from the perspective of the init system (i.e. systemd), these services have been -started before systemd was initialized and stay around until after systemd is -already gone. These daemons can only be updated by updating the initrd and -rebooting; a takeover from initrd-supplied services to replacements from the -root file system is not supported. - -## What does this mean? - -Near the end of system shutdown, systemd executes a small tool called -systemd-shutdown, replacing its own process. This tool (which runs as PID 1, as -it entirely replaces the systemd init process) then iterates through the -mounted file systems and running processes (as well as a couple of other -resources) and tries to unmount/read-only mount/detach/kill them. It continues -to do this in a tight loop as long as this results in any effect. From this -killing spree a couple of processes are automatically excluded: PID 1 itself of -course, as well as all kernel threads. After the killing/unmounting spree -control is passed back to the initrd, whose job is then to unmount/detach -whatever might be remaining. - -The same killing spree logic (but not the unmount/detach/read-only logic) is -applied during the transition from the initrd to the main system (i.e. the -"`switch_root`" operation), so that no processes from the initrd survive to the -main system. - -To implement the supported logic proposed above (i.e. where storage daemons -needed for the root file system which are started by the initrd stay around -during normal operation and are only killed after control is passed back to the -initrd), we need to exclude these daemons from the shutdown/switch_root killing -spree. To accomplish this, the following logic is available starting with -systemd 38: - -Processes (run by the root user) whose first character of the zeroth command -line argument is `@` are excluded from the killing spree, much the same way as -kernel threads are excluded too. Thus, a daemon which wants to take advantage -of this logic needs to place the following at the top of its `main()` function: - -```c -... -argv[0][0] = '@'; -... -``` - -And that's already it. Note that this functionality is only to be used by -programs running from the initrd, and **not** for programs running from the -root file system itself. Programs which use this functionality and are running -from the root file system are considered buggy since they effectively prohibit -clean unmounting/detaching of the root file system and its backing storage. - -_Again: if your code is being run from the root file system, then this logic -suggested above is **NOT** for you. Sorry. Talk to us, we can probably help you -to find a different solution to your problem._ - -The recommended way to distinguish between run-from-initrd and run-from-rootfs -for a daemon is to check for `/etc/initrd-release` (which exists on all modern -initrd implementations, see the [initrd Interface](INITRD_INTERFACE) for -details) which when exists results in `argv[0][0]` being set to `@`, and -otherwise doesn't. Something like this: - -```c -#include <unistd.h> - -int main(int argc, char *argv[]) { - ... - if (access("/etc/initrd-release", F_OK) >= 0) - argv[0][0] = '@'; - ... - } -``` - -Why `@`? Why `argv[0][0]`? First of all, a technique like this is not without -precedent: traditionally Unix login shells set `argv[0][0]` to `-` to clarify -they are login shells. This logic is also very easy to implement. We have been -looking for other ways to mark processes for exclusion from the killing spree, -but could not find any that was equally simple to implement and quick to read -when traversing through `/proc/`. Also, as a side effect replacing the first -character of `argv[0]` with `@` also visually invalidates the path normally -stored in `argv[0]` (which usually starts with `/`) thus helping the -administrator to understand that your daemon is actually not originating from -the actual root file system, but from a path in a completely different -namespace (i.e. the initrd namespace). Other than that we just think that `@` -is a cool character which looks pretty in the ps output... 😎 - -Note that your code should only modify `argv[0][0]` and leave the comm name -(i.e. `/proc/self/comm`) of your process untouched. - -Since systemd v255, alternatively the `SurviveFinalKillSignal=yes` unit option -can be set, and provides the equivalent functionality to modifying `argv[0][0]`. - -## To which technologies does this apply? - -These recommendations apply to those storage daemons which need to stay around -until after the storage they maintain is unmounted. If your storage daemon is -fine with being shut down before its storage device is unmounted, you may ignore -the recommendations above. - -This all applies to storage technology only, not to daemons with any other -(non-storage related) purposes. - -## What else to keep in mind? - -If your daemon implements the logic pointed out above, it should work nicely -from initrd environments. In many cases it might be necessary to additionally -support storage daemons to be started from within the actual OS, for example -when complex storage setups are used for auxiliary file systems, i.e. not the -root file system, or created by the administrator during runtime. Here are a -few additional notes for supporting these setups: - -* If your storage daemon is run from the main OS (i.e. not the initrd) it will - also be terminated when the OS shuts down (i.e. before we pass control back - to the initrd). Your daemon needs to handle this properly. - -* It is not acceptable to spawn off background processes transparently from - user commands or udev rules. Whenever a process is forked off on Unix it - inherits a multitude of process attributes (ranging from the obvious to the - not-so-obvious such as security contexts or audit trails) from its parent - process. It is practically impossible to fully detach a service from the - process context of the spawning process. In particular, systemd tracks which - processes belong to a service or login sessions very closely, and by spawning - off your storage daemon from udev or an administrator command you thus make - it part of its service/login. Effectively this means that whenever udev is - shut down, your storage daemon is killed too, resp. whenever the login - session goes away your storage might be terminated as well. (Also note that - recent udev versions will automatically kill all long running background - processes forked off udev rules now.) So, in summary: double-forking off - processes from user commands or udev rules is **NOT** OK! - -* To automatically spawn storage daemons from udev rules or administrator - commands, the recommended technology is socket-based activation as - implemented by systemd. Transparently for your client code connecting to the - socket of your storage daemon will result in the storage to be started. For - that it is simply necessary to inform systemd about the socket you'd like it - to listen on behalf of your daemon and minimally modify the daemon to - receive the listening socket for its services from systemd instead of - creating it on its own. Such modifications can be minimal, and are easily - written in a way that does not negatively impact usability on non-systemd - systems. For more information on making use of socket activation in your - program consult this blog story: [Socket - Activation](https://0pointer.de/blog/projects/socket-activation.html) - -* Consider having a look at the [initrd Interface of systemd](INITRD_INTERFACE). diff --git a/docs/_interfaces/TEMPORARY_DIRECTORIES.md b/docs/_interfaces/TEMPORARY_DIRECTORIES.md deleted file mode 100644 index bc9cb7bc45..0000000000 --- a/docs/_interfaces/TEMPORARY_DIRECTORIES.md +++ /dev/null @@ -1,220 +0,0 @@ ---- -title: Using /tmp/ and /var/tmp/ Safely -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# Using `/tmp/` and `/var/tmp/` Safely - -`/tmp/` and `/var/tmp/` are two world-writable directories Linux systems -provide for temporary files. The former is typically on `tmpfs` and thus -backed by RAM/swap, and flushed out on each reboot. The latter is typically a -proper, persistent file system, and thus backed by physical storage. This -means: - -1. `/tmp/` should be used for smaller, size-bounded files only; `/var/tmp/` - should be used for everything else. - -2. Data that shall survive a boot cycle shouldn't be placed in `/tmp/`. - -If the `$TMPDIR` environment variable is set, use that path, and neither use -`/tmp/` nor `/var/tmp/` directly. - -See -[file-hierarchy(7)](https://www.freedesktop.org/software/systemd/man/file-hierarchy.html) -for details about these two (and most other) directories of a Linux system. - -## Common Namespace - -Note that `/tmp/` and `/var/tmp/` each define a common namespace shared by all -local software. This means guessable file or directory names below either -directory directly translate into a 🚨 Denial-of-Service (DoS) 🚨 vulnerability -or worse: if some software creates a file or directory `/tmp/foo` then any -other software that wants to create the same file or directory `/tmp/foo` -either will fail (as the file already exists) or might be tricked into using -untrusted files. Hence: do not use guessable names in `/tmp/` or `/var/tmp/` — -if you do you open yourself up to a local DoS exploit or worse. (You can get -away with using guessable names, if you pre-create subdirectories below `/tmp/` -for them, like X11 does with `/tmp/.X11-unix/` through `tmpfiles.d/` -drop-ins. However this is not recommended, as it is fully safe only if these -directories are pre-created during early boot, and thus problematic if package -installation during runtime is permitted.) - -To protect yourself against these kinds of attacks Linux provides a couple of -APIs that help you avoiding guessable names. Specifically: - -1. Use [`mkstemp()`](https://man7.org/linux/man-pages/man3/mkstemp.3.html) - (POSIX), `mkostemp()` (glibc), - [`mkdtemp()`](https://man7.org/linux/man-pages/man3/mkdtemp.3.html) (POSIX), - [`tmpfile()`](https://man7.org/linux/man-pages/man3/tmpfile.3.html) (C89) - -2. Use [`open()`](https://man7.org/linux/man-pages/man2/open.2.html) with - `O_TMPFILE` (Linux) - -3. [`memfd_create()`](https://man7.org/linux/man-pages/man2/memfd_create.2.html) - (Linux; this doesn't bother with `/tmp/` or `/var/tmp/` at all, but uses the - same RAM/swap backing as `tmpfs` uses, hence is very similar to `/tmp/` - semantics.) - -For system services systemd provides the `PrivateTmp=` boolean setting. If -turned on for a service (👍 which is highly recommended), `/tmp/` and -`/var/tmp/` are replaced by private sub-directories, implemented through Linux -file system namespacing and bind mounts. This means from the service's point of -view `/tmp/` and `/var/tmp/` look and behave like they normally do, but in -reality they are private sub-directories of the host's real `/tmp/` and -`/var/tmp/`, and thus not system-wide locations anymore, but service-specific -ones. This reduces the surface for local DoS attacks substantially. While it is -recommended to turn this option on, it's highly recommended for applications -not to rely on this solely to avoid DoS vulnerabilities, because this option is -not available in environments where file system namespaces are prohibited, for -example in certain container environments. This option is hence an extra line -of defense, but should not be used as an excuse to rely on guessable names in -`/tmp/` and `/var/tmp/`. When this option is used, the per-service temporary -directories are removed whenever the service shuts down, hence the lifecycle of -temporary files stored in it is substantially different from the case where -this option is not used. Also note that some applications use `/tmp/` and -`/var/tmp/` for sharing files and directories. If this option is turned on this -is not possible anymore as after all each service gets its own instances of -both directories. - -## Automatic Clean-Up - -By default, `systemd-tmpfiles` will apply a concept of ⚠️ "ageing" to all files -and directories stored in `/tmp/` and `/var/tmp/`. This means that files that -have neither been changed nor read within a specific time frame are -automatically removed in regular intervals. (This concept is not new to -`systemd-tmpfiles`, it's inherited from previous subsystems such as -`tmpwatch`.) By default files in `/tmp/` are cleaned up after 10 days, and -those in `/var/tmp` after 30 days. - -This automatic clean-up is important to ensure disk usage of these temporary -directories doesn't grow without bounds, even when programs abort unexpectedly -or otherwise don't clean up the temporary files/directories they create. On the -other hand it creates problems for long-running software that does not expect -temporary files it operates on to be suddenly removed. There are a couple of -strategies to avoid these issues: - -1. Make sure to always keep a file descriptor to the temporary files you - operate on open, and only access the files through them. This way it doesn't - matter whether the files have been unlinked from the file system: as long as - you have the file descriptor open you can still access the file for both - reading and writing. When operating this way it is recommended to delete the - files right after creating them to ensure that on unexpected program - termination the files or directories are implicitly released by the kernel. - -2. 🥇 Use `memfd_create()` or `O_TMPFILE`. This is an extension of the - suggestion above: files created this way are never linked under a filename - in the file system. This means they are not subject to ageing (as they come - unlinked out of the box), and there's no time window where a directory entry - for the file exists in the file system, and thus behaviour is fully robust - towards unexpected program termination as there are never files on disk that - need to be explicitly deleted. - -3. 🥇 Take an exclusive or shared BSD file lock ([`flock()`]( - https://man7.org/linux/man-pages/man2/flock.2.html)) on files and directories - you don't want to be removed. This is particularly interesting when operating - on more than a single file, or on file nodes that are not plain regular files, - for example when extracting a tarball to a temporary directory. The ageing - algorithm will skip all directories (and everything below them) and files that - are locked through a BSD file lock. As BSD file locks are automatically released - when the file descriptor they are taken on is closed, and all file - descriptors opened by a process are implicitly closed when it exits, this is - a robust mechanism that ensures all temporary files are subject to ageing - when the program that owns them dies, but not while it is still running. Use - this when decompressing tarballs that contain files with old - modification/access times, as extracted files are otherwise immediately - candidates for deletion by the ageing algorithm. The - [`flock`](https://man7.org/linux/man-pages/man1/flock.1.html) tool of the - `util-linux` packages makes this concept available to shell scripts. - -4. Keep the access time of all temporary files created current. In regular - intervals, use `utimensat()` or a related call to update the access time - ("atime") of all files that shall be kept around. Since the ageing algorithm - looks at the access time of files when deciding whether to delete them, it's - sufficient to update their access times in sufficiently frequent intervals to - ensure the files are not deleted. Since most applications (and tools such as - `ls`) primarily care for the modification time (rather than the access time) - using the access time for this purpose should be acceptable. - -5. Set the "sticky" bit on regular files. The ageing logic skips deletion of - all regular files that have the sticky bit (`chmod +t`) set. This is - honoured for regular files only however, and has no effect on directories as - the sticky bit has a different meaning for them. - -6. Don't use `/tmp/` or `/var/tmp/`, but use your own sub-directory under - `/run/` or `$XDG_RUNTIME_DIRECTORY` (the former if privileged, the latter if - unprivileged), or `/var/lib/` and `~/.config/` (similar, but with - persistency and suitable for larger data). The two temporary directories - `/tmp/` and `/var/tmp/` come with the implicit clean-up semantics described - above. When this is not desired, it's possible to create private per-package - runtime or state directories, and place all temporary files there. However, - do note that this means opting out of any kind of automatic clean-up, and it - is hence particularly essential that the program cleans up generated files - in these directories when they are no longer needed, in particular when the - program dies unexpectedly. Note: this strategy is only really suitable for - packages that operate in a "system wide singleton" fashion with "long" - persistence of its data or state, i.e. as opposed to programs that run in - multiple parallel or short-living instances. This is because a private - directory under `/run` (and the other mentioned directories) is itself - system and package specific singleton with greater longevity. - -5. Exclude your temporary files from clean-ups via a `tmpfiles.d/` drop-in - (which includes drop-ins in the runtime-only directory - `/run/tmpfiles.d/`). The `x`/`X` line types may be used to exclude files - matching the specified globbing patterns from the ageing logic. If this is - used, automatic clean-up is not done for matching files and directory, and - much like with the previous option it's hence essential that the program - generating these temporary files carefully removes the temporary files it - creates again, and in particular so if it dies unexpectedly. - -🥇 The semantics of options 2 (in case you only deal with temporary files, not -directories) and 3 (in case you deal with both) in the list above are in most -cases the most preferable. It is thus recommended to stick to these two -options. - -While the ageing logic is very useful as a safety concept to ensure unused -files and directories are eventually removed a well written program avoids even -creating files that need such a clean-up. In particular: - -1. Use `memfd_create()` or `O_TMPFILE` when creating temporary files. - -2. `unlink()` temporary files right after creating them. This is very similar - to `O_TMPFILE` behaviour: consider deleting temporary files right after - creating them, while keeping open a file descriptor to them. Unlike - `O_TMPFILE` this method also works on older Linux systems and other OSes - that do not implement `O_TMPFILE`. - -## Disk Quota - -Generally, files allocated from `/tmp/` and `/var/tmp/` are allocated from a -pool shared by all local users. Moreover the space available in `/tmp/` is -generally more restricted than `/var/tmp/`. This means, that in particular in -`/tmp/` space should be considered scarce, and programs need to be prepared -that no space is available. Essential programs might require a fallback logic -using a different location for storing temporary files hence. Non-essential -programs at least need to be prepared for `ENOSPC` errors and generate useful, -actionable error messages. - -Some setups employ per-user quota on `/var/tmp/` and possibly `/tmp/`, to make -`ENOSPC` situations less likely, and harder to trigger from unprivileged -users. However, in the general case no such per-user quota is implemented -though, in particular not when `tmpfs` is used as backing file system, because -— even today — `tmpfs` still provides no native quota support in the kernel. - -## Early Boot Considerations - -Both `/tmp/` and `/var/tmp/` are not necessarily available during early boot, -or — if they are available early — are not writable. This means software that -is intended to run during early boot (i.e. before `basic.target` — or more -specifically `local-fs.target` — is up) should not attempt to make use of -either. Interfaces such as `memfd_create()` or files below a package-specific -directory in `/run/` are much better options in this case. (Note that some -packages instead use `/dev/shm/` for temporary files during early boot; this is -not advisable however, as it offers no benefits over a private directory in -`/run/` as both are backed by the same concept: `tmpfs`. The directory -`/dev/shm/` exists to back POSIX shared memory (see -[`shm_open()`](https://man7.org/linux/man-pages/man3/shm_open.3.html) and -related calls), and not as a place for temporary files. `/dev/shm` is -problematic as it is world-writable and there's no automatic clean-up logic in -place.) diff --git a/docs/_interfaces/TRANSIENT-SETTINGS.md b/docs/_interfaces/TRANSIENT-SETTINGS.md deleted file mode 100644 index 15f1cbc47c..0000000000 --- a/docs/_interfaces/TRANSIENT-SETTINGS.md +++ /dev/null @@ -1,511 +0,0 @@ ---- -title: What Settings Are Currently Available For Transient Units? -category: Interfaces -layout: default -SPDX-License-Identifier: LGPL-2.1-or-later ---- - -# What Settings Are Currently Available For Transient Units? - -Our intention is to make all settings that are available as unit file settings -also available for transient units, through the D-Bus API. At the moment, -device, swap, and target units are not supported at all as transient units, but -others are pretty well supported. - -The lists below contain all settings currently available in unit files. The -ones currently available in transient units are prefixed with `✓`. - -## Generic Unit Settings - -Most generic unit settings are available for transient units. - -``` -✓ Description= -✓ Documentation= -✓ SourcePath= -✓ Requires= -✓ Requisite= -✓ Wants= -✓ BindsTo= -✓ Conflicts= -✓ Before= -✓ After= -✓ OnFailure= -✓ PropagatesReloadTo= -✓ ReloadPropagatedFrom= -✓ PartOf= -✓ Upholds= -✓ JoinsNamespaceOf= -✓ RequiresMountsFor= -✓ StopWhenUnneeded= -✓ RefuseManualStart= -✓ RefuseManualStop= -✓ AllowIsolate= -✓ DefaultDependencies= -✓ OnFailureJobMode= -✓ IgnoreOnIsolate= -✓ JobTimeoutSec= -✓ JobRunningTimeoutSec= -✓ JobTimeoutAction= -✓ JobTimeoutRebootArgument= -✓ StartLimitIntervalSec= -✓ StartLimitBurst= -✓ StartLimitAction= -✓ FailureAction= -✓ SuccessAction= -✓ FailureActionExitStatus= -✓ SuccessActionExitStatus= -✓ RebootArgument= -✓ ConditionPathExists= -✓ ConditionPathExistsGlob= -✓ ConditionPathIsDirectory= -✓ ConditionPathIsSymbolicLink= -✓ ConditionPathIsMountPoint= -✓ ConditionPathIsReadWrite= -✓ ConditionDirectoryNotEmpty= -✓ ConditionFileNotEmpty= -✓ ConditionFileIsExecutable= -✓ ConditionNeedsUpdate= -✓ ConditionFirstBoot= -✓ ConditionKernelCommandLine= -✓ ConditionKernelVersion= -✓ ConditionArchitecture= -✓ ConditionFirmware= -✓ ConditionVirtualization= -✓ ConditionSecurity= -✓ ConditionCapability= -✓ ConditionHost= -✓ ConditionACPower= -✓ ConditionUser= -✓ ConditionGroup= -✓ ConditionControlGroupController= -✓ AssertPathExists= -✓ AssertPathExistsGlob= -✓ AssertPathIsDirectory= -✓ AssertPathIsSymbolicLink= -✓ AssertPathIsMountPoint= -✓ AssertPathIsReadWrite= -✓ AssertDirectoryNotEmpty= -✓ AssertFileNotEmpty= -✓ AssertFileIsExecutable= -✓ AssertNeedsUpdate= -✓ AssertFirstBoot= -✓ AssertKernelCommandLine= -✓ AssertKernelVersion= -✓ AssertArchitecture= -✓ AssertVirtualization= -✓ AssertSecurity= -✓ AssertCapability= -✓ AssertHost= -✓ AssertACPower= -✓ AssertUser= -✓ AssertGroup= -✓ AssertControlGroupController= -✓ CollectMode= -``` - -## Execution-Related Settings - -All execution-related settings are available for transient units. - -``` -✓ WorkingDirectory= -✓ RootDirectory= -✓ RootImage= -✓ User= -✓ Group= -✓ SupplementaryGroups= -✓ Nice= -✓ OOMScoreAdjust= -✓ CoredumpFilter= -✓ IOSchedulingClass= -✓ IOSchedulingPriority= -✓ CPUSchedulingPolicy= -✓ CPUSchedulingPriority= -✓ CPUSchedulingResetOnFork= -✓ CPUAffinity= -✓ UMask= -✓ Environment= -✓ EnvironmentFile= -✓ PassEnvironment= -✓ UnsetEnvironment= -✓ DynamicUser= -✓ RemoveIPC= -✓ StandardInput= -✓ StandardOutput= -✓ StandardError= -✓ StandardInputText= -✓ StandardInputData= -✓ TTYPath= -✓ TTYReset= -✓ TTYVHangup= -✓ TTYVTDisallocate= -✓ TTYRows= -✓ TTYColumns= -✓ SyslogIdentifier= -✓ SyslogFacility= -✓ SyslogLevel= -✓ SyslogLevelPrefix= -✓ LogLevelMax= -✓ LogExtraFields= -✓ LogFilterPatterns= -✓ LogRateLimitIntervalSec= -✓ LogRateLimitBurst= -✓ SecureBits= -✓ CapabilityBoundingSet= -✓ AmbientCapabilities= -✓ TimerSlackNSec= -✓ NoNewPrivileges= -✓ KeyringMode= -✓ ProtectProc= -✓ ProcSubset= -✓ SystemCallFilter= -✓ SystemCallArchitectures= -✓ SystemCallErrorNumber= -✓ SystemCallLog= -✓ MemoryDenyWriteExecute= -✓ RestrictNamespaces= -✓ RestrictRealtime= -✓ RestrictSUIDSGID= -✓ RestrictAddressFamilies= -✓ RootHash= -✓ RootHashSignature= -✓ RootVerity= -✓ LockPersonality= -✓ LimitCPU= -✓ LimitFSIZE= -✓ LimitDATA= -✓ LimitSTACK= -✓ LimitCORE= -✓ LimitRSS= -✓ LimitNOFILE= -✓ LimitAS= -✓ LimitNPROC= -✓ LimitMEMLOCK= -✓ LimitLOCKS= -✓ LimitSIGPENDING= -✓ LimitMSGQUEUE= -✓ LimitNICE= -✓ LimitRTPRIO= -✓ LimitRTTIME= -✓ ReadWritePaths= -✓ ReadOnlyPaths= -✓ InaccessiblePaths= -✓ BindPaths= -✓ BindReadOnlyPaths= -✓ TemporaryFileSystem= -✓ PrivateTmp= -✓ PrivateDevices= -✓ PrivateMounts= -✓ ProtectKernelTunables= -✓ ProtectKernelModules= -✓ ProtectKernelLogs= -✓ ProtectControlGroups= -✓ PrivateNetwork= -✓ PrivateUsers= -✓ ProtectSystem= -✓ ProtectHome= -✓ ProtectClock= -✓ MountFlags= -✓ MountAPIVFS= -✓ Personality= -✓ RuntimeDirectoryPreserve= -✓ RuntimeDirectoryMode= -✓ RuntimeDirectory= -✓ StateDirectoryMode= -✓ StateDirectory= -✓ CacheDirectoryMode= -✓ CacheDirectory= -✓ LogsDirectoryMode= -✓ LogsDirectory= -✓ ConfigurationDirectoryMode= -✓ ConfigurationDirectory= -✓ PAMName= -✓ IgnoreSIGPIPE= -✓ UtmpIdentifier= -✓ UtmpMode= -✓ SELinuxContext= -✓ SmackProcessLabel= -✓ AppArmorProfile= -✓ Slice= -``` - -## Resource Control Settings - -All cgroup/resource control settings are available for transient units - -``` -✓ CPUAccounting= -✓ CPUWeight= -✓ StartupCPUWeight= -✓ CPUShares= -✓ StartupCPUShares= -✓ CPUQuota= -✓ CPUQuotaPeriodSec= -✓ AllowedCPUs= -✓ StartupAllowedCPUs= -✓ AllowedMemoryNodes= -✓ StartupAllowedMemoryNodes= -✓ MemoryAccounting= -✓ DefaultMemoryMin= -✓ MemoryMin= -✓ DefaultMemoryLow= -✓ MemoryLow= -✓ MemoryHigh= -✓ MemoryMax= -✓ MemorySwapMax= -✓ MemoryLimit= -✓ DeviceAllow= -✓ DevicePolicy= -✓ IOAccounting= -✓ IOWeight= -✓ StartupIOWeight= -✓ IODeviceWeight= -✓ IOReadBandwidthMax= -✓ IOWriteBandwidthMax= -✓ IOReadIOPSMax= -✓ IOWriteIOPSMax= -✓ BlockIOAccounting= -✓ BlockIOWeight= -✓ StartupBlockIOWeight= -✓ BlockIODeviceWeight= -✓ BlockIOReadBandwidth= -✓ BlockIOWriteBandwidth= -✓ TasksAccounting= -✓ TasksMax= -✓ Delegate= -✓ DisableControllers= -✓ IPAccounting= -✓ IPAddressAllow= -✓ IPAddressDeny= -✓ ManagedOOMSwap= -✓ ManagedOOMMemoryPressure= -✓ ManagedOOMMemoryPressureLimit= -✓ ManagedOOMPreference= -✓ CoredumpReceive= -``` - -## Process Killing Settings - -All process killing settings are available for transient units: - -``` -✓ SendSIGKILL= -✓ SendSIGHUP= -✓ KillMode= -✓ KillSignal= -✓ RestartKillSignal= -✓ FinalKillSignal= -✓ WatchdogSignal= -``` - -## Service Unit Settings - -Most service unit settings are available for transient units. - -``` -✓ BusName= -✓ ExecCondition= -✓ ExecReload= -✓ ExecStart= -✓ ExecStartPost= -✓ ExecStartPre= -✓ ExecStop= -✓ ExecStopPost= -✓ ExitType= -✓ FileDescriptorStoreMax= -✓ GuessMainPID= -✓ NonBlocking= -✓ NotifyAccess= -✓ OOMPolicy= -✓ PIDFile= -✓ RemainAfterExit= -✓ Restart= -✓ RestartForceExitStatus= -✓ RestartPreventExitStatus= -✓ RestartSec= -✓ RootDirectoryStartOnly= -✓ RuntimeMaxSec= -✓ RuntimeRandomizedExtraSec= - Sockets= -✓ SuccessExitStatus= -✓ TimeoutAbortSec= -✓ TimeoutSec= -✓ TimeoutStartFailureMode= -✓ TimeoutStartSec= -✓ TimeoutStopFailureMode= -✓ TimeoutStopSec= -✓ Type= -✓ USBFunctionDescriptors= -✓ USBFunctionStrings= -✓ WatchdogSec= -``` - -## Mount Unit Settings - -All mount unit settings are available to transient units: - -``` -✓ What= -✓ Where= -✓ Options= -✓ Type= -✓ TimeoutSec= -✓ DirectoryMode= -✓ SloppyOptions= -✓ LazyUnmount= -✓ ForceUnmount= -✓ ReadWriteOnly= -``` - -## Automount Unit Settings - -All automount unit setting is available to transient units: - -``` -✓ Where= -✓ DirectoryMode= -✓ TimeoutIdleSec= -``` - -## Timer Unit Settings - -Most timer unit settings are available to transient units. - -``` -✓ OnActiveSec= -✓ OnBootSec= -✓ OnCalendar= -✓ OnClockChange= -✓ OnStartupSec= -✓ OnTimezoneChange= -✓ OnUnitActiveSec= -✓ OnUnitInactiveSec= -✓ Persistent= -✓ WakeSystem= -✓ RemainAfterElapse= -✓ AccuracySec= -✓ RandomizedDelaySec= -✓ FixedRandomDelay= - Unit= -``` - -## Slice Unit Settings - -Slice units are fully supported as transient units, but they have no settings -of their own beyond the generic unit and resource control settings. - -## Scope Unit Settings - -Scope units are fully supported as transient units (in fact they only exist as -such). - -``` -✓ RuntimeMaxSec= -✓ RuntimeRandomizedExtraSec= -✓ TimeoutStopSec= -``` - -## Socket Unit Settings - -Most socket unit settings are available to transient units. - -``` -✓ ListenStream= -✓ ListenDatagram= -✓ ListenSequentialPacket= -✓ ListenFIFO= -✓ ListenNetlink= -✓ ListenSpecial= -✓ ListenMessageQueue= -✓ ListenUSBFunction= -✓ SocketProtocol= -✓ BindIPv6Only= -✓ Backlog= -✓ BindToDevice= -✓ ExecStartPre= -✓ ExecStartPost= -✓ ExecStopPre= -✓ ExecStopPost= -✓ TimeoutSec= -✓ SocketUser= -✓ SocketGroup= -✓ SocketMode= -✓ DirectoryMode= -✓ Accept= -✓ FlushPending= -✓ Writable= -✓ MaxConnections= -✓ MaxConnectionsPerSource= -✓ KeepAlive= -✓ KeepAliveTimeSec= -✓ KeepAliveIntervalSec= -✓ KeepAliveProbes= -✓ DeferAcceptSec= -✓ NoDelay= -✓ Priority= -✓ ReceiveBuffer= -✓ SendBuffer= -✓ IPTOS= -✓ IPTTL= -✓ Mark= -✓ PipeSize= -✓ FreeBind= -✓ Transparent= -✓ Broadcast= -✓ PassCredentials= -✓ PassSecurity= -✓ PassPacketInfo= -✓ TCPCongestion= -✓ ReusePort= -✓ MessageQueueMaxMessages= -✓ MessageQueueMessageSize= -✓ RemoveOnStop= -✓ Symlinks= -✓ FileDescriptorName= - Service= -✓ TriggerLimitIntervalSec= -✓ TriggerLimitBurst= -✓ SmackLabel= -✓ SmackLabelIPIn= -✓ SmackLabelIPOut= -✓ SELinuxContextFromNet= -``` - -## Swap Unit Settings - -Swap units are currently not available at all as transient units: - -``` - What= - Priority= - Options= - TimeoutSec= -``` - -## Path Unit Settings - -Most path unit settings are available to transient units. - -``` -✓ PathExists= -✓ PathExistsGlob= -✓ PathChanged= -✓ PathModified= -✓ DirectoryNotEmpty= - Unit= -✓ MakeDirectory= -✓ DirectoryMode= -``` - -## Install Section - -The `[Install]` section is currently not available at all for transient units, and it probably doesn't even make sense. - -``` - Alias= - WantedBy= - RequiredBy= - Also= - DefaultInstance= -``` |