summaryrefslogtreecommitdiffstats
path: root/docs/FILE_DESCRIPTOR_STORE.md
diff options
context:
space:
mode:
authorLennart Poettering <lennart@poettering.net>2023-09-18 13:33:06 +0200
committerLennart Poettering <lennart@poettering.net>2023-09-18 14:47:07 +0200
commit0959847af5a4ee5282eb95d500899e00e33677d6 (patch)
tree34f32a5e17b0916a5180aaa7f1b77b18099c70fc /docs/FILE_DESCRIPTOR_STORE.md
parentmeson: restore tools/meson-vcs-tag.sh (diff)
downloadsystemd-0959847af5a4ee5282eb95d500899e00e33677d6.tar.xz
systemd-0959847af5a4ee5282eb95d500899e00e33677d6.zip
doc: add a markdown doc giving an overview over the fdstore
And link it up everywhere.
Diffstat (limited to 'docs/FILE_DESCRIPTOR_STORE.md')
-rw-r--r--docs/FILE_DESCRIPTOR_STORE.md193
1 files changed, 193 insertions, 0 deletions
diff --git a/docs/FILE_DESCRIPTOR_STORE.md b/docs/FILE_DESCRIPTOR_STORE.md
new file mode 100644
index 0000000000..bc4f3c82f4
--- /dev/null
+++ b/docs/FILE_DESCRIPTOR_STORE.md
@@ -0,0 +1,193 @@
+---
+title: The File Descriptor Store
+category: Interfaces
+layout: default
+SPDX-License-Identifier: LGPL-2.1-or-later
+---
+
+# The File Descriptor Store
+
+*TL;DR: The systemd service manager may optionally maintain a set of file
+descriptors for each service, that are under control of the service and that
+help making service restarts without losing connectivity or context easier to
+implement.*
+
+Since its inception `systemd` has supported the *socket* *activation*
+mechanism: the service manager creates and listens on some sockets (and similar
+UNIX file descriptors) on behalf of a service, and then passes them to the
+service during activation of the service via UNIX file descriptor (short: *fd*)
+passing over `execve()`. This is primarily exposed in the
+[.socket](https://www.freedesktop.org/software/systemd/man/systemd.socket.html)
+unit type.
+
+The *file* *descriptor* *store* (short: *fdstore*) extends this concept, and
+allows services to *upload* during runtime additional fds to the service
+manager that it shall keep on its behalf. File descriptors are passed back to
+the service on subsequent activations, the same way as any socket activation
+fds are passed.
+
+If a service fd is passed to the fdstore logic of the service manager it only
+maintains a duplicate of it (in the sense of UNIX
+[`dup(2)`](https://man7.org/linux/man-pages/man2/dup.2.html)), the fd remains
+also in possession of the service itself, and it may (and is expected to)
+invoke any operations on it that it likes.
+
+The primary usecase of this logic is to permit services to restart seamlessly
+(for example to update them to a newer version), without losing execution
+context, dropping pinned resources, terminating established connections or even
+just momentarily losing connectivity. In fact, as the file descriptors can be
+uploaded freely at any time during the service runtime, this can even be used to
+implement services that robustly handle abnormal termination and can recover
+from that without losing pinned resources.
+
+Note that Linux supports the
+[`memfd`](https://man7.org/linux/man-pages/man2/memfd_create.2.html) concept
+that allows associating a memory-backed fd with arbitrary data. This may
+conveniently be used to serialize service state into and then place in the
+fdstore, in order to implement service restarts with full service state being
+passed over.
+
+# Basic Mechanism
+
+The fdstore is enabled per-service via the
+[`FileDescriptorStoreMax=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#FileDescriptorStoreMax=)
+service setting. It defaults to zero (which means the fdstore logic is turned
+off), but can take an unsigned integer value that controls how many fds to
+permit the service to upload to the service manager to keep simultaneously.
+
+If set to values > 0, the fdstore is enabled. When invoked the service may now
+(asynchronously) upload file descriptors to the fdstore via the
+[`sd_pid_notify_with_fds()`](https://www.freedesktop.org/software/systemd/man/sd_pid_notify_with_fds.html)
+API call (or an equivalent reimplementation). When uploading the fds it is
+necessary to set the `FDSTORE=1` field in the message, to indicate what the fd
+is intended for. It's recommended to also set the `FDNAME=…` field to any
+string of choice, which may be used to identify the fd later.
+
+Whenever the service is restarted the fds in its fdstore will be passed to the
+new instance following the same protocol as for socket activation fds. i.e. the
+`$LISTEN_FDS`, `$LISTEN_PIDS`, `$LISTEN_FDNAMES` environment variables will be
+set (the latter will be populated from the `FDNAME=…` field mentioned
+above). See
+[`sd_listen_fds()`](https://www.freedesktop.org/software/systemd/man/sd_listen_fds.html)
+for details on receiving such fds in a service. (Note that the name set in
+`FDNAME=…` does not need to be unique, which is useful when operating with
+multiple fully equivalent sockets or similar, for example for a service that
+both operates on IPv4 and IPv6 and treats both more or less the same.).
+
+And that's already the gist of it.
+
+# Seamless Service Restarts
+
+A system service that provides a client-facing interface that shall be able to
+seamlessly restart can make use of this in a scheme like the following:
+whenever a new connection comes in it uploads its fd immediately into its
+fdstore. At approporate times it also serializes its state into a memfd it
+uploads to the service manager — either whenever the state changed
+sufficiently, or simply right before it terminates. (The latter of course means
+that state only survives on *clean* restarts and abnormal termination implies the
+state is lost completely — while the former would mean there's a good chance the
+next restart after an abnormal termination could continue where it left off
+with only some context lost.)
+
+Using the fdstore for such seamless service restarts is generally recommended
+over implementations that attempt to leave a process from the old service
+instance around until after the new instance already started, so that the old
+then communicates with the new service instance, and passes the fds over
+directly. Typically service restarts are a mechanism for implementing *code*
+updates, hence leaving two version of the service running at the same time is
+generally problematic. It also collides with the systemd service manager's
+general principle of guaranteeing a pristine execution environment, a pristine
+security context, and a pristine resource management context for freshly
+started services, without uncontrolled "left-overs" from previous runs. For
+example: leaving processes from previous runs generally negatively affects
+lifecycle management (i.e. `KillMode=none` must be set), which disables large
+parts of the service managers state tracking, resource management (as resource
+counters cannot start at zero during service activation anymore, since the old
+processes remaining skew them), security policies (as processes with possibly
+out-of-date security policies – selinux, AppArmor, any LSM, seccomp, BPF — in
+effect remain), and similar.
+
+# File Descriptor Store Lifecycle
+
+By default any file descriptor stored in the fdstore for which a `POLLHUP` or
+`POLLERR` is seen is automatically closed and removed from the fdstore. This
+behaviour can be turned off, by setting the `FDPOLL=0` field when uploading the
+fd via `sd_notify_with_fds()`.
+
+The fdstore is automatically closed whenever the service is fully deactivated
+and no jobs are queued for it anymore. This means that a restart job for a
+service will leave the fdstore intact, but a separate stop and start job for
+it — executed synchronously one after the other — will likely not.
+
+This behaviour can be modified via the
+[`FileDescriptorStorePreserve=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#FileDescriptorStorePreserve=)
+setting in service unit files. If set to `yes` the fdstore will be kept as long
+as the service definition is loaded into memory by the service manager, i.e. as
+long as at least one other loaded unit has a reference to it.
+
+The `systemctl clean --what=fdstore …` command may be used to explicitly clear
+the fdstore of a service. This is only allowed when the service is fully
+deactivated, and is hence primarily useful in case
+`FileDescriptorStorePreserve=yes` is set (because the fdstore is otherwise
+fully closed anyway in this state).
+
+Individual file descriptors may be removed from the fdstore via the
+`sd_notify()` mechanism, by sending an `FDSTOREREMOVE=1` message, accompanied
+by an `FDNAME=…` string identifying the fds to remove. (The name does not have
+to be unique, as mentioned, in which case *all* matching fds are
+closed). Generally it's a good idea to send such messages to the service
+manager during initialization of the service whenever an unrecognized fd is
+received, to make the service robust for code updates: if an old version
+uploaded an fd that the new version doesn't recognize anymore it's good idea to
+close it both in the service and in the fdstore.
+
+Note that storing a duplicate of an fd in the fdstore means the fd remains
+pinned even if the service closes it. This in particular means that peers on a
+connection socket uploaded this way will not receive an automatic `POLLHUP`
+event anymore if the service code issues `close()` on the socket. It must
+accompany it with an `FDSTOREREMOVE=1` notification to the service manager, so
+that the fd is comprehensively closed.
+
+# Access Control
+
+Access to the fds in the file descriptor store is generally restricted to the
+service code itself. Pushing fds into or removing fds from the fdstore is
+subject to the access control restrictions of any other `sd_notify()` message,
+which is controlled via
+[`NotifyAccess=`](https://www.freedesktop.org/software/systemd/man/systemd.service.html#NotifyAccess=).
+
+By default only the main service process hence can push/remove fds, but by
+setting `NotifyAccess=cgroup` this may be relaxed to allow arbitrary service
+child processes to do the same.
+
+# Soft Reboot
+
+The fdstore is particularly interesting in [soft
+reboot](https://www.freedesktop.org/software/systemd/man/systemd-soft-reboot.service.html)
+scenarios, as per `systemctl soft-reboot` (which restarts userspace like in a
+real reboot, but leaves the kernel running). File descriptor stores that remain
+loaded at the very end of the system cycle — just before the soft-reboot – are
+passed over to the next system cycle, and propagated to services they originate
+from there. This enables updating the full userspace of a system during
+runtime, fully replacing all processes without losing pinning resources,
+interrupting connectivity or established connections and similar.
+
+This mechanism can be enabled either by making sure the service survives until
+the very end (i.e. by setting `DefaultDependencies=no` so that it keeps running
+for the whole system lifetime without being regularly deactivated at shutdown)
+or by setting `FileDescriptorStorePresever=yes` (and referencing the unit
+continously).
+
+# Debugging
+
+The
+[`systemd-analyze`](https://www.freedesktop.org/software/systemd/man/systemd-analyze.html#systemd-analyze%20fdstore%20%5BUNIT...%5D)
+tool may be used to list the current contents of the fdstore of any running
+service.
+
+The
+[`systemd-run`](https://www.freedesktop.org/software/systemd/man/systemd-run.html)
+tool may be used to quickly start a testing binary or similar as a service. Use
+`-p FileDescriptorStore=4711` to enable the fdstore from `systemd-run`'s
+command line. By using the `-t` switch you can even interactively communicate
+via processes spawned that way, via the TTY.