diff options
author | Luca Boccassi <bluca@debian.org> | 2023-11-07 22:34:49 +0100 |
---|---|---|
committer | GitHub <noreply@github.com> | 2023-11-07 22:34:49 +0100 |
commit | 00666ec71f03e50dbec399013d5cc4ffadf4eec0 (patch) | |
tree | b5b234bfdfcdaedd96d1d9a15849251bda67441f /man | |
parent | Merge pull request #29914 from yuwata/network-generator (diff) | |
parent | test-execute: add no_new_privs tests for SystemCallFilter (diff) | |
download | systemd-00666ec71f03e50dbec399013d5cc4ffadf4eec0.tar.xz systemd-00666ec71f03e50dbec399013d5cc4ffadf4eec0.zip |
Merge pull request #6763 from kinvolk/iaguis/no-new-privs
core: allow using seccomp without no_new_privs when unprivileged
Diffstat (limited to 'man')
-rw-r--r-- | man/systemd.exec.xml | 113 |
1 files changed, 33 insertions, 80 deletions
diff --git a/man/systemd.exec.xml b/man/systemd.exec.xml index 8de96e1be8..d81154a339 100644 --- a/man/systemd.exec.xml +++ b/man/systemd.exec.xml @@ -823,21 +823,10 @@ CapabilityBoundingSet=~CAP_B CAP_C</programlisting> <listitem><para>Takes a boolean argument. If true, ensures that the service process and all its children can never gain new privileges through <function>execve()</function> (e.g. via setuid or setgid bits, or filesystem capabilities). This is the simplest and most effective way to ensure that - a process and its children can never elevate privileges again. Defaults to false, but certain - settings override this and ignore the value of this setting. This is the case when - <varname>DynamicUser=</varname>, <varname>LockPersonality=</varname>, - <varname>MemoryDenyWriteExecute=</varname>, <varname>PrivateDevices=</varname>, - <varname>ProtectClock=</varname>, <varname>ProtectHostname=</varname>, - <varname>ProtectKernelLogs=</varname>, <varname>ProtectKernelModules=</varname>, - <varname>ProtectKernelTunables=</varname>, <varname>RestrictAddressFamilies=</varname>, - <varname>RestrictNamespaces=</varname>, <varname>RestrictRealtime=</varname>, - <varname>RestrictSUIDSGID=</varname>, <varname>SystemCallArchitectures=</varname>, - <varname>SystemCallFilter=</varname>, or <varname>SystemCallLog=</varname> are specified. Note that - even if this setting is overridden by them, <command>systemctl show</command> shows the original - value of this setting. In case the service will be run in a new mount namespace anyway and SELinux is - disabled, all file systems are mounted with <constant>MS_NOSUID</constant> flag. Also see - the kernel document - <ulink url="https://docs.kernel.org/userspace-api/no_new_privs.html">No New Privileges Flag</ulink>. + a process and its children can never elevate privileges again. Defaults to false. In case the service + will be run in a new mount namespace anyway and SELinux is disabled, all file systems are mounted with + <constant>MS_NOSUID</constant> flag. Also see <ulink + url="https://docs.kernel.org/userspace-api/no_new_privs.html">No New Privileges Flag</ulink>. </para> <para>Note that this setting only has an effect on the unit's processes themselves (or any processes @@ -1779,9 +1768,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting> <citerefentry><refentrytitle>mmap</refentrytitle><manvolnum>2</manvolnum></citerefentry> of <filename>/dev/zero</filename> instead of using <constant>MAP_ANON</constant>. For this setting the same restrictions regarding mount propagation and privileges apply as for - <varname>ReadOnlyPaths=</varname> and related calls, see above. If turned on and if running in user - mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting - <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied.</para> + <varname>ReadOnlyPaths=</varname> and related calls, see above.</para> <para>Note that the implementation of this setting might be impossible (for example if mount namespaces are not available), and the unit should be written in a way that does not solely rely on @@ -1973,10 +1960,6 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting> the system into the service, it is hence not suitable for services that need to take notice of system hostname changes dynamically.</para> - <para>If this setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> - capability (e.g. services for which <varname>User=</varname> is set), - <varname>NoNewPrivileges=yes</varname> is implied.</para> - <xi:include href="system-or-user-ns.xml" xpointer="singular"/> <xi:include href="version-info.xml" xpointer="v242"/></listitem> @@ -1994,9 +1977,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting> Effectively, <filename>/dev/rtc0</filename>, <filename>/dev/rtc1</filename>, etc. are made read-only to the service. See <citerefentry><refentrytitle>systemd.resource-control</refentrytitle><manvolnum>5</manvolnum></citerefentry> - for the details about <varname>DeviceAllow=</varname>. If this setting is on, but the unit doesn't - have the <constant>CAP_SYS_ADMIN</constant> capability (e.g. services for which - <varname>User=</varname> is set), <varname>NoNewPrivileges=yes</varname> is implied.</para> + for the details about <varname>DeviceAllow=</varname>.</para> <para>It is recommended to turn this on for most services that do not need modify the clock or check its state.</para> @@ -2018,13 +1999,10 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting> <citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry> mechanism. Few services need to write to these at runtime; it is hence recommended to turn this on for most services. For this setting the same restrictions regarding mount propagation and privileges apply as for - <varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off. If this - setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> capability - (e.g. services for which <varname>User=</varname> is set), - <varname>NoNewPrivileges=yes</varname> is implied. Note that this option does not prevent - indirect changes to kernel tunables effected by IPC calls to other processes. However, - <varname>InaccessiblePaths=</varname> may be used to make relevant IPC file system objects - inaccessible. If <varname>ProtectKernelTunables=</varname> is set, + <varname>ReadOnlyPaths=</varname> and related calls, see above. Defaults to off. + Note that this option does not prevent indirect changes to kernel tunables effected by IPC calls to + other processes. However, <varname>InaccessiblePaths=</varname> may be used to make relevant IPC file system + objects inaccessible. If <varname>ProtectKernelTunables=</varname> is set, <varname>MountAPIVFS=yes</varname> is implied.</para> <xi:include href="system-or-user-ns.xml" xpointer="singular"/> @@ -2046,9 +2024,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting> both privileged and unprivileged. To disable module auto-load feature please see <citerefentry><refentrytitle>sysctl.d</refentrytitle><manvolnum>5</manvolnum></citerefentry> <constant>kernel.modules_disabled</constant> mechanism and - <filename>/proc/sys/kernel/modules_disabled</filename> documentation. If this setting is on, - but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> capability (e.g. services for - which <varname>User=</varname> is set), <varname>NoNewPrivileges=yes</varname> is implied.</para> + <filename>/proc/sys/kernel/modules_disabled</filename> documentation.</para> <xi:include href="system-or-user-ns.xml" xpointer="singular"/> @@ -2067,9 +2043,7 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting> <citerefentry project='man-pages'><refentrytitle>syslog</refentrytitle><manvolnum>3</manvolnum></citerefentry> for userspace logging). The kernel exposes its log buffer to userspace via <filename>/dev/kmsg</filename> and <filename>/proc/kmsg</filename>. If enabled, these are made inaccessible to all the processes in the unit. - If this setting is on, but the unit doesn't have the <constant>CAP_SYS_ADMIN</constant> - capability (e.g. services for which <varname>User=</varname> is set), - <varname>NoNewPrivileges=yes</varname> is implied.</para> + </para> <xi:include href="system-or-user-ns.xml" xpointer="singular"/> @@ -2113,12 +2087,9 @@ BindReadOnlyPaths=/var/lib/systemd</programlisting> including x86-64). Note that on systems supporting multiple ABIs (such as x86/x86-64) it is recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the restrictions of this option. Specifically, it is recommended to combine this option with - <varname>SystemCallArchitectures=native</varname> or similar. If running in user mode, or in system - mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting - <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied. By default, no - restrictions apply, all address families are accessible to processes. If assigned the empty string, - any previous address family restriction changes are undone. This setting does not affect commands - prefixed with <literal>+</literal>.</para> + <varname>SystemCallArchitectures=native</varname> or similar. By default, no restrictions apply, all + address families are accessible to processes. If assigned the empty string, any previous address family + restriction changes are undone. This setting does not affect commands prefixed with <literal>+</literal>.</para> <para>Use this option to limit exposure of processes to remote access, in particular via exotic and sensitive network protocols, such as <constant>AF_PACKET</constant>. Note that in most cases, the local @@ -2251,9 +2222,7 @@ RestrictFileSystems=ext4</programlisting> creation and switching of the specified types of namespaces (or all of them, if true) access to the <function>setns()</function> system call with a zero flags parameter is prohibited. This setting is only supported on x86, x86-64, mips, mips-le, mips64, mips64-le, mips64-n32, mips64-le-n32, ppc64, ppc64-le, s390 - and s390x, and enforces no restrictions on other architectures. If running in user mode, or in system mode, but - without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>), - <varname>NoNewPrivileges=yes</varname> is implied.</para> + and s390x, and enforces no restrictions on other architectures.</para> <para>Example: if a unit has the following, <programlisting>RestrictNamespaces=cgroup ipc @@ -2274,9 +2243,7 @@ RestrictNamespaces=~cgroup net</programlisting> project='man-pages'><refentrytitle>personality</refentrytitle><manvolnum>2</manvolnum></citerefentry> system call so that the kernel execution domain may not be changed from the default or the personality selected with <varname>Personality=</varname> directive. This may be useful to improve security, because odd personality - emulations may be poorly tested and source of vulnerabilities. If running in user mode, or in system mode, but - without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>), - <varname>NoNewPrivileges=yes</varname> is implied.</para> + emulations may be poorly tested and source of vulnerabilities.</para> <xi:include href="version-info.xml" xpointer="v235"/></listitem> </varlistentry> @@ -2308,9 +2275,7 @@ RestrictNamespaces=~cgroup net</programlisting> available on x86. Note that on systems supporting multiple ABIs (such as x86/x86-64) it is recommended to turn off alternative ABIs for services, so that they cannot be used to circumvent the restrictions of this option. Specifically, it is recommended to combine this option with - <varname>SystemCallArchitectures=native</varname> or similar. If running in user mode, or in system - mode, but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting - <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied.</para> + <varname>SystemCallArchitectures=native</varname> or similar.</para> <xi:include href="version-info.xml" xpointer="v231"/></listitem> </varlistentry> @@ -2322,9 +2287,7 @@ RestrictNamespaces=~cgroup net</programlisting> the unit are refused. This restricts access to realtime task scheduling policies such as <constant>SCHED_FIFO</constant>, <constant>SCHED_RR</constant> or <constant>SCHED_DEADLINE</constant>. See <citerefentry project='man-pages'><refentrytitle>sched</refentrytitle><manvolnum>7</manvolnum></citerefentry> - for details about these scheduling policies. If running in user mode, or in system mode, but without the - <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>), - <varname>NoNewPrivileges=yes</varname> is implied. Realtime scheduling policies may be used to monopolize CPU + for details about these scheduling policies. Realtime scheduling policies may be used to monopolize CPU time for longer periods of time, and may hence be used to lock up or otherwise trigger Denial-of-Service situations on the system. It is hence recommended to restrict access to realtime scheduling to the few programs that actually require them. Defaults to off.</para> @@ -2338,10 +2301,8 @@ RestrictNamespaces=~cgroup net</programlisting> <listitem><para>Takes a boolean argument. If set, any attempts to set the set-user-ID (SUID) or set-group-ID (SGID) bits on files or directories will be denied (for details on these bits see <citerefentry - project='man-pages'><refentrytitle>inode</refentrytitle><manvolnum>7</manvolnum></citerefentry>). If - running in user mode, or in system mode, but without the <constant>CAP_SYS_ADMIN</constant> - capability (e.g. setting <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is - implied. As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the + project='man-pages'><refentrytitle>inode</refentrytitle><manvolnum>7</manvolnum></citerefentry>). + As the SUID/SGID bits are mechanisms to elevate privileges, and allow users to acquire the identity of other users, it is recommended to restrict creation of SUID/SGID files to the few programs that actually require them. Note that this restricts marking of any type of file system object with these bits, including both regular files and directories (where the SGID is a different @@ -2457,15 +2418,12 @@ RestrictNamespaces=~cgroup net</programlisting> full list). This value will be returned when a deny-listed system call is triggered, instead of terminating the processes immediately. Special setting <literal>kill</literal> can be used to explicitly specify killing. This value takes precedence over the one given in - <varname>SystemCallErrorNumber=</varname>, see below. If running in user mode, or in system mode, - but without the <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting - <varname>User=</varname>), <varname>NoNewPrivileges=yes</varname> is implied. This feature - makes use of the Secure Computing Mode 2 interfaces of the kernel ('seccomp filtering') and is useful - for enforcing a minimal sandboxing environment. Note that the <function>execve()</function>, - <function>exit()</function>, <function>exit_group()</function>, <function>getrlimit()</function>, - <function>rt_sigreturn()</function>, <function>sigreturn()</function> system calls and the system calls - for querying time and sleeping are implicitly allow-listed and do not need to be listed - explicitly. This option may be specified more than once, in which case the filter masks are + <varname>SystemCallErrorNumber=</varname>, see below. This feature makes use of the Secure Computing Mode 2 + interfaces of the kernel ('seccomp filtering') and is useful for enforcing a minimal sandboxing environment. + Note that the <function>execve()</function>, <function>exit()</function>, <function>exit_group()</function>, + <function>getrlimit()</function>, <function>rt_sigreturn()</function>, <function>sigreturn()</function> + system calls and the system calls for querying time and sleeping are implicitly allow-listed and do not + need to be listed explicitly. This option may be specified more than once, in which case the filter masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will have no effect. This does not affect commands prefixed with <literal>+</literal>.</para> @@ -2692,10 +2650,7 @@ SystemCallErrorNumber=EPERM</programlisting> as well as <constant>x32</constant>, <constant>mips64-n32</constant>, <constant>mips64-le-n32</constant>, and the special identifier <constant>native</constant>. The special identifier <constant>native</constant> implicitly maps to the native architecture of the system (or more precisely: to the architecture the system - manager is compiled for). If running in user mode, or in system mode, but without the - <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>), - <varname>NoNewPrivileges=yes</varname> is implied. By default, this option is set to the empty list, i.e. no - filtering is applied.</para> + manager is compiled for). By default, this option is set to the empty list, i.e. no filtering is applied.</para> <para>If this setting is used, processes of this unit will only be permitted to call native system calls, and system calls of the specified architectures. For the purposes of this option, the x32 architecture is treated @@ -2723,13 +2678,11 @@ SystemCallErrorNumber=EPERM</programlisting> <listitem><para>Takes a space-separated list of system call names. If this setting is used, all system calls executed by the unit processes for the listed ones will be logged. If the first character of the list is <literal>~</literal>, the effect is inverted: all system calls except the - listed system calls will be logged. If running in user mode, or in system mode, but without the - <constant>CAP_SYS_ADMIN</constant> capability (e.g. setting <varname>User=</varname>), - <varname>NoNewPrivileges=yes</varname> is implied. This feature makes use of the Secure Computing - Mode 2 interfaces of the kernel ('seccomp filtering') and is useful for auditing or setting up a - minimal sandboxing environment. This option may be specified more than once, in which case the filter - masks are merged. If the empty string is assigned, the filter is reset, all prior assignments will - have no effect. This does not affect commands prefixed with <literal>+</literal>.</para> + listed system calls will be logged. This feature makes use of the Secure Computing Mode 2 interfaces + of the kernel ('seccomp filtering') and is useful for auditing or setting up a minimal sandboxing + environment. This option may be specified more than once, in which case the filter masks are merged. + If the empty string is assigned, the filter is reset, all prior assignments will have no effect. + This does not affect commands prefixed with <literal>+</literal>.</para> <xi:include href="version-info.xml" xpointer="v247"/></listitem> </varlistentry> |