| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
We have to skip the W^X protections as we need executable
memory on PARISC for now. Kernel work is in progress (started
w/ 5.18).
Closes: https://github.com/systemd/systemd/issues/23180
|
|
|
|
|
|
| |
@known is generated from syscall-list.txt, which generated from kernel
headers. So, some syscalls in @obsolete may not be listed in
syscall-list.txt.
|
| |
|
|
|
|
|
|
|
|
| |
RestrictNamespaces should block clone3() like flatpak:
https://github.com/flatpak/flatpak/commit/a10f52a7565c549612c92b8e736a6698a53db330
clone3() passes arguments in a structure referenced by a pointer, so we can't
filter on the flags as with clone(). Let's disallow the whole function call.
|
|
|
|
|
| |
In case anyone else starts wondering whether it should be listed
as I did…
|
| |
|
|
|
|
| |
This also avoids multiple evaluations in STRV_FOREACH_BACKWARDS()
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
It was reported as used by the linker:
> [It is] called in the setup of ld-linux-x86-64.so.2 from _dl_sysdep_start.
> My local call stack (with LTO):
>
> #0 init_cpu_features.constprop.0 (/usr/lib64/ld-linux-x86-64.so.2)
> #1 _dl_sysdep_start (/usr/lib64/ld-linux-x86-64.so.2)
> #2 _dl_start (/usr/lib64/ld-linux-x86-64.so.2)
> #3 _start (/usr/lib64/ld-linux-x86-64.so.2)
>
> Looking through the source, I think it's this (links for glibc 2.34):
> - First dl_platform_init calls _dl_x86_init_cpu_features, a wrapper for init_cpu_features.
> - Then init_cpu_features calls get_cet_status.
> - At last, get_cet_status invokes arch_prctl.
Fixes #22033.
|
|
|
|
| |
Fixes #21969.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With glibc-2.34.9000-17.fc36.x86_64, dynamically programs newly fail in early
init with a restrictive syscall filter that does not include @system-service.
I think this is caused by 2dd87703d4386f2776c5b5f375a494c91d7f9fe4:
Author: Florian Weimer <fweimer@redhat.com>
Date: Mon May 10 10:31:41 2021 +0200
nptl: Move changing of stack permissions into ld.so
All the stack lists are now in _rtld_global, so it is possible
to change stack permissions directly from there, instead of
calling into libpthread to do the change.
It seems that this call will now be very widely used, so let's just move it to
default to avoid too many failures.
|
|
|
|
|
|
|
|
|
|
|
| |
NOPs via seccomp
This is supposed to be used by package/image builders such as mkosi to
speed up building, since it allows us to suppress sync() inside a
container.
This does what Debian's eatmydata tool does, but for a container, and
via seccomp (instead of LD_PRELOAD).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The commit 6597686865ff ("seccomp: don't install filters for archs that
can't use syscalls") introduced a regression where filters may not be
installed for the "native" architecture. This means that setting
SystemCallArchitectures=native for a unit effectively disables the
SystemCallFilter= and SystemCallLog= options.
Conceptually, we have two filter stages:
1. architecture used for syscall (SystemCallArchitectures=)
2. syscall + architecture combination (SystemCallFilter=)
The above commit tried to optimize the filter generation by skipping the
second level filtering when it is not required.
However, systemd will never fully block the "native" architecture using
the first level filter. This makes the code a lot simpler, as systemd
can execve() the target binary using its own architecture. And, it
should be perfectly fine as the "native" architecture will always be the
one with the most restrictive seccomp filtering.
Said differently, the bug arises because (on x86_64):
1. x86_64 is permitted by libseccomp already
2. native != x86_64
3. the loop wants to block x86_64 because the permitted set only
contains "native" (i.e. "native" != "x86_64")
4. x86_64 is marked as blocked in seccomp_local_archs
Thereby we have an inconsistency, where it is marked as blocked in the
seccomp_local_archs array but it is allowed by libseccomp. i.e. we will
skip generating filter stage 2 without having stage 1 in place.
The fix is simple, we just skip the native architecture when looping
seccomp_local_archs. This way the inconsistency cannot happen.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
See: https://github.com/systemd/systemd/pull/20191#issuecomment-881982739
In general, we shouldn't blanket move syscalls like this into @default,
given that glibc actually does have fallbacks, afaics. However, as
long as the syscalls are "read-only" and thus benign, I figure it's a
safe thing to do. But we should probably stick to a "if in doubt, don't"
rule, and put these syscalls in @system-service as default, but not into
@default.
I think in the real world @system-service is the sensible group people
should use, and not @default actually.
|
|
|
|
|
|
|
|
| |
It's included in @default now, since
14f4b1b568907350d023d1429c1aa4aaa8925f22, and since @system-service
pulls that in we can drop it from @system-service.
Follow-up for #20191
|
|
|
| |
glibc master uses getrandom in malloc since https://sourceware.org/git/?p=glibc.git;a=commit;h=fc859c304898a5ec72e0ba5269ed136ed0ea10e1 , getrandom should be in the default set so to avoid all non trivial programs to fallback to a PRNG.
|
|
|
|
|
|
|
|
|
|
|
|
| |
In the light of https://lwn.net/Articles/859679/ let's drop
quotactl_path() again from the filter set list, as it got backed out
again in 5.13-rc3.
It's likely going to be replaced by quotactl_fd() eventually, but that
hasn't made its way into the tree yet, hence let's not replace the entry
for now.
This partially reverts 34254e599a28529bdb89f91571adeaf7c76d9f43.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, if the hashmap is allow-list and a new deny-listed syscall
is added, seccomp_parse_syscall_filter() simply drop the new syscall
from hashmap even if error number is specified.
This makes 'allow-list' hashmap store two types of entries:
- allow-listed syscalls, which are stored with negative value (-1).
- deny-listed syscalls, which are stored with specified errno.
Fixes #18916.
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
| |
We reject all openat2() calls because it is currently not possible to
inspect its flags parameter via seccomp.
Fallback code is more likely to look for ENOSYS than EPERM.
|
|
|
|
|
| |
This makes parse-util.c independent of seccomp-util.c, which is located
in src/shared.
|
|
|
|
|
|
|
|
|
| |
When seccomp_restrict_archs is called, architectures that are blocked
are replaced by the SECCOMP_LOCAL_ARCH_BLOCKED marker so that they are
not disabled again and filters are not installed for them.
This can make some service that use SystemCallArchitecture= and
SystemCallFilter= start faster.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts the gist of da1921a5c396547261c8c7fcd94173346eb3b718 and
0d9fca76bb69e162265b2d25cb79f1890c0da31b (for ppc).
Quoting #17559:
> libseccomp 2.5 added socket syscall multiplexing on ppc64(el):
> https://github.com/seccomp/libseccomp/pull/229
>
> Like with i386, s390 and s390x this breaks socket argument filtering, so
> RestrictAddressFamilies doesn't work.
>
> This causes the unit test to fail:
> /* test_restrict_address_families */
> Operating on architecture: ppc
> Failed to install socket family rules for architecture ppc, skipping: Operation canceled
> Operating on architecture: ppc64
> Failed to add socket() rule for architecture ppc64, skipping: Invalid argument
> Operating on architecture: ppc64-le
> Failed to add socket() rule for architecture ppc64-le, skipping: Invalid argument
> Assertion 'fd < 0' failed at src/test/test-seccomp.c:424, function test_restrict_address_families(). Aborting.
>
> The socket filters can't be added so `socket(AF_UNIX, SOCK_DGRAM, 0);` still
> works, triggering the assertion.
Fixes #17559.
|
|
|
|
| |
Follow-up for 5abede3247591248718026cb8be6cd231de7728b.
|
|
|
|
|
|
|
|
|
|
|
| |
These three syscalls are internally used by libc's memory allocation
logic, i.e. ultimately back malloc(). Allocating a bit of memory is so
basic, it should just be in the default set.
This fixes a couple of issues with asan/msan and the seccomp tests: when
asan/msan is used some additional, large memory allocations take place
in the background, and unless mmap/mmap2/brk are allowlisted these will
fail, aborting the test prematurely.
|
| |
|
|
|
|
|
|
|
|
|
| |
Fixes: #17504
(While we are it, also move $SYSTEMD_SECCOMP_LOG= env var description
into the right document section)
Also suggested in: https://github.com/systemd/systemd/issues/17245#issuecomment-704773603
|
|
|
|
|
|
| |
Quoting the manual page of stime(2): "Starting with glibc 2.31, this function
is no longer available to newly linked applications and is no longer declared
in <time.h>."
|
| |
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
This is like membarrier() I guess and basically just exposes CPU
functionality via kernel syscall on some archs. Let's whitelist it for
everyone.
Fixes: #17197
|
|
|
|
|
|
|
|
|
|
|
|
| |
With new directive SystemCallLog= it's possible to list system calls to be
logged. This can be used for auditing or temporarily when constructing system
call filters.
---
v5: drop intermediary, update HASHMAP_FOREACH_KEY() use
v4: skip useless debug messages, actually parse directive
v3: don't declare unused variables with old libseccomp
v2: fix build without seccomp or old libseccomp
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Define explicit action "kill" for SystemCallErrorNumber=.
In addition to errno code, allow specifying "kill" as action for
SystemCallFilter=.
---
v7: seccomp_parse_errno_or_action() returns -EINVAL if !HAVE_SECCOMP
v6: use streq_ptr(), let errno_to_name() handle bad values, kill processes,
init syscall_errno
v5: actually use seccomp_errno_or_action_to_string(), don't fail bus unit
parsing without seccomp
v4: fix build without seccomp
v3: drop log action
v2: action -> number
|
| |
|
| |
|
|\
| |
| | |
Return ENOSYS in nspawn for "unknown" syscalls
|
| |
| |
| |
| |
| | |
While at it, start removing the "seccomp_" prefix from our
own functions. It is used by libseccomp.
|
| | |
|
| | |
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This patch adds seccomp support to the riscv64 architecture. seccomp
support is available in the riscv64 kernel since version 5.5, and it
has just been added to the libseccomp library.
riscv64 uses generic syscalls like aarch64, so I used that architecture
as a reference to find which code has to be modified.
With this patch, the testsuite passes successfully, including the
test-seccomp test. The system boots and works fine with kernel 5.4 (i.e.
without seccomp support) and kernel 5.5 (i.e. with seccomp support). I
have also verified that the "SystemCallFilter=~socket" option prevents a
service to use the ping utility when running on kernel 5.5.
|
|
|
|
| |
(cherry picked from commit 27605d6a836d85563faf41db9f7a72883d44c0ff)
|
|
|
|
|
|
|
|
|
|
|
|
| |
It is possible that we will be running with an upgraded libseccomp, in which
case libseccomp might know the syscall name, even if the number is not known at
the time when systemd is being compiled. The guard only serves to break such
upgrades, by requiring that we also recompile systemd.
For s390-specific syscalls, use a define to exclude them, so that that we don't
try to filter them on other arches.
(cherry picked from commit 6cf852e79eb0eced2f77653941f9c75c3bd79386)
|
|
|
|
|
|
| |
cf https://repo.or.cz/glibc.git/commit/3d3ab573a5f3071992cbc4f57d50d1d29d55bde2
This cause breakage on Fedora Rawhide: https://bugzilla.redhat.com/show_bug.cgi?id=1869030
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
https://tools.ietf.org/html/draft-knodel-terminology-02
https://lwn.net/Articles/823224/
This gets rid of most but not occasions of these loaded terms:
1. scsi_id and friends are something that is supposed to be removed from
our tree (see #7594)
2. The test suite defines an API used by the ubuntu CI. We can remove
this too later, but this needs to be done in sync with the ubuntu CI.
3. In some cases the terms are part of APIs we call or where we expose
concepts the kernel names the way it names them. (In particular all
remaining uses of the word "slave" in our codebase are like this,
it's used by the POSIX PTY layer, by the network subsystem, the mount
API and the block device subsystem). Getting rid of the term in these
contexts would mean doing some major fixes of the kernel ABI first.
Regarding the replacements: when whitelist/blacklist is used as noun we
replace with with allow list/deny list, and when used as verb with
allow-list/deny-list.
|
|
|
|
|
|
|
|
|
| |
Patch contains a coccinelle script, but it only works in some cases. Many
parts were converted by hand.
Note: I did not fix errors in return value handing. This will be done separate
to keep the patch comprehensible. No functional change is intended in this
patch.
|
| |
|