| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
If all processes we are supposed to add are gone by the time we are
ready to do so, let's fail.
THis is heavily based on Cunlong Li's work, who thankfully tracked this
down.
Replaces: #20577
|
|
|
|
|
|
|
|
| |
otherwise we might return an invalid value, since `usec_t` is 64-bit,
whereas `int` might not be.
Follow-up to: 5918a93
Fixes: #20872
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In general we almost never hit those asserts in production code, so users see
them very rarely, if ever. But either way, we just need something that users
can pass to the developers.
We have quite a few of those asserts, and some have fairly nice messages, but
many are like "WTF?" or "???" or "unexpected something". The error that is
printed includes the file location, and function name. In almost all functions
there's at most one assert, so the function name alone is enough to identify
the failure for a developer. So we don't get much extra from the message, and
we might just as well drop them.
Dropping them makes our code a tiny bit smaller, and most importantly, improves
development experience by making it easy to insert such an assert in the code
without thinking how to phrase the argument.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This mirrors the change done for systemd-resolved in
97935302283729c9206b84f5e00b1aff0f78ad19. Quoting that patch:
> We generally operate on the assumption that a source is "gone" as soon as we
> unref it. This is generally true because we have the only reference. But if
> something else holds the reference, our unref doesn't really stop the source
> and it could fire again.
In particular, we take temporary references from sd-event code, and when called
from an sd-event callback, we could temporarily see this elevated reference
count. This patch doesn't seem to change anything, but I think it's nicer to do
the same change as in other places and not rely on _unref() immediately
disabling the source.
|
|
|
|
|
|
|
|
|
|
|
| |
the cgroup scope
Commit 428a9f6f1d0396b9eacde2b38d667cbe3f15eb55 freed u->pids which is
problematic since the references to this unit in m->watch_pids were no more
removed when the unit was freed.
This patch makes sure to clean all this refs up before freeing u->pids by
calling unit_unwatch_all_pids().
|
| |
|
| |
|
|
|
|
|
|
| |
Otherwise if a daemon-reload happens somewhere between the enqueue of the job
start for the scope unit and scope_start() then u->pids might be lost and none
of the processes specified by "PIDs=" will be moved into the scope cgroup.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
process exits
If processes remain in the unit's cgroup after the final SIGKILL is
sent and the unit has exceeded stop timeout, don't release the unit's
cgroup information. Pid1 will have failed to `rmdir` the cgroup path due
to processes remaining in the cgroup and releasing would leave the cgroup
path on the file system with no tracking for pid1 to clean it up.
Instead, keep the information around until the last process exits and pid1
sends the cgroup empty notification. The service/scope can then prune
the cgroup if the unit is inactive/failed.
|
|
|
|
|
| |
This adds the hook ups so it can be read with the usual systemd
utilities. Used in later commits by sytemd-oomd.
|
|
|
|
|
|
|
|
| |
In all the other cases, I think the code was clearer with the static table.
Here, not so much. And because of the existing dump code, the vtables cannot
be made static and need to remain exported. I still think it's worth to do the
change to have the cmdline introspection, but I'm disappointed with how this
came out.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With cgroup v2 the cgroup freezer is implemented as a cgroup
attribute called cgroup.freeze. cgroup can be frozen by writing "1"
to the file and kernel will send us a notification through
"cgroup.events" after the operation is finished and processes in the
cgroup entered quiescent state, i.e. they are not scheduled to
run. Writing "0" to the attribute file does the inverse and process
execution is resumed.
This commit exposes above low-level functionality through systemd's DBus
API. Each unit type must provide specialized implementation for these
methods, otherwise, we return an error. So far only service, scope, and
slice unit types provide the support. It is possible to check if a
given unit has the support using CanFreeze() DBus property.
Note that DBus API has a synchronous behavior and we dispatch the reply
to freeze/thaw requests only after the kernel has notified us that
requested operation was completed.
|
|
|
|
|
|
|
|
|
| |
Similar, refuse triggering deps on units that cannot trigger.
And rework how we ignore After= dependencies on device units, to work
the same way.
See: #14142
|
|\
| |
| | |
Add `RuntimeMaxSec=` support to scope units (time-limited login sessions)
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Just as `RuntimeMaxSec=` is supported for service units, add support for
it to scope units. This will gracefully kill a scope after the timeout
expires from the moment the scope enters the running state.
This could be used for time-limited login sessions, for example.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
Fixes: #12035
|
| |
| |
| |
| |
| |
| |
| | |
Factor it out into a helper function which is a bit easier to expand in
future. This introduces no functional changes.
Signed-off-by: Philip Withnall <withnall@endlessm.com>
|
| |
| |
| |
| |
| |
| |
| | |
No functional change, just adjusting code to follow the same pattern
everywhere. In particular, never call _verify() on an already loaded unit,
but return early from the caller instead. This makes the code a bit easier
to follow.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
unit_load_fragment_and_dropin() and unit_load_fragment_and_dropin_optional()
are really the same, with one minor difference in behaviour. Let's drop
the second function.
"_optional" in the name suggests that it's the "dropin" part that is optional.
(Which it is, but in this case, we mean the fragment to be optional.)
I think the new version with a flag is easier to understand.
|
|/
|
|
|
|
| |
This is the most basic consumer of the new systemd-vs-kernel checker,
both acting as a reasonable standalone exerciser of the code, and also
as a way for easy inspection of deviations from systemd internal state.
|
|
|
|
|
|
|
|
| |
It's a simple wrapper for resetting both IP and CPU accounting in one
go.
This will become particularly useful when we also needs this to reset IO
accounting (to be added in a later commit).
|
|
|
|
| |
No functional changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This allows clients to follow our internal state changes safely.
Previously, quick state changes (for example, when we restart a unit due
to Restart= after it quickly transitioned through DEAD/FAILED states)
would be coalesced into one bus signal event, with this change there's
the guarantee that all state changes after the unit was announced ones
are reflected on th bus.
Note we only do this kind of guaranteed flushing only for unit state
changes, not for other unit property changes, where clients still have
to expect coalescing. This is because the unit state is a very
important, high-level concept.
Fixes: #10185
|
|
|
|
|
| |
It's inline so that the compiler can easily optimize away the call to get
status string.
|
|
|
|
|
|
| |
We already are doing it on failure, let's do it on success, too.
Fixes: #10265
|
|
|
|
|
| |
Let's make this recognizable, and carry result information in a
structure fashion.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's be more careful with what we serialize: let's ensure we never
serialize strings that are longer than LONG_LINE_MAX, so that we know we
can read them back with read_line(…, LONG_LINE_MAX, …) safely.
In order to implement this all serialization functions are move to
serialize.[ch], and internally will do line size checks. We'd rather
skip a serialization line (with a loud warning) than write an overly
long line out. Of course, this is just a second level protection, after
all the data we serialize shouldn't be this long in the first place.
While we are at it also clean up logging: while serializing make sure to
always log about errors immediately. Also, (void)ify all calls we don't
expect errors in (or catch errors as part of the general
fflush_and_check() at the end.
|
| |
|
|
|
|
|
|
|
|
|
|
|
| |
These lines are generally out-of-date, incomplete and unnecessary. With
SPDX and git repository much more accurate and fine grained information
about licensing and authorship is available, hence let's drop the
per-file copyright notice. Of course, removing copyright lines of others
is problematic, hence this commit only removes my own lines and leaves
all others untouched. It might be nicer if sooner or later those could
go away too, making git the only and accurate source of authorship
information.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This part of the copyright blurb stems from the GPL use recommendations:
https://www.gnu.org/licenses/gpl-howto.en.html
The concept appears to originate in times where version control was per
file, instead of per tree, and was a way to glue the files together.
Ultimately, we nowadays don't live in that world anymore, and this
information is entirely useless anyway, as people are very welcome to
copy these files into any projects they like, and they shouldn't have to
change bits that are part of our copyright header for that.
hence, let's just get rid of this old cruft, and shorten our codebase a
bit.
|
|
|
|
|
| |
This changes a number of EINVAL cases to ENOEXEC, so that we enter
"bad-setting" state if they fail.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously the enumerate() callback defined for each unit type would do
two things:
1. It would create perpetual units (i.e. -.slice, system.slice, -.mount and
init.scope)
2. It would enumerate units from /proc/self/mountinfo, /proc/swaps and
the udev database
With this change these two parts are split into two seperate methods:
enumerate() now only does #2, while enumerate_perpetual() is responsible
for #1. Why make this change? Well, perpetual units should have a
slightly different effect that those found through enumeration: as
perpetual units should be up unconditionally, perpetually and thus never
change state, they should also not pull in deps by their state changing,
not even when the state is first set to active. Thus, their state is
generally initialized through the per-device coldplug() method in
similar fashion to the deserialized state from a previous run would be
put into place. OTOH units found through regular enumeration should
result in state changes (and thus pull in deps due to state changes),
hence their state should be put in effect in the catchup() method
instead. Hence, given this difference, let's also separate the
functions, so that the rule is:
1. What is created in enumerate_perpetual() should be started in
coldplug()
2. What is created in enumerate() should be started in catchup().
|
|
|
|
|
|
| |
Scope units don't have a main or control process we can watch, hence
let's explicitly watch the PIDs contained in them early on, just to make
things more robust and have at least something to watch.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reworks how systemd tracks processes on cgroupv1 systems where
cgroup notification is not reliable. Previously, whenever we had reason
to believe that new processes showed up or got removed we'd scan the
cgroup of the scope or service unit for new processes, and would tidy up
the list of PIDs previously watched. This scanning is relatively slow,
and does not scale well. With this change behaviour is changed: instead
of scanning for new/removed processes right away we do this work in a
per-unit deferred event loop job. This event source is scheduled at a
very low priority, so that it is executed when we have time but does not
starve other event sources. This has two benefits: this expensive work is
coalesced, if events happen in quick succession, and we won't delay
SIGCHLD handling for too long.
This patch basically replaces all direct invocation of
unit_watch_all_pids() in scope.c and service.c with invocations of the
new unit_enqueue_rewatch_pids() call which just enqueues a request of
watching/tidying up the PID sets (with one exception: in
scope_enter_signal() and service_enter_signal() we'll still do
unit_watch_all_pids() synchronously first, since we really want to know
all processes we are about to kill so that we can track them properly.
Moreover, all direct invocations of unit_tidy_watch_pids() and
unit_synthesize_cgroup_empty_event() are removed too, when the
unit_enqueue_rewatch_pids() call is invoked, as the queued job will run
those operations too.
All of this is done on cgroupsv1 systems only, and is disabled on
cgroupsv2 systems as cgroup-empty notifications are reliable there, and
we do not need SIGCHLD events to track processes there.
Fixes: #9138
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a flags parameter to unit_notify() which can be used to pass
additional notification information to the function. We the make the old
reload_failure boolean parameter one of these flags, and then add a new
flag that let's unit_notify() if we are configured to restart the
service.
Note that this adjusts behaviour of systemd to match what the docs say.
Fixes: #8398
|
|
|
|
|
|
|
|
| |
Scope units are populated from PIDs specified by the bus client. We do
that when a scope is started. We really shouldn't allow scopes to be
started multiple times, as the PIDs then might be heavily out of date.
Moreover, clients should have the guarantee that any scope they allocate
has a clear runtime cycle which is not repetitive.
|
|
|
|
|
|
|
|
|
|
| |
Files which are installed as-is (any .service and other unit files, .conf
files, .policy files, etc), are left as is. My assumption is that SPDX
identifiers are not yet that well known, so it's better to retain the
extended header to avoid any doubt.
I also kept any copyright lines. We can probably remove them, but it'd nice to
obtain explicit acks from all involved authors before doing that.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
units
This adds a new bus call to service and scope units called
AttachProcesses() that moves arbitrary processes into the cgroup of the
unit. The primary user for this new API is systemd itself: the systemd
--user instance uses this call of the systemd --system instance to
migrate processes if itself gets the request to migrate processes and
the kernel refuses this due to access restrictions.
The primary use-case of this is to make "systemd-run --scope --user …"
invoked from user session scopes work correctly on pure cgroupsv2
environments. There, the kernel refuses to migrate processes between two
unprivileged-owned cgroups unless the requestor as well as the ownership
of the closest parent cgroup all match. This however is not the case
between the session-XYZ.scope unit of a login session and the
user@ABC.service of the systemd --user instance.
The new logic always tries to move the processes on its own, but if
that doesn't work when being the user manager, then the system manager
is asked to do it instead.
The new operation is relatively restrictive: it will only allow to move
the processes like this if the caller is root, or the UID of the target
unit, caller and process all match. Note that this means that
unprivileged users cannot attach processes to scope units, as those do
not have "owning" users (i.e. they have now User= field).
Fixes: #3388
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
scope and service units only
Currently we allowed delegation for alluntis with cgroup backing
except for slices. Let's make this a bit more strict for now, and only
allow this in service and scope units.
Let's also add a generic accessor unit_cgroup_delegate() for checking
whether a unit has delegation turned on that checks the new bool first.
Also, when doing transient units, let's explcitly refuse turning on
delegation for unit types that don#t support it. This is mostly
cosmetical as we wouldn't act on the delegation request anyway, but
certainly helpful for debugging.
|
|\
| |
| | |
core: a couple of tidyups to synthesized units
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
`IgnoreOnIsolate=yes` is the default for slices and scopes. So it's not
essential to set it on root.slice or init.scope.
We don't need to worry about a bad unit file configuration. Any attempt
to stop these unit should fail, since we mark them as `perpetual`.
Also since init.scope cannot be stopped, there is no point setting
`KillSignal=SIGRTMIN+14`. According to both documentation and testing,
KillSignal= does not affect the behaviour of `systemctl kill`.
|
|/
|
|
|
|
|
|
|
| |
watching any unit PIDs
This code is very similar in scope and service units, let's unify it in
one function. This changes little for service units, but for scope units
makes sure we go through the cgroup queue, which is something we should
do anyway.
|
|
|
|
|
|
|
|
| |
Let's move the cgroup empty check for all unit types into the generic
unit_check_gc() call, out of the per-unit-type _check_gc() type. This
not only allows us to share some code, but also hooks up mount and
socket units with this kind of check, for free, as it was missing there
previously.
|
|
|
|
|
|
|
|
|
| |
This watches controllers on the bus, and unsets them automatically when
they disappear.
Note that this is primarily a cosmetical fix. Since unique bus names are not
recycled, there's strictly no need to forget about them, but it's a lot
nicer to do so.
|
|
|
|
|
| |
We forgot to serialize it previously, hence daemon reload flushed it
out, since we also didn't write it to any unit file...
|
|
|
|
|
| |
This follows what the kernel is doing, c.f.
https://git.kernel.org/pub/scm/linux/kernel/git/torvalds/linux.git/commit/?id=5fd54ace4721fc5ce2bb5aef6318fcf17f421460.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PID 1 to journald
And let's make use of it to implement two new unit settings with it:
1. LogLevelMax= is a new per-unit setting that may be used to configure
log priority filtering: set it to LogLevelMax=notice and only
messages of level "notice" and lower (i.e. more important) will be
processed, all others are dropped.
2. LogExtraFields= is a new per-unit setting for configuring per-unit
journal fields, that are implicitly included in every log record
generated by the unit's processes. It takes field/value pairs in the
form of FOO=BAR.
Also, related to this, one exisiting unit setting is ported to this new
facility:
3. The invocation ID is now pulled from /run/systemd/units/ instead of
cgroupfs xattrs. This substantially relaxes requirements of systemd
on the kernel version and the privileges it runs with (specifically,
cgroupfs xattrs are not available in containers, since they are
stored in kernel memory, and hence are unsafe to permit to lesser
privileged code).
/run/systemd/units/ is a new directory, which contains a number of files
and symlinks encoding the above information. PID 1 creates and manages
these files, and journald reads them from there.
Note that this is supposed to be a direct path between PID 1 and the
journal only, due to the special runtime environment the journal runs
in. Normally, today we shouldn't introduce new interfaces that (mis-)use
a file system as IPC framework, and instead just an IPC system, but this
is very hard to do between the journal and PID 1, as long as the IPC
system is a subject PID 1 manages, and itself a client to the journal.
This patch cleans up a couple of types used in journal code:
specifically we switch to size_t for a couple of memory-sizing values,
as size_t is the right choice for everything that is memory.
Fixes: #4089
Fixes: #3041
Fixes: #4441
|