| Commit message (Collapse) | Author | Age | Files | Lines |
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of assuming that more-recently modified directories have higher mtime,
just look for any mtime changes, up or down. Since we don't want to remember
individual mtimes, hash them to obtain a single value.
This should help us behave properly in the case when the time jumps backwards
during boot: various files might have mtimes that in the future, but we won't
care. This fixes the following scenario:
We have /etc/systemd/system with T1. T1 is initially far in the past.
We have /run/systemd/generator with time T2.
The time is adjusted backwards, so T2 will be always in the future for a while.
Now the user writes new files to /etc/systemd/system, and T1 is updated to T1'.
Nevertheless, T1 < T1' << T2.
We would consider our cache to be up-to-date, falsely.
|
|
|
|
|
|
| |
The name is misleading, since we aren't really loading the unit from cache — if
this function returns true, we'll try to load the unit from disk, updating the
cache in the process.
|
|
|
|
| |
Fixes: #15778 #16060
|
|
|
|
|
|
|
|
|
| |
When a command asks to load a unit directly and it is in state
UNIT_NOT_FOUND, and the cache is outdated, we refresh it and
attempto to load again.
Use the same logic when building up a transaction and a dependency in
UNIT_NOT_FOUND state is encountered.
Update the unit test to exercise this code path.
|
|
|
|
|
|
| |
manager_override_(show_status,watchdog}
No functional change.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The fact that m->show_status was serialized/deserialized made impossible any
further customisation of this setting via system.conf. IOW the value was
basically always locked unless it was changed via signals.
This patch reworks the handling of m->show_status but also makes sure that if a
new value was changed via the signal API then this value is kept and preserved
accross PID1 reexecuting or reloading.
Note: this effectively means that once the value is set via the signal
interface, it can be changed again only through the signal API.
|
|
|
|
| |
No functional change.
|
|
|
|
| |
No functional change.
|
|
|
|
| |
No functional change.
|
|
|
|
| |
No functional change.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Most complexity of this patch is due to the fact that some manager settings
(basically the watchdog properties) can be set at runtime and in this case the
runtime values must be retained over daemon-reload or daemon-reexec.
For consistency sake, all watchdog properties behaves now the same way, that
is:
- Values defined by config files can be overridden by writing the new value
through their respective D-BUS properties. In this case, these values are
preserved over reload/reexec until the special value '0' or USEC_INFINITY
is written, which will then restore the last values loaded from the config
files. If the restored value is '0' or 'USEC_INFINITY', the watchdogs will
be disabled and the corresponding device will be closed.
- Reading the properties from a user instance will return the USEC_INFINITY
value as these properties are only meaningful for PID1.
- Writing to one of the watchdog properties of a user instance's will be a
NOP.
Fixes: #15453
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We would flip to status=temporary mode on the first error, and then switch back
to status=auto after the initial transaction was done. This isn't very useful,
because usually all the messages about successfully started units and not
related to the original failure. In fact, all those messages most likely cause
the information about the prime error to scroll off screen. And if the user
requested quiet boot, there's no reason to think that they care about those
success messages.
Also, when logging about dependency cycles, treat this similarly to a unit
error and show the message even if the status is "soft disabled" (before we
wouldn't show it in that case).
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Previously, when first connecting to the bus after connecting to it we'd
issue a ListNames() bus call to the driver to figure out which bus names
are currently active. This information was then used to initialize the
initial state for services that use BusName=.
This change removes the whole code for this and replaces it with
something vastly simpler.
First of all, the ListNames() call was issues synchronosuly, which meant
if dbus was for some reason synchronously calling into PID1 for some
reason we'd deadlock. As it turns out there's now a good chance it does:
the nss-systemd userdb hookup means that any user dbus-daemon resolves
might result in a varlink call into PID 1, and dbus resolves quite a lot
of users while parsing its policy. My original goal was to fix this
deadlock.
But as it turns out we don't need the ListNames() call at all anymore,
since #12957 has been merged. That PR was supposed to fix a race where
asynchronous installation of bus matches would cause us missing the
initial owner of a bus name when a service is first started. It fixed it
(correctly) by enquiring with GetOwnerName() who currently owns the
name, right after installing the match. But this means whenever we start watching a bus name we anyway
issue a GetOwnerName() for it, and that means also when first connecting
to the bus we don't need to issue ListNames() anymore since that just
tells us the same info: which names are currently owned.
hence, let's drop ListNames() and instead make better use of the
GetOwnerName() result: if it failed the name is not owned.
Also, while we are at it, let's simplify the unit's owner_name_changed()
callback(): let's drop the "old_owner" argument. We never used that
besides logging, and it's hard to synthesize from just the return of a
GetOwnerName(), hence don't bother.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
TasksMax= and DefaultTasksMax= can be specified as percentages. We don't
actually document of what the percentage is relative to, but the implementation
uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max,
and /sys/fs/cgroup/pids.max (when present). When the value is a percentage,
we immediately convert it to an absolute value. If the limit later changes
(which can happen e.g. when systemd-sysctl runs), the absolute value becomes
outdated.
So let's store either the percentage or absolute value, whatever was specified,
and only convert to an absolute value when the value is used. For example, when
starting a unit, the absolute value will be calculated when the cgroup for
the unit is created.
Fixes #13419.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We have the problem that many early boot or late shutdown issues are harder
to solve than they could be because we have no logs. When journald is not
running, messages are redirected to /dev/kmsg. It is also the time when many
things happen in a rapid succession, so we tend to hit the kernel printk
ratelimit fairly reliably. The end result is that we get no logs from the time
where they would be most useful. Thus let's disable the kernels ratelimit.
Once the system is up and running, the ratelimit is not a problem. But during
normal runtime, things also log to journald, and not to /dev/kmsg, so the
ratelimit is not useful. Hence, there doesn't seem to be much point in trying
to restore the ratelimit after boot is finished and journald is up and running.
See kernel's commit 750afe7babd117daabebf4855da18e4418ea845e for the
description of the kenrel interface. Our setting has lower precedence than
explicit configuration on the kenrel command line.
|
|\
| |
| | |
Rework unit loading to take into account all aliases
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
v2:
- do not watch mtime of transient and generated dirs
We'd reload the map after every transient unit we created, which we don't
need to do, since we create those units ourselves and know their fragment
path.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reworks how we load units from disk. Instead of chasing symlinks every
time we are asked to load a unit by name, we slurp all symlinks from disk
and build two hashmaps:
1. from unit name to either alias target, or fragment on disk
(if an alias, we put just the target name in the hashmap, if a fragment
we put an absolute path, so we can distinguish both).
2. from a unit name to all aliases
Reading all this data can be pretty costly (40 ms) on my machine, so we keep it
around for reuse.
The advantage is that we can reliably know what all the aliases of a given unit
are. This means we can reliably load dropins under all names. This fixes #11972.
|
| |
| |
| |
| |
| |
| |
| | |
This option is only used on reboot, not on other types of shutdown
modes, so it is misleading.
Keep the old name working for backward compatibility, but remove it
from the documentation.
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Rather than always enabling the shutdown WD on kexec, which might be
dangerous in case the kernel driver and/or the hardware implementation
does not reset the wd on kexec, add a new timer, disabled by default,
to let users optionally enable the shutdown WD on kexec separately
from the runtime and reboot ones. Advise in the documentation to
also use the runtime WD in conjunction with it.
Fixes: a637d0f9ecbe ("core: set shutdown watchdog on kexec too")
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Jobs are added to the run queue in random order. This happens because most
jobs are added by iterating over the transaction or dependency hash maps.
As a result, jobs that can be executed at the same time are started in a
different order each time.
On small embedded devices this can cause a measurable jitter for the point
in time when a job starts (~100ms jitter for 10 units that are started in
random order).
This results is a similar jitter for the boot time. This is undesirable in
general and make optimizing the boot time a lot harder.
Also, jobs that should have a higher priority because the unit has a higher
CPU weight might get executed later than others.
Fix this by turning the job run_queue into a Prioq and sort by the
following criteria (use the next if the values are equal):
- CPU weight
- nice level
- unit type
- unit name
The last one is just there for deterministic sorting to avoid any jitter.
|
|
|
|
|
|
|
| |
No functional change, just docs and configuration and parsing.
v2:
- change ShortIdentifiers=yes|no to StatusUnitFormat=name|description.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When shooting down a service with SIGABRT the user might want to have a
much longer stop timeout than on regular stops/shutdowns. Especially in
the face of short stop timeouts the time might not be sufficient to
write huge core dumps before the service is killed.
This commit adds a dedicated (Default)TimeoutAbortSec= timer that is
used when stopping a service via SIGABRT. In all other cases the
existing TimeoutStopSec= is used. The timer value is unset by default
to skip the special handling and use TimeoutStopSec= for state
'stop-watchdog' to keep the old behaviour.
If the service is in state 'stop-watchdog' and the service should be
stopped explicitly we still go to 'stop-sigterm' and re-apply the usual
TimeoutStopSec= timeout.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds a new per-service OOMPolicy= (along with a global
DefaultOOMPolicy=) that controls what to do if a process of the service
is killed by the kernel's OOM killer. It has three different values:
"continue" (old behaviour), "stop" (terminate the service), "kill" (let
the kernel kill all the service's processes).
On top of that, track OOM killer events per unit: generate a per-unit
structured, recognizable log message when we see an OOM killer event,
and put the service in a failure state if an OOM killer event was seen
and the selected policy was not "continue". A new "result" is defined
for this case: "oom-kill".
All of this relies on new cgroupv2 kernel functionality: the
"memory.events" notification interface and the "memory.oom.group"
attribute (which makes the kernel kill all cgroup processes
automatically).
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Let's rename the .cgroup_inotify_wd field of the Unit object to
.cgroup_control_inotify_wd. Let's similarly rename the hashmap
.cgroup_inotify_wd_unit of the Manager object to
.cgroup_control_inotify_wd_unit.
Why? As preparation for a later commit that allows us to watch the
"memory.events" cgroup attribute file in addition to the "cgroup.events"
file we already watch with the fields above. In that later commit we'll
add new fields "cgroup_memory_inotify_wd" to Unit and
"cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these
other events file.
No change in behaviour. Just some renaming.
|
|\
| |
| | |
core: on switching root do not emit device state change based on enumeration results
|
| |
| |
| |
| |
| |
| |
| |
| | |
When system manager is started first time or after switching root,
then the udev's device tag data do not exist yet.
So, let's not honor the enumeration results.
Fixes #11997.
|
| | |
|
|/
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
possible
Some PIDs can remain in the watched list even though their processes have
exited since a long time. It can easily happen if the main process of a forking
service manages to spawn a child before the control process exits for example.
However when a pid is about to be mapped to a unit by calling unit_watch_pid(),
the caller usually knows if the pid should belong to this unit exclusively: if
we just forked() off a child, then we can be sure that its PID is otherwise
unused. In this case we take this opportunity to remove any stalled PIDs from
the watched process list.
If we learnt about a PID in any other form (for example via PID file, via
searching, MAINPID= and so on), then we can't assume anything.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit 89f9752ea08f516b5d77f8e577bb772073c70c01.
This patch causes various problems during boot, where a "mount storm" occurs
naturally. Current approach is flakey, and it seems very risky to push a
feature like this which impacts boot right before a release. So let's revert
for now, and consider a more robust solution after later.
Fixes #11209.
> https://github.com/systemd/systemd/pull/11196#issuecomment-448523186:
"Reverting 89f9752ea08f516b5d77f8e577bb772073c70c01 and fcfb1f775ed0e9d282607bb118ba788b98952855 fixes this test."
|
|
|
|
| |
This reverts commit fcfb1f775ed0e9d282607bb118ba788b98952855.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The starting of mount units requires that changes to
/proc/self/mountinfo be processed before the SIGCHILD from the
completion of /sbin/mount is processed, as described by the comment
/* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded
* before we process the SIGCHLD for the mount command. */
The recently-added mount-storm protection can defeat this as it
will sometimes deliberately delay processing of /proc/self/mountinfo.
So we need to disable mount-storm protection when a mount unit is starting.
We do this by keeping a counter of the number of pending
mounts, and disabling the protection when this is non-zero.
Thanks to @asavah for finding and reporting this problem.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we create 2000 mounts (on a 1-CPU qemu VM) with
mkdir -p /MNT/{1..2000}
time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done
it takes around 20 seconds to complete. Much of this time is taken up
by systemd repeatedly processing /proc/self/mountinfo.
If I disable the processing, the time drops to about 4 seconds.
I have reports that on a larger system with multiple active user sessions, each
with it's own systemd, the impact can be higher.
One particular use-case where a large number of mounts can be expected in quick
succession is when the "clearcase" SCM starts up.
This patch modifies the handling up events from /proc/self/mountinfo so
that systemd backs off when a storm is detected. Specifically the time to process
mountinfo is measured, and the process will not be repeated until 10 times
that duration has passed. This ensures systemd won't use more than 10% of
real time processing mountinfo.
With this patch, my test above takes about 5 seconds.
|
|
|
|
|
|
|
|
|
|
|
| |
Memory management is borked for this, and moreover this is unnecessary
since f0831ed2a03, i.e. since coldplug() and catchup() are two different
concepts: the former restoring the state from before a reload, the
latter than adjusting it again to the actual status in effect after the
reload.
Fixes: #10716
Mostly reverts: #8803
|
|
|
|
|
|
|
|
|
|
|
| |
Otherwise we keep collecting stuff from env generators, and we really
shouldn't.
This was working properly on reexec but not on reload, as for reexec we
would always start fresh, but for reload would reuse the Manager object
and hence its default environment set.
Fixes: #10671
|
|
|
|
|
|
|
|
|
| |
We don't dispatch the queue recursively anymore, hence let's simplify
things a bit.
As pointed out by @fbuihuu:
https://github.com/systemd/systemd/pull/10763#discussion_r233209550
|
|
|
|
|
|
|
| |
This field is only used for pending Reload() replies, hence let's rename
it to be more descriptive and precise.
No change in behaviour.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This splits the "environment" field of Manager into two:
transient_environment and client_environment. The former is generated
from configuration file, kernel cmdline, environment generators. The
latter is the one the user can control with "systemctl set-environment"
and similar.
Both sets are merged transparently whenever needed. Separating the two
sets has the benefit that we can safely flush out the former while
keeping the latter during daemon reload cycles, so that env var settings
from env generators or configuration files do not accumulate, but
dynamic API changes are kept around.
Note that this change is not entirely transparent to users: if the user
first uses "set-environment" to override a transient variable, and then
uses "unset-environment" to unset it again things will revert to the
original transient variable now, while previously the variable was fully
removed. This change in behaviour should not matter too much though I
figure.
Fixes: #9972
|
| |
|
|
|
|
|
|
| |
Let's make them typesafe, and let's add a nice macro helper for checking
if we are in a test run, which should make testing for this much easier
to read for most cases.
|
|
|
|
|
|
|
| |
No reason to avoid bit 0.
Also, fix some tests that pass "true" as flags value, which is just
wrong.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
"ExitCode" is a bit of a misnomer in two ways: it suggests this was
about the "exit code" concept that exit()/waitid() deal with, but really
isn't. Moreover, it's not event just about exiting either, but more
often about reloading/reexecing or rebooting. Let's hence pick a new
name for this that is a bit more correct.
I initially thought about naming this the "state", but that'd be a
misnomer too, as the value really encodes a "goal" more than a current
state. Also we already have the externally visible ManagerState.
No actual changes in behaviour, just the rename.
|
| |
|
| |
|
|\
| |
| | |
rework StopWhenUnneeded=1 logic
|