summaryrefslogtreecommitdiffstats
path: root/src/core/manager.h (follow)
Commit message (Collapse)AuthorAgeFilesLines
* license: LGPL-2.1+ -> LGPL-2.1-or-laterYu Watanabe2020-11-091-1/+1
|
* core: add varlink call to get cgroup paths of units using ManagedOOM*=Anita Zhang2020-10-081-0/+2
|
* Rework how we cache mtime to figure out if units changedZbigniew Jędrzejewski-Szmek2020-08-311-1/+1
| | | | | | | | | | | | | | | | | Instead of assuming that more-recently modified directories have higher mtime, just look for any mtime changes, up or down. Since we don't want to remember individual mtimes, hash them to obtain a single value. This should help us behave properly in the case when the time jumps backwards during boot: various files might have mtimes that in the future, but we won't care. This fixes the following scenario: We have /etc/systemd/system with T1. T1 is initially far in the past. We have /run/systemd/generator with time T2. The time is adjusted backwards, so T2 will be always in the future for a while. Now the user writes new files to /etc/systemd/system, and T1 is updated to T1'. Nevertheless, T1 < T1' << T2. We would consider our cache to be up-to-date, falsely.
* core: rename manager_unit_file_maybe_loadable_from_cache()Zbigniew Jędrzejewski-Szmek2020-08-311-1/+1
| | | | | | The name is misleading, since we aren't really loading the unit from cache — if this function returns true, we'll try to load the unit from disk, updating the cache in the process.
* core: add credentials logicLennart Poettering2020-08-251-0/+1
| | | | Fixes: #15778 #16060
* core: refresh unit cache when building a transaction if UNIT_NOT_FOUNDLuca Boccassi2020-07-071-0/+1
| | | | | | | | | When a command asks to load a unit directly and it is in state UNIT_NOT_FOUND, and the cache is outdated, we refresh it and attempto to load again. Use the same logic when building up a transaction and a dependency in UNIT_NOT_FOUND state is encountered. Update the unit test to exercise this code path.
* pid1: rename manager_set_{show_status,watchdog}_overridden() into ↵Franck Bui2020-06-111-2/+2
| | | | | | manager_override_(show_status,watchdog} No functional change.
* pid1: rework handling of m->show_statusFranck Bui2020-06-091-0/+4
| | | | | | | | | | | | | The fact that m->show_status was serialized/deserialized made impossible any further customisation of this setting via system.conf. IOW the value was basically always locked unless it was changed via signals. This patch reworks the handling of m->show_status but also makes sure that if a new value was changed via the signal API then this value is kept and preserved accross PID1 reexecuting or reloading. Note: this effectively means that once the value is set via the signal interface, it can be changed again only through the signal API.
* pid1: make manager_deserialize_{uid,gid}_refs() staticFranck Bui2020-05-191-3/+0
| | | | No functional change.
* pid1: make manager_serialize_{uid,gid}_refs() staticFranck Bui2020-05-191-3/+0
| | | | No functional change.
* pid1: make manager_vacuum_{uid,gid}_refs() staticFranck Bui2020-05-191-3/+0
| | | | No functional change.
* pid1: make manager_flip_auto_status() staticFranck Bui2020-05-191-1/+0
| | | | No functional change.
* pid1: update manager settings on reload tooFranck Bui2020-05-191-3/+13
| | | | | | | | | | | | | | | | | | | | | | | | Most complexity of this patch is due to the fact that some manager settings (basically the watchdog properties) can be set at runtime and in this case the runtime values must be retained over daemon-reload or daemon-reexec. For consistency sake, all watchdog properties behaves now the same way, that is: - Values defined by config files can be overridden by writing the new value through their respective D-BUS properties. In this case, these values are preserved over reload/reexec until the special value '0' or USEC_INFINITY is written, which will then restore the last values loaded from the config files. If the restored value is '0' or 'USEC_INFINITY', the watchdogs will be disabled and the corresponding device will be closed. - Reading the properties from a user instance will return the USEC_INFINITY value as these properties are only meaningful for PID1. - Writing to one of the watchdog properties of a user instance's will be a NOP. Fixes: #15453
* pid1: when showing error status, do not switch to status=temporaryZbigniew Jędrzejewski-Szmek2020-03-011-0/+1
| | | | | | | | | | | | | | We would flip to status=temporary mode on the first error, and then switch back to status=auto after the initial transaction was done. This isn't very useful, because usually all the messages about successfully started units and not related to the original failure. In fact, all those messages most likely cause the information about the prime error to scroll off screen. And if the user requested quiet boot, there's no reason to think that they care about those success messages. Also, when logging about dependency cycles, treat this similarly to a unit error and show the message even if the status is "soft disabled" (before we wouldn't show it in that case).
* pid1: when printing status message status, give reasonZbigniew Jędrzejewski-Szmek2020-03-011-2/+2
|
* core: add user/group resolution varlink interface to PID 1Lennart Poettering2020-01-151-0/+3
|
* core: drop initial ListNames() bus call from PID 1Lennart Poettering2020-01-061-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, when first connecting to the bus after connecting to it we'd issue a ListNames() bus call to the driver to figure out which bus names are currently active. This information was then used to initialize the initial state for services that use BusName=. This change removes the whole code for this and replaces it with something vastly simpler. First of all, the ListNames() call was issues synchronosuly, which meant if dbus was for some reason synchronously calling into PID1 for some reason we'd deadlock. As it turns out there's now a good chance it does: the nss-systemd userdb hookup means that any user dbus-daemon resolves might result in a varlink call into PID 1, and dbus resolves quite a lot of users while parsing its policy. My original goal was to fix this deadlock. But as it turns out we don't need the ListNames() call at all anymore, since #12957 has been merged. That PR was supposed to fix a race where asynchronous installation of bus matches would cause us missing the initial owner of a bus name when a service is first started. It fixed it (correctly) by enquiring with GetOwnerName() who currently owns the name, right after installing the match. But this means whenever we start watching a bus name we anyway issue a GetOwnerName() for it, and that means also when first connecting to the bus we don't need to issue ListNames() anymore since that just tells us the same info: which names are currently owned. hence, let's drop ListNames() and instead make better use of the GetOwnerName() result: if it failed the name is not owned. Also, while we are at it, let's simplify the unit's owner_name_changed() callback(): let's drop the "old_owner" argument. We never used that besides logging, and it's hard to synthesize from just the return of a GetOwnerName(), hence don't bother.
* core: make TasksMax a partially dynamic propertyZbigniew Jędrzejewski-Szmek2019-11-141-1/+2
| | | | | | | | | | | | | | | | | TasksMax= and DefaultTasksMax= can be specified as percentages. We don't actually document of what the percentage is relative to, but the implementation uses the smallest of /proc/sys/kernel/pid_max, /proc/sys/kernel/threads-max, and /sys/fs/cgroup/pids.max (when present). When the value is a percentage, we immediately convert it to an absolute value. If the limit later changes (which can happen e.g. when systemd-sysctl runs), the absolute value becomes outdated. So let's store either the percentage or absolute value, whatever was specified, and only convert to an absolute value when the value is used. For example, when starting a unit, the absolute value will be calculated when the cgroup for the unit is created. Fixes #13419.
* pid1: disable printk ratelimit in early bootZbigniew Jędrzejewski-Szmek2019-09-201-0/+1
| | | | | | | | | | | | | | | | | | We have the problem that many early boot or late shutdown issues are harder to solve than they could be because we have no logs. When journald is not running, messages are redirected to /dev/kmsg. It is also the time when many things happen in a rapid succession, so we tend to hit the kernel printk ratelimit fairly reliably. The end result is that we get no logs from the time where they would be most useful. Thus let's disable the kernels ratelimit. Once the system is up and running, the ratelimit is not a problem. But during normal runtime, things also log to journald, and not to /dev/kmsg, so the ratelimit is not useful. Hence, there doesn't seem to be much point in trying to restore the ratelimit after boot is finished and journald is up and running. See kernel's commit 750afe7babd117daabebf4855da18e4418ea845e for the description of the kenrel interface. Our setting has lower precedence than explicit configuration on the kenrel command line.
* Merge pull request #13119 from keszybz/unit-loading-2Lennart Poettering2019-07-301-0/+3
|\ | | | | Rework unit loading to take into account all aliases
| * pid1: drop unit caches only based on mtimeZbigniew Jędrzejewski-Szmek2019-07-301-0/+1
| | | | | | | | | | | | | | | | | | v2: - do not watch mtime of transient and generated dirs We'd reload the map after every transient unit we created, which we don't need to do, since we create those units ourselves and know their fragment path.
| * pid1: use a cache for all unit aliasesZbigniew Jędrzejewski-Szmek2019-07-301-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reworks how we load units from disk. Instead of chasing symlinks every time we are asked to load a unit by name, we slurp all symlinks from disk and build two hashmaps: 1. from unit name to either alias target, or fragment on disk (if an alias, we put just the target name in the hashmap, if a fragment we put an absolute path, so we can distinguish both). 2. from a unit name to all aliases Reading all this data can be pretty costly (40 ms) on my machine, so we keep it around for reuse. The advantage is that we can reliably know what all the aliases of a given unit are. This means we can reliably load dropins under all names. This fixes #11972.
* | core: rename ShutdownWatchdogSec to RebootWatchdogSecLuca Boccassi2019-07-231-1/+1
| | | | | | | | | | | | | | This option is only used on reboot, not on other types of shutdown modes, so it is misleading. Keep the old name working for backward compatibility, but remove it from the documentation.
* | core: add KExecWatchdogSec optionLuca Boccassi2019-07-231-0/+1
| | | | | | | | | | | | | | | | | | | | | | Rather than always enabling the shutdown WD on kexec, which might be dangerous in case the kernel driver and/or the hardware implementation does not reset the wd on kexec, add a new timer, disabled by default, to let users optionally enable the shutdown WD on kexec separately from the runtime and reboot ones. Advise in the documentation to also use the runtime WD in conjunction with it. Fixes: a637d0f9ecbe ("core: set shutdown watchdog on kexec too")
* | job: make the run queue order deterministicMichael Olbrich2019-07-181-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | Jobs are added to the run queue in random order. This happens because most jobs are added by iterating over the transaction or dependency hash maps. As a result, jobs that can be executed at the same time are started in a different order each time. On small embedded devices this can cause a measurable jitter for the point in time when a job starts (~100ms jitter for 10 units that are started in random order). This results is a similar jitter for the boot time. This is undesirable in general and make optimizing the boot time a lot harder. Also, jobs that should have a higher priority because the unit has a higher CPU weight might get executed later than others. Fix this by turning the job run_queue into a Prioq and sort by the following criteria (use the next if the values are equal): - CPU weight - nice level - unit type - unit name The last one is just there for deterministic sorting to avoid any jitter.
* Add config and kernel commandline option to use short identifiersZbigniew Jędrzejewski-Szmek2019-07-101-0/+1
| | | | | | | No functional change, just docs and configuration and parsing. v2: - change ShortIdentifiers=yes|no to StatusUnitFormat=name|description.
* core: add assertion in two inline functionsYu Watanabe2019-04-141-0/+1
|
* service: handle abort stops with dedicated timeoutJan Klötzke2019-04-121-0/+6
| | | | | | | | | | | | | | | | | When shooting down a service with SIGABRT the user might want to have a much longer stop timeout than on regular stops/shutdowns. Especially in the face of short stop timeouts the time might not be sufficient to write huge core dumps before the service is killed. This commit adds a dedicated (Default)TimeoutAbortSec= timer that is used when stopping a service via SIGABRT. In all other cases the existing TimeoutStopSec= is used. The timer value is unset by default to skip the special handling and use TimeoutStopSec= for state 'stop-watchdog' to keep the old behaviour. If the service is in state 'stop-watchdog' and the service should be stopped explicitly we still go to 'stop-sigterm' and re-apply the usual TimeoutStopSec= timeout.
* core: implement OOMPolicy= and watch cgroups for OOM killingsLennart Poettering2019-04-091-0/+21
| | | | | | | | | | | | | | | | | | | This adds a new per-service OOMPolicy= (along with a global DefaultOOMPolicy=) that controls what to do if a process of the service is killed by the kernel's OOM killer. It has three different values: "continue" (old behaviour), "stop" (terminate the service), "kill" (let the kernel kill all the service's processes). On top of that, track OOM killer events per unit: generate a per-unit structured, recognizable log message when we see an OOM killer event, and put the service in a failure state if an OOM killer event was seen and the selected policy was not "continue". A new "result" is defined for this case: "oom-kill". All of this relies on new cgroupv2 kernel functionality: the "memory.events" notification interface and the "memory.oom.group" attribute (which makes the kernel kill all cgroup processes automatically).
* core: rename cgroup_inotify_wd → cgroup_control_inotify_wdLennart Poettering2019-04-091-1/+1
| | | | | | | | | | | | | | | | Let's rename the .cgroup_inotify_wd field of the Unit object to .cgroup_control_inotify_wd. Let's similarly rename the hashmap .cgroup_inotify_wd_unit of the Manager object to .cgroup_control_inotify_wd_unit. Why? As preparation for a later commit that allows us to watch the "memory.events" cgroup attribute file in addition to the "cgroup.events" file we already watch with the fields above. In that later commit we'll add new fields "cgroup_memory_inotify_wd" to Unit and "cgroup_memory_inotify_wd_unit" to Manager, that are used to watch these other events file. No change in behaviour. Just some renaming.
* Merge pull request #12013 from yuwata/fix-switchroot-11997Zbigniew Jędrzejewski-Szmek2019-04-021-0/+2
|\ | | | | core: on switching root do not emit device state change based on enumeration results
| * core: add Manager::honor_device_enumeration flagYu Watanabe2019-03-151-0/+2
| | | | | | | | | | | | | | | | When system manager is started first time or after switching root, then the udev's device tag data do not exist yet. So, let's not honor the enumeration results. Fixes #11997.
* | core: add new API for enqueing a job with returning the transaction dataLennart Poettering2019-03-271-3/+3
| |
* | core: reduce the number of stalled PIDs from the watched processes list when ↵Franck Bui2019-03-201-0/+2
|/ | | | | | | | | | | | | | | | | possible Some PIDs can remain in the watched list even though their processes have exited since a long time. It can easily happen if the main process of a forking service manages to spawn a child before the control process exits for example. However when a pid is about to be mapped to a unit by calling unit_watch_pid(), the caller usually knows if the pid should belong to this unit exclusively: if we just forked() off a child, then we can be sure that its PID is otherwise unused. In this case we take this opportunity to remove any stalled PIDs from the watched process list. If we learnt about a PID in any other form (for example via PID file, via searching, MAINPID= and so on), then we can't assume anything.
* Revert "core/mount: minimize impact on mount storm."Zbigniew Jędrzejewski-Szmek2018-12-191-3/+0
| | | | | | | | | | | | | | This reverts commit 89f9752ea08f516b5d77f8e577bb772073c70c01. This patch causes various problems during boot, where a "mount storm" occurs naturally. Current approach is flakey, and it seems very risky to push a feature like this which impacts boot right before a release. So let's revert for now, and consider a more robust solution after later. Fixes #11209. > https://github.com/systemd/systemd/pull/11196#issuecomment-448523186: "Reverting 89f9752ea08f516b5d77f8e577bb772073c70c01 and fcfb1f775ed0e9d282607bb118ba788b98952855 fixes this test."
* Revert "mount: disable mount-storm protection while mount unit is starting."Zbigniew Jędrzejewski-Szmek2018-12-191-1/+0
| | | | This reverts commit fcfb1f775ed0e9d282607bb118ba788b98952855.
* mount: disable mount-storm protection while mount unit is starting.NeilBrown2018-12-191-0/+1
| | | | | | | | | | | | | | | | | The starting of mount units requires that changes to /proc/self/mountinfo be processed before the SIGCHILD from the completion of /sbin/mount is processed, as described by the comment /* Note that due to the io event priority logic, we can be sure the new mountinfo is loaded * before we process the SIGCHLD for the mount command. */ The recently-added mount-storm protection can defeat this as it will sometimes deliberately delay processing of /proc/self/mountinfo. So we need to disable mount-storm protection when a mount unit is starting. We do this by keeping a counter of the number of pending mounts, and disabling the protection when this is non-zero. Thanks to @asavah for finding and reporting this problem.
* core/mount: minimize impact on mount storm.NeilBrown2018-12-161-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | If we create 2000 mounts (on a 1-CPU qemu VM) with mkdir -p /MNT/{1..2000} time for i in {1..2000}; do mount --bind /etc /MNT/$i ; done it takes around 20 seconds to complete. Much of this time is taken up by systemd repeatedly processing /proc/self/mountinfo. If I disable the processing, the time drops to about 4 seconds. I have reports that on a larger system with multiple active user sessions, each with it's own systemd, the impact can be higher. One particular use-case where a large number of mounts can be expected in quick succession is when the "clearcase" SCM starts up. This patch modifies the handling up events from /proc/self/mountinfo so that systemd backs off when a storm is detected. Specifically the time to process mountinfo is measured, and the process will not be repeated until 10 times that duration has passed. This ensures systemd won't use more than 10% of real time processing mountinfo. With this patch, my test above takes about 5 seconds.
* core: don't track jobs-finishing-during-reload explicitlyLennart Poettering2018-12-121-3/+0
| | | | | | | | | | | Memory management is borked for this, and moreover this is unnecessary since f0831ed2a03, i.e. since coldplug() and catchup() are two different concepts: the former restoring the state from before a reload, the latter than adjusting it again to the actual status in effect after the reload. Fixes: #10716 Mostly reverts: #8803
* main: when reloading PID 1 let's reset the default environmentLennart Poettering2018-11-191-0/+1
| | | | | | | | | | | Otherwise we keep collecting stuff from env generators, and we really shouldn't. This was working properly on reexec but not on reload, as for reexec we would always start fresh, but for reload would reuse the Manager object and hence its default environment set. Fixes: #10671
* core: drop dbus queue recursion checkLennart Poettering2018-11-141-1/+0
| | | | | | | | | We don't dispatch the queue recursively anymore, hence let's simplify things a bit. As pointed out by @fbuihuu: https://github.com/systemd/systemd/pull/10763#discussion_r233209550
* core: rename queued_message → pending_reload_messageLennart Poettering2018-11-131-1/+1
| | | | | | | This field is only used for pending Reload() replies, hence let's rename it to be more descriptive and precise. No change in behaviour.
* core: split environment block mantained by PID 1's Manager object in twoLennart Poettering2018-10-311-2/+6
| | | | | | | | | | | | | | | | | | | | | | | This splits the "environment" field of Manager into two: transient_environment and client_environment. The former is generated from configuration file, kernel cmdline, environment generators. The latter is the one the user can control with "systemctl set-environment" and similar. Both sets are merged transparently whenever needed. Separating the two sets has the benefit that we can safely flush out the former while keeping the latter during daemon reload cycles, so that env var settings from env generators or configuration files do not accumulate, but dynamic API changes are kept around. Note that this change is not entirely transparent to users: if the user first uses "set-environment" to override a transient variable, and then uses "unset-environment" to unset it again things will revert to the original transient variable now, while previously the variable was fully removed. This change in behaviour should not matter too much though I figure. Fixes: #9972
* core: replace udev_monitor by sd_device_monitorYu Watanabe2018-10-161-3/+2
|
* core: clean up test run flagsLennart Poettering2018-10-091-5/+8
| | | | | | Let's make them typesafe, and let's add a nice macro helper for checking if we are in a test run, which should make testing for this much easier to read for most cases.
* manager: rework test flags setLennart Poettering2018-10-091-4/+4
| | | | | | | No reason to avoid bit 0. Also, fix some tests that pass "true" as flags value, which is just wrong.
* core: rename ManagerExitCode → ManagerObjectiveLennart Poettering2018-10-091-8/+8
| | | | | | | | | | | | | | "ExitCode" is a bit of a misnomer in two ways: it suggests this was about the "exit code" concept that exit()/waitid() deal with, but really isn't. Moreover, it's not event just about exiting either, but more often about reloading/reexecing or rebooting. Let's hence pick a new name for this that is a bit more correct. I initially thought about naming this the "state", but that'd be a misnomer too, as the value really encodes a "goal" more than a current state. Also we already have the externally visible ManagerState. No actual changes in behaviour, just the rename.
* manager: add explanatory comment regarding ManagerStateLennart Poettering2018-10-091-0/+2
|
* core: replace udev_device by sd_deviceYu Watanabe2018-08-221-2/+1
|
* Merge pull request #9853 from poettering/uneeded-queueZbigniew Jędrzejewski-Szmek2018-08-211-0/+3
|\ | | | | rework StopWhenUnneeded=1 logic