| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
I started looking into https://github.com/uapi-group/specifications/issues/35.
BLS says:
> Otherwise [no existing XBOOTLDR partition], if on GPT and an ESP is found and
> it is large enough (let’s say at least 1G) it should be used as $BOOT and
> used as primary location to place boot loader menu resources in.
> It is recommended to mount $BOOT to /boot/, and the ESP to /efi/.
DPS says:
> The ESP used for the current boot is automatically mounted to /efi/ (or
> /boot/ as fallback), unless a different partition is mounted there (possibly
> via /etc/fstab, or because the Extended Boot Loader Partition — see below —
> exists) or the directory is non-empty on the root disk.
I don't think we want to mount the same partition in two places.
If the same partition is not mounted in two places, then the two specs are
contradictory.
The code in gpt-auto-generator implemented the logic from the DPS. It is
modified to implement the logic from BLS.
Effectively:
- if both /boot and /efi are available:
- if both XBOOTLDR and ESP exist:
ESP on /efi, XBOOTLDR on /boot
- if only ESP exists:
ESP on /boot
- if only XBOOTLDR exists:
XBOOTLDR on /boot
- if only /boot is available:
- if XBOOTLDR exists:
XBOOTLDR on /boot
- if only ESP exists:
ESP on /boot
- if only /efi is available:
- if ESP exists:
ESP on /efi
"Available" means that it the mount point is not mounted over and does not
contain files. If the directory doesn't exist, it is also "available" and will
be created later when the mount or automount unit is started.
Thus, the generator attempts to match the partitions and mount points to the
extent possible. In all cases, /boot is the primary place to install kernels.
ESP can be found on /boot or /efi, depending on the situation.
If this patch is merged, I'll submit fixes for BLS and DPS to describe the
same logic.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
| |
E.g. in logs on jammy-ppc64el in https://github.com/systemd/systemd/pull/27294:
Apr 16 17:42:50 H systemd-gpt-auto-generator[300]: Failed to dissect partition table of block device /dev/sda: No message of desired type
Apr 16 17:42:50 H (sd-execu[295]: /usr/lib/systemd/system-generators/systemd-gpt-auto-generator failed with exit status 1.
ee0e6e476e61d4baa2a18e241d212753e75003bf made this particular condition not an
error. But for other errnos we want to print a better message too.
dissect_loop_device_and_warn() already does this, but it always prints the
error at error level. We want to suppress some of the errors, so let's make the
print helper public and do the error suppression in the caller.
|
|
|
|
| |
Follow-up for 598fd4da1cf9665834110583fd9133073cc12481.
|
|
|
|
|
|
|
|
|
| |
Addresses
https://github.com/systemd/systemd/pull/25608/commits/84be0c710d9d562f6d2cf986cc2a8ff4c98a138b#r1060130312,
https://github.com/systemd/systemd/pull/25608/commits/84be0c710d9d562f6d2cf986cc2a8ff4c98a138b#r1067927293, and
https://github.com/systemd/systemd/pull/25608/commits/84be0c710d9d562f6d2cf986cc2a8ff4c98a138b#r1067926416.
Follow-up for 84be0c710d9d562f6d2cf986cc2a8ff4c98a138b.
|
| |
|
| |
|
|
|
|
|
|
|
|
| |
This way we'll have the same mount options in place if we boot via the
gpt generator, or if we mount a DDI locally.
Note that this will also enable MS_NOSYMFOLLOW on ESP and XBOOTLDR now,
if booted via gpt-auto-generator.
|
|
|
|
|
|
|
|
| |
We prefer /efi as a mount point for the ESP, and use /boot as a fallback
if /efi doesn't exist. However, when root=tmpfs, neither /efi nor /boot
exist. gpt-auto should mount to /efi in this case, but it mounted to
/boot instead. This is because gpt-auto didn't check for the existence
of /boot. Here, we correct this
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We already have this nice code in system that determines the block
device backing the root file system, but it's only used internally in
systemd-gpt-generator. Let's make this more accessible and expose it
directly in bootctl.
It doesn't fit immediately into the topic of bootctl, but I think it's
close enough and behaves very similar to the existing "bootctl
--print-boot-path" and "--print-esp-path" tools.
If --print-root-device (or -R) is specified once, will show the block device
backing the root fs, and if specified twice (probably easier: -RR) it
will show the whole block device that block device belongs to in case it
is a partition block device.
Suggested use:
# cfdisk `bootctl -RR`
To get access to the partition table, behind the OS install, for
whatever it might be.
|
|
|
|
|
|
|
|
|
|
|
|
| |
When compiled without ENABLE_EFI, efi_stub_measured() was not defined, so
compilation would fail. But it's not enough to add a stub that returns
-EOPNOTSUPP. We call this function in various places and usually print the error
at warning or error level, so we'd print a confusing message. We also can't add
a stub that always returns 0, because then we'd print a message like "Kernel
stub did not measure", which would be confusing too. Adding special handling for
-EOPNOTSUPP in every caller is also unattractive. So instead efi_stub_measured()
is reworked to log the warning or error internally, and such logging is removed
from the callers, and a stub is added that logs a custom message.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
measurements
Let's introduce a common implementation of a function that checks
whether we are booted on a kernel with systemd-stub that has TPM PCR
measurements enabled. Do our own userspace measurements only if we
detect that.
PCRs are scarce and most likely there are projects which already make
use of them in other ways. Hence, instead of blindly stepping into their
territory let's conditionalize things so that people have to explicitly
buy into our PCR assignments before we start measuring things into them.
Specifically bind everything to an UKI that reported measurements.
This was previously already implemented in systemd-pcrphase, but with
this change we expand this to all tools that process PCR measurement
settings.
The env var to override the check is renamed to SYSTEMD_FORCE_MEASURE,
to make it more generic (since we'll use it at multiple places now).
This is not a compat break, since the original env var for that was not
included in any stable release yet.
|
|
|
|
|
|
| |
If we use gpt-auto-generator, automatically measure root fs and /var.
Otherwise, add x-systemd.measure option to request this.
|
|
|
|
|
| |
let's enable PCR 15 measurements automatically if gpt-auto discovery is
used and systemd-stub is also used.
|
|
|
|
|
|
|
|
|
| |
When these partitions are probed by gpt-auto,
they will always be hardened with such options.
See also: https://github.com/systemd/systemd/issues/25776#issuecomment-1364115711
Closes #25776
|
|
|
|
| |
Fixes: #20331
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When we dissect images automatically, let's be a bit more conservative
with the file system types we are willing to mount: only mount common
file systems automatically.
Explicit mounts requested by admins should always be OK, but when we do
automatic mounts, let's not permit barely maintained, possibly legacy
file systems.
The list for now covers the four common writable and two common
read-only file systems. Sooner or later we might want to add more to the
list.
Also, it might make sense to eventually make this configurable via the
image dissection policy logic.
|
|
|
|
|
|
|
|
| |
Even if root= is not specified on the kernel cmdline, we should honour
the other rootXYZ= options.
Fixes: #8411
See: #17034
|
|
|
|
|
| |
"auto"/"noauto" only make sense in the fstab. Putting them in Options= in the
generated unit has no effect and is confusing.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
DISSECT_IMAGE_OPEN_PARTITION_DEVICES
Curently, these two flags were implied by dissect_loop_device(), but
that's not right, because this means systemd-gpt-auto-generator will
dissect the root block device with these flags set and that's not
desirable: the generator should not cause the partition devices to be
created (we don't intend to use them right-away after all, but expect
udev to find/probe them first, and then mount them though .mount units).
And there's no point in opening the partition devices, since we do not
intend to mount them via fds either.
Hence, rework this: instead of implying the flags, specify them
explicitly.
While we are at it, let's also rename the flags to make them more
descriptive:
DISSECT_IMAGE_MANAGE_PARTITION_DEVICES becomes
DISSECT_IMAGE_ADD_PARTITION_DEVICES, since that's really all this does:
add the partition devices via BLKPG.
DISSECT_IMAGE_OPEN_PARTITION_DEVICES becomes
DISSECT_IMAGE_PIN_PARTITION_DEVICES, since we not only open the devices,
but keep the devices open continously (i.e. we "pin" them).
Also, drop the DISSECT_IMAGE_BLOCK_DEVICE combination flag, since it is
misleading, i.e. it suggests it was appropriate to specify on all
dissected blocking devices, but that's precisely not the case, see the
systemd-gpt-auto-generator case. My guess is that the confusion around
this was actually the cause for this bug we are addressing here.
Fixes: #25528
|
|
|
|
|
|
|
|
|
| |
I changed imports of util.h to initrd-util.h, or added an import of
initrd-util.h, to keep compilation working. It turns out that many files didn't
import util.h directly.
When viewing the patch, don't be confused by git rename detection logic:
a new .c file is added and two functions moved into it.
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
add_partition_xyz()
The function for handling regular mounts based on DissectedPartition
objects is called add_partition_mount(), so let's follow this scheme for
all other functions that handle them, too. This nicely separates out the
low-level functions (which get split up args) from the high-level
functions (which get a DissectedPartition object): the latter are called
add_partition_xyz() the former just add_xyz().
This makes naming a bit more systematic. No change in behaviour.
|
| |
|
|
|
|
| |
Fixes #24978
|
|
|
|
| |
No functional changes, just preparation for later commits.
|
| |
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The loading of an extension image from a symlink "NAME.raw" to
"NAME-VERSION.raw" failed because the release file name check worked
with the backing file of the loop device which already resolves the
symlink and thus the found name "NAME-VERSION" mismatched "NAME".
Pass the original filename and use it instead of the backing file
when available. This fixes the loading of "NAME.raw" extensions which
are a symlink to "NAME-VERSION.raw" as, e.g., may be the case when
systemd-sysupdate manages multiple versions.
Fixes https://github.com/systemd/systemd/issues/24293
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts a major chunk of 75d7e04eb4662a814c26010d447eed8a862f5ec1
Now that the loopback device code already destroys the partitions we
don't have to do this here anymore.
I am sure the right place to delete the partitions is in the loopback
code, since we really only should do that for loopback devices, see
bug #24431, and not on "real" block devices.
I am also not convinced dropping partitions the dissection logic doesn't
care about is a good idea, after all. The dissection stuff should
probably not consider itself the "owner" of the block devices it
analyzes, but take a more passive role: figure out what is what, but not
modify it.
Fixes: #24431
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
prefix_roota() is something we should stop using. It is bad for three
reasons:
1. As it names suggests it's supposed to be used when working relative
to some root directory, but given it doesn't follow symlinks (and
instead just stupidly joins paths) it is not a good choice for that.
2. More often than not it is currently used with inputs under control of
the user, and that is icky given it typically allocates memory on the
stack.
3. It's a redundant interface, where chase_symlinks() and path_join()
already exist as better, safer interfaces.
Hence, let's start moving things from prefix_roota() to path_join() for
the cases where that's appropriate.
|
| |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When closing a loop device, the kernel will asynchronously remove
the probed partitions. This can lead to race conditions where we
try to reuse a partition device that still needs to be removed by
the kernel. To avoid such issues, let's explicitly try to remove
any partitions using BLKPG_DEL_PARTITION when we're done with an
image.
To make sure we don't try to remove partitions when we want them
to remain (e.g. systemd-dissect --mount), we add
dissected_image_relinquish() in a similar vein to loop_device_relinquish()
and decrypted_image_relinquish().
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This is a follow-up for f470cb6d13558fc06131dc677d54a089a0b07359 which in
turn is a follow-up for a068aceafbffcba85398cce636c25d659265087a.
The latter started to honour hidden files when deciding whether a
directory is empty. The former reverted to the old behaviour to fix
issue #23220.
It introduced a bug though: when a directory contains a larger number of
hidden entries the getdents64() buffer will not suffice to read them,
since we just allocate three entries for it (which is definitely enough
if we just ignore the . + .. entries, but not ig we ignore more).
I think it's a bit confusing that dir_is_empty() can return true even if
rmdir() on the dir would return ENOTEMPTY. Hence, let's rework the
function to make it optional whether hidden files are ignored or not.
After all, I looking at the users of this function I am pretty sure in
more cases we want to honour hidden files.
|
|
|
|
| |
And port some parts over.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
No actual code changes, just splitting out of some dev_t handling
related calls from stat-util.[ch], they are quite a number already, and
deserve their own module now I think.
Also, try to settle on the name "devnum" as the name for the concept,
instead of "devno" or "dev" or "devid". "devnum" is the name exported in
udev APIs, hence probably best to stick to that. (this just renames a
few symbols to "devum", local variables are left untouched, to make the
patch not too invasive)
No actual code changes.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
So here's something we should always keep in mind:
systemd-udevd actually does *two* things with BSD file locks on block
devices:
1. While it probes a device it takes a LOCK_SH lock. Thus everyone else
taking a LOCK_EX lock will temporarily block udev from probing
devices, which is good when making changes to it.
2. Whenever a device is closed after write (detected via inotify), udevd
will issue BLKRRPART (requesting the kernel to reread the partition
table). It does this while holding a LOCK_EX lock on the block
device. Thus anyone else taking LOCK_SH or LOCK_EX will temporarily
block udevd from issuing that ioctl. And that's quite relevant, since
the kernel will temporarily flush out all partitions while re-reading
the partition table and then create them anew. Thus it is smart to
take LOCK_SH when dissecting a block device to ensure that no
BLKRRPART is issued in the background, until we mounted the devices.
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This revisits the mess around waiting for partition block devices in
the image dissection code. It implements a nice little trick:
Instead of waiting for the kernel to probe the partition table for us
and generate the block devices from it, we'll just do that ourselves.
How can we do it? Via the BLKPG_ADD_PARTITION ioctl, that the kernel has
supported for a while. This ioctl allows creating partition block
devices off "whole" block devices from userspace, without the partitions
necessarily being present in the partition table at all.
So, whenever we want a partition to be there, we'll just issue
BLKPG_ADD_PARTITION. This can either work, in which case we know the
partition is there, and can use it. Yay. Or it can fail with EBUSY,
which the kernel returns if a partition by the selected partition index
already exists (or if an existing partition overlaps with the new one).
But if that's the case, then that's also OK, because the partition will
already exist.
So, regardless if we win or the kernel wins, for us the outcome is the
same: the partition block device will exist after invoking the ioctl.
Yay.
Net effect: we are not dependent on asynchronous uevent messages to wait
for the devices. Instead we synchronously get what we need. This makes
us independent of the (apparently less than reliable) netlink transport,
and should almost always be quicker.
Hopefully addresses #17469 even on older kernels.
Fixes: #17469
|
|
|
|
|
|
|
|
|
|
|
|
| |
get_block_device_harder() returns == 0 if the fs is valid, but it is not
backed by a single devno. (As opposed to returning > 0 if the devno is
valid). Let's catch this case and log a clear message, and don't bother
open the device in that case.
This is mostly cosmetical, as either way, systemd-gpt-auto-generator
doesn't work in scenarios like that.
Prompted-by: #22504
|
|\
| |
| | |
Use new diskseq block device property
|
| |
| |
| |
| |
| |
| |
| |
| | |
DISKSEQ is a reliable way to find out if we missed a uevent or not, as
it's monotonically increasing. If we parse an event with a smaller or
no sequence number, we know we need to wait longer. If we parse an
event with a greater sequence number, we know we missed it and the
device was reused.
|
|/
|
|
|
|
|
|
|
|
|
|
| |
Previously volatile-root was only checked if "/" wasn't backed by a
block device, but the block device isn't necessarily original root block
device (ex: if the rootfs is copied to a ext4 fs backed by zram in the
initramfs), so we always want volatile-root checked.
So shuffle the code around so volatile-root is checked first and
fallback to the automatic logic.
Fix #20557
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If the auto-discovered swap partition is LUKS encrypted, decrypt it
automatically.
This aligns with the Discoverable Partitions Specification, though I've
also updated it to explicitly mention that LUKS is now supported here.
Since systemd retries any key already in the kernel keyring, if the swap
partition has the same passphrase as the root partition, the user won't
be prompted a second time for a second passphrase.
See https://github.com/systemd/systemd/issues/20019
|
| |
|
|
|
|
| |
partition flag is set
|
|
|
|
|
|
|
|
|
| |
This tries to shorten the race of device reuse a bit more: let's ignore
udev database entries that are older than the time where we started to
use a loopback device.
This doesn't fix the whole loopback device raciness mess, but it makes
the race window a bit shorter.
|
|
|
|
|
|
|
|
|
|
|
| |
Let's drop all monitor uevent that were enqueued before we actually
started setting up the device.
This doesn't fix the race, but it makes the race window smaller: since
we cannot determine the uevent seqnum and the loopback attachment
atomically, there's a tiny window where uevents might be generated by
the device which we mistake for being associated with out use of the
loopback device.
|
|
|
|
| |
--Dlibcryptsetup=false
|
|
|
|
|
|
|
|
| |
Let's make use of the new dissection in all tools where this makes
sense, which are all tools that dissect images, except for those which
inherently operate on state/configuraiton and thus where an image
without state nor configuration is useless (e.g.
systemd-tmpfiles/systemd-firstboot/… --image= switch).
|