| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
| |
Rename ret to ret2 compile and then err to ret. Also, new ret2 is found
to be localized within the 'if (trans)' statement, so move its
declaration there.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
| |
In quick_update_accounting() err is used as 2nd return value, which could
be achieved just with ret.
Signed-off-by: Anand Jain <anand.jain@oracle.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A comment from Filipe on one of my previous cleanups brought my
attention to a new helper we have for getting the root id of a root,
which makes it easier to read in the code.
The changes where made with the following Coccinelle semantic patch:
// <smpl>
@@
expression E,E1;
@@
(
E->root_key.objectid = E1
|
- E->root_key.objectid
+ btrfs_root_id(E)
)
// </smpl>
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
[ minor style fixups ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This check "if (inherit->num_qgroups > PAGE_SIZE)" is confusing and
unnecessary.
The problem with the check is that static checkers flag it as a
potential mixup of between units of bytes vs number of elements.
Fortunately, the check can safely be deleted because the next check is
correct and applies an even stricter limit:
if (size != struct_size(inherit, qgroups, inherit->num_qgroups))
return -EINVAL;
The "inherit" struct ends in a variable array of __u64 and
"inherit->num_qgroups" is the number of elements in the array. At the
start of the function we check that:
if (size < sizeof(*inherit) || size > PAGE_SIZE)
return -EINVAL;
Thus, since we verify that the whole struct fits within one page, that
means that the number of elements in the inherit->qgroups[] array must
be less than PAGE_SIZE.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[BUG]
After kernel commit 86211eea8ae1 ("btrfs: qgroup: validate
btrfs_qgroup_inherit parameter"), user space tool snapper will fail to
create snapshot using its timeline feature.
[CAUSE]
It turns out that, if using timeline snapper would unconditionally pass
btrfs_qgroup_inherit parameter (assigning the new snapshot to qgroup 1/0)
for snapshot creation.
In that case, since qgroup is disabled there would be no qgroup 1/0, and
btrfs_qgroup_check_inherit() would return -ENOENT and fail the whole
snapshot creation.
[FIX]
Just skip the check if qgroup is not enabled.
This is to keep the older behavior for user space tools, as if the
kernel behavior changed for user space, it is a regression of kernel.
Thankfully snapper is also fixing the behavior by detecting if qgroup is
running in the first place, so the effect should not be that huge.
Link: https://github.com/openSUSE/snapper/issues/894
Fixes: 86211eea8ae1 ("btrfs: qgroup: validate btrfs_qgroup_inherit parameter")
CC: stable@vger.kernel.org # 6.8+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
One of my CI runs popped the following lockdep splat
======================================================
WARNING: possible circular locking dependency detected
6.9.0-rc4+ #1 Not tainted
------------------------------------------------------
btrfs/471533 is trying to acquire lock:
ffff92ba46980850 (&fs_info->cleaner_mutex){+.+.}-{3:3}, at: btrfs_quota_disable+0x54/0x4c0
but task is already holding lock:
ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
which lock already depends on the new lock.
the existing dependency chain (in reverse order) is:
-> #2 (&fs_info->subvol_sem){++++}-{3:3}:
down_read+0x42/0x170
btrfs_rename+0x607/0xb00
btrfs_rename2+0x2e/0x70
vfs_rename+0xaf8/0xfc0
do_renameat2+0x586/0x600
__x64_sys_rename+0x43/0x50
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #1 (&sb->s_type->i_mutex_key#16){++++}-{3:3}:
down_write+0x3f/0xc0
btrfs_inode_lock+0x40/0x70
prealloc_file_extent_cluster+0x1b0/0x370
relocate_file_extent_cluster+0xb2/0x720
relocate_data_extent+0x107/0x160
relocate_block_group+0x442/0x550
btrfs_relocate_block_group+0x2cb/0x4b0
btrfs_relocate_chunk+0x50/0x1b0
btrfs_balance+0x92f/0x13d0
btrfs_ioctl+0x1abf/0x2600
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
-> #0 (&fs_info->cleaner_mutex){+.+.}-{3:3}:
__lock_acquire+0x13e7/0x2180
lock_acquire+0xcb/0x2e0
__mutex_lock+0xbe/0xc00
btrfs_quota_disable+0x54/0x4c0
btrfs_ioctl+0x206b/0x2600
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x95/0x180
entry_SYSCALL_64_after_hwframe+0x76/0x7e
other info that might help us debug this:
Chain exists of:
&fs_info->cleaner_mutex --> &sb->s_type->i_mutex_key#16 --> &fs_info->subvol_sem
Possible unsafe locking scenario:
CPU0 CPU1
---- ----
lock(&fs_info->subvol_sem);
lock(&sb->s_type->i_mutex_key#16);
lock(&fs_info->subvol_sem);
lock(&fs_info->cleaner_mutex);
*** DEADLOCK ***
2 locks held by btrfs/471533:
#0: ffff92ba4319e420 (sb_writers#14){.+.+}-{0:0}, at: btrfs_ioctl+0x3b5/0x2600
#1: ffff92ba46980bd0 (&fs_info->subvol_sem){++++}-{3:3}, at: btrfs_ioctl+0x1c8f/0x2600
stack backtrace:
CPU: 1 PID: 471533 Comm: btrfs Kdump: loaded Not tainted 6.9.0-rc4+ #1
Call Trace:
<TASK>
dump_stack_lvl+0x77/0xb0
check_noncircular+0x148/0x160
? lock_acquire+0xcb/0x2e0
__lock_acquire+0x13e7/0x2180
lock_acquire+0xcb/0x2e0
? btrfs_quota_disable+0x54/0x4c0
? lock_is_held_type+0x9a/0x110
__mutex_lock+0xbe/0xc00
? btrfs_quota_disable+0x54/0x4c0
? srso_return_thunk+0x5/0x5f
? lock_acquire+0xcb/0x2e0
? btrfs_quota_disable+0x54/0x4c0
? btrfs_quota_disable+0x54/0x4c0
btrfs_quota_disable+0x54/0x4c0
btrfs_ioctl+0x206b/0x2600
? srso_return_thunk+0x5/0x5f
? __do_sys_statfs+0x61/0x70
__x64_sys_ioctl+0x97/0xd0
do_syscall_64+0x95/0x180
? srso_return_thunk+0x5/0x5f
? reacquire_held_locks+0xd1/0x1f0
? do_user_addr_fault+0x307/0x8a0
? srso_return_thunk+0x5/0x5f
? lock_acquire+0xcb/0x2e0
? srso_return_thunk+0x5/0x5f
? srso_return_thunk+0x5/0x5f
? find_held_lock+0x2b/0x80
? srso_return_thunk+0x5/0x5f
? lock_release+0xca/0x2a0
? srso_return_thunk+0x5/0x5f
? do_user_addr_fault+0x35c/0x8a0
? srso_return_thunk+0x5/0x5f
? trace_hardirqs_off+0x4b/0xc0
? srso_return_thunk+0x5/0x5f
? lockdep_hardirqs_on_prepare+0xde/0x190
? srso_return_thunk+0x5/0x5f
This happens because when we call rename we already have the inode mutex
held, and then we acquire the subvol_sem if we are a subvolume. This
makes the dependency
inode lock -> subvol sem
When we're running data relocation we will preallocate space for the
data relocation inode, and we always run the relocation under the
->cleaner_mutex. This now creates the dependency of
cleaner_mutex -> inode lock (from the prealloc) -> subvol_sem
Qgroup delete is doing this in the opposite order, it is acquiring the
subvol_sem and then it is acquiring the cleaner_mutex, which results in
this lockdep splat. This deadlock can't happen in reality, because we
won't ever rename the data reloc inode, nor is the data reloc inode a
subvolume.
However this is fairly easy to fix, simply take the cleaner mutex in the
case where we are disabling qgroups before we take the subvol_sem. This
resolves the lockdep splat.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We use add_root_meta_rsv and sub_root_meta_rsv to track prealloc and
pertrans reservations for subvolumes when quotas are enabled. The
convert function does not properly increment pertrans after decrementing
prealloc, so the count is not accurate.
Note: we check that the fs is not read-only to mirror the logic in
qgroup_convert_meta, which checks that before adding to the pertrans rsv.
Fixes: 8287475a2055 ("btrfs: qgroup: Use root::qgroup_meta_rsv_* to record qgroup meta reserved space")
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
same parent
Currently "btrfs subvolume snapshot -i <qgroupid>" would always mark the
qgroup inconsistent.
This can be annoying if the fs has a lot of snapshots, and needs qgroup
to get the accounting for the amount of bytes it can free for each
snapshot.
Although we have the new simple quote as a solution, there is also a
case where we can skip the full scan, if all the following conditions
are met:
- The source subvolume belongs to a higher level parent qgroup
- The parent qgroup already owns all its bytes exclusively
- The new snapshot is also added to the same parent qgroup
In that case, we only need to add nodesize to the parent qgroup and
avoid a full rescan.
This patch would add the extra quick accounting update for such inherit.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[BUG]
Currently btrfs can create subvolume with an invalid qgroup inherit
without triggering any error:
# mkfs.btrfs -O quota -f $dev
# mount $dev $mnt
# btrfs subvolume create -i 2/0 $mnt/subv1
# btrfs qgroup show -prce --sync $mnt
Qgroupid Referenced Exclusive Path
-------- ---------- --------- ----
0/5 16.00KiB 16.00KiB <toplevel>
0/256 16.00KiB 16.00KiB subv1
[CAUSE]
We only do a very basic size check for btrfs_qgroup_inherit structure,
but never really verify if the values are correct.
Thus in btrfs_qgroup_inherit() function, we have to skip non-existing
qgroups, and never return any error.
[FIX]
Fix the behavior and introduce extra checks:
- Introduce early check for btrfs_qgroup_inherit structure
Not only the size, but also all the qgroup ids would be verified.
And the timing is very early, so we can return error early.
This early check is very important for snapshot creation, as snapshot
is delayed to transaction commit.
- Drop support for btrfs_qgroup_inherit::num_ref_copies and
num_excl_copies
Those two members are used to specify to copy refr/excl numbers from
other qgroups.
This would definitely mark qgroup inconsistent, and btrfs-progs has
dropped the support for them for a long time.
It's time to drop the support for kernel.
- Verify the supported btrfs_qgroup_inherit::flags
Just in case we want to add extra flags for btrfs_qgroup_inherit.
Now above subvolume creation would fail with -ENOENT other than silently
ignore the non-existing qgroup.
CC: stable@vger.kernel.org # 6.7+
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[BUG]
If qgroup is marked inconsistent (e.g. caused by operations needing full
subtree rescan, like creating a snapshot and assign to a higher level
qgroup), btrfs would immediately start leaking its data reserved space.
The following script can easily reproduce it:
mkfs.btrfs -O quota -f $dev
mount $dev $mnt
btrfs subvolume create $mnt/subv1
btrfs qgroup create 1/0 $mnt
# This snapshot creation would mark qgroup inconsistent,
# as the ownership involves different higher level qgroup, thus
# we have to rescan both source and snapshot, which can be very
# time consuming, thus here btrfs just choose to mark qgroup
# inconsistent, and let users to determine when to do the rescan.
btrfs subv snapshot -i 1/0 $mnt/subv1 $mnt/snap1
# Now this write would lead to qgroup rsv leak.
xfs_io -f -c "pwrite 0 64k" $mnt/file1
# And at unmount time, btrfs would report 64K DATA rsv space leaked.
umount $mnt
And we would have the following dmesg output for the unmount:
BTRFS info (device dm-1): last unmount of filesystem 14a3d84e-f47b-4f72-b053-a8a36eef74d3
BTRFS warning (device dm-1): qgroup 0/5 has unreleased space, type 0 rsv 65536
[CAUSE]
Since commit e15e9f43c7ca ("btrfs: introduce
BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"),
we introduce a mode for btrfs qgroup to skip the timing consuming
backref walk, if the qgroup is already inconsistent.
But this skip also covered the data reserved freeing, thus the qgroup
reserved space for each newly created data extent would not be freed,
thus cause the leakage.
[FIX]
Make the data extent reserved space freeing mandatory.
The qgroup reserved space handling is way cheaper compared to the
backref walking part, and we always have the super sensitive leak
detector, thus it's definitely worth to always free the qgroup
reserved data space.
Reported-by: Fabian Vogt <fvogt@suse.com>
Fixes: e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting")
CC: stable@vger.kernel.org # 6.1+
Link: https://bugzilla.suse.com/show_bug.cgi?id=1216196
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Other errors in flush_reservations() are handled and also in the caller.
Ignoring commit might make some sense as it's called right after join so
it's to poke the whole commit machinery to free space.
However for consistency return the error. The caller
btrfs_quota_disable() would try to start the transaction which would
in turn fail too so there's no effective change.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
btrfs_qgroup_account_extent()
The BUG_ON is deep in the qgroup code where we can expect that it
exists. A NULL pointer would cause a crash.
It was added long ago in 550d7a2ed5db35 ("btrfs: qgroup: Add new qgroup
calculation function btrfs_qgroup_account_extents()."). It maybe made
sense back then as the quota enable/disable state machine was not that
robust as it is nowadays, so we can just delete it.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
| |
The only caller do_walk_down() of btrfs_qgroup_trace_subtree() validates
the value of level and uses it several times before it's passed as an
argument. Same for root_eb that's called 'next' in the caller.
Change both BUG_ONs to assertions as this is to assure proper interface
use rather than real errors.
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If a subvolume still exists, forbid deleting its qgroup 0/subvolid.
This behavior generally leads to incorrect behavior in squotas and
doesn't have a legitimate purpose.
Fixes: cecbb533b5fc ("btrfs: record simple quota deltas in delayed refs")
CC: stable@vger.kernel.org # 5.4+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
| |
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A reservation goes through a 3 step lifetime:
- generated during delalloc
- released/counted by ordered_extent allocation
- freed by running delayed ref
That third step depends on must_insert_reserved on the head ref, so the
head ref with that field set owns the reservation. Once you prepare to
run the head ref, must_insert_reserved is unset, which means that
running the ref must free the reservation, whether or not it succeeds,
or else the reservation is leaked. That results in either a risk of
spurious ENOSPC if the fs stays writeable or a warning on unmount if it
is readonly.
The existing squota code was aware of these invariants, but missed a few
cases. Improve it by adding a helper function to use in the cleanup
paths and call it from the existing early returns in running delayed
refs. This also simplifies btrfs_record_squota_delta and struct
btrfs_quota_delta.
This fixes (or at least improves the reliability of) generic/475 with
"mkfs -O squota". On my machine, that test failed ~4/10 times without
this patch and passed 100/100 times with it.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we abort a transaction, we never run the code that frees the pertrans
qgroup reservation. This results in warnings on unmount as that
reservation has been leaked. The leak isn't a huge issue since the fs is
read-only, but it's better to clean it up when we know we can/should. Do
it during the cleanup_transaction step of aborting.
CC: stable@vger.kernel.org # 5.15+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The reserved data counter and input parameter is a u64, but we
inadvertently accumulate it in an int. Overflowing that int results in
freeing the wrong amount of data and breaking reserve accounting.
Unfortunately, this overflow rot spreads from there, as the qgroup
release/free functions rely on returning an int to take advantage of
negative values for error codes.
Therefore, the full fix is to return the "released" or "freed" amount by
a u64 argument and to return 0 or negative error code via the return
value.
Most of the call sites simply ignore the return value, though some
of them handle the error and count the returned bytes. Change all of
them accordingly.
CC: stable@vger.kernel.org # 6.1+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When using simple quotas we are not supposed to allocate qgroup records
when adding delayed references. However we allocate them if either mode
of quotas is enabled (the new simple one or the old one), but then we
never free them because running the accounting, which frees the records,
is only run when using the old quotas (at btrfs_qgroup_account_extents()),
resulting in a memory leak of the records allocated when adding delayed
references.
Fix this by allocating the records only if the old quotas mode is enabled.
Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas
mode is not enabled - meaning the caller has to free the record.
Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas")
Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When doing qgroup accounting for an extent, we take the spinlock
fs_info->qgroup_lock and then add qgroups to the local list (iterator)
named "qgroups". These qgroups are found in the fs_info->qgroup_tree
rbtree. After we're done, we unlock fs_info->qgroup_lock and then call
qgroup_iterator_nested_clean(), which will iterate over all the qgroups
added to the local list "qgroups" and then delete them from the list.
Deleting a qgroup from the list can however result in a use-after-free
if a qgroup remove operation happens after we unlock fs_info->qgroup_lock
and before or while we are at qgroup_iterator_nested_clean().
Fix this by calling qgroup_iterator_nested_clean() while still holding
the lock fs_info->qgroup_lock - we don't need it under the 'out' label
since before taking the lock the "qgroups" list is always empty. This
guarantees safety because btrfs_remove_qgroup() takes that lock before
removing a qgroup from the rbtree fs_info->qgroup_tree.
This was reported by syzbot with the following stack traces:
BUG: KASAN: slab-use-after-free in __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
Read of size 8 at addr ffff888027e420b0 by task kworker/u4:3/48
CPU: 1 PID: 48 Comm: kworker/u4:3 Not tainted 6.6.0-syzkaller-10396-g4652b8e4f3ff #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Workqueue: btrfs-qgroup-rescan btrfs_work_helper
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106
print_address_description mm/kasan/report.c:364 [inline]
print_report+0x163/0x540 mm/kasan/report.c:475
kasan_report+0x175/0x1b0 mm/kasan/report.c:588
__list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49
__list_del_entry_valid include/linux/list.h:124 [inline]
__list_del_entry include/linux/list.h:215 [inline]
list_del_init include/linux/list.h:287 [inline]
qgroup_iterator_nested_clean fs/btrfs/qgroup.c:2623 [inline]
btrfs_qgroup_account_extent+0x18b/0x1150 fs/btrfs/qgroup.c:2883
qgroup_rescan_leaf fs/btrfs/qgroup.c:3543 [inline]
btrfs_qgroup_rescan_worker+0x1078/0x1c60 fs/btrfs/qgroup.c:3604
btrfs_work_helper+0x37c/0xbd0 fs/btrfs/async-thread.c:315
process_one_work kernel/workqueue.c:2630 [inline]
process_scheduled_works+0x90f/0x1400 kernel/workqueue.c:2703
worker_thread+0xa5f/0xff0 kernel/workqueue.c:2784
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
</TASK>
Allocated by task 6355:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
____kasan_kmalloc mm/kasan/common.c:374 [inline]
__kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:383
kmalloc include/linux/slab.h:600 [inline]
kzalloc include/linux/slab.h:721 [inline]
btrfs_quota_enable+0xee9/0x2060 fs/btrfs/qgroup.c:1209
btrfs_ioctl_quota_ctl+0x143/0x190 fs/btrfs/ioctl.c:3705
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Freed by task 6355:
kasan_save_stack mm/kasan/common.c:45 [inline]
kasan_set_track+0x4f/0x70 mm/kasan/common.c:52
kasan_save_free_info+0x28/0x40 mm/kasan/generic.c:522
____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236
kasan_slab_free include/linux/kasan.h:164 [inline]
slab_free_hook mm/slub.c:1800 [inline]
slab_free_freelist_hook mm/slub.c:1826 [inline]
slab_free mm/slub.c:3809 [inline]
__kmem_cache_free+0x263/0x3a0 mm/slub.c:3822
btrfs_remove_qgroup+0x764/0x8c0 fs/btrfs/qgroup.c:1787
btrfs_ioctl_qgroup_create+0x185/0x1e0 fs/btrfs/ioctl.c:3811
vfs_ioctl fs/ioctl.c:51 [inline]
__do_sys_ioctl fs/ioctl.c:871 [inline]
__se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2667 [inline]
call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
Second to last potentially related work creation:
kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45
__kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492
__call_rcu_common kernel/rcu/tree.c:2667 [inline]
call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781
kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823
kthread+0x2d3/0x370 kernel/kthread.c:388
ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147
ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242
The buggy address belongs to the object at ffff888027e42000
which belongs to the cache kmalloc-512 of size 512
The buggy address is located 176 bytes inside of
freed 512-byte region [ffff888027e42000, ffff888027e42200)
The buggy address belongs to the physical page:
page:ffffea00009f9000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x27e40
head:ffffea00009f9000 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0
flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff)
page_type: 0xffffffff()
raw: 00fff00000000840 ffff888012c41c80 ffffea0000a5ba00 dead000000000002
raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000
page dumped because: kasan: bad access detected
page_owner tracks the page as allocated
page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 4514, tgid 4514 (udevadm), ts 24598439480, free_ts 23755696267
set_page_owner include/linux/page_owner.h:31 [inline]
post_alloc_hook+0x1e6/0x210 mm/page_alloc.c:1536
prep_new_page mm/page_alloc.c:1543 [inline]
get_page_from_freelist+0x31db/0x3360 mm/page_alloc.c:3170
__alloc_pages+0x255/0x670 mm/page_alloc.c:4426
alloc_slab_page+0x6a/0x160 mm/slub.c:1870
allocate_slab mm/slub.c:2017 [inline]
new_slab+0x84/0x2f0 mm/slub.c:2070
___slab_alloc+0xc85/0x1310 mm/slub.c:3223
__slab_alloc mm/slub.c:3322 [inline]
__slab_alloc_node mm/slub.c:3375 [inline]
slab_alloc_node mm/slub.c:3468 [inline]
__kmem_cache_alloc_node+0x19d/0x270 mm/slub.c:3517
kmalloc_trace+0x2a/0xe0 mm/slab_common.c:1098
kmalloc include/linux/slab.h:600 [inline]
kzalloc include/linux/slab.h:721 [inline]
kernfs_fop_open+0x3e7/0xcc0 fs/kernfs/file.c:670
do_dentry_open+0x8fd/0x1590 fs/open.c:948
do_open fs/namei.c:3622 [inline]
path_openat+0x2845/0x3280 fs/namei.c:3779
do_filp_open+0x234/0x490 fs/namei.c:3809
do_sys_openat2+0x13e/0x1d0 fs/open.c:1440
do_sys_open fs/open.c:1455 [inline]
__do_sys_openat fs/open.c:1471 [inline]
__se_sys_openat fs/open.c:1466 [inline]
__x64_sys_openat+0x247/0x290 fs/open.c:1466
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
page last free stack trace:
reset_page_owner include/linux/page_owner.h:24 [inline]
free_pages_prepare mm/page_alloc.c:1136 [inline]
free_unref_page_prepare+0x8c3/0x9f0 mm/page_alloc.c:2312
free_unref_page+0x37/0x3f0 mm/page_alloc.c:2405
discard_slab mm/slub.c:2116 [inline]
__unfreeze_partials+0x1dc/0x220 mm/slub.c:2655
put_cpu_partial+0x17b/0x250 mm/slub.c:2731
__slab_free+0x2b6/0x390 mm/slub.c:3679
qlink_free mm/kasan/quarantine.c:166 [inline]
qlist_free_all+0x75/0xe0 mm/kasan/quarantine.c:185
kasan_quarantine_reduce+0x14b/0x160 mm/kasan/quarantine.c:292
__kasan_slab_alloc+0x23/0x70 mm/kasan/common.c:305
kasan_slab_alloc include/linux/kasan.h:188 [inline]
slab_post_alloc_hook+0x67/0x3d0 mm/slab.h:762
slab_alloc_node mm/slub.c:3478 [inline]
slab_alloc mm/slub.c:3486 [inline]
__kmem_cache_alloc_lru mm/slub.c:3493 [inline]
kmem_cache_alloc+0x104/0x2c0 mm/slub.c:3502
getname_flags+0xbc/0x4f0 fs/namei.c:140
do_sys_openat2+0xd2/0x1d0 fs/open.c:1434
do_sys_open fs/open.c:1455 [inline]
__do_sys_openat fs/open.c:1471 [inline]
__se_sys_openat fs/open.c:1466 [inline]
__x64_sys_openat+0x247/0x290 fs/open.c:1466
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Memory state around the buggy address:
ffff888027e41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
ffff888027e42000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
>ffff888027e42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
^
ffff888027e42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
ffff888027e42180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Reported-by: syzbot+e0b615318f8fcfc01ceb@syzkaller.appspotmail.com
Fixes: dce28769a33a ("btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()")
CC: stable@vger.kernel.org # 6.6
Link: https://lore.kernel.org/linux-btrfs/00000000000091a5b2060936bf6d@google.com/
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In open_ctree, we set BTRFS_FS_QUOTA_ENABLED as soon as we see a
quota_root, as opposed to after we are done setting up the qgroup
structures. In the quota_enable path, we wait until after the structures
are set up. Likewise, in disable, we clear the bit before tearing down
the structures. I feel that this organization is less surprising for the
open_ctree path.
I don't believe this fixes any actual bug, but avoids potential
confusion when using btrfs_qgroup_mode in an intermediate state where we
are enabled but haven't yet setup the qgroup status flags. It also
avoids any risk of calling a qgroup function and attempting to use the
qgroup rbtrees before they exist/are setup.
This all occurs before we do rw setup, so I believe it should be mostly
a no-op.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Simple quotas count extents only from the moment the feature is enabled.
Therefore, if we do something like:
1. create subvol S
2. write F in S
3. enable quotas
4. remove F
5. write G in S
then after 3. and 4. we would expect the simple quota usage of S to be 0
(putting aside some metadata extents that might be written) and after
5., it should be the size of G plus metadata. Therefore, we need to be
able to determine whether a particular quota delta we are processing
predates simple quota enablement.
To do this, store the transaction id when quotas were enabled. In
fs_info for immediate use and in the quota status item to make it
recoverable on mount. When we see a delta, check if the generation of
the extent item is less than that of quota enablement. If so, we should
ignore the delta from this extent.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Consider the following sequence:
- enable quotas
- create subvol S id 256 at dir outer/
- create a qgroup 1/100
- add 0/256 (S's auto qgroup) to 1/100
- create subvol T id 257 at dir outer/inner/
With full qgroups, there is no relationship between 0/257 and either of
0/256 or 1/100. There is an inherit feature that the creator of inner/
can use to specify it ought to be in 1/100.
Simple quotas are targeted at container isolation, where such automatic
inheritance for not necessarily trusted/controlled nested subvol
creation would be quite helpful. Therefore, add a new default behavior
for simple quotas: when you create a nested subvol, automatically
inherit as parents any parents of the qgroup of the subvol the new inode
is going in.
In our example, 257/0 would also be under 1/100, allowing easy control
of a total quota over an arbitrary hierarchy of subvolumes.
I think this _might_ be a generally useful behavior, so it could be
interesting to put it behind a new inheritance flag that simple quotas
always use while traditional quotas let the user specify, but this is a
minimally intrusive change to start.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
| |
Rather than re-computing shared/exclusive ownership based on backrefs
and walking roots for implicit backrefs, simple quotas does an increment
when creating an extent and a decrement when deleting it. Add the API
for the extent item code to use to track those events.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Pull creating the qgroup earlier in the snapshot. This allows simple
quotas qgroups to see all the metadata writes related to the snapshot
being created and to be born with the root node accounted.
Note this has an impact on transaction commit where the qgroup creation
can do a lot of work, allocate memory and take locks. The change is done
for correctness, potential performance issues will be fixed in the
future.
Signed-off-by: Boris Burkov <boris@bur.io>
[ add note ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following sequence:
enable simple quotas
do some writes
reserve space
create ordered_extent
release rsv (store rsv_bytes in OE, mark QGROUP_RESERVED bits)
disable quotas
enable simple quotas
set qgroup rsv to 0 on all subvolumes
ordered_extent finishes
create delayed ref with rsv_bytes from before
run delayed ref
record_simple_quota_delta
free rsv_bytes (0 -> -rsv_delta)
results in us reliably underflowing the subvolume's qgroup rsv counter,
because disabling/re-enabling quotas toggles reservation counters down
to 0, but does not remove other file system state which represents
successful acquisition of qgroup rsv space. Specifically metadata rsv
counters on the root object and rsv_bytes on ordered_extent objects that
have released their reservation as well as the corresponding
QGROUP_RESERVED extent bits.
Normal qgroups gets away with this, I believe because it forces more
work to happen on transaction commit, but I am not certain it is totally
safe from the ordered_extent/leaked extent bit variant. Simple quotas
hits this reliably.
The intent of the fix is to make disable take the time to clear that
external to qgroups state as well: after flipping off the quota bit on
fs_info, flush delalloc and ordered extents, clearing the extent bits
along the way. This makes it so there are no ordered extents or meta
prealloc hanging around from the first enablement period during the second.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
| |
Add a new quota mode called "simple quotas". It can be enabled by the
existing quota enable ioctl via a new command, and sets an incompat
bit, as the implementation of simple quotas will make backwards
incompatible changes to the disk format of the extent tree.
Signed-off-by: Boris Burkov <boris@bur.io>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
| |
In preparation for introducing simple quotas, change from a binary
setting for quotas to an enum based mode. Initially, the possible modes
are disabled/full. Full quotas is normal btrfs qgroups.
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Boris Burkov <boris@bur.io>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There are two callbacks defined in btrfs_work but only two actually make
use of them, otherwise there are NULLs. We can get rid of the freeing
callback making it a special case of the normal work. This reduces the
size of btrfs_work by 8 bytes, final layout:
struct btrfs_work {
btrfs_func_t func; /* 0 8 */
btrfs_ordered_func_t ordered_func; /* 8 8 */
struct work_struct normal_work; /* 16 32 */
struct list_head ordered_list; /* 48 16 */
/* --- cacheline 1 boundary (64 bytes) --- */
struct btrfs_workqueue * wq; /* 64 8 */
long unsigned int flags; /* 72 8 */
/* size: 80, cachelines: 2, members: 6 */
/* last cacheline: 16 bytes */
};
This in turn reduces size of other structures (on a release config):
- async_chunk 160 -> 152
- async_submit_bio 152 -> 144
- btrfs_async_delayed_work 104 -> 96
- btrfs_caching_control 176 -> 168
- btrfs_delalloc_work 144 -> 136
- btrfs_fs_info 3608 -> 3600
- btrfs_ordered_extent 440 -> 424
- btrfs_writepage_fixup 104 -> 96
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(),
we check if its generation matches the running transaction and if not we
just print a warning. Such mismatch is an indicator that something really
went wrong and only printing a warning message (and stack trace) is not
enough to prevent a corruption. Allowing a transaction to commit with such
an extent buffer will trigger an error if we ever try to read it from disk
due to a generation mismatch with its parent generation.
So abort the current transaction with -EUCLEAN if we notice a generation
mismatch. For this we need to pass a transaction handle to
btrfs_mark_buffer_dirty() which is always available except in test code,
in which case we can pass NULL since it operates on dummy extent buffers
and all test roots have a single node/leaf (root node at level 0).
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
| |
We keep the comments next to the implementation, there were some left
to move.
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
These functions are defined in the qgroup.c file, but not called
anymore since commit "btrfs: qgroup: use qgroup_iterator_nested to in
qgroup_update_refcnt()" so we can delete them.
fs/btrfs/qgroup.c:149:19: warning: unused function 'qgroup_to_aux'.
fs/btrfs/qgroup.c:154:36: warning: unused function 'unode_aux_to_qgroup'.
Reported-by: Abaci Robot <abaci@linux.alibaba.com>
Closes: https://bugzilla.openanolis.cn/show_bug.cgi?id=6566
Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com>
Signed-off-by: Jiapeng Chong <jiapeng.chong@linux.alibaba.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently we go GFP_ATOMIC allocation for qgroup relation add, this
includes the following 3 call sites:
- btrfs_read_qgroup_config()
This is not really needed, as at that time we're still in single
thread mode, and no spin lock is held.
- btrfs_add_qgroup_relation()
This one is holding a spinlock, but we're ensured to add at most one
relation, thus we can easily do a preallocation and use the
preallocated memory to avoid GFP_ATOMIC.
- btrfs_qgroup_inherit()
This is a little more tricky, as we may have as many relationships as
inherit::num_qgroups.
Thus we have to properly allocate an array then preallocate all the
memory.
This patch would remove the GFP_ATOMIC allocation for above involved
call sites, by doing preallocation before holding the spinlock, and let
__add_relation_rb() to handle the freeing of the structure.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Qgroup is the heaviest user of GFP_ATOMIC, but one call site does not
really need GFP_ATOMIC, that is add_qgroup_rb().
That function only searches the rbtree to find if we already have such
entry. If not, then it would try to allocate memory for it.
This means we can afford to pre-allocate such structure unconditionally,
then free the memory if it's not needed.
Considering this function is not a hot path, only utilized by the
following functions:
- btrfs_qgroup_inherit()
For "btrfs subvolume snapshot -i" option.
- btrfs_read_qgroup_config()
At mount time, and we're ensured there would be no existing rb tree
entry for each qgroup.
- btrfs_create_qgroup()
Thus we're completely safe to pre-allocate the extra memory for btrfs_qgroup
structure, and reduce unnecessary GFP_ATOMIC usage.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The ulist @qgroups is utilized to record all involved qgroups from both
old and new roots inside btrfs_qgroup_account_extent().
Due to the fact that qgroup_update_refcnt() itself is already utilizing
qgroup_iterator, here we have to introduce another list_head,
btrfs_qgroup::nested_iterator, allowing nested iteration.
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
qgroup_update_refcnt()
For function qgroup_update_refcnt(), we use @tmp list to iterate all the
involved qgroups of a subvolume.
It's a perfect match for qgroup_iterator facility, as that @tmp ulist
has a very limited lifespan (just inside the while() loop).
By migrating to qgroup_iterator, we can get rid of the GFP_ATOMIC memory
allocation and no error handling is needed.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
With the new qgroup_iterator_add() and qgroup_iterator_clean(), we can
get rid of the ulist and its GFP_ATOMIC memory allocation.
Furthermore we can merge the code handling the initial and parent
qgroups into one loop, and drop the @tmp ulist parameter for involved
call sites.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
| |
With the new qgroup_iterator_add() and qgroup_iterator_clean(), we can
get rid of the ulist and its GFP_ATOMIC memory allocation.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
| |
With the new qgroup_iterator_add() and qgroup_iterator_clean(), we can
get rid of the ulist and its GFP_ATOMIC memory allocation.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Qgroup heavily relies on ulist to go through all the involved
qgroups, but since we're using ulist inside fs_info->qgroup_lock
spinlock, this means we're doing a lot of GFP_ATOMIC allocations.
This patch reduces the GFP_ATOMIC usage for qgroup_reserve() by
eliminating the memory allocation completely.
This is done by moving the needed memory to btrfs_qgroup::iterator
list_head, so that we can put all the involved qgroup into a on-stack
list, thus eliminating the need to allocate memory while holding
spinlock.
The only cost is the slightly higher memory usage, but considering the
reduce GFP_ATOMIC during a hot path, it should still be acceptable.
Function qgroup_reserve() is the perfect start point for this
conversion.
Reviewed-by: Boris Burkov <boris@bur.io>
Signed-off-by: Qu Wenruo <wqu@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When flushing qgroups, we try to join a running transaction, with
btrfs_join_transaction(), and then commit the transaction. However using
btrfs_join_transaction() will result in creating a new transaction in case
there isn't any running or if there's an existing one already committing.
This is pointless as we only need to attach to an existing one that is
not committing and in case there's an existing one committing, wait for
its commit to complete. Creating and committing an empty transaction is
wasteful, pointless IO and unnecessary rotation of the backup roots.
So use btrfs_attach_transaction_barrier() instead, to avoid creating and
committing empty transactions.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When starting a qgroup rescan, we try to join a running transaction, with
btrfs_join_transaction(), and then commit the transaction. However using
btrfs_join_transaction() will result in creating a new transaction in case
there isn't any running or if there's an existing one already committing.
This is pointless as we only need to attach to an existing one that is
not committing and in case there's an existing one committing, wait for
its commit to complete. Creating and committing an empty transaction is
wasteful, pointless IO and unnecessary rotation of the backup roots.
So use btrfs_attach_transaction_barrier() instead, to avoid creating and
committing empty transactions.
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we have a transaction abort with qgroups enabled we get a warning
triggered when doing the final put on the transaction, like this:
[552.6789] ------------[ cut here ]------------
[552.6815] WARNING: CPU: 4 PID: 81745 at fs/btrfs/transaction.c:144 btrfs_put_transaction+0x123/0x130 [btrfs]
[552.6817] Modules linked in: btrfs blake2b_generic xor (...)
[552.6819] CPU: 4 PID: 81745 Comm: btrfs-transacti Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[552.6819] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.2-0-gea1b7a073390-prebuilt.qemu.org 04/01/2014
[552.6819] RIP: 0010:btrfs_put_transaction+0x123/0x130 [btrfs]
[552.6821] Code: bd a0 01 00 (...)
[552.6821] RSP: 0018:ffffa168c0527e28 EFLAGS: 00010286
[552.6821] RAX: ffff936042caed00 RBX: ffff93604a3eb448 RCX: 0000000000000000
[552.6821] RDX: ffff93606421b028 RSI: ffffffff92ff0878 RDI: ffff93606421b010
[552.6821] RBP: ffff93606421b000 R08: 0000000000000000 R09: ffffa168c0d07c20
[552.6821] R10: 0000000000000000 R11: ffff93608dc52950 R12: ffffa168c0527e70
[552.6821] R13: ffff93606421b000 R14: ffff93604a3eb420 R15: ffff93606421b028
[552.6821] FS: 0000000000000000(0000) GS:ffff93675fb00000(0000) knlGS:0000000000000000
[552.6821] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[552.6821] CR2: 0000558ad262b000 CR3: 000000014feda005 CR4: 0000000000370ee0
[552.6822] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[552.6822] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[552.6822] Call Trace:
[552.6822] <TASK>
[552.6822] ? __warn+0x80/0x130
[552.6822] ? btrfs_put_transaction+0x123/0x130 [btrfs]
[552.6824] ? report_bug+0x1f4/0x200
[552.6824] ? handle_bug+0x42/0x70
[552.6824] ? exc_invalid_op+0x14/0x70
[552.6824] ? asm_exc_invalid_op+0x16/0x20
[552.6824] ? btrfs_put_transaction+0x123/0x130 [btrfs]
[552.6826] btrfs_cleanup_transaction+0xe7/0x5e0 [btrfs]
[552.6828] ? _raw_spin_unlock_irqrestore+0x23/0x40
[552.6828] ? try_to_wake_up+0x94/0x5e0
[552.6828] ? __pfx_process_timeout+0x10/0x10
[552.6828] transaction_kthread+0x103/0x1d0 [btrfs]
[552.6830] ? __pfx_transaction_kthread+0x10/0x10 [btrfs]
[552.6832] kthread+0xee/0x120
[552.6832] ? __pfx_kthread+0x10/0x10
[552.6832] ret_from_fork+0x29/0x50
[552.6832] </TASK>
[552.6832] ---[ end trace 0000000000000000 ]---
This corresponds to this line of code:
void btrfs_put_transaction(struct btrfs_transaction *transaction)
{
(...)
WARN_ON(!RB_EMPTY_ROOT(
&transaction->delayed_refs.dirty_extent_root));
(...)
}
The warning happens because btrfs_qgroup_destroy_extent_records(), called
in the transaction abort path, we free all entries from the rbtree
"dirty_extent_root" with rbtree_postorder_for_each_entry_safe(), but we
don't actually empty the rbtree - it's still pointing to nodes that were
freed.
So set the rbtree's root node to NULL to avoid this warning (assign
RB_ROOT).
Fixes: 81f7eb00ff5b ("btrfs: destroy qgroup extent records on transaction abort")
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we disable quotas while we have a relocation of a metadata block group
that has extents belonging to the quota root, we can cause the relocation
to fail with -ENOENT. This is because relocation builds backref nodes for
extents of the quota root and later needs to walk the backrefs and access
the quota root - however if in between a task disables quotas, it results
in deleting the quota root from the root tree (with btrfs_del_root(),
called from btrfs_quota_disable().
This can be sporadically triggered by test case btrfs/255 from fstests:
$ ./check btrfs/255
FSTYP -- btrfs
PLATFORM -- Linux/x86_64 debian0 6.4.0-rc6-btrfs-next-134+ #1 SMP PREEMPT_DYNAMIC Thu Jun 15 11:59:28 WEST 2023
MKFS_OPTIONS -- /dev/sdc
MOUNT_OPTIONS -- /dev/sdc /home/fdmanana/btrfs-tests/scratch_1
btrfs/255 6s ... _check_dmesg: something found in dmesg (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.dmesg)
- output mismatch (see /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad)
--- tests/btrfs/255.out 2023-03-02 21:47:53.876609426 +0000
+++ /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad 2023-06-16 10:20:39.267563212 +0100
@@ -1,2 +1,4 @@
QA output created by 255
+ERROR: error during balancing '/home/fdmanana/btrfs-tests/scratch_1': No such file or directory
+There may be more info in syslog - try dmesg | tail
Silence is golden
...
(Run 'diff -u /home/fdmanana/git/hub/xfstests/tests/btrfs/255.out /home/fdmanana/git/hub/xfstests/results//btrfs/255.out.bad' to see the entire diff)
Ran: btrfs/255
Failures: btrfs/255
Failed 1 of 1 tests
To fix this make the quota disable operation take the cleaner mutex, as
relocation of a block group also takes this mutex. This is also what we
do when deleting a subvolume/snapshot, we take the cleaner mutex in the
cleaner kthread (at cleaner_kthread()) and then we call btrfs_del_root()
at btrfs_drop_snapshot() while under the protection of the cleaner mutex.
Fixes: bed92eae26cc ("Btrfs: qgroup implementation and prototypes")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When disabling quotas we are deleting the quota root from the list
fs_info->dirty_cowonly_roots without taking the lock that protects it,
which is struct btrfs_fs_info::trans_lock. This unsynchronized list
manipulation may cause chaos if there's another concurrent manipulation
of this list, such as when adding a root to it with
ctree.c:add_root_to_dirty_list().
This can result in all sorts of weird failures caused by a race, such as
the following crash:
[337571.278245] general protection fault, probably for non-canonical address 0xdead000000000108: 0000 [#1] PREEMPT SMP PTI
[337571.278933] CPU: 1 PID: 115447 Comm: btrfs Tainted: G W 6.4.0-rc6-btrfs-next-134+ #1
[337571.279153] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.14.0-0-g155821a1990b-prebuilt.qemu.org 04/01/2014
[337571.279572] RIP: 0010:commit_cowonly_roots+0x11f/0x250 [btrfs]
[337571.279928] Code: 85 38 06 00 (...)
[337571.280363] RSP: 0018:ffff9f63446efba0 EFLAGS: 00010206
[337571.280582] RAX: ffff942d98ec2638 RBX: ffff9430b82b4c30 RCX: 0000000449e1c000
[337571.280798] RDX: dead000000000100 RSI: ffff9430021e4900 RDI: 0000000000036070
[337571.281015] RBP: ffff942d98ec2000 R08: ffff942d98ec2000 R09: 000000000000015b
[337571.281254] R10: 0000000000000009 R11: 0000000000000001 R12: ffff942fe8fbf600
[337571.281476] R13: ffff942dabe23040 R14: ffff942dabe20800 R15: ffff942d92cf3b48
[337571.281723] FS: 00007f478adb7340(0000) GS:ffff94349fa40000(0000) knlGS:0000000000000000
[337571.281950] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[337571.282184] CR2: 00007f478ab9a3d5 CR3: 000000001e02c001 CR4: 0000000000370ee0
[337571.282416] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[337571.282647] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[337571.282874] Call Trace:
[337571.283101] <TASK>
[337571.283327] ? __die_body+0x1b/0x60
[337571.283570] ? die_addr+0x39/0x60
[337571.283796] ? exc_general_protection+0x22e/0x430
[337571.284022] ? asm_exc_general_protection+0x22/0x30
[337571.284251] ? commit_cowonly_roots+0x11f/0x250 [btrfs]
[337571.284531] btrfs_commit_transaction+0x42e/0xf90 [btrfs]
[337571.284803] ? _raw_spin_unlock+0x15/0x30
[337571.285031] ? release_extent_buffer+0x103/0x130 [btrfs]
[337571.285305] reset_balance_state+0x152/0x1b0 [btrfs]
[337571.285578] btrfs_balance+0xa50/0x11e0 [btrfs]
[337571.285864] ? __kmem_cache_alloc_node+0x14a/0x410
[337571.286086] btrfs_ioctl+0x249a/0x3320 [btrfs]
[337571.286358] ? mod_objcg_state+0xd2/0x360
[337571.286577] ? refill_obj_stock+0xb0/0x160
[337571.286798] ? seq_release+0x25/0x30
[337571.287016] ? __rseq_handle_notify_resume+0x3ba/0x4b0
[337571.287235] ? percpu_counter_add_batch+0x2e/0xa0
[337571.287455] ? __x64_sys_ioctl+0x88/0xc0
[337571.287675] __x64_sys_ioctl+0x88/0xc0
[337571.287901] do_syscall_64+0x38/0x90
[337571.288126] entry_SYSCALL_64_after_hwframe+0x72/0xdc
[337571.288352] RIP: 0033:0x7f478aaffe9b
So fix this by locking struct btrfs_fs_info::trans_lock before deleting
the quota root from that list.
Fixes: bed92eae26cc ("Btrfs: qgroup implementation and prototypes")
CC: stable@vger.kernel.org # 4.14+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The quota assign ioctl can currently run in parallel with a quota disable
ioctl call. The assign ioctl uses the quota root, while the disable ioctl
frees that root, and therefore we can have a use-after-free triggered in
the assign ioctl, leading to a trace like the following when KASAN is
enabled:
[672.723][T736] BUG: KASAN: slab-use-after-free in btrfs_search_slot+0x2962/0x2db0
[672.723][T736] Read of size 8 at addr ffff888022ec0208 by task btrfs_search_sl/27736
[672.724][T736]
[672.725][T736] CPU: 1 PID: 27736 Comm: btrfs_search_sl Not tainted 6.3.0-rc3 #37
[672.723][T736] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014
[672.727][T736] Call Trace:
[672.728][T736] <TASK>
[672.728][T736] dump_stack_lvl+0xd9/0x150
[672.725][T736] print_report+0xc1/0x5e0
[672.720][T736] ? __virt_addr_valid+0x61/0x2e0
[672.727][T736] ? __phys_addr+0xc9/0x150
[672.725][T736] ? btrfs_search_slot+0x2962/0x2db0
[672.722][T736] kasan_report+0xc0/0xf0
[672.729][T736] ? btrfs_search_slot+0x2962/0x2db0
[672.724][T736] btrfs_search_slot+0x2962/0x2db0
[672.723][T736] ? fs_reclaim_acquire+0xba/0x160
[672.722][T736] ? split_leaf+0x13d0/0x13d0
[672.726][T736] ? rcu_is_watching+0x12/0xb0
[672.723][T736] ? kmem_cache_alloc+0x338/0x3c0
[672.722][T736] update_qgroup_status_item+0xf7/0x320
[672.724][T736] ? add_qgroup_rb+0x3d0/0x3d0
[672.739][T736] ? do_raw_spin_lock+0x12d/0x2b0
[672.730][T736] ? spin_bug+0x1d0/0x1d0
[672.737][T736] btrfs_run_qgroups+0x5de/0x840
[672.730][T736] ? btrfs_qgroup_rescan_worker+0xa70/0xa70
[672.738][T736] ? __del_qgroup_relation+0x4ba/0xe00
[672.738][T736] btrfs_ioctl+0x3d58/0x5d80
[672.735][T736] ? tomoyo_path_number_perm+0x16a/0x550
[672.737][T736] ? tomoyo_execute_permission+0x4a0/0x4a0
[672.731][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50
[672.737][T736] ? __sanitizer_cov_trace_switch+0x54/0x90
[672.734][T736] ? do_vfs_ioctl+0x132/0x1660
[672.730][T736] ? vfs_fileattr_set+0xc40/0xc40
[672.730][T736] ? _raw_spin_unlock_irq+0x2e/0x50
[672.732][T736] ? sigprocmask+0xf2/0x340
[672.737][T736] ? __fget_files+0x26a/0x480
[672.732][T736] ? bpf_lsm_file_ioctl+0x9/0x10
[672.738][T736] ? btrfs_ioctl_get_supported_features+0x50/0x50
[672.736][T736] __x64_sys_ioctl+0x198/0x210
[672.736][T736] do_syscall_64+0x39/0xb0
[672.731][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.739][T736] RIP: 0033:0x4556ad
[672.742][T736] </TASK>
[672.743][T736]
[672.748][T736] Allocated by task 27677:
[672.743][T736] kasan_save_stack+0x22/0x40
[672.741][T736] kasan_set_track+0x25/0x30
[672.741][T736] __kasan_kmalloc+0xa4/0xb0
[672.749][T736] btrfs_alloc_root+0x48/0x90
[672.746][T736] btrfs_create_tree+0x146/0xa20
[672.744][T736] btrfs_quota_enable+0x461/0x1d20
[672.743][T736] btrfs_ioctl+0x4a1c/0x5d80
[672.747][T736] __x64_sys_ioctl+0x198/0x210
[672.749][T736] do_syscall_64+0x39/0xb0
[672.744][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.756][T736]
[672.757][T736] Freed by task 27677:
[672.759][T736] kasan_save_stack+0x22/0x40
[672.759][T736] kasan_set_track+0x25/0x30
[672.756][T736] kasan_save_free_info+0x2e/0x50
[672.751][T736] ____kasan_slab_free+0x162/0x1c0
[672.758][T736] slab_free_freelist_hook+0x89/0x1c0
[672.752][T736] __kmem_cache_free+0xaf/0x2e0
[672.752][T736] btrfs_put_root+0x1ff/0x2b0
[672.759][T736] btrfs_quota_disable+0x80a/0xbc0
[672.752][T736] btrfs_ioctl+0x3e5f/0x5d80
[672.756][T736] __x64_sys_ioctl+0x198/0x210
[672.753][T736] do_syscall_64+0x39/0xb0
[672.765][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.769][T736]
[672.768][T736] The buggy address belongs to the object at ffff888022ec0000
[672.768][T736] which belongs to the cache kmalloc-4k of size 4096
[672.769][T736] The buggy address is located 520 bytes inside of
[672.769][T736] freed 4096-byte region [ffff888022ec0000, ffff888022ec1000)
[672.760][T736]
[672.764][T736] The buggy address belongs to the physical page:
[672.761][T736] page:ffffea00008bb000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x22ec0
[672.766][T736] head:ffffea00008bb000 order:3 entire_mapcount:0 nr_pages_mapped:0 pincount:0
[672.779][T736] flags: 0xfff00000010200(slab|head|node=0|zone=1|lastcpupid=0x7ff)
[672.770][T736] raw: 00fff00000010200 ffff888012842140 ffffea000054ba00 dead000000000002
[672.770][T736] raw: 0000000000000000 0000000000040004 00000001ffffffff 0000000000000000
[672.771][T736] page dumped because: kasan: bad access detected
[672.778][T736] page_owner tracks the page as allocated
[672.777][T736] page last allocated via order 3, migratetype Unmovable, gfp_mask 0xd2040(__GFP_IO|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 88
[672.779][T736] get_page_from_freelist+0x119c/0x2d50
[672.779][T736] __alloc_pages+0x1cb/0x4a0
[672.776][T736] alloc_pages+0x1aa/0x270
[672.773][T736] allocate_slab+0x260/0x390
[672.771][T736] ___slab_alloc+0xa9a/0x13e0
[672.778][T736] __slab_alloc.constprop.0+0x56/0xb0
[672.771][T736] __kmem_cache_alloc_node+0x136/0x320
[672.789][T736] __kmalloc+0x4e/0x1a0
[672.783][T736] tomoyo_realpath_from_path+0xc3/0x600
[672.781][T736] tomoyo_path_perm+0x22f/0x420
[672.782][T736] tomoyo_path_unlink+0x92/0xd0
[672.780][T736] security_path_unlink+0xdb/0x150
[672.788][T736] do_unlinkat+0x377/0x680
[672.788][T736] __x64_sys_unlink+0xca/0x110
[672.789][T736] do_syscall_64+0x39/0xb0
[672.783][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.784][T736] page last free stack trace:
[672.787][T736] free_pcp_prepare+0x4e5/0x920
[672.787][T736] free_unref_page+0x1d/0x4e0
[672.784][T736] __unfreeze_partials+0x17c/0x1a0
[672.797][T736] qlist_free_all+0x6a/0x180
[672.796][T736] kasan_quarantine_reduce+0x189/0x1d0
[672.797][T736] __kasan_slab_alloc+0x64/0x90
[672.793][T736] kmem_cache_alloc+0x17c/0x3c0
[672.799][T736] getname_flags.part.0+0x50/0x4e0
[672.799][T736] getname_flags+0x9e/0xe0
[672.792][T736] vfs_fstatat+0x77/0xb0
[672.791][T736] __do_sys_newlstat+0x84/0x100
[672.798][T736] do_syscall_64+0x39/0xb0
[672.796][T736] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[672.790][T736]
[672.791][T736] Memory state around the buggy address:
[672.799][T736] ffff888022ec0100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.805][T736] ffff888022ec0180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.802][T736] >ffff888022ec0200: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.809][T736] ^
[672.809][T736] ffff888022ec0280: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[672.809][T736] ffff888022ec0300: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
Fix this by having the qgroup assign ioctl take the qgroup ioctl mutex
before calling btrfs_run_qgroups(), which is what all qgroup ioctls should
call.
Reported-by: butt3rflyh4ck <butterflyhuangxx@gmail.com>
Link: https://lore.kernel.org/linux-btrfs/CAFcO6XN3VD8ogmHwqRk4kbiwtpUSNySu2VAxN8waEPciCHJvMA@mail.gmail.com/
CC: stable@vger.kernel.org # 5.10+
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
| |
btrfs_clean_tree_block is a misnomer, it's just
clear_extent_buffer_dirty with some extra accounting around it. Rename
this to btrfs_clear_buffer_dirty to make it more clear it belongs with
it's setter, btrfs_mark_buffer_dirty.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We check the header generation in the extent buffer against the current
running transaction id to see if it's safe to clear DIRTY on this
buffer. Generally speaking if we're clearing the buffer dirty we're
holding the transaction open, but in the case of cleaning up an aborted
transaction we don't, so we have extra checks in that path to check the
transid. To allow for a future cleanup go ahead and pass in the trans
handle so we don't have to rely on ->running_transaction being set.
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
If we have one task trying to start the quota rescan worker while another
one is trying to disable quotas, we can end up hitting a race that results
in the quota rescan worker doing a NULL pointer dereference. The steps for
this are the following:
1) Quotas are enabled;
2) Task A calls the quota rescan ioctl and enters btrfs_qgroup_rescan().
It calls qgroup_rescan_init() which returns 0 (success) and then joins a
transaction and commits it;
3) Task B calls the quota disable ioctl and enters btrfs_quota_disable().
It clears the bit BTRFS_FS_QUOTA_ENABLED from fs_info->flags and calls
btrfs_qgroup_wait_for_completion(), which returns immediately since the
rescan worker is not yet running.
Then it starts a transaction and locks fs_info->qgroup_ioctl_lock;
4) Task A queues the rescan worker, by calling btrfs_queue_work();
5) The rescan worker starts, and calls rescan_should_stop() at the start
of its while loop, which results in 0 iterations of the loop, since
the flag BTRFS_FS_QUOTA_ENABLED was cleared from fs_info->flags by
task B at step 3);
6) Task B sets fs_info->quota_root to NULL;
7) The rescan worker tries to start a transaction and uses
fs_info->quota_root as the root argument for btrfs_start_transaction().
This results in a NULL pointer dereference down the call chain of
btrfs_start_transaction(). The stack trace is something like the one
reported in Link tag below:
general protection fault, probably for non-canonical address 0xdffffc0000000041: 0000 [#1] PREEMPT SMP KASAN
KASAN: null-ptr-deref in range [0x0000000000000208-0x000000000000020f]
CPU: 1 PID: 34 Comm: kworker/u4:2 Not tainted 6.1.0-syzkaller-13872-gb6bb9676f216 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/26/2022
Workqueue: btrfs-qgroup-rescan btrfs_work_helper
RIP: 0010:start_transaction+0x48/0x10f0 fs/btrfs/transaction.c:564
Code: 48 89 fb 48 (...)
RSP: 0018:ffffc90000ab7ab0 EFLAGS: 00010206
RAX: 0000000000000041 RBX: 0000000000000208 RCX: ffff88801779ba80
RDX: 0000000000000000 RSI: 0000000000000001 RDI: 0000000000000000
RBP: dffffc0000000000 R08: 0000000000000001 R09: fffff52000156f5d
R10: fffff52000156f5d R11: 1ffff92000156f5c R12: 0000000000000000
R13: 0000000000000001 R14: 0000000000000001 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff8880b9900000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f2bea75b718 CR3: 000000001d0cc000 CR4: 00000000003506e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
btrfs_qgroup_rescan_worker+0x3bb/0x6a0 fs/btrfs/qgroup.c:3402
btrfs_work_helper+0x312/0x850 fs/btrfs/async-thread.c:280
process_one_work+0x877/0xdb0 kernel/workqueue.c:2289
worker_thread+0xb14/0x1330 kernel/workqueue.c:2436
kthread+0x266/0x300 kernel/kthread.c:376
ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:308
</TASK>
Modules linked in:
So fix this by having the rescan worker function not attempt to start a
transaction if it didn't do any rescan work.
Reported-by: syzbot+96977faa68092ad382c4@syzkaller.appspotmail.com
Link: https://lore.kernel.org/linux-btrfs/000000000000e5454b05f065a803@google.com/
Fixes: e804861bd4e6 ("btrfs: fix deadlock between quota disable and qgroup rescan worker")
CC: stable@vger.kernel.org # 5.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
[BUG]
There are some reports from the mailing list that since v6.1 kernel, the
WARN_ON() inside btrfs_qgroup_account_extent() gets triggered during
rescan:
WARNING: CPU: 3 PID: 6424 at fs/btrfs/qgroup.c:2756 btrfs_qgroup_account_extents+0x1ae/0x260 [btrfs]
CPU: 3 PID: 6424 Comm: snapperd Tainted: P OE 6.1.2-1-default #1 openSUSE Tumbleweed 05c7a1b1b61d5627475528f71f50444637b5aad7
RIP: 0010:btrfs_qgroup_account_extents+0x1ae/0x260 [btrfs]
Call Trace:
<TASK>
btrfs_commit_transaction+0x30c/0xb40 [btrfs c39c9c546c241c593f03bd6d5f39ea1b676250f6]
? start_transaction+0xc3/0x5b0 [btrfs c39c9c546c241c593f03bd6d5f39ea1b676250f6]
btrfs_qgroup_rescan+0x42/0xc0 [btrfs c39c9c546c241c593f03bd6d5f39ea1b676250f6]
btrfs_ioctl+0x1ab9/0x25c0 [btrfs c39c9c546c241c593f03bd6d5f39ea1b676250f6]
? __rseq_handle_notify_resume+0xa9/0x4a0
? mntput_no_expire+0x4a/0x240
? __seccomp_filter+0x319/0x4d0
__x64_sys_ioctl+0x90/0xd0
do_syscall_64+0x5b/0x80
? syscall_exit_to_user_mode+0x17/0x40
? do_syscall_64+0x67/0x80
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0033:0x7fd9b790d9bf
</TASK>
[CAUSE]
Since commit e15e9f43c7ca ("btrfs: introduce
BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting"), if
our qgroup is already in inconsistent state, we will no longer do the
time-consuming backref walk.
This can leave some qgroup records without a valid old_roots ulist.
Normally this is fine, as btrfs_qgroup_account_extents() would also skip
those records if we have NO_ACCOUNTING flag set.
But there is a small window, if we have NO_ACCOUNTING flag set, and
inserted some qgroup_record without a old_roots ulist, but then the user
triggered a qgroup rescan.
During btrfs_qgroup_rescan(), we firstly clear NO_ACCOUNTING flag, then
commit current transaction.
And since we have a qgroup_record with old_roots = NULL, we trigger the
WARN_ON() during btrfs_qgroup_account_extents().
[FIX]
Unfortunately due to the introduction of NO_ACCOUNTING flag, the
assumption that every qgroup_record would have its old_roots populated
is no longer correct.
Fix the false alerts and drop the WARN_ON().
Reported-by: Lukas Straub <lukasstraub2@web.de>
Reported-by: HanatoK <summersnow9403@gmail.com>
Fixes: e15e9f43c7ca ("btrfs: introduce BTRFS_QGROUP_RUNTIME_FLAG_NO_ACCOUNTING to skip qgroup accounting")
CC: stable@vger.kernel.org # 6.1
Link: https://lore.kernel.org/linux-btrfs/2403c697-ddaf-58ad-3829-0335fc89df09@gmail.com/
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|