summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/extent-tree.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: do not start relocation until in progress drops are doneJosef Bacik2022-03-021-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We hit a bug with a recovering relocation on mount for one of our file systems in production. I reproduced this locally by injecting errors into snapshot delete with balance running at the same time. This presented as an error while looking up an extent item WARNING: CPU: 5 PID: 1501 at fs/btrfs/extent-tree.c:866 lookup_inline_extent_backref+0x647/0x680 CPU: 5 PID: 1501 Comm: btrfs-balance Not tainted 5.16.0-rc8+ #8 RIP: 0010:lookup_inline_extent_backref+0x647/0x680 RSP: 0018:ffffae0a023ab960 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 000000000000000c RDI: 0000000000000000 RBP: ffff943fd2a39b60 R08: 0000000000000000 R09: 0000000000000001 R10: 0001434088152de0 R11: 0000000000000000 R12: 0000000001d05000 R13: ffff943fd2a39b60 R14: ffff943fdb96f2a0 R15: ffff9442fc923000 FS: 0000000000000000(0000) GS:ffff944e9eb40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f1157b1fca8 CR3: 000000010f092000 CR4: 0000000000350ee0 Call Trace: <TASK> insert_inline_extent_backref+0x46/0xd0 __btrfs_inc_extent_ref.isra.0+0x5f/0x200 ? btrfs_merge_delayed_refs+0x164/0x190 __btrfs_run_delayed_refs+0x561/0xfa0 ? btrfs_search_slot+0x7b4/0xb30 ? btrfs_update_root+0x1a9/0x2c0 btrfs_run_delayed_refs+0x73/0x1f0 ? btrfs_update_root+0x1a9/0x2c0 btrfs_commit_transaction+0x50/0xa50 ? btrfs_update_reloc_root+0x122/0x220 prepare_to_merge+0x29f/0x320 relocate_block_group+0x2b8/0x550 btrfs_relocate_block_group+0x1a6/0x350 btrfs_relocate_chunk+0x27/0xe0 btrfs_balance+0x777/0xe60 balance_kthread+0x35/0x50 ? btrfs_balance+0xe60/0xe60 kthread+0x16b/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x22/0x30 </TASK> Normally snapshot deletion and relocation are excluded from running at the same time by the fs_info->cleaner_mutex. However if we had a pending balance waiting to get the ->cleaner_mutex, and a snapshot deletion was running, and then the box crashed, we would come up in a state where we have a half deleted snapshot. Again, in the normal case the snapshot deletion needs to complete before relocation can start, but in this case relocation could very well start before the snapshot deletion completes, as we simply add the root to the dead roots list and wait for the next time the cleaner runs to clean up the snapshot. Fix this by setting a bit on the fs_info if we have any DEAD_ROOT's that had a pending drop_progress key. If they do then we know we were in the middle of the drop operation and set a flag on the fs_info. Then balance can wait until this flag is cleared to start up again. If there are DEAD_ROOT's that don't have a drop_progress set then we're safe to start balance right away as we'll be properly protected by the cleaner_mutex. CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: fix chunk allocation condition for zoned allocatorNaohiro Aota2022-01-071-12/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The ZNS specification defines a limit on the number of "active" zones. That limit impose us to limit the number of block groups which can be used for an allocation at the same time. Not to exceed the limit, we reuse the existing active block groups as much as possible when we can't activate any other zones without sacrificing an already activated block group in commit a85f05e59bc1 ("btrfs: zoned: avoid chunk allocation if active block group has enough space"). However, the check is wrong in two ways. First, it checks the condition for every raid index (ffe_ctl->index). Even if it reaches the condition and "ffe_ctl->max_extent_size >= ffe_ctl->min_alloc_size" is met, there can be other block groups having enough space to hold ffe_ctl->num_bytes. (Actually, this won't happen in the current zoned code as it only supports SINGLE profile. But, it can happen once it enables other RAID types.) Second, it checks the active zone availability depending on the raid index. The raid index is just an index for space_info->block_groups, so it has nothing to do with chunk allocation. These mistakes are causing a faulty allocation in a certain situation. Consider we are running zoned btrfs on a device whose max_active_zone == 0 (no limit). And, suppose no block group have a room to fit ffe_ctl->num_bytes but some room to meet ffe_ctl->min_alloc_size (i.e. max_extent_size > num_bytes >= min_alloc_size). In this situation, the following occur: - With SINGLE raid_index, it reaches the chunk allocation checking code - The check returns true because we can activate a new zone (no limit) - But, before allocating the chunk, it iterates to the next raid index (RAID5) - Since there are no RAID5 block groups on zoned mode, it again reaches the check code - The check returns false because of btrfs_can_activate_zone()'s "if (raid_index != BTRFS_RAID_SINGLE)" part - That results in returning -ENOSPC without allocating a new chunk As a result, we end up hitting -ENOSPC too early. Move the check to the right place in the can_allocate_chunk() hook, and do the active zone check depending on the allocation flag, not on the raid index. CC: stable@vger.kernel.org # 5.16 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add extent allocator hook to decide to allocate chunk or notNaohiro Aota2022-01-071-0/+17
| | | | | | | | | | | | | | | Introduce a new hook for an extent allocator policy. With the new hook, a policy can decide to allocate a new block group or not. If not, it will return -ENOSPC, so btrfs_reserve_extent() will cut the allocation size in half and retry the allocation if min_alloc_size is large enough. The hook has a place holder and will be replaced with the real implementation in the next patch. CC: stable@vger.kernel.org # 5.16 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: unset dedicated block group on allocation failureNaohiro Aota2022-01-071-4/+16
| | | | | | | | | | | | | | | | | | | Allocating an extent from a block group can fail for various reasons. When an allocation from a dedicated block group (for tree-log or relocation data) fails, we need to unregister it as a dedicated one so that we can allocate a new block group for the dedicated one. However, we are returning early when the block group in case it is read-only, fully used, or not be able to activate the zone. As a result, we keep the non-usable block group as a dedicated one, leading to further allocation failure. With many block groups, the allocator will iterate hopeless loop to find a free extent, results in a hung task. Fix the issue by delaying the return and doing the proper cleanups. CC: stable@vger.kernel.org # 5.16 Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: track the csum, extent, and free space trees in a rb treeJosef Bacik2022-01-031-0/+1
| | | | | | | | | | | | | | In the future we are going to have multiple copies of these trees. To facilitate this we need a way to lookup the different roots we are looking for. Handle this by adding a global root rb tree that is indexed on the root->root_key. Then instead of loading the roots at mount time with individually targeted keys, simply search the tree_root for anything with the specific objectid we want. This will make it straightforward to support both old style and new style file systems. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: stop accessing ->csum_root directlyJosef Bacik2022-01-031-3/+8
| | | | | | | | | | We are going to have multiple csum roots in the future, so convert all users of ->csum_root to btrfs_csum_root() and rename ->csum_root to ->_csum_root so we can easily find remaining users in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: stop accessing ->extent_root directlyJosef Bacik2022-01-031-15/+25
| | | | | | | | | | | | When we start having multiple extent roots we'll need to use a helper to get to the correct extent_root. Rename fs_info->extent_root to _extent_root and convert all of the users of the extent root to using the btrfs_extent_root() helper. This will allow us to easily clean up the remaining direct accesses in the future. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use chunk_root in find_free_extent_update_loopJosef Bacik2022-01-031-1/+1
| | | | | | | | | | | | We're only using this to start the transaction with to possibly allocate a chunk. It doesn't really matter which root to use, but with extent tree v2 we'll need a bytenr to look up a extent root which makes the usage of the extent_root awkward here. Simply change it to the chunk_root. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: make remove_extent_backref pass the rootJosef Bacik2022-01-031-8/+10
| | | | | | | | | | With extent tree v2 we'll have a different extent root based on where the bytenr is located, so adjust the remove_extent_backref() helper and it's helpers to pass the extent_root around. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: drop the _nr from the item helpersJosef Bacik2022-01-031-7/+7
| | | | | | | | | | | Now that all call sites are using the slot number to modify item values, rename the SETGET helpers to raw_item_*(), and then rework the _nr() helpers to be the btrfs_item_*() btrfs_set_item_*() helpers, and then rename all of the callers to the new helpers. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix invalid delayed ref after subvolume creation failureFilipe Manana2021-12-151-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When creating a subvolume, at ioctl.c:create_subvol(), if we fail to insert the new root's root item into the root tree, we are freeing the metadata extent we reserved for the new root to prevent a metadata extent leak, as we don't abort the transaction at that point (since there is nothing at that point that is irreversible). However we allocated the metadata extent for the new root which we are creating for the new subvolume, so its delayed reference refers to the ID of this new root. But when we free the metadata extent we pass the root of the subvolume where the new subvolume is located to btrfs_free_tree_block() - this is incorrect because this will generate a delayed reference that refers to the ID of the parent subvolume's root, and not to ID of the new root. This results in a failure when running delayed references that leads to a transaction abort and a trace like the following: [3868.738042] RIP: 0010:__btrfs_free_extent+0x709/0x950 [btrfs] [3868.739857] Code: 68 0f 85 e6 fb ff (...) [3868.742963] RSP: 0018:ffffb0e9045cf910 EFLAGS: 00010246 [3868.743908] RAX: 00000000fffffffe RBX: 00000000fffffffe RCX: 0000000000000002 [3868.745312] RDX: 00000000fffffffe RSI: 0000000000000002 RDI: ffff90b0cd793b88 [3868.746643] RBP: 000000000e5d8000 R08: 0000000000000000 R09: ffff90b0cd793b88 [3868.747979] R10: 0000000000000002 R11: 00014ded97944d68 R12: 0000000000000000 [3868.749373] R13: ffff90b09afe4a28 R14: 0000000000000000 R15: ffff90b0cd793b88 [3868.750725] FS: 00007f281c4a8b80(0000) GS:ffff90b3ada00000(0000) knlGS:0000000000000000 [3868.752275] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [3868.753515] CR2: 00007f281c6a5000 CR3: 0000000108a42006 CR4: 0000000000370ee0 [3868.754869] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [3868.756228] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [3868.757803] Call Trace: [3868.758281] <TASK> [3868.758655] ? btrfs_merge_delayed_refs+0x178/0x1c0 [btrfs] [3868.759827] __btrfs_run_delayed_refs+0x2b1/0x1250 [btrfs] [3868.761047] btrfs_run_delayed_refs+0x86/0x210 [btrfs] [3868.762069] ? lock_acquired+0x19f/0x420 [3868.762829] btrfs_commit_transaction+0x69/0xb20 [btrfs] [3868.763860] ? _raw_spin_unlock+0x29/0x40 [3868.764614] ? btrfs_block_rsv_release+0x1c2/0x1e0 [btrfs] [3868.765870] create_subvol+0x1d8/0x9a0 [btrfs] [3868.766766] btrfs_mksubvol+0x447/0x4c0 [btrfs] [3868.767669] ? preempt_count_add+0x49/0xa0 [3868.768444] __btrfs_ioctl_snap_create+0x123/0x190 [btrfs] [3868.769639] ? _copy_from_user+0x66/0xa0 [3868.770391] btrfs_ioctl_snap_create_v2+0xbb/0x140 [btrfs] [3868.771495] btrfs_ioctl+0xd1e/0x35c0 [btrfs] [3868.772364] ? __slab_free+0x10a/0x360 [3868.773198] ? rcu_read_lock_sched_held+0x12/0x60 [3868.774121] ? lock_release+0x223/0x4a0 [3868.774863] ? lock_acquired+0x19f/0x420 [3868.775634] ? rcu_read_lock_sched_held+0x12/0x60 [3868.776530] ? trace_hardirqs_on+0x1b/0xe0 [3868.777373] ? _raw_spin_unlock_irqrestore+0x3e/0x60 [3868.778280] ? kmem_cache_free+0x321/0x3c0 [3868.779011] ? __x64_sys_ioctl+0x83/0xb0 [3868.779718] __x64_sys_ioctl+0x83/0xb0 [3868.780387] do_syscall_64+0x3b/0xc0 [3868.781059] entry_SYSCALL_64_after_hwframe+0x44/0xae [3868.781953] RIP: 0033:0x7f281c59e957 [3868.782585] Code: 3c 1c 48 f7 d8 4c (...) [3868.785867] RSP: 002b:00007ffe1f83e2b8 EFLAGS: 00000202 ORIG_RAX: 0000000000000010 [3868.787198] RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f281c59e957 [3868.788450] RDX: 00007ffe1f83e2c0 RSI: 0000000050009418 RDI: 0000000000000003 [3868.789748] RBP: 00007ffe1f83f300 R08: 0000000000000000 R09: 00007ffe1f83fe36 [3868.791214] R10: 0000000000000000 R11: 0000000000000202 R12: 0000000000000003 [3868.792468] R13: 0000000000000003 R14: 00007ffe1f83e2c0 R15: 00000000000003cc [3868.793765] </TASK> [3868.794037] irq event stamp: 0 [3868.794548] hardirqs last enabled at (0): [<0000000000000000>] 0x0 [3868.795670] hardirqs last disabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.797086] softirqs last enabled at (0): [<ffffffff98294214>] copy_process+0x934/0x2040 [3868.798309] softirqs last disabled at (0): [<0000000000000000>] 0x0 [3868.799284] ---[ end trace be24c7002fe27747 ]--- [3868.799928] BTRFS info (device dm-0): leaf 241188864 gen 1268 total ptrs 214 free space 469 owner 2 [3868.801133] BTRFS info (device dm-0): refs 2 lock_owner 225627 current 225627 [3868.802056] item 0 key (237436928 169 0) itemoff 16250 itemsize 33 [3868.802863] extent refs 1 gen 1265 flags 2 [3868.803447] ref#0: tree block backref root 1610 (...) [3869.064354] item 114 key (241008640 169 0) itemoff 12488 itemsize 33 [3869.065421] extent refs 1 gen 1268 flags 2 [3869.066115] ref#0: tree block backref root 1689 (...) [3869.403834] BTRFS error (device dm-0): unable to find ref byte nr 241008640 parent 0 root 1622 owner 0 offset 0 [3869.405641] BTRFS: error (device dm-0) in __btrfs_free_extent:3076: errno=-2 No such entry [3869.407138] BTRFS: error (device dm-0) in btrfs_run_delayed_refs:2159: errno=-2 No such entry Fix this by passing the new subvolume's root ID to btrfs_free_tree_block(). This requires changing the root argument of btrfs_free_tree_block() from struct btrfs_root * to a u64, since at this point during the subvolume creation we have not yet created the struct btrfs_root for the new subvolume, and btrfs_free_tree_block() only needs a root ID and nothing else from a struct btrfs_root. This was triggered by test case generic/475 from fstests. Fixes: 67addf29004c5b ("btrfs: fix metadata extent leak after failure to create subvolume") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fail if fstrim_range->start == U64_MAXJosef Bacik2021-12-081-0/+3
| | | | | | | | | | | | We've always been failing generic/260 because it's testing things we actually don't care about and thus won't fail for. However we probably should fail for fstrim_range->start == U64_MAX since we clearly can't trim anything past that. This in combination with an update to generic/260 will allow us to pass this test properly. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: reduce btrfs_update_block_group alloc argument to boolAnand Jain2021-10-261-3/+3
| | | | | | | | | | | btrfs_update_block_group() accounts for the number of bytes allocated or freed. Argument @alloc specifies whether the call is for alloc or free. Convert the argument @alloc type from int to bool. Reviewed-by: Su Yue <l@damenly.su> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: pull up qgroup checks from delayed-ref core to init timeNikolay Borisov2021-10-261-5/+0
| | | | | | | | | | | | Instead of checking whether qgroup processing for a dealyed ref has to happen in the core of delayed ref, simply pull the check at init time of respective delayed ref structures. This eliminates the final use of real_root in delayed-ref core paving the way to making this member optional. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add additional parameters to btrfs_init_tree_ref/btrfs_init_data_refNikolay Borisov2021-10-261-6/+11
| | | | | | | | | | | | In order to make 'real_root' used only in ref-verify it's required to have the necessary context to perform the same checks that this member is used for. So add 'mod_root' which will contain the root on behalf of which a delayed ref was created and a 'skip_group' parameter which will contain callsite-specific override of skip_qgroup. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename root fields in delayed refs structsNikolay Borisov2021-10-261-5/+5
| | | | | | | | | | | | | Both data and metadata delayed ref structures have fields named root/ref_root respectively. Those are somewhat cryptic and don't really convey the real meaning. In fact those roots are really the original owners of the respective block (i.e in case of a snapshot a data delayed ref will contain the original root that owns the given block). Rename those fields accordingly and adjust comments. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: assert that extent buffers are write locked instead of only lockedFilipe Manana2021-10-261-2/+2
| | | | | | | | | | | | | | | | | | | | | | | We currently use lockdep_assert_held() at btrfs_assert_tree_locked(), and that checks that we hold a lock either in read mode or write mode. However in all contexts we use btrfs_assert_tree_locked(), we actually want to check if we are holding a write lock on the extent buffer's rw semaphore - it would be a bug if in any of those contexts we were holding a read lock instead. So change btrfs_assert_tree_locked() to use lockdep_assert_held_write() instead and, to make it more explicit, rename btrfs_assert_tree_locked() to btrfs_assert_tree_write_locked(), so that it's clear we want to check we are holding a write lock. For now there are no contexts where we want to assert that we must have a read lock, but in case that is needed in the future, we can add a new helper function that just calls out lockdep_assert_held_read(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename btrfs_bio to btrfs_io_contextQu Wenruo2021-10-261-10/+9
| | | | | | | | | | | | | | | | | | | | The structure btrfs_bio is used by two different sites: - bio->bi_private for mirror based profiles For those profiles (SINGLE/DUP/RAID1*/RAID10), this structures records how many mirrors are still pending, and save the original endio function of the bio. - RAID56 code In that case, RAID56 only utilize the stripes info, and no long uses that to trace the pending mirrors. So btrfs_bio is not always bind to a bio, and contains more info for IO context, thus renaming it will make the naming less confusing. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: let the for_treelog test in the allocator stand outJohannes Thumshirn2021-10-261-3/+4
| | | | | | | | | | | | | | | The statement which decides if an extent allocation on a zoned device is for the dedicated tree-log block group or not and if we can use the block group we picked for this allocation is not easy to read but an important part of the allocator. Rewrite into an if condition instead of a plain boolean test to make it stand out more, like the version which tests for the dedicated data-relocation block group. Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: add a dedicated data relocation block groupJohannes Thumshirn2021-10-261-2/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Relocation in a zoned filesystem can fail with a transaction abort with error -22 (EINVAL). This happens because the relocation code assumes that the extents we relocated the data to have the same size the source extents had and ensures this by preallocating the extents. But in a zoned filesystem we currently can't preallocate the extents as this would break the sequential write required rule. Therefore it can happen that the writeback process kicks in while we're still adding pages to a delalloc range and starts writing out dirty pages. This then creates destination extents that are smaller than the source extents, triggering the following safety check in get_new_location(): 1034 if (num_bytes != btrfs_file_extent_disk_num_bytes(leaf, fi)) { 1035 ret = -EINVAL; 1036 goto out; 1037 } Temporarily create a dedicated block group for the relocation process, so no non-relocation data writes can interfere with the relocation writes. This is needed that we can switch the relocation process on a zoned filesystem from the REQ_OP_ZONE_APPEND writing we use for data to a scheme like in a non-zoned filesystem using REQ_OP_WRITE and preallocation. Fixes: 32430c614844 ("btrfs: zoned: enable relocation on a zoned filesystem") Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: introduce btrfs_is_data_reloc_rootJohannes Thumshirn2021-10-261-1/+1
| | | | | | | | | | | | | There are several places in our codebase where we check if a root is the root of the data reloc tree and subsequent patches will introduce more. Factor out the check into a small helper function instead of open coding it multiple times. Reviewed-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: avoid chunk allocation if active block group has enough spaceNaohiro Aota2021-10-261-7/+21
| | | | | | | | | | | | | | | | | | The current extent allocator tries to allocate a new block group when the existing block groups do not have enough space. On a ZNS device, a new block group means a new active zone. If the number of active zones has already reached the max_active_zones, activating a new zone needs to finish an existing zone, leading to wasting the free space there. So, instead, it should reuse the existing active block groups as much as possible when we can't activate any other zones without sacrificing an already activated block group. While at it, I converted find_free_extent_update_loop() to check the found_extent() case early and made the other conditions simpler. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move ffe_ctl one level upNaohiro Aota2021-10-261-75/+87
| | | | | | | | | | We are passing too many variables as it is from btrfs_reserve_extent() to find_free_extent(). The next commit will add min_alloc_size to ffe_ctl, and that means another pass-through argument. Take this opportunity to move ffe_ctl one level up and drop the redundant arguments. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: activate block group on allocationNaohiro Aota2021-10-261-0/+12
| | | | | | | | | Activate a block group when trying to allocate an extent from it. We check read-only case and no space left case before trying to activate a block group not to consume the number of active zones uselessly. Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: calculate free space from zone capacityNaohiro Aota2021-10-261-1/+2
| | | | | | | | | | | Now that we introduced capacity in a block group, we need to calculate free space using the capacity instead of the length. Thus, bytes we account capacity - alloc_pointer as free, and account bytes [capacity, length] as zone unusable. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: unlock newly allocated extent buffer after errorQu Wenruo2021-10-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] There is a bug report that injected ENOMEM error could leave a tree block locked while we return to user-space: BTRFS info (device loop0): enabling ssd optimizations FAULT_INJECTION: forcing a failure. name failslab, interval 1, probability 0, space 0, times 0 CPU: 0 PID: 7579 Comm: syz-executor Not tainted 5.15.0-rc1 #16 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.0-59-gc9ba5276e321-prebuilt.qemu.org 04/01/2014 Call Trace: __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x8d/0xcf lib/dump_stack.c:106 fail_dump lib/fault-inject.c:52 [inline] should_fail+0x13c/0x160 lib/fault-inject.c:146 should_failslab+0x5/0x10 mm/slab_common.c:1328 slab_pre_alloc_hook.constprop.99+0x4e/0xc0 mm/slab.h:494 slab_alloc_node mm/slub.c:3120 [inline] slab_alloc mm/slub.c:3214 [inline] kmem_cache_alloc+0x44/0x280 mm/slub.c:3219 btrfs_alloc_delayed_extent_op fs/btrfs/delayed-ref.h:299 [inline] btrfs_alloc_tree_block+0x38c/0x670 fs/btrfs/extent-tree.c:4833 __btrfs_cow_block+0x16f/0x7d0 fs/btrfs/ctree.c:415 btrfs_cow_block+0x12a/0x300 fs/btrfs/ctree.c:570 btrfs_search_slot+0x6b0/0xee0 fs/btrfs/ctree.c:1768 btrfs_insert_empty_items+0x80/0xf0 fs/btrfs/ctree.c:3905 btrfs_new_inode+0x311/0xa60 fs/btrfs/inode.c:6530 btrfs_create+0x12b/0x270 fs/btrfs/inode.c:6783 lookup_open+0x660/0x780 fs/namei.c:3282 open_last_lookups fs/namei.c:3352 [inline] path_openat+0x465/0xe20 fs/namei.c:3557 do_filp_open+0xe3/0x170 fs/namei.c:3588 do_sys_openat2+0x357/0x4a0 fs/open.c:1200 do_sys_open+0x87/0xd0 fs/open.c:1216 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x34/0xb0 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x46ae99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f46711b9c48 EFLAGS: 00000246 ORIG_RAX: 0000000000000055 RAX: ffffffffffffffda RBX: 000000000078c0a0 RCX: 000000000046ae99 RDX: 0000000000000000 RSI: 00000000000000a1 RDI: 0000000020005800 RBP: 00007f46711b9c80 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000017 R13: 0000000000000000 R14: 000000000078c0a0 R15: 00007ffc129da6e0 ================================================ WARNING: lock held when returning to user space! 5.15.0-rc1 #16 Not tainted ------------------------------------------------ syz-executor/7579 is leaving the kernel with locks still held! 1 lock held by syz-executor/7579: #0: ffff888104b73da8 (btrfs-tree-01/1){+.+.}-{3:3}, at: __btrfs_tree_lock+0x2e/0x1a0 fs/btrfs/locking.c:112 [CAUSE] In btrfs_alloc_tree_block(), after btrfs_init_new_buffer(), the new extent buffer @buf is locked, but if later operations like adding delayed tree ref fail, we just free @buf without unlocking it, resulting above warning. [FIX] Unlock @buf in out_free_buf: label. Reported-by: Hao Sun <sunhao.th@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CACkBjsZ9O6Zr0KK1yGn=1rQi6Crh1yeCRdTSBxx9R99L4xdn-Q@mail.gmail.com/ CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: pass NULL as trans to btrfs_search_slot if we only want to searchMarcos Paulo de Souza2021-08-231-1/+1
| | | | | | | | | | | | | Using a transaction in btrfs_search_slot is only useful when we are searching to add or modify the tree. When the function is used for searching, insert length and mod arguments are 0, there is no need to use a transaction. No functional changes, changing for consistency. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Marcos Paulo de Souza <mpdesouza@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: cleanup fs_devices pointer usage in btrfs_trim_fsAnand Jain2021-08-231-5/+5
| | | | | | | | | | Drop variable 'devices' (used only once) and add new variable for the fs_devices, so it is used at two locations within btrfs_trim_fs() function and also helps to access fs_devices->devices. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: check for missing device in btrfs_trim_fsAnand Jain2021-07-221-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A fstrim on a degraded raid1 can trigger the following null pointer dereference: BTRFS info (device loop0): allowing degraded mounts BTRFS info (device loop0): disk space caching is enabled BTRFS info (device loop0): has skinny extents BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing BTRFS warning (device loop0): devid 2 uuid 97ac16f7-e14d-4db1-95bc-3d489b424adb is missing BTRFS info (device loop0): enabling ssd optimizations BUG: kernel NULL pointer dereference, address: 0000000000000620 PGD 0 P4D 0 Oops: 0000 [#1] SMP NOPTI CPU: 0 PID: 4574 Comm: fstrim Not tainted 5.13.0-rc7+ #31 Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006 RIP: 0010:btrfs_trim_fs+0x199/0x4a0 [btrfs] RSP: 0018:ffff959541797d28 EFLAGS: 00010293 RAX: 0000000000000000 RBX: ffff946f84eca508 RCX: a7a67937adff8608 RDX: ffff946e8122d000 RSI: 0000000000000000 RDI: ffffffffc02fdbf0 RBP: ffff946ea4615000 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: ffff946e8122d960 R12: 0000000000000000 R13: ffff959541797db8 R14: ffff946e8122d000 R15: ffff959541797db8 FS: 00007f55917a5080(0000) GS:ffff946f9bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000620 CR3: 000000002d2c8001 CR4: 00000000000706f0 Call Trace: btrfs_ioctl_fitrim+0x167/0x260 [btrfs] btrfs_ioctl+0x1c00/0x2fe0 [btrfs] ? selinux_file_ioctl+0x140/0x240 ? syscall_trace_enter.constprop.0+0x188/0x240 ? __x64_sys_ioctl+0x83/0xb0 __x64_sys_ioctl+0x83/0xb0 Reproducer: $ mkfs.btrfs -fq -d raid1 -m raid1 /dev/loop0 /dev/loop1 $ mount /dev/loop0 /btrfs $ umount /btrfs $ btrfs dev scan --forget $ mount -o degraded /dev/loop0 /btrfs $ fstrim /btrfs The reason is we call btrfs_trim_free_extents() for the missing device, which uses device->bdev (NULL for missing device) to find if the device supports discard. Fix is to check if the device is missing before calling btrfs_trim_free_extents(). CC: stable@vger.kernel.org # 5.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rip out btrfs_space_info::total_bytes_pinnedJosef Bacik2021-06-221-15/+0
| | | | | | | | | | | We used this in may_commit_transaction() in order to determine if we needed to commit the transaction. However we no longer have that logic and thus have no use of this counter anymore, so delete it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix typos in commentsDavid Sterba2021-06-221-1/+1
| | | | | | Fix typos that have snuck in since the last round. Found by codespell. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: always abort the transaction if we abort a trans handleJosef Bacik2021-06-211-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While stress testing our error handling I noticed that sometimes we would still commit the transaction even though we had aborted the transaction. Currently we track if a trans handle has dirtied any metadata, and if it hasn't we mark the filesystem as having an error (so no new transactions can be started), but we will allow the current transaction to complete as we do not mark the transaction itself as having been aborted. This sounds good in theory, but we were not properly tracking IO errors in btrfs_finish_ordered_io, and thus committing the transaction with bogus free space data. This isn't necessarily a problem per-se with the free space cache, as the other guards in place would have kept us from accepting the free space cache as valid, but highlights a real world case where we had a bug and could have corrupted the filesystem because of it. This "skip abort on empty trans handle" is nice in theory, but assumes we have perfect error handling everywhere, which we clearly do not. Also we do not allow further transactions to be started, so all this does is save the last transaction that was happening, which doesn't necessarily gain us anything other than the potential for real corruption. Remove this particular bit of code, if we decide we need to abort the transaction then abort the current one and keep us from doing real harm to the file system, regardless of whether this specific trans handle dirtied anything or not. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: return errors from btrfs_del_csums in cleanup_ref_headJosef Bacik2021-05-271-1/+1
| | | | | | | | | | | | | We are unconditionally returning 0 in cleanup_ref_head, despite the fact that btrfs_del_csums could fail. We need to return the error so the transaction gets aborted properly, fix this by returning ret from btrfs_del_csums in cleanup_ref_head. Reviewed-by: Qu Wenruo <wqu@suse.com> CC: stable@vger.kernel.org # 4.19+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix unmountable seed device after fstrimAnand Jain2021-05-041-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following test case reproduces an issue of wrongly freeing in-use blocks on the readonly seed device when fstrim is called on the rw sprout device. As shown below. Create a seed device and add a sprout device to it: $ mkfs.btrfs -fq -dsingle -msingle /dev/loop0 $ btrfstune -S 1 /dev/loop0 $ mount /dev/loop0 /btrfs $ btrfs dev add -f /dev/loop1 /btrfs BTRFS info (device loop0): relocating block group 290455552 flags system BTRFS info (device loop0): relocating block group 1048576 flags system BTRFS info (device loop0): disk added /dev/loop1 $ umount /btrfs Mount the sprout device and run fstrim: $ mount /dev/loop1 /btrfs $ fstrim /btrfs $ umount /btrfs Now try to mount the seed device, and it fails: $ mount /dev/loop0 /btrfs mount: /btrfs: wrong fs type, bad option, bad superblock on /dev/loop0, missing codepage or helper program, or other error. Block 5292032 is missing on the readonly seed device: $ dmesg -kt | tail <snip> BTRFS error (device loop0): bad tree block start, want 5292032 have 0 BTRFS warning (device loop0): couldn't read-tree root BTRFS error (device loop0): open_ctree failed From the dump-tree of the seed device (taken before the fstrim). Block 5292032 belonged to the block group starting at 5242880: $ btrfs inspect dump-tree -e /dev/loop0 | grep -A1 BLOCK_GROUP <snip> item 3 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16169 itemsize 24 block group used 114688 chunk_objectid 256 flags METADATA <snip> From the dump-tree of the sprout device (taken before the fstrim). fstrim used block-group 5242880 to find the related free space to free: $ btrfs inspect dump-tree -e /dev/loop1 | grep -A1 BLOCK_GROUP <snip> item 1 key (5242880 BLOCK_GROUP_ITEM 8388608) itemoff 16226 itemsize 24 block group used 32768 chunk_objectid 256 flags METADATA <snip> BPF kernel tracing the fstrim command finds the missing block 5292032 within the range of the discarded blocks as below: kprobe:btrfs_discard_extent { printf("freeing start %llu end %llu num_bytes %llu:\n", arg1, arg1+arg2, arg2); } freeing start 5259264 end 5406720 num_bytes 147456 <snip> Fix this by avoiding the discard command to the readonly seed device. Reported-by: Chris Murphy <lists@colorremedies.com> CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use the new bit BTRFS_FS_TREE_MOD_LOG_USERS at btrfs_free_tree_block()Filipe Manana2021-04-191-5/+3
| | | | | | | | | | | | Instead of exposing implementation details of the tree mod log to check if there are active tree mod log users at btrfs_free_tree_block(), use the new bit BTRFS_FS_TREE_MOD_LOG_USERS for fs_info->flags instead. This way extent-tree.c does not need to known about any of the internals of the tree mod log and avoids taking a lock unnecessarily as well. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: unexport btrfs_extent_readonly() and make it staticAnand Jain2021-04-191-13/+0
| | | | | | | | | btrfs_extent_readonly() is used by can_nocow_extent() in inode.c. So move it from extent-tree.c to inode.c and declare it as static. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: always pin deleted leaves when there are active tree mod log usersFilipe Manana2021-03-161-1/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeing a tree block we may end up adding its extent back to the free space cache/tree, as long as there are no more references for it, it was created in the current transaction and writeback for it never happened. This is generally fine, however when we have tree mod log operations it can result in inconsistent versions of a btree after unwinding extent buffers with the recorded tree mod log operations. This is because: * We only log operations for nodes (adding and removing key/pointers), for leaves we don't do anything; * This means that we can log a MOD_LOG_KEY_REMOVE_WHILE_FREEING operation for a node that points to a leaf that was deleted; * Before we apply the logged operation to unwind a node, we can have that leaf's extent allocated again, either as a node or as a leaf, and possibly for another btree. This is possible if the leaf was created in the current transaction and writeback for it never started, in which case btrfs_free_tree_block() returns its extent back to the free space cache/tree; * Then, before applying the tree mod log operation, some task allocates the metadata extent just freed before, and uses it either as a leaf or as a node for some btree (can be the same or another one, it does not matter); * After applying the MOD_LOG_KEY_REMOVE_WHILE_FREEING operation we now get the target node with an item pointing to the metadata extent that now has content different from what it had before the leaf was deleted. It might now belong to a different btree and be a node and not a leaf anymore. As a consequence, the results of searches after the unwinding can be unpredictable and produce unexpected results. So make sure we pin extent buffers corresponding to leaves when there are tree mod log users. CC: stable@vger.kernel.org # 4.14+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: extend zoned allocator to use dedicated tree-log block groupNaohiro Aota2021-02-091-4/+71
| | | | | | | | | | | | | | | | | | | | | | | | This is the 1/3 patch to enable tree log on zoned filesystems. The tree-log feature does not work on a zoned filesystem as is. Blocks for a tree-log tree are allocated mixed with other metadata blocks and btrfs writes and syncs the tree-log blocks to devices at the time of fsync(), which has a different timing than a global transaction commit. As a result, both writing tree-log blocks and writing other metadata blocks become non-sequential writes that zoned filesystems must avoid. Introduce a dedicated block group for tree-log blocks, so that tree-log blocks and other metadata blocks can be separate write streams. As a result, each write stream can now be written to devices separately. "fs_info->treelog_bg" tracks the dedicated block group and assigns "treelog_bg" on-demand on tree-log block allocation time. This commit extends the zoned block allocator to use the block group. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: implement cloning for zoned device-replaceNaohiro Aota2021-02-091-15/+42
| | | | | | | | | | | | | | | | | This is 2/4 patch to implement device replace for zoned filesystems. In zoned mode, a block group must be either copied (from the source device to the target device) or cloned (to both devices). Implement the cloning part. If a block group targeted by an IO is marked to copy, we should not clone the IO to the destination device, because the block group is eventually copied by the replace process. This commit also handles cloning of device reset. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: reset zones of unused block groupsNaohiro Aota2021-02-091-5/+12
| | | | | | | | | | | | | | We must reset the zones of a deleted unused block group to rewind the zones' write pointers to the zones' start. To do this, we can use the DISCARD_SYNC code to do the reset when the filesystem is running on zoned devices. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: redirty released extent buffersNaohiro Aota2021-02-091-1/+11
| | | | | | | | | | | | | | | | | | | | | | | Tree manipulating operations like merging nodes often release once-allocated tree nodes. Such nodes are cleaned so that pages in the node are not uselessly written out. On zoned volumes, however, such optimization blocks the following IOs as the cancellation of the write out of the freed blocks breaks the sequential write sequence expected by the device. Introduce a list of clean and unwritten extent buffers that have been released in a transaction. Redirty the buffers so that btree_write_cache_pages() can send proper bios to the devices. Besides it clears the entire content of the extent buffer not to confuse raw block scanners e.g. 'btrfs check'. By clearing the content, csum_dirty_buffer() complains about bytenr mismatch, so avoid the checking and checksum using newly introduced buffer flag EXTENT_BUFFER_NO_CHECK. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: implement sequential extent allocationNaohiro Aota2021-02-091-6/+84
| | | | | | | | | | | | | | | | | | | Implement a sequential extent allocator for zoned filesystems. This allocator only needs to check if there is enough space in the block group after the allocation pointer to satisfy the extent allocation request. Therefore the allocator never manages bitmaps or clusters. Also, add assertions to the corresponding functions. As zone append writing is used, it would be unnecessary to track the allocation offset, as the allocator only needs to check available space. But by tracking and returning the offset as an allocated region, we can skip modification of ordered extents and checksum information when there is no IO reordering. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: zoned: track unusable bytes for zonesNaohiro Aota2021-02-091-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | In a zoned filesystem a once written then freed region is not usable until the underlying zone has been reset. So we need to distinguish such unusable space from usable free space. Therefore we need to introduce the "zone_unusable" field to the block group structure, and "bytes_zone_unusable" to the space_info structure to track the unusable space. Pinned bytes are always reclaimed to the unusable space. But, when an allocated region is returned before using e.g., the block group becomes read-only between allocation time and reservation time, we can safely return the region to the block group. For the situation, this commit introduces "btrfs_add_free_space_unused". This behaves the same as btrfs_add_free_space() on regular filesystem. On zoned filesystems, it rewinds the allocation offset. Because the read-only bytes tracks free but unusable bytes when the block group is read-only, we need to migrate the zone_unusable bytes to read-only bytes when a block group is marked read-only. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove bogus BUG_ON in alloc_reserved_tree_blockJosef Bacik2021-02-081-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The fix 361048f586f5 ("Btrfs: fix full backref problem when inserting shared block reference") added a delayed ref flushing at subvolume creation time in order to avoid hitting this particular BUG_ON(). Before this fix, we were tripping the BUG_ON() by 1. Modify snapshot A, which creates blocks with a normal reference for snapshot A, as A is the owner of these blocks. We now have delayed refs for these blocks. 2. Create a snapshot of A named B, which pushes references for the children blocks of the root node for the new root B, thus creating more delayed refs for newly allocated blocks. 3. A is modified, and because the metadata blocks can now be shared, it must push FULL_BACKREF references to the children of any block that A COWs down it's path to its target key. 4. Delayed refs are run. Because these are newly allocated blocks, we have ->must_insert_reserved reserved set on the delayed ref head, we call into alloc_reserved_tree_block() to add the extent item, and then add our ref. At the time of this fix, we were ordering FULL_BACKREF delayed ref operations first, so we'd go to add this reference and then BUG_ON() because we didn't have the FULL_BACKREF flag set. The patch fixed this problem by making sure we ran the delayed refs before we had the chance to modify A. This meant that any *new* blocks would have had their extent items created _before_ we would ever actually COW down and generate FULL_BACKREF entries. Thus the problem went away. However this BUG_ON() is actually completely bogus. The existence of a full backref doesn't necessarily mean that FULL_BACKREF must be set on that block, it must only be set on the actual parent itself. Consider the example provided above. If we COW down one path from A, any nodes are going to have a FULL_BACKREF ref pushed down to _all_ of their children, but not all of the children are going to have FULL_BACKREF set. It is completely valid to have an extent item with normal and full backrefs without FULL_BACKREF actually set on the block itself. As a final note, I have been testing with the patch (applied after this one) btrfs: stop running all delayed refs during snapshot which removed this flushing. My test was a torture test which did a lot of operations while snapshotting and deleting snapshots as well as relocation, and I never tripped this BUG_ON(). This is actually because at the time of 361048f586f5, we ordered SHARED keys _before_ normal references, and thus they would get run first. However currently they are ordered _after_ normal references, so we'd do the initial creation without having a shared reference, and thus not hit this BUG_ON(), which explains why I didn't start hitting this problem during my testing with my other patch applied. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: delayed refs pre-flushing should only run the heads we haveJosef Bacik2021-02-081-1/+1
| | | | | | | | | | | | | | | | | | | | | Previously our delayed ref running used the total number of items as the items to run. However we changed that to number of heads to run with the delayed_refs_rsv, as generally we want to run all of the operations for one bytenr. But with btrfs_run_delayed_refs(trans, 0) we set our count to 2x the number of items that we have. This is generally fine, but if we have some operation generation loads of delayed refs while we're doing this pre-flushing in the transaction commit, we'll just spin forever doing delayed refs. Fix this to simply pick the number of delayed refs we currently have, that way we do not end up doing a lot of extra work that's being generated in other threads. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: account for new extents being deleted in total_bytes_pinnedJosef Bacik2021-02-081-14/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My recent patch set "A variety of lock contention fixes", found here https://lore.kernel.org/linux-btrfs/cover.1608319304.git.josef@toxicpanda.com/ (Tracked in https://github.com/btrfs/linux/issues/86) that reduce lock contention on the extent root by running delayed refs less often resulted in a regression in generic/371. This test fallocate()'s the fs until it's full, deletes all the files, and then tries to fallocate() until full again. Before these patches we would run all of the delayed refs during flushing, and then would commit the transaction because we had plenty of pinned space to recover in order to allocate. However my patches made it so we weren't running the delayed refs as aggressively, which meant that we appeared to have less pinned space when we were deciding to commit the transaction. We use the space_info->total_bytes_pinned to approximate how much space we have pinned. It's approximate because if we remove a reference to an extent we may free it, but there may be more references to it than we know of at that point, but we account it as pinned at the creation time, and then it's properly accounted when the delayed ref runs. The way we account for pinned space is if the delayed_ref_head->total_ref_mod is < 0, because that is clearly a freeing option. However there is another case, and that is where ->total_ref_mod == 0 && ->must_insert_reserved == 1. When we allocate a new extent, we have ->total_ref_mod == 1 and we have ->must_insert_reserved == 1. This is used to indicate that it is a brand new extent and will need to have its extent entry added before we modify any references on the delayed ref head. But if we subsequently remove that extent reference, our ->total_ref_mod will be 0, and that space will be pinned and freed. Accounting for this case properly allows for generic/371 to pass with my delayed refs patches applied. It's important to note that this problem exists without the referenced patches, it just was uncovered by them. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: handle space_info::total_bytes_pinned inside the delayed ref itselfJosef Bacik2021-02-081-85/+12
| | | | | | | | | | | | | | | | | | | | | | Currently we pass things around to figure out if we maybe freeing data based on the state of the delayed refs head. This makes the accounting sort of confusing and hard to follow, as it's distinctly separate from the delayed ref heads stuff, but also depends on it entirely. Fix this by explicitly adjusting the space_info->total_bytes_pinned in the delayed refs code. We now have two places where we modify this counter, once where we create the delayed and destroy the delayed refs, and once when we pin and unpin the extents. This means there is a slight overlap between delayed refs and the pin/unpin mechanisms, but this is simply used by the ENOSPC infrastructure to determine if we need to commit the transaction, so there's no adverse affect from this, we might simply commit thinking it will give us enough space when it might not. CC: stable@vger.kernel.org # 5.10 Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix log replay failure due to race with space cache rebuildFilipe Manana2021-01-251-43/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a sudden power failure we may end up with a space cache on disk that is not valid and needs to be rebuilt from scratch. If that happens, during log replay when we attempt to pin an extent buffer from a log tree, at btrfs_pin_extent_for_log_replay(), we do not wait for the space cache to be rebuilt through the call to: btrfs_cache_block_group(cache, 1); That is because that only waits for the task (work queue job) that loads the space cache to change the cache state from BTRFS_CACHE_FAST to any other value. That is ok when the space cache on disk exists and is valid, but when the cache is not valid and needs to be rebuilt, it ends up returning as soon as the cache state changes to BTRFS_CACHE_STARTED (done at caching_thread()). So this means that we can end up trying to unpin a range which is not yet marked as free in the block group. This results in the call to btrfs_remove_free_space() to return -EINVAL to btrfs_pin_extent_for_log_replay(), which in turn makes the log replay fail as well as mounting the filesystem. More specifically the -EINVAL comes from free_space_cache.c:remove_from_bitmap(), because the requested range is not marked as free space (ones in the bitmap), we have the following condition triggered: static noinline int remove_from_bitmap(struct btrfs_free_space_ctl *ctl, (...) if (ret < 0 || search_start != *offset) return -EINVAL; (...) It's the "search_start != *offset" that results in the condition being evaluated to true. When this happens we got the following in dmesg/syslog: [72383.415114] BTRFS: device fsid 32b95b69-0ea9-496a-9f02-3f5a56dc9322 devid 1 transid 1432 /dev/sdb scanned by mount (3816007) [72383.417837] BTRFS info (device sdb): disk space caching is enabled [72383.418536] BTRFS info (device sdb): has skinny extents [72383.423846] BTRFS info (device sdb): start tree-log replay [72383.426416] BTRFS warning (device sdb): block group 30408704 has wrong amount of free space [72383.427686] BTRFS warning (device sdb): failed to load free space cache for block group 30408704, rebuilding it now [72383.454291] BTRFS: error (device sdb) in btrfs_recover_log_trees:6203: errno=-22 unknown (Failed to pin buffers while recovering log root tree.) [72383.456725] BTRFS: error (device sdb) in btrfs_replay_log:2253: errno=-22 unknown (Failed to recover log tree) [72383.460241] BTRFS error (device sdb): open_ctree failed We also mark the range for the extent buffer in the excluded extents io tree. That is fine when the space cache is valid on disk and we can load it, in which case it causes no problems. However, for the case where we need to rebuild the space cache, because it is either invalid or it is missing, having the extent buffer range marked in the excluded extents io tree leads to a -EINVAL failure from the call to btrfs_remove_free_space(), resulting in the log replay and mount to fail. This is because by having the range marked in the excluded extents io tree, the caching thread ends up never adding the range of the extent buffer as free space in the block group since the calls to add_new_free_space(), called from load_extent_tree_free(), filter out any ranges that are marked as excluded extents. So fix this by making sure that during log replay we wait for the caching task to finish completely when we need to rebuild a space cache, and also drop the need to mark the extent buffer range in the excluded extents io tree, as well as clearing ranges from that tree at btrfs_finish_extent_commit(). This started to happen with some frequency on large filesystems having block groups with a lot of fragmentation since the recent commit e747853cae3ae3 ("btrfs: load free space cache asynchronously"), but in fact the issue has been there for years, it was just much less likely to happen. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't get an EINTR during drop_snapshot for relocJosef Bacik2021-01-181-1/+9
| | | | | | | | | | | | | | This was partially fixed by f3e3d9cc3525 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree"), however it missed a spot when we restart a trans handle because we need to end the transaction. The fix is the same, simply use btrfs_join_transaction() instead of btrfs_start_transaction() when deleting reloc roots. Fixes: f3e3d9cc3525 ("btrfs: avoid possible signal interruption of btrfs_drop_snapshot() on relocation tree") CC: stable@vger.kernel.org # 5.4+ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: correctly calculate item size used when item key collision happensethanwu2020-12-181-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Item key collision is allowed for some item types, like dir item and inode refs, but the overall item size is limited by the nodesize. item size(ins_len) passed from btrfs_insert_empty_items to btrfs_search_slot already contains size of btrfs_item. When btrfs_search_slot reaches leaf, we'll see if we need to split leaf. The check incorrectly reports that split leaf is required, because it treats the space required by the newly inserted item as btrfs_item + item data. But in item key collision case, only item data is actually needed, the newly inserted item could merge into the existing one. No new btrfs_item will be inserted. And split_leaf return EOVERFLOW from following code: if (extend && data_size + btrfs_item_size_nr(l, slot) + sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(fs_info)) return -EOVERFLOW; In most cases, when callers receive EOVERFLOW, they either return this error or handle in different ways. For example, in normal dir item creation the userspace will get errno EOVERFLOW; in inode ref case INODE_EXTREF is used instead. However, this is not the case for rename. To avoid the unrecoverable situation in rename, btrfs_check_dir_item_collision is called in early phase of rename. In this function, when item key collision is detected leaf space is checked: data_size = sizeof(*di) + name_len; if (data_size + btrfs_item_size_nr(leaf, slot) + sizeof(struct btrfs_item) > BTRFS_LEAF_DATA_SIZE(root->fs_info)) the sizeof(struct btrfs_item) + btrfs_item_size_nr(leaf, slot) here refers to existing item size, the condition here correctly calculates the needed size for collision case rather than the wrong case above. The consequence of inconsistent condition check between btrfs_check_dir_item_collision and btrfs_search_slot when item key collision happens is that we might pass check here but fail later at btrfs_search_slot. Rename fails and volume is forced readonly [436149.586170] ------------[ cut here ]------------ [436149.586173] BTRFS: Transaction aborted (error -75) [436149.586196] WARNING: CPU: 0 PID: 16733 at fs/btrfs/inode.c:9870 btrfs_rename2+0x1938/0x1b70 [btrfs] [436149.586227] CPU: 0 PID: 16733 Comm: python Tainted: G D 4.18.0-rc5+ #1 [436149.586228] Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 04/05/2016 [436149.586238] RIP: 0010:btrfs_rename2+0x1938/0x1b70 [btrfs] [436149.586254] RSP: 0018:ffffa327043a7ce0 EFLAGS: 00010286 [436149.586255] RAX: 0000000000000000 RBX: ffff8d8a17d13340 RCX: 0000000000000006 [436149.586256] RDX: 0000000000000007 RSI: 0000000000000096 RDI: ffff8d8a7fc164b0 [436149.586257] RBP: ffffa327043a7da0 R08: 0000000000000560 R09: 7265282064657472 [436149.586258] R10: 0000000000000000 R11: 6361736e61725420 R12: ffff8d8a0d4c8b08 [436149.586258] R13: ffff8d8a17d13340 R14: ffff8d8a33e0a540 R15: 00000000000001fe [436149.586260] FS: 00007fa313933740(0000) GS:ffff8d8a7fc00000(0000) knlGS:0000000000000000 [436149.586261] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [436149.586262] CR2: 000055d8d9c9a720 CR3: 000000007aae0003 CR4: 00000000003606f0 [436149.586295] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [436149.586296] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [436149.586296] Call Trace: [436149.586311] vfs_rename+0x383/0x920 [436149.586313] ? vfs_rename+0x383/0x920 [436149.586315] do_renameat2+0x4ca/0x590 [436149.586317] __x64_sys_rename+0x20/0x30 [436149.586324] do_syscall_64+0x5a/0x120 [436149.586330] entry_SYSCALL_64_after_hwframe+0x44/0xa9 [436149.586332] RIP: 0033:0x7fa3133b1d37 [436149.586348] RSP: 002b:00007fffd3e43908 EFLAGS: 00000246 ORIG_RAX: 0000000000000052 [436149.586349] RAX: ffffffffffffffda RBX: 00007fa3133b1d30 RCX: 00007fa3133b1d37 [436149.586350] RDX: 000055d8da06b5e0 RSI: 000055d8da225d60 RDI: 000055d8da2c4da0 [436149.586351] RBP: 000055d8da2252f0 R08: 00007fa313782000 R09: 00000000000177e0 [436149.586351] R10: 000055d8da010680 R11: 0000000000000246 R12: 00007fa313840b00 Thanks to Hans van Kranenburg for information about crc32 hash collision tools, I was able to reproduce the dir item collision with following python script. https://github.com/wutzuchieh/misc_tools/blob/master/crc32_forge.py Run it under a btrfs volume will trigger the abort transaction. It simply creates files and rename them to forged names that leads to hash collision. There are two ways to fix this. One is to simply revert the patch 878f2d2cb355 ("Btrfs: fix max dir item size calculation") to make the condition consistent although that patch is correct about the size. The other way is to handle the leaf space check correctly when collision happens. I prefer the second one since it correct leaf space check in collision case. This fix will not account sizeof(struct btrfs_item) when the item already exists. There are two places where ins_len doesn't contain sizeof(struct btrfs_item), however. 1. extent-tree.c: lookup_inline_extent_backref 2. file-item.c: btrfs_csum_file_blocks to make the logic of btrfs_search_slot more clear, we add a flag search_for_extension in btrfs_path. This flag indicates that ins_len passed to btrfs_search_slot doesn't contain sizeof(struct btrfs_item). When key exists, btrfs_search_slot will use the actual size needed to calculate the required leaf space. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: ethanwu <ethanwu@synology.com> Signed-off-by: David Sterba <dsterba@suse.com>