summaryrefslogtreecommitdiffstats
path: root/fs/btrfs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'for-6.7-rc1-tag' of ↵Linus Torvalds2023-11-1311-33/+53
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: - fix potential overflow in returned value from SEARCH_TREE_V2 ioctl on 32bit architecture - zoned mode fixes: - drop unnecessary write pointer check for RAID0/RAID1/RAID10 profiles, now it works because of raid-stripe-tree - wait for finishing the zone when direct IO needs a new allocation - simple quota fixes: - pass correct owning root pointer when cleaning up an aborted transaction - fix leaking some structures when processing delayed refs - change key type number of BTRFS_EXTENT_OWNER_REF_KEY, reorder it before inline refs that are supposed to be sorted, keeping the original number would complicate a lot of things; this change needs an updated version of btrfs-progs to work and filesystems need to be recreated - fix error pointer dereference after failure to allocate fs devices - fix race between accounting qgroup extents and removing a qgroup * tag 'for-6.7-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: make OWNER_REF_KEY type value smallest among inline refs btrfs: fix qgroup record leaks when using simple quotas btrfs: fix race between accounting qgroup extents and removing a qgroup btrfs: fix error pointer dereference after failure to allocate fs devices btrfs: make found_logical_ret parameter mandatory for function queue_scrub_stripe() btrfs: get correct owning_root when dropping snapshot btrfs: zoned: wait for data BG to be finished on direct IO allocation btrfs: zoned: drop no longer valid write pointer check btrfs: directly return 0 on no error code in btrfs_insert_raid_extent() btrfs: use u64 for buffer sizes in the tree search ioctls
| * btrfs: fix qgroup record leaks when using simple quotasFilipe Manana2023-11-092-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using simple quotas we are not supposed to allocate qgroup records when adding delayed references. However we allocate them if either mode of quotas is enabled (the new simple one or the old one), but then we never free them because running the accounting, which frees the records, is only run when using the old quotas (at btrfs_qgroup_account_extents()), resulting in a memory leak of the records allocated when adding delayed references. Fix this by allocating the records only if the old quotas mode is enabled. Also fix btrfs_qgroup_trace_extent_nolock() to return 1 if the old quotas mode is not enabled - meaning the caller has to free the record. Fixes: 182940f4f4db ("btrfs: qgroup: add new quota mode for simple quotas") Reported-by: syzbot+d3ddc6dcc6386dea398b@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/00000000000004769106097f9a34@google.com/ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix race between accounting qgroup extents and removing a qgroupFilipe Manana2023-11-091-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When doing qgroup accounting for an extent, we take the spinlock fs_info->qgroup_lock and then add qgroups to the local list (iterator) named "qgroups". These qgroups are found in the fs_info->qgroup_tree rbtree. After we're done, we unlock fs_info->qgroup_lock and then call qgroup_iterator_nested_clean(), which will iterate over all the qgroups added to the local list "qgroups" and then delete them from the list. Deleting a qgroup from the list can however result in a use-after-free if a qgroup remove operation happens after we unlock fs_info->qgroup_lock and before or while we are at qgroup_iterator_nested_clean(). Fix this by calling qgroup_iterator_nested_clean() while still holding the lock fs_info->qgroup_lock - we don't need it under the 'out' label since before taking the lock the "qgroups" list is always empty. This guarantees safety because btrfs_remove_qgroup() takes that lock before removing a qgroup from the rbtree fs_info->qgroup_tree. This was reported by syzbot with the following stack traces: BUG: KASAN: slab-use-after-free in __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49 Read of size 8 at addr ffff888027e420b0 by task kworker/u4:3/48 CPU: 1 PID: 48 Comm: kworker/u4:3 Not tainted 6.6.0-syzkaller-10396-g4652b8e4f3ff #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023 Workqueue: btrfs-qgroup-rescan btrfs_work_helper Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x1e7/0x2d0 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0x163/0x540 mm/kasan/report.c:475 kasan_report+0x175/0x1b0 mm/kasan/report.c:588 __list_del_entry_valid_or_report+0x2f/0x130 lib/list_debug.c:49 __list_del_entry_valid include/linux/list.h:124 [inline] __list_del_entry include/linux/list.h:215 [inline] list_del_init include/linux/list.h:287 [inline] qgroup_iterator_nested_clean fs/btrfs/qgroup.c:2623 [inline] btrfs_qgroup_account_extent+0x18b/0x1150 fs/btrfs/qgroup.c:2883 qgroup_rescan_leaf fs/btrfs/qgroup.c:3543 [inline] btrfs_qgroup_rescan_worker+0x1078/0x1c60 fs/btrfs/qgroup.c:3604 btrfs_work_helper+0x37c/0xbd0 fs/btrfs/async-thread.c:315 process_one_work kernel/workqueue.c:2630 [inline] process_scheduled_works+0x90f/0x1400 kernel/workqueue.c:2703 worker_thread+0xa5f/0xff0 kernel/workqueue.c:2784 kthread+0x2d3/0x370 kernel/kthread.c:388 ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 </TASK> Allocated by task 6355: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4f/0x70 mm/kasan/common.c:52 ____kasan_kmalloc mm/kasan/common.c:374 [inline] __kasan_kmalloc+0x98/0xb0 mm/kasan/common.c:383 kmalloc include/linux/slab.h:600 [inline] kzalloc include/linux/slab.h:721 [inline] btrfs_quota_enable+0xee9/0x2060 fs/btrfs/qgroup.c:1209 btrfs_ioctl_quota_ctl+0x143/0x190 fs/btrfs/ioctl.c:3705 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Freed by task 6355: kasan_save_stack mm/kasan/common.c:45 [inline] kasan_set_track+0x4f/0x70 mm/kasan/common.c:52 kasan_save_free_info+0x28/0x40 mm/kasan/generic.c:522 ____kasan_slab_free+0xd6/0x120 mm/kasan/common.c:236 kasan_slab_free include/linux/kasan.h:164 [inline] slab_free_hook mm/slub.c:1800 [inline] slab_free_freelist_hook mm/slub.c:1826 [inline] slab_free mm/slub.c:3809 [inline] __kmem_cache_free+0x263/0x3a0 mm/slub.c:3822 btrfs_remove_qgroup+0x764/0x8c0 fs/btrfs/qgroup.c:1787 btrfs_ioctl_qgroup_create+0x185/0x1e0 fs/btrfs/ioctl.c:3811 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl+0xf8/0x170 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Last potentially related work creation: kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45 __kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2667 [inline] call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781 kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823 kthread+0x2d3/0x370 kernel/kthread.c:388 ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 Second to last potentially related work creation: kasan_save_stack+0x3f/0x60 mm/kasan/common.c:45 __kasan_record_aux_stack+0xad/0xc0 mm/kasan/generic.c:492 __call_rcu_common kernel/rcu/tree.c:2667 [inline] call_rcu+0x167/0xa70 kernel/rcu/tree.c:2781 kthread_worker_fn+0x4ba/0xa90 kernel/kthread.c:823 kthread+0x2d3/0x370 kernel/kthread.c:388 ret_from_fork+0x48/0x80 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x11/0x20 arch/x86/entry/entry_64.S:242 The buggy address belongs to the object at ffff888027e42000 which belongs to the cache kmalloc-512 of size 512 The buggy address is located 176 bytes inside of freed 512-byte region [ffff888027e42000, ffff888027e42200) The buggy address belongs to the physical page: page:ffffea00009f9000 refcount:1 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x27e40 head:ffffea00009f9000 order:2 entire_mapcount:0 nr_pages_mapped:0 pincount:0 flags: 0xfff00000000840(slab|head|node=0|zone=1|lastcpupid=0x7ff) page_type: 0xffffffff() raw: 00fff00000000840 ffff888012c41c80 ffffea0000a5ba00 dead000000000002 raw: 0000000000000000 0000000080100010 00000001ffffffff 0000000000000000 page dumped because: kasan: bad access detected page_owner tracks the page as allocated page last allocated via order 2, migratetype Unmovable, gfp_mask 0xd20c0(__GFP_IO|__GFP_FS|__GFP_NOWARN|__GFP_NORETRY|__GFP_COMP|__GFP_NOMEMALLOC), pid 4514, tgid 4514 (udevadm), ts 24598439480, free_ts 23755696267 set_page_owner include/linux/page_owner.h:31 [inline] post_alloc_hook+0x1e6/0x210 mm/page_alloc.c:1536 prep_new_page mm/page_alloc.c:1543 [inline] get_page_from_freelist+0x31db/0x3360 mm/page_alloc.c:3170 __alloc_pages+0x255/0x670 mm/page_alloc.c:4426 alloc_slab_page+0x6a/0x160 mm/slub.c:1870 allocate_slab mm/slub.c:2017 [inline] new_slab+0x84/0x2f0 mm/slub.c:2070 ___slab_alloc+0xc85/0x1310 mm/slub.c:3223 __slab_alloc mm/slub.c:3322 [inline] __slab_alloc_node mm/slub.c:3375 [inline] slab_alloc_node mm/slub.c:3468 [inline] __kmem_cache_alloc_node+0x19d/0x270 mm/slub.c:3517 kmalloc_trace+0x2a/0xe0 mm/slab_common.c:1098 kmalloc include/linux/slab.h:600 [inline] kzalloc include/linux/slab.h:721 [inline] kernfs_fop_open+0x3e7/0xcc0 fs/kernfs/file.c:670 do_dentry_open+0x8fd/0x1590 fs/open.c:948 do_open fs/namei.c:3622 [inline] path_openat+0x2845/0x3280 fs/namei.c:3779 do_filp_open+0x234/0x490 fs/namei.c:3809 do_sys_openat2+0x13e/0x1d0 fs/open.c:1440 do_sys_open fs/open.c:1455 [inline] __do_sys_openat fs/open.c:1471 [inline] __se_sys_openat fs/open.c:1466 [inline] __x64_sys_openat+0x247/0x290 fs/open.c:1466 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b page last free stack trace: reset_page_owner include/linux/page_owner.h:24 [inline] free_pages_prepare mm/page_alloc.c:1136 [inline] free_unref_page_prepare+0x8c3/0x9f0 mm/page_alloc.c:2312 free_unref_page+0x37/0x3f0 mm/page_alloc.c:2405 discard_slab mm/slub.c:2116 [inline] __unfreeze_partials+0x1dc/0x220 mm/slub.c:2655 put_cpu_partial+0x17b/0x250 mm/slub.c:2731 __slab_free+0x2b6/0x390 mm/slub.c:3679 qlink_free mm/kasan/quarantine.c:166 [inline] qlist_free_all+0x75/0xe0 mm/kasan/quarantine.c:185 kasan_quarantine_reduce+0x14b/0x160 mm/kasan/quarantine.c:292 __kasan_slab_alloc+0x23/0x70 mm/kasan/common.c:305 kasan_slab_alloc include/linux/kasan.h:188 [inline] slab_post_alloc_hook+0x67/0x3d0 mm/slab.h:762 slab_alloc_node mm/slub.c:3478 [inline] slab_alloc mm/slub.c:3486 [inline] __kmem_cache_alloc_lru mm/slub.c:3493 [inline] kmem_cache_alloc+0x104/0x2c0 mm/slub.c:3502 getname_flags+0xbc/0x4f0 fs/namei.c:140 do_sys_openat2+0xd2/0x1d0 fs/open.c:1434 do_sys_open fs/open.c:1455 [inline] __do_sys_openat fs/open.c:1471 [inline] __se_sys_openat fs/open.c:1466 [inline] __x64_sys_openat+0x247/0x290 fs/open.c:1466 do_syscall_x64 arch/x86/entry/common.c:51 [inline] do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82 entry_SYSCALL_64_after_hwframe+0x63/0x6b Memory state around the buggy address: ffff888027e41f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc ffff888027e42000: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb >ffff888027e42080: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ^ ffff888027e42100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb ffff888027e42180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb Reported-by: syzbot+e0b615318f8fcfc01ceb@syzkaller.appspotmail.com Fixes: dce28769a33a ("btrfs: qgroup: use qgroup_iterator_nested to in qgroup_update_refcnt()") CC: stable@vger.kernel.org # 6.6 Link: https://lore.kernel.org/linux-btrfs/00000000000091a5b2060936bf6d@google.com/ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: fix error pointer dereference after failure to allocate fs devicesFilipe Manana2023-11-031-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At device_list_add() we allocate a btrfs_fs_devices structure and then before checking if the allocation failed (pointer is ERR_PTR(-ENOMEM)), we dereference the error pointer in a memcpy() argument if the feature BTRFS_FEATURE_INCOMPAT_METADATA_UUID is enabled. Fix this by checking for an allocation error before trying the memcpy(). Fixes: f7361d8c3fc3 ("btrfs: sipmlify uuid parameters of alloc_fs_devices()") Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: make found_logical_ret parameter mandatory for function ↵Qu Wenruo2023-11-031-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | queue_scrub_stripe() [BUG] There is a compilation warning reported on commit ae76d8e3e135 ("btrfs: scrub: fix grouping of read IO"), where gcc (14.0.0 20231022 experimental) is reporting the following uninitialized variable: fs/btrfs/scrub.c: In function ‘scrub_simple_mirror.isra’: fs/btrfs/scrub.c:2075:29: error: ‘found_logical’ may be used uninitialized [-Werror=maybe-uninitialized[https://gcc.gnu.org/onlinedocs/gcc/Warning-Options.html#index-Wmaybe-uninitialized]] 2075 | cur_logical = found_logical + BTRFS_STRIPE_LEN; fs/btrfs/scrub.c:2040:21: note: ‘found_logical’ was declared here 2040 | u64 found_logical; | ^~~~~~~~~~~~~ [CAUSE] This is a false alert, as @found_logical is passed as parameter @found_logical_ret of function queue_scrub_stripe(). As long as queue_scrub_stripe() returned 0, we would update @found_logical_ret. And if queue_scrub_stripe() returned >0 or <0, the caller would not utilized @found_logical, thus there should be nothing wrong. Although the triggering gcc is still experimental, it looks like the extra check on "if (found_logical_ret)" can sometimes confuse the compiler. Meanwhile the only caller of queue_scrub_stripe() is always passing a valid pointer, there is no need for such check at all. [FIX] Although the report itself is a false alert, we can still make it more explicit by: - Replace the check for @found_logical_ret with ASSERT() - Initialize @found_logical to U64_MAX - Add one extra ASSERT() to make sure @found_logical got updated Link: https://lore.kernel.org/linux-btrfs/87fs1x1p93.fsf@gentoo.org/ Fixes: ae76d8e3e135 ("btrfs: scrub: fix grouping of read IO") Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: get correct owning_root when dropping snapshotJosef Bacik2023-11-033-10/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave reported a bug where we were aborting the transaction while trying to cleanup the squota reservation for an extent. This turned out to be because we're doing btrfs_header_owner(next) in do_walk_down when we decide to free the block. However in this code block we haven't explicitly read next, so it could be stale. We would then get whatever garbage happened to be in the pages at this point. The commit that introduced that is "btrfs: track owning root in btrfs_ref". Fix this by saving the owner_root when we do the btrfs_lookup_extent_info(). We always do this in do_walk_down, it is how we make the decision of whether or not to delete the block. This is cheap because we've already done the extent item lookup at this point, so it's straightforward to just grab the owner root as well. Then we can use this when deleting the metadata block without needing to force a read of the extent buffer to find the owner. This fixes the problem that Dave reported. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: zoned: wait for data BG to be finished on direct IO allocationNaohiro Aota2023-11-031-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running the fio command below on a ZNS device results in "Resource temporarily unavailable" error. $ sudo fio --name=w --directory=/mnt --filesize=1GB --bs=16MB --numjobs=16 \ --rw=write --ioengine=libaio --iodepth=128 --direct=1 fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=117440512, buflen=16777216 fio: io_u error on file /mnt/w.2.0: Resource temporarily unavailable: write offset=134217728, buflen=16777216 ... This happens because -EAGAIN error returned from btrfs_reserve_extent() called from btrfs_new_extent_direct() is spilling over to the userland. btrfs_reserve_extent() returns -EAGAIN when there is no active zone available. Then, the caller should wait for some other on-going IO to finish a zone and retry the allocation. This logic is already implemented for buffered write in cow_file_range(), but it is missing for the direct IO counterpart. Implement the same logic for it. Reported-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Fixes: 2ce543f47843 ("btrfs: zoned: wait until zone is finished when allocation didn't progress") CC: stable@vger.kernel.org # 6.1+ Tested-by: Shinichiro Kawasaki <shinichiro.kawasaki@wdc.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: zoned: drop no longer valid write pointer checkNaohiro Aota2023-11-031-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a check of the write pointer vs the zone size to reject an invalid write pointer. However, as of now, we have RAID0/RAID10 on the zoned mode, we can have a block group whose size is larger than the zone size. As an equivalent check against the block group's zone_capacity is already there, we can just drop this invalid check. Fixes: 568220fa9657 ("btrfs: zoned: support RAID0/1/10 on top of raid stripe tree") Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: directly return 0 on no error code in btrfs_insert_raid_extent()Dan Carpenter2023-11-031-1/+1
| | | | | | | | | | | | | | | | | | | | It's more obvious to return a literal zero instead of "return ret;". Plus Smatch complains that ret could be uninitialized if the ordered_extent->bioc_list list is empty and this silences that warning. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: use u64 for buffer sizes in the tree search ioctlsFilipe Manana2023-11-031-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the tree search v2 ioctl we use the type size_t, which is an unsigned long, to track the buffer size in the local variable 'buf_size'. An unsigned long is 32 bits wide on a 32 bits architecture. The buffer size defined in struct btrfs_ioctl_search_args_v2 is a u64, so when we later try to copy the local variable 'buf_size' to the argument struct, when the search returns -EOVERFLOW, we copy only 32 bits which will be a problem on big endian systems. Fix this by using a u64 type for the buffer sizes, not only at btrfs_ioctl_tree_search_v2(), but also everywhere down the call chain so that we can use the u64 at btrfs_ioctl_tree_search_v2(). Fixes: cc68a8a5a433 ("btrfs: new ioctl TREE_SEARCH_V2") Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/linux-btrfs/ce6f4bd6-9453-4ffe-ba00-cee35495e10f@moroto.mountain/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'mm-stable-2023-11-01-14-33' of ↵Linus Torvalds2023-11-031-1/+1
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Kemeng Shi has contributed some compation maintenance work in the series 'Fixes and cleanups to compaction' - Joel Fernandes has a patchset ('Optimize mremap during mutual alignment within PMD') which fixes an obscure issue with mremap()'s pagetable handling during a subsequent exec(), based upon an implementation which Linus suggested - More DAMON/DAMOS maintenance and feature work from SeongJae Park i the following patch series: mm/damon: misc fixups for documents, comments and its tracepoint mm/damon: add a tracepoint for damos apply target regions mm/damon: provide pseudo-moving sum based access rate mm/damon: implement DAMOS apply intervals mm/damon/core-test: Fix memory leaks in core-test mm/damon/sysfs-schemes: Do DAMOS tried regions update for only one apply interval - In the series 'Do not try to access unaccepted memory' Adrian Hunter provides some fixups for the recently-added 'unaccepted memory' feature. To increase the feature's checking coverage. 'Plug a few gaps where RAM is exposed without checking if it is unaccepted memory' - In the series 'cleanups for lockless slab shrink' Qi Zheng has done some maintenance work which is preparation for the lockless slab shrinking code - Qi Zheng has redone the earlier (and reverted) attempt to make slab shrinking lockless in the series 'use refcount+RCU method to implement lockless slab shrink' - David Hildenbrand contributes some maintenance work for the rmap code in the series 'Anon rmap cleanups' - Kefeng Wang does more folio conversions and some maintenance work in the migration code. Series 'mm: migrate: more folio conversion and unification' - Matthew Wilcox has fixed an issue in the buffer_head code which was causing long stalls under some heavy memory/IO loads. Some cleanups were added on the way. Series 'Add and use bdev_getblk()' - In the series 'Use nth_page() in place of direct struct page manipulation' Zi Yan has fixed a potential issue with the direct manipulation of hugetlb page frames - In the series 'mm: hugetlb: Skip initialization of gigantic tail struct pages if freed by HVO' has improved our handling of gigantic pages in the hugetlb vmmemmep optimizaton code. This provides significant boot time improvements when significant amounts of gigantic pages are in use - Matthew Wilcox has sent the series 'Small hugetlb cleanups' - code rationalization and folio conversions in the hugetlb code - Yin Fengwei has improved mlock()'s handling of large folios in the series 'support large folio for mlock' - In the series 'Expose swapcache stat for memcg v1' Liu Shixin has added statistics for memcg v1 users which are available (and useful) under memcg v2 - Florent Revest has enhanced the MDWE (Memory-Deny-Write-Executable) prctl so that userspace may direct the kernel to not automatically propagate the denial to child processes. The series is named 'MDWE without inheritance' - Kefeng Wang has provided the series 'mm: convert numa balancing functions to use a folio' which does what it says - In the series 'mm/ksm: add fork-exec support for prctl' Stefan Roesch makes is possible for a process to propagate KSM treatment across exec() - Huang Ying has enhanced memory tiering's calculation of memory distances. This is used to permit the dax/kmem driver to use 'high bandwidth memory' in addition to Optane Data Center Persistent Memory Modules (DCPMM). The series is named 'memory tiering: calculate abstract distance based on ACPI HMAT' - In the series 'Smart scanning mode for KSM' Stefan Roesch has optimized KSM by teaching it to retain and use some historical information from previous scans - Yosry Ahmed has fixed some inconsistencies in memcg statistics in the series 'mm: memcg: fix tracking of pending stats updates values' - In the series 'Implement IOCTL to get and optionally clear info about PTEs' Peter Xu has added an ioctl to /proc/<pid>/pagemap which permits us to atomically read-then-clear page softdirty state. This is mainly used by CRIU - Hugh Dickins contributed the series 'shmem,tmpfs: general maintenance', a bunch of relatively minor maintenance tweaks to this code - Matthew Wilcox has increased the use of the VMA lock over file-backed page faults in the series 'Handle more faults under the VMA lock'. Some rationalizations of the fault path became possible as a result - In the series 'mm/rmap: convert page_move_anon_rmap() to folio_move_anon_rmap()' David Hildenbrand has implemented some cleanups and folio conversions - In the series 'various improvements to the GUP interface' Lorenzo Stoakes has simplified and improved the GUP interface with an eye to providing groundwork for future improvements - Andrey Konovalov has sent along the series 'kasan: assorted fixes and improvements' which does those things - Some page allocator maintenance work from Kemeng Shi in the series 'Two minor cleanups to break_down_buddy_pages' - In thes series 'New selftest for mm' Breno Leitao has developed another MM self test which tickles a race we had between madvise() and page faults - In the series 'Add folio_end_read' Matthew Wilcox provides cleanups and an optimization to the core pagecache code - Nhat Pham has added memcg accounting for hugetlb memory in the series 'hugetlb memcg accounting' - Cleanups and rationalizations to the pagemap code from Lorenzo Stoakes, in the series 'Abstract vma_merge() and split_vma()' - Audra Mitchell has fixed issues in the procfs page_owner code's new timestamping feature which was causing some misbehaviours. In the series 'Fix page_owner's use of free timestamps' - Lorenzo Stoakes has fixed the handling of new mappings of sealed files in the series 'permit write-sealed memfd read-only shared mappings' - Mike Kravetz has optimized the hugetlb vmemmap optimization in the series 'Batch hugetlb vmemmap modification operations' - Some buffer_head folio conversions and cleanups from Matthew Wilcox in the series 'Finish the create_empty_buffers() transition' - As a page allocator performance optimization Huang Ying has added automatic tuning to the allocator's per-cpu-pages feature, in the series 'mm: PCP high auto-tuning' - Roman Gushchin has contributed the patchset 'mm: improve performance of accounted kernel memory allocations' which improves their performance by ~30% as measured by a micro-benchmark - folio conversions from Kefeng Wang in the series 'mm: convert page cpupid functions to folios' - Some kmemleak fixups in Liu Shixin's series 'Some bugfix about kmemleak' - Qi Zheng has improved our handling of memoryless nodes by keeping them off the allocation fallback list. This is done in the series 'handle memoryless nodes more appropriately' - khugepaged conversions from Vishal Moola in the series 'Some khugepaged folio conversions'" [ bcachefs conflicts with the dynamically allocated shrinkers have been resolved as per Stephen Rothwell in https://lore.kernel.org/all/20230913093553.4290421e@canb.auug.org.au/ with help from Qi Zheng. The clone3 test filtering conflict was half-arsed by yours truly ] * tag 'mm-stable-2023-11-01-14-33' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (406 commits) mm/damon/sysfs: update monitoring target regions for online input commit mm/damon/sysfs: remove requested targets when online-commit inputs selftests: add a sanity check for zswap Documentation: maple_tree: fix word spelling error mm/vmalloc: fix the unchecked dereference warning in vread_iter() zswap: export compression failure stats Documentation: ubsan: drop "the" from article title mempolicy: migration attempt to match interleave nodes mempolicy: mmap_lock is not needed while migrating folios mempolicy: alloc_pages_mpol() for NUMA policy without vma mm: add page_rmappable_folio() wrapper mempolicy: remove confusing MPOL_MF_LAZY dead code mempolicy: mpol_shared_policy_init() without pseudo-vma mempolicy trivia: use pgoff_t in shared mempolicy tree mempolicy trivia: slightly more consistent naming mempolicy trivia: delete those ancient pr_debug()s mempolicy: fix migrate_pages(2) syscall return nr_failed kernfs: drop shared NUMA mempolicy hooks hugetlbfs: drop shared NUMA mempolicy pretence mm/damon/sysfs-test: add a unit test for damon_sysfs_set_targets() ...
| * | fs: super: dynamically allocate the s_shrinkQi Zheng2023-10-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for implementing lockless slab shrink, use new APIs to dynamically allocate the s_shrink, so that it can be freed asynchronously via RCU. Then it doesn't need to wait for RCU read-side critical section when releasing the struct super_block. Link: https://lkml.kernel.org/r/20230911094444.68966-39-zhengqi.arch@bytedance.com Signed-off-by: Qi Zheng <zhengqi.arch@bytedance.com> Reviewed-by: Muchun Song <songmuchun@bytedance.com> Acked-by: David Sterba <dsterba@suse.com> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <josef@toxicpanda.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Christian Brauner <brauner@kernel.org> Cc: Abhinav Kumar <quic_abhinavk@quicinc.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Alyssa Rosenzweig <alyssa.rosenzweig@collabora.com> Cc: Andreas Dilger <adilger.kernel@dilger.ca> Cc: Andreas Gruenbacher <agruenba@redhat.com> Cc: Anna Schumaker <anna@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Bob Peterson <rpeterso@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Carlos Llamas <cmllamas@google.com> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Chao Yu <chao@kernel.org> Cc: Christian Koenig <christian.koenig@amd.com> Cc: Chuck Lever <cel@kernel.org> Cc: Coly Li <colyli@suse.de> Cc: Dai Ngo <Dai.Ngo@oracle.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: Daniel Vetter <daniel.vetter@ffwll.ch> Cc: "Darrick J. Wong" <djwong@kernel.org> Cc: Dave Chinner <david@fromorbit.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Airlie <airlied@gmail.com> Cc: David Hildenbrand <david@redhat.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Cc: Gao Xiang <hsiangkao@linux.alibaba.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Huang Rui <ray.huang@amd.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jaegeuk Kim <jaegeuk@kernel.org> Cc: Jani Nikula <jani.nikula@linux.intel.com> Cc: Jan Kara <jack@suse.cz> Cc: Jason Wang <jasowang@redhat.com> Cc: Jeff Layton <jlayton@kernel.org> Cc: Jeffle Xu <jefflexu@linux.alibaba.com> Cc: Joel Fernandes (Google) <joel@joelfernandes.org> Cc: Joonas Lahtinen <joonas.lahtinen@linux.intel.com> Cc: Juergen Gross <jgross@suse.com> Cc: Kent Overstreet <kent.overstreet@gmail.com> Cc: Kirill Tkhai <tkhai@ya.ru> Cc: Marijn Suijten <marijn.suijten@somainline.org> Cc: "Michael S. Tsirkin" <mst@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nadav Amit <namit@vmware.com> Cc: Neil Brown <neilb@suse.de> Cc: Oleksandr Tyshchenko <oleksandr_tyshchenko@epam.com> Cc: Olga Kornievskaia <kolga@netapp.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Richard Weinberger <richard@nod.at> Cc: Rob Clark <robdclark@gmail.com> Cc: Rob Herring <robh@kernel.org> Cc: Rodrigo Vivi <rodrigo.vivi@intel.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Sean Paul <sean@poorly.run> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Song Liu <song@kernel.org> Cc: Stefano Stabellini <sstabellini@kernel.org> Cc: Steven Price <steven.price@arm.com> Cc: "Theodore Ts'o" <tytso@mit.edu> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Tomeu Vizoso <tomeu.vizoso@collabora.com> Cc: Tom Talpey <tom@talpey.com> Cc: Trond Myklebust <trond.myklebust@hammerspace.com> Cc: Tvrtko Ursulin <tvrtko.ursulin@linux.intel.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Xuan Zhuo <xuanzhuo@linux.alibaba.com> Cc: Yue Hu <huyue2@coolpad.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | Merge tag 'for-6.7-tag' of ↵Linus Torvalds2023-10-3080-5285/+3893
|\ \ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group items - adjust overcommit logic in near full situations, account for one more chunk to eventually allocate metadata chunk, this is mostly relevant for small filesystems (<10GiB) - single device filesystems are scanned but not registered (except seed devices), this allows temp_fsid to work - qgroup iterations do not need GFP_ATOMIC allocations anymore - cleanups, refactoring, reduced data structure size, function parameter simplifications, error handling fixes" * tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits) btrfs: open code timespec64 in struct btrfs_inode btrfs: remove redundant log root tree index assignment during log sync btrfs: remove redundant initialization of variable dirty in btrfs_update_time() btrfs: sysfs: show temp_fsid feature btrfs: disable the device add feature for temp-fsid btrfs: disable the seed feature for temp-fsid btrfs: update comment for temp-fsid, fsid, and metadata_uuid btrfs: remove pointless empty log context list check when syncing log btrfs: update comment for struct btrfs_inode::lock btrfs: remove pointless barrier from btrfs_sync_file() btrfs: add and use helpers for reading and writing last_trans_committed btrfs: add and use helpers for reading and writing fs_info->generation btrfs: add and use helpers for reading and writing log_transid btrfs: add and use helpers for reading and writing last_log_commit btrfs: support cloned-device mount capability btrfs: add helper function find_fsid_by_disk btrfs: stop reserving excessive space for block group item insertions btrfs: stop reserving excessive space for block group item updates btrfs: reorder btrfs_inode to fill gaps btrfs: open code btrfs_ordered_inode_tree in btrfs_inode ...
| * | btrfs: open code timespec64 in struct btrfs_inodeDavid Sterba2023-10-123-23/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The type of timespec64::tv_nsec is 'unsigned long', while we have only u32 for on-disk and in-memory. This wastes a few bytes in btrfs_inode. Add separate members for sec and nsec with the corresponding type width. This creates a 4 byte hole in btrfs_inode which can be utilized in the future. Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: remove redundant log root tree index assignment during log syncFilipe Manana2023-10-121-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During log syncing, when we start updating the log root tree we compute an index value, stored in variable 'index2', once we lock the log root tree's mutex. This value depends on the log root's log_transid. And shortly after we compute again the same value for 'index2' - the value is exactly the same since we haven't released the mutex and therefore the log_transid of the log root is the same as before. This second 'index2' computation became pointless after commit a93e01682e28 ("btrfs: remove no longer needed use of log_writers for the log root tree"). So remove it. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: remove redundant initialization of variable dirty in btrfs_update_time()Colin Ian King2023-10-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The variable dirty is initialized with a value that is never read, it is being re-assigned later on. Remove the redundant initialization. Cleans up clang scan build warning: fs/btrfs/inode.c:5965:7: warning: Value stored to 'dirty' during its initialization is never read [deadcode.DeadStores] Signed-off-by: Colin Ian King <colin.i.king@gmail.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: sysfs: show temp_fsid featureAnand Jain2023-10-121-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds sysfs objects to indicate temp_fsid feature support and its status. /sys/fs/btrfs/features/temp_fsid /sys/fs/btrfs/<UUID>/temp_fsid For example: Consider two cloned and mounted devices. $ blkid /dev/sdc[1-2] /dev/sdc1: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" .. /dev/sdc2: UUID="509ad44b-ad2a-4a8a-bc8d-fe69db7220d5" .. One gets actual fsid, and the other gets the temp_fsid when mounted. $ btrfs filesystem show -m Label: none uuid: 509ad44b-ad2a-4a8a-bc8d-fe69db7220d5 Total devices 1 FS bytes used 54.14MiB devid 1 size 300.00MiB used 144.00MiB path /dev/sdc1 Label: none uuid: 33bad74e-c91b-43a5-aef8-b3cab97ae63a Total devices 1 FS bytes used 54.14MiB devid 1 size 300.00MiB used 144.00MiB path /dev/sdc2 Their sysfs as below. $ cat /sys/fs/btrfs/features/temp_fsid 0 $ cat /sys/fs/btrfs/509ad44b-ad2a-4a8a-bc8d-fe69db7220d5/temp_fsid 0 $ cat /sys/fs/btrfs/33bad74e-c91b-43a5-aef8-b3cab97ae63a/temp_fsid 1 Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: disable the device add feature for temp-fsidAnand Jain2023-10-121-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The device addition operation will transform the cloned temp-fsid mounted device into a multi-device filesystem. Therefore, it is marked as unsupported. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: disable the seed feature for temp-fsidAnand Jain2023-10-121-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | A seed device is an integral component of the sprout device, which functions as a multi-device filesystem. Therefore, temp-fsid feature is not supported. Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: update comment for temp-fsid, fsid, and metadata_uuidAnand Jain2023-10-121-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | Update the comment to explain the relationship between temp_fsid, fsid, and metadata_uuid. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: remove pointless empty log context list check when syncing logFilipe Manana2023-10-121-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When syncing the log, if we get an error when updating the log root, we check first if the log root tree context is in a log context list, and if so it deletes from the log root tree context from the list. This check however is pointless because at this moment the context is always in a list, he have just added it to a context list. The check became pointless after commit a93e01682e28 ("btrfs: remove no longer needed use of log_writers for the log root tree"). So remove this now pointless empty list check. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: update comment for struct btrfs_inode::lockFilipe Manana2023-10-121-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Update the comment for the lock named "lock" in struct btrfs_inode because it does not mention that the fields "delalloc_bytes", "defrag_bytes", "csum_bytes", "outstanding_extents" and "disk_i_size" are also protected by that lock. Also add a comment on top of each field protected by this lock to mention that the lock protects them. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: remove pointless barrier from btrfs_sync_file()Filipe Manana2023-10-121-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The memory barrier (smp_mb()) at btrfs_sync_file() is completely redundant now that fs_info->last_trans_committed is read using READ_ONCE(), with the helper btrfs_get_last_trans_committed(), and written using WRITE_ONCE() with the helper btrfs_set_last_trans_committed(). This barrier was introduced in 2011, by commit a4abeea41adf ("Btrfs: kill trans_mutex"), but even back then it was not correct since the writer side (in btrfs_commit_transaction()), did not issue a pairing memory barrier after it updated fs_info->last_trans_committed. So remove this barrier. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: add and use helpers for reading and writing last_trans_committedFilipe Manana2023-10-128-14/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the last_trans_committed field of struct btrfs_fs_info is modified and read without any locking or other protection. For example early in the fsync path, skip_inode_logging() is called which reads fs_info->last_trans_committed, but at the same time we can have a transaction commit completing and updating that field. In the case of an fsync this is harmless and any data race should be rare and at most cause an unnecessary logging of an inode. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_trans_committed field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: add and use helpers for reading and writing fs_info->generationFilipe Manana2023-10-126-6/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the generation field of struct btrfs_fs_info is always modified while holding fs_info->trans_lock locked. Most readers will access this field without taking that lock but while holding a transaction handle, which is safe to do due to the transaction life cycle. However there are other readers that are neither holding the lock nor holding a transaction handle open: 1) When reading an inode from disk, at btrfs_read_locked_inode(); 2) When reading the generation to expose it to sysfs, at btrfs_generation_show(); 3) Early in the fsync path, at skip_inode_logging(); 4) When creating a hole at btrfs_cont_expand(), during write paths, truncate and reflinking; 5) In the fs_info ioctl (btrfs_ioctl_fs_info()); 6) While mounting the filesystem, in the open_ctree() path. In these cases it's safe to directly read fs_info->generation as no one can concurrently start a transaction and update fs_info->generation. In case of the fsync path, races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. In the other cases it's not so clear if functional problems may arise, though in case 1 rare things like a load/store tearing [1] may cause the BTRFS_INODE_NEEDS_FULL_SYNC flag not being set on an inode and therefore result in incorrect logging later on in case a fsync call is made. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the generation field of struct btrfs_fs_info using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: add and use helpers for reading and writing log_transidFilipe Manana2023-10-124-4/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the log_transid field of a root is always modified while holding the root's log_mutex locked. Most readers of a root's log_transid are also holding the root's log_mutex locked, however there is one exception which is btrfs_set_inode_last_trans() where we don't take the lock to avoid blocking several operations if log syncing is happening in parallel. Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the log_transid field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers where needed. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: add and use helpers for reading and writing last_log_commitFilipe Manana2023-10-124-6/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the last_log_commit of a root can be accessed concurrently without any lock protection. Readers can be calling btrfs_inode_in_log() early in a fsync call, which reads a root's last_log_commit, while a writer can change the last_log_commit while a log tree if being synced, at btrfs_sync_log(). Any races here should be harmless, and in the worst case they may cause a fsync to log an inode when it's not really needed, so nothing bad from a functional perspective. To avoid data race warnings from tools like KCSAN and other issues such as load and store tearing (amongst others, see [1]), create helpers to access the last_log_commit field of a root using READ_ONCE() and WRITE_ONCE(), and use these helpers everywhere. [1] https://lwn.net/Articles/793253/ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: support cloned-device mount capabilityAnand Jain2023-10-123-4/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Guilherme's previous work [1] aimed at the mounting of cloned devices using a superblock flag SINGLE_DEV during mkfs. [1] https://lore.kernel.org/linux-btrfs/20230831001544.3379273-1-gpiccoli@igalia.com/ Building upon this work, here is in memory only approach. As it mounts we determine if the same fsid is already mounted if then we generate a random temp fsid which shall be used the mount, in memory only not written to the disk. We distinguish devices by devt. Example: $ fallocate -l 300m ./disk1.img $ mkfs.btrfs -f ./disk1.img $ cp ./disk1.img ./disk2.img $ cp ./disk1.img ./disk3.img $ mount -o loop ./disk1.img /btrfs $ mount -o ./disk2.img /btrfs1 $ mount -o ./disk3.img /btrfs2 $ btrfs fi show -m Label: none uuid: 4a212b48-1bec-46a5-938a-783c8c1f0b02 Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop0 Label: none uuid: adabf2fe-5515-4ad0-95b4-7b1609218c16 Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop1 Label: none uuid: 1d77d0df-7d92-439e-adbd-20b9b86fdedb Total devices 1 FS bytes used 144.00KiB devid 1 size 300.00MiB used 88.00MiB path /dev/loop2 Co-developed-by: Guilherme G. Piccoli <gpiccoli@igalia.com> Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: add helper function find_fsid_by_diskAnand Jain2023-10-121-4/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for adding support to mount multiple single-disk btrfs filesystems with the same FSID, wrap find_fsid() into find_fsid_by_disk(). Signed-off-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: stop reserving excessive space for block group item insertionsFilipe Manana2023-10-124-4/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Space for block group item insertions, necessary after allocating a new block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes(). That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_insert_metadata_size(), since we only need to insert a block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_insert_metadata_size() we will need to reserve 2 times less space when using the free space tree, putting less pressure on space reservation. So use helpers to reserve and release space for block group item insertions that use btrfs_calc_insert_metadata_size() for calculation of the space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: stop reserving excessive space for block group item updatesFilipe Manana2023-10-124-6/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Space for block group item updates, necessary after allocating or deallocating an extent from a block group, is reserved in the delayed refs block reserve. Currently we do this by incrementing the transaction handle's delayed_ref_updates counter and then calling btrfs_update_delayed_refs_rsv(), which will increase the size of the delayed refs block reserve by an amount that corresponds to the same amount we use for delayed refs, given by btrfs_calc_delayed_ref_bytes(). That is an excessive amount because it corresponds to the amount of space needed to insert one item in a btree (btrfs_calc_insert_metadata_size()) times 2 when the free space tree feature is enabled. All we need is an amount as given by btrfs_calc_metadata_size(), since we only need to update an existing block group item in the extent tree (or block group tree if this feature is enabled). By using btrfs_calc_metadata_size() we will need to reserve 4 times less space when using the free space tree and 2 times less space when not using it, putting less pressure on space reservation. So use helpers to reserve and release space for block group item updates that use btrfs_calc_metadata_size() for calculation of the space. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: reorder btrfs_inode to fill gapsDavid Sterba2023-10-121-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | Previous commit created a hole in struct btrfs_inode, we can move outstanding_extents there. This reduces size by 8 bytes from 1120 to 1112 on a release config. Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: open code btrfs_ordered_inode_tree in btrfs_inodeDavid Sterba2023-10-125-89/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The structure btrfs_ordered_inode_tree is used only in one place, in btrfs_inode. The structure itself has a 4 byte hole which is wasted space. Move the btrfs_ordered_inode_tree members to btrfs_inode with a common prefix 'ordered_tree_' where the hole can be utilized and shrink inode size. Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: adjust overcommit logic when very close to fullJosef Bacik2023-10-121-0/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A user reported some unpleasant behavior with very small file systems. The reproducer is this $ mkfs.btrfs -f -m single -b 8g /dev/vdb $ mount /dev/vdb /mnt/test $ dd if=/dev/zero of=/mnt/test/testfile bs=512M count=20 This will result in usage that looks like this Overall: Device size: 8.00GiB Device allocated: 8.00GiB Device unallocated: 1.00MiB Device missing: 0.00B Device slack: 2.00GiB Used: 5.47GiB Free (estimated): 2.52GiB (min: 2.52GiB) Free (statfs, df): 0.00B Data ratio: 1.00 Metadata ratio: 1.00 Global reserve: 5.50MiB (used: 0.00B) Multiple profiles: no Data,single: Size:7.99GiB, Used:5.46GiB (68.41%) /dev/vdb 7.99GiB Metadata,single: Size:8.00MiB, Used:5.77MiB (72.07%) /dev/vdb 8.00MiB System,single: Size:4.00MiB, Used:16.00KiB (0.39%) /dev/vdb 4.00MiB Unallocated: /dev/vdb 1.00MiB As you can see we've gotten ourselves quite full with metadata, with all of the disk being allocated for data. On smaller file systems there's not a lot of time before we get full, so our overcommit behavior bites us here. Generally speaking data reservations result in chunk allocations as we assume reservation == actual use for data. This means at any point we could end up with a chunk allocation for data, and if we're very close to full we could do this before we have a chance to figure out that we need another metadata chunk. Address this by adjusting the overcommit logic. Simply put we need to take away 1 chunk from the available chunk space in case of a data reservation. This will allow us to stop overcommitting before we potentially lose this space to a data allocation. With this fix in place we properly allocate a metadata chunk before we're completely full, allowing for enough slack space in metadata. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: increase ->free_chunk_space in btrfs_grow_deviceJosef Bacik2023-10-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | My overcommit patch exposed a bug with btrfs/177 [1]. The problem here is that when we grow the device we're not adding to ->free_chunk_space, so subsequent allocations can cause ->free_chunk_space to wrap, which causes problems in can_overcommit because we add this to ->total_bytes, which causes the counter to wrap and gives us an unexpected ENOSPC. Fix this by properly updating ->free_chunk_space with the new available space in btrfs_grow_device. [1] First version of the fix: https://lore.kernel.org/linux-btrfs/b97e47ce0ce1d41d221878de7d6090b90aa7a597.1695065233.git.josef@toxicpanda.com/ Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: fix ->free_chunk_space math in btrfs_shrink_deviceJosef Bacik2023-10-121-3/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are two bugs in how we adjust ->free_chunk_space in btrfs_shrink_device. First we're removing the entire diff between new_size and old_size from ->free_chunk_space. This only works if we're reducing the free area, which we could potentially not be. So adjust the math to only subtract the diff in the free space from ->free_chunk_space. Additionally in the error case we're unconditionally adding the diff back into ->free_chunk_space, which we need to only do if this device is writeable. Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: make sure we cache next state in find_first_extent_bit()Filipe Manana2023-10-121-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, at find_first_extent_bit(), when we are given a cached extent state that happens to have its end offset match the desired range start, we find the next extent state using that cached state, with next_state() calls, and then return it. We then try to cache that next state by calling cache_state_if_flags(), but that will not cache the state because we haven't reset *cached_state to NULL, so we end up with the cached_state unchanged, and if the caller is iterating over extent states in the io tree, its next call to find_first_extent_bit() will not use the current cached state as its end offset does not match the minimum start range offset, therefore the cached state is reset and we have to search the rbtree to find the next suitable extent state record. So fix this by resetting the cached state to NULL (and dropping our ref on it) when we have a suitable cached state and we found a next state by using next_state() starting from the cached state. This makes use cases of calling find_first_extent_bit() to go over all ranges in the io tree to do a single rbtree full search, only on the first call, and the next calls will just do next_state() (rb_next() wrapper) calls, which is more efficient. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: use extent_io_tree_release() to empty dirty log pagesFilipe Manana2023-10-121-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeing a log tree, during a transaction commit, we clear its dirty log pages io tree by calling clear_extent_bits() using a range from 0 to (u64)-1. This will iterate the io tree's rbtree and call rb_erase() on each node before freeing it, which will often trigger rebalance operations on the rbtree. A better alternative it to use extent_io_tree_release(), which will not do deletions and trigger rebalances. So use extent_io_tree_release() instead of clear_extent_bits(). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: make tree iteration in extent_io_tree_release() more efficientFilipe Manana2023-10-121-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently extent_io_tree_release() is a loop that keeps getting the first node in the io tree, using rb_first() which is a loop that gets to the leftmost node of the rbtree, and then for each node it calls rb_erase(), which often requires rebalancing the rbtree. We can make this more efficient by using rbtree_postorder_for_each_entry_safe() to free each node without having to delete it from the rbtree and without looping to get the first node. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: collapse wait_on_state() to its caller wait_extent_bit()Filipe Manana2023-10-121-15/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The wait_on_state() function is very short and has a single caller, which is wait_extent_bit(), so remove the function and put its code into the caller. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: remove redundant memory barrier from extent_io_tree_release()Filipe Manana2023-10-121-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The memory barrier at extent_io_tree_release() is redundant. Holding spin_lock here is not enough to drop the barrier completely. We only change the waitqueue of an extent state record while holding the tree lock - see wait_on_state(). The update to waitqueue state will not become stale because there will be an spin_unlock/spin_lock sequence between the change and waiting, this implies a full memory barrier. So remove the explicit smp_mb() barrier. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reword reasoning ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: make wait_extent_bit() staticFilipe Manana2023-10-122-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function wait_extent_bit() is not used outside extent-io-tree.c so make it static. Furthermore the function doesn't have the 'btrfs_' prefix. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: update stale comment at extent_io_tree_release()Filipe Manana2023-10-121-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's this comment at extent_io_tree_release() that mentions io btrees, but this function is no longer used only for io btrees. Originally it was added as a static function named clear_btree_io_tree() at transaction.c, in commit 663dfbb07774 ("Btrfs: deal with convert_extent_bit errors to avoid fs corruption"), as it was used only for cleaning one of the io trees that track dirty extent buffers, the dirty_log_pages io tree of a a root and the dirty_pages io tree of a transaction. Later it was renamed and exported and now it's used to cleanup other io trees such as the allocation state io tree of a device or the csums range io tree of a log root. So remove that comment and replace it with one at the top of the function that is more complete, mentioning what the function does and that it's expected to be called only when a task is sure no one else will need to use the tree anymore, as well as there should be no locked ranges in the tree and therefore no waiters on its extent state records. Also add an assertion to check that there are no locked extent state records in the tree. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: make extent state merges more efficient during insertionsFilipe Manana2023-10-121-42/+89
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When inserting a new extent state record into an io tree that happens to be mergeable, we currently do the following: 1) Insert the extent state record in the io tree's rbtree. This requires going down the tree to find where to insert it, and during the insertion we often need to balance the rbtree; 2) We then check if the previous node is mergeable, so we call rb_prev() to find it, which requires some looping to find the previous node; 3) If the previous node is mergeable, we adjust our node to include the range of the previous node and then delete the previous node from the rbtree, which again may need to balance the rbtree; 4) Then we check if the next node is mergeable with the node we inserted, so we call rb_next(), which requires some looping too. If the next node is indeed mergeable, we expand the range of our node to include the next node's range and then delete the next node from the rbtree, which again may need to balance the tree. So these are quite of lot of iterations and looping over the rbtree, and some of the operations may need to rebalance the rb tree. This can be made a bit more efficient by: 1) When iterating the rbtree, once we find a node that is mergeable with the node we want to insert, we can just adjust that node's range with the range of the node to insert - this avoids continuing iterating over the tree and deleting a node from the rbtree; 2) If we expand the range of a mergeable node, then we find the next or the previous node, depending on other we merged a range to the right or to the left of the node we are currently at during the iteration. This merging is as before, we find the next or previous node with rb_next() or rb_prev() and if that other node is mergeable with the current one, we adjust the range of the current node and remove the other node from the rbtree; 3) Whenever we need to insert the new extent state record it's because we don't have any extent state record in the rbtree which can be merged, so we can remove the call to merge_state() after the insertion, saving rb_next() and rb_prev() calls, which require some looping. So update the insertion function insert_state() to have this behaviour. Running dbench for 120 seconds and capturing the execution times of set_extent_bit() at pin_down_extent(), resulted in the following data (time values are in nanoseconds): Before this change: Count: 2278299 Range: 0.000 - 4003728.000; Mean: 713.436; Median: 612.000; Stddev: 3606.952 Percentiles: 90th: 1187.000; 95th: 1350.000; 99th: 1724.000 0.000 - 7.534: 5 | 7.534 - 35.418: 36 | 35.418 - 154.403: 273 | 154.403 - 662.138: 1244016 ##################################################### 662.138 - 2828.745: 1031335 ############################################ 2828.745 - 12074.102: 1395 | 12074.102 - 51525.930: 806 | 51525.930 - 219874.955: 162 | 219874.955 - 938254.688: 22 | 938254.688 - 4003728.000: 3 | After this change: Count: 2275862 Range: 0.000 - 1605175.000; Mean: 678.903; Median: 590.000; Stddev: 2149.785 Percentiles: 90th: 1105.000; 95th: 1245.000; 99th: 1590.000 0.000 - 10.219: 10 | 10.219 - 40.957: 36 | 40.957 - 155.907: 262 | 155.907 - 585.789: 1127214 #################################################### 585.789 - 2193.431: 1145134 ##################################################### 2193.431 - 8205.578: 1648 | 8205.578 - 30689.378: 1039 | 30689.378 - 114772.699: 362 | 114772.699 - 429221.537: 52 | 429221.537 - 1605175.000: 10 | Maximum duration (range), average duration, percentiles and standard deviation are all better. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: change test_range_bit to scan the whole rangeDavid Sterba2023-10-125-22/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | The semantics of test_range_bit() with filled == 0 is now in it's own helper so test_range_bit will check the whole range unconditionally. The detection logic is flipped and assumes success by default and catches exceptions. Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: add specific helper for range bit test existsDavid Sterba2023-10-125-10/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The existing helper test_range_bit works in two ways, checks if the whole range contains all the bits, or stop on the first occurrence. By adding a specific helper for the latter case, the inner loop can be simplified and contains fewer conditionals, making it a bit faster. There's no caller that uses the cached state pointer so this reduces the argument count further. Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: move btrfs_realloc_node() from ctree.c into defrag.cFilipe Manana2023-10-123-116/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_realloc_node() is only used by the defrag code. Nowadays we have a defrag.c file, so move it, and its helper close_blocks(), into defrag.c. During the move also do a few minor cosmetic changes: 1) Change the return value of close_blocks() from int to bool; 2) Use SZ_32K instead of 32768 at close_blocks(); 3) Make some variables const in btrfs_realloc_node(), 'blocksize' and 'end_slot'; 4) Get rid of 'parent_nritems' variable, in both places where it was used it could be replaced by calling btrfs_header_nritems(parent); 5) Change the type of a couple variables from int to bool; 6) Rename variable 'err' to 'ret', as that's the most common name we use to track the return value of a function; 7) Move some variables from the top scope to the scope of the for loop where they are used. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: export comp_keys() from ctree.c as btrfs_comp_keys()Filipe Manana2023-10-122-37/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Export comp_keys() out of ctree.c, as btrfs_comp_keys(), so that in a later patch we can move out defrag specific code from ctree.c into defrag.c. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: rename and export __btrfs_cow_block()Filipe Manana2023-10-122-15/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename and export __btrfs_cow_block() as btrfs_force_cow_block(). This is to allow to move defrag specific code out of ctree.c and into defrag.c in one of the next patches. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | btrfs: use round_down() to align block offset at btrfs_cow_block()Filipe Manana2023-10-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At btrfs_cow_block() we can use round_down() to align the extent buffer's logical offset to the start offset of a metadata block group, instead of the less easy to read set of bitwise operations (two plus one subtraction). So replace the bitwise operations with a round_down() call. Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>