summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/relocation.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* btrfs: check folio mapping after unlock in relocate_one_folio()Boris Burkov2024-12-231-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we call btrfs_read_folio() to bring a folio uptodate, we unlock the folio. The result of that is that a different thread can modify the mapping (like remove it with invalidate) before we call folio_lock(). This results in an invalid page and we need to try again. In particular, if we are relocating concurrently with aborting a transaction, this can result in a crash like the following: BUG: kernel NULL pointer dereference, address: 0000000000000000 PGD 0 P4D 0 Oops: 0000 [#1] SMP CPU: 76 PID: 1411631 Comm: kworker/u322:5 Workqueue: events_unbound btrfs_reclaim_bgs_work RIP: 0010:set_page_extent_mapped+0x20/0xb0 RSP: 0018:ffffc900516a7be8 EFLAGS: 00010246 RAX: ffffea009e851d08 RBX: ffffea009e0b1880 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffc900516a7b90 RDI: ffffea009e0b1880 RBP: 0000000003573000 R08: 0000000000000001 R09: ffff88c07fd2f3f0 R10: 0000000000000000 R11: 0000194754b575be R12: 0000000003572000 R13: 0000000003572fff R14: 0000000000100cca R15: 0000000005582fff FS: 0000000000000000(0000) GS:ffff88c07fd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000000000000 CR3: 000000407d00f002 CR4: 00000000007706f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __die+0x78/0xc0 ? page_fault_oops+0x2a8/0x3a0 ? __switch_to+0x133/0x530 ? wq_worker_running+0xa/0x40 ? exc_page_fault+0x63/0x130 ? asm_exc_page_fault+0x22/0x30 ? set_page_extent_mapped+0x20/0xb0 relocate_file_extent_cluster+0x1a7/0x940 relocate_data_extent+0xaf/0x120 relocate_block_group+0x20f/0x480 btrfs_relocate_block_group+0x152/0x320 btrfs_relocate_chunk+0x3d/0x120 btrfs_reclaim_bgs_work+0x2ae/0x4e0 process_scheduled_works+0x184/0x370 worker_thread+0xc6/0x3e0 ? blk_add_timer+0xb0/0xb0 kthread+0xae/0xe0 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork+0x2f/0x40 ? flush_tlb_kernel_range+0x90/0x90 ret_from_fork_asm+0x11/0x20 </TASK> This occurs because cleanup_one_transaction() calls destroy_delalloc_inodes() which calls invalidate_inode_pages2() which takes the folio_lock before setting mapping to NULL. We fail to check this, and subsequently call set_extent_mapping(), which assumes that mapping != NULL (in fact it asserts that in debug mode) Note that the "fixes" patch here is not the one that introduced the race (the very first iteration of this code from 2009) but a more recent change that made this particular crash happen in practice. Fixes: e7f1326cc24e ("btrfs: set page extent mapped after read_folio in relocate_one_page") CC: stable@vger.kernel.org # 6.1+ Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: drop unused transaction parameter from btrfs_qgroup_add_swapped_blocks()David Sterba2024-11-111-1/+1
| | | | | | | | The caller replace_path() runs under transaction but we don't need it in btrfs_qgroup_add_swapped_blocks(). Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: fix a NULL pointer dereference when failed to start a new trasacntionQu Wenruo2024-10-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [BUG] Syzbot reported a NULL pointer dereference with the following crash: FAULT_INJECTION: forcing a failure. start_transaction+0x830/0x1670 fs/btrfs/transaction.c:676 prepare_to_relocate+0x31f/0x4c0 fs/btrfs/relocation.c:3642 relocate_block_group+0x169/0xd20 fs/btrfs/relocation.c:3678 ... BTRFS info (device loop0): balance: ended with status: -12 Oops: general protection fault, probably for non-canonical address 0xdffffc00000000cc: 0000 [#1] PREEMPT SMP KASAN NOPTI KASAN: null-ptr-deref in range [0x0000000000000660-0x0000000000000667] RIP: 0010:btrfs_update_reloc_root+0x362/0xa80 fs/btrfs/relocation.c:926 Call Trace: <TASK> commit_fs_roots+0x2ee/0x720 fs/btrfs/transaction.c:1496 btrfs_commit_transaction+0xfaf/0x3740 fs/btrfs/transaction.c:2430 del_balance_item fs/btrfs/volumes.c:3678 [inline] reset_balance_state+0x25e/0x3c0 fs/btrfs/volumes.c:3742 btrfs_balance+0xead/0x10c0 fs/btrfs/volumes.c:4574 btrfs_ioctl_balance+0x493/0x7c0 fs/btrfs/ioctl.c:3673 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:907 [inline] __se_sys_ioctl+0xf9/0x170 fs/ioctl.c:893 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xf3/0x230 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f [CAUSE] The allocation failure happens at the start_transaction() inside prepare_to_relocate(), and during the error handling we call unset_reloc_control(), which makes fs_info->balance_ctl to be NULL. Then we continue the error path cleanup in btrfs_balance() by calling reset_balance_state() which will call del_balance_item() to fully delete the balance item in the root tree. However during the small window between set_reloc_contrl() and unset_reloc_control(), we can have a subvolume tree update and created a reloc_root for that subvolume. Then we go into the final btrfs_commit_transaction() of del_balance_item(), and into btrfs_update_reloc_root() inside commit_fs_roots(). That function checks if fs_info->reloc_ctl is in the merge_reloc_tree stage, but since fs_info->reloc_ctl is NULL, it results a NULL pointer dereference. [FIX] Just add extra check on fs_info->reloc_ctl inside btrfs_update_reloc_root(), before checking fs_info->reloc_ctl->merge_reloc_tree. That DEAD_RELOC_TREE handling is to prevent further modification to the reloc tree during merge stage, but since there is no reloc_ctl at all, we do not need to bother that. Reported-by: syzbot+283673dbc38527ef9f3d@syzkaller.appspotmail.com Link: https://lore.kernel.org/linux-btrfs/66f6bfa7.050a0220.38ace9.0019.GAE@google.com/ CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: drop the backref cache during relocation if we commitJosef Bacik2024-10-011-72/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the inception of relocation we have maintained the backref cache across transaction commits, updating the backref cache with the new bytenr whenever we COWed blocks that were in the cache, and then updating their bytenr once we detected a transaction id change. This works as long as we're only ever modifying blocks, not changing the structure of the tree. However relocation does in fact change the structure of the tree. For example, if we are relocating a data extent, we will look up all the leaves that point to this data extent. We will then call do_relocation() on each of these leaves, which will COW down to the leaf and then update the file extent location. But, a key feature of do_relocation() is the pending list. This is all the pending nodes that we modified when we updated the file extent item. We will then process all of these blocks via finish_pending_nodes, which calls do_relocation() on all of the nodes that led up to that leaf. The purpose of this is to make sure we don't break sharing unless we absolutely have to. Consider the case that we have 3 snapshots that all point to this leaf through the same nodes, the initial COW would have created a whole new path. If we did this for all 3 snapshots we would end up with 3x the number of nodes we had originally. To avoid this we will cycle through each of the snapshots that point to each of these nodes and update their pointers to point at the new nodes. Once we update the pointer to the new node we will drop the node we removed the link for and all of its children via btrfs_drop_subtree(). This is essentially just btrfs_drop_snapshot(), but for an arbitrary point in the snapshot. The problem with this is that we will never reflect this in the backref cache. If we do this btrfs_drop_snapshot() for a node that is in the backref tree, we will leave the node in the backref tree. This becomes a problem when we change the transid, as now the backref cache has entire subtrees that no longer exist, but exist as if they still are pointed to by the same roots. In the best case scenario you end up with "adding refs to an existing tree ref" errors from insert_inline_extent_backref(), where we attempt to link in nodes on roots that are no longer valid. Worst case you will double free some random block and re-use it when there's still references to the block. This is extremely subtle, and the consequences are quite bad. There isn't a way to make sure our backref cache is consistent between transid's. In order to fix this we need to simply evict the entire backref cache anytime we cross transid's. This reduces performance in that we have to rebuild this backref cache every time we change transid's, but fixes the bug. This has existed since relocation was added, and is a pretty critical bug. There's a lot more cleanup that can be done now that this functionality is going away, but this patch is as small as possible in order to fix the problem and make it easy for us to backport it to all the kernels it needs to be backported to. Followup series will dismantle more of this code and simplify relocation drastically to remove this functionality. We have a reproducer that reproduced the corruption within a few minutes of running. With this patch it survives several iterations/hours of running the reproducer. Fixes: 3fd0a5585eb9 ("Btrfs: Metadata ENOSPC handling for balance") CC: stable@vger.kernel.org Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: don't readahead the relocation inode on RSTJohannes Thumshirn2024-09-101-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On relocation we're doing readahead on the relocation inode, but if the filesystem is backed by a RAID stripe tree we can get ENOENT (e.g. due to preallocated extents not being mapped in the RST) from the lookup. But readahead doesn't handle the error and submits invalid reads to the device, causing an assertion in the scatter-gather list code: BTRFS info (device nvme1n1): balance: start -d -m -s BTRFS info (device nvme1n1): relocating block group 6480920576 flags data|raid0 BTRFS error (device nvme1n1): cannot find raid-stripe for logical [6481928192, 6481969152] devid 2, profile raid0 ------------[ cut here ]------------ kernel BUG at include/linux/scatterlist.h:115! Oops: invalid opcode: 0000 [#1] PREEMPT SMP PTI CPU: 0 PID: 1012 Comm: btrfs Not tainted 6.10.0-rc7+ #567 RIP: 0010:__blk_rq_map_sg+0x339/0x4a0 RSP: 0018:ffffc90001a43820 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802 RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000 RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8 R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000 FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000000002cd11000 CR3: 00000001109ea001 CR4: 0000000000370eb0 Call Trace: <TASK> ? __die_body.cold+0x14/0x25 ? die+0x2e/0x50 ? do_trap+0xca/0x110 ? do_error_trap+0x65/0x80 ? __blk_rq_map_sg+0x339/0x4a0 ? exc_invalid_op+0x50/0x70 ? __blk_rq_map_sg+0x339/0x4a0 ? asm_exc_invalid_op+0x1a/0x20 ? __blk_rq_map_sg+0x339/0x4a0 nvme_prep_rq.part.0+0x9d/0x770 nvme_queue_rq+0x7d/0x1e0 __blk_mq_issue_directly+0x2a/0x90 ? blk_mq_get_budget_and_tag+0x61/0x90 blk_mq_try_issue_list_directly+0x56/0xf0 blk_mq_flush_plug_list.part.0+0x52b/0x5d0 __blk_flush_plug+0xc6/0x110 blk_finish_plug+0x28/0x40 read_pages+0x160/0x1c0 page_cache_ra_unbounded+0x109/0x180 relocate_file_extent_cluster+0x611/0x6a0 ? btrfs_search_slot+0xba4/0xd20 ? balance_dirty_pages_ratelimited_flags+0x26/0xb00 relocate_data_extent.constprop.0+0x134/0x160 relocate_block_group+0x3f2/0x500 btrfs_relocate_block_group+0x250/0x430 btrfs_relocate_chunk+0x3f/0x130 btrfs_balance+0x71b/0xef0 ? kmalloc_trace_noprof+0x13b/0x280 btrfs_ioctl+0x2c2e/0x3030 ? kvfree_call_rcu+0x1e6/0x340 ? list_lru_add_obj+0x66/0x80 ? mntput_no_expire+0x3a/0x220 __x64_sys_ioctl+0x96/0xc0 do_syscall_64+0x54/0x110 entry_SYSCALL_64_after_hwframe+0x76/0x7e RIP: 0033:0x7fcc04514f9b Code: Unable to access opcode bytes at 0x7fcc04514f71. RSP: 002b:00007ffeba923370 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000003 RCX: 00007fcc04514f9b RDX: 00007ffeba923460 RSI: 00000000c4009420 RDI: 0000000000000003 RBP: 0000000000000000 R08: 0000000000000013 R09: 0000000000000001 R10: 00007fcc043fbba8 R11: 0000000000000246 R12: 00007ffeba924fc5 R13: 00007ffeba923460 R14: 0000000000000002 R15: 00000000004d4bb0 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:__blk_rq_map_sg+0x339/0x4a0 RSP: 0018:ffffc90001a43820 EFLAGS: 00010202 RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffffea00045d4802 RDX: 0000000117520000 RSI: 0000000000000000 RDI: ffff8881027d1000 RBP: 0000000000003000 R08: ffffea00045d4902 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000001000 R12: ffff8881003d10b8 R13: ffffc90001a438f0 R14: 0000000000000000 R15: 0000000000003000 FS: 00007fcc048a6900(0000) GS:ffff88813bc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007fcc04514f71 CR3: 00000001109ea001 CR4: 0000000000370eb0 Kernel panic - not syncing: Fatal exception Kernel Offset: disabled ---[ end Kernel panic - not syncing: Fatal exception ]--- So in case of a relocation on a RAID stripe-tree based file system, skip the readahead. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Merge tag 'mm-stable-2024-07-21-14-50' of ↵Linus Torvalds2024-07-221-2/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - In the series "mm: Avoid possible overflows in dirty throttling" Jan Kara addresses a couple of issues in the writeback throttling code. These fixes are also targetted at -stable kernels. - Ryusuke Konishi's series "nilfs2: fix potential issues related to reserved inodes" does that. This should actually be in the mm-nonmm-stable tree, along with the many other nilfs2 patches. My bad. - More folio conversions from Kefeng Wang in the series "mm: convert to folio_alloc_mpol()" - Kemeng Shi has sent some cleanups to the writeback code in the series "Add helper functions to remove repeated code and improve readability of cgroup writeback" - Kairui Song has made the swap code a little smaller and a little faster in the series "mm/swap: clean up and optimize swap cache index". - In the series "mm/memory: cleanly support zeropage in vm_insert_page*(), vm_map_pages*() and vmf_insert_mixed()" David Hildenbrand has reworked the rather sketchy handling of the use of the zeropage in MAP_SHARED mappings. I don't see any runtime effects here - more a cleanup/understandability/maintainablity thing. - Dev Jain has improved selftests/mm/va_high_addr_switch.c's handling of higher addresses, for aarch64. The (poorly named) series is "Restructure va_high_addr_switch". - The core TLB handling code gets some cleanups and possible slight optimizations in Bang Li's series "Add update_mmu_tlb_range() to simplify code". - Jane Chu has improved the handling of our fake-an-unrecoverable-memory-error testing feature MADV_HWPOISON in the series "Enhance soft hwpoison handling and injection". - Jeff Johnson has sent a billion patches everywhere to add MODULE_DESCRIPTION() to everything. Some landed in this pull. - In the series "mm: cleanup MIGRATE_SYNC_NO_COPY mode", Kefeng Wang has simplified migration's use of hardware-offload memory copying. - Yosry Ahmed performs more folio API conversions in his series "mm: zswap: trivial folio conversions". - In the series "large folios swap-in: handle refault cases first", Chuanhua Han inches us forward in the handling of large pages in the swap code. This is a cleanup and optimization, working toward the end objective of full support of large folio swapin/out. - In the series "mm,swap: cleanup VMA based swap readahead window calculation", Huang Ying has contributed some cleanups and a possible fixlet to his VMA based swap readahead code. - In the series "add mTHP support for anonymous shmem" Baolin Wang has taught anonymous shmem mappings to use multisize THP. By default this is a no-op - users must opt in vis sysfs controls. Dramatic improvements in pagefault latency are realized. - David Hildenbrand has some cleanups to our remaining use of page_mapcount() in the series "fs/proc: move page_mapcount() to fs/proc/internal.h". - David also has some highmem accounting cleanups in the series "mm/highmem: don't track highmem pages manually". - Build-time fixes and cleanups from John Hubbard in the series "cleanups, fixes, and progress towards avoiding "make headers"". - Cleanups and consolidation of the core pagemap handling from Barry Song in the series "mm: introduce pmd|pte_needs_soft_dirty_wp helpers and utilize them". - Lance Yang's series "Reclaim lazyfree THP without splitting" has reduced the latency of the reclaim of pmd-mapped THPs under fairly common circumstances. A 10x speedup is seen in a microbenchmark. It does this by punting to aother CPU but I guess that's a win unless all CPUs are pegged. - hugetlb_cgroup cleanups from Xiu Jianfeng in the series "mm/hugetlb_cgroup: rework on cftypes". - Miaohe Lin's series "Some cleanups for memory-failure" does just that thing. - Someone other than SeongJae has developed a DAMON feature in Honggyu Kim's series "DAMON based tiered memory management for CXL memory". This adds DAMON features which may be used to help determine the efficiency of our placement of CXL/PCIe attached DRAM. - DAMON user API centralization and simplificatio work in SeongJae Park's series "mm/damon: introduce DAMON parameters online commit function". - In the series "mm: page_type, zsmalloc and page_mapcount_reset()" David Hildenbrand does some maintenance work on zsmalloc - partially modernizing its use of pageframe fields. - Kefeng Wang provides more folio conversions in the series "mm: remove page_maybe_dma_pinned() and page_mkclean()". - More cleanup from David Hildenbrand, this time in the series "mm/memory_hotplug: use PageOffline() instead of PageReserved() for !ZONE_DEVICE". It "enlightens memory hotplug more about PageOffline() pages" and permits the removal of some virtio-mem hacks. - Barry Song's series "mm: clarify folio_add_new_anon_rmap() and __folio_add_anon_rmap()" is a cleanup to the anon folio handling in preparation for mTHP (multisize THP) swapin. - Kefeng Wang's series "mm: improve clear and copy user folio" implements more folio conversions, this time in the area of large folio userspace copying. - The series "Docs/mm/damon/maintaier-profile: document a mailing tool and community meetup series" tells people how to get better involved with other DAMON developers. From SeongJae Park. - A large series ("kmsan: Enable on s390") from Ilya Leoshkevich does that. - David Hildenbrand sends along more cleanups, this time against the migration code. The series is "mm/migrate: move NUMA hinting fault folio isolation + checks under PTL". - Jan Kara has found quite a lot of strangenesses and minor errors in the readahead code. He addresses this in the series "mm: Fix various readahead quirks". - SeongJae Park's series "selftests/damon: test DAMOS tried regions and {min,max}_nr_regions" adds features and addresses errors in DAMON's self testing code. - Gavin Shan has found a userspace-triggerable WARN in the pagecache code. The series "mm/filemap: Limit page cache size to that supported by xarray" addresses this. The series is marked cc:stable. - Chengming Zhou's series "mm/ksm: cmp_and_merge_page() optimizations and cleanup" cleans up and slightly optimizes KSM. - Roman Gushchin has separated the memcg-v1 and memcg-v2 code - lots of code motion. The series (which also makes the memcg-v1 code Kconfigurable) are "mm: memcg: separate legacy cgroup v1 code and put under config option" and "mm: memcg: put cgroup v1-specific memcg data under CONFIG_MEMCG_V1" - Dan Schatzberg's series "Add swappiness argument to memory.reclaim" adds an additional feature to this cgroup-v2 control file. - The series "Userspace controls soft-offline pages" from Jiaqi Yan permits userspace to stop the kernel's automatic treatment of excessive correctable memory errors. In order to permit userspace to monitor and handle this situation. - Kefeng Wang's series "mm: migrate: support poison recover from migrate folio" teaches the kernel to appropriately handle migration from poisoned source folios rather than simply panicing. - SeongJae Park's series "Docs/damon: minor fixups and improvements" does those things. - In the series "mm/zsmalloc: change back to per-size_class lock" Chengming Zhou improves zsmalloc's scalability and memory utilization. - Vivek Kasireddy's series "mm/gup: Introduce memfd_pin_folios() for pinning memfd folios" makes the GUP code use FOLL_PIN rather than bare refcount increments. So these paes can first be moved aside if they reside in the movable zone or a CMA block. - Andrii Nakryiko has added a binary ioctl()-based API to /proc/pid/maps for much faster reading of vma information. The series is "query VMAs from /proc/<pid>/maps". - In the series "mm: introduce per-order mTHP split counters" Lance Yang improves the kernel's presentation of developer information related to multisize THP splitting. - Michael Ellerman has developed the series "Reimplement huge pages without hugepd on powerpc (8xx, e500, book3s/64)". This permits userspace to use all available huge page sizes. - In the series "revert unconditional slab and page allocator fault injection calls" Vlastimil Babka removes a performance-affecting and not very useful feature from slab fault injection. * tag 'mm-stable-2024-07-21-14-50' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (411 commits) mm/mglru: fix ineffective protection calculation mm/zswap: fix a white space issue mm/hugetlb: fix kernel NULL pointer dereference when migrating hugetlb folio mm/hugetlb: fix possible recursive locking detected warning mm/gup: clear the LRU flag of a page before adding to LRU batch mm/numa_balancing: teach mpol_to_str about the balancing mode mm: memcg1: convert charge move flags to unsigned long long alloc_tag: fix page_ext_get/page_ext_put sequence during page splitting lib: reuse page_ext_data() to obtain codetag_ref lib: add missing newline character in the warning message mm/mglru: fix overshooting shrinker memory mm/mglru: fix div-by-zero in vmpressure_calc_level() mm/kmemleak: replace strncpy() with strscpy() mm, page_alloc: put should_fail_alloc_page() back behing CONFIG_FAIL_PAGE_ALLOC mm, slab: put should_failslab() back behind CONFIG_SHOULD_FAILSLAB mm: ignore data-race in __swap_writepage hugetlbfs: ensure generic_hugetlb_get_unmapped_area() returns higher address than mmap_min_addr mm: shmem: rename mTHP shmem counters mm: swap_state: use folio_alloc_mpol() in __read_swap_cache_async() mm/migrate: putback split folios when numa hint migration fails ...
| * readahead: drop index argument of page_cache_async_readahead()Jan Kara2024-07-041-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | The index argument of page_cache_async_readahead() is just folio->index so there's no point in passing is separately. Drop it. Link: https://lkml.kernel.org/r/20240625101909.12234-5-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Tested-by: Zhang Peng <zhangpengpeng0808@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | btrfs: fix data race when accessing the last_trans field of a rootFilipe Manana2024-07-111-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KCSAN complains about a data race when accessing the last_trans field of a root: [ 199.553628] BUG: KCSAN: data-race in btrfs_record_root_in_trans [btrfs] / record_root_in_trans [btrfs] [ 199.555186] read to 0x000000008801e308 of 8 bytes by task 2812 on cpu 1: [ 199.555210] btrfs_record_root_in_trans+0x9a/0x128 [btrfs] [ 199.555999] start_transaction+0x154/0xcd8 [btrfs] [ 199.556780] btrfs_join_transaction+0x44/0x60 [btrfs] [ 199.557559] btrfs_dirty_inode+0x9c/0x140 [btrfs] [ 199.558339] btrfs_update_time+0x8c/0xb0 [btrfs] [ 199.559123] touch_atime+0x16c/0x1e0 [ 199.559151] pipe_read+0x6a8/0x7d0 [ 199.559179] vfs_read+0x466/0x498 [ 199.559204] ksys_read+0x108/0x150 [ 199.559230] __s390x_sys_read+0x68/0x88 [ 199.559257] do_syscall+0x1c6/0x210 [ 199.559286] __do_syscall+0xc8/0xf0 [ 199.559318] system_call+0x70/0x98 [ 199.559431] write to 0x000000008801e308 of 8 bytes by task 2808 on cpu 0: [ 199.559464] record_root_in_trans+0x196/0x228 [btrfs] [ 199.560236] btrfs_record_root_in_trans+0xfe/0x128 [btrfs] [ 199.561097] start_transaction+0x154/0xcd8 [btrfs] [ 199.561927] btrfs_join_transaction+0x44/0x60 [btrfs] [ 199.562700] btrfs_dirty_inode+0x9c/0x140 [btrfs] [ 199.563493] btrfs_update_time+0x8c/0xb0 [btrfs] [ 199.564277] file_update_time+0xb8/0xf0 [ 199.564301] pipe_write+0x8ac/0xab8 [ 199.564326] vfs_write+0x33c/0x588 [ 199.564349] ksys_write+0x108/0x150 [ 199.564372] __s390x_sys_write+0x68/0x88 [ 199.564397] do_syscall+0x1c6/0x210 [ 199.564424] __do_syscall+0xc8/0xf0 [ 199.564452] system_call+0x70/0x98 This is because we update and read last_trans concurrently without any type of synchronization. This should be generally harmless and in the worst case it can make us do extra locking (btrfs_record_root_in_trans()) trigger some warnings at ctree.c or do extra work during relocation - this would probably only happen in case of load or store tearing. So fix this by always reading and updating the field using READ_ONCE() and WRITE_ONCE(), this silences KCSAN and prevents load and store tearing. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: switch btrfs_ordered_extent::inode to struct btrfs_inodeDavid Sterba2024-07-111-1/+1
| | | | | | | | | | | | | | | | The structure is internal so we should use struct btrfs_inode for that, allowing to remove some use of BTRFS_I. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: remove super block argument from btrfs_iget()Filipe Manana2024-07-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's pointless to pass a super block argument to btrfs_iget() because we always pass a root and from it we can get the super block through: root->fs_info->sb So remove the super block argument. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: pass reloc_control to setup_relocation_extent_mapping()Johannes Thumshirn2024-07-111-8/+10
| | | | | | | | | | | | | | | | | | | | | | All parameters passed into setup_relocation_extent_mapping() can be derived from 'struct reloc_control', so only pass in a 'struct reloc_control'. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: pass a struct reloc_control to prealloc_file_extent_cluster()Johannes Thumshirn2024-07-111-4/+4
| | | | | | | | | | | | | | | | | | | | Pass a 'struct reloc_control' to prealloc_file_extent_cluster() instead of passing its members 'data_inode' and 'cluster' on their own. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: don't pass fs_info to describe_relocation()Johannes Thumshirn2024-07-111-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | In describe_relocation() the fs_info is only needed for printing information via btrfs_info() and can easily be accessed via the passed in 'struct btrfs_block_group'. So we can safely remove the fs_info parameter. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: pass a reloc_control to relocate_one_folio()Johannes Thumshirn2024-07-111-3/+5
| | | | | | | | | | | | | | | | | | | | Pass a struct reloc_control to relocate_one_folio, instead of passing it's members data_inode and cluster as separate arguments to the function. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: pass a reloc_control to relocate_file_extent_cluster()Johannes Thumshirn2024-07-111-7/+7
| | | | | | | | | | | | | | | | | | | | Instead of passing in a reloc_control's data_inode and file_extent_cluster members, pass in the whole reloc_control structure. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: pass reloc_control to relocate_data_extent()Johannes Thumshirn2024-07-111-5/+5
| | | | | | | | | | | | | | | | | | | | Pass a 'struct reloc_control' to relocate_data_extent() instead of it's data_inode and file_extent_cluster separately. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: rename err to ret in btrfs_recover_relocation()Anand Jain2024-07-111-28/+28
| | | | | | | | | | | | | | | | | | Fix coding style: rename the return variable to 'ret' in the function btrfs_recover_relocation instead of 'err'. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: rename ret to ret2 in btrfs_recover_relocation()Anand Jain2024-07-111-4/+4
| | | | | | | | | | | | | | | | | | A preparatory patch to rename 'err' to 'ret', but ret is already used as an intermediary return value, so first rename 'ret' to 'ret2'. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: rename ret to err in btrfs_recover_relocation()Anand Jain2024-07-111-17/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the function btrfs_recover_relocation(), currently the variable 'err' carries the return value and 'ret' holds the intermediary return value. However, in some lines, we don't need this two-step approach; we can directly use 'err'. So, optimize them, which requires reinitializing 'err' to zero at two locations. This is a preparatory patch to fix the code style by renaming 'err' to 'ret'. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: remove extent_map::block_start memberQu Wenruo2024-07-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The member extent_map::block_start can be calculated from extent_map::disk_bytenr + extent_map::offset for regular extents. And otherwise just extent_map::disk_bytenr. And this is already validated by the validate_extent_map(). Now we can remove the member. However there is a special case in btrfs_create_dio_extent() where we for NOCOW/PREALLOC ordered extents cannot directly use the resulting btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them yet. So for that call site, we pass file_extent->disk_bytenr + file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for offset. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: remove extent_map::block_len memberQu Wenruo2024-07-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The extent_map::block_len is either extent_map::len (non-compressed extent) or extent_map::disk_num_bytes (compressed extent). Since we already have sanity checks to do the cross-checks between the new and old members, we can drop the old extent_map::block_len now. For most call sites, they can manually select extent_map::len or extent_map::disk_num_bytes, since most if not all of them have checked if the extent is compressed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: remove extent_map::orig_start memberQu Wenruo2024-07-111-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since we have extent_map::offset, the old extent_map::orig_start is just extent_map::start - extent_map::offset for non-hole/inline extents. And since the new extent_map::offset is already verified by validate_extent_map() while the old orig_start is not, let's just remove the old member from all call sites. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: introduce extra sanity checks for extent mapsQu Wenruo2024-07-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since extent_map structure has the all the needed members to represent a file extent directly, we can apply all the file extent sanity checks to an extent map. The new sanity checks will cross check both the old members (block_start/block_len/orig_start) and the new members (disk_bytenr/disk_num_bytes/offset). There is a special case for offset/orig_start/start cross check, we only do such sanity check for compressed extent, as only compressed read/encoded write really utilize orig_start. This can be proved by the cleanup patch of orig_start. The checks happens at the following times: - add_extent_mapping() This is for newly added extent map - replace_extent_mapping() This is for btrfs_drop_extent_map_range() and split_extent_map() - try_merge_map() For a lot of call sites we have to properly populate all the members to pass the sanity check, meanwhile the following code needs extra modification: - setup_file_extents() from inode-tests The file extents layout of setup_file_extents() is already too invalid that tree-checker would reject most of them in real world. However there is just a special unaligned regular extent which has mismatched disk_num_bytes (4096) and ram_bytes (4096 - 1). So instead of dropping the whole test case, here we just unify disk_num_bytes and ram_bytes to 4096 - 1. - test_case_7() from extent-map-tests An extent is inserted with 16K length, but on-disk extent size is only 4K. This means it must be a compressed extent, so set the compressed flag for it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: simplify range parameters of btrfs_wait_ordered_roots()David Sterba2024-07-111-3/+1
| | | | | | | | | | | | | | The range is specified only in two ways, we can simplify the case for the whole filesystem range as a NULL block group parameter. Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: pass a btrfs_inode to btrfs_wait_ordered_range()Filipe Manana2024-07-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Instead of passing a (VFS) inode pointer argument, pass a btrfs_inode instead, as this is generally what we do for internal APIs, making it more consistent with most of the code base. This will later allow to help to remove a lot of BTRFS_I() calls in btrfs_sync_file(). Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | btrfs: unify index_cnt and csum_bytes from struct btrfs_inodeFilipe Manana2024-07-111-6/+6
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The index_cnt field of struct btrfs_inode is used only for two purposes: 1) To store the index for the next entry added to a directory; 2) For the data relocation inode to track the logical start address of the block group currently being relocated. For the relocation case we use index_cnt because it's not used for anything else in the relocation use case - we could have used other fields that are not used by relocation such as defrag_bytes, last_unlink_trans or last_reflink_trans for example (among others). Since the csum_bytes field is not used for directories, do the following changes: 1) Put index_cnt and csum_bytes in a union, and index_cnt is only initialized when the inode is a directory. The csum_bytes is only accessed in IO paths for regular files, so we're fine here; 2) Use the defrag_bytes field for relocation, since the data relocation inode is never used for defrag purposes. And to make the naming better, alias it to reloc_block_group_start by using a union. This reduces the size of struct btrfs_inode by 8 bytes in a release kernel, from 1056 bytes down to 1048 bytes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: handle errors in btrfs_reloc_clone_csums properlyJosef Bacik2024-05-071-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | In the cow path we will clone the reloc csums for relocated data extents, and if there's an error we already have an ordered extent and rely on the ordered extent finishing to clean everything up. There's a problem however, we don't mark the ordered extent with an error, we pretend like everything was just fine. If we were at the end of our range we won't actually bubble up this error anywhere, and we could end up inserting an extent that doesn't have csums where it should have them. Fix this by adding a helper to mark the ordered extent with an error, and then use this when we fail to lookup the csums in btrfs_reloc_clone_csums. Use this helper in the other place where we use the same pattern while we're here. This will prevent us from erroneously inserting the extent that doesn't have the required checksums. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: drop unused argument of calcu_metadata_size()Naohiro Aota2024-05-071-6/+5
| | | | | | | | | | | | | calcu_metadata_size() has a "reserve" argument, but the only caller always set it to "1". The other usage (reserve = 0) is dropped by a commit 0647bf564f1e ("Btrfs: improve forever loop when doing balance relocation"), which is more than 10 years ago. Drop the argument and simplify the code. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: reuse ret instead of err in relocate_tree_blocks()Anand Jain2024-05-071-11/+8
| | | | | | | | | Coding style fixes the function relocate_tree_blocks(). After the fix, ret is the return value variable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename err and ret to ret in build_backref_tree()Anand Jain2024-05-071-11/+7
| | | | | | | | | Code style fix in the function build_backref_tree(). Drop the ret initialization 0, as we don't need it. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: change root->root_key.objectid to btrfs_root_id()Josef Bacik2024-05-071-33/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 | - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename ->len to ->num_bytes in btrfs_refJosef Bacik2024-05-071-7/+7
| | | | | | | | | | We consistently use ->num_bytes everywhere through the delayed ref code, except in btrfs_ref. Rename btrfs_ref to match all the other code. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move ref_root into btrfs_refJosef Bacik2024-05-071-13/+13
| | | | | | | | | | | | We have this in both btrfs_tree_ref and btrfs_data_ref, which is just wasting space and making the code more complicated. Move this into btrfs_ref proper and update all the call sites to do the assignment in btrfs_ref. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: do not use a function to initialize btrfs_refJosef Bacik2024-05-071-19/+39
| | | | | | | | | | | | | | | | | | | btrfs_ref currently has ->owning_root, and ->ref_root is shared between the tree ref and data ref, so in order to move that into btrfs_ref proper I would need to add another root parameter to the initialization function. This function has too many arguments, and adding another root will make it easy to make mistakes about which root goes where. Drop the generic ref init function and statically initialize the btrfs_ref in every usage. This makes the code easier to read because we can see what elements we're assigning, and will make the upcoming change moving the ref_root into the btrfs_ref more clear and less error prone than adding a new element to the initialization function. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: export find_next_inode() as btrfs_find_first_inode()Filipe Manana2024-05-071-80/+25
| | | | | | | | | | | | | Export the relocation private helper find_next_inode() to inode.c, as this same logic is also used at btrfs_prune_dentries() and will be used by an upcoming change that adds an extent map shrinker. The next patch will change btrfs_prune_dentries() to use this helper. Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: make NOCOW checks for existence of checksums in a range more efficientFilipe Manana2024-05-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before deciding if we can do a NOCOW write into a range, one of the things we have to do is check if there are checksum items for that range. We do that through the btrfs_lookup_csums_list() function, which searches for checksums and adds them to a list supplied by the caller. But all we need is to check if there is any checksum, we don't need to look for all of them and collect them into a list, which requires more search time in the checksums tree, allocating memory for checksums items to add to the list, copy checksums from a leaf into those list items, then free that memory, etc. This is all unnecessary overhead, wasting mostly CPU time, and perhaps some occasional IO if we need to read from disk any extent buffers. So change btrfs_lookup_csums_list() to allow to return immediately in case it finds any checksum, without the need to add it to a list and read it from a leaf. This is accomplished by allowing a NULL list parameter and making the function return 1 if it found any checksum, 0 if it didn't found any, and a negative value in case of an error. The following test with fio was used to measure performance: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bssplit=4k/20:8k/20:16k/20:32k/20:64k/20 direct=1 numjobs=16 fallocate=posix time_based runtime=300 [file1] size=8G ioengine=io_uring iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run on a release kernel (Debian's default kernel config). The results before this patch: WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec The results after this patch: WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove search_commit parameter from btrfs_lookup_csums_list()Filipe Manana2024-05-071-1/+1
| | | | | | | | | | | All the callers of btrfs_lookup_csums_list() pass a value of 0 as the "search_commit" parameter. So remove it and make the function behave as to always search from the regular root. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: stop locking the source extent range during reflinkFilipe Manana2024-05-071-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nowadays before starting a reflink operation we do this: 1) Take the VFS lock of the inodes in exclusive mode (a rw semaphore); 2) Take the mmap lock of the inodes (struct btrfs_inode::i_mmap_lock); 3) Flush all delalloc in the source and target ranges; 4) Wait for all ordered extents in the source and target ranges to complete; 5) Lock the source and destination ranges in the inodes' io trees. In step 5 we lock the source range because: 1) We needed to serialize against mmap writes, but that is not needed anymore because nowadays we do that through the inode's i_mmap_lock (step 2). This happens since commit 8c99516a8cdd ("btrfs: exclude mmaps while doing remap"); 2) To serialize against a concurrent relocation and avoid generating a delayed ref for an extent that was just dropped by relocation, see commit d8b552424210 ("Btrfs: fix race between reflink/dedupe and relocation"). Locking the source range however blocks any concurrent reads for that range and makes test case generic/733 fail. So instead of locking the source range during reflinks, make relocation read lock the inode's i_mmap_lock, so that it serializes with a concurrent reflink while still able to run concurrently with mmap writes and allow concurrent reads too. Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: convert relocate_one_page() to folios and renameGoldwyn Rodrigues2024-05-071-46/+47
| | | | | | | | | | | | | Convert page references to folios and call the respective folio functions. Since find_or_create_page() takes a mask argument, call __filemap_get_folio() instead of filemap_grab_folio(). The patch assumes folio size is PAGE_SIZE, add a warning in case it's a higher order that will be implemented in the future. Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: page to folio conversion: prealloc_file_extent_cluster()Goldwyn Rodrigues2024-05-071-6/+6
| | | | | | | | | Convert usage of page to folio in prealloc_file_extent_cluster() Reviewed-by: Boris Burkov <boris@bur.io> Signed-off-by: Goldwyn Rodrigues <rgoldwyn@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: rename err to ret in create_reloc_inode()Anand Jain2024-05-071-9/+9
| | | | | | | | Unify naming of return value to the preferred way. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: open code btrfs_backref_iter_free()David Sterba2024-03-041-1/+2
| | | | | | | The helper is trivial and used only once, open code it. It's safe to remove the 'if', the pointer is validated in build_backref_tree(). Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: add helper to get fs_info from struct inode pointerDavid Sterba2024-03-041-1/+1
| | | | | | | | | | | | Add a convenience helper to get a fs_info from a VFS inode pointer instead of open coding the chain or using btrfs_sb() that in some cases does one more pointer hop. This is implemented as a macro (still with type checking) so we don't need full definitions of struct btrfs_inode, btrfs_root or btrfs_fs_info. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: migrate subpage code to folio interfacesQu Wenruo2023-12-151-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Although subpage itself is conflicting with higher folio, since subpage (sectorsize < PAGE_SIZE and nodesize < PAGE_SIZE) means we will never need higher order folio, there is a hidden pitfall: - btrfs_page_*() helpers Those helpers are an abstraction to handle both subpage and non-subpage cases, which means we're going to pass pages pointers to those helpers. And since those helpers are shared between data and metadata paths, it's unavoidable to let them to handle folios, including higher order folios). Meanwhile for true subpage case, we should only have a single page backed folios anyway, thus add a new ASSERT() for btrfs_subpage_assert() to ensure that. Also since those helpers are shared between both data and metadata, add some extra ASSERT()s for data path to make sure we only get single page backed folio for now. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: use the flags of an extent map to identify the compression typeFilipe Manana2023-12-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Merge tag 'for-6.7-tag' of ↵Linus Torvalds2023-10-301-81/+127
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "New features: - raid-stripe-tree New tree for logical file extent mapping where the physical mapping may not match on multiple devices. This is now used in zoned mode to implement RAID0/RAID1* profiles, but can be used in non-zoned mode as well. The support for RAID56 is in development and will eventually fix the problems with the current implementation. This is a backward incompatible feature and has to be enabled at mkfs time. - simple quota accounting (squota) A simplified mode of qgroup that accounts all space on the initial extent owners (a subvolume), the snapshots are then cheap to create and delete. The deletion of snapshots in fully accounting qgroups is a known CPU/IO performance bottleneck. The squota is not suitable for the general use case but works well for containers where the original subvolume exists for the whole time. This is a backward incompatible feature as it needs extending some structures, but can be enabled on an existing filesystem. - temporary filesystem fsid (temp_fsid) The fsid identifies a filesystem and is hard coded in the structures, which disallows mounting the same fsid found on different devices. For a single device filesystem this is not strictly necessary, a new temporary fsid can be generated on mount e.g. after a device is cloned. This will be used by Steam Deck for root partition A/B testing, or can be used for VM root images. Other user visible changes: - filesystems with partially finished metadata_uuid conversion cannot be mounted anymore and the uuid fixup has to be done by btrfs-progs (btrfstune). Performance improvements: - reduce reservations for checksum deletions (with enabled free space tree by factor of 4), on a sample workload on file with many extents the deletion time decreased by 12% - make extent state merges more efficient during insertions, reduce rb-tree iterations (run time of critical functions reduced by 5%) Core changes: - the integrity check functionality has been removed, this was a debugging feature and removal does not affect other integrity checks like checksums or tree-checker - space reservation changes: - more efficient delayed ref reservations, this avoids building up too much work or overusing or exhausting the global block reserve in some situations - move delayed refs reservation to the transaction start time, this prevents some ENOSPC corner cases related to exhaustion of global reserve - improvements in reducing excessive reservations for block group items - adjust overcommit logic in near full situations, account for one more chunk to eventually allocate metadata chunk, this is mostly relevant for small filesystems (<10GiB) - single device filesystems are scanned but not registered (except seed devices), this allows temp_fsid to work - qgroup iterations do not need GFP_ATOMIC allocations anymore - cleanups, refactoring, reduced data structure size, function parameter simplifications, error handling fixes" * tag 'for-6.7-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (156 commits) btrfs: open code timespec64 in struct btrfs_inode btrfs: remove redundant log root tree index assignment during log sync btrfs: remove redundant initialization of variable dirty in btrfs_update_time() btrfs: sysfs: show temp_fsid feature btrfs: disable the device add feature for temp-fsid btrfs: disable the seed feature for temp-fsid btrfs: update comment for temp-fsid, fsid, and metadata_uuid btrfs: remove pointless empty log context list check when syncing log btrfs: update comment for struct btrfs_inode::lock btrfs: remove pointless barrier from btrfs_sync_file() btrfs: add and use helpers for reading and writing last_trans_committed btrfs: add and use helpers for reading and writing fs_info->generation btrfs: add and use helpers for reading and writing log_transid btrfs: add and use helpers for reading and writing last_log_commit btrfs: support cloned-device mount capability btrfs: add helper function find_fsid_by_disk btrfs: stop reserving excessive space for block group item insertions btrfs: stop reserving excessive space for block group item updates btrfs: reorder btrfs_inode to fill gaps btrfs: open code btrfs_ordered_inode_tree in btrfs_inode ...
| * btrfs: change test_range_bit to scan the whole rangeDavid Sterba2023-10-121-1/+1
| | | | | | | | | | | | | | | | | | The semantics of test_range_bit() with filled == 0 is now in it's own helper so test_range_bit will check the whole range unconditionally. The detection logic is flipped and assumes success by default and catches exceptions. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: relocation: constify parameters where possibleDavid Sterba2023-10-121-28/+28
| | | | | | | | | | | | | | | | | | Lots of the functions in relocation.c don't change pointer parameters but lack the annotations. Add them and reformat according to current coding style if needed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: relocation: return bool from btrfs_should_ignore_reloc_rootDavid Sterba2023-10-121-10/+9
| | | | | | | | | | | | | | | | | | btrfs_should_ignore_reloc_root() is a predicate so it should return bool. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: switch btrfs_backref_cache::is_reloc to boolDavid Sterba2023-10-121-1/+1
| | | | | | | | | | | | | | | | | | The btrfs_backref_cache::is_reloc is an indicator variable and should use a bool type. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>