summaryrefslogtreecommitdiffstats
path: root/fs/xfs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2023-09-2327-240/+441
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Chandan Babu: - Fix an integer overflow bug when processing an fsmap call - Fix crash due to CPU hot remove event racing with filesystem mount operation - During read-only mount, XFS does not allow the contents of the log to be recovered when there are one or more unrecognized rcompat features in the primary superblock, since the log might have intent items which the kernel does not know how to process - During recovery of log intent items, XFS now reserves log space sufficient for one cycle of a permanent transaction to execute. Otherwise, this could lead to livelocks due to non-availability of log space - On an fs which has an ondisk unlinked inode list, trying to delete a file or allocating an O_TMPFILE file can cause the fs to the shutdown if the first inode in the ondisk inode list is not present in the inode cache. The bug is solved by explicitly loading the first inode in the ondisk unlinked inode list into the inode cache if it is not already cached A similar problem arises when the uncached inode is present in the middle of the ondisk unlinked inode list. This second bug is triggered when executing operations like quotacheck and bulkstat. In this case, XFS now reads in the entire ondisk unlinked inode list - Enable LARP mode only on recent v5 filesystems - Fix a out of bounds memory access in scrub - Fix a performance bug when locating the tail of the log during mounting a filesystem * tag 'xfs-6.6-fixes-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: use roundup_pow_of_two instead of ffs during xlog_find_tail xfs: only call xchk_stats_merge after validating scrub inputs xfs: require a relatively recent V5 filesystem for LARP mode xfs: make inode unlinked bucket recovery work with quotacheck xfs: load uncached unlinked inodes into memory on demand xfs: reserve less log space when recovering log intent items xfs: fix log recovery when unknown rocompat bits are set xfs: reload entire unlinked bucket lists xfs: allow inode inactivation during a ro mount log recovery xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list xfs: remove CPU hotplug infrastructure xfs: remove the all-mounts list xfs: use per-mount cpumask to track nonempty percpu inodegc lists xfs: fix an agbno overflow in __xfs_getfsmap_datadev xfs: fix per-cpu CIL structure aggregation racing with dying cpus xfs: fix select in config XFS_ONLINE_SCRUB_STATS
| * xfs: use roundup_pow_of_two instead of ffs during xlog_find_tailWang Jianchao2023-09-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In our production environment, we find that mounting a 500M /boot which is umount cleanly needs ~6s. One cause is that ffs() is used by xlog_write_log_records() to decide the buffer size. It can cause a lot of small IO easily when xlog_clear_stale_blocks() needs to wrap around the end of log area and log head block is not power of two. Things are similar in xlog_find_verify_cycle(). The code is able to handed bigger buffer very well, we can use roundup_pow_of_two() to replace ffs() directly to avoid small and sychronous IOs. Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Wang Jianchao <wangjc136@midea.com> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
| * Merge tag 'fix-scrub-6.6_2023-09-12' of ↵Chandan Babu R2023-09-132-3/+6
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA xfs: fix out of bounds memory access in scrub This is a quick fix for a few internal syzbot reports concerning an invalid memory access in the scrub code. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'fix-scrub-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: only call xchk_stats_merge after validating scrub inputs
| | * xfs: only call xchk_stats_merge after validating scrub inputsDarrick J. Wong2023-09-122-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Harshit Mogalapalli slogged through several reports from our internal syzbot instance and observed that they all had a common stack trace: BUG: KASAN: user-memory-access in instrument_atomic_read_write include/linux/instrumented.h:96 [inline] BUG: KASAN: user-memory-access in atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:1294 [inline] BUG: KASAN: user-memory-access in queued_spin_lock include/asm-generic/qspinlock.h:111 [inline] BUG: KASAN: user-memory-access in do_raw_spin_lock include/linux/spinlock.h:187 [inline] BUG: KASAN: user-memory-access in __raw_spin_lock include/linux/spinlock_api_smp.h:134 [inline] BUG: KASAN: user-memory-access in _raw_spin_lock+0x76/0xe0 kernel/locking/spinlock.c:154 Write of size 4 at addr 0000001dd87ee280 by task syz-executor365/1543 CPU: 2 PID: 1543 Comm: syz-executor365 Not tainted 6.5.0-syzk #1 Hardware name: Red Hat KVM, BIOS 1.13.0-2.module+el8.3.0+7860+a7792d29 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0x83/0xb0 lib/dump_stack.c:106 print_report+0x3f8/0x620 mm/kasan/report.c:478 kasan_report+0xb0/0xe0 mm/kasan/report.c:588 check_region_inline mm/kasan/generic.c:181 [inline] kasan_check_range+0x139/0x1e0 mm/kasan/generic.c:187 instrument_atomic_read_write include/linux/instrumented.h:96 [inline] atomic_try_cmpxchg_acquire include/linux/atomic/atomic-instrumented.h:1294 [inline] queued_spin_lock include/asm-generic/qspinlock.h:111 [inline] do_raw_spin_lock include/linux/spinlock.h:187 [inline] __raw_spin_lock include/linux/spinlock_api_smp.h:134 [inline] _raw_spin_lock+0x76/0xe0 kernel/locking/spinlock.c:154 spin_lock include/linux/spinlock.h:351 [inline] xchk_stats_merge_one.isra.1+0x39/0x650 fs/xfs/scrub/stats.c:191 xchk_stats_merge+0x5f/0xe0 fs/xfs/scrub/stats.c:225 xfs_scrub_metadata+0x252/0x14e0 fs/xfs/scrub/scrub.c:599 xfs_ioc_scrub_metadata+0xc8/0x160 fs/xfs/xfs_ioctl.c:1646 xfs_file_ioctl+0x3fd/0x1870 fs/xfs/xfs_ioctl.c:1955 vfs_ioctl fs/ioctl.c:51 [inline] __do_sys_ioctl fs/ioctl.c:871 [inline] __se_sys_ioctl fs/ioctl.c:857 [inline] __x64_sys_ioctl+0x199/0x220 fs/ioctl.c:857 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x3e/0x90 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x6e/0xd8 RIP: 0033:0x7ff155af753d Code: 00 c3 66 2e 0f 1f 84 00 00 00 00 00 90 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d 1b 79 2c 00 f7 d8 64 89 01 48 RSP: 002b:00007ffc006e2568 EFLAGS: 00000246 ORIG_RAX: 0000000000000010 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007ff155af753d RDX: 00000000200000c0 RSI: 00000000c040583c RDI: 0000000000000003 RBP: 00000000ffffffff R08: 00000000004010c0 R09: 00000000004010c0 R10: 00000000004010c0 R11: 0000000000000246 R12: 0000000000400cb0 R13: 00007ffc006e2670 R14: 0000000000000000 R15: 0000000000000000 </TASK> The root cause here is that xchk_stats_merge_one walks off the end of the xchk_scrub_stats.cs_stats array because it has been fed a garbage value in sm->sm_type. That occurs because I put the xchk_stats_merge in the wrong place -- it should have been after the last xchk_teardown call on our way out of xfs_scrub_metadata because we only call the teardown function if we called the setup function, and we don't call the setup functions if the inputs are obviously garbage. Thanks to Harshit for triaging the bug reports and bringing this to my attention. Fixes: d7a74cad8f45 ("xfs: track usage statistics of online fsck") Reported-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'fix-larp-requirements-6.6_2023-09-12' of ↵Chandan Babu R2023-09-131-0/+11
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA xfs: disallow LARP on old fses Before enabling logged xattrs, make sure the filesystem is new enough that it actually supports log incompat features. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'fix-larp-requirements-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: require a relatively recent V5 filesystem for LARP mode
| | * xfs: require a relatively recent V5 filesystem for LARP modeDarrick J. Wong2023-09-121-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While reviewing the FIEXCHANGE code in XFS, I realized that the function that enables logged xattrs doesn't actually check that the superblock has a LOG_INCOMPAT feature bit field. Add a check to refuse the operation if we don't have a V5 filesystem... ...but on second though, let's require either reflink or rmap so that we only have to deal with LARP mode on relatively /modern/ kernel. 4.14 is about as far back as I feel like going. Seeing as LARP is a debugging-only option anyway, this isn't likely to affect any real users. Fixes: d9c61ccb3b09 ("xfs: move xfs_attr_use_log_assist out of xfs_log.c") Really-Fixes: f3f36c893f26 ("xfs: Add xfs_attr_set_deferred and xfs_attr_remove_deferred") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Bill O'Donnell <bodonnel@redhat.com>
| * | Merge tag 'fix-iunlink-list-6.6_2023-09-12' of ↵Chandan Babu R2023-09-139-9/+195
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA xfs: reload entire iunlink lists This is the second part of correcting XFS to reload the incore unlinked inode list from the ondisk contents. Whereas part one tackled failures from regular filesystem calls, this part takes on the problem of needing to reload the entire incore unlinked inode list on account of somebody loading an inode that's in the /middle/ of an unlinked list. This happens during quotacheck, bulkstat, or even opening a file by handle. In this case we don't know the length of the list that we're reloading, so we don't want to create a new unbounded memory load while holding resources locked. Instead, we'll target UNTRUSTED iget calls to reload the entire bucket. Note that this changes the definition of the incore unlinked inode list slightly -- i_prev_unlinked == 0 now means "not on the incore list". Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'fix-iunlink-list-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: make inode unlinked bucket recovery work with quotacheck xfs: reload entire unlinked bucket lists xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked list
| | * xfs: make inode unlinked bucket recovery work with quotacheckDarrick J. Wong2023-09-125-6/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Teach quotacheck to reload the unlinked inode lists when walking the inode table. This requires extra state handling, since it's possible that a reloaded inode will get inactivated before quotacheck tries to scan it; in this case, we need to ensure that the reloaded inode does not have dquots attached when it is freed. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| | * xfs: reload entire unlinked bucket listsDarrick J. Wong2023-09-125-0/+144
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous patch to reload unrecovered unlinked inodes when adding a newly created inode to the unlinked list is missing a key piece of functionality. It doesn't handle the case that someone calls xfs_iget on an inode that is not the last item in the incore list. For example, if at mount time the ondisk iunlink bucket looks like this: AGI -> 7 -> 22 -> 3 -> NULL None of these three inodes are cached in memory. Now let's say that someone tries to open inode 3 by handle. We need to walk the list to make sure that inodes 7 and 22 get loaded cold, and that the i_prev_unlinked of inode 3 gets set to 22. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| | * xfs: use i_prev_unlinked to distinguish inodes that are not on the unlinked listDarrick J. Wong2023-09-123-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alter the definition of i_prev_unlinked slightly to make it more obvious when an inode with 0 link count is not part of the iunlink bucket lists rooted in the AGI. This distinction is necessary because it is not sufficient to check inode.i_nlink to decide if an inode is on the unlinked list. Updates to i_nlink can happen while holding only ILOCK_EXCL, but updates to an inode's position in the AGI unlinked list (which happen after the nlink update) requires both ILOCK_EXCL and the AGI buffer lock. The next few patches will make it possible to reload an entire unlinked bucket list when we're walking the inode table or performing handle operations and need more than the ability to iget the last inode in the chain. The upcoming directory repair code also needs to be able to make this distinction to decide if a zero link count directory should be moved to the orphanage or allowed to inactivate. An upcoming enhancement to the online AGI fsck code will need this distinction to check and rebuild the AGI unlinked buckets. Signed-off-by: Darrick J. Wong <djwong@kernel.org>
| * | Merge tag 'fix-iunlink-6.6_2023-09-12' of ↵Chandan Babu R2023-09-132-5/+100
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA xfs: reload the last iunlink item It turns out that there are some serious bugs in how xfs handles the unlinked inode lists. Way back before 4.14, there was a bug where a ro mount of a dirty filesystem would recover the log bug neglect to purge the unlinked list. This leads to clean unmounted filesystems with unlinked inodes. Starting around 5.15, we also converted the codebase to maintain a doubly-linked incore unlinked list. However, we never provided the ability to load the incore list from disk. If someone tries to allocate an O_TMPFILE file on a clean fs with a pre-existing unlinked list or even deletes a file, the code will fail and the fs shuts down. This first part of the correction effort adds the ability to load the first inode in the bucket when unlinking a file; and to load the next inode in the list when inactivating (freeing) an inode. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'fix-iunlink-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: load uncached unlinked inodes into memory on demand
| | * xfs: load uncached unlinked inodes into memory on demandDarrick J. Wong2023-09-122-5/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | shrikanth hegde reports that filesystems fail shortly after mount with the following failure: WARNING: CPU: 56 PID: 12450 at fs/xfs/xfs_inode.c:1839 xfs_iunlink_lookup+0x58/0x80 [xfs] This of course is the WARN_ON_ONCE in xfs_iunlink_lookup: ip = radix_tree_lookup(&pag->pag_ici_root, agino); if (WARN_ON_ONCE(!ip || !ip->i_ino)) { ... } From diagnostic data collected by the bug reporters, it would appear that we cleanly mounted a filesystem that contained unlinked inodes. Unlinked inodes are only processed as a final step of log recovery, which means that clean mounts do not process the unlinked list at all. Prior to the introduction of the incore unlinked lists, this wasn't a problem because the unlink code would (very expensively) traverse the entire ondisk metadata iunlink chain to keep things up to date. However, the incore unlinked list code complains when it realizes that it is out of sync with the ondisk metadata and shuts down the fs, which is bad. Ritesh proposed to solve this problem by unconditionally parsing the unlinked lists at mount time, but this imposes a mount time cost for every filesystem to catch something that should be very infrequent. Instead, let's target the places where we can encounter a next_unlinked pointer that refers to an inode that is not in cache, and load it into cache. Note: This patch does not address the problem of iget loading an inode from the middle of the iunlink list and needing to set i_prev_unlinked correctly. Reported-by: shrikanth hegde <sshegde@linux.vnet.ibm.com> Triaged-by: Ritesh Harjani <ritesh.list@gmail.com> Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'fix-efi-recovery-6.6_2023-09-12' of ↵Chandan Babu R2023-09-136-9/+40
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA xfs: fix EFI recovery livelocks This series fixes a customer-reported transaction reservation bug introduced ten years ago that could result in livelocks during log recovery. Log intent item recovery single-steps each step of a deferred op chain, which means that each step only needs to allocate one transaction's worth of space in the log, not an entire chain all at once. This single-stepping is critical to unpinning the log tail since there's nobody else to do it for us. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'fix-efi-recovery-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: reserve less log space when recovering log intent items
| | * xfs: reserve less log space when recovering log intent itemsDarrick J. Wong2023-09-126-9/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wengang Wang reports that a customer's system was running a number of truncate operations on a filesystem with a very small log. Contention on the reserve heads lead to other threads stalling on smaller updates (e.g. mtime updates) long enough to result in the node being rebooted on account of the lack of responsivenes. The node failed to recover because log recovery of an EFI became stuck waiting for a grant of reserve space. From Wengang's report: "For the file deletion, log bytes are reserved basing on xfs_mount->tr_itruncate which is: tr_logres = 175488, tr_logcount = 2, tr_logflags = XFS_TRANS_PERM_LOG_RES, "You see it's a permanent log reservation with two log operations (two transactions in rolling mode). After calculation (xlog_calc_unit_res() adds space for various log headers), the final log space needed per transaction changes from 175488 to 180208 bytes. So the total log space needed is 360416 bytes (180208 * 2). [That quantity] of log space (360416 bytes) needs to be reserved for both run time inode removing (xfs_inactive_truncate()) and EFI recover (xfs_efi_item_recover())." In other words, runtime pre-reserves 360K of space in anticipation of running a chain of two transactions in which each transaction gets a 180K reservation. Now that we've allocated the transaction, we delete the bmap mapping, log an EFI to free the space, and roll the transaction as part of finishing the deferops chain. Rolling creates a new xfs_trans which shares its ticket with the old transaction. Next, xfs_trans_roll calls __xfs_trans_commit with regrant == true, which calls xlog_cil_commit with the same regrant parameter. xlog_cil_commit calls xfs_log_ticket_regrant, which decrements t_cnt and subtracts t_curr_res from the reservation and write heads. If the filesystem is fresh and the first transaction only used (say) 20K, then t_curr_res will be 160K, and we give that much reservation back to the reservation head. Or if the file is really fragmented and the first transaction actually uses 170K, then t_curr_res will be 10K, and that's what we give back to the reservation. Having done that, we're now headed into the second transaction with an EFI and 180K of reservation. Other threads apparently consumed all the reservation for smaller transactions, such as timestamp updates. Now let's say the first transaction gets written to disk and we crash without ever completing the second transaction. Now we remount the fs, log recovery finds the unfinished EFI, and calls xfs_efi_recover to finish the EFI. However, xfs_efi_recover starts a new tr_itruncate tranasction, which asks for 360K log reservation. This is a lot more than the 180K that we had reserved at the time of the crash. If the first EFI to be recovered is also pinning the tail of the log, we will be unable to free any space in the log, and recovery livelocks. Wengang confirmed this: "Now we have the second transaction which has 180208 log bytes reserved too. The second transaction is supposed to process intents including extent freeing. With my hacking patch, I blocked the extent freeing 5 hours. So in that 5 hours, 180208 (NOT 360416) log bytes are reserved. "With my test case, other transactions (update timestamps) then happen. As my hacking patch pins the journal tail, those timestamp-updating transactions finally use up (almost) all the left available log space (in memory in on disk). And finally the on disk (and in memory) available log space goes down near to 180208 bytes. Those 180208 bytes are reserved by [the] second (extent-free) transaction [in the chain]." Wengang and I noticed that EFI recovery starts a transaction, completes one step of the chain, and commits the transaction without completing any other steps of the chain. Those subsequent steps are completed by xlog_finish_defer_ops, which allocates yet another transaction to finish the rest of the chain. That transaction gets the same tr_logres as the head transaction, but with tr_logcount = 1 to force regranting with every roll to avoid livelocks. In other words, we already figured this out in commit 929b92f64048d ("xfs: xfs_defer_capture should absorb remaining transaction reservation"), but should have applied that logic to each intent item's recovery function. For Wengang's case, the xfs_trans_alloc call in the EFI recovery function should only be asking for a single transaction's worth of log reservation -- 180K, not 360K. Quoting Wengang again: "With log recovery, during EFI recovery, we use tr_itruncate again to reserve two transactions that needs 360416 log bytes. Reserving 360416 bytes fails [stalls] because we now only have about 180208 available. "Actually during the EFI recover, we only need one transaction to free the extents just like the 2nd transaction at RUNTIME. So it only needs to reserve 180208 rather than 360416 bytes. We have (a bit) more than 180208 available log bytes on disk, so [if we decrease the reservation to 180K] the reservation goes and the recovery [finishes]. That is to say: we can fix the log recover part to fix the issue. We can introduce a new xfs_trans_res xfs_mount->tr_ext_free { tr_logres = 175488, tr_logcount = 0, tr_logflags = 0, } "and use tr_ext_free instead of tr_itruncate in EFI recover." However, I don't think it quite makes sense to create an entirely new transaction reservation type to handle single-stepping during log recovery. Instead, we should copy the transaction reservation information in the xfs_mount, change tr_logcount to 1, and pass that into xfs_trans_alloc. We know this won't risk changing the min log size computation since we always ask for a fraction of the reservation for all known transaction types. This looks like it's been lurking in the codebase since commit 3d3c8b5222b92, which changed the xfs_trans_reserve call in xlog_recover_process_efi to use the tr_logcount in tr_itruncate. That changed the EFI recovery transaction from making a non-XFS_TRANS_PERM_LOG_RES request for one transaction's worth of log space to a XFS_TRANS_PERM_LOG_RES request for two transactions worth. Fixes: 3d3c8b5222b92 ("xfs: refactor xfs_trans_reserve() interface") Complements: 929b92f64048d ("xfs: xfs_defer_capture should absorb remaining transaction reservation") Suggested-by: Wengang Wang <wen.gang.wang@oracle.com> Cc: Srikanth C S <srikanth.c.s@oracle.com> [djwong: apply the same transformation to all log intent recovery] Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'fix-ro-mounts-6.6_2023-09-12' of ↵Chandan Babu R2023-09-133-22/+12
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA xfs: fix ro mounting with unknown rocompat features Dave pointed out some failures in xfs/270 when he upgraded Debian unstable and util-linux started using the new mount apis. Upon further inquiry I noticed that XFS is quite a hot mess when it encounters a filesystem with unrecognized rocompat bits set in the superblock. Whereas we used to allow readonly mounts under these conditions, a change to the sb write verifier several years ago resulted in the filesystem going down immediately because the post-mount log cleaning writes the superblock, which trips the sb write verifier on the unrecognized rocompat bit. I made the observation that the ROCOMPAT features RMAPBT and REFLINK both protect new log intent item types, which means that we actually cannot support recovering the log if we don't recognize all the rocompat bits. Therefore -- fix inode inactivation to work when we're recovering the log, disallow recovery when there's unrecognized rocompat bits, and don't clean the log if doing so would trip the rocompat checks. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'fix-ro-mounts-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: fix log recovery when unknown rocompat bits are set xfs: allow inode inactivation during a ro mount log recovery
| | * xfs: fix log recovery when unknown rocompat bits are setDarrick J. Wong2023-09-122-18/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Log recovery has always run on read only mounts, even where the primary superblock advertises unknown rocompat bits. Due to a misunderstanding between Eric and Darrick back in 2018, we accidentally changed the superblock write verifier to shutdown the fs over that exact scenario. As a result, the log cleaning that occurs at the end of the mounting process fails if there are unknown rocompat bits set. As we now allow writing of the superblock if there are unknown rocompat bits set on a RO mount, we no longer want to turn off RO state to allow log recovery to succeed on a RO mount. Hence we also remove all the (now unnecessary) RO state toggling from the log recovery path. Fixes: 9e037cb7972f ("xfs: check for unknown v5 feature bits in superblock write verifier" Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: allow inode inactivation during a ro mount log recoveryDarrick J. Wong2023-09-121-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the next patch, we're going to prohibit log recovery if the primary superblock contains an unrecognized rocompat feature bit even on readonly mounts. This requires removing all the code in the log mounting process that temporarily disables the readonly state. Unfortunately, inode inactivation disables itself on readonly mounts. Clearing the iunlinked lists after log recovery needs inactivation to run to free the unreferenced inodes, which (AFAICT) is the only reason why log mounting plays games with the readonly state in the first place. Therefore, change the inactivation predicates to allow inactivation during log recovery of a readonly mount. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'fix-percpu-lists-6.6_2023-09-12' of ↵Chandan Babu R2023-09-136-182/+56
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA xfs: fix cpu hotplug mess Ritesh and Eric separately reported crashes in XFS's hook function for CPU hot remove if the remove event races with a filesystem being mounted. I also noticed via generic/650 that once in a while the log will shut down over an apparent overrun of a transaction reservation; this turned out to be due to CIL percpu list aggregation failing to pick up the percpu list items from a dying CPU. Either way, the solution here is to eliminate the need for a CPU dying hook by using a private cpumask to track which CPUs have added to their percpu lists directly, and iterating with that mask. This fixes the log problems and (I think) solves a theoretical UAF bug in the inodegc code too. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'fix-percpu-lists-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: remove CPU hotplug infrastructure xfs: remove the all-mounts list xfs: use per-mount cpumask to track nonempty percpu inodegc lists xfs: fix per-cpu CIL structure aggregation racing with dying cpus
| | * xfs: remove CPU hotplug infrastructureDarrick J. Wong2023-09-111-41/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | There are no users of the cpu hotplug hooks in xfs now, so remove it. This reverts f1653c2e2831e ("xfs: introduce CPU hotplug infrastructure"). Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: remove the all-mounts listDarrick J. Wong2023-09-112-40/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | Revert commit 0ed17f01c8540 ("xfs: introduce all-mounts list for cpu hotplug notifications") because the cpu hotplug hooks are now pointless, so we don't need this list anymore. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: use per-mount cpumask to track nonempty percpu inodegc listsDarrick J. Wong2023-09-114-56/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Directly track which CPUs have contributed to the inodegc percpu lists instead of trusting the cpu online mask. This eliminates a theoretical problem where the inodegc flush functions might fail to flush a CPU's inodes if that CPU happened to be dying at exactly the same time. Most likely nobody's noticed this because the CPU dead hook moves the percpu inodegc list to another CPU and schedules that worker immediately. But it's quite possible that this is a subtle race leading to UAF if the inodegc flush were part of an unmount. Further benefits: This reduces the overhead of the inodegc flush code slightly by allowing us to ignore CPUs that have empty lists. Better yet, it reduces our dependence on the cpu online masks, which have been the cause of confusion and drama lately. Fixes: ab23a7768739 ("xfs: per-cpu deferred inode inactivation queues") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| | * xfs: fix per-cpu CIL structure aggregation racing with dying cpusDarrick J. Wong2023-09-113-45/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 7c8ade2121200 ("xfs: implement percpu cil space used calculation"), the XFS committed (log) item list code was converted to use per-cpu lists and space tracking to reduce cpu contention when multiple threads are modifying different parts of the filesystem and hence end up contending on the log structures during transaction commit. Each CPU tracks its own commit items and space usage, and these do not have to be merged into the main CIL until either someone wants to push the CIL items, or we run over a soft threshold and switch to slower (but more accurate) accounting with atomics. Unfortunately, the for_each_cpu iteration suffers from the same race with cpu dying problem that was identified in commit 8b57b11cca88f ("pcpcntrs: fix dying cpu summation race") -- CPUs are removed from cpu_online_mask before the CPUHP_XFS_DEAD callback gets called. As a result, both CIL percpu structure aggregation functions fail to collect the items and accounted space usage at the correct point in time. If we're lucky, the items that are collected from the online cpus exceed the space given to those cpus, and the log immediately shuts down in xlog_cil_insert_items due to the (apparent) log reservation overrun. This happens periodically with generic/650, which exercises cpu hotplug vs. the filesystem code: smpboot: CPU 3 is now offline XFS (sda3): ctx ticket reservation ran out. Need to up reservation XFS (sda3): ticket reservation summary: XFS (sda3): unit res = 9268 bytes XFS (sda3): current res = -40 bytes XFS (sda3): original count = 1 XFS (sda3): remaining count = 1 XFS (sda3): Filesystem has been shut down due to log error (0x2). Applying the same sort of fix from 8b57b11cca88f to the CIL code seems to make the generic/650 problem go away, but I've been told that tglx was not happy when he saw: "...the only thing we actually need to care about is that percpu_counter_sum() iterates dying CPUs. That's trivial to do, and when there are no CPUs dying, it has no addition overhead except for a cpumask_or() operation." The CPU hotplug code is rather complex and difficult to understand and I don't want to try to understand the cpu hotplug locking well enough to use cpu_dying mask. Furthermore, there's a performance improvement that could be had here. Attach a private cpu mask to the CIL structure so that we can track exactly which cpus have accessed the percpu data at all. It doesn't matter if the cpu has since gone offline; log item aggregation will still find the items. Better yet, we skip cpus that have not recently logged anything. Worse yet, Ritesh Harjani and Eric Sandeen both reported today that CPU hot remove racing with an xfs mount can crash if the cpu_dead notifier tries to access the log but the mount hasn't yet set up the log. Link: https://lore.kernel.org/linux-xfs/ZOLzgBOuyWHapOyZ@dread.disaster.area/T/ Link: https://lore.kernel.org/lkml/877cuj1mt1.ffs@tglx/ Link: https://lore.kernel.org/lkml/20230414162755.281993820@linutronix.de/ Link: https://lore.kernel.org/linux-xfs/ZOVkjxWZq0YmjrJu@dread.disaster.area/T/ Cc: tglx@linutronix.de Cc: peterz@infradead.org Reported-by: ritesh.list@gmail.com Reported-by: sandeen@sandeen.net Fixes: af1c2146a50b ("xfs: introduce per-cpu CIL tracking structure") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | Merge tag 'fix-fsmap-6.6_2023-09-12' of ↵Chandan Babu R2023-09-131-7/+18
| |\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux into xfs-6.6-fixesA xfs: fix fsmap cursor handling This patchset addresses an integer overflow bug that Dave Chinner found in how fsmap handles figuring out where in the record set we left off when userspace calls back after the first call filled up all the designated record space. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org> * tag 'fix-fsmap-6.6_2023-09-12' of https://git.kernel.org/pub/scm/linux/kernel/git/djwong/xfs-linux: xfs: fix an agbno overflow in __xfs_getfsmap_datadev
| | * xfs: fix an agbno overflow in __xfs_getfsmap_datadevDarrick J. Wong2023-09-111-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave Chinner reported that xfs/273 fails if the AG size happens to be an exact power of two. I traced this to an agbno integer overflow when the current GETFSMAP call is a continuation of a previous GETFSMAP call, and the last record returned was non-shareable space at the end of an AG. __xfs_getfsmap_datadev sets up a data device query by converting the incoming fmr_physical into an xfs_fsblock_t and cracking it into an agno and agbno pair. In the (failing) case of where fmr_blockcount of the low key is nonzero and the record was for a non-shareable extent, it will add fmr_blockcount to start_fsb and info->low.rm_startblock. If the low key was actually the last record for that AG, then this addition causes info->low.rm_startblock to point beyond EOAG. When the rmapbt range query starts, it'll return an empty set, and fsmap moves on to the next AG. Or so I thought. Remember how we added to start_fsb? If agsize < 1<<agblklog, start_fsb points to the same AG as the original fmr_physical from the low key. We run the rmapbt query, which returns nothing, so getfsmap zeroes info->low and moves on to the next AG. If agsize == 1<<agblklog, start_fsb now points to the next AG. We run the rmapbt query on the next AG with the excessively large rm_startblock. If this next AG is actually the last AG, we'll set info->high to EOFS (which is now has a lower rm_startblock than info->low), and the ranged btree query code will return -EINVAL. If it's not the last AG, we ignore all records for the intermediate AGs. Oops. Fix this by decoding start_fsb into agno and agbno only after making adjustments to start_fsb. This means that info->low.rm_startblock will always be set to a valid agbno, and we always start the rmapbt iteration in the correct AG. While we're at it, fix the predicate for determining if an fsmap record represents non-shareable space to include file data on pre-reflink filesystems. Reported-by: Dave Chinner <david@fromorbit.com> Fixes: 63ef7a35912dd ("xfs: fix interval filtering in multi-step fsmap queries") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * | xfs: fix select in config XFS_ONLINE_SCRUB_STATSLukas Bulwahn2023-09-111-1/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | Commit d7a74cad8f45 ("xfs: track usage statistics of online fsck") introduces config XFS_ONLINE_SCRUB_STATS, which selects the non-existing config FS_DEBUG. It is probably intended to select the existing config XFS_DEBUG. Fix the select in config XFS_ONLINE_SCRUB_STATS. Fixes: d7a74cad8f45 ("xfs: track usage statistics of online fsck") Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com> Reviewed-by: "Darrick J. Wong" <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandanbabu@kernel.org>
* / Revert "xfs: switch to multigrain timestamps"Christian Brauner2023-09-203-7/+7
|/ | | | | | | | | | | | | | This reverts commit e44df2664746aed8b6dd5245eb711a0ce33c5cf5. Users reported regressions due to enabling multi-grained timestamps unconditionally. As no clear consensus on a solution has come up and the discussion has gone back to the drawing board revert the infrastructure changes for. If it isn't code that's here to stay, make it go away. Message-ID: <20230920-keine-eile-c9755b5825db@brauner> Acked-by: Jan Kara <jack@suse.cz> Acked-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* Merge tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2023-08-3042-647/+3910
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs updates from Chandan Babu: - Chandan Babu will be taking over as the XFS release manager. He has reviewed all the patches that are in this branch, though I'm signing the branch one last time since I'm still technically maintainer. :P - Create a maintainer entry profile for XFS in which we lay out the various roles that I have played for many years. Aside from release manager, the remaining roles are as yet unfilled. - Start merging online repair -- we now have in-memory pageable memory for staging btrees, a bunch of pending fixes, and we've started the process of refactoring the scrub support code to support more of repair. In particular, reaping of old blocks from damaged structures. - Scrub the realtime summary file. - Fix a bug where scrub's quota iteration only ever returned the root dquot. Oooops. - Fix some typos. [ Pull request from Chandan Babu, but signed tag and description from Darrick Wong, thus the first person singular above is Darrick, not Chandan ] * tag 'xfs-6.6-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: (37 commits) fs/xfs: Fix typos in comments xfs: fix dqiterate thinko xfs: don't check reflink iflag state when checking cow fork xfs: simplify returns in xchk_bmap xfs: rewrite xchk_inode_is_allocated to work properly xfs: hide xfs_inode_is_allocated in scrub common code xfs: fix agf_fllast when repairing an empty AGFL xfs: allow userspace to rebuild metadata structures xfs: clear pagf_agflreset when repairing the AGFL xfs: allow the user to cancel repairs before we start writing xfs: don't complain about unfixed metadata when repairs were injected xfs: implement online scrubbing of rtsummary info xfs: always rescan allegedly healthy per-ag metadata after repair xfs: move the realtime summary file scrubber to a separate source file xfs: wrap ilock/iunlock operations on sc->ip xfs: get our own reference to inodes that we want to scrub xfs: track usage statistics of online fsck xfs: improve xfarray quicksort pivot xfs: create scaffolding for creating debugfs entries xfs: cache pages used for xfarray quicksort convergence ...
| * fs/xfs: Fix typos in commentsZizhen Pang2023-08-181-1/+1
| | | | | | | | | | | | | | | | | | Delete duplicate word "the" [chandan: Fix mangled patch] Signed-off-by: Zizhen Pang <pangzizhen001@208suo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
| * xfs: fix dqiterate thinkoDarrick J. Wong2023-08-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | For some unknown reason, when I converted the incore dquot objects to store the dquot id in host endian order, I removed the increment here. This causes the scan to stop after retrieving the root dquot, which severely limits the usefulness of the quota scrubber. Fix the lost increment, though it won't fix the problem that the quota iterator code filters out zeroed dquot records. Fixes: c51df7334167e ("xfs: stop using q_core.d_id in the quota code") Signed-off-by: "Darrick J. Wong" <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Chandan Babu R <chandan.babu@oracle.com>
| * xfs: don't check reflink iflag state when checking cow forkDarrick J. Wong2023-08-101-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Any inode on a reflink filesystem can have a cow fork, even if the inode does not have the reflink iflag set. This happens either because the inode once had the iflag set but does not now, because we don't free the incore cow fork until the icache deletes the inode; or because we're running in alwayscow mode. Either way, we can collapse both of the xfs_is_reflink_inode calls into one, and change it to xfs_has_reflink, now that the bmap checker will return ENOENT if there is no pointer to the incore fork. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: simplify returns in xchk_bmapDarrick J. Wong2023-08-101-13/+13
| | | | | | | | | | | | | | | | | | | | Remove the pointless goto and return code in xchk_bmap, since it only serves to obscure what's going on in the function. Instead, return whichever error code is appropriate there. For nonexistent forks, this should have been ENOENT. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: rewrite xchk_inode_is_allocated to work properlyDarrick J. Wong2023-08-104-24/+162
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back in the mists of time[1], I proposed this function to assist the inode btree scrubbers in checking the inode btree contents against the allocation state of the inode records. The original version performed a direct lookup in the inode cache and returned the allocation status if the cached inode hadn't been reused and wasn't in an intermediate state. Brian thought it would be better to use the usual iget/irele mechanisms, so that was changed for the final version. Unfortunately, this hasn't aged well -- the IGET_INCORE flag only has one user and clutters up the regular iget path, which makes it hard to reason about how it actually works. Worse yet, the inode inactivation series silently broke it because iget won't return inodes that are anywhere in the inactivation machinery, even though the caller is already required to prevent inode allocation and freeing. Inodes in the inactivation machinery are still allocated, but the current code's interactions with the iget code prevent us from being able to say that. Now that I understand the inode lifecycle better than I did in early 2017, I now realize that as long as the cached inode hasn't been reused and isn't actively being reclaimed, it's safe to access the i_mode field (with the AGI, rcu, and i_flags locks held), and we don't need to worry about the inode being freed out from under us. Therefore, port the original version to modern code structure, which fixes the brokennes w.r.t. inactivation. In the next patch we'll remove IGET_INCORE since it's no longer necessary. [1] https://lore.kernel.org/linux-xfs/149643868294.23065.8094890990886436794.stgit@birch.djwong.org/ Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: hide xfs_inode_is_allocated in scrub common codeDarrick J. Wong2023-08-105-44/+40
| | | | | | | | | | | | | | | | | | | | This function is only used by online fsck, so let's move it there. In the next patch, we'll fix it to work properly and to require that the caller hold the AGI buffer locked. No major changes aside from adjusting the signature a bit. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: fix agf_fllast when repairing an empty AGFLDarrick J. Wong2023-08-101-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs/139 with parent pointers enabled occasionally pops up a corruption message when online fsck force-rebuild repairs an AGFL: XFS (sde): Metadata corruption detected at xfs_agf_verify+0x11e/0x220 [xfs], xfs_agf block 0x9e0001 XFS (sde): Unmount and run xfs_repair XFS (sde): First 128 bytes of corrupted metadata buffer: 00000000: 58 41 47 46 00 00 00 01 00 00 00 4f 00 00 40 00 XAGF.......O..@. 00000010: 00 00 00 01 00 00 00 02 00 00 00 05 00 00 00 01 ................ 00000020: 00 00 00 01 00 00 00 01 00 00 00 00 ff ff ff ff ................ 00000030: 00 00 00 00 00 00 00 05 00 00 00 05 00 00 00 00 ................ 00000040: 91 2e 6f b1 ed 61 4b 4d 8c 9b 6e 87 08 bb f6 36 ..o..aKM..n....6 00000050: 00 00 00 01 00 00 00 01 00 00 00 06 00 00 00 01 ................ 00000060: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ 00000070: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ The root cause of this failure is that prior to the repair, there were zero blocks in the AGFL. This scenario is set up by the test case, since it formats with 64MB AGs and tries to ENOSPC the whole filesystem. In this case of flcount==0, we reset fllast to -1U, which then trips the write verifier's check that fllast is less than xfs_agfl_size(). Correct this code to set fllast to the last possible slot in the AGFL when flcount is zero, which mirrors the behavior of xfs_repair phase5 when it has to create a totally empty AGFL. Fixes: 0e93d3f43ec7 ("xfs: repair the AGFL") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: clear pagf_agflreset when repairing the AGFLDarrick J. Wong2023-08-101-1/+4
| | | | | | | | | | | | | | | | Clear the pagf_agflreset flag when we're repairing the AGFL because we fix all the same padding problems that xfs_agfl_reset does. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: allow userspace to rebuild metadata structuresDarrick J. Wong2023-08-103-3/+17
| | | | | | | | | | | | | | | | | | | | Add a new (superuser-only) flag to the online metadata repair ioctl to force it to rebuild structures, even if they're not broken. We will use this to move metadata structures out of the way during a free space defragmentation operation. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: don't complain about unfixed metadata when repairs were injectedDarrick J. Wong2023-08-102-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While debugging other parts of online repair, I noticed that if someone injects FORCE_SCRUB_REPAIR, starts an IFLAG_REPAIR scrub on a piece of metadata, and the metadata repair fails, we'll log a message about uncorrected errors in the filesystem. This isn't strictly true if the scrub function didn't set OFLAG_CORRUPT and we're only doing the repair because the error injection knob is set. Repair functions are allowed to abort the entire operation at any point before committing new metadata, in which case the piece of metadata is in the same state as it was before. Therefore, the log message should be gated on the results of the scrub. Refactor the predicate and rearrange the code flow to make this happen. Note: If the repair function errors out after it commits the new metadata, the transaction cancellation will shut down the filesystem, which is an obvious sign of corrupt metadata. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: allow the user to cancel repairs before we start writingDarrick J. Wong2023-08-101-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All online repair functions have the same structure: walk filesystem metadata structures gathering enough data to rebuild the structure, stage a new copy, and then commit the new copy. The gathering steps do not write anything to disk, so they are peppered with xchk_should_terminate calls to avoid softlockup warnings and to provide an opportunity to abort the repair (by killing xfs_scrub). However, it's not clear in the code base when is the last chance to abort cleanly without having to undo a bunch of structure. Therefore, add one more call to xchk_should_terminate (along with a comment) providing the sysadmin with the ability to abort before it's too late and to make it clear in the source code when it's no longer convenient or safe to abort a repair. As there are only four repair functions right now, this patch exists more to establish a precedent for subsequent additions than to deliver practical functionality. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: always rescan allegedly healthy per-ag metadata after repairDarrick J. Wong2023-08-101-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | After an online repair function runs for a per-AG metadata structure, sc->sick_mask is supposed to reflect the per-AG metadata that the repair function fixed. Our next move is to re-check the metadata to assess the completeness of our repair, so we don't want the rebuilt structure to be excluded from the rescan just because the health system previously logged a problem with the data structure. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: implement online scrubbing of rtsummary infoDarrick J. Wong2023-08-107-28/+298
| | | | | | | | | | | | | | | | | | Finish the realtime summary scrubber by adding the functions we need to compute a fresh copy of the rtsummary info and comparing it to the copy on disk. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: move the realtime summary file scrubber to a separate source fileDarrick J. Wong2023-08-103-38/+60
| | | | | | | | | | | | | | | | Move the realtime summary file checking code to a separate file in preparation to actually implement it. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: wrap ilock/iunlock operations on sc->ipDarrick J. Wong2023-08-108-29/+53
| | | | | | | | | | | | | | | | | | | | Scrub tracks the resources that it's holding onto in the xfs_scrub structure. This includes the inode being checked (if applicable) and the inode lock state of that inode. Replace the open-coded structure manipulation with a trivial helper to eliminate sources of error. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: get our own reference to inodes that we want to scrubDarrick J. Wong2023-08-106-13/+36
| | | | | | | | | | | | | | | | When we want to scrub a file, get our own reference to the inode unconditionally. This will make disposal rules simpler in the long run. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: track usage statistics of online fsckDarrick J. Wong2023-08-1010-9/+535
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Track the usage, outcomes, and run times of the online fsck code, and report these values via debugfs. The columns in the file are: * scrubber name * number of scrub invocations * clean objects found * corruptions found * optimizations found * cross referencing failures * inconsistencies found during cross referencing * incomplete scrubs * warnings * number of time scrub had to retry * cumulative amount of time spent scrubbing (microseconds) * number of repair inovcations * successfully repaired objects * cumuluative amount of time spent repairing (microseconds) Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: create scaffolding for creating debugfs entriesDarrick J. Wong2023-08-104-2/+34
| | | | | | | | | | | | | | | | | | Set up debugfs directories for xfs as a whole, and a subdirectory for each mounted filesystem. This will enable the creation of debugfs files in the next patch. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: improve xfarray quicksort pivotDarrick J. Wong2023-08-102-69/+148
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that we have the means to do insertion sorts of small in-memory subsets of an xfarray, use it to improve the quicksort pivot algorithm by reading 7 records into memory and finding the median of that. This should prevent bad partitioning when a[lo] and a[hi] end up next to each other in the final sort, which can happen when sorting for cntbt repair when the free space is extremely fragmented (e.g. generic/176). This doesn't speed up the average quicksort run by much, but it will (hopefully) avoid the quadratic time collapse for which quicksort is famous. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: cache pages used for xfarray quicksort convergenceDarrick J. Wong2023-08-102-10/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | | After quicksort picks a pivot item for a particular subsort, it walks the records in that subset from the outside in, rearranging them so that every record less than the pivot comes before it, and every record greater than the pivot comes after it. This scan has a lot of locality, so we can speed it up quite a bit by grabbing the xfile backing page and holding onto it as long as we possibly can. Doing so reduces the runtime by another 5% on the author's computer. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: speed up xfarray sort by sorting xfile page contents directlyDarrick J. Wong2023-08-103-0/+121
| | | | | | | | | | | | | | | | | | | | | | | | If all the records in an xfarray subset live within the same memory page, we can short-circuit even more quicksort recursion by mapping that page into the local CPU and using the kernel's heapsort function to sort the subset. On the author's computer, this reduces the runtime by another 15% on a 500,000 element array. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: teach xfile to pass back direct-map pages to callerDarrick J. Wong2023-08-103-0/+120
| | | | | | | | | | | | | | | | | | | | Certain xfile array operations (such as sorting) can be sped up quite a bit by allowing xfile users to grab a page to bulk-read the records contained within it. Create helper methods to facilitate this. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>
| * xfs: convert xfarray insertion sort to heapsort using scratchpad memoryDarrick J. Wong2023-08-103-120/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the previous patch, we created a very basic quicksort implementation for xfile arrays. While the use of an alternate sorting algorithm to avoid quicksort recursion on very small subsets reduces the runtime modestly, we could do better than a load and store-heavy insertion sort, particularly since each load and store requires a page mapping lookup in the xfile. For a small increase in kernel memory requirements, we could instead bulk load the xfarray records into memory, use the kernel's existing heapsort implementation to sort the records, and bulk store the memory buffer back into the xfile. On the author's computer, this reduces the runtime by about 5% on a 500,000 element array. Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Dave Chinner <dchinner@redhat.com>