summaryrefslogtreecommitdiffstats
path: root/fs/bcachefs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefsLinus Torvalds2024-04-1124-238/+359
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more bcachefs fixes from Kent Overstreet: "Notable user impacting bugs - On multi device filesystems, recovery was looping in btree_trans_too_many_iters(). This checks if a transaction has touched too many btree paths (because of iteration over many keys), and isuses a restart to drop unneeded paths. But it's now possible for some paths to exceed the previous limit without iteration in the interior btree update path, since the transaction commit will do alloc updates for every old and new btree node, and during journal replay we don't use the btree write buffer for locking reasons and thus those updates use btree paths when they wouldn't normally. - Fix a corner case in rebalance when moving extents on a durability=0 device. This wouldn't be hit when a device was formatted with durability=0 since in that case we'll only use it as a write through cache (only cached extents will live on it), but durability can now be changed on an existing device. - bch2_get_acl() could rarely forget to handle a transaction restart; this manifested as the occasional missing acl that came back after dropping caches. - Fix a major performance regression on high iops multithreaded write workloads (only since 6.9-rc1); a previous fix for a deadlock in the interior btree update path to check the journal watermark introduced a dependency on the state of btree write buffer flushing that we didn't want. - Assorted other repair paths and recovery fixes" * tag 'bcachefs-2024-04-10' of https://evilpiepirate.org/git/bcachefs: (25 commits) bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter() bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail() bcachefs: Fix a race in btree_update_nodes_written() bcachefs: btree_node_scan: Respect member.data_allowed bcachefs: Don't scan for btree nodes when we can reconstruct bcachefs: Fix check_topology() when using node scan bcachefs: fix eytzinger0_find_gt() bcachefs: fix bch2_get_acl() transaction restart handling bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu list bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take() bcachefs: Rename struct field swap to prevent macro naming collision MAINTAINERS: Add entry for bcachefs documentation Documentation: filesystems: Add bcachefs toctree bcachefs: JOURNAL_SPACE_LOW bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINE bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystems bcachefs: fix rand_delete unit test bcachefs: fix ! vs ~ typo in __clear_bit_le64() bcachefs: Fix rebalance from durability=0 device bcachefs: Print shutdown journal sequence number ...
| * bcachefs: Fix __bch2_btree_and_journal_iter_init_node_iter()Kent Overstreet2024-04-111-5/+7
| | | | | | | | | | | | | | We weren't respecting trans->journal_replay_not_finished - we shouldn't be searching the journal keys unless we have a ref on them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Kill read lock dropping in bch2_btree_node_lock_write_nofail()Kent Overstreet2024-04-111-27/+1
| | | | | | | | | | | | | | | | | | | | | | dropping read locks in bch2_btree_node_lock_write_nofail() dates from before we had the cycle detector; we can now tell the cycle detector directly when taking a lock may not fail because we can't handle transaction restarts. This is needed for adding should_be_locked asserts. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix a race in btree_update_nodes_written()Kent Overstreet2024-04-111-3/+7
| | | | | | | | | | | | | | | | | | | | One btree update might have terminated in a node update, and then while it is in flight another btree update might free that original node. This race has to be handled in btree_update_nodes_written() - we were missing a READ_ONCE(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: btree_node_scan: Respect member.data_allowedKent Overstreet2024-04-101-0/+3
| | | | | | | | | | | | If a device wasn't used for btree nodes, no need to scan for them. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Don't scan for btree nodes when we can reconstructKent Overstreet2024-04-094-18/+29
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix check_topology() when using node scanKent Overstreet2024-04-091-1/+1
| | | | | | | | | | | | | | shoot down journal keys _before_ populating journal keys with pointers to scanned nodes Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix eytzinger0_find_gt()Kent Overstreet2024-04-091-6/+20
| | | | | | | | | | | | | | | | | | - fix return types: promoting from unsigned to ssize_t does not do what we want here, and was pointless since the rest of the eytzinger code is u32 - nr, not size Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix bch2_get_acl() transaction restart handlingKent Overstreet2024-04-071-16/+14
| | | | | | | | | | | | | | bch2_acl_from_disk() uses allocate_dropping_locks, and can thus return a transaction restart - this wasn't handled. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix the count of nr_freed_pcpu after changing bc->freed_nonpcpu listHongbo Li2024-04-071-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When allocating bkey_cached from bc->freed_pcpu list, it missed decreasing the count of nr_freed_pcpu which would cause the mismatch between the value of nr_freed_pcpu and the list items. This problem also exists in moving new bkey_cached to bc->freed_pcpu list. If these happened, the bug info may appear in bch2_fs_btree_key_cache_exit by the follow code: BUG_ON(list_count_nodes(&bc->freed_pcpu) != bc->nr_freed_pcpu); BUG_ON(list_count_nodes(&bc->freed_nonpcpu) != bc->nr_freed_nonpcpu); Fixes: c65c13f0eac6 ("bcachefs: Run btree key cache shrinker less aggressively") Signed-off-by: Hongbo Li <lihongbo22@huawei.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix gap buffer bug in bch2_journal_key_insert_take()Kent Overstreet2024-04-071-10/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | Multiple bug fixes for journal iters: - When the journal keys gap buffer is resized, we have to adjust the iterators for moving the gap to the end - We don't want to rewind iterators to point to the key we just inserted if it's not for the correct btree/level Also, add some new assertions. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Rename struct field swap to prevent macro naming collisionThorsten Blum2024-04-061-4/+4
| | | | | | | | | | | | | | | | The struct field swap can collide with the swap() macro defined in linux/minmax.h. Rename the struct field to prevent such collisions. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: JOURNAL_SPACE_LOWKent Overstreet2024-04-065-12/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "bcachefs; Fix deadlock in bch2_btree_update_start()" was a significant performance regression (nearly 50%) on multithreaded random writes with fio. The reason is that the journal watermark checks multiple things, including the state of the btree write buffer, and on multithreaded update heavy workloads we're bottleneked on write buffer flushing - we don't want kicknig off btree updates to depend on the state of the write buffer. This isn't strictly correct; the interior btree update path does do write buffer updates, but it's a tiny fraction of total accounting updates and we're more concerned with space in the journal itself. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Disable errors=panic for BCH_IOCTL_FSCK_OFFLINEKent Overstreet2024-04-061-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BCH_IOCTL_FSCK_OFFLINE allows the userspace fsck tool to use the kernel implementation of fsck - primarily when the kernel version is a better version match. It should look and act exactly like the normal userspace fsck that the user expected to be invoking, so errors should never result in a kernel panic. We may want to consider further restricting errors=panic - it's only intended for debugging in controlled test environments, it should have no purpose it normal usage. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix BCH_IOCTL_FSCK_OFFLINE for encrypted filesystemsKent Overstreet2024-04-061-44/+50
| | | | | | | | | | | | | | | | | | | | To open an encrypted filesystem, we use request_key() to get the encryption key from the user's keyring - but request_key() needs to happen in the context of the process that invoked the ioctl. This easily fixed by using bch2_fs_open() in nostart mode. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix rand_delete unit testKent Overstreet2024-04-051-1/+1
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: fix ! vs ~ typo in __clear_bit_le64()Dan Carpenter2024-04-051-1/+1
| | | | | | | | | | | | | | | | | | The ! was obviously intended to be ~. As it is, this function does the equivalent to: "addr[bit / 64] = 0;". Fixes: 27fcec6c27ca ("bcachefs: Clear recovery_passes_required as they complete without errors") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Fix rebalance from durability=0 deviceKent Overstreet2024-04-051-4/+13
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Print shutdown journal sequence numberKent Overstreet2024-04-041-0/+5
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Further improve btree_update_to_text()Kent Overstreet2024-04-042-55/+48
| | | | | | | | | | | | Print start and end level of the btree update; also a bit of cleanup. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Move btree_updates to debugfsKent Overstreet2024-04-042-9/+42
| | | | | | | | | | | | | | sysfs is limited to PAGE_SIZE, and when we're debugging strange deadlocks/priority inversions we need to see the full list. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Bump limit in btree_trans_too_many_iters()Kent Overstreet2024-04-042-1/+15
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: Make snapshot_is_ancestor() safeKent Overstreet2024-04-041-6/+13
| | | | | | | | | | | | | | Snapshot table accesses generally need to be checking for invalid snapshot ID now, fix one that was missed. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
| * bcachefs: create debugfs dir for each btreeThomas Bertschinger2024-04-041-15/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This creates a subdirectory for each individual btree under the btrees/ debugfs directory. Directory structure, before: /sys/kernel/debug/bcachefs/$FS_ID/btrees/ ├── alloc ├── alloc-bfloat-failed ├── alloc-formats ├── backpointers ├── backpointers-bfloat-failed ├── backpointers-formats ... Directory structure, after: /sys/kernel/debug/bcachefs/$FS_ID/btrees/ ├── alloc │   ├── bfloat-failed │   ├── formats │   └── keys ├── backpointers │   ├── bfloat-failed │   ├── formats │   └── keys ... Signed-off-by: Thomas Bertschinger <tahbertschinger@gmail.com> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | Merge tag 'vfs-6.9-rc3.fixes' of ↵Linus Torvalds2024-04-051-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: "This contains a few small fixes. This comes with some delay because I wanted to wait on people running their reproducers and the Easter Holidays meant that those replies came in a little later than usual: - Fix handling of preventing writes to mounted block devices. Since last kernel we allow to prevent writing to mounted block devices provided CONFIG_BLK_DEV_WRITE_MOUNTED isn't set and the block device is opened with restricted writes. When we switched to opening block devices as files we altered the mechanism by which we recognize when a block device has been opened with write restrictions. The detection logic assumed that only read-write mounted filesystems would apply write restrictions to their block devices from other openers. That of course is not true since it also makes sense to apply write restrictions for filesystems that are read-only. Fix the detection logic using an FMODE_* bit. We still have a few left since we freed up a couple a while ago. I also picked up a patch to free up four additional FMODE_* bits scheduled for the next merge window. - Fix counting the number of writers to a block device. This just changes the logic to be consistent. - Fix a bug in aio causing a NULL pointer derefernce after we implemented batched processing in aio. - Finally, add the changes we discussed that allows to yield block devices early even though file closing itself is deferred. This also allows us to remove two holder operations to get and release the holder to align lifetime of file and holder of the block device" * tag 'vfs-6.9-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: aio: Fix null ptr deref in aio_complete() wakeup fs,block: yield devices early block: count BLK_OPEN_RESTRICT_WRITES openers block: handle BLK_OPEN_RESTRICT_WRITES correctly
| * fs,block: yield devices earlyChristian Brauner2024-03-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently a device is only really released once the umount returns to userspace due to how file closing works. That ultimately could cause an old umount assumption to be violated that concurrent umount and mount don't fail. So an exclusively held device with a temporary holder should be yielded before the filesystem is gone. Add a helper that allows callers to do that. This also allows us to remove the two holder ops that Linus wasn't excited about. Link: https://lore.kernel.org/r/20240326-vfs-bdev-end_holder-v1-1-20af85202918@kernel.org Fixes: f3a608827d1f ("bdev: open block device as files") # mainline only Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Suggested-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | bcachefs: reconstruct_inode()Kent Overstreet2024-04-031-2/+50
| | | | | | | | | | | | | | If an inode is missing, but corresponding extents and dirent still exist, it's well worth recreating it - this does so. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Subvolume reconstructionKent Overstreet2024-04-031-19/+148
| | | | | | | | | | | | We can now recreate missing subvolumes from dirents and/or inodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Check for extents that point to same spaceKent Overstreet2024-04-032-8/+168
| | | | | | | | | | | | | | | | | | In backpointer repair, if we get a missing backpointer - but there's already a backpointer that points to an existing extent - we've got multiple extents that point to the same space and need to decide which to keep. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Reconstruct missing snapshot nodesKent Overstreet2024-04-036-6/+199
| | | | | | | | | | | | | | | | When the snapshots btree is going, we'll have to delete huge amounts of data - unless we can reconstruct it by looking at the keys that refer to it. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Flag btrees with missing dataKent Overstreet2024-04-036-5/+44
| | | | | | | | | | | | | | We need this to know when we should attempt to reconstruct the snapshots btree Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Topology repair now uses nodes found by scanning to fill holesKent Overstreet2024-04-032-107/+199
| | | | | | | | | | | | | | | | | | | | | | | | | | With the new btree node scan code, we can now recover from corrupt btree roots - simply create a new fake root at depth 1, and then insert all the leaves we found. If the root wasn't corrupt but there's corruption elsewhere in the btree, we can fill in holes as needed with the newest version of a given node(s) from the scan; we also check if a given btree node is older than what we found from the scan. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Repair pass for scanning for btree nodesKent Overstreet2024-04-0312-51/+605
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a btree root or interior btree node goes bad, we're going to lose a lot of data, unless we can recover the nodes that it pointed to by scanning. Fortunately btree node headers are fully self describing, and additionally the magic number is xored with the filesytem UUID, so we can do so safely. This implements the scanning - next patch will rework topology repair to make use of the found nodes. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Don't skip fake btree roots in fsckKent Overstreet2024-04-031-3/+0
| | | | | | | | | | | | | | When a btree root is unreadable, we might still have keys fro the journal to walk and mark. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: bch2_btree_root_alloc() -> bch2_btree_root_alloc_fake()Kent Overstreet2024-04-033-7/+7
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Etyzinger cleanupsKent Overstreet2024-04-037-182/+285
| | | | | | | | | | | | | | | | Pull out eytzinger.c and kill eytzinger_cmp_fn. We now provide eytzinger0_sort and eytzinger0_sort_r, which use the standard cmp_func_t and cmp_r_func_t callbacks. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: bch2_shoot_down_journal_keys()Kent Overstreet2024-04-033-10/+35
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Clear recovery_passes_required as they complete without errorsKent Overstreet2024-04-033-12/+43
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: ratelimit informational fsck errorsKent Overstreet2024-04-031-4/+4
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Check for bad needs_discard before doing discardKent Overstreet2024-04-031-21/+26
| | | | | | | | | | | | | | | | | | | | | | In the discard worker, we were failing to validate the bucket state - meaning a corrupt needs_discard btree could cause us to discard a bucket that we shouldn't. If check_alloc_info hasn't run yet we just want to bail out, otherwise it's a filesystem inconsistent error. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Improve bch2_btree_update_to_text()Kent Overstreet2024-04-022-22/+43
| | | | | | | | | | | | | | Print out the mode as a string, and also print out the btree and watermark. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | mean_and_variance: Drop always failing testsGuenter Roeck2024-04-021-27/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mean_and_variance_test_2 and mean_and_variance_test_4 always fail. The input parameters to those tests are identical to the input parameters to tests 1 and 3, yet the expected result for tests 2 and 4 is different for the mean and stddev tests. That will always fail. Expected mean_and_variance_get_mean(mv) == mean[i], but mean_and_variance_get_mean(mv) == 22 (0x16) mean[i] == 10 (0xa) Drop the bad tests. Fixes: 65bc41090720 ("mean and variance: More tests") Closes: https://lore.kernel.org/lkml/065b94eb-6a24-4248-b7d7-d3212efb4787@roeck-us.net/ Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: fix nocow lock deadlockKent Overstreet2024-04-021-2/+1
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: BCH_WATERMARK_interior_updatesKent Overstreet2024-04-026-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | This adds a new watermark, higher priority than BCH_WATERMARK_reclaim, for interior btree updates. We've seen a deadlock where journal replay triggers a ton of btree node merges, and these use up all available open buckets and then interior updates get stuck. One cause of this is that we're currently lacking btree node merging on write buffer btrees - that needs to be fixed as well. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Fix btree node reserveKent Overstreet2024-04-021-1/+1
| | | | | | | | | | | | Sign error when checking the watermark - oops. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: On emergency shutdown, print out current journal sequence numberKent Overstreet2024-04-011-1/+3
| | | | | | | | Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Fix overlapping extent repairKent Overstreet2024-04-011-4/+6
| | | | | | | | | | | | | | overlapping extent repair was colliding with extent past end of inode checks - don't update "extent ends at" until we know we have an extent. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Fix remove_dirent()Kent Overstreet2024-04-011-3/+4
| | | | | | | | | | | | We were missing an iter_traverse(). Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Logged op errors should be ignoredKent Overstreet2024-04-011-4/+3
| | | | | | | | | | | | | | If something is wrong with a logged op, we just want to delete it - there's nothing to repair. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>
* | bcachefs: Improve -o norecovery; opts.recovery_pass_limitKent Overstreet2024-04-016-14/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds opts.recovery_pass_limit, and redoes -o norecovery to make use of it; this fixes some issues with -o norecovery so it can be safely used for data recovery. Norecovery means "don't do journal replay"; it's an important data recovery tool when we're getting stuck in journal replay. When using it this way we need to make sure we don't free journal keys after startup, so we continue to overlay them: thus it needs to imply retain_recovery_info, as well as nochanges. recovery_pass_limit is an explicit option for telling recovery to exit after a specific recovery pass; this is a much cleaner way of implementing -o norecovery, as well as being a useful debug feature in its own right. Signed-off-by: Kent Overstreet <kent.overstreet@linux.dev>