linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	btrfs: introduce new "rescue=ignoremetacsums" mount option	Qu Wenruo	2024-07-11	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce "rescue=ignoremetacsums" to ignore metadata csums, all the other metadata sanity checks are still kept as is. This new mount option is mostly to allow the kernel to mount an interrupted checksum conversion (at the metadata csum overwrite stage). And since the main part of metadata sanity checks is inside tree-checker, we shouldn't lose much safety, and the new mount option is rescue mount option it requires full read-only mount. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: ignore incorrect btrfs_file_extent_item::ram_bytes	Qu Wenruo	2024-07-11	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[HICCUP] Kernels can create file extent items with incorrect ram_bytes like this: item 6 key (257 EXTENT_DATA 0) itemoff 15816 itemsize 53 generation 7 type 1 (regular) extent data disk byte 13631488 nr 32768 extent data offset 0 nr 4096 ram 4096 extent compression 0 (none) Thankfully kernel can handle them properly, as in that case ram_bytes is not utilized at all. [ENHANCEMENT] Since the hiccup is not going to cause any data-loss and is only a minor violation of on-disk format, here we only need to ignore the incorrect ram_bytes value, and use the correct one from btrfs_file_extent_item::disk_num_bytes. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: cleanup the bytenr usage inside btrfs_extent_item_to_extent_map()	Qu Wenruo	2024-07-11	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	[HICCUP] Before commit 85de2be7129c ("btrfs: remove extent_map::block_start member"), we utilized @bytenr variable inside btrfs_extent_item_to_extent_map() to calculate block_start. But that commit removed block_start completely, we have no need to advance @bytenr at all. [ENHANCEMENT] - Rename @bytenr as @disk_bytenr - Only declare @disk_bytenr inside the if branch - Make @disk_bytenr const and remove the modification on it Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: remove extent_map::block_start member	Qu Wenruo	2024-07-11	1	-4/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The member extent_map::block_start can be calculated from extent_map::disk_bytenr + extent_map::offset for regular extents. And otherwise just extent_map::disk_bytenr. And this is already validated by the validate_extent_map(). Now we can remove the member. However there is a special case in btrfs_create_dio_extent() where we for NOCOW/PREALLOC ordered extents cannot directly use the resulting btrfs_file_extent, as btrfs_split_ordered_extent() cannot handle them yet. So for that call site, we pass file_extent->disk_bytenr + file_extent->num_bytes as disk_bytenr for the ordered extent, and 0 for offset. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: remove extent_map::block_len member	Qu Wenruo	2024-07-11	1	-7/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The extent_map::block_len is either extent_map::len (non-compressed extent) or extent_map::disk_num_bytes (compressed extent). Since we already have sanity checks to do the cross-checks between the new and old members, we can drop the old extent_map::block_len now. For most call sites, they can manually select extent_map::len or extent_map::disk_num_bytes, since most if not all of them have checked if the extent is compressed. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: remove extent_map::orig_start member	Qu Wenruo	2024-07-11	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since we have extent_map::offset, the old extent_map::orig_start is just extent_map::start - extent_map::offset for non-hole/inline extents. And since the new extent_map::offset is already verified by validate_extent_map() while the old orig_start is not, let's just remove the old member from all call sites. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: introduce new members for extent_map	Qu Wenruo	2024-07-11	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Introduce two new members for extent_map: - disk_bytenr - offset Both are matching the members with the same name inside btrfs_file_extent_items. For now this patch only touches those members when: - Reading btrfs_file_extent_items from disk - Inserting new holes - Merging two extent maps With the new disk_bytenr and disk_num_bytes, doing merging would be a little more complex, as we have 3 different cases: * Both extent maps are referring to the same data extents \|<----- data extent A ----->\| \|<- em 1 ->\|<- em 2 ->\| * Both extent maps are referring to different data extents \|<-- data extent A -->\|<-- data extent B -->\| \|<- em 1 ->\|<- em 2 ->\| * One of the extent maps is referring to a merged and larger data extent that covers both extent maps This is not really valid case other than some selftests. So this test case would be removed. A new helper merge_ondisk_extents() is introduced to handle the above valid cases. To properly assign values for those new members, a new btrfs_file_extent parameter is introduced to all the involved call sites. - For NOCOW writes the btrfs_file_extent would be exposed from can_nocow_file_extent(). - For other writes, the members can be easily calculated As most of them have 0 offset and utilizing the whole on-disk data extent. The exception is encoded write, but thankfully that interface provided offset directly and all other needed info. For now, both the old members (block_start/block_len/orig_start) are co-existing with the new members (disk_bytenr/offset), meanwhile all the critical code is still using the old members only. The cleanup will happen later after all the old and new members are properly validated. There would be some re-ordering for the assignment of the extent_map members, now we follow the new ordering: - start and len Or file_pos and num_bytes for other structures. - disk_bytenr and disk_num_bytes - offset and ram_bytes - compression So expect some seemingly unrelated line movement. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: rename extent_map::orig_block_len to disk_num_bytes	Qu Wenruo	2024-07-11	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \|	This would make it very obvious that the member just matches btrfs_file_extent_item::disk_num_bytes. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: don't allocate file extent tree for non regular files	Filipe Manana	2024-07-11	1	-6/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When not using the NO_HOLES feature we always allocate an io tree for an inode's file_extent_tree. This is wasteful because that io tree is only used for regular files, so we allocate more memory than needed for inodes that represent directories or symlinks for example, or for inodes that correspond to free space inodes. So improve on this by allocating the io tree only for inodes of regular files that are not free space inodes. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: simplify the inline extent map creation	Qu Wenruo	2024-05-07	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \|	With the tree-checker ensuring all inline file extents starts at file offset 0 and has a length no larger than sectorsize, we can simplify the calculation to assigned those fixes values directly. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: change root->root_key.objectid to btrfs_root_id()	Josef Bacik	2024-05-07	1	-6/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	A comment from Filipe on one of my previous cleanups brought my attention to a new helper we have for getting the root id of a root, which makes it easier to read in the code. The changes where made with the following Coccinelle semantic patch: // <smpl> @@ expression E,E1; @@ ( E->root_key.objectid = E1 \| - E->root_key.objectid + btrfs_root_id(E) ) // </smpl> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ minor style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: make NOCOW checks for existence of checksums in a range more efficient	Filipe Manana	2024-05-07	1	-7/+18
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Before deciding if we can do a NOCOW write into a range, one of the things we have to do is check if there are checksum items for that range. We do that through the btrfs_lookup_csums_list() function, which searches for checksums and adds them to a list supplied by the caller. But all we need is to check if there is any checksum, we don't need to look for all of them and collect them into a list, which requires more search time in the checksums tree, allocating memory for checksums items to add to the list, copy checksums from a leaf into those list items, then free that memory, etc. This is all unnecessary overhead, wasting mostly CPU time, and perhaps some occasional IO if we need to read from disk any extent buffers. So change btrfs_lookup_csums_list() to allow to return immediately in case it finds any checksum, without the need to add it to a list and read it from a leaf. This is accomplished by allowing a NULL list parameter and making the function return 1 if it found any checksum, 0 if it didn't found any, and a negative value in case of an error. The following test with fio was used to measure performance: $ cat test.sh #!/bin/bash DEV=/dev/nullb0 MNT=/mnt/nullb0 cat <<EOF > /tmp/fio-job.ini [global] name=fio-rand-write filename=$MNT/fio-rand-write rw=randwrite bssplit=4k/20:8k/20:16k/20:32k/20:64k/20 direct=1 numjobs=16 fallocate=posix time_based runtime=300 [file1] size=8G ioengine=io_uring iodepth=16 EOF umount $MNT &> /dev/null mkfs.btrfs -f $DEV mount -o ssd $DEV $MNT fio /tmp/fio-job.ini umount $MNT The test was run on a release kernel (Debian's default kernel config). The results before this patch: WRITE: bw=139MiB/s (146MB/s), 8204KiB/s-9504KiB/s (8401kB/s-9732kB/s), io=17.0GiB (18.3GB), run=125317-125344msec The results after this patch: WRITE: bw=153MiB/s (160MB/s), 9241KiB/s-10.0MiB/s (9463kB/s-10.5MB/s), io=17.0GiB (18.3GB), run=114054-114071msec Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: simplify error path for btrfs_lookup_csums_list()	Filipe Manana	2024-05-07	1	-8/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the error path we have this while loop that keeps iterating over the csums of the list and then delete them from the list and free them, testing for an error (ret < 0) and list emptyness as the conditions of the while loop. Simplify this by using list_for_each_entry_safe() so there's no need to delete elements from the list and need to test the error condition on each iteration. Also rename the 'fail' label to 'out' since the label is not exclusive to a failure path, as we also end up there when the function succeeds, and it's also a more common label name. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: remove use of a temporary list at btrfs_lookup_csums_list()	Filipe Manana	2024-05-07	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \|	There's no need to use a temporary list to add the checksums, we can just add them to input list and then on error delete and free any checksums that were added. So simplify and remove the temporary list. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: remove search_commit parameter from btrfs_lookup_csums_list()	Filipe Manana	2024-05-07	1	-9/+1
\| \| \| \| \| \| \| \| \| \| \|	All the callers of btrfs_lookup_csums_list() pass a value of 0 as the "search_commit" parameter. So remove it and make the function behave as to always search from the regular root. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: add function comment to btrfs_lookup_csums_list()	Filipe Manana	2024-05-07	1	-0/+13
\| \| \| \| \| \| \| \| \| \| \| \|	Add a function comment to btrfs_lookup_csums_list() to document it. With another upcoming change its parameter list and return value will be less obvious. So add the documentation now so that it can be updated where needed later. Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: unify handling of return values of btrfs_insert_empty_items()	David Sterba	2024-03-04	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \|	The error values returned by btrfs_insert_empty_items() are following the common patter of 0/-errno, but some callers check for a value > 0, which can't happen. Document that and update calls to not expect positive values. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: remove unused included headers	David Sterba	2024-03-04	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \|	With help of neovim, LSP and clangd we can identify header files that are not actually needed to be included in the .c files. This is focused only on removal (with minor fixups), further cleanups are possible but will require doing the header files properly with forward declarations, minimized includes and include-what-you-use care. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: use the flags of an extent map to identify the compression type	Filipe Manana	2023-12-15	1	-6/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, in struct extent_map, we use an unsigned int (32 bits) to identify the compression type of an extent and an unsigned long (64 bits on a 64 bits platform, 32 bits otherwise) for flags. We are only using 6 different flags, so an unsigned long is excessive and we can use flags to identify the compression type instead of using a dedicated 32 bits field. We can easily have tens or hundreds of thousands (or more) of extent maps on busy and large filesystems, specially with compression enabled or many or large files with tons of small extents. So it's convenient to have the extent_map structure as small as possible in order to use less memory. So remove the compression type field from struct extent_map, use flags to identify the compression type and shorten the flags field from an unsigned long to a u32. This saves 8 bytes (on 64 bits platforms) and reduces the size of the structure from 136 bytes down to 128 bytes, using now only two cache lines, and increases the number of extent maps we can have per 4K page from 30 to 32. By using a u32 for the flags instead of an unsigned long, we no longer use test_bit(), set_bit() and clear_bit(), but that level of atomicity is not needed as most flags are never cleared once set (before adding an extent map to the tree), and the ones that can be cleared or set after an extent map is added to the tree, are always performed while holding the write lock on the extent map tree, while the reader holds a lock on the tree or tests for a flag that never changes once the extent map is in the tree (such as compression flags). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: allocate btrfs_inode::file_extent_tree only without NO_HOLES	David Sterba	2023-12-15	1	-3/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The file_extent_tree was added in 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree") so we have an explicit mapping of the file extents to know where it is safe to update i_size. When the feature NO_HOLES is enabled, and it's been a mkfs default since 5.15, the tree is not necessary. To save some space in the inode, allocate the tree only when necessary. This reduces size by 16 bytes from 1096 to 1080 on a x86_64 release config. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: abort transaction on generation mismatch when marking eb as dirty	Filipe Manana	2023-10-12	1	-8/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When marking an extent buffer as dirty, at btrfs_mark_buffer_dirty(), we check if its generation matches the running transaction and if not we just print a warning. Such mismatch is an indicator that something really went wrong and only printing a warning message (and stack trace) is not enough to prevent a corruption. Allowing a transaction to commit with such an extent buffer will trigger an error if we ever try to read it from disk due to a generation mismatch with its parent generation. So abort the current transaction with -EUCLEAN if we notice a generation mismatch. For this we need to pass a transaction handle to btrfs_mark_buffer_dirty() which is always available except in test code, in which case we can pass NULL since it operates on dummy extent buffers and all test roots have a single node/leaf (root node at level 0). Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: scrub: avoid unnecessary csum tree search preparing stripes	Qu Wenruo	2023-08-21	1	-12/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	One of the bottleneck of the new scrub code is the extra csum tree search. The old code would only do the csum tree search for each scrub bio, which can be as large as 512KiB, thus they can afford to allocate a new path each time. But the new scrub code is doing csum tree search for each stripe, which is only 64KiB, this means we'd better re-use the same csum path during each search. This patch would introduce a per-sctx path for csum tree search, as we don't need to re-allocate the path every time we need to do a csum tree search. With this change we can further improve the queue depth and improve the scrub read performance: Before (with regression and cached extent tree path): Device r/s rkB/s rrqm/s %rrqm r_await rareq-sz aqu-sz %util nvme0n1p3 15875.00 1013328.00 12.00 0.08 0.08 63.83 1.35 100.00 After (with both cached extent/csum tree path): nvme0n1p3 17759.00 1133280.00 10.00 0.06 0.08 63.81 1.50 100.00 Fixes: e02ee89baa66 ("btrfs: scrub: switch scrub_simple_mirror() to scrub_stripe infrastructure") CC: stable@vger.kernel.org # 6.4+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: use bbio->ordered in btrfs_csum_one_bio	Christoph Hellwig	2023-06-19	1	-20/+1
\| \| \| \| \| \| \| \| \| \| \|	Use the ordered_extent pointer in the btrfs_bio instead of looking it up manually. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: add an ordered_extent pointer to struct btrfs_bio	Christoph Hellwig	2023-06-19	1	-8/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a pointer to the ordered_extent to the existing union in struct btrfs_bio, so all code dealing with data write bios can just use a pointer dereference to retrieve the ordered_extent instead of doing multiple rbtree lookups per I/O. The reference to this ordered_extent is dropped at end I/O time, which implies that an extra one must be acquired when the bio is split. This also requires moving the btrfs_extract_ordered_extent call into btrfs_split_bio so that the invariant of always having a valid ordered_extent reference for the btrfs_bio is kept. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: limit write bios to a single ordered extent	Christoph Hellwig	2023-06-19	1	-38/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently buffered writeback bios are allowed to span multiple ordered_extents, although that basically never actually happens since commit 4a445b7b6178 ("btrfs: don't merge pages into bio if their page offset is not contiguous"). Supporting bios than span ordered_extents complicates the file checksumming code, and prevents us from adding an ordered_extent pointer to the btrfs_bio structure. Use the existing code to limit a bio to single ordered_extent for zoned device writes for all writes. This allows to remove the REQ_BTRFS_ONE_ORDERED flags, and the handling of multiple ordered_extents in btrfs_csum_one_bio. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: optimize the logical to physical mapping for zoned writes	Christoph Hellwig	2023-06-19	1	-0/+30
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The current code to store the final logical to physical mapping for a zone append write in the extent tree is rather inefficient. It first has to split the ordered extent so that there is one ordered extent per bio, so that it can look up the ordered extent on I/O completion in btrfs_record_physical_zoned and store the physical LBA returned by the block driver in the ordered extent. btrfs_rewrite_logical_zoned then has to do a lookup in the chunk tree to see what physical address the logical address for this bio / ordered extent is mapped to, and then rewrite it in the extent tree. To optimize this process, we can store the physical address assigned in the chunk tree to the original logical address and a pointer to btrfs_ordered_sum structure the in the btrfs_bio structure, and then use this information to rewrite the logical address in the btrfs_ordered_sum structure directly at I/O completion time in btrfs_record_physical_zoned. btrfs_rewrite_logical_zoned then simply updates the logical address in the extent tree and the ordered_extent itself. The code in btrfs_rewrite_logical_zoned now runs for all data I/O completions in zoned file systems, which is fine as there is no remapping to do for non-append writes to conventional zones or for relocation, and the overhead for quickly breaking out of the loop is very low. Because zoned file systems now need the ordered_sums structure to record the actual write location returned by zone append, allocate dummy structures without the csum array for them when the I/O doesn't use checksums, and free them when completing the ordered_extent. Note that the btrfs_bio doesn't grow as the new field are places into a union that is so far not used for data writes and has plenty of space left in it. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: rename the bytenr field in struct btrfs_ordered_sum to logical	Christoph Hellwig	2023-06-19	1	-4/+4
\| \| \| \| \| \| \| \| \| \|	btrfs_ordered_sum::bytendr stores a logical address. Make that clear by renaming it to ->logical. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: mark the len field in struct btrfs_ordered_sum as unsigned	Christoph Hellwig	2023-06-19	1	-1/+1
\| \| \| \| \| \| \| \| \|	len can't ever be negative, so mark it as an u32 instead of int. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: drop gfp from parameter extent state helpers	David Sterba	2023-06-19	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that all extent state bit helpers effectively take the GFP_NOFS mask (and GFP_NOWAIT is encoded in the bits) we can remove the parameter. This reduces stack consumption in many functions and simplifies a lot of code. Net effect on module on a release build: text data bss dec hex filename 1250432 20985 16088 1287505 13a551 pre/btrfs.ko 1247074 20985 16088 1284147 139833 post/btrfs.ko DELTA: -3358 Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: open code set_extent_bits	David Sterba	2023-06-19	1	-5/+5
\| \| \| \| \| \| \| \| \|	This helper calls set_extent_bit with two more parameters set to default values, but otherwise it's purpose is not clear. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: use SECTOR_SHIFT to convert LBA to physical offset	Anand Jain	2023-06-19	1	-2/+2
\| \| \| \| \| \| \| \| \|	Using SECTOR_SHIFT to convert LBA to physical address makes it more readable. Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: handle memory allocation failure in btrfs_csum_one_bio	Johannes Thumshirn	2023-05-17	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since f8a53bb58ec7 ("btrfs: handle checksum generation in the storage layer") the failures of btrfs_csum_one_bio() are handled via bio_end_io(). This means, we can return BLK_STS_RESOURCE from btrfs_csum_one_bio() in case the allocation of the ordered sums fails. This also fixes a syzkaller report, where injecting a failure into the kvzalloc() call results in a BUG_ON(). Reported-by: syzbot+d8941552e21eac774778@syzkaller.appspotmail.com Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: fix encoded write i_size corruption with no-holes	Boris Burkov	2023-05-02	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have observed a btrfs filesystem corruption on workloads using no-holes and encoded writes via send stream v2. The symptom is that a file appears to be truncated to the end of its last aligned extent, even though the final unaligned extent and even the file extent and otherwise correctly updated inode item have been written. So if we were writing out a 1MiB+X file via 8 128K extents and one extent of length X, i_size would be set to 1MiB, but the ninth extent, nbyte, etc. would all appear correct otherwise. The source of the race is a narrow (one line of code) window in which a no-holes fs has read in an updated i_size, but has not yet set a shared disk_i_size variable to write. Therefore, if two ordered extents run in parallel (par for the course for receive workloads), the following sequence can play out: (following "threads" a bit loosely, since there are callbacks involved for endio but extra threads aren't needed to cause the issue) ENC-WR1 (second to last) ENC-WR2 (last) ------- ------- btrfs_do_encoded_write set i_size = 1M submit bio B1 ending at 1M endio B1 btrfs_inode_safe_disk_i_size_write local i_size = 1M falls off a cliff for some reason btrfs_do_encoded_write set i_size = 1M+X submit bio B2 ending at 1M+X endio B2 btrfs_inode_safe_disk_i_size_write local i_size = 1M+X disk_i_size = 1M+X disk_i_size = 1M btrfs_delayed_update_inode btrfs_delayed_update_inode And the delayed inode ends up filled with nbytes=1M+X and isize=1M, and writes respect i_size and present a corrupted file missing its last extents. Fix this by holding the inode lock in the no-holes case so that a thread can't sneak in a write to disk_i_size that gets overwritten with an out of date i_size. Fixes: 41a2ee75aab0 ("btrfs: introduce per-inode file extent tree") CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: scrub: introduce helper to find and fill sector info for a scrub_stripe	Qu Wenruo	2023-04-17	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The new helper will search the extent tree to find the first extent of a logical range, then fill the sectors array by two loops: - Loop 1 to fill common bits and metadata generation - Loop 2 to fill csum data (only for data bgs) This loop will use the new btrfs_lookup_csums_bitmap() to fill the full csum buffer, and set scrub_sector_verification::csum. With all the needed info filled by this function, later we only need to submit and verify the stripe. Here we temporarily export the helper to avoid warning on unused static function. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: cleanup the main loop in btrfs_lookup_bio_sums	Christoph Hellwig	2023-04-17	1	-24/+9
\| \| \| \| \| \| \| \| \| \| \|	Introduce a bio_offset variable for the current offset into the bio instead of recalculating it over and over. Remove the now only used once search_len and sector_offset variables, and reduce the scope for count and cur_disk_bytenr. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: remove search_file_offset_in_bio	Christoph Hellwig	2023-04-17	1	-49/+3
\| \| \| \| \| \| \| \| \| \| \| \| \|	There is no need to search for a file offset in a bio, it is now always provided in bbio->file_offset (set at bio allocation time since 0d495430db8d ("btrfs: set bbio->file_offset in alloc_new_bio")). Just use that with the offset into the bio. Reviewed-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: switch search_file_offset_in_bio to return bool	Anand Jain	2023-04-17	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	Function search_file_offset_in_bio() finds the file offset in the file_offset_ret, and we use the return value to indicate if it is successful, so use bool. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: avoid reusing return variable in nested block in btrfs_lookup_bio_sums	Anand Jain	2023-04-17	1	-5/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	The function btrfs_lookup_bio_sums() and a nested if statement declare ret respectively as blk_status_t and int. There is no need to store the return value of search_file_offset_in_bio() to ret as this is a one-time call. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Anand Jain <anand.jain@oracle.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: simplify the btrfs_csum_one_bio calling convention	Christoph Hellwig	2023-02-15	1	-15/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	To prepare for further bio submission changes btrfs_csum_one_bio should be able to take all it's arguments from the btrfs_bio structure. It can always use the bbio->inode already, and once the compression code is updated to set ->file_offset that one can be used unconditionally as well instead of looking at the page mapping now that btrfs doesn't allow ordered extents to span discontiguous data ranges. The only slightly tricky bit is the one_ordered flag set by the compressed writes. Replace that one with the driver private bio flag, which gets cleared before the bio is handed off to the block layer so that we don't get in the way of driver use. Note: this leaves an argument and a flag to btrfs_wq_submit_bio unused. But that whole mechanism will be removed in its current form in the next patch. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: simplify parameters of btrfs_lookup_bio_sums	Christoph Hellwig	2023-02-15	1	-32/+20
\| \| \| \| \| \| \| \| \| \| \| \| \|	The csums argument is always NULL now, so remove it and always allocate the csums array in the btrfs_bio. Also pass the btrfs_bio instead of inode + bio to document that this function requires a btrfs_bio and not just any bio. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: split the bio submission path into a separate file	Christoph Hellwig	2022-12-05	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The code used by btrfs_submit_bio only interacts with the rest of volumes.c through __btrfs_map_block (which itself is a more generic version of two exported helpers) and does not really have anything to do with volumes.c. Create a new bio.c file and a bio.h header going along with it for the btrfs_bio-based storage layer, which will grow even more going forward. Also update the file with my copyright notice given that a large part of the moved code was written or rewritten by me. Reviewed-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: introduce a bitmap based csum range search function	Qu Wenruo	2022-12-05	1	-3/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although we have an existing function, btrfs_lookup_csums_range(), to find all data checksums for a range, it's based on a btrfs_ordered_sum list. For the incoming RAID56 data checksum verification at RMW time, we don't want to waste time by allocating temporary memory. So this patch will introduce a new helper, btrfs_lookup_csums_bitmap(). It will use bitmap based result, which will be a perfect fit for later RAID56 usage. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: refactor checksum calculations in btrfs_lookup_csums_range()	Qu Wenruo	2022-12-05	1	-21/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The refactoring involves the following parts: - Introduce bytes_to_csum_size() and csum_size_to_bytes() helpers As we have quite some open-coded calculations, some of them are even split into two assignments just to fit 80 chars limit. - Remove the @csum_size parameter from max_ordered_sum_bytes() Csum size can be fetched from @fs_info. And we will use the csum_size_to_bytes() helper anyway. - Add a comment explaining how we handle the first search result - Use newly introduced helpers to cleanup btrfs_lookup_csums_range() - Move variables declaration to the minimal scope - Never mix number of sectors with bytes There are several locations doing things like: size = min_t(size_t, csum_end - start, max_ordered_sum_bytes(fs_info)); ... size >>= fs_info->sectorsize_bits Or offset = (start - key.offset) >> fs_info->sectorsize_bits; offset *= csum_size; Make sure these variables can only represent BYTES inside the function, by using the above bytes_to_csum_size() helpers. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: remove new_inline argument from btrfs_extent_item_to_extent_map()	Qu Wenruo	2022-12-05	1	-4/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The argument @new_inline changes the following members of extent_map: - em->compress_type - EXTENT_FLAG_COMPRESSED of em->flags However neither members makes a difference for inline extents: - Inline extent read never use above em members As inside btrfs_get_extent() we directly use the file extent item to do the read. - Inline extents are never to be split Thus code really needs em->compress_type or that flag will never be executed on inlined extents. (btrfs_drop_extent_cache() would be one example) - Fiemap no longer relies on extent maps Recent fiemap optimization makes fiemap to search subvolume tree directly, without using any extent map at all. Thus those members make no difference for inline extents any more. Furthermore such exception without much explanation is really a source of confusion. Thus this patch will completely remove the argument, and always set the involved members, unifying the behavior. Signed-off-by: Qu Wenruo <wqu@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: move super_block specific helpers into super.h	Josef Bacik	2022-12-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	This will make syncing fs.h to user space a little easier if we can pull the super block specific helpers out of fs.h and put them in super.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: move file-item prototypes into their own header	Josef Bacik	2022-12-05	1	-0/+1
\| \| \| \| \| \| \| \| \|	Move these prototypes out of ctree.h and into file-item.h. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: update function comments	David Sterba	2022-12-05	1	-25/+25
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update, reformat or reword function comments. This also removes the kdoc marker so we don't get reports when the function name is missing. Changes made: - remove kdoc markers - reformat the brief description to be a proper sentence - reword to imperative voice - align parameter list - fix typos Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: move accessor helpers into accessors.h	Josef Bacik	2022-12-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	This is a large patch, but because they're all macros it's impossible to split up. Simply copy all of the item accessors in ctree.h and paste them in accessors.h, and then update any files to include the header so everything compiles. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> [ reformat comments, style fixups ] Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: move the printk helpers out of ctree.h	Josef Bacik	2022-12-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	We have a bunch of printk helpers that are in ctree.h. These have nothing to do with ctree.c, so move them into their own header. Subsequent patches will cleanup the printk helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
*	btrfs: move fs wide helpers out of ctree.h	Josef Bacik	2022-12-05	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have several fs wide related helpers in ctree.h. The bulk of these are the incompat flag test helpers, but there are things such as btrfs_fs_closing() and the read only helpers that also aren't directly related to the ctree code. Move these into a fs.h header, which will serve as the location for file system wide related helpers. Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>