summaryrefslogtreecommitdiffstats
path: root/fs/btrfs/tree-log.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Btrfs: fix infinite loop during fsync after rename operationsFilipe Manana2020-01-231-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently fsstress (from fstests) sporadically started to trigger an infinite loop during fsync operations. This turned out to be because support for the rename exchange and whiteout operations was added to fsstress in fstests. These operations, unlike any others in fsstress, cause file names to be reused, whence triggering this issue. However it's not necessary to use rename exchange and rename whiteout operations trigger this issue, simple rename operations and file creations are enough to trigger the issue. The issue boils down to when we are logging inodes that conflict (that had the name of any inode we need to log during the fsync operation), we keep logging them even if they were already logged before, and after that we check if there's any other inode that conflicts with them and then add it again to the list of inodes to log. Skipping already logged inodes fixes the issue. Consider the following example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir # inode 257 $ touch /mnt/testdir/zz # inode 258 $ ln /mnt/testdir/zz /mnt/testdir/zz_link $ touch /mnt/testdir/a # inode 259 $ sync # The following 3 renames achieve the same result as a rename exchange # operation (<rename_exchange> /mnt/testdir/zz_link to /mnt/testdir/a). $ mv /mnt/testdir/a /mnt/testdir/a/tmp $ mv /mnt/testdir/zz_link /mnt/testdir/a $ mv /mnt/testdir/a/tmp /mnt/testdir/zz_link # The following rename and file creation give the same result as a # rename whiteout operation (<rename_whiteout> zz to a2). $ mv /mnt/testdir/zz /mnt/testdir/a2 $ touch /mnt/testdir/zz # inode 260 $ xfs_io -c fsync /mnt/testdir/zz --> results in the infinite loop The following steps happen: 1) When logging inode 260, we find that its reference named "zz" was used by inode 258 in the previous transaction (through the commit root), so inode 258 is added to the list of conflicting indoes that need to be logged; 2) After logging inode 258, we find that its reference named "a" was used by inode 259 in the previous transaction, and therefore we add inode 259 to the list of conflicting inodes to be logged; 3) After logging inode 259, we find that its reference named "zz_link" was used by inode 258 in the previous transaction - we add inode 258 to the list of conflicting inodes to log, again - we had already logged it before at step 3. After logging it again, we find again that inode 259 conflicts with him, and we add again 259 to the list, etc - we end up repeating all the previous steps. So fix this by skipping logging of conflicting inodes that were already logged. Fixes: 6b5fc433a7ad67 ("Btrfs: fix fsync after succession of renames of different files") CC: stable@vger.kernel.org # 5.1+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove redundant WARN_ON in walk_down_log_treeNikolay Borisov2020-01-201-9/+0
| | | | | | | | | | | | | | | | | | | | | level <0 and level >= BTRFS_MAX_LEVEL are already performed upon extent buffer read by tree checker in btrfs_check_node. go. As far as 'level <= 0' we are guaranteed that level is '> 0' because the value of level _before_ reading 'next' is larger than 1 (otherwise we wouldn't have executed that code at all) this in turn guarantees that 'level' after btrfs_read_buffer is 'level - 1' since we verify this invariant in: btrfs_read_buffer btree_read_extent_buffer_pages btrfs_verify_level_key This guarantees that level can never be '<= 0' so the warn on is never triggered. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Remove WARN_ON in walk_log_treeNikolay Borisov2020-01-201-2/+0
| | | | | | | | | | | | | | | | | The log_root passed to walk_log_tree is guaranteed to have its root_key.objectid always be BTRFS_TREE_LOG_OBJECTID. This is by merit that all log roots of an ordinary root are allocated in alloc_log_tree which hard-codes objectid to be BTRFS_TREE_LOG_OBJECTID. In case walk_log_tree is called for a log tree found by btrfs_read_fs_root in btrfs_recover_log_trees, that function already ensures found_key.objectid is BTRFS_TREE_LOG_OBJECTID. No functional changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Rename __btrfs_free_reserved_extent to btrfs_pin_reserved_extentNikolay Borisov2020-01-201-7/+5
| | | | | | | | | | __btrfs_free_reserved_extent now performs the actions of btrfs_free_and_pin_reserved_extent. But this name is a bit of a misnomer, since the extent is not really freed but just pinned. Reflect this in the new name. No semantics changes. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix missing hole after hole punching and fsync when using NO_HOLESFilipe Manana2020-01-201-288/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When using the NO_HOLES feature, if we punch a hole into a file and then fsync it, there are cases where a subsequent fsync will miss the fact that a hole was punched, resulting in the holes not existing after replaying the log tree. Essentially these cases all imply that, tree-log.c:copy_items(), is not invoked for the leafs that delimit holes, because nothing changed those leafs in the current transaction. And it's precisely copy_items() where we currenly detect and log holes, which works as long as the holes are between file extent items in the input leaf or between the beginning of input leaf and the previous leaf or between the last item in the leaf and the next leaf. First example where we miss a hole: *) The extent items of the inode span multiple leafs; *) The punched hole covers a range that affects only the extent items of the first leaf; *) The fsync operation is done in full mode (BTRFS_INODE_NEEDS_FULL_SYNC is set in the inode's runtime flags). That results in the hole not existing after replaying the log tree. For example, if the fs/subvolume tree has the following layout for a particular inode: Leaf N, generation 10: [ ... INODE_ITEM INODE_REF EXTENT_ITEM (0 64K) EXTENT_ITEM (64K 128K) ] Leaf N + 1, generation 10: [ EXTENT_ITEM (128K 64K) ... ] If at transaction 11 we punch a hole coverting the range [0, 128K[, we end up dropping the two extent items from leaf N, but we don't touch the other leaf, so we end up in the following state: Leaf N, generation 11: [ ... INODE_ITEM INODE_REF ] Leaf N + 1, generation 10: [ EXTENT_ITEM (128K 64K) ... ] A full fsync after punching the hole will only process leaf N because it was modified in the current transaction, but not leaf N + 1, since it was not modified in the current transaction (generation 10 and not 11). As a result the fsync will not log any holes, because it didn't process any leaf with extent items. Second example where we will miss a hole: *) An inode as its items spanning 5 (or more) leafs; *) A hole is punched and it covers only the extents items of the 3rd leaf. This resulsts in deleting the entire leaf and not touching any of the other leafs. So the only leaf that is modified in the current transaction, when punching the hole, is the first leaf, which contains the inode item. During the full fsync, the only leaf that is passed to copy_items() is that first leaf, and that's not enough for the hole detection code in copy_items() to determine there's a hole between the last file extent item in the 2nd leaf and the first file extent item in the 3rd leaf (which was the 4th leaf before punching the hole). Fix this by scanning all leafs and punch holes as necessary when doing a full fsync (less common than a non-full fsync) when the NO_HOLES feature is enabled. The lack of explicit file extent items to mark holes makes it necessary to scan existing extents to determine if holes exist. A test case for fstests follows soon. Fixes: 16e7549f045d33 ("Btrfs: incompatible format change to remove hole extents") CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: skip log replay on orphaned rootsJosef Bacik2019-12-131-2/+21
| | | | | | | | | | | | | | | My fsstress modifications coupled with generic/475 uncovered a failure to mount and replay the log if we hit a orphaned root. We do not want to replay the log for an orphan root, but it's completely legitimate to have an orphaned root with a log attached. Fix this by simply skipping replaying the log. We still need to pin it's root node so that we do not overwrite it while replaying other logs, as we re-read the log root at every stage of the replay. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix missing data checksums after replaying a log treeFilipe Manana2019-12-131-3/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When logging a file that has shared extents (reflinked with other files or with itself), we can end up logging multiple checksum items that cover overlapping ranges. This confuses the search for checksums at log replay time causing some checksums to never be added to the fs/subvolume tree. Consider the following example of a file that shares the same extent at offsets 0 and 256Kb: [ bytenr 13893632, offset 64Kb, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 13893632, offset 0, len 256Kb ] 256Kb 512Kb When logging the inode, at tree-log.c:copy_items(), when processing the file extent item at offset 0, we log a checksum item covering the range 13959168 to 14024704, which corresponds to 13893632 + 64Kb and 13893632 + 64Kb + 64Kb, respectively. Later when processing the extent item at offset 256K, we log the checksums for the range from 13893632 to 14155776 (which corresponds to 13893632 + 256Kb). These checksums get merged with the checksum item for the range from 13631488 to 13893632 (13631488 + 256Kb), logged by a previous fsync. So after this we get the two following checksum items in the log tree: (...) item 6 key (EXTENT_CSUM EXTENT_CSUM 13631488) itemoff 3095 itemsize 512 range start 13631488 end 14155776 length 524288 item 7 key (EXTENT_CSUM EXTENT_CSUM 13959168) itemoff 3031 itemsize 64 range start 13959168 end 14024704 length 65536 The first one covers the range from the second one, they overlap. So far this does not cause a problem after replaying the log, because when replaying the file extent item for offset 256K, we copy all the checksums for the extent 13893632 from the log tree to the fs/subvolume tree, since searching for an checksum item for bytenr 13893632 leaves us at the first checksum item, which covers the whole range of the extent. However if we write 64Kb to file offset 256Kb for example, we will not be able to find and copy the checksums for the last 128Kb of the extent at bytenr 13893632, referenced by the file range 384Kb to 512Kb. After writing 64Kb into file offset 256Kb we get the following extent layout for our file: [ bytenr 13893632, offset 64K, len 64Kb ] 0 64Kb [ bytenr 13631488, offset 64Kb, len 192Kb ] 64Kb 256Kb [ bytenr 14155776, offset 0, len 64Kb ] 256Kb 320Kb [ bytenr 13893632, offset 64Kb, len 192Kb ] 320Kb 512Kb After fsync'ing the file, if we have a power failure and then mount the filesystem to replay the log, the following happens: 1) When replaying the file extent item for file offset 320Kb, we lookup for the checksums for the extent range from 13959168 (13893632 + 64Kb) to 14155776 (13893632 + 256Kb), through a call to btrfs_lookup_csums_range(); 2) btrfs_lookup_csums_range() finds the checksum item that starts precisely at offset 13959168 (item 7 in the log tree, shown before); 3) However that checksum item only covers 64Kb of data, and not 192Kb of data; 4) As a result only the checksums for the first 64Kb of data referenced by the file extent item are found and copied to the fs/subvolume tree. The remaining 128Kb of data, file range 384Kb to 512Kb, doesn't get the corresponding data checksums found and copied to the fs/subvolume tree. 5) After replaying the log userspace will not be able to read the file range from 384Kb to 512Kb, because the checksums are missing and resulting in an -EIO error. The following steps reproduce this scenario: $ mkfs.btrfs -f /dev/sdc $ mount /dev/sdc /mnt/sdc $ xfs_io -f -c "pwrite -S 0xa3 0 256K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xc7 256K 256K" /mnt/sdc/foobar $ xfs_io -c "reflink /mnt/sdc/foobar 320K 0 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar $ xfs_io -c "pwrite -S 0xe5 256K 64K" /mnt/sdc/foobar $ xfs_io -c "fsync" /mnt/sdc/foobar <power failure> $ mount /dev/sdc /mnt/sdc $ md5sum /mnt/sdc/foobar md5sum: /mnt/sdc/foobar: Input/output error $ dmesg | tail [165305.003464] BTRFS info (device sdc): no csum found for inode 257 start 401408 [165305.004014] BTRFS info (device sdc): no csum found for inode 257 start 405504 [165305.004559] BTRFS info (device sdc): no csum found for inode 257 start 409600 [165305.005101] BTRFS info (device sdc): no csum found for inode 257 start 413696 [165305.005627] BTRFS info (device sdc): no csum found for inode 257 start 417792 [165305.006134] BTRFS info (device sdc): no csum found for inode 257 start 421888 [165305.006625] BTRFS info (device sdc): no csum found for inode 257 start 425984 [165305.007278] BTRFS info (device sdc): no csum found for inode 257 start 430080 [165305.008248] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 [165305.009550] BTRFS warning (device sdc): csum failed root 5 ino 257 off 393216 csum 0x1337385e expected csum 0x00000000 mirror 1 Fix this simply by deleting first any checksums, from the log tree, for the range of the extent we are logging at copy_items(). This ensures we do not get checksum items in the log tree that have overlapping ranges. This is a long time issue that has been present since we have the clone (and deduplication) ioctl, and can happen both when an extent is shared between different files and within the same file. A test case for fstests follows soon. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: opencode extent_buffer_getDavid Sterba2019-11-181-1/+1
| | | | | | | | | The helper is trivial and we can understand what the atomic_inc on something named refs does. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: drop unused parameter is_new from btrfs_igetDavid Sterba2019-11-181-8/+6
| | | | | | | | | The parameter is now always set to NULL and could be dropped. The last user was get_default_root but that got reworked in 05dbe6837b60 ("Btrfs: unify subvol= and subvolid= mounting") and the parameter became unused. Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Open-code name_in_log_ref in replay_one_nameNikolay Borisov2019-11-181-29/+25
| | | | | | | | | | | | | | | That function adds unnecessary indirection between backref_in_log and the caller. Furthermore it also "downgrades" backref_in_log's return value to a boolean, when in fact it could very well be an error. Rectify the situation by simply opencoding name_in_log_ref in replay_one_name and properly handling possible return codes from backref_in_log. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ update comment ] Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Properly handle backref_in_log retvalNikolay Borisov2019-11-181-11/+21
| | | | | | | | | | | | | | This function can return a negative error value if btrfs_search_slot errors for whatever reason or if btrfs_alloc_path runs out of memory. This is currently problemattic because backref_in_log is treated by its callers as if it returns boolean. Fix this by adding proper error handling in callers. That also enables the function to return the direct error code from btrfs_search_slot. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Don't opencode btrfs_find_name_in_backref in backref_in_logNikolay Borisov2019-11-181-36/+12
| | | | | | | | | | | Direct replacement, though note that the inside of the loop in btrfs_find_name_in_backref is organized in a slightly different way but is equvalent. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> [ add changelog ] Signed-off-by: David Sterba <dsterba@suse.com>
* Merge tag 'for-5.4-rc2-tag' of ↵Linus Torvalds2019-10-101-9/+27
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs fixes from David Sterba: "A few more stabitly fixes, one build warning fix. - fix inode allocation under NOFS context - fix leak in fiemap due to concurrent append writes - fix log-root tree updates - fix balance convert of single profile on 32bit architectures - silence false positive warning on old GCCs (code moved in rc1)" * tag 'for-5.4-rc2-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: silence maybe-uninitialized warning in clone_range btrfs: fix uninitialized ret in ref-verify btrfs: allocate new inode in NOFS context btrfs: fix balance convert to single on 32-bit host CPUs btrfs: fix incorrect updating of log root tree Btrfs: fix memory leak due to concurrent append writes with fiemap
| * btrfs: fix incorrect updating of log root treeJosef Bacik2019-10-011-9/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've historically had reports of being unable to mount file systems because the tree log root couldn't be read. Usually this is the "parent transid failure", but could be any of the related errors, including "fsid mismatch" or "bad tree block", depending on which block got allocated. The modification of the individual log root items are serialized on the per-log root root_mutex. This means that any modification to the per-subvol log root_item is completely protected. However we update the root item in the log root tree outside of the log root tree log_mutex. We do this in order to allow multiple subvolumes to be updated in each log transaction. This is problematic however because when we are writing the log root tree out we update the super block with the _current_ log root node information. Since these two operations happen independently of each other, you can end up updating the log root tree in between writing out the dirty blocks and setting the super block to point at the current root. This means we'll point at the new root node that hasn't been written out, instead of the one we should be pointing at. Thus whatever garbage or old block we end up pointing at complains when we mount the file system later and try to replay the log. Fix this by copying the log's root item into a local root item copy. Then once we're safely under the log_root_tree->log_mutex we update the root item in the log_root_tree. This way we do not modify the log_root_tree while we're committing it, fixing the problem. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Chris Mason <clm@fb.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Merge tag 'for-5.4-tag' of ↵Linus Torvalds2019-09-191-28/+27
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull btrfs updates from David Sterba: "This continues with work on code refactoring, sanity checks and space handling. There are some less user visible changes, nothing that would particularly stand out. User visible changes: - tree checker, more sanity checks of: - ROOT_ITEM (key, size, generation, level, alignment, flags) - EXTENT_ITEM and METADATA_ITEM checks (key, size, offset, alignment, refs) - tree block reference items - EXTENT_DATA_REF (key, hash, offset) - deprecate flag BTRFS_SUBVOL_CREATE_ASYNC for subvolume creation ioctl, scheduled removal in 5.7 - delete stale and unused UAPI definitions BTRFS_DEV_REPLACE_ITEM_STATE_* - improved export of debugging information available via existing sysfs directory structure - try harder to delete relations between qgroups and allow to delete orphan entries - remove unreliable space checks before relocation starts Core: - space handling: - improved ticket reservations and other high level logic in order to remove special cases - factor flushing infrastructure and use it for different contexts, allows to remove some special case handling - reduce metadata reservation when only updating inodes - reduce global block reserve minimum size (affects small filesystems) - improved overcommit logic wrt global block reserve - tests: - fix memory leaks in extent IO tree - catch all TRIM range Fixes: - fix ENOSPC errors, leading to transaction aborts, when cloning extents - several fixes for inode number cache (mount option inode_cache) - fix potential soft lockups during send when traversing large trees - fix unaligned access to space cache pages with SLUB debug on (PowerPC) Other: - refactoring public/private functions, moving to new or more appropriate files - defines converted to enums - error handling improvements - more assertions and comments - old code deletion" * tag 'for-5.4-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: (138 commits) btrfs: Relinquish CPUs in btrfs_compare_trees btrfs: Don't assign retval of btrfs_try_tree_write_lock/btrfs_tree_read_lock_atomic btrfs: create structure to encode checksum type and length btrfs: turn checksum type define into an enum btrfs: add enospc debug messages for ticket failure btrfs: do not account global reserve in can_overcommit btrfs: use btrfs_try_granting_tickets in update_global_rsv btrfs: always reserve our entire size for the global reserve btrfs: change the minimum global reserve size btrfs: rename btrfs_space_info_add_old_bytes btrfs: remove orig_bytes from reserve_ticket btrfs: fix may_commit_transaction to deal with no partial filling btrfs: rework wake_all_tickets btrfs: refactor the ticket wakeup code btrfs: stop partially refilling tickets when releasing space btrfs: add space reservation tracepoint for reserved bytes btrfs: roll tracepoint into btrfs_space_info_update helper btrfs: do not allow reservations if we have pending tickets btrfs: stop clearing EXTENT_DIRTY in inode I/O tree btrfs: treat RWF_{,D}SYNC writes as sync for CRCs ...
| * btrfs: tie extent buffer and it's token togetherDavid Sterba2019-09-091-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | Further simplifaction of the get/set helpers is possible when the token is uniquely tied to an extent buffer. A condition and an assignment can be avoided. The initializations are moved closer to the first use when the extent buffer is valid. There's one exception in __push_leaf_left where the token is reused. Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Make btrfs_find_name_in_ext_backref return struct btrfs_inode_extrefNikolay Borisov2019-09-091-7/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_find_name_in_ext_backref returns either 0/1 depending on whether it found a backref for the given name. If it returns true then the actual inode_ref struct is returned in one of its parameters. That's pointless, instead refactor the function such that it returns either a pointer to the btrfs_inode_extref or NULL it it didn't find anything. This streamlines the function calling convention. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Make btrfs_find_name_in_backref return btrfs_inode_ref structNikolay Borisov2019-09-091-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | btrfs_find_name_in_backref returns either 0/1 depending on whether it found a backref for the given name. If it returns true then the actual inode_ref struct is returned in one of its parameters. That's pointless, instead refactor the function such that it returns either a pointer to the btrfs_inode_ref or NULL it it didn't find anything. This streamlines the function calling convention. Signed-off-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: move cond_wake_up functions out of ctreeDavid Sterba2019-09-091-0/+1
| | | | | | | | | | | | | | | | | | The file ctree.h serves as a header for everything and has become quite bloated. Split some helpers that are generic and create a new file that should be the catch-all for code that's not btrfs-specific. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: tree-log: use symbolic name for first replay stageDavid Sterba2019-09-091-1/+1
| | | | | | | | | | | | Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: tree-log: convert defines to enumsDavid Sterba2019-09-091-8/+12
| | | | | | | | | | | | | | | | Used only for in-memory state tracking. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * btrfs: Remove unnecessary check from join_running_log_transNikolay Borisov2019-09-091-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | join_running_log_trans checks btrfs_root::log_root outside of btrfs_root::log_mutex to avoid contention on the mutex. Turns out this check is not necessary because the two callers of join_running_log_trans (both of which deal with removing entries from the tree-log during unlink) explicitly check whether the respective inode has been logged in the current transaction. If it hasn't then it won't have any items in the tree-log and call path will return before calling join_running_log_trans. If the check passes, however, then it's guaranteed that btrfs_root::log_root is set because the inode is logged. Those guarantees allows us to remove the speculative as well as the implicity and tricky memory barrier. Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | Btrfs: fix assertion failure during fsync and use of stale transactionFilipe Manana2019-09-121-8/+8
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Sometimes when fsync'ing a file we need to log that other inodes exist and when we need to do that we acquire a reference on the inodes and then drop that reference using iput() after logging them. That generally is not a problem except if we end up doing the final iput() (dropping the last reference) on the inode and that inode has a link count of 0, which can happen in a very short time window if the logging path gets a reference on the inode while it's being unlinked. In that case we end up getting the eviction callback, btrfs_evict_inode(), invoked through the iput() call chain which needs to drop all of the inode's items from its subvolume btree, and in order to do that, it needs to join a transaction at the helper function evict_refill_and_join(). However because the task previously started a transaction at the fsync handler, btrfs_sync_file(), it has current->journal_info already pointing to a transaction handle and therefore evict_refill_and_join() will get that transaction handle from btrfs_join_transaction(). From this point on, two different problems can happen: 1) evict_refill_and_join() will often change the transaction handle's block reserve (->block_rsv) and set its ->bytes_reserved field to a value greater than 0. If evict_refill_and_join() never commits the transaction, the eviction handler ends up decreasing the reference count (->use_count) of the transaction handle through the call to btrfs_end_transaction(), and after that point we have a transaction handle with a NULL ->block_rsv (which is the value prior to the transaction join from evict_refill_and_join()) and a ->bytes_reserved value greater than 0. If after the eviction/iput completes the inode logging path hits an error or it decides that it must fallback to a transaction commit, the btrfs fsync handle, btrfs_sync_file(), gets a non-zero value from btrfs_log_dentry_safe(), and because of that non-zero value it tries to commit the transaction using a handle with a NULL ->block_rsv and a non-zero ->bytes_reserved value. This makes the transaction commit hit an assertion failure at btrfs_trans_release_metadata() because ->bytes_reserved is not zero but the ->block_rsv is NULL. The produced stack trace for that is like the following: [192922.917158] assertion failed: !trans->bytes_reserved, file: fs/btrfs/transaction.c, line: 816 [192922.917553] ------------[ cut here ]------------ [192922.917922] kernel BUG at fs/btrfs/ctree.h:3532! [192922.918310] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [192922.918666] CPU: 2 PID: 883 Comm: fsstress Tainted: G W 5.1.4-btrfs-next-47 #1 [192922.919035] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [192922.919801] RIP: 0010:assfail.constprop.25+0x18/0x1a [btrfs] (...) [192922.920925] RSP: 0018:ffffaebdc8a27da8 EFLAGS: 00010286 [192922.921315] RAX: 0000000000000051 RBX: ffff95c9c16a41c0 RCX: 0000000000000000 [192922.921692] RDX: 0000000000000000 RSI: ffff95cab6b16838 RDI: ffff95cab6b16838 [192922.922066] RBP: ffff95c9c16a41c0 R08: 0000000000000000 R09: 0000000000000000 [192922.922442] R10: ffffaebdc8a27e70 R11: 0000000000000000 R12: ffff95ca731a0980 [192922.922820] R13: 0000000000000000 R14: ffff95ca84c73338 R15: ffff95ca731a0ea8 [192922.923200] FS: 00007f337eda4e80(0000) GS:ffff95cab6b00000(0000) knlGS:0000000000000000 [192922.923579] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [192922.923948] CR2: 00007f337edad000 CR3: 00000001e00f6002 CR4: 00000000003606e0 [192922.924329] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [192922.924711] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [192922.925105] Call Trace: [192922.925505] btrfs_trans_release_metadata+0x10c/0x170 [btrfs] [192922.925911] btrfs_commit_transaction+0x3e/0xaf0 [btrfs] [192922.926324] btrfs_sync_file+0x44c/0x490 [btrfs] [192922.926731] do_fsync+0x38/0x60 [192922.927138] __x64_sys_fdatasync+0x13/0x20 [192922.927543] do_syscall_64+0x60/0x1c0 [192922.927939] entry_SYSCALL_64_after_hwframe+0x49/0xbe (...) [192922.934077] ---[ end trace f00808b12068168f ]--- 2) If evict_refill_and_join() decides to commit the transaction, it will be able to do it, since the nested transaction join only increments the transaction handle's ->use_count reference counter and it does not prevent the transaction from getting committed. This means that after eviction completes, the fsync logging path will be using a transaction handle that refers to an already committed transaction. What happens when using such a stale transaction can be unpredictable, we are at least having a use-after-free on the transaction handle itself, since the transaction commit will call kmem_cache_free() against the handle regardless of its ->use_count value, or we can end up silently losing all the updates to the log tree after that iput() in the logging path, or using a transaction handle that in the meanwhile was allocated to another task for a new transaction, etc, pretty much unpredictable what can happen. In order to fix both of them, instead of using iput() during logging, use btrfs_add_delayed_iput(), so that the logging path of fsync never drops the last reference on an inode, that step is offloaded to a safe context (usually the cleaner kthread). The assertion failure issue was sporadically triggered by the test case generic/475 from fstests, which loads the dm error target while fsstress is running, which lead to fsync failing while logging inodes with -EIO errors and then trying later to commit the transaction, triggering the assertion failure. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix fsync not persisting dentry deletions due to inode evictionsFilipe Manana2019-07-021-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to avoid searches on a log tree when unlinking an inode, we check if the inode being unlinked was logged in the current transaction, as well as the inode of its parent directory. When any of the inodes are logged, we proceed to delete directory items and inode reference items from the log, to ensure that if a subsequent fsync of only the inode being unlinked or only of the parent directory when the other is not fsync'ed as well, does not result in the entry still existing after a power failure. That check however is not reliable when one of the inodes involved (the one being unlinked or its parent directory's inode) is evicted, since the logged_trans field is transient, that is, it is not stored on disk, so it is lost when the inode is evicted and loaded into memory again (which is set to zero on load). As a consequence the checks currently being done by btrfs_del_dir_entries_in_log() and btrfs_del_inode_ref_in_log() always return true if the inode was evicted before, regardless of the inode having been logged or not before (and in the current transaction), this results in the dentry being unlinked still existing after a log replay if after the unlink operation only one of the inodes involved is fsync'ed. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/foo $ xfs_io -c fsync /mnt/dir/foo # Keep an open file descriptor on our directory while we evict inodes. # We just want to evict the file's inode, the directory's inode must not # be evicted. $ ( cd /mnt/dir; while true; do :; done ) & $ pid=$! # Wait a bit to give time to background process to chdir to our test # directory. $ sleep 0.5 # Trigger eviction of the file's inode. $ echo 2 > /proc/sys/vm/drop_caches # Unlink our file and fsync the parent directory. After a power failure # we don't expect to see the file anymore, since we fsync'ed the parent # directory. $ rm -f $SCRATCH_MNT/dir/foo $ xfs_io -c fsync /mnt/dir <power failure> $ mount /dev/sdb /mnt $ ls /mnt/dir foo $ --> file still there, unlink not persisted despite explicit fsync on dir Fix this by checking if the inode has the full_sync bit set in its runtime flags as well, since that bit is set everytime an inode is loaded from disk, or for other less common cases such as after a shrinking truncate or failure to allocate extent maps for holes, and gets cleared after the first fsync. Also consider the inode as possibly logged only if it was last modified in the current transaction (besides having the full_fsync flag set). Fixes: 3a5f1d458ad161 ("Btrfs: Optimize btree walking while logging inodes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix data loss after inode eviction, renaming it, and fsync itFilipe Manana2019-07-021-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we log an inode, regardless of logging it completely or only that it exists, we always update it as logged (logged_trans and last_log_commit fields of the inode are updated). This is generally fine and avoids future attempts to log it from having to do repeated work that brings no value. However, if we write data to a file, then evict its inode after all the dealloc was flushed (and ordered extents completed), rename the file and fsync it, we end up not logging the new extents, since the rename may result in logging that the inode exists in case the parent directory was logged before. The following reproducer shows and explains how this can happen: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ touch /mnt/dir/foo $ touch /mnt/dir/bar # Do a direct IO write instead of a buffered write because with a # buffered write we would need to make sure dealloc gets flushed and # complete before we do the inode eviction later, and we can not do that # from user space with call to things such as sync(2) since that results # in a transaction commit as well. $ xfs_io -d -c "pwrite -S 0xd3 0 4K" /mnt/dir/bar # Keep the directory dir in use while we evict inodes. We want our file # bar's inode to be evicted but we don't want our directory's inode to # be evicted (if it were evicted too, we would not be able to reproduce # the issue since the first fsync below, of file foo, would result in a # transaction commit. $ ( cd /mnt/dir; while true; do :; done ) & $ pid=$! # Wait a bit to give time for the background process to chdir. $ sleep 0.1 # Evict all inodes, except the inode for the directory dir because it is # currently in use by our background process. $ echo 2 > /proc/sys/vm/drop_caches # fsync file foo, which ends up persisting information about the parent # directory because it is a new inode. $ xfs_io -c fsync /mnt/dir/foo # Rename bar, this results in logging that this inode exists (inode item, # names, xattrs) because the parent directory is in the log. $ mv /mnt/dir/bar /mnt/dir/baz # Now fsync baz, which ends up doing absolutely nothing because of the # rename operation which logged that the inode exists only. $ xfs_io -c fsync /mnt/dir/baz <power failure> $ mount /dev/sdb /mnt $ od -t x1 -A d /mnt/dir/baz 0000000 --> Empty file, data we wrote is missing. Fix this by not updating last_sub_trans of an inode when we are logging only that it exists and the inode was not yet logged since it was loaded from disk (full_sync bit set), this is enough to make btrfs_inode_in_log() return false for this scenario and make us log the inode. The logged_trans of the inode is still always setsince that alone is used to track if names need to be deleted as part of unlink operations. Fixes: 257c62e1bce03e ("Btrfs: avoid tree log commit when there are no changes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix race updating log root item during fsyncFilipe Manana2019-05-281-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When syncing the log, the final phase of a fsync operation, we need to either create a log root's item or update the existing item in the log tree of log roots, and that depends on the current value of the log root's log_transid - if it's 1 we need to create the log root item, otherwise it must exist already and we update it. Since there is no synchronization between updating the log_transid and checking it for deciding whether the log root's item needs to be created or updated, we end up with a tiny race window that results in attempts to update the item to fail because the item was not yet created: CPU 1 CPU 2 btrfs_sync_log() lock root->log_mutex set log root's log_transid to 1 unlock root->log_mutex btrfs_sync_log() lock root->log_mutex sets log root's log_transid to 2 unlock root->log_mutex update_log_root() sees log root's log_transid with a value of 2 calls btrfs_update_root(), which fails with -EUCLEAN and causes transaction abort Until recently the race lead to a BUG_ON at btrfs_update_root(), but after the recent commit 7ac1e464c4d47 ("btrfs: Don't panic when we can't find a root key") we just abort the current transaction. A sample trace of the BUG_ON() on a SLE12 kernel: ------------[ cut here ]------------ kernel BUG at ../fs/btrfs/root-tree.c:157! Oops: Exception in kernel mode, sig: 5 [#1] SMP NR_CPUS=2048 NUMA pSeries (...) Supported: Yes, External CPU: 78 PID: 76303 Comm: rtas_errd Tainted: G X 4.4.156-94.57-default #1 task: c00000ffa906d010 ti: c00000ff42b08000 task.ti: c00000ff42b08000 NIP: d000000036ae5cdc LR: d000000036ae5cd8 CTR: 0000000000000000 REGS: c00000ff42b0b860 TRAP: 0700 Tainted: G X (4.4.156-94.57-default) MSR: 8000000002029033 <SF,VEC,EE,ME,IR,DR,RI,LE> CR: 22444484 XER: 20000000 CFAR: d000000036aba66c SOFTE: 1 GPR00: d000000036ae5cd8 c00000ff42b0bae0 d000000036bda220 0000000000000054 GPR04: 0000000000000001 0000000000000000 c00007ffff8d37c8 0000000000000000 GPR08: c000000000e19c00 0000000000000000 0000000000000000 3736343438312079 GPR12: 3930373337303434 c000000007a3a800 00000000007fffff 0000000000000023 GPR16: c00000ffa9d26028 c00000ffa9d261f8 0000000000000010 c00000ffa9d2ab28 GPR20: c00000ff42b0bc48 0000000000000001 c00000ff9f0d9888 0000000000000001 GPR24: c00000ffa9d26000 c00000ffa9d261e8 c00000ffa9d2a800 c00000ff9f0d9888 GPR28: c00000ffa9d26028 c00000ffa9d2aa98 0000000000000001 c00000ffa98f5b20 NIP [d000000036ae5cdc] btrfs_update_root+0x25c/0x4e0 [btrfs] LR [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] Call Trace: [c00000ff42b0bae0] [d000000036ae5cd8] btrfs_update_root+0x258/0x4e0 [btrfs] (unreliable) [c00000ff42b0bba0] [d000000036b53610] btrfs_sync_log+0x2d0/0xc60 [btrfs] [c00000ff42b0bce0] [d000000036b1785c] btrfs_sync_file+0x44c/0x4e0 [btrfs] [c00000ff42b0bd80] [c00000000032e300] vfs_fsync_range+0x70/0x120 [c00000ff42b0bdd0] [c00000000032e44c] do_fsync+0x5c/0xb0 [c00000ff42b0be10] [c00000000032e8dc] SyS_fdatasync+0x2c/0x40 [c00000ff42b0be30] [c000000000009488] system_call+0x3c/0x100 Instruction dump: 7f43d378 4bffebb9 60000000 88d90008 3d220000 e8b90000 3b390009 e87a01f0 e8898e08 e8f90000 4bfd48e5 60000000 <0fe00000> e95b0060 39200004 394a0ea0 ---[ end trace 8f2dc8f919cabab8 ]--- So fix this by doing the check of log_transid and updating or creating the log root's item while holding the root's log_mutex. Fixes: 7237f1833601d ("Btrfs: fix tree logs parallel sync") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix fsync not persisting changed attributes of a directoryFilipe Manana2019-05-281-12/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While logging an inode we follow its ancestors and for each one we mark it as logged in the current transaction, even if we have not logged it. As a consequence if we change an attribute of an ancestor, such as the UID or GID for example, and then explicitly fsync it, we end up not logging the inode at all despite returning success to user space, which results in the attribute being lost if a power failure happens after the fsync. Sample reproducer: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/dir $ chown 6007:6007 /mnt/dir $ sync $ chown 9003:9003 /mnt/dir $ touch /mnt/dir/file $ xfs_io -c fsync /mnt/dir/file # fsync our directory after fsync'ing the new file, should persist the # new values for the uid and gid. $ xfs_io -c fsync /mnt/dir <power failure> $ mount /dev/sdb /mnt $ stat -c %u:%g /mnt/dir 6007:6007 --> should be 9003:9003, the uid and gid were not persisted, despite the explicit fsync on the directory prior to the power failure Fix this by not updating the logged_trans field of ancestor inodes when logging an inode, since we have not logged them. Let only future calls to btrfs_log_inode() to mark inodes as logged. This could be triggered by my recent fsync fuzz tester for fstests, for which an fstests patch exists titled "fstests: generic, fsync fuzz tester with fsstress". Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: avoid fallback to transaction commit during fsync of files with holesFilipe Manana2019-05-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a file that has holes and has file extent items spanning two or more leafs, we can end up falling to back to a full transaction commit due to a logic bug that leads to failure to insert a duplicate file extent item that is meant to represent a hole between the last file extent item of a leaf and the first file extent item in the next leaf. The failure (EEXIST error) leads to a transaction commit (as most errors when logging an inode do). For example, we have the two following leafs: Leaf N: ----------------------------------------------- | ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) | ----------------------------------------------- The file extent item at the end of leaf N has a length of 4Kb, representing the file range from 64K to 68K - 1. Leaf N + 1: ----------------------------------------------- | (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... | ----------------------------------------------- The file extent item at the first slot of leaf N + 1 has a length of 4Kb too, representing the file range from 72K to 76K - 1. During the full fsync path, when we are at tree-log.c:copy_items() with leaf N as a parameter, after processing the last file extent item, that represents the extent at offset 64K, we take a look at the first file extent item at the next leaf (leaf N + 1), and notice there's a 4K hole between the two extents, and therefore we insert a file extent item representing that hole, starting at file offset 68K and ending at offset 72K - 1. However we don't update the value of *last_extent, which is used to represent the end offset (plus 1, non-inclusive end) of the last file extent item inserted in the log, so it stays with a value of 68K and not with a value of 72K. Then, when copy_items() is called for leaf N + 1, because the value of *last_extent is smaller then the offset of the first extent item in the leaf (68K < 72K), we look at the last file extent item in the previous leaf (leaf N) and see it there's a 4K gap between it and our first file extent item (again, 68K < 72K), so we decide to insert a file extent item representing the hole, starting at file offset 68K and ending at offset 72K - 1, this insertion will fail with -EEXIST being returned from btrfs_insert_file_extent() because we already inserted a file extent item representing a hole for this offset (68K) in the previous call to copy_items(), when processing leaf N. The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(), which falls back to a full transaction commit. Fix this by adjusting *last_extent after inserting a hole when we had to look at the next leaf. Fixes: 4ee3fad34a9c ("Btrfs: fix fsync after hole punching when using no-holes feature") Cc: stable@vger.kernel.org # 4.14+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: improve performance on fsync of files with multiple hardlinksFilipe Manana2019-04-291-40/+188
| | | | | | | | | | | | | | | | | | | | | | | Commit 41bd6067692382 ("Btrfs: fix fsync of files with multiple hard links in new directories") introduced a path that makes fsync fallback to a full transaction commit in order to avoid losing hard links and new ancestors of the fsynced inode. That path is triggered only when the inode has more than one hard link and either has a new hard link created in the current transaction or the inode was evicted and reloaded in the current transaction. That path ends up getting triggered very often (hundreds of times) during the course of pgbench benchmarks, resulting in performance drops of about 20%. This change restores the performance by not triggering the full transaction commit in those cases, and instead iterate the fs/subvolume tree in search of all possible new ancestors, for all hard links, to log them. Reported-by: Zhao Yuhu <zyuhu@suse.com> Tested-by: James Wang <jnwang@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused parameter fs_info from btrfs_extend_itemDavid Sterba2019-04-291-3/+1
| | | | Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove unused parameter fs_info from btrfs_truncate_itemDavid Sterba2019-04-291-1/+1
| | | | Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: extent-tree: Use btrfs_ref to refactor btrfs_inc_extent_ref()Qu Wenruo2019-04-291-3/+8
| | | | | | | | | Use the new btrfs_ref structure and replace parameter list to clean up the usage of owner and level to distinguish the extent types. Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in btrfs_set_log_full_commitDavid Sterba2019-04-291-10/+9
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from trans in btrfs_need_log_full_commitDavid Sterba2019-04-291-3/+3
| | | | | | | We can read fs_info from the transaction and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from eb in clean_tree_blockDavid Sterba2019-04-291-3/+3
| | | | | | | We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: get fs_info from eb in btrfs_exclude_logged_extentsDavid Sterba2019-04-291-1/+1
| | | | | | | We can read fs_info from extent buffer and can drop it from the parameters. Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move tree block wait and write helpers to tree-logDavid Sterba2019-04-291-0/+11
| | | | | | | | | The wrapper names better describe what's happening so they're not deleted though they're trivial, but at least moved closer to their place of use. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix assertion failure on fsync with NO_HOLES enabledFilipe Manana2019-03-201-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back in commit a89ca6f24ffe4 ("Btrfs: fix fsync after truncate when no_holes feature is enabled") I added an assertion that is triggered when an inline extent is found to assert that the length of the (uncompressed) data the extent represents is the same as the i_size of the inode, since that is true most of the time I couldn't find or didn't remembered about any exception at that time. Later on the assertion was expanded twice to deal with a case of a compressed inline extent representing a range that matches the sector size followed by an expanding truncate, and another case where fallocate can update the i_size of the inode without adding or updating existing extents (if the fallocate range falls entirely within the first block of the file). These two expansion/fixes of the assertion were done by commit 7ed586d0a8241 ("Btrfs: fix assertion on fsync of regular file when using no-holes feature") and commit 6399fb5a0b69a ("Btrfs: fix assertion failure during fsync in no-holes mode"). These however missed the case where an falloc expands the i_size of an inode to exactly the sector size and inline extent exists, for example: $ mkfs.btrfs -f -O no-holes /dev/sdc $ mount /dev/sdc /mnt $ xfs_io -f -c "pwrite -S 0xab 0 1096" /mnt/foobar wrote 1096/1096 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.448 MiB/sec and 4255.3191 ops/sec) $ xfs_io -c "falloc 1096 3000" /mnt/foobar $ xfs_io -c "fsync" /mnt/foobar Segmentation fault $ dmesg [701253.602385] assertion failed: len == i_size || (len == fs_info->sectorsize && btrfs_file_extent_compression(leaf, extent) != BTRFS_COMPRESS_NONE) || (len < i_size && i_size < fs_info->sectorsize), file: fs/btrfs/tree-log.c, line: 4727 [701253.602962] ------------[ cut here ]------------ [701253.603224] kernel BUG at fs/btrfs/ctree.h:3533! [701253.603503] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI [701253.603774] CPU: 2 PID: 7192 Comm: xfs_io Tainted: G W 5.0.0-rc8-btrfs-next-45 #1 [701253.604054] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.11.2-0-gf9626ccb91-prebuilt.qemu-project.org 04/01/2014 [701253.604650] RIP: 0010:assfail.constprop.23+0x18/0x1a [btrfs] (...) [701253.605591] RSP: 0018:ffffbb48c186bc48 EFLAGS: 00010286 [701253.605914] RAX: 00000000000000de RBX: ffff921d0a7afc08 RCX: 0000000000000000 [701253.606244] RDX: 0000000000000000 RSI: ffff921d36b16868 RDI: ffff921d36b16868 [701253.606580] RBP: ffffbb48c186bcf0 R08: 0000000000000000 R09: 0000000000000000 [701253.606913] R10: 0000000000000003 R11: 0000000000000000 R12: ffff921d05d2de18 [701253.607247] R13: ffff921d03b54000 R14: 0000000000000448 R15: ffff921d059ecf80 [701253.607769] FS: 00007f14da906700(0000) GS:ffff921d36b00000(0000) knlGS:0000000000000000 [701253.608163] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [701253.608516] CR2: 000056087ea9f278 CR3: 00000002268e8001 CR4: 00000000003606e0 [701253.608880] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [701253.609250] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [701253.609608] Call Trace: [701253.609994] btrfs_log_inode+0xdfb/0xe40 [btrfs] [701253.610383] btrfs_log_inode_parent+0x2be/0xa60 [btrfs] [701253.610770] ? do_raw_spin_unlock+0x49/0xc0 [701253.611150] btrfs_log_dentry_safe+0x4a/0x70 [btrfs] [701253.611537] btrfs_sync_file+0x3b2/0x440 [btrfs] [701253.612010] ? do_sysinfo+0xb0/0xf0 [701253.612552] do_fsync+0x38/0x60 [701253.612988] __x64_sys_fsync+0x10/0x20 [701253.613360] do_syscall_64+0x60/0x1b0 [701253.613733] entry_SYSCALL_64_after_hwframe+0x49/0xbe [701253.614103] RIP: 0033:0x7f14da4e66d0 (...) [701253.615250] RSP: 002b:00007fffa670fdb8 EFLAGS: 00000246 ORIG_RAX: 000000000000004a [701253.615647] RAX: ffffffffffffffda RBX: 0000000000000001 RCX: 00007f14da4e66d0 [701253.616047] RDX: 000056087ea9c260 RSI: 000056087ea9c260 RDI: 0000000000000003 [701253.616450] RBP: 0000000000000001 R08: 0000000000000020 R09: 0000000000000010 [701253.616854] R10: 000000000000009b R11: 0000000000000246 R12: 000056087ea9c260 [701253.617257] R13: 000056087ea9c240 R14: 0000000000000000 R15: 000056087ea9dd10 (...) [701253.619941] ---[ end trace e088d74f132b6da5 ]--- Updating the assertion again to allow for this particular case would result in a meaningless assertion, plus there is currently no risk of logging content that would result in any corruption after a log replay if the size of the data encoded in an inline extent is greater than the inode's i_size (which is not currently possibe either with or without compression), therefore just remove the assertion. CC: stable@vger.kernel.org # 4.4+ Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: remove WARN_ON in log_dir_itemsJosef Bacik2019-03-131-2/+9
| | | | | | | | | | | | | | | | | When Filipe added the recursive directory logging stuff in 2f2ff0ee5e430 ("Btrfs: fix metadata inconsistencies after directory fsync") he specifically didn't take the directory i_mutex for the children directories that we need to log because of lockdep. This is generally fine, but can lead to this WARN_ON() tripping if we happen to run delayed deletion's in between our first search and our second search of dir_item/dir_indexes for this directory. We expect this to happen, so the WARN_ON() isn't necessary. Drop the WARN_ON() and add a comment so we know why this case can happen. CC: stable@vger.kernel.org # 4.4+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix incorrect file size after shrinking truncate and fsyncFilipe Manana2019-03-131-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we do a shrinking truncate against an inode which is already present in the respective log tree and then rename it, as part of logging the new name we end up logging an inode item that reflects the old size of the file (the one which we previously logged) and not the new smaller size. The decision to preserve the size previously logged was added by commit 1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to inode") in order to avoid data loss after replaying the log. However that decision is only needed for the case the logged inode size is smaller then the current size of the inode, as explained in that commit's change log. If the current size of the inode is smaller then the previously logged size, we know a shrinking truncate happened and therefore need to use that smaller size. Example to trigger the problem: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo $ xfs_io -c "fsync" /mnt/foo $ xfs_io -c "truncate 3000" /mnt/foo $ mv /mnt/foo /mnt/bar $ xfs_io -c "fsync" /mnt/bar <power failure> $ mount /dev/sdb /mnt $ od -t x1 -A d /mnt/bar 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab * 0008000 Once we rename the file, we log its name (and inode item), and because the inode was already logged before in the current transaction, we log it with a size of 8000 bytes because that is the size we previously logged (with the first fsync). As part of the rename, besides logging the inode, we do also sync the log, which is done since commit d4682ba03ef618 ("Btrfs: sync log after logging new name"), so the next fsync against our inode is effectively a no-op, since no new changes happened since the rename operation. Even if did not sync the log during the rename operation, the same problem (fize size of 8000 bytes instead of 3000 bytes) would be visible after replaying the log if the log ended up getting synced to disk through some other means, such as for example by fsyncing some other modified file. In the example above the fsync after the rename operation is there just because not every filesystem may guarantee logging/journalling the inode (and syncing the log/journal) during the rename operation, for example it is needed for f2fs, but not for ext4 and xfs. Fix this scenario by, when logging a new name (which is triggered by rename and link operations), using the current size of the inode instead of the previously logged inode size. A test case for fstests follows soon. Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695 CC: stable@vger.kernel.org # 4.4+ Reported-by: Seulbae Kim <seulbae@gatech.edu> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: add missing error handling after doing leaf/node binary searchFilipe Manana2019-02-251-0/+2
| | | | | | | | | | | | | | | The function map_private_extent_buffer() can return an -EINVAL error, and it is called by generic_bin_search() which will return back the error. The btrfs_bin_search() function in turn calls generic_bin_search() and the key_search() function calls btrfs_bin_search(), so both can return the -EINVAL error coming from the map_private_extent_buffer() function. Some callers of these functions were ignoring that these functions can return an error, so fix them to deal with error return values. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix fsync after succession of renames and unlink/rmdirFilipe Manana2019-02-251-12/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a succession of renames operations of different files and unlinking one of them, if we fsync one of the renamed files we can end up with a log that will either fail to replay at mount time or result in a filesystem that is in an inconsistent state. One example scenario: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir $ touch /mnt/testdir/fname1 $ touch /mnt/testdir/fname2 $ sync $ mv /mnt/testdir/fname1 /mnt/testdir/fname3 $ rm -f /mnt/testdir/fname2 $ ln /mnt/testdir/fname3 /mnt/testdir/fname2 $ touch /mnt/testdir/fname1 $ xfs_io -c "fsync" /mnt/testdir/fname1 <power failure> $ mount /dev/sdb /mnt $ umount /mnt $ btrfs check /dev/sdb [1/7] checking root items [2/7] checking extents [3/7] checking free space cache [4/7] checking fs roots root 5 inode 259 errors 2, no orphan item ERROR: errors found in fs roots Opening filesystem to check... Checking filesystem on /dev/sdc UUID: 20e4abb8-5a19-4492-8bb4-6084125c2d0d found 393216 bytes used, error(s) found total csum bytes: 0 total tree bytes: 131072 total fs tree bytes: 32768 total extent tree bytes: 16384 btree space waste bytes: 122986 file data blocks allocated: 262144 referenced 262144 On a kernel without the first patch in this series, titled "[PATCH] Btrfs: fix fsync after succession of renames of different files", we get instead an error when mounting the filesystem due to failure of replaying the log: $ mount /dev/sdb /mnt mount: mount /dev/sdb on /mnt failed: File exists Fix this by logging the parent directory of an inode whenever we find an inode that no longer exists (was unlinked in the current transaction), during the procedure which finds inodes that have old names that collide with new names of other inodes. A test case for fstests follows soon. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix fsync after succession of renames of different filesFilipe Manana2019-02-251-44/+197
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After a succession of rename operations of different files and fsyncing one of them, such that each file gets a new name that corresponds to an old name of another file, we can end up with a log that will cause a failure when attempted to replay at mount time (an EEXIST error). We currently have correct behaviour when such succession of renames involves only two files, but if there are more files involved, we end up not logging all the inodes that are needed, therefore resulting in a failure when attempting to replay the log. Example: $ mkfs.btrfs -f /dev/sdb $ mount /dev/sdb /mnt $ mkdir /mnt/testdir $ touch /mnt/testdir/fname1 $ touch /mnt/testdir/fname2 $ sync $ mv /mnt/testdir/fname1 /mnt/testdir/fname3 $ mv /mnt/testdir/fname2 /mnt/testdir/fname4 $ ln /mnt/testdir/fname3 /mnt/testdir/fname2 $ touch /mnt/testdir/fname1 $ xfs_io -c "fsync" /mnt/testdir/fname1 <power failure> $ mount /dev/sdb /mnt mount: mount /dev/sdb on /mnt failed: File exists So fix this by checking all inode dependencies when logging an inode. That is, if one logged inode A has a new name that matches the old name of some other inode B, check if inode B has a new name that matches the old name of some other inode C, and so on. This fix is implemented not by doing any recursive function calls but by using an iterative method using a linked list that is used in a first-in-first-out fashion. A test case for fstests follows soon. Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: open code now trivial btrfs_set_lock_blockingDavid Sterba2019-02-251-3/+3
| | | | | | | | | | btrfs_set_lock_blocking is now only a simple wrapper around btrfs_set_lock_blocking_write. The name does not bring any semantic value that could not be inferred from the new function so there's no point keeping it. Reviewed-by: Johannes Thumshirn <jthumshirn@suse.de> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: Fix typos in comments and stringsAndrea Gelmini2018-12-171-2/+2
| | | | | | | | | The typos accumulate over time so once in a while time they get fixed in a large patch. Signed-off-by: Andrea Gelmini <andrea.gelmini@gelma.net> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix fsync of files with multiple hard links in new directoriesFilipe Manana2018-12-171-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The log tree has a long standing problem that when a file is fsync'ed we only check for new ancestors, created in the current transaction, by following only the hard link for which the fsync was issued. We follow the ancestors using the VFS' dget_parent() API. This means that if we create a new link for a file in a directory that is new (or in an any other new ancestor directory) and then fsync the file using an old hard link, we end up not logging the new ancestor, and on log replay that new hard link and ancestor do not exist. In some cases, involving renames, the file will not exist at all. Example: mkfs.btrfs -f /dev/sdb mount /dev/sdb /mnt mkdir /mnt/A touch /mnt/foo ln /mnt/foo /mnt/A/bar xfs_io -c fsync /mnt/foo <power failure> In this example after log replay only the hard link named 'foo' exists and directory A does not exist, which is unexpected. In other major linux filesystems, such as ext4, xfs and f2fs for example, both hard links exist and so does directory A after mounting again the filesystem. Checking if any new ancestors are new and need to be logged was added in 2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"), however only for the ancestors of the hard link (dentry) for which the fsync was issued, instead of checking for all ancestors for all of the inode's hard links. So fix this by tracking the id of the last transaction where a hard link was created for an inode and then on fsync fallback to a full transaction commit when an inode has more than one hard link and at least one new hard link was created in the current transaction. This is the simplest solution since this is not a common use case (adding frequently hard links for which there's an ancestor created in the current transaction and then fsync the file). In case it ever becomes a common use case, a solution that consists of iterating the fs/subvol btree for each hard link and check if any ancestor is new, could be implemented. This solves many unexpected scenarios reported by Jayashree Mohan and Vijay Chidambaram, and for which there is a new test case for fstests under review. Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") CC: stable@vger.kernel.org # 4.4+ Reported-by: Vijay Chidambaram <vvijay03@gmail.com> Reported-by: Jayashree Mohan <jayashree2912@gmail.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: simpler and more efficient cleanup of a log tree's extent io treeFilipe Manana2018-12-171-14/+2
| | | | | | | | | | | | | | | | | | | | | We currently are in a loop finding each range (corresponding to a btree node/leaf) in a log root's extent io tree and then clean it up. This is a waste of time since we are traversing the extent io tree's rb_tree more times then needed (one for a range lookup and another for cleaning it up) without any good reason. We free the log trees when we are in the critical section of a transaction commit (the transaction state is set to TRANS_STATE_COMMIT_DOING), so it's of great convenience to do everything as fast as possible in order to reduce the time we block other tasks from starting a new transaction. So fix this by traversing the extent io tree once and cleaning up all its records in one go while traversing it. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: remove no longer used logged range variables when logging extentsFilipe Manana2018-12-171-8/+0
| | | | | | | | | | | | | The logged_start and logged_end variables, at btrfs_log_changed_extents, were added in commit 8c6c592831a0 ("btrfs: log csums for all modified extents"). However since the recent simplification for fsync, which makes us wait for all ordered extents to complete before logging extents, we no longer need those variables. Commit a2120a473a80 ("btrfs: clean up the left over logged_list usage") forgot to remove them. Reviewed-by: Nikolay Borisov <nborisov@suse.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* Btrfs: fix missing data checksums after a ranged fsync (msync)Filipe Manana2018-11-061-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently we got a massive simplification for fsync, where for the fast path we no longer log new extents while their respective ordered extents are still running. However that simplification introduced a subtle regression for the case where we use a ranged fsync (msync). Consider the following example: CPU 0 CPU 1 mmap write to range [2Mb, 4Mb[ mmap write to range [512Kb, 1Mb[ msync range [512K, 1Mb[ --> triggers fast fsync (BTRFS_INODE_NEEDS_FULL_SYNC not set) --> creates extent map A for this range and adds it to list of modified extents --> starts ordered extent A for this range --> waits for it to complete writeback triggered for range [2Mb, 4Mb[ --> create extent map B and adds it to the list of modified extents --> creates ordered extent B --> start looking for and logging modified extents --> logs extent maps A and B --> finds checksums for extent A in the csum tree, but not for extent B fsync (msync) finishes --> ordered extent B finishes and its checksums are added to the csum tree <power cut> After replaying the log, we have the extent covering the range [2Mb, 4Mb[ but do not have the data checksum items covering that file range. This happens because at the very beginning of an fsync (btrfs_sync_file()) we start and wait for IO in the given range [512Kb, 1Mb[ and therefore wait for any ordered extents in that range to complete before we start logging the extents. However if right before we start logging the extent in our range [512Kb, 1Mb[, writeback is started for any other dirty range, such as the range [2Mb, 4Mb[ due to memory pressure or a concurrent fsync or msync (btrfs_sync_file() starts writeback before acquiring the inode's lock), an ordered extent is created for that other range and a new extent map is created to represent that range and added to the inode's list of modified extents. That means that we will see that other extent in that list when collecting extents for logging (done at btrfs_log_changed_extents()) and log the extent before the respective ordered extent finishes - namely before the checksum items are added to the checksums tree, which is where log_extent_csums() looks for the checksums, therefore making us log an extent without logging its checksums. Before that massive simplification of fsync, this wasn't a problem because besides looking for checkums in the checksums tree, we also looked for them in any ordered extent still running. The consequence of data checksums missing for a file range is that users attempting to read the affected file range will get -EIO errors and dmesg reports the following: [10188.358136] BTRFS info (device sdc): no csum found for inode 297 start 57344 [10188.359278] BTRFS warning (device sdc): csum failed root 5 ino 297 off 57344 csum 0x98f94189 expected csum 0x00000000 mirror 1 So fix this by skipping extents outside of our logging range at btrfs_log_changed_extents() and leaving them on the list of modified extents so that any subsequent ranged fsync may collect them if needed. Also, if we find a hole extent outside of the range still log it, just to prevent having gaps between extent items after replaying the log, otherwise fsck will complain when we are not using the NO_HOLES feature (fstest btrfs/056 triggers such case). Fixes: e7175a692765 ("btrfs: remove the wait ordered logic in the log_one_extent path") CC: stable@vger.kernel.org # 4.19+ Reviewed-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* btrfs: move the dio_sem higher up the callchainJosef Bacik2018-10-191-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're getting a lockdep splat because we take the dio_sem under the log_mutex. What we really need is to protect fsync() from logging an extent map for an extent we never waited on higher up, so just guard the whole thing with dio_sem. ====================================================== WARNING: possible circular locking dependency detected 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Not tainted ------------------------------------------------------ aio-dio-invalid/30928 is trying to acquire lock: 0000000092621cfd (&mm->mmap_sem){++++}, at: get_user_pages_unlocked+0x5a/0x1e0 but task is already holding lock: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400 which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #5 (&ei->dio_sem){++++}: lock_acquire+0xbd/0x220 down_write+0x51/0xb0 btrfs_log_changed_extents+0x80/0xa40 btrfs_log_inode+0xbaf/0x1000 btrfs_log_inode_parent+0x26f/0xa80 btrfs_log_dentry_safe+0x50/0x70 btrfs_sync_file+0x357/0x540 do_fsync+0x38/0x60 __ia32_sys_fdatasync+0x12/0x20 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #4 (&ei->log_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_record_unlink_dir+0x2a/0xa0 btrfs_unlink+0x5a/0xc0 vfs_unlink+0xb1/0x1a0 do_unlinkat+0x264/0x2b0 do_fast_syscall_32+0x9a/0x2f0 entry_SYSENTER_compat+0x84/0x96 -> #3 (sb_internal#2){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 start_transaction+0x3e6/0x590 btrfs_evict_inode+0x475/0x640 evict+0xbf/0x1b0 btrfs_run_delayed_iputs+0x6c/0x90 cleaner_kthread+0x124/0x1a0 kthread+0x106/0x140 ret_from_fork+0x3a/0x50 -> #2 (&fs_info->cleaner_delayed_iput_mutex){+.+.}: lock_acquire+0xbd/0x220 __mutex_lock+0x86/0xa10 btrfs_alloc_data_chunk_ondemand+0x197/0x530 btrfs_check_data_free_space+0x4c/0x90 btrfs_delalloc_reserve_space+0x20/0x60 btrfs_page_mkwrite+0x87/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #1 (sb_pagefaults){.+.+}: lock_acquire+0xbd/0x220 __sb_start_write+0x14d/0x230 btrfs_page_mkwrite+0x6a/0x520 do_page_mkwrite+0x31/0xa0 __handle_mm_fault+0x799/0xb00 handle_mm_fault+0x7c/0xe0 __do_page_fault+0x1d3/0x4a0 async_page_fault+0x1e/0x30 -> #0 (&mm->mmap_sem){++++}: __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 down_read+0x48/0xb0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 __blockdev_direct_IO+0x32d/0x1c20 btrfs_direct_IO+0x227/0x400 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 io_submit_one+0x3a9/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 other info that might help us debug this: Chain exists of: &mm->mmap_sem --> &ei->log_mutex --> &ei->dio_sem Possible unsafe locking scenario: CPU0 CPU1 ---- ---- lock(&ei->dio_sem); lock(&ei->log_mutex); lock(&ei->dio_sem); lock(&mm->mmap_sem); *** DEADLOCK *** 1 lock held by aio-dio-invalid/30928: #0: 00000000cefe6b35 (&ei->dio_sem){++++}, at: btrfs_direct_IO+0x3be/0x400 stack backtrace: CPU: 0 PID: 30928 Comm: aio-dio-invalid Not tainted 4.18.0-rc4-xfstests-00025-g5de5edbaf1d4 #411 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: dump_stack+0x7c/0xbb print_circular_bug.isra.37+0x297/0x2a4 check_prev_add.constprop.45+0x781/0x7a0 ? __lock_acquire+0x42e/0x7a0 validate_chain.isra.41+0x7f0/0xb00 __lock_acquire+0x42e/0x7a0 lock_acquire+0xbd/0x220 ? get_user_pages_unlocked+0x5a/0x1e0 down_read+0x48/0xb0 ? get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_unlocked+0x5a/0x1e0 get_user_pages_fast+0xa4/0x150 iov_iter_get_pages+0xc3/0x340 do_direct_IO+0xf93/0x1d70 ? __alloc_workqueue_key+0x358/0x490 ? __blockdev_direct_IO+0x14b/0x1c20 __blockdev_direct_IO+0x32d/0x1c20 ? btrfs_run_delalloc_work+0x40/0x40 ? can_nocow_extent+0x490/0x490 ? kvm_clock_read+0x1f/0x30 ? can_nocow_extent+0x490/0x490 ? btrfs_run_delalloc_work+0x40/0x40 btrfs_direct_IO+0x227/0x400 ? btrfs_run_delalloc_work+0x40/0x40 generic_file_direct_write+0xcf/0x180 btrfs_file_write_iter+0x308/0x58c aio_write+0xf8/0x1d0 ? kvm_clock_read+0x1f/0x30 ? __might_fault+0x3e/0x90 io_submit_one+0x3a9/0x620 ? io_submit_one+0xe5/0x620 __ia32_compat_sys_io_submit+0xb2/0x270 do_int80_syscall_32+0x5b/0x1a0 entry_INT80_compat+0x88/0xa0 CC: stable@vger.kernel.org # 4.14+ Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Josef Bacik <josef@toxicpanda.com> Signed-off-by: David Sterba <dsterba@suse.com>