summaryrefslogtreecommitdiffstats
path: root/fs/xfs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* xfs: change some error-less functions to void typesEric Sandeen2019-05-0211-58/+26
| | | | | | | | | | | | There are several functions which have no opportunity to return an error, and don't contain any ASSERTs which could be argued to be better constructed as error cases. So, make them voids to simplify the callers. Signed-off-by: Eric Sandeen <sandeen@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: add online scrub for superblock countersDarrick J. Wong2019-04-3011-3/+461
| | | | | | | | | | | | | | Teach online scrub how to check the filesystem summary counters. We use the incore delalloc block counter along with the incore AG headers to compute expected values for fdblocks, icount, and ifree, and then check that the percpu counter is within a certain threshold of the expected value. This is done to avoid having to freeze or otherwise lock the filesystem, which means that we're only checking that the counters are fairly close, not that they're exactly correct. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: don't parse the mtpt mount optionChristoph Hellwig2019-04-301-5/+1
| | | | | | | | | The text isn't really any more useful than the default unknown option handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: always rejoin held resources during defer rollDarrick J. Wong2019-04-304-37/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | During testing of xfs/141 on a V4 filesystem, I observed some inconsistent behavior with regards to resources that are held (i.e. remain locked) across a defer roll. The transaction roll always gives the defer roll function a new transaction, even if committing the old transaction fails. However, the defer roll function only rejoins the held resources if the transaction commit succeedied. This means that callers of defer roll have to figure out whether the held resources are attached to the transaction being passed back. Worse yet, if the defer roll was part of a defer finish call, we have a third possibility: the defer finish could pass back a dirty transaction with dirty held resources and an error code. The only sane way to handle all of these scenarios is to require that the code that held the resource either cancel the transaction before unlocking and releasing the resources, or use functions that detach resources from a transaction properly (e.g. xfs_trans_brelse) if they need to drop the reference before committing or cancelling the transaction. In order to make this so, change the defer roll code to join held resources to the new transaction unconditionally and fix all the bhold callers to release the held buffers correctly. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: add missing error check in xfs_prepare_shift()Brian Foster2019-04-261-0/+2
| | | | | | | | | | | | | | | | | xfs_prepare_shift() fails to check the error return from xfs_flush_unmap_range(). If the latter fails, that could lead to an insert/collapse range operation over a delalloc range, which is not supported. Add an error check and return appropriately. This is reproduced rarely by generic/475. Fixes: 7f9f71be84bc ("xfs: extent shifting doesn't fully invalidate page cache") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: scrub should check incore counters against ondisk headersDarrick J. Wong2019-04-261-0/+20
| | | | | | | | In theory, the incore per-AG structure counters should match the ones on disk, so check that. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: allow scrubbers to pause background reclaimDarrick J. Wong2019-04-264-0/+23
| | | | | | | | | | | | | The forthcoming summary counter patch races with regular filesystem activity to compute rough expected values for the counters. This design was chosen to avoid having to freeze the entire filesystem to check the counters, but while that's running we'd prefer to minimize background reclamation activity to reduce the perturbations to the incore free block count. Therefore, provide a way for scrubbers to disable background posteof and cowblock reclamation. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: rename the speculative block allocation reclaim toggle functionsDarrick J. Wong2019-04-264-9/+9
| | | | | | | | | | "reclaim" is used throughout the icache code to mean reclamation of incore inode structures. It's also used for two helper functions that toggle background deletion of speculative preallocations. Separate the second of the two uses to make things less confusing. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: track delayed allocation reservations across the filesystemDarrick J. Wong2019-04-264-3/+51
| | | | | | | | | | | | | Add a percpu counter to track the number of blocks directly reserved for delayed allocations on the data device. This counter (in contrast to i_delayed_blks) does not track allocated CoW staging extents or anything going on with the realtime device. It will be used in the upcoming summary counter scrub function to check the free block counts without having to freeze the filesystem or walk all the inodes to find the delayed allocations. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: fix broken bhold behavior in xrep_roll_ag_transDarrick J. Wong2019-04-261-17/+8
| | | | | | | | | | | In xrep_roll_ag_trans, the transaction roll will always set sc->tp to the new transaction, even if committing the old one fails. A bare transaction roll leaves the buffer(s) locked but not joined to the new transaction, so it's not necessary to release the hold if the roll fails. Remove the incorrect xfs_trans_bhold_release calls. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: unlock inode when xfs_ioctl_setattr_get_trans can't get transactionDarrick J. Wong2019-04-231-1/+1
| | | | | | | | | | | | We passed an inode into xfs_ioctl_setattr_get_trans with join_flags indicating which locks are held on that inode. If we can't allocate a transaction then we need to unlock the inode before we bail out, like all the other error paths do. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: kill the xfs_dqtrx_t typedefDarrick J. Wong2019-04-232-16/+16
| | | | | | | | | There's only a few uses left, so just kill the typedef while we're at it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: widen inode delalloc block counter to 64-bitsDarrick J. Wong2019-04-232-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | Widen the incore inode's i_delayed_blks counter to be a 64-bit integer. This is necessary to fix an integer overflow problem that can be reproduced easily now that we use the counter to track blocks that are assigned to the inode in memory but not on disk. This includes actual delalloc reservations as well as real extents in the COW fork that are waiting to be remapped into the data fork. These 'delayed mapping' blocks can easily exceed 2^32 blocks if one creates a very large sparse file of size approximately 2^33 bytes with one byte written every 2^23 bytes, sets a very large COW extent size hint of 2^23 blocks, reflinks the first file into a second file, and then writes a single byte every 2^23 blocks in the original file. When this happens, we'll try to create approximately 1024 2^23 extent reservations in the COW fork, which will overflow the counter and cause problems. Note that on x64 we end up filling a 4-byte gap in the structure so this doesn't increase the incore size. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: widen quota block counters to 64-bit integersDarrick J. Wong2019-04-233-35/+34
| | | | | | | | | | | Widen the incore quota transaction delta structure to treat block counters as 64-bit integers. This is a necessary addition so that we can widen the i_delayed_blks counter to be a 64-bit integer. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Collins <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: abort unaligned nowait directio earlyDarrick J. Wong2019-04-231-3/+3
| | | | | | | | | | | | | | | Dave Chinner noticed that xfs_file_dio_aio_write returns EAGAIN without dropping the IOLOCK when its deciding not to wait, which means that we leak the IOLOCK there. Since we now make unaligned directio always wait, we have the opportunity to bail out before trying to take the lock, which should reduce the overhead of this never-gonna-work case considerably while also solving the dropped lock problem. Reported-by: Dave Chinner <david@fromorbit.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de>
* xfs: assert that we don't enter agfl freeing with a non-permanent transactionBrian Foster2019-04-231-0/+3
| | | | | | | | | | | | | Block allocation requires a permanent transaction for deferred AGFL frees. Add an assert in the block allocation path to make explicit and obvious to future callers the requirement of a transaction with a permanent reservation. Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: split this out from the previous patch per hch request] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: make tr_growdata a permanent transactionBrian Foster2019-04-231-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | The growdata transaction is used by growfs operations to increase the data size of the filesystem. Part of this sequence involves extending the size of the last preexisting AG in the fs, if necessary. This is implemented by freeing the newly available physical range to the AG. tr_growdata is not a permanent transaction, however, and block allocation transactions must be permanent to handle deferred frees of AGFL blocks. If the grow operation extends an existing AG that requires AGFL fixing, assert failures occur due to a populated dfops list on a non-permanent transaction and the AGFL free does not occur. This is reproduced (rarely) by xfs/104. Change tr_growdata to a permanent transaction with a default log count. This increases initial transaction reservation size, but growfs is an infrequent and non-performance critical operation and so should have minimal impact. Reported-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: add a comment to the assert] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: merge adjacent io completions of the same typeDarrick J. Wong2019-04-161-0/+86
| | | | | | | | | | | | | | | | | | | | | It's possible for pagecache writeback to split up a large amount of work into smaller pieces for throttling purposes or to reduce the amount of time a writeback operation is pending. Whatever the reason, XFS can end up with a bunch of IO completions that call for the same operation to be performed on a contiguous extent mapping. Since mappings are extent based in XFS, we'd prefer to run fewer transactions when we can. When we're processing an ioend on the list of io completions, check to see if the next items on the list are both adjacent and of the same type. If so, we can merge the completions to reduce transaction overhead. On fast storage this doesn't seem to make much of a difference in performance, though the number of transactions for an overnight xfstests run seems to drop by ~5%. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: remove unused m_data_workqueueDarrick J. Wong2019-04-162-10/+1
| | | | | | | Now that we're no longer using m_data_workqueue, remove it. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: implement per-inode writeback completion queuesDarrick J. Wong2019-04-164-12/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When scheduling writeback of dirty file data in the page cache, XFS uses IO completion workqueue items to ensure that filesystem metadata only updates after the write completes successfully. This is essential for converting unwritten extents to real extents at the right time and performing COW remappings. Unfortunately, XFS queues each IO completion work item to an unbounded workqueue, which means that the kernel can spawn dozens of threads to try to handle the items quickly. These threads need to take the ILOCK to update file metadata, which results in heavy ILOCK contention if a large number of the work items target a single file, which is inefficient. Worse yet, the writeback completion threads get stuck waiting for the ILOCK while holding transaction reservations, which can use up all available log reservation space. When that happens, metadata updates to other parts of the filesystem grind to a halt, even if the filesystem could otherwise have handled it. Even worse, if one of the things grinding to a halt happens to be a thread in the middle of a defer-ops finish holding the same ILOCK and trying to obtain more log reservation having exhausted the permanent reservation, we now have an ABBA deadlock - writeback completion has a transaction reserved and wants the ILOCK, and someone else has the ILOCK and wants a transaction reservation. Therefore, we create a per-inode writeback io completion queue + work item. When writeback finishes, it can add the ioend to the per-inode queue and let the single worker item process that queue. This dramatically cuts down on the number of kworkers and ILOCK contention in the system, and seems to have eliminated an occasional deadlock I was seeing while running generic/476. Testing with a program that simulates a heavy random-write workload to a single file demonstrates that the number of kworkers drops from approximately 120 threads per file to 1, without dramatically changing write bandwidth or pagecache access latency. Note that we leave the xfs-conv workqueue's max_active alone because we still want to be able to run ioend processing for as many inodes as the system can handle. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: scrub should only cross-reference with healthy btreesDarrick J. Wong2019-04-163-5/+77
| | | | | | | | Skip cross-referencing with a btree if the health report tells us that it's known to be bad. This should reduce the dmesg spew considerably. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: scrub/repair should update filesystem metadata healthDarrick J. Wong2019-04-165-0/+200
| | | | | | | | Now that we have the ability to track sick metadata in-core, make scrub and repair update those health assessments after doing work. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: hoist the already_fixed variable to the scrub contextDarrick J. Wong2019-04-164-11/+10
| | | | | | | | | Now that we no longer memset the scrub context, we can move the already_fixed variable into the scrub context's state flags instead of passing around pointers to separate stack variables. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: collapse scrub bool state flags into a single unsigned intDarrick J. Wong2019-04-166-12/+17
| | | | | | | | Combine all the boolean state flags in struct xfs_scrub into a single unsigned int, because we're going to be adding more state flags soon. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: refactor scrub context initializationDarrick J. Wong2019-04-161-13/+18
| | | | | | | | | | | It's a little silly how the memset in scrub context initialization forces us to declare stack variables to preserve context variables across a retry. Since the teardown functions already null out most of the ephemeral state (buffer pointers, btree cursors, etc.), just skip the memset and move the initialization as needed. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com>
* xfs: report inode health via bulkstatDarrick J. Wong2019-04-154-1/+50
| | | | | | | | Use space in the bulkstat ioctl structure to report any problems observed with the inode. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: report AG health via AG geometry ioctlDarrick J. Wong2019-04-154-1/+52
| | | | | | | Use the AG geometry info ioctl to report health status too. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: report fs and rt health via geometry structureDarrick J. Wong2019-04-154-2/+73
| | | | | | | | Use our newly expanded geometry structure to report the overall fs and realtime health status. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: add a new ioctl to describe allocation group geometryDarrick J. Wong2019-04-155-0/+93
| | | | | | | Add a new ioctl to describe an allocation group's geometry. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: bump XFS_IOC_FSGEOMETRY to v5 structuresDave Chinner2019-04-154-59/+84
| | | | | | | | | | | | | | | Unfortunately, the V4 XFS_IOC_FSGEOMETRY structure is out of space so we can't just add a new field to it. Hence we need to bump the definition to V5 and and treat the V4 ioctl and structure similar to v1 to v3. While doing this, clean up all the definitions associated with the XFS_IOC_FSGEOMETRY ioctl. Signed-Off-By: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> [darrick: forward port to 5.1, expand structure size to 256 bytes] Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: clear BAD_SUMMARY if unmounting an unhealthy filesystemDarrick J. Wong2019-04-154-0/+81
| | | | | | | | | | If we know the filesystem metadata isn't healthy during unmount, we want to encourage the administrator to run xfs_repair right away. We can't do this if BAD_SUMMARY will cause an unclean log unmount to force summary recalculation, so turn it off if the fs is bad. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: replace the BAD_SUMMARY mount flag with the equivalent health codeDarrick J. Wong2019-04-154-9/+9
| | | | | | | | Replace the BAD_SUMMARY mount flag with calls to the equivalent health tracking code. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: track metadata health statusDarrick J. Wong2019-04-158-0/+485
| | | | | | | | | | Add the necessary in-core metadata fields to keep track of which parts of the filesystem have been observed and which parts were observed to be unhealthy, and print a warning at unmount time if we have unfixed problems. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs,fstrim: fix to return correct minlenWang Shilong2019-04-151-1/+2
| | | | | | | | | | | | | | | | This patch tries to address two problems: 1) return @minlen we used to trim to user space. 2) return EINVAL if granularity is larger than avg size, even most of cases, granularity is small(4K), but if devices return a lager granularity for some reaons (testing, bugs etc), fstrim should return failure directly. Signed-off-by: Wang Shilong <wshilong@ddn.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: don't account extra agfl blocks as availableBrian Foster2019-04-151-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The block allocation AG selection code has parameters that allow a caller to perform multiple allocations from a single AG and transaction (under certain conditions). The parameters specify the total block allocation count required by the transaction and the AG selection code selects and locks an AG that will be able to satisfy the overall requirement. If the available block accounting calculation turns out to be inaccurate and a subsequent allocation call fails with -ENOSPC, the resulting transaction cancel leads to filesystem shutdown because the transaction is dirty. This exact problem can be reproduced with a highly parallel space consumer and fsstress workload running long enough to a large filesystem against -ENOSPC conditions. A bmbt block allocation request made for inode extent to bmap format conversion after an extent allocation is expected to be satisfied by the same AG and the same transaction as the extent allocation. The bmbt block allocation fails, however, because the block availability of the AG has changed since the AG was selected (outside of the blocks used for the extent itself). The inconsistent block availability calculation is caused by the deferred block freeing behavior of the AGFL. This immediately removes extra blocks from the AGFL to free up AGFL slots, but rather than immediately freeing such blocks as was done in the past, the block free is deferred such that said blocks are not available for allocation until the current transaction commits. The AG selection logic currently considers all AGFL blocks as available and executes shortly before any extra AGFL blocks are freed. This means the block availability of the current AG can change before the first allocation even occurs, but in practice a failure is more likely to manifest via a subsequent allocation because extent allocation usually has a contiguity requirement larger than a single block that can't be satisfied from the AGFL. In general, XFS prefers operational robustness to absolute allocation efficiency. In other words, we prefer to return -ENOSPC slightly earlier at the expense of not being able to allocate every last block in an AG to avoid this kind of problem. As such, update the AG block availability calculation to consider extra AGFL blocks as unavailable since they are immediately removed following the calculation and will not become available until the current transaction commits. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: shutdown after buf release in iflush cluster abort pathBrian Foster2019-04-151-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | If xfs_iflush_cluster() fails due to corruption, the error path issues a shutdown and simulates an I/O completion to release the buffer. This code has a couple small problems. First, the shutdown sequence can issue a synchronous log force, which is unsafe to do with buffer locks held. Second, the simulated I/O completion does not guarantee the buffer is async and thus is unlocked and released. For example, if the last operation on the buffer was a read off disk prior to the corruption event, XBF_ASYNC is not set and the buffer is left locked and held upon return. This results in a memory leak as shown by the following message on module unload: BUG xfs_buf (...): Objects remaining in xfs_buf on __kmem_cache_shutdown() Fix both of these problems by setting XBF_ASYNC on the buffer prior to the simulated I/O error and performing the shutdown immediately after ioend processing when the buffer has been released. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: wake commit waiters on CIL abort before log item abortBrian Foster2019-04-151-8/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS shutdown deadlocks have been reproduced by fstest generic/475. The deadlock signature involves log I/O completion running error handling to abort logged items and waiting for an inode cluster buffer lock in the buffer item unpin handler. The buffer lock is held by xfsaild attempting to flush an inode. The buffer happens to be pinned and so xfs_iflush() triggers an async log force to begin work required to get it unpinned. The log force is blocked waiting on the commit completion, which never occurs and thus leaves the filesystem deadlocked. The root problem is that aborted log I/O completion pots commit completion behind callback completion, which is unexpected for async log forces. Under normal running conditions, an async log force returns to the caller once the CIL ctx has been formatted/submitted and the commit completion event triggered at the tail end of xlog_cil_push(). If the filesystem has shutdown, however, we rely on xlog_cil_committed() to trigger the completion event and it happens to do so after running log item unpin callbacks. This makes it unsafe to invoke an async log force from contexts that hold locks that might also be required in log completion processing. To address this problem, wake commit completion waiters before aborting log items in the log I/O completion handler. This ensures that an async log force will not deadlock on held locks if the filesystem happens to shutdown. Note that it is still unsafe to issue a sync log force while holding such locks because a sync log force explicitly waits on the force completion, which occurs after log I/O completion processing. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: fix use after free in buf log item unlock assertBrian Foster2019-04-151-1/+3
| | | | | | | | | | | | | | | The xfs_buf_log_item ->iop_unlock() callback asserts that the buffer is unlocked when either non-stale or aborted. This assert occurs after the bli refcount has been dropped and the log item potentially freed. The aborted check is thus a potential use after free. This problem has been reproduced with KASAN enabled via generic/475. Fix up xfs_buf_item_unlock() to query aborted state before the bli reference is dropped to prevent a potential use after free. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: serialize unaligned dio writes against all other dio writesBrian Foster2019-03-261-10/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS applies more strict serialization constraints to unaligned direct writes to accommodate things like direct I/O layer zeroing, unwritten extent conversion, etc. Unaligned submissions acquire the exclusive iolock and wait for in-flight dio to complete to ensure multiple submissions do not race on the same block and cause data corruption. This generally works in the case of an aligned dio followed by an unaligned dio, but the serialization is lost if I/Os occur in the opposite order. If an unaligned write is submitted first and immediately followed by an overlapping, aligned write, the latter submits without the typical unaligned serialization barriers because there is no indication of an unaligned dio still in-flight. This can lead to unpredictable results. To provide proper unaligned dio serialization, require that such direct writes are always the only dio allowed in-flight at one time for a particular inode. We already acquire the exclusive iolock and drain pending dio before submitting the unaligned dio. Wait once more after the dio submission to hold the iolock across the I/O and prevent further submissions until the unaligned I/O completes. This is heavy handed, but consistent with the current pre-submission serialization for unaligned direct writes. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* xfs: prohibit fstrim in norecovery modeDarrick J. Wong2019-03-251-0/+8
| | | | | | | | | | The xfs fstrim implementation uses the free space btrees to find free space that can be discarded. If we haven't recovered the log, the bnobt will be stale and we absolutely *cannot* use stale metadata to zap the underlying storage. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Eric Sandeen <sandeen@redhat.com>
* xfs: always init bma in xfs_bmapi_writeDarrick J. Wong2019-03-191-5/+5
| | | | | | | | | | Always init the tp/ip fields of bma in xfs_bmapi_write so that the bmapi_finish at the bottom never trips over null transaction or inode pointers. Coverity-id: 1443964 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: fix btree scrub checking with regards to root-in-inodeDarrick J. Wong2019-03-191-1/+10
| | | | | | | | | | | In xchk_btree_check_owner, we can be passed a null buffer pointer. This should only happen for the root of a root-in-inode btree type, but we should program defensively in case the btree cursor state ever gets screwed up and we get a null buffer anyway. Coverity-id: 1438713 Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: dabtree scrub needs to range-check levelDarrick J. Wong2019-03-191-0/+5
| | | | | | | | Make sure scrub's dabtree iterator function checks that we're not going deeper in the stack than our cursor permits. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Brian Foster <bfoster@redhat.com>
* xfs: don't trip over uninitialized buffer on extent read of corrupted inodeBrian Foster2019-03-171-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've had rather rare reports of bmap btree block corruption where the bmap root block has a level count of zero. The root cause of the corruption is so far unknown. We do have verifier checks to detect this form of on-disk corruption, but this doesn't cover a memory corruption variant of the problem. The latter is a reasonable possibility because the root block is part of the inode fork and can reside in-core for some time before inode extents are read. If this occurs, it leads to a system crash such as the following: BUG: unable to handle kernel paging request at ffffffff00000221 PF error: [normal kernel read fault] ... RIP: 0010:xfs_trans_brelse+0xf/0x200 [xfs] ... Call Trace: xfs_iread_extents+0x379/0x540 [xfs] xfs_file_iomap_begin_delay+0x11a/0xb40 [xfs] ? xfs_attr_get+0xd1/0x120 [xfs] ? iomap_write_begin.constprop.40+0x2d0/0x2d0 xfs_file_iomap_begin+0x4c4/0x6d0 [xfs] ? __vfs_getxattr+0x53/0x70 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 iomap_apply+0x63/0x130 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 iomap_file_buffered_write+0x62/0x90 ? iomap_write_begin.constprop.40+0x2d0/0x2d0 xfs_file_buffered_aio_write+0xe4/0x3b0 [xfs] __vfs_write+0x150/0x1b0 vfs_write+0xba/0x1c0 ksys_pwrite64+0x64/0xa0 do_syscall_64+0x5a/0x1d0 entry_SYSCALL_64_after_hwframe+0x49/0xbe The crash occurs because xfs_iread_extents() attempts to release an uninitialized buffer pointer as the level == 0 value prevented the buffer from ever being allocated or read. Change the level > 0 assert to an explicit error check in xfs_iread_extents() to avoid crashing the kernel in the event of localized, in-core inode corruption. Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* Merge tag 'xfs-5.1-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2019-03-152-30/+25
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs cleanups from Darrick Wong: "Here's a few more cleanups that trickled in for the merge window. It's all fixes for static checker complaints and slowly unwinding typedef usage. The four patches here have gone through a few days worth of fstest runs with no new problems observed. Summary: - Fix some clang/smatch/sparse warnings about uninitialized variables. - Clean up some typedef usage" * tag 'xfs-5.1-merge-5' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: clean up xfs_dir2_leaf_addname xfs: zero initialize highstale and lowstale in xfs_dir2_leaf_addname xfs: clean up xfs_dir2_leafn_add xfs: Zero initialize highstale and lowstale in xfs_dir2_leafn_add
| * xfs: clean up xfs_dir2_leaf_addnameDarrick J. Wong2019-03-121-18/+15
| | | | | | | | | | | | | | | | Remove typedefs and consolidate local variable initialization. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
| * xfs: zero initialize highstale and lowstale in xfs_dir2_leaf_addnameDarrick J. Wong2019-03-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Smatch complains about the following: fs/xfs/libxfs/xfs_dir2_leaf.c:848 xfs_dir2_leaf_addname() error: uninitialized symbol 'lowstale'. fs/xfs/libxfs/xfs_dir2_leaf.c:849 xfs_dir2_leaf_addname() error: uninitialized symbol 'highstale'. I don't think there's any incorrect behavior associated with the uninitialized variable, but as the author of the previous zero-init patch points out, it's best not to be passing around pointers to uninitialized stack areas. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Bill O'Donnell <billodo@redhat.com>
| * xfs: clean up xfs_dir2_leafn_addDarrick J. Wong2019-03-081-12/+8
| | | | | | | | | | | | | | Remove typedefs and consolidate local variable initialization. Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
| * xfs: Zero initialize highstale and lowstale in xfs_dir2_leafn_addNathan Chancellor2019-03-081-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When building with -Wsometimes-uninitialized, Clang warns: fs/xfs/libxfs/xfs_dir2_node.c:481:6: warning: variable 'lowstale' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] fs/xfs/libxfs/xfs_dir2_node.c:481:6: warning: variable 'highstale' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] While it isn't technically wrong, it isn't a problem in practice because highstale and lowstale are only initialized in xfs_dir2_leafn_add when compact is not zero then they are passed to xfs_dir3_leaf_find_entry, where they are initialized before use when compact is zero. Regardless, it's better not to be passing around uninitialized stack memory so zero initialize these variables, which silences this warning. Link: https://github.com/ClangBuiltLinux/linux/issues/393 Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Darrick J. Wong <darrick.wong@oracle.com> Signed-off-by: Darrick J. Wong <darrick.wong@oracle.com>
* | Merge tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-blockLinus Torvalds2019-03-082-4/+6
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block layer updates from Jens Axboe: "Not a huge amount of changes in this round, the biggest one is that we finally have Mings multi-page bvec support merged. Apart from that, this pull request contains: - Small series that avoids quiescing the queue for sysfs changes that match what we currently have (Aleksei) - Series of bcache fixes (via Coly) - Series of lightnvm fixes (via Mathias) - NVMe pull request from Christoph. Nothing major, just SPDX/license cleanups, RR mp policy (Hannes), and little fixes (Bart, Chaitanya). - BFQ series (Paolo) - Save blk-mq cpu -> hw queue mapping, removing a pointer indirection for the fast path (Jianchao) - fops->iopoll() added for async IO polling, this is a feature that the upcoming io_uring interface will use (Christoph, me) - Partition scan loop fixes (Dongli) - mtip32xx conversion from managed resource API (Christoph) - cdrom registration race fix (Guenter) - MD pull from Song, two minor fixes. - Various documentation fixes (Marcos) - Multi-page bvec feature. This brings a lot of nice improvements with it, like more efficient splitting, larger IOs can be supported without growing the bvec table size, and so on. (Ming) - Various little fixes to core and drivers" * tag 'for-5.1/block-20190302' of git://git.kernel.dk/linux-block: (117 commits) block: fix updating bio's front segment size block: Replace function name in string with __func__ nbd: propagate genlmsg_reply return code floppy: remove set but not used variable 'q' null_blk: fix checking for REQ_FUA block: fix NULL pointer dereference in register_disk fs: fix guard_bio_eod to check for real EOD errors blk-mq: use HCTX_TYPE_DEFAULT but not 0 to index blk_mq_tag_set->map block: optimize bvec iteration in bvec_iter_advance block: introduce mp_bvec_for_each_page() for iterating over page block: optimize blk_bio_segment_split for single-page bvec block: optimize __blk_segment_map_sg() for single-page bvec block: introduce bvec_nth_page() iomap: wire up the iopoll method block: add bio_set_polled() helper block: wire up block device iopoll method fs: add an iopoll method to struct file_operations loop: set GENHD_FL_NO_PART_SCAN after blkdev_reread_part() loop: do not print warn message if partition scan is successful block: bounce: make sure that bvec table is updated ...