summaryrefslogtreecommitdiffstats
path: root/fs/xfs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linuxLinus Torvalds2024-11-026-65/+62
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull xfs fixes from Carlos Maiolino: - fix a sysbot reported crash on filestreams - Reduce cpu time spent searching for extents in a very fragmented FS - Check for delayed allocations before setting extsize * tag 'xfs-6.12-fixes-6' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux: xfs: streamline xfs_filestream_pick_ag xfs: fix finding a last resort AG in xfs_filestream_pick_ag xfs: Reduce unnecessary searches when searching for the best extents xfs: Check for delayed allocations before setting extsize
| * xfs: streamline xfs_filestream_pick_agChristoph Hellwig2024-10-301-50/+46
| | | | | | | | | | | | | | | | | | | | | | | | Directly return the error from xfs_bmap_longest_free_extent instead of breaking from the loop and handling it there, and use a done label to directly jump to the exist when we found a suitable perag structure to reduce the indentation level and pag/max_pag check complexity in the tail of the function. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
| * xfs: fix finding a last resort AG in xfs_filestream_pick_agChristoph Hellwig2024-10-302-21/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the main loop in xfs_filestream_pick_ag fails to find a suitable AG it tries to just pick the online AG. But the loop for that uses args->pag as loop iterator while the later code expects pag to be set. Fix this by reusing the max_pag case for this last resort, and also add a check for impossible case of no AG just to make sure that the uninitialized pag doesn't even escape in theory. Reported-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: syzbot+4125a3c514e3436a02e6@syzkaller.appspotmail.com Fixes: f8f1ed1ab3baba ("xfs: return a referenced perag from filestreams allocator") Cc: <stable@vger.kernel.org> # v6.3 Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
| * xfs: Reduce unnecessary searches when searching for the best extentsChi Zhiling2024-10-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently, we found that the CPU spent a lot of time in xfs_alloc_ag_vextent_size when the filesystem has millions of fragmented spaces. The reason is that we conducted much extra searching for extents that could not yield a better result, and these searches would cost a lot of time when there were millions of extents to search through. Even if we get the same result length, we don't switch our choice to the new one, so we can definitely terminate the search early. Since the result length cannot exceed the found length, when the found length equals the best result length we already have, we can conclude the search. We did a test in that filesystem: [root@localhost ~]# xfs_db -c freesp /dev/vdb from to extents blocks pct 1 1 215 215 0.01 2 3 994476 1988952 99.99 Before this patch: 0) | xfs_alloc_ag_vextent_size [xfs]() { 0) * 15597.94 us | } After this patch: 0) | xfs_alloc_ag_vextent_size [xfs]() { 0) 19.176 us | } Signed-off-by: Chi Zhiling <chizhiling@kylinos.cn> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
| * xfs: Check for delayed allocations before setting extsizeOjaswin Mujoo2024-10-303-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Extsize should only be allowed to be set on files with no data in it. For this, we check if the files have extents but miss to check if delayed extents are present. This patch adds that check. While we are at it, also refactor this check into a helper since it's used in some other places as well like xfs_inactive() or xfs_ioctl_setattr_xflags() **Without the patch (SUCCEEDS)** $ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536' wrote 1024/1024 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec) **With the patch (FAILS as expected)** $ xfs_io -c 'open -f testfile' -c 'pwrite 0 1024' -c 'extsize 65536' wrote 1024/1024 bytes at offset 0 1 KiB, 1 ops; 0.0002 sec (4.628 MiB/sec and 4739.3365 ops/sec) xfs_io: FS_IOC_FSSETXATTR testfile: Invalid argument Fixes: e94af02a9cd7 ("[XFS] fix old xfs_setattr mis-merge from irix; mostly harmless esp if not using xfs rt") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: John Garry <john.g.garry@oracle.com> Signed-off-by: Ojaswin Mujoo <ojaswin@linux.ibm.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | Merge tag 'vfs-6.12-rc6.iomap' of ↵Linus Torvalds2024-11-011-1/+1
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull iomap fixes from Christian Brauner: "Fixes for iomap to prevent data corruption bugs in the fallocate unshare range implementation of fsdax and a small cleanup to turn iomap_want_unshare_iter() into an inline function" * tag 'vfs-6.12-rc6.iomap' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: iomap: turn iomap_want_unshare_iter into an inline function fsdax: dax_unshare_iter needs to copy entire blocks fsdax: remove zeroing code from dax_unshare_iter iomap: share iomap_unshare_iter predicate code with fsdax xfs: don't allocate COW extents when unsharing a hole
| * Merge patch series "fsdax/xfs: unshare range fixes for 6.12"Christian Brauner2024-10-081-1/+1
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Darrick J. Wong <djwong@kernel.org> says: This patchset fixes multiple data corruption bugs in the fallocate unshare range implementation for fsdax. * patches from https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs: fsdax: dax_unshare_iter needs to copy entire blocks fsdax: remove zeroing code from dax_unshare_iter iomap: share iomap_unshare_iter predicate code with fsdax xfs: don't allocate COW extents when unsharing a hole Link: https://lore.kernel.org/r/172796813251.1131942.12184885574609980777.stgit@frogsfrogsfrogs Signed-off-by: Christian Brauner <brauner@kernel.org>
| | * xfs: don't allocate COW extents when unsharing a holeDarrick J. Wong2024-10-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It doesn't make sense to allocate a COW extent when unsharing a hole because holes cannot be shared. Fixes: 1f1397b7218d7 ("xfs: don't allocate into the data fork for an unshare request") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/172796813277.1131942.5486112889531210260.stgit@frogsfrogsfrogs Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | xfs: update the pag for the last AG at recovery timeChristoph Hellwig2024-10-223-3/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently log recovery never updates the in-core perag values for the last allocation group when they were grown by growfs. This leads to btree record validation failures for the alloc, ialloc or finotbt trees if a transaction references this new space. Found by Brian's new growfs recovery stress test. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: don't use __GFP_RETRY_MAYFAIL in xfs_initialize_peragChristoph Hellwig2024-10-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __GFP_RETRY_MAYFAIL increases the likelyhood of allocations to fail, which isn't really helpful during log recovery. Remove the flag and stick to the default GFP_KERNEL policies. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: error out when a superblock buffer update reduces the agcountChristoph Hellwig2024-10-221-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS currently does not support reducing the agcount, so error out if a logged sb buffer tries to shrink the agcount. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: update the file system geometry after recoverying superblock buffersChristoph Hellwig2024-10-222-8/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Primary superblock buffers that change the file system geometry after a growfs operation can affect the operation of later CIL checkpoints that make use of the newly added space and allocation groups. Apply the changes to the in-memory structures as part of recovery pass 2, to ensure recovery works fine for such cases. In the future we should apply the logic to other updates such as features bits as well. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: merge the perag freeing helpersChristoph Hellwig2024-10-224-37/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is no good reason to have two different routines for freeing perag structures for the unmount and error cases. Add two arguments to specify the range of AGs to free to xfs_free_perag, and use that to replace xfs_free_unused_perag_range. The addition RCU grace period for the error case is harmless, and the extra check for the AG to actually exist is not required now that the callers pass the exact known allocated range. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: pass the exact range to initialize to xfs_initialize_peragChristoph Hellwig2024-10-225-38/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently only the new agcount is passed to xfs_initialize_perag, which requires lookups of existing AGs to skip them and complicates error handling. Also pass the previous agcount so that the range that xfs_initialize_perag operates on is exactly defined. That way the extra lookups can be avoided, and error handling can clean up the exact range from the old count to the last added perag structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: don't fail repairs on metadata files with no attr forkDarrick J. Wong2024-10-221-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a minor bug where we fail repairs on metadata files that do not have attr forks because xrep_metadata_inode_subtype doesn't filter ENOENT. Cc: stable@vger.kernel.org # v6.8 Fixes: 5a8e07e799721b ("xfs: repair the inode core and forks of a metadata inode") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: punch delalloc extents from the COW fork for COW writesChristoph Hellwig2024-10-151-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ->iomap_end is called on a short write to the COW fork it needs to punch stale delalloc data from the COW fork and not the data fork. Ensure that IOMAP_F_NEW is set for new COW fork allocations in xfs_buffered_write_iomap_begin, and then use the IOMAP_F_SHARED flag in xfs_buffered_write_delalloc_punch to decide which fork to punch. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: set IOMAP_F_SHARED for all COW fork allocationsChristoph Hellwig2024-10-151-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change to always set xfs_buffered_write_iomap_begin for COW fork allocations even if they don't overlap existing data fork extents, which will allow the iomap_end callback to detect if it has to punch stale delalloc blocks from the COW fork instead of the data fork. It also means we sample the sequence counter for both the data and the COW fork when writing to the COW fork, which ensures we properly revalidate when only COW fork changes happens. This is essentially a revert of commit 72a048c1056a ("xfs: only set IOMAP_F_SHARED when providing a srcmap to a write"). This is fine because the problem that the commit fixed has now been dealt with in iomap by only looking at the actual srcmap and not the fallback to the write iomap. Note that the direct I/O path was never changed and has always set IOMAP_F_SHARED for all COW fork allocations. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: share more code in xfs_buffered_write_iomap_beginChristoph Hellwig2024-10-151-10/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce a local iomap_flags variable so that the code allocating new delalloc blocks in the data fork can fall through to the found_imap label and reuse the code to unlock and fill the iomap. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: support the COW fork in xfs_bmap_punch_delalloc_rangeChristoph Hellwig2024-10-154-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_buffered_write_iomap_begin can also create delallocate reservations that need cleaning up, prepare for that by adding support for the COW fork in xfs_bmap_punch_delalloc_range. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: IOMAP_ZERO and IOMAP_UNSHARE already hold invalidate_lockChristoph Hellwig2024-10-151-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All XFS callers of iomap_zero_range and iomap_file_unshare already hold invalidate_lock, so we can't take it again in iomap_file_buffered_write_punch_delalloc. Use the passed in flags argument to detect if we're called from a zero or unshare operation and don't take the lock again in this case. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: take XFS_MMAPLOCK_EXCL xfs_file_write_zero_eofChristoph Hellwig2024-10-152-1/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_file_write_zero_eof is the only caller of xfs_zero_range that does not take XFS_MMAPLOCK_EXCL (aka the invalidate lock). Currently that is actually the right thing, as an error in the iomap zeroing code will also take the invalidate_lock to clean up, but to fix that deadlock we need a consistent locking pattern first. The only extra thing that XFS_MMAPLOCK_EXCL will lock out are read pagefaults, which isn't really needed here, but also not actively harmful. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: factor out a xfs_file_write_zero_eof helperChristoph Hellwig2024-10-151-58/+82
| | | | | | | | | | | | | | | | | | | | | | | | | | | Split a helper from xfs_file_write_checks that just deal with the post-EOF zeroing to keep the code readable. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | iomap: move locking out of iomap_write_delalloc_releaseChristoph Hellwig2024-10-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS (which currently is the only user of iomap_write_delalloc_release) already holds invalidate_lock for most zeroing operations. To be able to avoid a deadlock it needs to stop taking the lock, but doing so in iomap would leak XFS locking details into iomap. To avoid this require the caller to hold invalidate_lock when calling iomap_write_delalloc_release instead of taking it there. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | iomap: remove iomap_file_buffered_write_punch_delallocChristoph Hellwig2024-10-151-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently iomap_file_buffered_write_punch_delalloc can be called from XFS either with the invalidate lock held or not. To fix this while keeping the locking in the file system and not the iomap library code we'll need to life the locking up into the file system. To prepare for that, open code iomap_file_buffered_write_punch_delalloc in the only caller, and instead export iomap_write_delalloc_release. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: fix integer overflow in xrep_bmapDarrick J. Wong2024-10-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The variable declaration in this function predates the merge of the nrext64 (aka 64-bit extent counters) feature, which means that the variable declaration type is insufficient to avoid an integer overflow. Fix that by redeclaring the variable to be xfs_extnum_t. Coverity-id: 1630958 Fixes: 8f71bede8efd ("xfs: repair inode fork block mapping data structures") Signed-off-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: fix a typoAndrew Kreimer2024-10-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Fix a typo in comments. Signed-off-by: Andrew Kreimer <algonell@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: don't free cowblocks from under dirty pagecache on unshareBrian Foster2024-10-093-7/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fallocate unshare mode explicitly breaks extent sharing. When a command completes, it checks the data fork for any remaining shared extents to determine whether the reflink inode flag and COW fork preallocation can be removed. This logic doesn't consider in-core pagecache and I/O state, however, which means we can unsafely remove COW fork blocks that are still needed under certain conditions. For example, consider the following command sequence: xfs_io -fc "pwrite 0 1k" -c "reflink <file> 0 256k 1k" \ -c "pwrite 0 32k" -c "funshare 0 1k" <file> This allocates a data block at offset 0, shares it, and then overwrites it with a larger buffered write. The overwrite triggers COW fork preallocation, 32 blocks by default, which maps the entire 32k write to delalloc in the COW fork. All but the shared block at offset 0 remains hole mapped in the data fork. The unshare command redirties and flushes the folio at offset 0, removing the only shared extent from the inode. Since the inode no longer maps shared extents, unshare purges the COW fork before the remaining 28k may have written back. This leaves dirty pagecache backed by holes, which writeback quietly skips, thus leaving clean, non-zeroed pagecache over holes in the file. To verify, fiemap shows holes in the first 32k of the file and reads return different data across a remount: $ xfs_io -c "fiemap -v" <file> <file>: EXT: FILE-OFFSET BLOCK-RANGE TOTAL FLAGS ... 1: [8..511]: hole 504 ... $ xfs_io -c "pread -v 4k 8" <file> 00001000: cd cd cd cd cd cd cd cd ........ $ umount <mnt>; mount <dev> <mnt> $ xfs_io -c "pread -v 4k 8" <file> 00001000: 00 00 00 00 00 00 00 00 ........ To avoid this problem, make unshare follow the same rules used for background cowblock scanning and never purge the COW fork for inodes with dirty pagecache or in-flight I/O. Fixes: 46afb0628b86347 ("xfs: only flush the unshared range in xfs_reflink_unshare") Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: skip background cowblock trims on inodes open for writeBrian Foster2024-10-071-8/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The background blockgc scanner runs on a 5m interval by default and trims preallocation (post-eof and cow fork) from inodes that are otherwise idle. Idle effectively means that iolock can be acquired without blocking and that the inode has no dirty pagecache or I/O in flight. This simple mechanism and heuristic has worked fairly well for post-eof speculative preallocations. Support for reflink and COW fork preallocations came sometime later and plugged into the same mechanism, with similar heuristics. Some recent testing has shown that COW fork preallocation may be notably more sensitive to blockgc processing than post-eof preallocation, however. For example, consider an 8GB reflinked file with a COW extent size hint of 1MB. A worst case fully randomized overwrite of this file results in ~8k extents of an average size of ~1MB. If the same workload is interrupted a couple times for blockgc processing (assuming the file goes idle), the resulting extent count explodes to over 100k extents with an average size <100kB. This is significantly worse than ideal and essentially defeats the COW extent size hint mechanism. While this particular test is instrumented, it reflects a fairly reasonable pattern in practice where random I/Os might spread out over a large period of time with varying periods of (in)activity. For example, consider a cloned disk image file for a VM or container with long uptime and variable and bursty usage. A background blockgc scan that races and processes the image file when it happens to be clean and idle can have a significant effect on the future fragmentation level of the file, even when still in use. To help combat this, update the heuristic to skip cowblocks inodes that are currently opened for write access during non-sync blockgc scans. This allows COW fork preallocations to persist for as long as possible unless otherwise needed for functional purposes (i.e. a sync scan), the file is idle and closed, or the inode is being evicted from cache. While here, update the comments to help distinguish performance oriented heuristics from the logic that exists to maintain functional correctness. Suggested-by: Darrick Wong <djwong@kernel.org> Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: support lowmode allocations in xfs_bmap_exact_minlen_extent_allocChristoph Hellwig2024-10-071-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently the debug-only xfs_bmap_exact_minlen_extent_alloc allocation variant fails to drop into the lowmode last resort allocator, and thus can sometimes fail allocations for which the caller has a transaction block reservation. Fix this by using xfs_bmap_btalloc_low_space to do the actual allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: call xfs_bmap_exact_minlen_extent_alloc from xfs_bmap_btallocChristoph Hellwig2024-10-071-48/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_bmap_exact_minlen_extent_alloc duplicates the args setup in xfs_bmap_btalloc. Switch to call it from xfs_bmap_btalloc after doing the basic setup. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: don't ifdef around the exact minlen allocationsChristoph Hellwig2024-10-073-14/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Exact minlen allocations only exist as an error injection tool for debug builds. Currently this is implemented using ifdefs, which means the code isn't even compiled for non-XFS_DEBUG builds. Enhance the compile test coverage by always building the code and use the compilers' dead code elimination to remove it from the generated binary instead. The only downside is that the alloc_minlen_only field is unconditionally added to struct xfs_alloc_args now, but by moving it around and packing it tightly this doesn't actually increase the size of the structure. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: fold xfs_bmap_alloc_userdata into xfs_bmapi_allocateChristoph Hellwig2024-10-071-45/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Userdata and metadata allocations end up in the same allocation helpers. Remove the separate xfs_bmap_alloc_userdata function to make this more clear. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: distinguish extra split from real ENOSPC from xfs_attr_node_try_addnameChristoph Hellwig2024-10-071-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just like xfs_attr3_leaf_split, xfs_attr_node_try_addname can return -ENOSPC both for an actual failure to allocate a disk block, but also to signal the caller to convert the format of the attr fork. Use magic 1 to ask for the conversion here as well. Note that unlike the similar issue in xfs_attr3_leaf_split, this one was only found by code review. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: distinguish extra split from real ENOSPC from xfs_attr3_leaf_splitChristoph Hellwig2024-10-072-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_attr3_leaf_split propagates the need for an extra btree split as -ENOSPC to it's only caller, but the same return value can also be returned from xfs_da_grow_inode when it fails to find free space. Distinguish the two cases by returning 1 for the extra split case instead of overloading -ENOSPC. This can be triggered relatively easily with the pending realtime group support and a file system with a lot of small zones that use metadata space on the main device. In this case every about 5-10th run of xfs/538 runs into the following assert: ASSERT(oldblk->magic == XFS_ATTR_LEAF_MAGIC); in xfs_attr3_leaf_split caused by an allocation failure. Note that the allocation failure is caused by another bug that will be fixed subsequently, but this commit at least sorts out the error handling. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: return bool from xfs_attr3_leaf_addChristoph Hellwig2024-10-073-27/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_attr3_leaf_add only has two potential return values, indicating if the entry could be added or not. Replace the errno return with a bool so that ENOSPC from it can't easily be confused with a real ENOSPC. Remove the return value from the xfs_attr3_leaf_add_work helper entirely, as it always return 0. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: merge xfs_attr_leaf_try_add into xfs_attr_leaf_addnameChristoph Hellwig2024-10-071-102/+74
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xfs_attr_leaf_try_add is only called by xfs_attr_leaf_addname, and merging the two will simplify a following error handling fix. To facilitate this move the remote block state save/restore helpers up in the file so that they don't need forward declarations now. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: Use try_cmpxchg() in xlog_cil_insert_pcp_aggregate()Uros Bizjak2024-10-071-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use !try_cmpxchg instead of cmpxchg (*ptr, old, new) != old in xlog_cil_insert_pcp_aggregate(). x86 CMPXCHG instruction returns success in ZF flag, so this change saves a compare after cmpxchg. Also, try_cmpxchg implicitly assigns old *ptr value to "old" when cmpxchg fails. There is no need to re-read the value in the loop. Note that the value from *ptr should be read using READ_ONCE to prevent the compiler from merging, refetching or reordering the read. No functional change intended. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Reviewed-by: Christoph Hellwig <hch@infradead.org> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Dave Chinner <dchinner@redhat.com> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: scrub: convert comma to semicolonYan Zhen2024-10-071-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Replace a comma between expression statements by a semicolon. Signed-off-by: Yan Zhen <yanzhen@vivo.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* | | xfs: Remove empty declartion in header fileZhang Zekun2024-10-071-2/+0
|/ / | | | | | | | | | | | | | | | | | | The definition of xfs_attr_use_log_assist() has been removed since commit d9c61ccb3b09 ("xfs: move xfs_attr_use_log_assist out of xfs_log.c"). So, Remove the empty declartion in header files. Signed-off-by: Zhang Zekun <zhangzekun11@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Carlos Maiolino <cem@kernel.org>
* / move asm/unaligned.h to linux/unaligned.hAl Viro2024-10-021-1/+1
|/ | | | | | | | | | | | | | | | | | | asm/unaligned.h is always an include of asm-generic/unaligned.h; might as well move that thing to linux/unaligned.h and include that - there's nothing arch-specific in that header. auto-generated by the following: for i in `git grep -l -w asm/unaligned.h`; do sed -i -e "s/asm\/unaligned.h/linux\/unaligned.h/" $i done for i in `git grep -l -w asm-generic/unaligned.h`; do sed -i -e "s/asm-generic\/unaligned.h/linux\/unaligned.h/" $i done git mv include/asm-generic/unaligned.h include/linux/unaligned.h git mv tools/include/asm-generic/unaligned.h tools/include/linux/unaligned.h sed -i -e "/unaligned.h/d" include/asm-generic/Kbuild sed -i -e "s/__ASM_GENERIC/__LINUX/" include/linux/unaligned.h tools/include/linux/unaligned.h
* Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds2024-09-233-21/+21
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
| * struct fd: representation changeAl Viro2024-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want the compiler to see that fdput() on empty instance is a no-op. The emptiness check is that file reference is NULL, while fdput() is "fput() if FDPUT_FPUT is present in flags". The reason why fdput() on empty instance is a no-op is something compiler can't see - it's that we never generate instances with NULL file reference combined with non-zero flags. It's not that hard to deal with - the real primitives behind fdget() et.al. are returning an unsigned long value, unpacked by (inlined) __to_fd() into the current struct file * + int. The lower bits are used to store flags, while the rest encodes the pointer. Linus suggested that keeping this unsigned long around with the extractions done by inlined accessors should generate a sane code and that turns out to be the case. Namely, turning struct fd into a struct-wrapped unsinged long, with fd_empty(f) => unlikely(f.word == 0) fd_file(f) => (struct file *)(f.word & ~3) fdput(f) => if (f.word & 1) fput(fd_file(f)) ends up with compiler doing the right thing. The cost is the patch footprint, of course - we need to switch f.file to fd_file(f) all over the tree, and it's not doable with simple search and replace; there are false positives, etc. Note that the sole member of that structure is an opaque unsigned long - all accesses should be done via wrappers and I don't want to use a name that would invite manual casts to file pointers, etc. The value of that member is equal either to (unsigned long)p | flags, p being an address of some struct file instance, or to 0 for an empty fd. For now the new predicate (fd_empty(f)) has no users; all the existing checks have form (!fd_file(f)). We will convert to fd_empty() use later; here we only define it (and tell the compiler that it's unlikely to return true). This commit only deals with representation change; there will be followups. Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * introduce fd_file(), convert all accessors to it.Al Viro2024-08-133-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds2024-09-213-5/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
| * | mm,tmpfs: consider end of file write in shmem_is_hugeRik van Riel2024-09-102-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Take the end of a file write into consideration when deciding whether or not to use huge pages for tmpfs files when the tmpfs filesystem is mounted with huge=within_size This allows large writes that append to the end of a file to automatically use large pages. Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with fio. The numbers without THP or with huge=always stay the same, but the performance with huge=within_size now matches that of huge=always. huge before after 4kB pages 1560 MB/s 1560 MB/s within_size 1560 MB/s 4720 MB/s always: 4720 MB/s 4720 MB/s [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/20240903111928.7171e60c@imladris.surriel.com Signed-off-by: Rik van Riel <riel@surriel.com> Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com> Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Darrick J. Wong <djwong@kernel.org> Cc: Hugh Dickins <hughd@google.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: kvmalloc: align kvrealloc() with krealloc()Danilo Krummrich2024-09-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. Besides that, implementing kvrealloc() by making use of krealloc() and vrealloc() provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. [dakr@kernel.org: document concurrency restrictions] Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org [dakr@kernel.org: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org [dakr@kernel.org: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Christian König <christian.koenig@amd.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | Merge tag 'vfs-6.12.blocksize' of ↵Linus Torvalds2024-09-219-47/+51
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs blocksize updates from Christian Brauner: "This contains the vfs infrastructure as well as the xfs bits to enable support for block sizes (bs) larger than page sizes (ps) plus a few fixes to related infrastructure. There has been efforts over the last 16 years to enable enable Large Block Sizes (LBS), that is block sizes in filesystems where bs > page size. Through these efforts we have learned that one of the main blockers to supporting bs > ps in filesystems has been a way to allocate pages that are at least the filesystem block size on the page cache where bs > ps. Thanks to various previous efforts it is possible to support bs > ps in XFS with only a few changes in XFS itself. Most changes are to the page cache to support minimum order folio support for the target block size on the filesystem. A motivation for Large Block Sizes today is to support high-capacity (large amount of Terabytes) QLC SSDs where the internal Indirection Unit (IU) are typically greater than 4k to help reduce DRAM and so in turn cost and space. In practice this then allows different architectures to use a base page size of 4k while still enabling support for block sizes aligned to the larger IUs by relying on high order folios on the page cache when needed. It also allows to take advantage of the drive's support for atomics larger than 4k with buffered IO support in Linux. As described this year at LSFMM, supporting large atomics greater than 4k enables databases to remove the need to rely on their own journaling, so they can disable double buffered writes, which is a feature different cloud providers are already enabling through custom storage solutions" * tag 'vfs-6.12.blocksize' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (22 commits) Documentation: iomap: fix a typo iomap: remove the iomap_file_buffered_write_punch_delalloc return value iomap: pass the iomap to the punch callback iomap: pass flags to iomap_file_buffered_write_punch_delalloc iomap: improve shared block detection in iomap_unshare_iter iomap: handle a post-direct I/O invalidate race in iomap_write_delalloc_release docs:filesystems: fix spelling and grammar mistakes in iomap design page filemap: fix htmldoc warning for mapping_align_index() iomap: make zero range flush conditional on unwritten mappings iomap: fix handling of dirty folios over unwritten extents iomap: add a private argument for iomap_file_buffered_write iomap: remove set_memor_ro() on zero page xfs: enable block size larger than page size support xfs: make the calculation generic in xfs_sb_validate_fsb_count() xfs: expose block size in stat xfs: use kvmalloc for xattr buffers iomap: fix iomap_dio_zero() for fs bs > system page size filemap: cap PTE range to be created to allowed zero fill in folio_map_range() mm: split a folio in minimum folio order chunks readahead: allocate folios with mapping_min_order in readahead ...
| * | | iomap: remove the iomap_file_buffered_write_punch_delalloc return valueChristoph Hellwig2024-09-101-14/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iomap_file_buffered_write_punch_delalloc can only return errors if either the ->punch callback returned an error, or if someone changed the API of mapping_seek_hole_data to return a negative error code that is not -ENXIO. As the only instance of ->punch never returns an error, an such an error would be fatal anyway remove the entire error propagation and don't return an error code from iomap_file_buffered_write_punch_delalloc. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240910043949.3481298-6-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | iomap: pass the iomap to the punch callbackChristoph Hellwig2024-09-101-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFS will need to look at the flags in the iomap structure, so pass it down all the way to the callback. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240910043949.3481298-5-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | iomap: pass flags to iomap_file_buffered_write_punch_delallocChristoph Hellwig2024-09-101-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To fix short write error handling, We'll need to figure out what operation iomap_file_buffered_write_punch_delalloc is called for. Pass the flags argument on to it, and reorder the argument list to match that of ->iomap_end so that the compiler only has to add the new punch argument to the end of it instead of reshuffling the registers. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20240910043949.3481298-4-hch@lst.de Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>