summaryrefslogtreecommitdiffstats
path: root/block (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'mm-stable-2024-01-08-15-31' of ↵Linus Torvalds2024-01-091-4/+19
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Many singleton patches against the MM code. The patch series which are included in this merge do the following: - Peng Zhang has done some mapletree maintainance work in the series 'maple_tree: add mt_free_one() and mt_attr() helpers' 'Some cleanups of maple tree' - In the series 'mm: use memmap_on_memory semantics for dax/kmem' Vishal Verma has altered the interworking between memory-hotplug and dax/kmem so that newly added 'device memory' can more easily have its memmap placed within that newly added memory. - Matthew Wilcox continues folio-related work (including a few fixes) in the patch series 'Add folio_zero_tail() and folio_fill_tail()' 'Make folio_start_writeback return void' 'Fix fault handler's handling of poisoned tail pages' 'Convert aops->error_remove_page to ->error_remove_folio' 'Finish two folio conversions' 'More swap folio conversions' - Kefeng Wang has also contributed folio-related work in the series 'mm: cleanup and use more folio in page fault' - Jim Cromie has improved the kmemleak reporting output in the series 'tweak kmemleak report format'. - In the series 'stackdepot: allow evicting stack traces' Andrey Konovalov to permits clients (in this case KASAN) to cause eviction of no longer needed stack traces. - Charan Teja Kalla has fixed some accounting issues in the page allocator's atomic reserve calculations in the series 'mm: page_alloc: fixes for high atomic reserve caluculations'. - Dmitry Rokosov has added to the samples/ dorectory some sample code for a userspace memcg event listener application. See the series 'samples: introduce cgroup events listeners'. - Some mapletree maintanance work from Liam Howlett in the series 'maple_tree: iterator state changes'. - Nhat Pham has improved zswap's approach to writeback in the series 'workload-specific and memory pressure-driven zswap writeback'. - DAMON/DAMOS feature and maintenance work from SeongJae Park in the series 'mm/damon: let users feed and tame/auto-tune DAMOS' 'selftests/damon: add Python-written DAMON functionality tests' 'mm/damon: misc updates for 6.8' - Yosry Ahmed has improved memcg's stats flushing in the series 'mm: memcg: subtree stats flushing and thresholds'. - In the series 'Multi-size THP for anonymous memory' Ryan Roberts has added a runtime opt-in feature to transparent hugepages which improves performance by allocating larger chunks of memory during anonymous page faults. - Matthew Wilcox has also contributed some cleanup and maintenance work against eh buffer_head code int he series 'More buffer_head cleanups'. - Suren Baghdasaryan has done work on Andrea Arcangeli's series 'userfaultfd move option'. UFFDIO_MOVE permits userspace heap compaction algorithms to move userspace's pages around rather than UFFDIO_COPY'a alloc/copy/free. - Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm: Add ksm advisor'. This is a governor which tunes KSM's scanning aggressiveness in response to userspace's current needs. - Chengming Zhou has optimized zswap's temporary working memory use in the series 'mm/zswap: dstmem reuse optimizations and cleanups'. - Matthew Wilcox has performed some maintenance work on the writeback code, both code and within filesystems. The series is 'Clean up the writeback paths'. - Andrey Konovalov has optimized KASAN's handling of alloc and free stack traces for secondary-level allocators, in the series 'kasan: save mempool stack traces'. - Andrey also performed some KASAN maintenance work in the series 'kasan: assorted clean-ups'. - David Hildenbrand has gone to town on the rmap code. Cleanups, more pte batching, folio conversions and more. See the series 'mm/rmap: interface overhaul'. - Kinsey Ho has contributed some maintenance work on the MGLRU code in the series 'mm/mglru: Kconfig cleanup'. - Matthew Wilcox has contributed lruvec page accounting code cleanups in the series 'Remove some lruvec page accounting functions'" * tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits) mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER mm, treewide: introduce NR_PAGE_ORDERS selftests/mm: add separate UFFDIO_MOVE test for PMD splitting selftests/mm: skip test if application doesn't has root privileges selftests/mm: conform test to TAP format output selftests: mm: hugepage-mmap: conform to TAP format output selftests/mm: gup_test: conform test to TAP format output mm/selftests: hugepage-mremap: conform test to TAP format output mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large mm/memcontrol: remove __mod_lruvec_page_state() mm/khugepaged: use a folio more in collapse_file() slub: use a folio in __kmalloc_large_node slub: use folio APIs in free_large_kmalloc() slub: use alloc_pages_node() in alloc_slab_page() mm: remove inc/dec lruvec page state functions mm: ratelimit stat flush from workingset shrinker kasan: stop leaking stack trace handles mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE mm/mglru: add dummy pmd_dirty() ...
| * fs: convert block_write_full_page to block_write_full_folioMatthew Wilcox (Oracle)2023-12-291-3/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | Convert the function to be compatible with writepage_t so that it can be passed to write_cache_pages() by blkdev. This removes a call to compound_head(). We can also remove the function export as both callers are built-in. Link: https://lkml.kernel.org/r/20231215200245.748418-14-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fs: convert error_remove_page to error_remove_folioMatthew Wilcox (Oracle)2023-12-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | There were already assertions that we were not passing a tail page to error_remove_page(), so make the compiler enforce that by converting everything to pass and use a folio. Link: https://lkml.kernel.org/r/20231117161447.2461643-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Naoya Horiguchi <naoya.horiguchi@nec.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | Merge tag 'vfs-6.8.super' of ↵Linus Torvalds2024-01-082-107/+171
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs super updates from Christian Brauner: "This contains the super work for this cycle including the long-awaited series by Jan to make it possible to prevent writing to mounted block devices: - Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. We expect that this will be interesting to quite a few workloads. Btrfs is currently opted out of this because they still haven't merged patches we require for this to work from three kernel releases ago. - Reimplement block device freezing and thawing as holder operations on the block device. This allows us to extend block device freezing to all devices associated with a superblock and not just the main device. It also allows us to remove get_active_super() and thus another function that scans the global list of superblocks. Freezing via additional block devices only works if the filesystem chooses to use @fs_holder_ops for these additional devices as well. That currently only includes ext4 and xfs. Earlier releases switched get_tree_bdev() and mount_bdev() to use @fs_holder_ops. The remaining nilfs2 open-coded version of mount_bdev() has been converted to rely on @fs_holder_ops as well. So block device freezing for the main block device will continue to work as before. There should be no regressions in functionality. The only special case is btrfs where block device freezing for the main block device never worked because sb->s_bdev isn't set. Block device freezing for btrfs can be fixed once they can switch to @fs_holder_ops but that can happen whenever they're ready" * tag 'vfs-6.8.super' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: (27 commits) block: Fix a memory leak in bdev_open_by_dev() super: don't bother with WARN_ON_ONCE() super: massage wait event mechanism ext4: Block writes to journal device xfs: Block writes to log device fs: Block writes to mounted block devices btrfs: Do not restrict writes to btrfs devices block: Add config option to not allow writing to mounted devices block: Remove blkdev_get_by_*() functions bcachefs: Convert to bdev_open_by_path() fs: handle freezing from multiple devices fs: remove dead check nilfs2: simplify device handling fs: streamline thaw_super_locked ext4: simplify device handling xfs: simplify device handling fs: simplify setup_bdev_super() calls blkdev: comment fs_holder_ops porting: document block device freeze and thaw changes fs: remove unused helper ...
| * | block: Fix a memory leak in bdev_open_by_dev()Christophe JAILLET2023-12-281-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If we early exit here, 'handle' needs to be freed, or some memory leaks. Fixes: ed5cc702d311 ("block: Add config option to not allow writing to mounted devices") Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Link: https://lore.kernel.org/r/8eaec334781e695810aaa383b55de00ca4ab1352.1703439383.git.christophe.jaillet@wanadoo.fr Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | block: Add config option to not allow writing to mounted devicesJan Kara2023-11-182-1/+94
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Writing to mounted devices is dangerous and can lead to filesystem corruption as well as crashes. Furthermore syzbot comes with more and more involved examples how to corrupt block device under a mounted filesystem leading to kernel crashes and reports we can do nothing about. Add tracking of writers to each block device and a kernel cmdline argument which controls whether other writeable opens to block devices open with BLK_OPEN_RESTRICT_WRITES flag are allowed. We will make filesystems use this flag for used devices. Note that this effectively only prevents modification of the particular block device's page cache by other writers. The actual device content can still be modified by other means - e.g. by issuing direct scsi commands, by doing writes through devices lower in the storage stack (e.g. in case loop devices, DM, or MD are involved) etc. But blocking direct modifications of the block device page cache is enough to give filesystems a chance to perform data validation when loading data from the underlying storage and thus prevent kernel crashes. Syzbot can use this cmdline argument option to avoid uninteresting crashes. Also users whose userspace setup does not need writing to mounted block devices can set this option for hardening. Link: https://lore.kernel.org/all/60788e5d-5c7c-1142-e554-c21d709acfd9@linaro.org Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231101174325.10596-3-jack@suse.cz Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | block: Remove blkdev_get_by_*() functionsJan Kara2023-11-181-64/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blkdev_get_by_*() and blkdev_put() functions are now unused. Remove them. Acked-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231101174325.10596-2-jack@suse.cz Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | bdev: implement freeze and thaw holder operationsChristian Brauner2023-11-181-33/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The old method of implementing block device freeze and thaw operations required us to rely on get_active_super() to walk the list of all superblocks on the system to find any superblock that might use the block device. This is wasteful and not very pleasant overall. Now that we can finally go straight from block device to owning superblock things become way simpler. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-5-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | bdev: surface the error from sync_blockdev()Christian Brauner2023-11-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When freeze_super() is called, sync_filesystem() will be called which calls sync_blockdev() and already surfaces any errors. Do the same for block devices that aren't owned by a superblock and also for filesystems that don't call sync_blockdev() internally but implicitly rely on bdev_freeze() to do it. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-3-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | bdev: rename freeze and thaw helpersChristian Brauner2023-11-181-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have bdev_mark_dead() etc and we're going to move block device freezing to holder ops in the next patch. Make the naming consistent: * freeze_bdev() -> bdev_freeze() * thaw_bdev() -> bdev_thaw() Also document the return code. Link: https://lore.kernel.org/r/20231024-vfs-super-freeze-v2-2-599c19f4faac@kernel.org Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | | Merge tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linuxLinus Torvalds2023-12-291-2/+4
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: "Fix for a badly numbered flag, and a regression fix for the badblocks updates from this merge window" * tag 'block-6.7-2023-12-29' of git://git.kernel.dk/linux: block: renumber QUEUE_FLAG_HW_WC badblocks: avoid checking invalid range in badblocks_check()
| * | badblocks: avoid checking invalid range in badblocks_check()Coly Li2023-12-241-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If prev_badblocks() returns '-1', it means no valid badblocks record before the checking range. It doesn't make sense to check whether the input checking range is overlapped with the non-existed invalid front range. This patch checkes whether 'prev >= 0' is true before calling overlap_front(), to void such invalid operations. Fixes: 3ea3354cb9f0 ("badblocks: improve badblocks_check() for multiple ranges handling") Reported-and-tested-by: Ira Weiny <ira.weiny@intel.com> Signed-off-by: Coly Li <colyli@suse.de> Link: https://lore.kernel.org/nvdimm/3035e75a-9be0-4bc3-8d4a-6e52c207f277@leemhuis.info/ Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Link: https://lore.kernel.org/r/20231224002820.20234-1-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linuxLinus Torvalds2023-12-013-4/+26
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block fixes from Jens Axboe: - NVMe pull request via Keith: - Invalid namespace identification error handling (Marizio Ewan, Keith) - Fabrics keep-alive tuning (Mark) - Fix for a bad error check regression in bcache (Markus) - Fix for a performance regression with O_DIRECT (Ming) - Fix for a flush related deadlock (Ming) - Make the read-only warn on per-partition (Yu) * tag 'block-6.7-2023-12-01' of git://git.kernel.dk/linux: nvme-core: check for too small lba shift blk-mq: don't count completed flush data request as inflight in case of quiesce block: Document the role of the two attribute groups block: warn once for each partition in bio_check_ro() block: move .bd_inode into 1st cacheline of block_device nvme: check for valid nvme_identify_ns() before using it nvme-core: fix a memory leak in nvme_ns_info_from_identify() nvme: fine-tune sending of first keep-alive bcache: revert replacing IS_ERR_OR_NULL with IS_ERR
| * | blk-mq: don't count completed flush data request as inflight in case of quiesceMing Lei2023-12-011-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Request queue quiesce may interrupt flush sequence, and the original request may have been marked as COMPLETE, but can't get finished because of queue quiesce. This way is fine from driver viewpoint, because flush sequence is block layer concept, and it isn't related with driver. However, driver(such as dm-rq) can call blk_mq_queue_inflight() to count & drain inflight requests, then the wait & drain never gets done because the completed & not-finished flush request is counted as inflight. Fix this issue by not counting completed flush data request as inflight in case of quiesce. Cc: Mike Snitzer <snitzer@kernel.org> Cc: David Jeffery <djeffery@redhat.com> Cc: John Pittman <jpittman@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231201085605.577730-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: Document the role of the two attribute groupsBart Van Assche2023-11-291-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is nontrivial to derive the role of the two attribute groups in source file block/blk-sysfs.c. Hence add a comment that explains their roles. See also commit 6d85ebf95c44 ("blk-sysfs: add a new attr_group for blk_mq"). Cc: Christoph Hellwig <hch@lst.de> Cc: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20231128194019.72762-1-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: warn once for each partition in bio_check_ro()Yu Kuai2023-11-281-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 1b0a151c10a6 ("blk-core: use pr_warn_ratelimited() in bio_check_ro()") fix message storm by limit the rate, however, there will still be lots of message in the long term. Fix it better by warn once for each partition. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231128123027.971610-3-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'vfs-6.7-rc3.fixes' of ↵Linus Torvalds2023-11-241-0/+2
|\ \ \ | |/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs Pull vfs fixes from Christian Brauner: - Avoid calling back into LSMs from vfs_getattr_nosec() calls. IMA used to query inode properties accessing raw inode fields without dedicated helpers. That was finally fixed a few releases ago by forcing IMA to use vfs_getattr_nosec() helpers. The goal of the vfs_getattr_nosec() helper is to query for attributes without calling into the LSM layer which would be quite problematic because incredibly IMA is called from __fput()... __fput() -> ima_file_free() What it does is to call back into the filesystem to update the file's IMA xattr. Querying the inode without using vfs_getattr_nosec() meant that IMA didn't handle stacking filesystems such as overlayfs correctly. So the switch to vfs_getattr_nosec() is quite correct. But the switch to vfs_getattr_nosec() revealed another bug when used on stacking filesystems: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() -> security_inode_getattr() # calls back into LSMs Now, if that __fput() happens from task_work_run() of an exiting task current->fs and various other pointer could already be NULL. So anything in the LSM layer relying on that not being NULL would be quite surprised. Fix that by passing the information that this is a security request through to the stacking filesystem by adding a new internal ATT_GETATTR_NOSEC flag. Now the callchain becomes: __fput() -> ima_file_free() -> vfs_getattr_nosec() -> i_op->getattr::ovl_getattr() -> if (AT_GETATTR_NOSEC) vfs_getattr_nosec() else vfs_getattr() -> i_op->getattr::$WHATEVER_UNDERLYING_FS_getattr() - Fix a bug introduced with the iov_iter rework from last cycle. This broke /proc/kcore by copying too much and without the correct offset. - Add a missing NULL check when allocating the root inode in autofs_fill_super(). - Fix stable writes for multi-device filesystems (xfs, btrfs etc) and the block device pseudo filesystem. Stable writes used to be a superblock flag only, making it a per filesystem property. Add an additional AS_STABLE_WRITES mapping flag to allow for fine-grained control. - Ensure that offset_iterate_dir() returns 0 after reaching the end of a directory so it adheres to getdents() convention. * tag 'vfs-6.7-rc3.fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/vfs: libfs: getdents() should return 0 after reaching EOD xfs: respect the stable writes flag on the RT device xfs: clean up FS_XFLAG_REALTIME handling in xfs_ioctl_setattr_xflags block: update the stable_writes flag in bdev_add filemap: add a per-mapping stable writes flag autofs: add: new_inode check in autofs_fill_super() iov_iter: fix copy_page_to_iter_nofault() fs: Pass AT_GETATTR_NOSEC flag to getattr interface function
| * | block: update the stable_writes flag in bdev_addChristoph Hellwig2023-11-201-0/+2
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Propagate the per-queue stable_write flags into each bdev inode in bdev_add. This makes sure devices that require stable writes have it set for I/O on the block device node as well. Note that this doesn't cover the case of a flag changing on a live device yet. We should handle that as well, but I plan to cover it as part of a more general rework of how changing runtime paramters on block devices works. Fixes: 1cb039f3dc16 ("bdi: replace BDI_CAP_STABLE_WRITES with a queue and a sb flag") Reported-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231025141020.192413-3-hch@lst.de Tested-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Signed-off-by: Christian Brauner <brauner@kernel.org>
* | block: Remove blk_set_runtime_active()Damien Le Moal2023-11-201-28/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | The function blk_set_runtime_active() is called only from blk_post_runtime_resume(), so there is no need for that function to be exported. Open-code this function directly in blk_post_runtime_resume() and remove it. Signed-off-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Bart Van Assche <bvanassche@acm.org> Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Link: https://lore.kernel.org/r/20231120070611.33951-1-dlemoal@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-cgroup: bypass blkcg_deactivate_policy after destroyingMing Lei2023-11-171-0/+13
| | | | | | | | | | | | | | | | | | | | blkcg_deactivate_policy() can be called after blkg_destroy_all() returns, and it isn't necessary since blkg_destroy_all has covered policy deactivation. Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231117023527.3188627-4-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-cgroup: avoid to warn !rcu_read_lock_held() in blkg_lookup()Ming Lei2023-11-171-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far, all callers either holds spin lock or rcu read explicitly, and most of the caller has added WARN_ON_ONCE(!rcu_read_lock_held()) or lockdep_assert_held(&disk->queue->queue_lock). Remove WARN_ON_ONCE(!rcu_read_lock_held()) from blkg_lookup() for killing the false positive warning from blkg_conf_prep(). Reported-by: Changhui Zhong <czhong@redhat.com> Fixes: 83462a6c971c ("blkcg: Drop unnecessary RCU read [un]locks from blkg_conf_prep/finish()") Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231117023527.3188627-3-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-throttle: fix lockdep warning of "cgroup_mutex or RCU read lock required!"Ming Lei2023-11-171-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Inside blkg_for_each_descendant_pre(), both css_for_each_descendant_pre() and blkg_lookup() requires RCU read lock, and either cgroup_assert_mutex_or_rcu_locked() or rcu_read_lock_held() is called. Fix the warning by adding rcu read lock. Reported-by: Changhui Zhong <czhong@redhat.com> Signed-off-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20231117023527.3188627-2-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | blk-mq: make sure active queue usage is held for bio_integrity_prep()Christoph Hellwig2023-11-131-37/+38
|/ | | | | | | | | | | | | | | | | | | | | | | blk_integrity_unregister() can come if queue usage counter isn't held for one bio with integrity prepared, so this request may be completed with calling profile->complete_fn, then kernel panic. Another constraint is that bio_integrity_prep() needs to be called before bio merge. Fix the issue by: - call bio_integrity_prep() with one queue usage counter grabbed reliably - call bio_integrity_prep() before bio merge Fixes: 900e080752025f00 ("block: move queue enter logic into blk_mq_submit_bio()") Reported-by: Yi Zhang <yi.zhang@redhat.com> Cc: Christoph Hellwig <hch@lst.de> Signed-off-by: Ming Lei <ming.lei@redhat.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Link: https://lore.kernel.org/r/20231113035231.2708053-1-ming.lei@redhat.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* blk-core: use pr_warn_ratelimited() in bio_check_ro()Yu Kuai2023-11-071-2/+2
| | | | | | | | | | | | | | If one of the underlying disks of raid or dm is set to read-only, then each io will generate new log, which will cause message storm. This environment is indeed problematic, however we can't make sure our naive custormer won't do this, hence use pr_warn_ratelimited() to prevent message storm in this case. Signed-off-by: Yu Kuai <yukuai3@huawei.com> Fixes: 57e95e4670d1 ("block: fix and cleanup bio_check_ro") Signed-off-by: Ye Bin <yebin10@huawei.com> Link: https://lore.kernel.org/r/20231107111247.2157820-1-yukuai1@huaweicloud.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'mm-nonmm-stable-2023-11-02-14-08' of ↵Linus Torvalds2023-11-031-3/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-MM updates from Andrew Morton: "As usual, lots of singleton and doubleton patches all over the tree and there's little I can say which isn't in the individual changelogs. The lengthier patch series are - 'kdump: use generic functions to simplify crashkernel reservation in arch', from Baoquan He. This is mainly cleanups and consolidation of the 'crashkernel=' kernel parameter handling - After much discussion, David Laight's 'minmax: Relax type checks in min() and max()' is here. Hopefully reduces some typecasting and the use of min_t() and max_t() - A group of patches from Oleg Nesterov which clean up and slightly fix our handling of reads from /proc/PID/task/... and which remove task_struct.thread_group" * tag 'mm-nonmm-stable-2023-11-02-14-08' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (64 commits) scripts/gdb/vmalloc: disable on no-MMU scripts/gdb: fix usage of MOD_TEXT not defined when CONFIG_MODULES=n .mailmap: add address mapping for Tomeu Vizoso mailmap: update email address for Claudiu Beznea tools/testing/selftests/mm/run_vmtests.sh: lower the ptrace permissions .mailmap: map Benjamin Poirier's address scripts/gdb: add lx_current support for riscv ocfs2: fix a spelling typo in comment proc: test ProtectionKey in proc-empty-vm test proc: fix proc-empty-vm test with vsyscall fs/proc/base.c: remove unneeded semicolon do_io_accounting: use sig->stats_lock do_io_accounting: use __for_each_thread() ocfs2: replace BUG_ON() at ocfs2_num_free_extents() with ocfs2_error() ocfs2: fix a typo in a comment scripts/show_delta: add __main__ judgement before main code treewide: mark stuff as __ro_after_init fs: ocfs2: check status values proc: test /proc/${pid}/statm compiler.h: move __is_constexpr() to compiler.h ...
| * treewide: mark stuff as __ro_after_initAlexey Dobriyan2023-10-181-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | __read_mostly predates __ro_after_init. Many variables which are marked __read_mostly should have been __ro_after_init from day 1. Also, mark some stuff as "const" and "__init" while I'm at it. [akpm@linux-foundation.org: revert sysctl_nr_open_min, sysctl_nr_open_max changes due to arm warning] [akpm@linux-foundation.org: coding-style cleanups] Link: https://lkml.kernel.org/r/4f6bb9c0-abba-4ee4-a7aa-89265e886817@p183 Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | Merge tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linuxLinus Torvalds2023-11-018-384/+1465
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - Improvements to the queue_rqs() support, and adding null_blk support for that as well (Chengming) - Series improving badblocks support (Coly) - Key store support for sed-opal (Greg) - IBM partition string handling improvements (Jan) - Make number of ublk devices supported configurable (Mike) - Cancelation improvements for ublk (Ming) - MD pull requests via Song: - Handle timeout in md-cluster, by Denis Plotnikov - Cleanup pers->prepare_suspend, by Yu Kuai - Rewrite mddev_suspend(), by Yu Kuai - Simplify md_seq_ops, by Yu Kuai - Reduce unnecessary locking array_state_store(), by Mariusz Tkaczyk - Make rdev add/remove independent from daemon thread, by Yu Kuai - Refactor code around quiesce() and mddev_suspend(), by Yu Kuai - NVMe pull request via Keith: - nvme-auth updates (Mark) - nvme-tcp tls (Hannes) - nvme-fc annotaions (Kees) - Misc cleanups and improvements (Jiapeng, Joel) * tag 'for-6.7/block-2023-10-30' of git://git.kernel.dk/linux: (95 commits) block: ublk_drv: Remove unused function md: cleanup pers->prepare_suspend() nvme-auth: allow mixing of secret and hash lengths nvme-auth: use transformed key size to create resp nvme-auth: alloc nvme_dhchap_key as single buffer nvmet-tcp: use 'spin_lock_bh' for state_lock() powerpc/pseries: PLPKS SED Opal keystore support block: sed-opal: keystore access for SED Opal keys block:sed-opal: SED Opal keystore ublk: simplify aborting request ublk: replace monitor with cancelable uring_cmd ublk: quiesce request queue when aborting queue ublk: rename mm_lock as lock ublk: move ublk_cancel_dev() out of ub->mutex ublk: make sure io cmd handled in submitter task context ublk: don't get ublk device reference in ublk_abort_queue() ublk: Make ublks_max configurable ublk: Limit dev_id/ub_number values md-cluster: check for timeout while a new disk adding nvme: rework NVME_AUTH Kconfig selection ...
| * | powerpc/pseries: PLPKS SED Opal keystore supportGreg Joyce2023-10-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Define operations for SED Opal to read/write keys from POWER LPAR Platform KeyStore(PLPKS). This allows non-volatile storage of SED Opal keys. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Link: https://lore.kernel.org/r/20231004201957.1451669-4-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: sed-opal: keystore access for SED Opal keysGreg Joyce2023-10-171-2/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow for permanent SED authentication keys by reading/writing to the SED Opal non-volatile keystore. Signed-off-by: Greg Joyce <gjoyce@linux.vnet.ibm.com> Reviewed-by: Jonathan Derrick <jonathan.derrick@linux.dev> Link: https://lore.kernel.org/r/20231004201957.1451669-3-gjoyce@linux.vnet.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | partitions/ibm: Introduce defines for magic string length valuesJan Höppner2023-10-041-9/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The length values for volume label type and volume label id are hard-coded in several places. Provide defines for those values and replace all occurrences accordingly. Note that the length is defined and used, and not the size since the volume label type string and volume label id string are not nul-terminated. Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20230915131001.697070-4-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | partitions/ibm: Replace strncpy() and improve readabilityJan Höppner2023-10-041-25/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | strncpy() is deprecated and needs to be replaced. The volume label information strings are not nul-terminated and strncpy() can simply be replaced with memcpy(). To enhance the readability of find_label() alongside this change, the following improvements are made: - Introduce the array dasd_vollabels[] containing all information necessary for the label detection. - Provide a helper function to obtain an index value corresponding to a volume label type. This allows the use of a switch statement to reduce indentation levels. - The 'temp' variable is used to check against valid volume label types. In the good case, this variable already contains the volume label type making it unnecessary to copy the information again from e.g. label->vol.vollbl. Remove the 'temp' variable and the second copy as all information are already provided. - Remove the 'found' variable and replace it with early returns Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20230915131001.697070-3-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | partitions/ibm: Remove unnecessary memsetJan Höppner2023-10-041-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The data holding the volume label information is zeroed in case no valid volume label was found. Since the label information isn't used in that case, zeroing the data doesn't provide any value whatsoever. Remove the unnecessary memset() call accordingly. Signed-off-by: Jan Höppner <hoeppner@linux.ibm.com> Reviewed-by: Stefan Haberland <sth@linux.ibm.com> Signed-off-by: Stefan Haberland <sth@linux.ibm.com> Link: https://lore.kernel.org/r/20230915131001.697070-2-sth@linux.ibm.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | badblocks: switch to the improved badblock handling codeColy Li2023-09-261-305/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes old code of badblocks_set(), badblocks_clear() and badblocks_check(), and make them as wrappers to call _badblocks_set(), _badblocks_clear() and _badblocks_check(). By this change now the badblock handing switch to the improved algorithm in _badblocks_set(), _badblocks_clear() and _badblocks_check(). This patch only contains the changes of old code deletion, new added code for the improved algorithms are in previous patches. Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-7-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | badblocks: improve badblocks_check() for multiple ranges handlingColy Li2023-09-261-0/+97
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch rewrites badblocks_check() with similar coding style as _badblocks_set() and _badblocks_clear(). The only difference is bad blocks checking may handle multiple ranges in bad tables now. If a checking range covers multiple bad blocks range in bad block table, like the following condition (C is the checking range, E1, E2, E3 are three bad block ranges in bad block table), +------------------------------------+ | C | +------------------------------------+ +----+ +----+ +----+ | E1 | | E2 | | E3 | +----+ +----+ +----+ The improved badblocks_check() algorithm will divide checking range C into multiple parts, and handle them in 7 runs of a while-loop, +--+ +----+ +----+ +----+ +----+ +----+ +----+ |C1| | C2 | | C3 | | C4 | | C5 | | C6 | | C7 | +--+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ +----+ | E1 | | E2 | | E3 | +----+ +----+ +----+ And the start LBA and length of range E1 will be set as first_bad and bad_sectors for the caller. The return value rule is consistent for multiple ranges. For example if there are following bad block ranges in bad block table, Index No. Start Len Ack 0 400 20 1 1 500 50 1 2 650 20 0 the return value, first_bad, bad_sectors by calling badblocks_set() with different checking range can be the following values, Checking Start, Len Return Value first_bad bad_sectors 100, 100 0 N/A N/A 100, 310 1 400 10 100, 440 1 400 10 100, 540 1 400 10 100, 600 -1 400 10 100, 800 -1 400 10 In order to make code review easier, this patch names the improved bad block range checking routine as _badblocks_check() and does not change existing badblock_check() code yet. Later patch will delete old code of badblocks_check() and make it as a wrapper to call _badblocks_check(). Then the new added code won't mess up with the old deleted code, it will be more clear and easier for code review. Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-6-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | badblocks: improve badblocks_clear() for multiple ranges handlingColy Li2023-09-261-0/+325
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the fundamental ideas and helper routines from badblocks_set() improvement, clearing bad block for multiple ranges is much simpler. With a similar idea from badblocks_set() improvement, this patch simplifies bad block range clearing into 5 situations. No matter how complicated the clearing condition is, we just look at the head part of clearing range with relative already set bad block range from the bad block table. The rested part will be handled in next run of the while-loop. Based on existing helpers added from badblocks_set(), this patch adds two more helpers, - front_clear() Clear the bad block range from bad block table which is front overlapped with the clearing range. - front_splitting_clear() Handle the condition that the clearing range hits middle of an already set bad block range from bad block table. Similar as badblocks_set(), the first part of clearing range is handled with relative bad block range which is find by prev_badblocks(). In most cases a valid hint is provided to prev_badblocks() to avoid unnecessary bad block table iteration. This patch also explains the detail algorithm code comments at beginning of badblocks.c, including which five simplified situations are categrized and how all the bad block range clearing conditions are handled by these five situations. Again, in order to make the code review easier and avoid the code changes mixed together, this patch does not modify badblock_clear() and implement another routine called _badblock_clear() for the improvement. Later patch will delete current code of badblock_clear() and make it as a wrapper to _badblock_clear(), so the code change can be much clear for review. Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-5-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | badblocks: improve badblocks_set() for multiple ranges handlingColy Li2023-09-261-20/+544
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently I received a bug report that current badblocks code does not properly handle multiple ranges. For example, badblocks_set(bb, 32, 1, true); badblocks_set(bb, 34, 1, true); badblocks_set(bb, 36, 1, true); badblocks_set(bb, 32, 12, true); Then indeed badblocks_show() reports, 32 3 36 1 But the expected bad blocks table should be, 32 12 Obviously only the first 2 ranges are merged and badblocks_set() returns and ignores the rest setting range. This behavior is improper, if the caller of badblocks_set() wants to set a range of blocks into bad blocks table, all of the blocks in the range should be handled even the previous part encountering failure. The desired way to set bad blocks range by badblocks_set() is, - Set as many as blocks in the setting range into bad blocks table. - Merge the bad blocks ranges and occupy as less as slots in the bad blocks table. - Fast. Indeed the above proposal is complicated, especially with the following restrictions, - The setting bad blocks range can be acknowledged or not acknowledged. - The bad blocks table size is limited. - Memory allocation should be avoided. The basic idea of the patch is to categorize all possible bad blocks range setting combinations into much less simplified and more less special conditions. Inside badblocks_set() there is an implicit loop composed by jumping between labels 're_insert' and 'update_sectors'. No matter how large the setting bad blocks range is, in every loop just a minimized range from the head is handled by a pre-defined behavior from one of the categorized conditions. The logic is simple and code flow is manageable. The different relative layout between the setting range and existing bad block range are checked and handled (merge, combine, overwrite, insert) by the helpers in previous patch. This patch is to make all the helpers work together with the above idea. This patch only has the algorithm improvement for badblocks_set(). There are following patches contain improvement for badblocks_clear() and badblocks_check(). But the algorithm in badblocks_set() is fundamental and typical, other improvement in clear and check routines are based on all the helpers and ideas in this patch. In order to make the change to be more clear for code review, this patch does not directly modify existing badblocks_set(), and just add a new one named _badblocks_set(). Later patch will remove current existing badblocks_set() code and make it as a wrapper of _badblocks_set(). So the new added change won't be mixed with deleted code, the code review can be easier. Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Wols Lists <antlists@youngman.org.uk> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-4-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | badblocks: add helper routines for badblock ranges handlingColy Li2023-09-261-0/+386
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds several helper routines to improve badblock ranges handling. These helper routines will be used later in the improved version of badblocks_set()/badblocks_clear()/badblocks_check(). - Helpers prev_by_hint() and prev_badblocks() are used to find the bad range from bad table which the searching range starts at or after. - The following helpers are to decide the relative layout between the manipulating range and existing bad block range from bad table. - can_merge_behind() Return 'true' if the manipulating range can backward merge with the bad block range. - can_merge_front() Return 'true' if the manipulating range can forward merge with the bad block range. - can_combine_front() Return 'true' if two adjacent bad block ranges before the manipulating range can be merged. - overlap_front() Return 'true' if the manipulating range exactly overlaps with the bad block range in front of its range. - overlap_behind() Return 'true' if the manipulating range exactly overlaps with the bad block range behind its range. - can_front_overwrite() Return 'true' if the manipulating range can forward overwrite the bad block range in front of its range. - The following helpers are to add the manipulating range into the bad block table. Different routine is called with the specific relative layout between the manipulating range and other bad block range in the bad block table. - behind_merge() Merge the manipulating range with the bad block range behind its range, and return the number of merged length in unit of sector. - front_merge() Merge the manipulating range with the bad block range in front of its range, and return the number of merged length in unit of sector. - front_combine() Combine the two adjacent bad block ranges before the manipulating range into a larger one. - front_overwrite() Overwrite partial of whole bad block range which is in front of the manipulating range. The overwrite may split existing bad block range and generate more bad block ranges into the bad block table. - insert_at() Insert the manipulating range at a specific location in the bad block table. All the above helpers are used in later patches to improve the bad block ranges handling for badblocks_set()/badblocks_clear()/badblocks_check(). Signed-off-by: Coly Li <colyli@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Geliang Tang <geliang.tang@suse.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Cc: NeilBrown <neilb@suse.de> Cc: Vishal L Verma <vishal.l.verma@intel.com> Cc: Xiao Ni <xni@redhat.com> Reviewed-by: Xiao Ni <xni@redhat.com> Acked-by: Geliang Tang <geliang.tang@suse.com> Link: https://lore.kernel.org/r/20230811170513.2300-3-colyli@suse.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: update driver tags request table when start requestChengming Zhou2023-09-222-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we update driver tags request table in blk_mq_get_driver_tag(), so the driver that support queue_rqs() have to update that inflight table by itself. Move it to blk_mq_start_request(), which is a better place where we setup the deadline for request timeout check. And it's just where the request becomes inflight. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-5-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: support batched queue_rqs() on shared tags queueChengming Zhou2023-09-221-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since active requests have been accounted when allocate driver tags, we can remove this limit now. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-4-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: remove RQF_MQ_INFLIGHTChengming Zhou2023-09-223-14/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the previous patch change to only account active requests when we really allocate the driver tag, the RQF_MQ_INFLIGHT can be removed and no double account problem. 1. none elevator: flush request will use the first pending request's driver tag, won't double account. 2. other elevator: flush request will be accounted when allocate driver tag when issue, and will be unaccounted when it put the driver tag. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-3-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | blk-mq: account active requests when get driver tagChengming Zhou2023-09-222-36/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a limit that batched queue_rqs() can't work on shared tags queue, since the account of active requests can't be done there. Now we account the active requests only in blk_mq_get_driver_tag(), which is not the time we get driver tag actually (with none elevator). To support batched queue_rqs() on shared tags queue, we move the account of active requests to where we get the driver tag: 1. none elevator: blk_mq_get_tags() and blk_mq_get_tag() 2. other elevator: __blk_mq_alloc_driver_tag() This is clearer and match with the unaccount side, which just happen when we put the driver tag. The other good point is that we don't need RQF_MQ_INFLIGHT trick anymore, which used to avoid double account of flush request. Now we only account when actually get the driver tag, so all is good. We will remove RQF_MQ_INFLIGHT in the next patch. Signed-off-by: Chengming Zhou <zhouchengming@bytedance.com> Reviewed-by: Ming Lei <ming.lei@redhat.com> Link: https://lore.kernel.org/r/20230913151616.3164338-2-chengming.zhou@linux.dev Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'vfs-6.7.super' of ↵Linus Torvalds2023-10-306-58/+142
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs Pull vfs superblock updates from Christian Brauner: "This contains the work to make block device opening functions return a struct bdev_handle instead of just a struct block_device. The same struct bdev_handle is then also passed to block device closing functions. This allows us to propagate context from opening to closing a block device without having to modify all users everytime. Sidenote, in the future we might even want to try and have block device opening functions return a struct file directly but that's a series on top of this. These are further preparatory changes to be able to count writable opens and blocking writes to mounted block devices. That's a separate piece of work for next cycle and for that we absolutely need the changes to btrfs that have been quietly dropped somehow. Originally the series contained a patch that removed the old blkdev_*() helpers. But since this would've caused needles churn in -next for bcachefs we ended up delaying it. The second piece of work addresses one of the major annoyances about the work last cycle, namely that we required dropping s_umount whenever we used the superblock and fs_holder_ops for a block device. The reason for that requirement had been that in some codepaths s_umount could've been taken under disk->open_mutex (that's always been the case, at least theoretically). For example, on surprise block device removal or media change. And opening and closing block devices required grabbing disk->open_mutex as well. So we did the work and went through the block layer and fixed all those places so that s_umount is never taken under disk->open_mutex. This means no more brittle games where we yield and reacquire s_umount during block device opening and closing and no more requirements where block devices need to be closed. Filesystems don't need to care about this. There's a bunch of other follow-up work such as moving block device freezing and thawing to holder operations which makes it work for all block devices and not just the main block device just as we did for surprise removal. But that is for next cycle. Tested with fstests for all major fses, blktests, LTP" * tag 'vfs-6.7.super' of gitolite.kernel.org:pub/scm/linux/kernel/git/vfs/vfs: (37 commits) porting: update locking requirements fs: assert that open_mutex isn't held over holder ops block: assert that we're not holding open_mutex over blk_report_disk_dead block: move bdev_mark_dead out of disk_check_media_change block: WARN_ON_ONCE() when we remove active partitions block: simplify bdev_del_partition() fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lock jfs: fix log->bdev_handle null ptr deref in lbmStartIO bcache: Fixup error handling in register_cache() xfs: Convert to bdev_open_by_path() reiserfs: Convert to bdev_open_by_dev/path() ocfs2: Convert to use bdev_open_by_dev() nfs/blocklayout: Convert to use bdev_open_by_dev/path() jfs: Convert to bdev_open_by_dev() f2fs: Convert to bdev_open_by_dev/path() ext4: Convert to bdev_open_by_dev() erofs: Convert to use bdev_open_by_path() btrfs: Convert to bdev_open_by_path() fs: Convert to bdev_open_by_dev() mm/swap: Convert to use bdev_open_by_dev() ...
| * | | block: assert that we're not holding open_mutex over blk_report_disk_deadChristian Brauner2023-10-281-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | blk_report_disk_dead() has the following major callers: (1) del_gendisk() (2) blk_mark_disk_dead() Since del_gendisk() acquires disk->open_mutex it's clear that all callers are assumed to be called without disk->open_mutex held. In turn, blk_report_disk_dead() is called without disk->open_mutex held in del_gendisk(). All callers of blk_mark_disk_dead() call it without disk->open_mutex as well. Ensure that it is clear that blk_report_disk_dead() is called without disk->open_mutex on purpose by asserting it and a comment in the code. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231017184823.1383356-5-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | block: move bdev_mark_dead out of disk_check_media_changeChristoph Hellwig2023-10-282-16/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | disk_check_media_change is mostly called from ->open where it makes little sense to mark the file system on the device as dead, as we are just opening it. So instead of calling bdev_mark_dead from disk_check_media_change move it into the few callers that are not in an open instance. This avoid calling into bdev_mark_dead and thus taking s_umount with open_mutex held. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231017184823.1383356-4-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Christian Brauner <brauner@kernel.org> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | block: WARN_ON_ONCE() when we remove active partitionsChristian Brauner2023-10-281-13/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The logic for disk->open_partitions is: blkdev_get_by_*() -> bdev_is_partition() -> blkdev_get_part() -> blkdev_get_whole() // bdev_whole->bd_openers++ -> if (part->bd_openers == 0) disk->open_partitions++ part->bd_openers In other words, when we first claim/open a partition we increment disk->open_partitions and only when all part->bd_openers are closed will disk->open_partitions be zero. That should mean that disk->open_partitions is always > 0 as long as there's anyone that has an open partition. So the check for disk->open_partitions should mean that we can never remove an active partition that has a holder and holder ops set. Assert that in the code. The main disk isn't removed so that check doesn't work for disk->part0 which is what we want. After all we only care about partition not about the main disk. Link: https://lore.kernel.org/r/20231017184823.1383356-3-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | block: simplify bdev_del_partition()Christian Brauner2023-10-281-1/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BLKPG_DEL_PARTITION refuses to delete partitions that still have openers, i.e., that has an elevated @bdev->bd_openers count. If a device is claimed by setting @bdev->bd_holder and @bdev->bd_holder_ops @bdev->bd_openers and @bdev->bd_holders are incremented. @bdev->bd_openers is effectively guaranteed to be >= @bdev->bd_holders. So as long as @bdev->bd_openers isn't zero we know that this partition is still in active use and that there might still be @bdev->bd_holder and @bdev->bd_holder_ops set. The only current example is @fs_holder_ops for filesystems. But that means bdev_mark_dead() which calls into bdev->bd_holder_ops->mark_dead::fs_bdev_mark_dead() is a nop. As long as there's an elevated @bdev->bd_openers count we can't delete the partition and if there isn't an elevated @bdev->bd_openers count then there's no @bdev->bd_holder or @bdev->bd_holder_ops. So simply open-code what we need to do. This gets rid of one more instance where we acquire s_umount under @disk->open_mutex. Link: https://lore.kernel.org/r/20231016-fototermin-umriss-59f1ea6c1fe6@brauner Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Jens Axboe <axboe@kernel.dk> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20231017184823.1383356-2-hch@lst.de Reviewed-by: Ming Lei <ming.lei@redhat.com> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | fs: Avoid grabbing sb->s_umount under bdev->bd_holder_lockJan Kara2023-10-282-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The implementation of bdev holder operations such as fs_bdev_mark_dead() and fs_bdev_sync() grab sb->s_umount semaphore under bdev->bd_holder_lock. This is problematic because it leads to disk->open_mutex -> sb->s_umount lock ordering which is counterintuitive (usually we grab higher level (e.g. filesystem) locks first and lower level (e.g. block layer) locks later) and indeed makes lockdep complain about possible locking cycles whenever we open a block device while holding sb->s_umount semaphore. Implement a function bdev_super_lock_shared() which safely transitions from holding bdev->bd_holder_lock to holding sb->s_umount on alive superblock without introducing the problematic lock dependency. We use this function fs_bdev_sync() and fs_bdev_mark_dead(). Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20231018152924.3858-1-jack@suse.cz Link: https://lore.kernel.org/r/20231017184823.1383356-1-hch@lst.de Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | block: Use bdev_open_by_dev() in disk_scan_partitions() and blkdev_bszset()Jan Kara2023-10-282-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert disk_scan_partitions() and blkdev_bszset() to use bdev_open_by_dev(). Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-3-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | block: Use bdev_open_by_dev() in blkdev_open()Jan Kara2023-10-282-16/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert blkdev_open() to use bdev_open_by_dev(). To be able to propagate handle from blkdev_open() to blkdev_release() we need to stop using existence of file->private_data to determine exclusive block device opens. Use bdev_handle->mode for this purpose since file->f_flags isn't usable for this (O_EXCL is cleared from the flags during open). Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-2-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>
| * | | block: Provide bdev_open_* functionsJan Kara2023-10-281-0/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Create struct bdev_handle that contains all parameters that need to be passed to blkdev_put() and provide bdev_open_* functions that return this structure instead of plain bdev pointer. This will eventually allow us to pass one more argument to blkdev_put() (renamed to bdev_release()) without too much hassle. Acked-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Jan Kara <jack@suse.cz> Link: https://lore.kernel.org/r/20230927093442.25915-1-jack@suse.cz Signed-off-by: Christian Brauner <brauner@kernel.org>