summaryrefslogtreecommitdiffstats
path: root/drivers/nvdimm/pmem.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* nvdimm/pmem: Set dax flag for all 'PFN_MAP' casesZhihao Cheng2024-08-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | The dax is only supported on pfn type pmem devices since commit f467fee48da4 ("block: move the dax flag to queue_limits"). Trying to mount DAX filesystem fails with this error: mount: : wrong fs type, bad option, bad superblock on /dev/pmem7, missing codepage or helper program, or other error. dmesg(1) may have more information after failed mount system call. dmesg: EXT4-fs (pmem7): DAX unsupported by block device. Fix the problem by adding dax flag setting for the missed case. Fixes: f467fee48da4 ("block: move the dax flag to queue_limits") Signed-off-by: Zhihao Cheng <chengzhihao1@huawei.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Ira Weiny <ira.weiny@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Link: https://patch.msgid.link/20240809031155.2837271-1-chengzhihao1@huawei.com Signed-off-by: Ira Weiny <ira.weiny@intel.com>
* Merge tag 'libnvdimm-for-6.11' of ↵Linus Torvalds2024-07-201-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm updates from Ira Weiny: - One small cleanup to use sizeof(*pointer) - Add MODULE_DESCRIPTIONS() to eliminate make W=1 warnings * tag 'libnvdimm-for-6.11' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: testing: nvdimm: Add MODULE_DESCRIPTION() macros testing: nvdimm: iomap: add MODULE_DESCRIPTION() dax: add missing MODULE_DESCRIPTION() macros nvdimm: add missing MODULE_DESCRIPTION() macros ACPI: NFIT: add missing MODULE_DESCRIPTION() macro nvdimm/btt: use sizeof(*pointer) instead of sizeof(type)
| * nvdimm: add missing MODULE_DESCRIPTION() macrosJeff Johnson2024-06-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the 'make W=1' warnings: WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvdimm/libnvdimm.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvdimm/nd_pmem.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvdimm/nd_btt.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvdimm/nd_e820.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvdimm/of_pmem.o WARNING: modpost: missing MODULE_DESCRIPTION() in drivers/nvdimm/nd_virtio.o [iweiny: edit core module description] Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Link: https://patch.msgid.link/r/20240526-md-drivers-nvdimm-v1-1-9e583677e80f@quicinc.com Signed-off-by: Ira Weiny <ira.weiny@intel.com>
* | block: move the dax flag to queue_limitsChristoph Hellwig2024-06-191-5/+2
| | | | | | | | | | | | | | | | | | | | | | Move the dax flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-21-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: move the synchronous flag to queue_limitsChristoph Hellwig2024-06-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | Move the synchronous flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: move the nonrot flag to queue_limitsChristoph Hellwig2024-06-191-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the nonrot flag into the queue_limits feature field so that it can be set atomically with the queue frozen. Use the chance to switch to defaulting to non-rotational and require the driver to opt into rotational, which matches the polarity of the sysfs interface. For the z2ram, ps3vram, 2x memstick, ubiblock and dcssblk the new rotational flag is not set as they clearly are not rotational despite this being a behavior change. There are some other drivers that unconditionally set the rotational flag to keep the existing behavior as they arguably can be used on rotational devices even if that is probably not their main use today (e.g. virtio_blk and drbd). The flag is automatically inherited in blk_stack_limits matching the existing behavior in dm and md. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: move cache control settings out of queue->flagsChristoph Hellwig2024-06-191-1/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the cache control settings into the queue_limits so that the flags can be set atomically with the device queue frozen. Add new features and flags field for the driver set flags, and internal (usually sysfs-controlled) flags in the block layer. Note that we'll eventually remove enough field from queue_limits to bring it back to the previous size. The disable flag is inverted compared to the previous meaning, which means it now survives a rescan, similar to the max_sectors and max_discard_sectors user limits. The FLUSH and FUA flags are now inherited by blk_stack_limits, which simplified the code in dm a lot, but also causes a slight behavior change in that dm-switch and dm-unstripe now advertise a write cache despite setting num_flush_bios to 0. The I/O path will handle this gracefully, but as far as I can tell the lack of num_flush_bios and thus flush support is a pre-existing data integrity bug in those targets that really needs fixing, after which a non-zero num_flush_bios should be required in dm for targets that map to underlying devices. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Ulf Hansson <ulf.hansson@linaro.org> Reviewed-by: Damien Le Moal <dlemoal@kernel.org> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20240617060532.127975-14-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'mm-stable-2024-03-13-20-04' of ↵Linus Torvalds2024-03-151-11/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - Sumanth Korikkar has taught s390 to allocate hotplug-time page frames from hotplugged memory rather than only from main memory. Series "implement "memmap on memory" feature on s390". - More folio conversions from Matthew Wilcox in the series "Convert memcontrol charge moving to use folios" "mm: convert mm counter to take a folio" - Chengming Zhou has optimized zswap's rbtree locking, providing significant reductions in system time and modest but measurable reductions in overall runtimes. The series is "mm/zswap: optimize the scalability of zswap rb-tree". - Chengming Zhou has also provided the series "mm/zswap: optimize zswap lru list" which provides measurable runtime benefits in some swap-intensive situations. - And Chengming Zhou further optimizes zswap in the series "mm/zswap: optimize for dynamic zswap_pools". Measured improvements are modest. - zswap cleanups and simplifications from Yosry Ahmed in the series "mm: zswap: simplify zswap_swapoff()". - In the series "Add DAX ABI for memmap_on_memory", Vishal Verma has contributed several DAX cleanups as well as adding a sysfs tunable to control the memmap_on_memory setting when the dax device is hotplugged as system memory. - Johannes Weiner has added the large series "mm: zswap: cleanups", which does that. - More DAMON work from SeongJae Park in the series "mm/damon: make DAMON debugfs interface deprecation unignorable" "selftests/damon: add more tests for core functionalities and corner cases" "Docs/mm/damon: misc readability improvements" "mm/damon: let DAMOS feeds and tame/auto-tune itself" - In the series "mm/mempolicy: weighted interleave mempolicy and sysfs extension" Rakie Kim has developed a new mempolicy interleaving policy wherein we allocate memory across nodes in a weighted fashion rather than uniformly. This is beneficial in heterogeneous memory environments appearing with CXL. - Christophe Leroy has contributed some cleanup and consolidation work against the ARM pagetable dumping code in the series "mm: ptdump: Refactor CONFIG_DEBUG_WX and check_wx_pages debugfs attribute". - Luis Chamberlain has added some additional xarray selftesting in the series "test_xarray: advanced API multi-index tests". - Muhammad Usama Anjum has reworked the selftest code to make its human-readable output conform to the TAP ("Test Anything Protocol") format. Amongst other things, this opens up the use of third-party tools to parse and process out selftesting results. - Ryan Roberts has added fork()-time PTE batching of THP ptes in the series "mm/memory: optimize fork() with PTE-mapped THP". Mainly targeted at arm64, this significantly speeds up fork() when the process has a large number of pte-mapped folios. - David Hildenbrand also gets in on the THP pte batching game in his series "mm/memory: optimize unmap/zap with PTE-mapped THP". It implements batching during munmap() and other pte teardown situations. The microbenchmark improvements are nice. - And in the series "Transparent Contiguous PTEs for User Mappings" Ryan Roberts further utilizes arm's pte's contiguous bit ("contpte mappings"). Kernel build times on arm64 improved nicely. Ryan's series "Address some contpte nits" provides some followup work. - In the series "mm/hugetlb: Restore the reservation" Breno Leitao has fixed an obscure hugetlb race which was causing unnecessary page faults. He has also added a reproducer under the selftest code. - In the series "selftests/mm: Output cleanups for the compaction test", Mark Brown did what the title claims. - Kinsey Ho has added the series "mm/mglru: code cleanup and refactoring". - Even more zswap material from Nhat Pham. The series "fix and extend zswap kselftests" does as claimed. - In the series "Introduce cpu_dcache_is_aliasing() to fix DAX regression" Mathieu Desnoyers has cleaned up and fixed rather a mess in our handling of DAX on archiecctures which have virtually aliasing data caches. The arm architecture is the main beneficiary. - Lokesh Gidra's series "per-vma locks in userfaultfd" provides dramatic improvements in worst-case mmap_lock hold times during certain userfaultfd operations. - Some page_owner enhancements and maintenance work from Oscar Salvador in his series "page_owner: print stacks and their outstanding allocations" "page_owner: Fixup and cleanup" - Uladzislau Rezki has contributed some vmalloc scalability improvements in his series "Mitigate a vmap lock contention". It realizes a 12x improvement for a certain microbenchmark. - Some kexec/crash cleanup work from Baoquan He in the series "Split crash out from kexec and clean up related config items". - Some zsmalloc maintenance work from Chengming Zhou in the series "mm/zsmalloc: fix and optimize objects/page migration" "mm/zsmalloc: some cleanup for get/set_zspage_mapping()" - Zi Yan has taught the MM to perform compaction on folios larger than order=0. This a step along the path to implementaton of the merging of large anonymous folios. The series is named "Enable >0 order folio memory compaction". - Christoph Hellwig has done quite a lot of cleanup work in the pagecache writeback code in his series "convert write_cache_pages() to an iterator". - Some modest hugetlb cleanups and speedups in Vishal Moola's series "Handle hugetlb faults under the VMA lock". - Zi Yan has changed the page splitting code so we can split huge pages into sizes other than order-0 to better utilize large folios. The series is named "Split a folio to any lower order folios". - David Hildenbrand has contributed the series "mm: remove total_mapcount()", a cleanup. - Matthew Wilcox has sought to improve the performance of bulk memory freeing in his series "Rearrange batched folio freeing". - Gang Li's series "hugetlb: parallelize hugetlb page init on boot" provides large improvements in bootup times on large machines which are configured to use large numbers of hugetlb pages. - Matthew Wilcox's series "PageFlags cleanups" does that. - Qi Zheng's series "minor fixes and supplement for ptdesc" does that also. S390 is affected. - Cleanups to our pagemap utility functions from Peter Xu in his series "mm/treewide: Replace pXd_large() with pXd_leaf()". - Nico Pache has fixed a few things with our hugepage selftests in his series "selftests/mm: Improve Hugepage Test Handling in MM Selftests". - Also, of course, many singleton patches to many things. Please see the individual changelogs for details. * tag 'mm-stable-2024-03-13-20-04' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (435 commits) mm/zswap: remove the memcpy if acomp is not sleepable crypto: introduce: acomp_is_async to expose if comp drivers might sleep memtest: use {READ,WRITE}_ONCE in memory scanning mm: prohibit the last subpage from reusing the entire large folio mm: recover pud_leaf() definitions in nopmd case selftests/mm: skip the hugetlb-madvise tests on unmet hugepage requirements selftests/mm: skip uffd hugetlb tests with insufficient hugepages selftests/mm: dont fail testsuite due to a lack of hugepages mm/huge_memory: skip invalid debugfs new_order input for folio split mm/huge_memory: check new folio order when split a folio mm, vmscan: retry kswapd's priority loop with cache_trim_mode off on failure mm: add an explicit smp_wmb() to UFFDIO_CONTINUE mm: fix list corruption in put_pages_list mm: remove folio from deferred split list before uncharging it filemap: avoid unnecessary major faults in filemap_fault() mm,page_owner: drop unnecessary check mm,page_owner: check for null stack_record before bumping its refcount mm: swap: fix race between free_swap_and_cache() and swapoff() mm/treewide: align up pXd_leaf() retval across archs mm/treewide: drop pXd_large() ...
| * nvdimm/pmem: Treat alloc_dax() -EOPNOTSUPP failure as non-fatalMathieu Desnoyers2024-02-231-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for checking whether the architecture has data cache aliasing within alloc_dax(), modify the error handling of nvdimm/pmem pmem_attach_disk() to treat alloc_dax() -EOPNOTSUPP failure as non-fatal. [ Based on commit "nvdimm/pmem: Fix leak on dax_add_host() failure". ] Link: https://lkml.kernel.org/r/20240215144633.96437-4-mathieu.desnoyers@efficios.com Fixes: d92576f1167c ("dax: does not work correctly with virtual aliasing caches") Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Christoph Hellwig <hch@lst.de> Cc: Dave Chinner <david@fromorbit.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: kernel test robot <lkp@intel.com> Cc: Michael Sclafani <dm-devel@lists.linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * nvdimm/pmem: fix leak on dax_add_host() failureMathieu Desnoyers2024-02-231-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a leak on dax_add_host() error, where "goto out_cleanup_dax" is done before setting pmem->dax_dev, which therefore issues the two following calls on NULL pointers: out_cleanup_dax: kill_dax(pmem->dax_dev); put_dax(pmem->dax_dev); Link: https://lkml.kernel.org/r/20240208184913.484340-1-mathieu.desnoyers@efficios.com Link: https://lkml.kernel.org/r/20240208184913.484340-2-mathieu.desnoyers@efficios.com Signed-off-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Fan Ni <fan.ni@samsung.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Cc: Alasdair Kergon <agk@redhat.com> Cc: Mike Snitzer <snitzer@kernel.org> Cc: Mikulas Patocka <mpatocka@redhat.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Russell King <linux@armlinux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | pmem: pass queue_limits to blk_mq_alloc_diskChristoph Hellwig2024-02-201-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Pass the queue limits directly to blk_alloc_disk instead of setting them one at a time. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Chaitanya Kulkarni <kch@nvidia.com> Reviewed-by: Dave Jiang <dave.jiang@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-9-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | block: pass a queue_limits argument to blk_alloc_diskChristoph Hellwig2024-02-201-3/+3
|/ | | | | | | | | | | | | | | Pass a queue_limits to blk_alloc_disk and apply it if non-NULL. This will allow allocating queues with valid queue limits instead of setting the values one at a time later. Also change blk_alloc_disk to return an ERR_PTR instead of just NULL which can't distinguish errors. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Himanshu Madhani <himanshu.madhani@oracle.com> Link: https://lore.kernel.org/r/20240215071055.2201424-2-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvdimm: Use kstrtobool() instead of strtobool()Christophe JAILLET2023-07-191-1/+2
| | | | | | | | | | | | | | strtobool() is the same as kstrtobool(). However, the latter is more used within the kernel. In order to remove strtobool() and slightly simplify kstrtox.h, switch to the other function name. While at it, include the corresponding header file (<linux/kstrtox.h>) Reviewed-by: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr> Signed-off-by: Dave Jiang <dave.jiang@intel.com>
* dax: enable dax fault handler to report VM_FAULT_HWPOISONJane Chu2023-06-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | When multiple processes mmap() a dax file, then at some point, a process issues a 'load' and consumes a hwpoison, the process receives a SIGBUS with si_code = BUS_MCEERR_AR and with si_lsb set for the poison scope. Soon after, any other process issues a 'load' to the poisoned page (that is unmapped from the kernel side by memory_failure), it receives a SIGBUS with si_code = BUS_ADRERR and without valid si_lsb. This is confusing to user, and is different from page fault due to poison in RAM memory, also some helpful information is lost. Channel dax backend driver's poison detection to the filesystem such that instead of reporting VM_FAULT_SIGBUS, it could report VM_FAULT_HWPOISON. If user level block IO syscalls fail due to poison, the errno will be converted to EIO to maintain block API consistency. Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20230615181325.1327259-2-jane.chu@oracle.com Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
* block: remove ->rw_pageChristoph Hellwig2023-02-031-23/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The ->rw_page method is a special purpose bypass of the usual bio handling path that is limited to single-page reads and writes and synchronous which causes a lot of extra code in the drivers, callers and the block layer. The only remaining user is the MM swap code. Switch that swap code to simply submit a single-vec on-stack bio an synchronously wait on it based on a newly added QUEUE_FLAG_SYNCHRONOUS flag set by the drivers that currently implement ->rw_page instead. While this touches one extra cache line and executes extra code, it simplifies the block layer and drivers and ensures that all feastures are properly supported by all drivers, e.g. right now ->rw_page bypassed cgroup writeback entirely. [akpm@linux-foundation.org: fix comment typo, per Dan] Link: https://lkml.kernel.org/r/20230125133436.447864-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Cc: Jens Axboe <axboe@kernel.dk> Cc: Keith Busch <kbusch@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Sergey Senozhatsky <senozhatsky@chromium.org> Cc: Vishal Verma <vishal.l.verma@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge branch 'for-6.0/dax' into libnvdimm-fixesDan Williams2022-09-251-3/+20
|\ | | | | | | | | Pick up another "Soft Reservation" fix for v6.0-final on top of some straggling nvdimm fixes that missed v5.19.
| * Merge tag 'mm-stable-2022-08-03' of ↵Linus Torvalds2022-08-061-0/+17
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Most of the MM queue. A few things are still pending. Liam's maple tree rework didn't make it. This has resulted in a few other minor patch series being held over for next time. Multi-gen LRU still isn't merged as we were waiting for mapletree to stabilize. The current plan is to merge MGLRU into -mm soon and to later reintroduce mapletree, with a view to hopefully getting both into 6.1-rc1. Summary: - The usual batches of cleanups from Baoquan He, Muchun Song, Miaohe Lin, Yang Shi, Anshuman Khandual and Mike Rapoport - Some kmemleak fixes from Patrick Wang and Waiman Long - DAMON updates from SeongJae Park - memcg debug/visibility work from Roman Gushchin - vmalloc speedup from Uladzislau Rezki - more folio conversion work from Matthew Wilcox - enhancements for coherent device memory mapping from Alex Sierra - addition of shared pages tracking and CoW support for fsdax, from Shiyang Ruan - hugetlb optimizations from Mike Kravetz - Mel Gorman has contributed some pagealloc changes to improve latency and realtime behaviour. - mprotect soft-dirty checking has been improved by Peter Xu - Many other singleton patches all over the place" [ XFS merge from hell as per Darrick Wong in https://lore.kernel.org/all/YshKnxb4VwXycPO8@magnolia/ ] * tag 'mm-stable-2022-08-03' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (282 commits) tools/testing/selftests/vm/hmm-tests.c: fix build mm: Kconfig: fix typo mm: memory-failure: convert to pr_fmt() mm: use is_zone_movable_page() helper hugetlbfs: fix inaccurate comment in hugetlbfs_statfs() hugetlbfs: cleanup some comments in inode.c hugetlbfs: remove unneeded header file hugetlbfs: remove unneeded hugetlbfs_ops forward declaration hugetlbfs: use helper macro SZ_1{K,M} mm: cleanup is_highmem() mm/hmm: add a test for cross device private faults selftests: add soft-dirty into run_vmtests.sh selftests: soft-dirty: add test for mprotect mm/mprotect: fix soft-dirty check in can_change_pte_writable() mm: memcontrol: fix potential oom_lock recursion deadlock mm/gup.c: fix formatting in check_and_migrate_movable_page() xfs: fail dax mount if reflink is enabled on a partition mm/memcontrol.c: remove the redundant updating of stats_flush_threshold userfaultfd: don't fail on unrecognized features hugetlb_cgroup: fix wrong hugetlb cgroup numa stat ...
| | * pagemap,pmem: introduce ->memory_failure()Shiyang Ruan2022-07-181-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When memory-failure occurs, we call this function which is implemented by each kind of devices. For the fsdax case, pmem device driver implements it. Pmem device driver will find out the filesystem in which the corrupted page located in. With dax_holder notify support, we are able to notify the memory failure from pmem driver to upper layers. If there is something not support in the notify routine, memory_failure will fall back to the generic hanlder. Link: https://lkml.kernel.org/r/20220603053738.1218681-4-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Naoya Horiguchi <naoya.horiguchi@nec.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Dan Williams <dan.j.wiliams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.com> Cc: Goldwyn Rodrigues <rgoldwyn@suse.de> Cc: Jane Chu <jane.chu@oracle.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Ritesh Harjani <riteshh@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | block: Change the type of the last .rw_page() argumentBart Van Assche2022-07-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All .rw_page() callers pass an enum req_op value as last argument. Make this explicit by changing the type of the last argument into enum req_op. See also commit 3f289dcb4b26 ("block: make bdev_ops->rw_page() take a REQ_OP instead of bool"). Cc: Tejun Heo <tj@kernel.org> Cc: Minchan Kim <minchan@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Damien Le Moal <damien.lemoal@wdc.com> Cc: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20220714180729.1065367-4-bvanassche@acm.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | block: remove blk_cleanup_diskChristoph Hellwig2022-06-281-2/+2
| |/ | | | | | | | | | | | | | | | | | | blk_cleanup_disk is nothing but a trivial wrapper for put_disk now, so remove it. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Link: https://lore.kernel.org/r/20220619060552.1850436-7-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* / pmem: fix a name collisionJane Chu2022-07-131-3/+3
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kernel test robot detected name collision when compiled on 'um' architecture. Rename "to_phys()" to "pmem_to_phys()". >> drivers/nvdimm/pmem.c:48:20: error: conflicting types for 'to_phys'; have 'phys_addr_t(struct pmem_device *, phys_addr_t)' {aka 'long long unsigned int(struct pmem_device *, long long unsigned int)'} 48 | static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset) | ^~~~~~~ In file included from arch/um/include/asm/page.h:98, from arch/um/include/asm/thread_info.h:15, from include/linux/thread_info.h:60, from include/asm-generic/preempt.h:5, from ./arch/um/include/generated/asm/preempt.h:1, arch/um/include/shared/mem.h:12:29: note: previous definition of 'to_phys' with type 'long unsigned int(void *)' 12 | static inline unsigned long to_phys(void *virt) | ^~~~~~~ vim +48 drivers/nvdimm/pmem.c 47 > 48 static phys_addr_t to_phys(struct pmem_device *pmem, phys_addr_t offset) 49 { 50 return pmem->phys_addr + offset; 51 } 52 Fixes: 9409c9b6709e (pmem: refactor pmem_clear_poison()) Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20220630182802.3250449-1-jane.chu@oracle.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Merge tag 'cxl-for-5.19' of ↵Linus Torvalds2022-05-281-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl Pull cxl updates from Dan Williams: "Compute Express Link (CXL) updates for this cycle. The highlight is new driver-core infrastructure and CXL subsystem changes for allowing lockdep to validate device_lock() usage. Thanks to PeterZ for setting me straight on the current capabilities of the lockdep API, and Greg acked it as well. On the CXL ACPI side this update adds support for CXL _OSC so that platform firmware knows that it is safe to still grant Linux native control of PCIe hotplug and error handling in the presence of CXL devices. A circular dependency problem was discovered between suspend and CXL memory for cases where the suspend image might be stored in CXL memory where that image also contains the PCI register state to restore to re-enable the device. Disable suspend for now until an architecture is defined to clarify that conflict. Lastly a collection of reworks, fixes, and cleanups to the CXL subsystem where support for snooping mailbox commands and properly handling the "mem_enable" flow are the highlights. Summary: - Add driver-core infrastructure for lockdep validation of device_lock(), and fixup a deadlock report that was previously hidden behind the 'lockdep no validate' policy. - Add CXL _OSC support for claiming native control of CXL hotplug and error handling. - Disable suspend in the presence of CXL memory unless and until a protocol is identified for restoring PCI device context from memory hosted on CXL PCI devices. - Add support for snooping CXL mailbox commands to protect against inopportune changes, like set-partition with the 'immediate' flag set. - Rework how the driver detects legacy CXL 1.1 configurations (CXL DVSEC / 'mem_enable') before enabling new CXL 2.0 decode configurations (CXL HDM Capability). - Miscellaneous cleanups and fixes from -next exposure" * tag 'cxl-for-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cxl/cxl: (47 commits) cxl/port: Enable HDM Capability after validating DVSEC Ranges cxl/port: Reuse 'struct cxl_hdm' context for hdm init cxl/port: Move endpoint HDM Decoder Capability init to port driver cxl/pci: Drop @info argument to cxl_hdm_decode_init() cxl/mem: Merge cxl_dvsec_ranges() and cxl_hdm_decode_init() cxl/mem: Skip range enumeration if mem_enable clear cxl/mem: Consolidate CXL DVSEC Range enumeration in the core cxl/pci: Move cxl_await_media_ready() to the core cxl/mem: Validate port connectivity before dvsec ranges cxl/mem: Fix cxl_mem_probe() error exit cxl/pci: Drop wait_for_valid() from cxl_await_media_ready() cxl/pci: Consolidate wait_for_media() and wait_for_media_ready() cxl/mem: Drop mem_enabled check from wait_for_media() nvdimm: Fix firmware activation deadlock scenarios device-core: Kill the lockdep_mutex nvdimm: Drop nd_device_lock() ACPI: NFIT: Drop nfit_device_lock() nvdimm: Replace lockdep_mutex with local lock classes cxl: Drop cxl_device_lock() cxl/acpi: Add root device lockdep validation ...
| * nvdimm: Drop nd_device_lock()Dan Williams2022-04-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Now that all NVDIMM subsystem locking is validated with custom lock classes, there is no need for the custom usage of the lockdep_mutex. Cc: Vishal Verma <vishal.l.verma@intel.com> Cc: Dave Jiang <dave.jiang@intel.com> Cc: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Link: https://lore.kernel.org/r/165055521979.3745911.10751769706032029999.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | pmem: implement pmem_recovery_write()Jane Chu2022-05-161-8/+79
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recovery write thread started out as a normal pwrite thread and when the filesystem was told about potential media error in the range, filesystem turns the normal pwrite to a dax_recovery_write. The recovery write consists of clearing media poison, clearing page HWPoison bit, reenable page-wide read-write permission, flush the caches and finally write. A competing pread thread will be held off during the recovery process since data read back might not be valid, and this is achieved by clearing the badblock records after the recovery write is complete. Competing recovery write threads are already serialized by writer lock held by dax_iomap_rw(). Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/165247997655.53156.8381418704988035976.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | pmem: refactor pmem_clear_poison()Jane Chu2022-05-161-25/+48
| | | | | | | | | | | | | | | | | | | | | | | | Refactor the pmem_clear_poison() function such that the common shared code between the typical write path and the recovery write path is factored out. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20220422224508.440670-7-jane.chu@oracle.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | dax: add .recovery_write dax_operationJane Chu2022-05-161-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce dax_recovery_write() operation. The function is used to recover a dax range that contains poison. Typical use case is when a user process receives a SIGBUS with si_code BUS_MCEERR_AR indicating poison(s) in a dax range, in response, the user process issues a pwrite() to the page-aligned dax range, thus clears the poison and puts valid data in the range. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Jane Chu <jane.chu@oracle.com> Link: https://lore.kernel.org/r/20220422224508.440670-6-jane.chu@oracle.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | dax: introduce DAX_RECOVERY_WRITE dax access modeJane Chu2022-05-161-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Up till now, dax_direct_access() is used implicitly for normal access, but for the purpose of recovery write, dax range with poison is requested. To make the interface clear, introduce enum dax_access_mode { DAX_ACCESS, DAX_RECOVERY_WRITE, } where DAX_ACCESS is used for normal dax access, and DAX_RECOVERY_WRITE is used for dax recovery write. Suggested-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> Link: https://lore.kernel.org/r/165247982851.52965.11024212198889762949.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | mce: fix set_mce_nospec to always unmap the whole pageJane Chu2022-05-161-23/+7
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | The set_memory_uc() approach doesn't work well in all cases. As Dan pointed out when "The VMM unmapped the bad page from guest physical space and passed the machine check to the guest." "The guest gets virtual #MC on an access to that page. When the guest tries to do set_memory_uc() and instructs cpa_flush() to do clean caches that results in taking another fault / exception perhaps because the VMM unmapped the page from the guest." Since the driver has special knowledge to handle NP or UC, mark the poisoned page with NP and let driver handle it when it comes down to repair. Please refer to discussions here for more details. https://lore.kernel.org/all/CAPcyv4hrXPb1tASBZUg-GgdVs0OOFKXMXLiHmktg_kFi7YBMyQ@mail.gmail.com/ Now since poisoned page is marked as not-present, in order to avoid writing to a not-present page and trigger kernel Oops, also fix pmem_do_write(). Fixes: 284ce4011ba6 ("x86/memory_failure: Introduce {set, clear}_mce_nospec()") Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: Jane Chu <jane.chu@oracle.com> Acked-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/r/165272615484.103830.2563950688772226611.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* dax: remove the copy_from_iter and copy_to_iter methodsChristoph Hellwig2021-12-181-18/+2
| | | | | | | | | | | | | | | These methods indirect the actual DAX read/write path. In the end pmem uses magic flush and mc safe variants and fuse and dcssblk use plain ones while device mapper picks redirects to the underlying device. Add set_dax_nocache() and set_dax_nomc() APIs to control which copy routines are used to remove indirect call from the read/write fast path as well as a lot of boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vivek Goyal <vgoyal@redhat.com> [virtiofs] Link: https://lore.kernel.org/r/20211215084508.435401-5-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* dax: remove the DAXDEV_F_SYNC flagChristoph Hellwig2021-12-181-4/+3
| | | | | | | | | | | Remove the DAXDEV_F_SYNC flag and thus the flags argument to alloc_dax and just let the drivers call set_dax_synchronous directly. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Pankaj Gupta <pankaj.gupta@ionos.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20211215084508.435401-4-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* uio: remove copy_from_iter_flushcache() and copy_mc_to_iter()Christoph Hellwig2021-12-181-3/+1
| | | | | | | | These two wrappers are never used. Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211215084508.435401-2-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* dax: remove dax_capableChristoph Hellwig2021-12-041-1/+0
| | | | | | | | | | | | Just open code the block size and dax_dev == NULL checks in the callers. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Gao Xiang <hsiangkao@linux.alibaba.com> [erofs] Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-9-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* dax: simplify the dax_device <-> gendisk associationChristoph Hellwig2021-12-041-2/+8
| | | | | | | | | | | | Replace the dax_host_hash with an xarray indexed by the pointer value of the gendisk, and require explicitly calls from the block drivers that want to associate their gendisk with a dax_device. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Mike Snitzer <snitzer@redhat.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20211129102203.2243509-5-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Merge tag 'libnvdimm-for-5.16' of ↵Linus Torvalds2021-11-101-0/+43
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm update from Dan Williams: "A single cleanup that precedes some deeper PMEM/DAX reworks that did not settle in time for v5.16: - Continue the cleanup of the dax api in preparation for a dax-device block-device divorce" * tag 'libnvdimm-for-5.16' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: nvdimm/pmem: move dax_attribute_group from dax to pmem
| * nvdimm/pmem: move dax_attribute_group from dax to pmemChristoph Hellwig2021-09-271-0/+43
| | | | | | | | | | | | | | | | | | | | | | | | dax_attribute_group is only used by the pmem driver, and can avoid the completely pointless lookup by the disk name if moved there. This leaves just a single caller of dax_get_by_host, so move dax_get_by_host into the same ifdef block as that caller. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Link: https://lore.kernel.org/r/20210922173431.2454024-3-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* | Merge tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-blockLinus Torvalds2021-11-091-5/+16
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more block driver updates from Jens Axboe: - Last series adding error handling support for add_disk() in drivers. After this one, and once the SCSI side has been merged, we can finally annotate add_disk() as must_check. (Luis) - bcache fixes (Coly) - zram fixes (Ming) - ataflop locking fix (Tetsuo) - nbd fixes (Ye, Yu) - MD merge via Song - Cleanup (Yang) - sysfs fix (Guoqing) - Misc fixes (Geert, Wu, luo) * tag 'for-5.16/drivers-2021-11-09' of git://git.kernel.dk/linux-block: (34 commits) bcache: Revert "bcache: use bvec_virt" ataflop: Add missing semicolon to return statement floppy: address add_disk() error handling on probe ataflop: address add_disk() error handling on probe block: update __register_blkdev() probe documentation ataflop: remove ataflop_probe_lock mutex mtd/ubi/block: add error handling support for add_disk() block/sunvdc: add error handling support for add_disk() z2ram: add error handling support for add_disk() nvdimm/pmem: use add_disk() error handling nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assigned nvdimm/blk: add error handling support for add_disk() nvdimm/blk: avoid calling del_gendisk() on early failures nvdimm/btt: add error handling support for add_disk() nvdimm/btt: use goto error labels on btt_blk_init() loop: Remove duplicate assignments drbd: Fix double free problem in drbd_create_device nvdimm/btt: do not call del_gendisk() if not needed bcache: fix use-after-free problem in bcache_device_free() zram: replace fsync_bdev with sync_blockdev ...
| * | nvdimm/pmem: use add_disk() error handlingLuis Chamberlain2021-11-041-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that device_add_disk() supports returning an error, use that. We must unwind alloc_dax() on error. Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20211103230437.1639990-7-mcgrof@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | nvdimm/pmem: cleanup the disk if pmem_release_disk() is yet assignedLuis Chamberlain2021-11-041-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to devm being able to use pmem_release_disk() there are other failure which can occur for which we must account for and release the disk for. Address those few cases. Fixes: 3dd60fb9d95d ("nvdimm/pmem: stop using q_usage_count as external pgmap refcount") Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Luis Chamberlain <mcgrof@kernel.org> Link: https://lore.kernel.org/r/20211103230437.1639990-6-mcgrof@kernel.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | Merge tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-blockLinus Torvalds2021-11-011-2/+1
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull block updates from Jens Axboe: - mq-deadline accounting improvements (Bart) - blk-wbt timer fix (Andrea) - Untangle the block layer includes (Christoph) - Rework the poll support to be bio based, which will enable adding support for polling for bio based drivers (Christoph) - Block layer core support for multi-actuator drives (Damien) - blk-crypto improvements (Eric) - Batched tag allocation support (me) - Request completion batching support (me) - Plugging improvements (me) - Shared tag set improvements (John) - Concurrent queue quiesce support (Ming) - Cache bdev in ->private_data for block devices (Pavel) - bdev dio improvements (Pavel) - Block device invalidation and block size improvements (Xie) - Various cleanups, fixes, and improvements (Christoph, Jackie, Masahira, Tejun, Yu, Pavel, Zheng, me) * tag 'for-5.16/block-2021-10-29' of git://git.kernel.dk/linux-block: (174 commits) blk-mq-debugfs: Show active requests per queue for shared tags block: improve readability of blk_mq_end_request_batch() virtio-blk: Use blk_validate_block_size() to validate block size loop: Use blk_validate_block_size() to validate block size nbd: Use blk_validate_block_size() to validate block size block: Add a helper to validate the block size block: re-flow blk_mq_rq_ctx_init() block: prefetch request to be initialized block: pass in blk_mq_tags to blk_mq_rq_ctx_init() block: add rq_flags to struct blk_mq_alloc_data block: add async version of bio_set_polled block: kill DIO_MULTI_BIO block: kill unused polling bits in __blkdev_direct_IO() block: avoid extra iter advance with async iocb block: Add independent access ranges support blk-mq: don't issue request directly in case that current is to be blocked sbitmap: silence data race warning blk-cgroup: synchronize blkg creation against policy deactivation block: refactor bio_iov_bvec_set() block: add single bio async direct IO helper ...
| * | block: switch polling to be bio basedChristoph Hellwig2021-10-181-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the blk_poll interface that requires the caller to keep a queue and cookie from the submissions with polling based on the bio. Polling for the bio itself leads to a few advantages: - the cookie construction can made entirely private in blk-mq.c - the caller does not need to remember the request_queue and cookie separately and thus sidesteps their lifetime issues - keeping the device and the cookie inside the bio allows to trivially support polling BIOs remapping by stacking drivers - a lot of code to propagate the cookie back up the submission path can be removed entirely. Signed-off-by: Christoph Hellwig <hch@lst.de> Tested-by: Mark Wunderlich <mark.wunderlich@intel.com> Link: https://lore.kernel.org/r/20211012111226.760968-15-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | nvdimm/pmem: stop using q_usage_count as external pgmap refcountChristoph Hellwig2021-10-261-31/+2
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Originally all DAX access when through block_device operations and thus needed a queue reference. But since commit cccbce671582 ("filesystem-dax: convert to dax_direct_access()") all this happens at the DAX device level which uses its own refcounting. Having the external refcount thus wasn't needed but has otherwise been harmless for long time. But now that "block: drain file system I/O on del_gendisk" waits for q_usage_count to reach 0 in del_gendisk this whole scheme can't work anymore (and pmem is the only driver abusing q_usage_count like that). So switch to the internal reference and remove the unbalanced blk_freeze_queue_start that is taken care of by del_gendisk. Fixes: 8e141f9eb803 ("block: drain file system I/O on del_gendisk") Reported-by: Yi Zhang <yi.zhang@redhat.com> Signed-off-by: Christoph Hellwig <hch@lst.de> Link: https://lore.kernel.org/r/20211019073641.2323410-2-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* / nvdimm/pmem: fix creating the dax groupChristoph Hellwig2021-09-271-4/+1
|/ | | | | | | | | | | | | | The recent block layer refactoring broke the way how the pmem driver abused device_add_disk. Fix this by properly passing the attribute groups to device_add_disk. Fixes: 52b85909f85d ("block: fold register_disk into device_add_disk") Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ira Weiny <ira.weiny@intel.com> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Darrick J. Wong <djwong@kernel.org> Link: https://lore.kernel.org/r/20210922173431.2454024-2-hch@lst.de Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* libnvdimm/pmem: Fix crash triggered when I/O in-flight during unbindsumiyawang2021-08-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a use after free crash when the pmem driver tears down its mapping while I/O is still inbound. This is triggered by driver unbind, "ndctl destroy-namespace", while I/O is in flight. Fix the sequence of blk_cleanup_queue() vs memunmap(). The crash signature is of the form: BUG: unable to handle page fault for address: ffffc90080200000 CPU: 36 PID: 9606 Comm: systemd-udevd Call Trace: ? pmem_do_bvec+0xf9/0x3a0 ? xas_alloc+0x55/0xd0 pmem_rw_page+0x4b/0x80 bdev_read_page+0x86/0xb0 do_mpage_readpage+0x5d4/0x7a0 ? lru_cache_add+0xe/0x10 mpage_readpages+0xf9/0x1c0 ? bd_link_disk_holder+0x1a0/0x1a0 blkdev_readpages+0x1d/0x20 read_pages+0x67/0x1a0 ndctl Call Trace in vmcore: PID: 23473 TASK: ffff88c4fbbe8000 CPU: 1 COMMAND: "ndctl" __schedule schedule blk_mq_freeze_queue_wait blk_freeze_queue blk_cleanup_queue pmem_release_queue devm_action_release release_nodes devres_release_all device_release_driver_internal device_driver_detach unbind_store Cc: <stable@vger.kernel.org> Signed-off-by: sumiyawang <sumiyawang@tencent.com> Reviewed-by: yongduan <yongduan@tencent.com> Link: https://lore.kernel.org/r/1629632949-14749-1-git-send-email-sumiyawang@tencent.com Fixes: 50f44ee7248a ("mm/devm_memremap_pages: fix final page put race") Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* libnvdimm/pmem: Fix blk_cleanup_disk() usageDan Williams2021-06-091-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The queue_to_disk() helper can not be used after del_gendisk() communicate @disk via the pgmap->owner. Otherwise, queue_to_disk() returns NULL resulting in the splat below. Kernel attempted to read user page (330) - exploit attempt? (uid: 0) BUG: Kernel NULL pointer dereference on read at 0x00000330 Faulting instruction address: 0xc000000000906344 Oops: Kernel access of bad area, sig: 11 [#1] [..] NIP [c000000000906344] pmem_pagemap_cleanup+0x24/0x40 LR [c0000000004701d4] memunmap_pages+0x1b4/0x4b0 Call Trace: [c000000022cbb9c0] [c0000000009063c8] pmem_pagemap_kill+0x28/0x40 (unreliable) [c000000022cbb9e0] [c0000000004701d4] memunmap_pages+0x1b4/0x4b0 [c000000022cbba90] [c0000000008b28a0] devm_action_release+0x30/0x50 [c000000022cbbab0] [c0000000008b39c8] release_nodes+0x2f8/0x3e0 [c000000022cbbb60] [c0000000008ac440] device_release_driver_internal+0x190/0x2b0 [c000000022cbbba0] [c0000000008a8450] unbind_store+0x130/0x170 Reported-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Fixes: 87eb73b2ca7c ("nvdimm-pmem: convert to blk_alloc_disk/blk_cleanup_disk") Link: http://lore.kernel.org/r/DFB75BA8-603F-4A35-880B-C5B23EF8FA7D@linux.vnet.ibm.com Cc: Christoph Hellwig <hch@lst.de> Cc: Ulf Hansson <ulf.hansson@linaro.org> Cc: Jens Axboe <axboe@kernel.dk> Signed-off-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Tested-by: Sachin Sant <sachinp@linux.vnet.ibm.com> Link: https://lore.kernel.org/r/162310994435.1571616.334551212901820961.stgit@dwillia2-desk3.amr.corp.intel.com [axboe: fold in compile warning fix] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvme-multipath: convert to blk_alloc_disk/blk_cleanup_diskChristoph Hellwig2021-06-011-1/+0
| | | | | | | | | | | Convert the nvme-multipath driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-19-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* nvdimm-pmem: convert to blk_alloc_disk/blk_cleanup_diskChristoph Hellwig2021-06-011-10/+5
| | | | | | | | | | | Convert the nvdimm-pmem driver to use the blk_alloc_disk and blk_cleanup_disk helpers to simplify gendisk and request_queue allocation. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-18-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* block: automatically enable GENHD_FL_EXT_DEVTChristoph Hellwig2021-06-011-1/+0
| | | | | | | | | | | | | | Automatically set the GENHD_FL_EXT_DEVT flag for all disks allocated without an explicit number of minors. This is what all new block drivers should do, so make sure it is the default without boilerplate code. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Hannes Reinecke <hare@suse.de> Reviewed-by: Luis Chamberlain <mcgrof@kernel.org> Reviewed-by: Ulf Hansson <ulf.hansson@linaro.org> Link: https://lore.kernel.org/r/20210521055116.1053587-4-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* include: remove pagemap.h from blkdev.hMatthew Wilcox (Oracle)2021-05-071-0/+1
| | | | | | | | | | | | | | | | | | | | My UEK-derived config has 1030 files depending on pagemap.h before this change. Afterwards, just 326 files need to be rebuilt when I touch pagemap.h. I think blkdev.h is probably included too widely, but untangling that dependency is harder and this solves my problem. x86 allmodconfig builds, but there may be implicit include problems on other architectures. Link: https://lkml.kernel.org/r/20210309195747.283796-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Acked-by: Dan Williams <dan.j.williams@intel.com> [nvdimm] Acked-by: Jens Axboe <axboe@kernel.dk> [block] Reviewed-by: Christoph Hellwig <hch@lst.de> Acked-by: Coly Li <colyli@suse.de> [bcache] Acked-by: Martin K. Petersen <martin.petersen@oracle.com> [scsi] Reviewed-by: William Kucharski <william.kucharski@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* libnvdimm: Notify disk drivers to revalidate region read-onlyDan Williams2021-04-091-4/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous kernels allowed the BLKROSET to override the disk's read-only status. With that situation fixed the pmem driver needs to rely on notification events to reevaluate the disk read-only status after the host region has been marked read-write. Recall that when libnvdimm determines that the persistent memory has lost persistence (for example lack of energy to flush from DRAM to FLASH on an NVDIMM-N device) it marks the region read-only, but that state can be overridden by the user via: echo 0 > /sys/bus/nd/devices/regionX/read_only ...to date there is no notification that the region has restored persistence, so the user override is the only recovery. Fixes: 52f019d43c22 ("block: add a hard-readonly flag to struct gendisk") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Vishal Verma <vishal.l.verma@intel.com> Tested-by: Vishal Verma <vishal.l.verma@intel.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Cc: Christoph Hellwig <hch@lst.de> Cc: Ming Lei <ming.lei@redhat.com> Cc: Martin K. Petersen <martin.petersen@oracle.com> Cc: Hannes Reinecke <hare@suse.de> Cc: Jens Axboe <axboe@kernel.dk> Link: https://lore.kernel.org/r/161534060720.528671.2341213328968989192.stgit@dwillia2-desk3.amr.corp.intel.com Signed-off-by: Dan Williams <dan.j.williams@intel.com>
* Merge tag 'libnvdimm-for-5.12' of ↵Linus Torvalds2021-02-241-3/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm Pull libnvdimm and device-dax updates from Dan Williams: - Fix the error code polarity for the device-dax/mapping attribute - For the device-dax and libnvdimm bus implementations stop implementing a useless return code for the remove() callback. - Miscellaneous cleanups * tag 'libnvdimm-for-5.12' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm: dax-device: Make remove callback return void device-dax: Drop an empty .remove callback device-dax: Fix error path in dax_driver_register device-dax: Properly handle drivers without remove callback device-dax: Prevent registering drivers without probe callback libnvdimm: Make remove callback return void libnvdimm/dimm: Simplify nvdimm_remove() device-dax: Fix default return code of range_parse()