summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'mm-stable-2022-12-13' of ↵Linus Torvalds2022-12-14237-5047/+9281
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: - More userfaultfs work from Peter Xu - Several convert-to-folios series from Sidhartha Kumar and Huang Ying - Some filemap cleanups from Vishal Moola - David Hildenbrand added the ability to selftest anon memory COW handling - Some cpuset simplifications from Liu Shixin - Addition of vmalloc tracing support by Uladzislau Rezki - Some pagecache folioifications and simplifications from Matthew Wilcox - A pagemap cleanup from Kefeng Wang: we have VM_ACCESS_FLAGS, so use it - Miguel Ojeda contributed some cleanups for our use of the __no_sanitize_thread__ gcc keyword. This series should have been in the non-MM tree, my bad - Naoya Horiguchi improved the interaction between memory poisoning and memory section removal for huge pages - DAMON cleanups and tuneups from SeongJae Park - Tony Luck fixed the handling of COW faults against poisoned pages - Peter Xu utilized the PTE marker code for handling swapin errors - Hugh Dickins reworked compound page mapcount handling, simplifying it and making it more efficient - Removal of the autonuma savedwrite infrastructure from Nadav Amit and David Hildenbrand - zram support for multiple compression streams from Sergey Senozhatsky - David Hildenbrand reworked the GUP code's R/O long-term pinning so that drivers no longer need to use the FOLL_FORCE workaround which didn't work very well anyway - Mel Gorman altered the page allocator so that local IRQs can remnain enabled during per-cpu page allocations - Vishal Moola removed the try_to_release_page() wrapper - Stefan Roesch added some per-BDI sysfs tunables which are used to prevent network block devices from dirtying excessive amounts of pagecache - David Hildenbrand did some cleanup and repair work on KSM COW breaking - Nhat Pham and Johannes Weiner have implemented writeback in zswap's zsmalloc backend - Brian Foster has fixed a longstanding corner-case oddity in file[map]_write_and_wait_range() - sparse-vmemmap changes for MIPS, LoongArch and NIOS2 from Feiyang Chen - Shiyang Ruan has done some work on fsdax, to make its reflink mode work better under xfstests. Better, but still not perfect - Christoph Hellwig has removed the .writepage() method from several filesystems. They only need .writepages() - Yosry Ahmed wrote a series which fixes the memcg reclaim target beancounting - David Hildenbrand has fixed some of our MM selftests for 32-bit machines - Many singleton patches, as usual * tag 'mm-stable-2022-12-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (313 commits) mm/hugetlb: set head flag before setting compound_order in __prep_compound_gigantic_folio mm: mmu_gather: allow more than one batch of delayed rmaps mm: fix typo in struct pglist_data code comment kmsan: fix memcpy tests mm: add cond_resched() in swapin_walk_pmd_entry() mm: do not show fs mm pc for VM_LOCKONFAULT pages selftests/vm: ksm_functional_tests: fixes for 32bit selftests/vm: cow: fix compile warning on 32bit selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitions mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmem mm,thp,rmap: fix races between updates of subpages_mapcount mm: memcg: fix swapcached stat accounting mm: add nodes= arg to memory.reclaim mm: disable top-tier fallback to reclaim on proactive reclaim selftests: cgroup: make sure reclaim target memcg is unprotected selftests: cgroup: refactor proactive reclaim code to reclaim_until() mm: memcg: fix stale protection of reclaim target memcg mm/mmap: properly unaccount memory on mas_preallocate() failure omfs: remove ->writepage jfs: remove ->writepage ...
| * mm/hugetlb: set head flag before setting compound_order in ↵Sidhartha Kumar2022-12-131-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | __prep_compound_gigantic_folio folio_set_compound_order() checks if the passed in folio is a large folio. A large folio is indicated by the PG_head flag. Call __folio_set_head() before setting the order. Link: https://lkml.kernel.org/r/20221212225529.22493-1-sidhartha.kumar@oracle.com Fixes: d1c6095572d0 ("mm/hugetlb: convert hugetlb prep functions to folios") Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reported-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: mmu_gather: allow more than one batch of delayed rmapsLinus Torvalds2022-12-121-16/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 5df397dec7c4 ("mm: delay page_remove_rmap() until after the TLB has been flushed") limited the page batching for the mmu gather operation when a dirty shared page needed to delay rmap removal until after the TLB had been flushed. It did so because it needs to walk that array of pages while still holding the page table lock, and our mmu_gather infrastructure allows for batching quite a lot of pages. We may have thousands on pages queued up for freeing, and we wanted to walk only the last batch if we then added a dirty page to the queue. However, when I limited it to one batch, I didn't think of the degenerate case of the special first batch that is embedded on-stack in the mmu_gather structure (called "local") and that only has eight entries. So with the right pattern, that "limit delayed rmap to just one batch" will trigger over and over in that first small batch, and we'll waste a lot of time flushing TLB's every eight pages. And those right patterns are trivially triggered by just having a shared mappings with lots of adjacent dirty pages. Like the 'page_fault3' subtest of the 'will-it-scale' benchmark, that just maps a shared area, dirties all pages, and unmaps it. Rinse and repeat. We still want to limit the batching, but to fix this (easily triggered) degenerate case, just expand the "only one batch" logic to instead be "only one batch that isn't the special first on-stack ('local') batch". That way, when we need to flush the delayed rmaps, we can still limit our walk to just the last batch - and that first small one. Link: https://lkml.kernel.org/r/CAHk-=whkL5aM1fR7kYUmhHQHBcMUc-bDoFP7EwYjTxy64DGtvw@mail.gmail.com Fixes: 5df397dec7c4 ("mm: delay page_remove_rmap() until after the TLB has been flushed") Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Reported-by: kernel test robot <yujie.liu@intel.com> Link: https://lore.kernel.org/oe-lkp/202212051534.852804af-yujie.liu@intel.com Tested-by: Huang, Ying <ying.huang@intel.com> Tested-by: Hugh Dickins <hughd@google.com> Cc: Feng Tang <feng.tang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Nadav Amit <nadav.amit@gmail.com> Cc: Xing Zhengjun <zhengjun.xing@linux.intel.com> Cc: "Yin, Fengwei" <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: fix typo in struct pglist_data code commentWang Yong2022-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | change "stat" to "start". Link: https://lkml.kernel.org/r/20221207074011.GA151242@cloud Fixes: c959924b0dc5 ("memory tiering: adjust hot threshold automatically") Signed-off-by: Wang Yong <yongw.kernel@gmail.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kmsan: fix memcpy testsAlexander Potapenko2022-12-121-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent Clang changes may cause it to delete calls of memcpy(), if the source is an uninitialized volatile local. This happens because passing a pointer to a volatile local into memcpy() discards the volatile qualifier, giving the compiler a free hand to optimize the memcpy() call away. Use OPTIMIZER_HIDE_VAR() to hide the uninitialized var from the too-smart compiler. Link: https://lkml.kernel.org/r/20221205145740.694038-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Marco Elver <elver@google.com> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: add cond_resched() in swapin_walk_pmd_entry()Kefeng Wang2022-12-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When handling MADV_WILLNEED in madvise(), a soflockup may occurr in swapin_walk_pmd_entry() if swapping in lots of memory on a slow device. Add a cond_resched() to avoid the possible softlockup. Link: https://lkml.kernel.org/r/20221205140327.72304-1-wangkefeng.wang@huawei.com Fixes: 1998cc048901 ("mm: make madvise(MADV_WILLNEED) support swap file prefetch") Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Shaohua Li <shli@fusionio.com> Cc: Hugh Dickins <hughd@google.com> Cc: Rik van Riel <riel@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: do not show fs mm pc for VM_LOCKONFAULT pagesJason A. Donenfeld2022-12-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When VM_LOCKONFAULT was added, /proc/PID/smaps wasn't hooked up to it, so looking at /proc/PID/smaps, it shows '??' instead of something intelligable. This can be reached by userspace by simply calling `mlock2(..., MLOCK_ONFAULT);`. Fix this by adding "lf" to denote VM_LOCKONFAULT. Link: https://lkml.kernel.org/r/20221205173007.580210-1-Jason@zx2c4.com Fixes: de60f5f10c58 ("mm: introduce VM_LOCKONFAULT") Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Eric B Munson <emunson@akamai.com> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * selftests/vm: ksm_functional_tests: fixes for 32bitDavid Hildenbrand2022-12-122-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The test currently fails on 32bit. Fixing the "-1ull" vs. "-1ul" seems to make the test pass and the compiler happy. Note: This test is not in mm-stable yet. This fix should be squashed into "selftests/vm: add KSM unmerge tests". Link: https://lkml.kernel.org/r/20221205193716.276024-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * selftests/vm: cow: fix compile warning on 32bitDavid Hildenbrand2022-12-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | The compiler complains about the conversion of a pointer to an int of different width. Link: https://lkml.kernel.org/r/20221205193716.276024-4-david@redhat.com Fixes: 6f1405efc61b ("selftests/vm: anon_cow: add R/O longterm tests via gup_test") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * selftests/vm: madv_populate: fix missing MADV_POPULATE_(READ|WRITE) definitionsDavid Hildenbrand2022-12-121-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | The tests fail to compile in some environments (e.g., Debian 11.5 on x86). Let's simply conditionally define MADV_POPULATE_(READ|WRITE) if not already defined, similar to how the khugepaged.c test handles it. Link: https://lkml.kernel.org/r/20221205193716.276024-3-david@redhat.com Fixes: 39b2e5cae43d ("selftests/vm: make MADV_POPULATE_(READ|WRITE) use in-tree headers") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/gup_test: fix PIN_LONGTERM_TEST_READ with highmemDavid Hildenbrand2022-12-121-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "selftests/vm: fix some tests on 32bit". I finally had the time to run some of the selftests written by me (especially "cow") on x86 PAE. I found some unexpected "surprises" :) With these changes, and with [1] on top of mm-unstable, the "cow" tests and the "ksm_functional_tests" compile and pass as expected (expected failures with hugetlb in the "cow" tests). "madv_populate" has one expected test failure -- x86 does not support softdirty tracking. #1-#3 fix commits with stable commit ids. #4 fixes a test that is not in mm-stable yet. A note that there are many other compile errors/warnings when compiling on 32bit and with older Linux headers ... something for another day. [1] https://lkml.kernel.org/r/20221205150857.167583-1-david@redhat.com This patch (of 4): ... we have to kmap()/kunmap(), otherwise this won't work as expected with highmem. Link: https://lkml.kernel.org/r/20221205193716.276024-1-david@redhat.com Link: https://lkml.kernel.org/r/20221205193716.276024-2-david@redhat.com Fixes: c77369b437f9 ("mm/gup_test: start/stop/read functionality for PIN LONGTERM test") Signed-off-by: David Hildenbrand <david@redhat.com> Cc: Shuah Khan <shuah@kernel.org>, Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm,thp,rmap: fix races between updates of subpages_mapcountHugh Dickins2022-12-121-9/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 4b51634cd16a, introducing the COMPOUND_MAPPED bit, paid attention to the impossibility of subpages_mapcount ever appearing negative; but did not attend to those races in which it can momentarily appear larger than thought possible. These arise from how page_remove_rmap() first decrements page->_mapcount or compound_mapcount, then, if that transition goes negative (logical 0), decrements subpages_mapcount. The initial decrement lets a racing page_add_*_rmap() reincrement _mapcount or compound_mapcount immediately, and then in rare cases its corresponding increment of subpages_mapcount may be completed before page_remove_rmap()'s decrement. There could even (with increasing unlikelihood) be a series of increments intermixed with the decrements. In practice, checking subpages_mapcount with a temporary WARN on range, has caught values of 0x1000000 (2*COMPOUND_MAPPED, when move_pages() was using remove_migration_pmd()) and 0x800201 (do_huge_pmd_wp_page() using __split_huge_pmd()): page_add_anon_rmap() racing page_remove_rmap(), as predicted. I certainly found it harder to reason about than when bit_spin_locked, but the easy case gives a clue to how to handle the harder case. The easy case being the three !(nr & COMPOUND_MAPPED) checks, which should obviously be replaced by (nr < COMPOUND_MAPPED) checks - to count a page as compound mapped, even while the bit in that position is 0. The harder case is when trying to decide how many subpages are newly covered or uncovered, when compound map is first added or last removed: not knowing all that racily happened between first and second atomic ops. But the easy way to handle that, is again to count the page as compound mapped all the while that its subpages_mapcount indicates so - ignoring the _mapcount or compound_mapcount transition while it is on the way to being reversed. Link: https://lkml.kernel.org/r/4388158-3092-a960-ff2d-55f2b0fe4ef8@google.com Fixes: 4b51634cd16a ("mm,thp,rmap: subpages_mapcount COMPOUND_MAPPED if PMD-mapped") Signed-off-by: Hugh Dickins <hughd@google.com> Cc: David Hildenbrand <david@redhat.com> Cc: James Houghton <jthoughton@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: John Hubbard <jhubbard@nvidia.com> Cc: "Kirill A . Shutemov" <kirill@shutemov.name> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Naoya Horiguchi <naoya.horiguchi@linux.dev> Cc: Peter Xu <peterx@redhat.com> Cc: Sidhartha Kumar <sidhartha.kumar@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yang Shi <shy828301@gmail.com> Cc: Zach O'Keefe <zokeefe@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: memcg: fix swapcached stat accountingHugh Dickins2022-12-122-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I'd been worried by high "swapcached" counts in memcg OOM reports, thought we had a problem freeing swapcache, but it was just the accounting that was wrong. Two issues: 1. When __remove_mapping() removes swapcache, __delete_from_swap_cache() relies on memcg_data for the right counts to be updated; but that had already been reset by mem_cgroup_swapout(). Swap those calls around - mem_cgroup_swapout() does not require the swapcached flag to be set. 6.1 commit ac35a4902374 ("mm: multi-gen LRU: minimal implementation") already made a similar swap for workingset_eviction(), but not for this. 2. memcg's "swapcached" count was added for memcg v2 stats, but displayed on OOM even for memcg v1: so mem_cgroup_move_account() ought to move it. Link: https://lkml.kernel.org/r/b8b96ee0-1e1e-85f8-df97-c82a11d7cd14@google.com Fixes: b6038942480e ("mm: memcg: add swapcache stat for memcg v2") Signed-off-by: Hugh Dickins <hughd@google.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: add nodes= arg to memory.reclaimMina Almasry2022-12-124-21/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The nodes= arg instructs the kernel to only scan the given nodes for proactive reclaim. For example use cases, consider a 2 tier memory system: nodes 0,1 -> top tier nodes 2,3 -> second tier $ echo "1m nodes=0" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory from node 0. Since node 0 is a top tier node, demotion will be attempted first. This is useful to direct proactive reclaim to specific nodes that are under pressure. $ echo "1m nodes=2,3" > memory.reclaim This instructs the kernel to attempt to reclaim 1m memory in the second tier, since this tier of memory has no demotion targets the memory will be reclaimed. $ echo "1m nodes=0,1" > memory.reclaim Instructs the kernel to reclaim memory from the top tier nodes, which can be desirable according to the userspace policy if there is pressure on the top tiers. Since these nodes have demotion targets, the kernel will attempt demotion first. Since commit 3f1509c57b1b ("Revert "mm/vmscan: never demote for memcg reclaim""), the proactive reclaim interface memory.reclaim does both reclaim and demotion. Reclaim and demotion incur different latency costs to the jobs in the cgroup. Demoted memory would still be addressable by the userspace at a higher latency, but reclaimed memory would need to incur a pagefault. The 'nodes' arg is useful to allow the userspace to control demotion and reclaim independently according to its policy: if the memory.reclaim is called on a node with demotion targets, it will attempt demotion first; if it is called on a node without demotion targets, it will only attempt reclaim. Link: https://lkml.kernel.org/r/20221202223533.1785418-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeelb@google.com> Acked-by: Muchun Song <songmuchun@bytedance.com> Cc: Bagas Sanjaya <bagasdotme@gmail.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: Wei Xu <weixugc@google.com> Cc: Yang Shi <yang.shi@linux.alibaba.com> Cc: Yosry Ahmed <yosryahmed@google.com> Cc: zefan li <lizefan.x@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: disable top-tier fallback to reclaim on proactive reclaimMina Almasry2022-12-121-3/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reclaiming directly from top tier nodes breaks the aging pipeline of memory tiers. If we have a RAM -> CXL -> storage hierarchy, we should demote from RAM to CXL and from CXL to storage. If we reclaim a page from RAM, it means we 'demote' it directly from RAM to storage, bypassing potentially a huge amount of pages colder than it in CXL. However disabling reclaim from top tier nodes entirely would cause ooms in edge scenarios where lower tier memory is unreclaimable for whatever reason, e.g. memory being mlocked() or too hot to reclaim. In these cases we would rather the job run with a performance regression rather than it oom altogether. However, we can disable reclaim from top tier nodes for proactive reclaim. That reclaim is not real memory pressure, and we don't have any cause to be breaking the aging pipeline. [akpm@linux-foundation.org: restore comment layout, per Ying Huang] Link: https://lkml.kernel.org/r/20221201233317.1394958-1-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Cc: Greg Thelen <gthelen@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Cc: Wei Xu <weixugc@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * selftests: cgroup: make sure reclaim target memcg is unprotectedYosry Ahmed2022-12-121-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make sure that we ignore protection of a memcg that is the target of memcg reclaim. Link: https://lkml.kernel.org/r/20221202031512.1365483-4-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Chris Down <chris@chrisdown.name> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * selftests: cgroup: refactor proactive reclaim code to reclaim_until()Yosry Ahmed2022-12-121-36/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Refactor the code that drives writing to memory.reclaim (retrying, error handling, etc) from test_memcg_reclaim() to a helper called reclaim_until(), which proactively reclaims from a memcg until its usage reaches a certain value. While we are at it, refactor and simplify the reclaim loop. This will be used in a following patch in another test. Link: https://lkml.kernel.org/r/20221202031512.1365483-3-yosryahmed@google.com Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Suggested-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Chris Down <chris@chrisdown.name> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: memcg: fix stale protection of reclaim target memcgYosry Ahmed2022-12-122-15/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: memcg: fix protection of reclaim target memcg", v3. This series fixes a bug in calculating the protection of the reclaim target memcg where we end up using stale effective protection values from the last reclaim operation, instead of completely ignoring the protection of the reclaim target as intended. More detailed explanation and examples in patch 1, which includes the fix. Patches 2 & 3 introduce a selftest case that catches the bug. This patch (of 3): When we are doing memcg reclaim, the intended behavior is that we ignore any protection (memory.min, memory.low) of the target memcg (but not its children). Ever since the patch pointed to by the "Fixes" tag, we actually read a stale value for the target memcg protection when deciding whether to skip the memcg or not because it is protected. If the stale value happens to be high enough, we don't reclaim from the target memcg. Essentially, in some cases we may falsely skip reclaiming from the target memcg of reclaim because we read a stale protection value from last time we reclaimed from it. During reclaim, mem_cgroup_calculate_protection() is used to determine the effective protection (emin and elow) values of a memcg. The protection of the reclaim target is ignored, but we cannot set their effective protection to 0 due to a limitation of the current implementation (see comment in mem_cgroup_protection()). Instead, we leave their effective protection values unchaged, and later ignore it in mem_cgroup_protection(). However, mem_cgroup_protection() is called later in shrink_lruvec()->get_scan_count(), which is after the mem_cgroup_below_{min/low}() checks in shrink_node_memcgs(). As a result, the stale effective protection values of the target memcg may lead us to skip reclaiming from the target memcg entirely, before calling shrink_lruvec(). This can be even worse with recursive protection, where the stale target memcg protection can be higher than its standalone protection. See two examples below (a similar version of example (a) is added to test_memcontrol in a later patch). (a) A simple example with proactive reclaim is as follows. Consider the following hierarchy: ROOT | A | B (memory.min = 10M) Consider the following scenario: - B has memory.current = 10M. - The system undergoes global reclaim (or memcg reclaim in A). - In shrink_node_memcgs(): - mem_cgroup_calculate_protection() calculates the effective min (emin) of B as 10M. - mem_cgroup_below_min() returns true for B, we do not reclaim from B. - Now if we want to reclaim 5M from B using proactive reclaim (memory.reclaim), we should be able to, as the protection of the target memcg should be ignored. - In shrink_node_memcgs(): - mem_cgroup_calculate_protection() immediately returns for B without doing anything, as B is the target memcg, relying on mem_cgroup_protection() to ignore B's stale effective min (still 10M). - mem_cgroup_below_min() reads the stale effective min for B and we skip it instead of ignoring its protection as intended, as we never reach mem_cgroup_protection(). (b) An more complex example with recursive protection is as follows. Consider the following hierarchy with memory_recursiveprot: ROOT | A (memory.min = 50M) | B (memory.min = 10M, memory.high = 40M) Consider the following scenario: - B has memory.current = 35M. - The system undergoes global reclaim (target memcg is NULL). - B will have an effective min of 50M (all of A's unclaimed protection). - B will not be reclaimed from. - Now allocate 10M more memory in B, pushing it above it's high limit. - The system undergoes memcg reclaim from B (target memcg is B). - Like example (a), we do nothing in mem_cgroup_calculate_protection(), then call mem_cgroup_below_min(), which will read the stale effective min for B (50M) and skip it. In this case, it's even worse because we are not just considering B's standalone protection (10M), but we are reading a much higher stale protection (50M) which will cause us to not reclaim from B at all. This is an artifact of commit 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") which made mem_cgroup_calculate_protection() only change the state without returning any value. Before that commit, we used to return MEMCG_PROT_NONE for the target memcg, which would cause us to skip the mem_cgroup_below_{min/low}() checks. After that commit we do not return anything and we end up checking the min & low effective protections for the target memcg, which are stale. Update mem_cgroup_supports_protection() to also check if we are reclaiming from the target, and rename it to mem_cgroup_unprotected() (now returns true if we should not protect the memcg, much simpler logic). Link: https://lkml.kernel.org/r/20221202031512.1365483-1-yosryahmed@google.com Link: https://lkml.kernel.org/r/20221202031512.1365483-2-yosryahmed@google.com Fixes: 45c7f7e1ef17 ("mm, memcg: decouple e{low,min} state mutations from protection checks") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Chris Down <chris@chrisdown.name> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Tejun Heo <tj@kernel.org> Cc: Vasily Averin <vasily.averin@linux.dev> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/mmap: properly unaccount memory on mas_preallocate() failureAlistair Popple2022-12-121-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | security_vm_enough_memory_mm() accounts memory via a call to vm_acct_memory(). Therefore any subsequent failures should unaccount for this memory prior to returning the error. Link: https://lkml.kernel.org/r/20221202045339.2999017-1-apopple@nvidia.com Fixes: 28c5609fb236 ("mm/mmap: preallocate maple nodes for brk vma expansion") Signed-off-by: Alistair Popple <apopple@nvidia.com> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * omfs: remove ->writepageChristoph Hellwig2022-12-121-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-8-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Bob Copeland <me@bobcopeland.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * jfs: remove ->writepageChristoph Hellwig2022-12-121-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-7-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Dave Kleikamp <dave.kleikamp@oracle.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * hpfs: remove ->writepageChristoph Hellwig2022-12-121-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-6-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * hfsplus: remove ->writepageChristoph Hellwig2022-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and stop wiring up ->writepage for hfsplus_aops. Link: https://lkml.kernel.org/r/20221202102644.770505-5-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * hfs: remove ->writepageChristoph Hellwig2022-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and stop wiring up ->writepage for hfs_aops. Link: https://lkml.kernel.org/r/20221202102644.770505-4-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fat: remove ->writepageChristoph Hellwig2022-12-121-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-3-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * extfat: remove ->writepageChristoph Hellwig2022-12-121-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "start removing writepage instances v2". The VM doesn't need or want ->writepage for writeback and is fine with just having ->writepages as long as ->migrate_folio is implemented. This series removes all ->writepage instances that use block_write_full_page directly and also have a plain mpage_writepages based ->writepages. This patch (of 7): ->writepage is a very inefficient method to write back data, and only used through write_cache_pages or a a fallback when no ->migrate_folio method is present. Set ->migrate_folio to the generic buffer_head based helper, and remove the ->writepage implementation. Link: https://lkml.kernel.org/r/20221202102644.770505-1-hch@lst.de Link: https://lkml.kernel.org/r/20221202102644.770505-2-hch@lst.de Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Namjae Jeon <linkinjeon@kernel.org> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Bob Copeland <me@bobcopeland.com> Cc: Dave Kleikamp <shaggy@kernel.org> Cc: Jan Kara <jack@suse.com> Cc: Mikulas Patocka <mikulas@artax.karlin.mff.cuni.cz> Cc: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Cc: Sungjong Seo <sj1557.seo@samsung.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/page_alloc: update comments in __free_pages_ok()Deyan Wang2022-12-121-0/+5
| | | | | | | | | | | | | | | | | | Add a comment to explain why we call get_pfnblock_migratetype() twice in __free_pages_ok(). Link: https://lkml.kernel.org/r/20221201135045.31663-1-wonder_rock@126.com Signed-off-by: Deyan Wang <wonder_rock@126.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * xfs: remove restrictions for fsdax and reflinkShiyang Ruan2022-12-122-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the basic function for fsdax and reflink has been implemented, remove the restrictions of them for widly test. Link: https://lkml.kernel.org/r/1669908773-207-1-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fsdax,xfs: port unshare to fsdaxShiyang Ruan2022-12-123-2/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement unshare in fsdax mode: copy data from srcmap to iomap. Link: https://lkml.kernel.org/r/1669908753-169-1-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * xfs: use dax ops for zero and truncate in fsdax modeShiyang Ruan2022-12-121-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Zero and truncate on a dax file may execute CoW. So use dax ops which contains end work for CoW. Link: https://lkml.kernel.org/r/1669908730-131-1-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fsdax: dedupe: iter two files at the same timeShiyang Ruan2022-12-121-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The iomap_iter() on a range of one file may loop more than once. In this case, the inner dst_iter can update its iomap but the outer src_iter can't. This may cause the wrong remapping in filesystem. Let them called at the same time. Link: https://lkml.kernel.org/r/1669908701-93-1-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fsdax,xfs: set the shared flag when file extent is sharedShiyang Ruan2022-12-122-13/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a dax page is shared, mapread at different offsets can also trigger page fault on same dax page. So, change the flag from "cow" to "shared". And get the shared flag from filesystem when read. Link: https://lkml.kernel.org/r/1669908538-55-5-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fsdax: zero the edges if source is HOLE or UNWRITTENShiyang Ruan2022-12-121-30/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If srcmap contains invalid data, such as HOLE and UNWRITTEN, the dest page should be zeroed. Otherwise, since it's a pmem, old data may remains on the dest page, the result of CoW will be incorrect. The function name is also not easy to understand, rename it to "dax_iomap_copy_around()", which means it copies data around the range. [akpm@linux-foundation.org: update dax_iomap_copy_around() kerneldoc, per Darrick] Link: https://lkml.kernel.org/r/1669973145-318-1-git-send-email-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/1669908538-55-4-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fsdax: invalidate pages when CoWShiyang Ruan2022-12-121-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CoW changes the share state of a dax page, but the share count of the page isn't updated. The next time access this page, it should have been a newly accessed, but old association exists. So, we need to clear the share state when CoW happens, in both dax_iomap_rw() and dax_zero_iter(). Link: https://lkml.kernel.org/r/1669908538-55-3-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Alistair Popple <apopple@nvidia.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * fsdax: introduce page->share for fsdax in reflink modeShiyang Ruan2022-12-123-18/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "fsdax,xfs: fix warning messages", v2. Many testcases failed in dax+reflink mode with warning message in dmesg. Such as generic/051,075,127. The warning message is like this: [ 775.509337] ------------[ cut here ]------------ [ 775.509636] WARNING: CPU: 1 PID: 16815 at fs/dax.c:386 dax_insert_entry.cold+0x2e/0x69 [ 775.510151] Modules linked in: auth_rpcgss oid_registry nfsv4 algif_hash af_alg af_packet nft_reject_inet nf_reject_ipv4 nf_reject_ipv6 nft_reject nft_ct nft_chain_nat iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 ip_set nf_tables nfnetlink ip6table_filter ip6_tables iptable_filter ip_tables x_tables dax_pmem nd_pmem nd_btt sch_fq_codel configfs xfs libcrc32c fuse [ 775.524288] CPU: 1 PID: 16815 Comm: fsx Kdump: loaded Tainted: G W 6.1.0-rc4+ #164 eb34e4ee4200c7cbbb47de2b1892c5a3e027fd6d [ 775.524904] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS Arch Linux 1.16.0-3-3 04/01/2014 [ 775.525460] RIP: 0010:dax_insert_entry.cold+0x2e/0x69 [ 775.525797] Code: c7 c7 18 eb e0 81 48 89 4c 24 20 48 89 54 24 10 e8 73 6d ff ff 48 83 7d 18 00 48 8b 54 24 10 48 8b 4c 24 20 0f 84 e3 e9 b9 ff <0f> 0b e9 dc e9 b9 ff 48 c7 c6 a0 20 c3 81 48 c7 c7 f0 ea e0 81 48 [ 775.526708] RSP: 0000:ffffc90001d57b30 EFLAGS: 00010082 [ 775.527042] RAX: 000000000000002a RBX: 0000000000000000 RCX: 0000000000000042 [ 775.527396] RDX: ffffea000a0f6c80 RSI: ffffffff81dfab1b RDI: 00000000ffffffff [ 775.527819] RBP: ffffea000a0f6c40 R08: 0000000000000000 R09: ffffffff820625e0 [ 775.528241] R10: ffffc90001d579d8 R11: ffffffff820d2628 R12: ffff88815fc98320 [ 775.528598] R13: ffffc90001d57c18 R14: 0000000000000000 R15: 0000000000000001 [ 775.528997] FS: 00007f39fc75d740(0000) GS:ffff88817bc80000(0000) knlGS:0000000000000000 [ 775.529474] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 775.529800] CR2: 00007f39fc772040 CR3: 0000000107eb6001 CR4: 00000000003706e0 [ 775.530214] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 [ 775.530592] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 [ 775.531002] Call Trace: [ 775.531230] <TASK> [ 775.531444] dax_fault_iter+0x267/0x6c0 [ 775.531719] dax_iomap_pte_fault+0x198/0x3d0 [ 775.532002] __xfs_filemap_fault+0x24a/0x2d0 [xfs aa8d25411432b306d9554da38096f4ebb86bdfe7] [ 775.532603] __do_fault+0x30/0x1e0 [ 775.532903] do_fault+0x314/0x6c0 [ 775.533166] __handle_mm_fault+0x646/0x1250 [ 775.533480] handle_mm_fault+0xc1/0x230 [ 775.533810] do_user_addr_fault+0x1ac/0x610 [ 775.534110] exc_page_fault+0x63/0x140 [ 775.534389] asm_exc_page_fault+0x22/0x30 [ 775.534678] RIP: 0033:0x7f39fc55820a [ 775.534950] Code: 00 01 00 00 00 74 99 83 f9 c0 0f 87 7b fe ff ff c5 fe 6f 4e 20 48 29 fe 48 83 c7 3f 49 8d 0c 10 48 83 e7 c0 48 01 fe 48 29 f9 <f3> a4 c4 c1 7e 7f 00 c4 c1 7e 7f 48 20 c5 f8 77 c3 0f 1f 44 00 00 [ 775.535839] RSP: 002b:00007ffc66a08118 EFLAGS: 00010202 [ 775.536157] RAX: 00007f39fc772001 RBX: 0000000000042001 RCX: 00000000000063c1 [ 775.536537] RDX: 0000000000006400 RSI: 00007f39fac42050 RDI: 00007f39fc772040 [ 775.536919] RBP: 0000000000006400 R08: 00007f39fc772001 R09: 0000000000042000 [ 775.537304] R10: 0000000000000001 R11: 0000000000000246 R12: 0000000000000001 [ 775.537694] R13: 00007f39fc772000 R14: 0000000000006401 R15: 0000000000000003 [ 775.538086] </TASK> [ 775.538333] ---[ end trace 0000000000000000 ]--- This also affects dax+noreflink mode if we run the test after a dax+reflink test. So, the most urgent thing is solving the warning messages. With these fixes, most warning messages in dax_associate_entry() are gone. But honestly, generic/388 will randomly failed with the warning. The case shutdown the xfs when fsstress is running, and do it for many times. I think the reason is that dax pages in use are not able to be invalidated in time when fs is shutdown. The next time dax page to be associated, it still remains the mapping value set last time. I'll keep on solving it. The warning message in dax_writeback_one() can also be fixed because of the dax unshare. This patch (of 8): fsdax page is used not only when CoW, but also mapread. To make the it easily understood, use 'share' to indicate that the dax page is shared by more than one extent. And add helper functions to use it. Also, the flag needs to be renamed to PAGE_MAPPING_DAX_SHARED. [ruansy.fnst@fujitsu.com: rename several functions] Link: https://lkml.kernel.org/r/1669972991-246-1-git-send-email-ruansy.fnst@fujitsu.com [ruansy.fnst@fujitsu.com: v2.2] Link: https://lkml.kernel.org/r/1670381359-53-1-git-send-email-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/1669908538-55-1-git-send-email-ruansy.fnst@fujitsu.com Link: https://lkml.kernel.org/r/1669908538-55-2-git-send-email-ruansy.fnst@fujitsu.com Signed-off-by: Shiyang Ruan <ruansy.fnst@fujitsu.com> Reviewed-by: Allison Henderson <allison.henderson@oracle.com> Reviewed-by: Darrick J. Wong <djwong@kernel.org> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Dave Chinner <david@fromorbit.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Alistair Popple <apopple@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * selftests/damon: test removed scheme sysfs dir access bugSeongJae Park2022-12-122-1/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A DAMON sysfs user could start DAMON with a scheme, remove the sysfs directory for the scheme, and then ask stats or schemes tried regions update. The related logic were not aware of the already removed directory situation, so it was able to results in invalid memory accesses. The fix has made with commit 8468b486612c ("mm/damon/sysfs-schemes: skip stats update if the scheme directory is removed"), though. Add a selftest to prevent such kinds of bugs from being introduced again. Link: https://lkml.kernel.org/r/20221201170834.62823-1-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * kasan: fail non-kasan KUnit tests on KASAN reportsAndrey Konovalov2022-12-123-0/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the recent changes done to KUnit-enabled KASAN tests, non-KASAN KUnit tests stopped being failed when KASAN report is detected. Recover that property by failing the currently running non-KASAN KUnit test when KASAN detects and prints a report for a bad memory access. Note that if the bad accesses happened in a kernel thread that doesn't have a reference to the currently running KUnit-test available via current->kunit_test, the test won't be failed. This is a limitation of KUnit, which doesn't yet provide a thread-agnostic way to find the reference to the currenly running test. Link: https://lkml.kernel.org/r/7be29a8ea967cee6b7e48d3d5a242d1d0bd96851.1669820505.git.andreyknvl@google.com Fixes: 49d9977ac909 ("kasan: check CONFIG_KASAN_KUNIT_TEST instead of CONFIG_KUNIT") Fixes: 7ce0ea19d50e ("kasan: switch kunit tests to console tracepoints") Signed-off-by: Andrey Konovalov <andreyknvl@google.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: David Gow <davidgow@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: change hugetlb allocation functions to return a folioSidhartha Kumar2022-12-121-70/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Many hugetlb allocation helper functions have now been converting to folios, update their higher level callers to be compatible with folios. alloc_pool_huge_page is reorganized to avoid a smatch warning reporting the folio variable is uninitialized. [sidhartha.kumar@oracle.com: update alloc_and_dissolve_hugetlb_folio comments] Link: https://lkml.kernel.org/r/20221206233512.146535-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-11-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reported-by: Wei Chen <harperchen1110@gmail.com> Suggested-by: John Hubbard <jhubbard@nvidia.com> Suggested-by: Rasmus Villemoes <linux@rasmusvillemoes.dk> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Tarun Sahu <tsahu@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: convert hugetlb prep functions to foliosSidhartha Kumar2022-12-121-33/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert prep_new_huge_page() and __prep_compound_gigantic_page() to folios. Link: https://lkml.kernel.org/r/20221129225039.82257-10-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: convert free_gigantic_page() to foliosSidhartha Kumar2022-12-121-12/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert callers of free_gigantic_page() to use folios, function is then renamed to free_gigantic_folio(). Link: https://lkml.kernel.org/r/20221129225039.82257-9-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: convert enqueue_huge_page() to foliosSidhartha Kumar2022-12-121-11/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert callers of enqueue_huge_page() to pass in a folio, function is renamed to enqueue_hugetlb_folio(). Link: https://lkml.kernel.org/r/20221129225039.82257-8-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: convert add_hugetlb_page() to folios and add hugetlb_cma_folio()Sidhartha Kumar2022-12-121-21/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert add_hugetlb_page() to take in a folio, also convert hugetlb_cma_page() to take in a folio. Link: https://lkml.kernel.org/r/20221129225039.82257-7-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: convert update_and_free_page() to foliosSidhartha Kumar2022-12-121-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make more progress on converting the free_huge_page() destructor to operate on folios by converting update_and_free_page() to folios. Link: https://lkml.kernel.org/r/20221129225039.82257-6-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com>\ Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: convert remove_hugetlb_page() to foliosSidhartha Kumar2022-12-121-23/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Removes page_folio() call by converting callers to directly pass a folio into __remove_hugetlb_page(). Link: https://lkml.kernel.org/r/20221129225039.82257-5-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: convert dissolve_free_huge_page() to foliosSidhartha Kumar2022-12-121-10/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Removes compound_head() call by using a folio rather than a head page. Link: https://lkml.kernel.org/r/20221129225039.82257-4-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/hugetlb: convert destroy_compound_gigantic_page() to foliosSidhartha Kumar2022-12-121-22/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert page operations within __destroy_compound_gigantic_page() to the corresponding folio operations. Link: https://lkml.kernel.org/r/20221129225039.82257-3-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: add folio dtor and order setter functionsSidhartha Kumar2022-12-122-3/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "convert core hugetlb functions to folios", v5. ============== OVERVIEW =========================== Now that many hugetlb helper functions that deal with hugetlb specific flags[1] and hugetlb cgroups[2] are converted to folios, higher level allocation, prep, and freeing functions within hugetlb can also be converted to operate in folios. Patch 1 of this series implements the wrapper functions around setting the compound destructor and compound order for a folio. Besides the user added in patch 1, patch 2 and patch 9 also use these helper functions. Patches 2-10 convert the higher level hugetlb functions to folios. ============== TESTING =========================== LTP: Ran 10 back to back rounds of the LTP hugetlb test suite. Gigantic Huge Pages: Test allocation and freeing via hugeadm commands: hugeadm --pool-pages-min 1GB:10 hugeadm --pool-pages-min 1GB:0 Demote: Demote 1 1GB hugepages to 512 2MB hugepages echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages echo 1 > /sys/kernel/mm/hugepages/hugepages-1048576kB/demote cat /sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages # 512 cat /sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages # 0 [1] https://lore.kernel.org/lkml/20220922154207.1575343-1-sidhartha.kumar@oracle.com/ [2] https://lore.kernel.org/linux-mm/20221101223059.460937-1-sidhartha.kumar@oracle.com/ This patch (of 10): Add folio equivalents for set_compound_order() and set_compound_page_dtor(). Also remove extra new-lines introduced by mm/hugetlb: convert move_hugetlb_state() to folios and mm/hugetlb_cgroup: convert hugetlb_cgroup_uncharge_page() to folios. [sidhartha.kumar@oracle.com: clarify folio_set_compound_order() zero support] Link: https://lkml.kernel.org/r/20221207223731.32784-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-1-sidhartha.kumar@oracle.com Link: https://lkml.kernel.org/r/20221129225039.82257-2-sidhartha.kumar@oracle.com Signed-off-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Suggested-by: Muchun Song <songmuchun@bytedance.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Miaohe Lin <linmiaohe@huawei.com> Cc: Mina Almasry <almasrymina@google.com> Cc: Tarun Sahu <tsahu@linux.ibm.com> Cc: Rasmus Villemoes <linux@rasmusvillemoes.dk> Cc: Wei Chen <harperchen1110@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * folio-compat: remove lru_cache_add()Vishal Moola (Oracle)2022-12-124-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | There are no longer any callers of lru_cache_add(), so remove it. This saves 79 bytes of kernel text. Also cleanup some comments such that they reference the new folio_add_lru() instead. Link: https://lkml.kernel.org/r/20221101175326.13265-6-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * khugepage: replace lru_cache_add() with folio_add_lru()Vishal Moola (Oracle)2022-12-121-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | Replaces some calls with their folio equivalents. This is in preparation for the removal of lru_cache_add(). This replaces 3 calls to compound_head() with 1. Link: https://lkml.kernel.org/r/20221101175326.13265-5-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * userfaultfd: replace lru_cache functions with folio_add functionsVishal Moola (Oracle)2022-12-121-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | Replaces lru_cache_add() and lru_cache_add_inactive_or_unevictable() with folio_add_lru() and folio_add_lru_vma(). This is in preparation for the removal of lru_cache_add(). Link: https://lkml.kernel.org/r/20221101175326.13265-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Miklos Szeredi <mszeredi@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>