summaryrefslogtreecommitdiffstats
path: root/drivers/dma (unfollow)
Commit message (Collapse)AuthorFilesLines
2024-04-26mm: combine free_the_page() and free_unref_page()Matthew Wilcox (Oracle)1-14/+11
The pcp_allowed_order() check in free_the_page() was only being skipped by __folio_put_small() which is about to be rearranged. Link: https://lkml.kernel.org/r/20240405153228.2563754-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: free non-hugetlb large folios in a batchMatthew Wilcox (Oracle)1-2/+2
Patch series "Clean up __folio_put()". With all the changes over the last few years, __folio_put_small and __folio_put_large have become almost identical to each other ... except you can't tell because they're spread over two files. Rearrange it all so that you can tell, and then inline them both into __folio_put(). This patch (of 5): free_unref_folios() can now handle non-hugetlb large folios, so keep normal large folios in the batch. hugetlb folios still need to be handled specially. [peterx@redhat.com: fix panic] Link: https://lkml.kernel.org/r/ZikjPB0Dt5HA8-uL@x1n Link: https://lkml.kernel.org/r/20240405153228.2563754-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240405153228.2563754-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: convert pagecache_isize_extended to use a folioMatthew Wilcox (Oracle)1-19/+17
Remove four hidden calls to compound_head(). Also exit early if the filesystem block size is >= PAGE_SIZE instead of just equal to PAGE_SIZE. Link: https://lkml.kernel.org/r/20240405180038.2618624-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Pankaj Raghav <p.raghav@samsung.com> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/hugetlb: pass correct order_per_bit to cma_declare_contiguous_nidFrank van der Linden1-3/+3
The hugetlb_cma code passes 0 in the order_per_bit argument to cma_declare_contiguous_nid (the alignment, computed using the page order, is correctly passed in). This causes a bit in the cma allocation bitmap to always represent a 4k page, making the bitmaps potentially very large, and slower. It would create bitmaps that would be pretty big. E.g. for a 4k page size on x86, hugetlb_cma=64G would mean a bitmap size of (64G / 4k) / 8 == 2M. With HUGETLB_PAGE_ORDER as order_per_bit, as intended, this would be (64G / 2M) / 8 == 4k. So, that's quite a difference. Also, this restricted the hugetlb_cma area to ((PAGE_SIZE << MAX_PAGE_ORDER) * 8) * PAGE_SIZE (e.g. 128G on x86) , since bitmap_alloc uses normal page allocation, and is thus restricted by MAX_PAGE_ORDER. Specifying anything about that would fail the CMA initialization. So, correctly pass in the order instead. Link: https://lkml.kernel.org/r/20240404162515.527802-2-fvdl@google.com Fixes: cf11e85fc08c ("mm: hugetlb: optionally allocate gigantic hugepages using cma") Signed-off-by: Frank van der Linden <fvdl@google.com> Acked-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/cma: drop incorrect alignment check in cma_init_reserved_memFrank van der Linden1-4/+0
cma_init_reserved_mem uses IS_ALIGNED to check if the size represented by one bit in the cma allocation bitmask is aligned with CMA_MIN_ALIGNMENT_BYTES (pageblock size). However, this is too strict, as this will fail if order_per_bit > pageblock_order, which is a valid configuration. We could check IS_ALIGNED both ways, but since both numbers are powers of two, no check is needed at all. Link: https://lkml.kernel.org/r/20240404162515.527802-1-fvdl@google.com Fixes: de9e14eebf33 ("drivers: dma-contiguous: add initialization from device tree") Signed-off-by: Frank van der Linden <fvdl@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Marek Szyprowski <m.szyprowski@samsung.com> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26selftests/mm: fix additional build errors for selftestsJohn Hubbard2-0/+425
These build errors only occur if one fails to first run "make headers". However, that is a non-obvious and instrusive requirement, and so there was a discussion on how to get rid of it [1]. This uses that solution. These two files were created by taking a snapshot of the generated header files that are created via "make headers". These two files were copied from ./usr/include/linux/ to ./tools/include/uapi/linux/ . That fixes the selftests/mm build on today's Arch Linux (which required the userfaultfd.h) and Ubuntu 23.04 (which additionally required memfd.h). [1] https://lore.kernel.org/all/783a4178-1dec-4e30-989a-5174b8176b09@redhat.com/ Link: https://lkml.kernel.org/r/20240328033418.203790-3-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mark Brown <broonie@kernel.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26selftests: break the dependency upon local header filesJohn Hubbard2-1/+10
Patch series "Fix selftests/mm build without requiring "make headers"". As mentioned in each patch, this implements the solution that we discussed in December 2023, in [1]. This turned out to be very clean and easy. It should also be quite easy to maintain. This should also make Peter Zijlstra happy, because it directly addresses the root cause of his "NAK NAK NAK" reply [2]. :) [1] https://lore.kernel.org/all/783a4178-1dec-4e30-989a-5174b8176b09@redhat.com/ [2] https://lore.kernel.org/lkml/20231103121652.GA6217@noisy.programming.kicks-ass.net/ This patch (of 2): Use tools/include/uapi/ files instead. These are obtained by taking a snapshot: run "make headers" at the top level, then copy the desired header file into the appropriate subdir in tools/uapi/. This was discussed and solved in [1]. However, even before copying any additional files there, there are already quite a few in tools/include/uapi already. And these will immediately fix a number of selftests/mm build failures. So this patch: a) Adds TOOLS_INCLUDES to selftests/lib.mk, so that all selftests can immediately and easily include the snapshotted header files. b) Uses $(TOOLS_INCLUDES) in the selftests/mm build. On today's Arch Linux, this already fixes all build errors except for a few userfaultfd.h (those will be addressed in a subsequent patch). [1] https://lore.kernel.org/all/783a4178-1dec-4e30-989a-5174b8176b09@redhat.com/ Link: https://lkml.kernel.org/r/20240328033418.203790-1-jhubbard@nvidia.com Link: https://lkml.kernel.org/r/20240328033418.203790-2-jhubbard@nvidia.com Signed-off-by: John Hubbard <jhubbard@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Mark Brown <broonie@kernel.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Axel Rasmussen <axelrasmussen@google.com> Cc: Peter Xu <peterx@redhat.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26hugetlb: convert hugetlb_wp() to use struct vm_faultVishal Moola (Oracle)1-32/+32
hugetlb_wp() can use the struct vm_fault passed in from hugetlb_fault(). This alleviates the stack by consolidating 5 variables into a single struct. [vishal.moola@gmail.com: simplify hugetlb_wp() arguments] Link: https://lkml.kernel.org/r/ZhQtoFNZBNwBCeXn@fedora Link: https://lkml.kernel.org/r/20240401202651.31440-4-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26hugetlb: convert hugetlb_no_page() to use struct vm_faultVishal Moola (Oracle)1-32/+31
hugetlb_no_page() can use the struct vm_fault passed in from hugetlb_fault(). This alleviates the stack by consolidating 7 variables into a single struct. [vishal.moola@gmail.com: simplify hugetlb_no_page() arguments] Link: https://lkml.kernel.org/r/ZhQtN8y5zud8iI1u@fedora Link: https://lkml.kernel.org/r/20240401202651.31440-3-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26hugetlb: convert hugetlb_fault() to use struct vm_faultVishal Moola (Oracle)1-43/+41
Patch series "Hugetlb fault path to use struct vm_fault", v2. This patchset converts the hugetlb fault path to use struct vm_fault. This helps make the code more readable, and alleviates the stack by allowing us to consolidate many fault-related variables into an individual pointer. This patch (of 3): Now that hugetlb_fault() has a vm_fault available for fault tracking, use it throughout. This cleans up the code by removing 2 variables, and prepares hugetlb_fault() to take in a struct vm_fault argument. Link: https://lkml.kernel.org/r/20240401202651.31440-1-vishal.moola@gmail.com Link: https://lkml.kernel.org/r/20240401202651.31440-2-vishal.moola@gmail.com Signed-off-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/ksm: remove redundant code in ksm_forkJinjiang Tu1-10/+2
Since commit 3c6f33b7273a ("mm/ksm: support fork/exec for prctl"), when a child process is forked, the MMF_VM_MERGE_ANY flag will be inherited in mm_init(). So, it's unnecessary to set the flag in ksm_fork(). Link: https://lkml.kernel.org/r/20240402024934.1093361-1-tujinjiang@huawei.com Signed-off-by: Jinjiang Tu <tujinjiang@huawei.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Nanyong Sun <sunnanyong@huawei.com> Cc: Rik van Riel <riel@surriel.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: use "GUP-fast" instead "fast GUP" in remaining commentsDavid Hildenbrand2-2/+2
Let's fixup the remaining comments to consistently call that thing "GUP-fast". With this change, we consistently call it "GUP-fast". Link: https://lkml.kernel.org/r/20240402125516.223131-4-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/treewide: rename CONFIG_HAVE_FAST_GUP to CONFIG_HAVE_GUP_FASTDavid Hildenbrand14-22/+22
Nowadays, we call it "GUP-fast", the external interface includes functions like "get_user_pages_fast()", and we renamed all internal functions to reflect that as well. Let's make the config option reflect that. Link: https://lkml.kernel.org/r/20240402125516.223131-3-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm/gup: consistently name GUP-fast functionsDavid Hildenbrand1-102/+103
Patch series "mm/gup: consistently call it GUP-fast". Some cleanups around function names, comments and the config option of "GUP-fast" -- GUP without "lock" safety belts on. With this cleanup it's easy to judge which functions are GUP-fast specific. We now consistently call it "GUP-fast", avoiding mixing it with "fast GUP", "lockless", or simply "gup" (which I always considered confusing in the ode). So the magic now happens in functions that contain "gup_fast", whereby gup_fast() is the entry point into that magic. Comments consistently reference either "GUP-fast" or "gup_fast()". This patch (of 3): Let's consistently call the "fast-only" part of GUP "GUP-fast" and rename all relevant internal functions to start with "gup_fast", to make it clearer that this is not ordinary GUP. The current mixture of "lockless", "gup" and "gup_fast" is confusing. Further, avoid the term "huge" when talking about a "leaf" -- for example, we nowadays check pmd_leaf() because pmd_huge() is gone. For the "hugepd"/"hugepte" stuff, it's part of the name ("is_hugepd"), so that stays. What remains is the "external" interface: * get_user_pages_fast_only() * get_user_pages_fast() * pin_user_pages_fast() The high-level internal functions for GUP-fast (+slow fallback) are now: * internal_get_user_pages_fast() -> gup_fast_fallback() * lockless_pages_from_mm() -> gup_fast() The basic GUP-fast walker functions: * gup_pgd_range() -> gup_fast_pgd_range() * gup_p4d_range() -> gup_fast_p4d_range() * gup_pud_range() -> gup_fast_pud_range() * gup_pmd_range() -> gup_fast_pmd_range() * gup_pte_range() -> gup_fast_pte_range() * gup_huge_pgd() -> gup_fast_pgd_leaf() * gup_huge_pud() -> gup_fast_pud_leaf() * gup_huge_pmd() -> gup_fast_pmd_leaf() The weird hugepd stuff: * gup_huge_pd() -> gup_fast_hugepd() * gup_hugepte() -> gup_fast_hugepte() The weird devmap stuff: * __gup_device_huge_pud() -> gup_fast_devmap_pud_leaf() * __gup_device_huge_pmd -> gup_fast_devmap_pmd_leaf() * __gup_device_huge() -> gup_fast_devmap_leaf() * undo_dev_pagemap() -> gup_fast_undo_dev_pagemap() Helper functions: * unpin_user_pages_lockless() -> gup_fast_unpin_user_pages() * gup_fast_folio_allowed() is already properly named * gup_fast_permitted() is already properly named With "gup_fast()", we now even have a function that is referred to in comment in mm/mmu_gather.c. Link: https://lkml.kernel.org/r/20240402125516.223131-1-david@redhat.com Link: https://lkml.kernel.org/r/20240402125516.223131-2-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Reviewed-by: John Hubbard <jhubbard@nvidia.com> Cc: Peter Xu <peterx@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26hugetlb: convert alloc_buddy_hugetlb_folio to use a folioMatthew Wilcox (Oracle)1-17/+16
While this function returned a folio, it was still using __alloc_pages() and __free_pages(). Use __folio_alloc() and put_folio() instead. This actually removes a call to compound_head(), but more importantly, it prepares us for the move to memdescs. Link: https://lkml.kernel.org/r/20240402200656.913841-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Sidhartha Kumar <sidhartha.kumar@oracle.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: remove struct page from get_shadow_from_swap_cacheMatthew Wilcox (Oracle)1-4/+4
We don't actually use any parts of struct page; all we do is check the value of the pointer. So give the pointer the appropriate name & type. Link: https://lkml.kernel.org/r/20240402201659.918308-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26x86: mm: accelerate pagefault when badaccessKefeng Wang1-9/+14
The access_error() of vma is already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_area_access_error(). If mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-8-wangkefeng.wang@huawei.com Reviewed-by: Suren Baghdasaryan <surenb@google.com> Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26s390: mm: accelerate pagefault when badaccessKefeng Wang1-1/+2
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-7-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Heiko Carstens <hca@linux.ibm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26riscv: mm: accelerate pagefault when badaccessKefeng Wang1-1/+4
The access_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. [wangkefeng.wang@huawei.com: use `cause' rather than SIGSEGV, per Alexandre] Link: https://lkml.kernel.org/r/ac978061-ce1a-40a4-8b0a-61883b42bea7@huawei.com Link: https://lkml.kernel.org/r/20240403083805.1818160-6-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Tested-by: Alexandre Ghiti <alexghiti@rivosinc.com Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26powerpc: mm: accelerate pagefault when badaccessKefeng Wang1-13/+20
The access_[pkey]_error() of vma already checked under per-VMA lock, if it is a bad access, directly handle error, no need to retry with mmap_lock again. In order to release the correct lock, pass the mm_struct into bad_access_pkey()/bad_access(), if mm is NULL, release vma lock, or release mmap_lock. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-5-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26arm: mm: accelerate pagefault when VM_FAULT_BADACCESSKefeng Wang1-1/+3
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to retry with mmap_lock again. Since the page faut is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-4-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26arm64: mm: accelerate pagefault when VM_FAULT_BADACCESSKefeng Wang1-1/+3
The vm_flags of vma already checked under per-VMA lock, if it is a bad access, directly set fault to VM_FAULT_BADACCESS and handle error, no need to retry with mmap_lock again, the latency time reduces 34% in 'lat_sig -P 1 prot lat_sig' from lmbench testcase. Since the page fault is handled under per-VMA lock, count it as a vma lock event with VMA_LOCK_SUCCESS. Link: https://lkml.kernel.org/r/20240403083805.1818160-3-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26arm64: mm: cleanup __do_page_fault()Kefeng Wang1-20/+7
Patch series "arch/mm/fault: accelerate pagefault when badaccess", v2. After VMA lock-based page fault handling enabled, if bad access met under per-vma lock, it will fallback to mmap_lock-based handling, so it leads to unnessary mmap lock and vma find again. A test from lmbench shows 34% improve after this changes on arm64, lat_sig -P 1 prot lat_sig 0.29194 -> 0.19198 This patch (of 7): The __do_page_fault() only calls handle_mm_fault() after vm_flags checked, and it is only called by do_page_fault(), let's squash it into do_page_fault() to cleanup code. Link: https://lkml.kernel.org/r/20240403083805.1818160-1-wangkefeng.wang@huawei.com Link: https://lkml.kernel.org/r/20240403083805.1818160-2-wangkefeng.wang@huawei.com Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Russell King <linux@armlinux.org.uk> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: madvise: avoid split during MADV_PAGEOUT and MADV_COLDRyan Roberts4-41/+92
Rework madvise_cold_or_pageout_pte_range() to avoid splitting any large folio that is fully and contiguously mapped in the pageout/cold vm range. This change means that large folios will be maintained all the way to swap storage. This both improves performance during swap-out, by eliding the cost of splitting the folio, and sets us up nicely for maintaining the large folio when it is swapped back in (to be covered in a separate series). Folios that are not fully mapped in the target range are still split, but note that behavior is changed so that if the split fails for any reason (folio locked, shared, etc) we now leave it as is and move to the next pte in the range and continue work on the proceeding folios. Previously any failure of this sort would cause the entire operation to give up and no folios mapped at higher addresses were paged out or made cold. Given large folios are becoming more common, this old behavior would have likely lead to wasted opportunities. While we are at it, change the code that clears young from the ptes to use ptep_test_and_clear_young(), via the new mkold_ptes() batch helper function. This is more efficent than get_and_clear/modify/set, especially for contpte mappings on arm64, where the old approach would require unfolding/refolding and the new approach can be done in place. Link: https://lkml.kernel.org/r/20240408183946.2991168-8-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: vmscan: avoid split during shrink_folio_list()Ryan Roberts1-10/+10
Now that swap supports storing all mTHP sizes, avoid splitting large folios before swap-out. This benefits performance of the swap-out path by eliding split_folio_to_list(), which is expensive, and also sets us up for swapping in large folios in a future series. If the folio is partially mapped, we continue to split it since we want to avoid the extra IO overhead and storage of writing out pages uneccessarily. THP_SWPOUT and THP_SWPOUT_FALLBACK counters should continue to count events only for PMD-mappable folios to avoid user confusion. THP_SWPOUT already has the appropriate guard. Add a guard for THP_SWPOUT_FALLBACK. It may be appropriate to add per-size counters in future. Link: https://lkml.kernel.org/r/20240408183946.2991168-7-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Barry Song <v-songbaohua@oppo.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: allow storage of all mTHP ordersRyan Roberts2-72/+98
Multi-size THP enables performance improvements by allocating large, pte-mapped folios for anonymous memory. However I've observed that on an arm64 system running a parallel workload (e.g. kernel compilation) across many cores, under high memory pressure, the speed regresses. This is due to bottlenecking on the increased number of TLBIs added due to all the extra folio splitting when the large folios are swapped out. Therefore, solve this regression by adding support for swapping out mTHP without needing to split the folio, just like is already done for PMD-sized THP. This change only applies when CONFIG_THP_SWAP is enabled, and when the swap backing store is a non-rotating block device. These are the same constraints as for the existing PMD-sized THP swap-out support. Note that no attempt is made to swap-in (m)THP here - this is still done page-by-page, like for PMD-sized THP. But swapping-out mTHP is a prerequisite for swapping-in mTHP. The main change here is to improve the swap entry allocator so that it can allocate any power-of-2 number of contiguous entries between [1, (1 << PMD_ORDER)]. This is done by allocating a cluster for each distinct order and allocating sequentially from it until the cluster is full. This ensures that we don't need to search the map and we get no fragmentation due to alignment padding for different orders in the cluster. If there is no current cluster for a given order, we attempt to allocate a free cluster from the list. If there are no free clusters, we fail the allocation and the caller can fall back to splitting the folio and allocates individual entries (as per existing PMD-sized THP fallback). The per-order current clusters are maintained per-cpu using the existing infrastructure. This is done to avoid interleving pages from different tasks, which would prevent IO being batched. This is already done for the order-0 allocations so we follow the same pattern. As is done for order-0 per-cpu clusters, the scanner now can steal order-0 entries from any per-cpu-per-order reserved cluster. This ensures that when the swap file is getting full, space doesn't get tied up in the per-cpu reserves. This change only modifies swap to be able to accept any order mTHP. It doesn't change the callers to elide doing the actual split. That will be done in separate changes. Link: https://lkml.kernel.org/r/20240408183946.2991168-6-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: update get_swap_pages() to take folio orderRyan Roberts3-10/+11
We are about to allow swap storage of any mTHP size. To prepare for that, let's change get_swap_pages() to take a folio order parameter instead of nr_pages. This makes the interface self-documenting; a power-of-2 number of pages must be provided. We will also need the order internally so this simplifies accessing it. Link: https://lkml.kernel.org/r/20240408183946.2991168-5-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: simplify struct percpu_clusterRyan Roberts2-12/+19
struct percpu_cluster stores the index of cpu's current cluster and the offset of the next entry that will be allocated for the cpu. These two pieces of information are redundant because the cluster index is just (offset / SWAPFILE_CLUSTER). The only reason for explicitly keeping the cluster index is because the structure used for it also has a flag to indicate "no cluster". However this data structure also contains a spin lock, which is never used in this context, as a side effect the code copies the spinlock_t structure, which is questionable coding practice in my view. So let's clean this up and store only the next offset, and use a sentinal value (SWAP_NEXT_INVALID) to indicate "no cluster". SWAP_NEXT_INVALID is chosen to be 0, because 0 will never be seen legitimately; The first page in the swap file is the swap header, which is always marked bad to prevent it from being allocated as an entry. This also prevents the cluster to which it belongs being marked free, so it will never appear on the free list. This change saves 16 bytes per cpu. And given we are shortly going to extend this mechanism to be per-cpu-AND-per-order, we will end up saving 16 * 9 = 144 bytes per cpu, which adds up if you have 256 cpus in the system. Link: https://lkml.kernel.org/r/20240408183946.2991168-4-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: free_swap_and_cache_nr() as batched free_swap_and_cache()Ryan Roberts6-31/+196
Now that we no longer have a convenient flag in the cluster to determine if a folio is large, free_swap_and_cache() will take a reference and lock a large folio much more often, which could lead to contention and (e.g.) failure to split large folios, etc. Let's solve that problem by batch freeing swap and cache with a new function, free_swap_and_cache_nr(), to free a contiguous range of swap entries together. This allows us to first drop a reference to each swap slot before we try to release the cache folio. This means we only try to release the folio once, only taking the reference and lock once - much better than the previous 512 times for the 2M THP case. Contiguous swap entries are gathered in zap_pte_range() and madvise_free_pte_range() in a similar way to how present ptes are already gathered in zap_pte_range(). While we are at it, let's simplify by converting the return type of both functions to void. The return value was used only by zap_pte_range() to print a bad pte, and was ignored by everyone else, so the extra reporting wasn't exactly guaranteed. We will still get the warning with most of the information from get_swap_device(). With the batch version, we wouldn't know which pte was bad anyway so could print the wrong one. [ryan.roberts@arm.com: fix a build warning on parisc] Link: https://lkml.kernel.org/r/20240409111840.3173122-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-3-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Barry Song <v-songbaohua@oppo.com> Cc: Chris Li <chrisl@kernel.org> Cc: Gao Xiang <xiang@kernel.org> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: swap: remove CLUSTER_FLAG_HUGE from swap_cluster_info:flagsRyan Roberts3-52/+8
Patch series "Swap-out mTHP without splitting", v7. This series adds support for swapping out multi-size THP (mTHP) without needing to first split the large folio via split_huge_page_to_list_to_order(). It closely follows the approach already used to swap-out PMD-sized THP. There are a couple of reasons for swapping out mTHP without splitting: - Performance: It is expensive to split a large folio and under extreme memory pressure some workloads regressed performance when using 64K mTHP vs 4K small folios because of this extra cost in the swap-out path. This series not only eliminates the regression but makes it faster to swap out 64K mTHP vs 4K small folios. - Memory fragmentation avoidance: If we can avoid splitting a large folio memory is less likely to become fragmented, making it easier to re-allocate a large folio in future. - Performance: Enables a separate series [7] to swap-in whole mTHPs, which means we won't lose the TLB-efficiency benefits of mTHP once the memory has been through a swap cycle. I've done what I thought was the smallest change possible, and as a result, this approach is only employed when the swap is backed by a non-rotating block device (just as PMD-sized THP is supported today). Discussion against the RFC concluded that this is sufficient. Performance Testing =================== I've run some swap performance tests on Ampere Altra VM (arm64) with 8 CPUs. The VM is set up with a 35G block ram device as the swap device and the test is run from inside a memcg limited to 40G memory. I've then run `usemem` from vm-scalability with 70 processes, each allocating and writing 1G of memory. I've repeated everything 6 times and taken the mean performance improvement relative to 4K page baseline: | alloc size | baseline | + this series | | | mm-unstable (~v6.9-rc1) | | |:-----------|------------------------:|------------------------:| | 4K Page | 0.0% | 1.3% | | 64K THP | -13.6% | 46.3% | | 2M THP | 91.4% | 89.6% | So with this change, the 64K swap performance goes from a 14% regression to a 46% improvement. While 2M shows a small regression I'm confident that this is just noise. [1] https://lore.kernel.org/linux-mm/20231010142111.3997780-1-ryan.roberts@arm.com/ [2] https://lore.kernel.org/linux-mm/20231017161302.2518826-1-ryan.roberts@arm.com/ [3] https://lore.kernel.org/linux-mm/20231025144546.577640-1-ryan.roberts@arm.com/ [4] https://lore.kernel.org/linux-mm/20240311150058.1122862-1-ryan.roberts@arm.com/ [5] https://lore.kernel.org/linux-mm/20240327144537.4165578-1-ryan.roberts@arm.com/ [6] https://lore.kernel.org/linux-mm/20240403114032.1162100-1-ryan.roberts@arm.com/ [7] https://lore.kernel.org/linux-mm/20240304081348.197341-1-21cnbao@gmail.com/ [8] https://lore.kernel.org/linux-mm/CAGsJ_4yMOow27WDvN2q=E4HAtDd2PJ=OQ5Pj9DG+6FLWwNuXUw@mail.gmail.com/ [9] https://lore.kernel.org/linux-mm/579d5127-c763-4001-9625-4563a9316ac3@redhat.com/ This patch (of 7): As preparation for supporting small-sized THP in the swap-out path, without first needing to split to order-0, Remove the CLUSTER_FLAG_HUGE, which, when present, always implies PMD-sized THP, which is the same as the cluster size. The only use of the flag was to determine whether a swap entry refers to a single page or a PMD-sized THP in swap_page_trans_huge_swapped(). Instead of relying on the flag, we now pass in order, which originates from the folio's order. This allows the logic to work for folios of any order. The one snag is that one of the swap_page_trans_huge_swapped() call sites does not have the folio. But it was only being called there to shortcut a call __try_to_reclaim_swap() in some cases. __try_to_reclaim_swap() gets the folio and (via some other functions) calls swap_page_trans_huge_swapped(). So I've removed the problematic call site and believe the new logic should be functionally equivalent. That said, removing the fast path means that we will take a reference and trylock a large folio much more often, which we would like to avoid. The next patch will solve this. Removing CLUSTER_FLAG_HUGE also means we can remove split_swap_cluster() which used to be called during folio splitting, since split_swap_cluster()'s only job was to remove the flag. Link: https://lkml.kernel.org/r/20240408183946.2991168-1-ryan.roberts@arm.com Link: https://lkml.kernel.org/r/20240408183946.2991168-2-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Chris Li <chrisl@kernel.org> Acked-by: David Hildenbrand <david@redhat.com> Cc: Barry Song <21cnbao@gmail.com> Cc: Gao Xiang <xiang@kernel.org> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michal Hocko <mhocko@suse.com> Cc: Yang Shi <shy828301@gmail.com> Cc: Yu Zhao <yuzhao@google.com> Cc: Barry Song <v-songbaohua@oppo.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: page_alloc: use the correct THP order for THP PCPBaolin Wang1-3/+3
Commit 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") extends the PCP allocator to store THP pages, and it determines whether to cache THP pages in PCP by comparing with pageblock_order. But the pageblock_order is not always equal to THP order. It might also be MAX_PAGE_ORDER, which could prevent PCP from caching THP pages. Therefore, using HPAGE_PMD_ORDER instead to determine the need for caching THP for PCP will fix this issue Link: https://lkml.kernel.org/r/a25c9e14cd03907d5978b60546a69e6aa3fc2a7d.1712151833.git.baolin.wang@linux.alibaba.com Fixes: 44042b449872 ("mm/page_alloc: allow high-order pages to be stored on the per-cpu lists") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Mel Gorman <mgorman@techsingularity.net> Reviewed-by: Barry Song <baohua@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26proc: convert smaps_pmd_entry to use a folioMatthew Wilcox (Oracle)1-3/+5
Replace two calls to compound_head() with one. Link: https://lkml.kernel.org/r/20240403171456.1445117-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26proc: pass a folio to smaps_page_accumulate()Matthew Wilcox (Oracle)1-6/+5
Both callers already have a folio; pass it in instead of doing the conversion each time. Link: https://lkml.kernel.org/r/20240403171456.1445117-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26proc: convert smaps_page_accumulate to use a folioMatthew Wilcox (Oracle)1-3/+4
Replaces three calls to compound_head() with one. Shrinks the function from 2614 bytes to 1112 bytes in an allmodconfig build. Link: https://lkml.kernel.org/r/20240403171456.1445117-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26proc: convert gather_stats to use a folioMatthew Wilcox (Oracle)1-6/+7
Patch series "Use folio APIs in procfs". We're down to very few users of the PageFoo macros, with proc being a major user. After this patchset and another patchset I have for khugepaged, we can get rid of PageActive, PageReadahead and PageSwapBacked. This patchset has the usual advantages in its own right of removing hidden calls to compound_head(). We have the page table lock, so the mapcount & refcount are stable and there can't be any races with folios suddenly becoming tail pages. This patch (of 4): Replaces six calls to compound_head() with one. Shrinks the function from 5054 bytes to 1756 bytes in an allmodconfig build. Link: https://lkml.kernel.org/r/20240403171456.1445117-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240403171456.1445117-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Christian Brauner <brauner@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: generate PAGE_IDLE_FLAG definitionsMatthew Wilcox (Oracle)2-36/+10
If CONFIG_PAGE_IDLE_FLAG is not set, we can use FOLIO_FLAG_FALSE() to generate these definitions. Link: https://lkml.kernel.org/r/20240402201252.917342-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26mm: remove page_idle and page_young wrappersMatthew Wilcox (Oracle)1-25/+0
All users have now been converted to the folio equivalents, so remove the page wrappers. Link: https://lkml.kernel.org/r/20240402201252.917342-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26proc: convert smaps_account() to use a folioMatthew Wilcox (Oracle)1-7/+9
Replace seven calls to compound_head() with one. Link: https://lkml.kernel.org/r/20240402201252.917342-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26proc: convert clear_refs_pte_range to use a folioMatthew Wilcox (Oracle)1-8/+8
Patch series "Remove page_idle and page_young wrappers". There are only a couple of places left using the page wrappers for idle & young tracking. Convert the two users in proc and then we can remove the wrappers. That enables the further simplification of autogenerating the definitions when CONFIG_PAGE_IDLE_FLAG is disabled. This patch (of 4): Replaces four calls to compound_head() with two. Link: https://lkml.kernel.org/r/20240402201252.917342-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240402201252.917342-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: use a folio throughout hpage_collapse_scan_file()Matthew Wilcox (Oracle)2-20/+19
Replace the use of pages with folios. Saves a few calls to compound_head() and removes some uses of obsolete functions. Link: https://lkml.kernel.org/r/20240403171838.1445826-8-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: use a folio throughout collapse_file()Matthew Wilcox (Oracle)1-59/+54
Pull folios from the page cache instead of pages. Half of this work had been done already, but we were still operating on pages for a large chunk of this function. There is no attempt in this patch to handle large folios that are smaller than a THP; that will have to wait for a future patch. [willy@infradead.org: the unlikely() is embedded in IS_ERR()] Link: https://lkml.kernel.org/r/ZhIWX8K0E2tSyMSr@casper.infradead.org Link: https://lkml.kernel.org/r/20240403171838.1445826-7-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: remove hpage from collapse_file()Matthew Wilcox (Oracle)2-41/+42
Use new_folio throughout where we had been using hpage. Link: https://lkml.kernel.org/r/20240403171838.1445826-6-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: pass a folio to __collapse_huge_page_copy()Matthew Wilcox (Oracle)1-19/+15
Simplify the body of __collapse_huge_page_copy() while I'm looking at it. Link: https://lkml.kernel.org/r/20240403171838.1445826-5-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: remove hpage from collapse_huge_page()Matthew Wilcox (Oracle)1-7/+5
Work purely in terms of the folio. Removes a call to compound_head() in put_page(). Link: https://lkml.kernel.org/r/20240403171838.1445826-4-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: convert alloc_charge_hpage to alloc_charge_folioMatthew Wilcox (Oracle)1-8/+9
Both callers want to deal with a folio, so return a folio from this function. Link: https://lkml.kernel.org/r/20240403171838.1445826-3-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26khugepaged: inline hpage_collapse_alloc_folio()Matthew Wilcox (Oracle)1-15/+4
Patch series "khugepaged folio conversions". We've been kind of hacking piecemeal at converting khugepaged to use folios instead of compound pages, and so this patchset is a little larger than it should be as I undo some of our wrong moves in the past. In particular, collapse_file() now consistently uses 'new_folio' for the freshly allocated folio and 'folio' for the one that's currently in use. This patch (of 7): This function has one caller, and the combined function is simpler to read, reason about and modify. Link: https://lkml.kernel.org/r/20240403171838.1445826-1-willy@infradead.org Link: https://lkml.kernel.org/r/20240403171838.1445826-2-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Reviewed-by: Vishal Moola (Oracle) <vishal.moola@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26selftests/mm: mremap_test: use sscanf to parse /proc/self/mapsDev Jain1-7/+11
Enforce consistency across files by avoiding two separate functions to parse /proc/self/maps, replacing them with a simple sscanf(). Link: https://lkml.kernel.org/r/20240330173557.2697684-4-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26selftests/mm: mremap_test: optimize execution time from minutes to seconds ↵Dev Jain1-21/+91
using chunkwise memcmp Mismatch index is currently being checked by a brute force iteration over the buffer. Instead, break the comparison into O(sqrt(n)) number of chunks, with the chunk size of this order only, where n is the size of the buffer. Do a brute-force iteration to print to stdout only when the highly optimized memcmp() library function returns a mismatch in the chunk. The time complexity of this algorithm is O(sqrt(n)) * t, where t is the time taken by memcmp(); for our test conditions, it is safe to assume t to be small. Link: https://lkml.kernel.org/r/20240330173557.2697684-3-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26selftests/mm: mremap_test: optimize using pre-filled random array and memcpyDev Jain1-25/+53
Patch series "selftests/mm: mremap_test: Optimizations and style fixes". The mremap_test, in a worst case controlled by the -t flag, does a for loop iteration in orders of GB. Without compromising on the stdout report, the aim is to reduce this time. A pre-filled random buffer is allocated based on the seed, replacing repetitive rand() calls. The byte pattern in the memory locations is set through memcpy() from the random buffer. Replacing the loop for printing the mismatch index to stdout, employ an efficient algorithm by breaking the comparison into chunks, use the highly optimized memcmp() library function, and when a mismatch does occur, only then do a brute force iteration. Also, use sscanf() to parse /proc/self/maps for consistency across files. Execution time results (x86 system): ./mremap_test Original: 3 seconds After change: 0.8 seconds ./mremap_test -t100 Original: 17 seconds After change: 2 seconds ./mremap_test -t0 (worst case): Original: 9:40 minutes After change: 45 seconds This patch (of 3): Allocate a pre-filled random buffer using the seed. Replace iterative copying of the random sequence to buffers using the highly optimized library function memcpy(). Link: https://lkml.kernel.org/r/20240330173557.2697684-1-dev.jain@arm.com Link: https://lkml.kernel.org/r/20240330173557.2697684-2-dev.jain@arm.com Signed-off-by: Dev Jain <dev.jain@arm.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Kalesh Singh <kaleshsingh@google.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-04-26memory: remove the now superfluous sentinel element from ctl_table arrayJoel Granados7-7/+0
This commit comes at the tail end of a greater effort to remove the empty elements at the end of the ctl_table arrays (sentinels) which will reduce the overall build time size of the kernel and run time memory bloat by ~64 bytes per sentinel (further information Link : https://lore.kernel.org/all/ZO5Yx5JFogGi%2FcBo@bombadil.infradead.org/) Remove sentinel from all files under mm/ that register a sysctl table. Link: https://lkml.kernel.org/r/20240328-jag-sysctl_remset_misc-v1-1-47c1463b3af2@samsung.com Signed-off-by: Joel Granados <j.granados@samsung.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Reviewed-by: Miaohe Lin <linmiaohe@huawei.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>