| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch series "mm: Support huge pfnmaps", v2.
Overview
========
This series implements huge pfnmaps support for mm in general. Huge
pfnmap allows e.g. VM_PFNMAP vmas to map in either PMD or PUD levels,
similar to what we do with dax / thp / hugetlb so far to benefit from TLB
hits. Now we extend that idea to PFN mappings, e.g. PCI MMIO bars where
it can grow as large as 8GB or even bigger.
Currently, only x86_64 (1G+2M) and arm64 (2M) are supported. The last
patch (from Alex Williamson) will be the first user of huge pfnmap, so as
to enable vfio-pci driver to fault in huge pfn mappings.
Implementation
==============
In reality, it's relatively simple to add such support comparing to many
other types of mappings, because of PFNMAP's specialties when there's no
vmemmap backing it, so that most of the kernel routines on huge mappings
should simply already fail for them, like GUPs or old-school follow_page()
(which is recently rewritten to be folio_walk* APIs by David).
One trick here is that we're still unmature on PUDs in generic paths here
and there, as DAX is so far the only user. This patchset will add the 2nd
user of it. Hugetlb can be a 3rd user if the hugetlb unification work can
go on smoothly, but to be discussed later.
The other trick is how to allow gup-fast working for such huge mappings
even if there's no direct sign of knowing whether it's a normal page or
MMIO mapping. This series chose to keep the pte_special solution, so that
it reuses similar idea on setting a special bit to pfnmap PMDs/PUDs so
that gup-fast will be able to identify them and fail properly.
Along the way, we'll also notice that the major pgtable pfn walker, aka,
follow_pte(), will need to retire soon due to the fact that it only works
with ptes. A new set of simple API is introduced (follow_pfnmap* API) to
be able to do whatever follow_pte() can already do, plus that it can also
process huge pfnmaps now. Half of this series is about that and
converting all existing pfnmap walkers to use the new API properly.
Hopefully the new API also looks better to avoid exposing e.g. pgtable
lock details into the callers, so that it can be used in an even more
straightforward way.
Here, three more options will be introduced and involved in huge pfnmap:
- ARCH_SUPPORTS_HUGE_PFNMAP
Arch developers will need to select this option when huge pfnmap is
supported in arch's Kconfig. After this patchset applied, both x86_64
and arm64 will start to enable it by default.
- ARCH_SUPPORTS_PMD_PFNMAP / ARCH_SUPPORTS_PUD_PFNMAP
These options are for driver developers to identify whether current
arch / config supports huge pfnmaps, making decision on whether it can
use the huge pfnmap APIs to inject them. One can refer to the last
vfio-pci patch from Alex on the use of them properly in a device
driver.
So after the whole set applied, and if one would enable some dynamic debug
lines in vfio-pci core files, we should observe things like:
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x0: 0x100
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x200: 0x100
vfio-pci 0000:00:06.0: vfio_pci_mmap_huge_fault(,order = 9) BAR 0 page offset 0x400: 0x100
In this specific case, it says that vfio-pci faults in PMDs properly for a
few BAR0 offsets.
Patch Layout
============
Patch 1: Introduce the new options mentioned above for huge PFNMAPs
Patch 2: A tiny cleanup
Patch 3-8: Preparation patches for huge pfnmap (include introduce
special bit for pmd/pud)
Patch 9-16: Introduce follow_pfnmap*() API, use it everywhere, and
then drop follow_pte() API
Patch 17: Add huge pfnmap support for x86_64
Patch 18: Add huge pfnmap support for arm64
Patch 19: Add vfio-pci support for all kinds of huge pfnmaps (Alex)
TODO
====
More architectures / More page sizes
------------------------------------
Currently only x86_64 (2M+1G) and arm64 (2M) are supported. There seems
to have plan to support arm64 1G later on top of this series [2].
Any arch will need to first support THP / THP_1G, then provide a special
bit in pmds/puds to support huge pfnmaps.
remap_pfn_range() support
-------------------------
Currently, remap_pfn_range() still only maps PTEs. With the new option,
remap_pfn_range() can logically start to inject either PMDs or PUDs when
the alignment requirements match on the VAs.
When the support is there, it should be able to silently benefit all
drivers that is using remap_pfn_range() in its mmap() handler on better
TLB hit rate and overall faster MMIO accesses similar to processor on
hugepages.
More driver support
-------------------
VFIO is so far the only consumer for the huge pfnmaps after this series
applied. Besides above remap_pfn_range() generic optimization, device
driver can also try to optimize its mmap() on a better VA alignment for
either PMD/PUD sizes. This may, iiuc, normally require userspace changes,
as the driver doesn't normally decide the VA to map a bar. But I don't
think I know all the drivers to know the full picture.
Credits all go to Alex on help testing the GPU/NIC use cases above.
[0] https://lore.kernel.org/r/73ad9540-3fb8-4154-9a4f-30a0a2b03d41@lucifer.local
[1] https://lore.kernel.org/r/20240807194812.819412-1-peterx@redhat.com
[2] https://lore.kernel.org/r/498e0731-81a4-4f75-95b4-a8ad0bcc7665@huawei.com
This patch (of 19):
This patch introduces the option to introduce special pte bit into
pmd/puds. Archs can start to define pmd_special / pud_special when
supported by selecting the new option. Per-arch support will be added
later.
Before that, create fallbacks for these helpers so that they are always
available.
Link: https://lkml.kernel.org/r/20240826204353.2228736-1-peterx@redhat.com
Link: https://lkml.kernel.org/r/20240826204353.2228736-2-peterx@redhat.com
Signed-off-by: Peter Xu <peterx@redhat.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Alex Williamson <alex.williamson@redhat.com>
Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Catalin Marinas <catalin.marinas@arm.com>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David Hildenbrand <david@redhat.com>
Cc: Gavin Shan <gshan@redhat.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jason Gunthorpe <jgg@nvidia.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Niklas Schnelle <schnelle@linux.ibm.com>
Cc: Paolo Bonzini <pbonzini@redhat.com>
Cc: Ryan Roberts <ryan.roberts@arm.com>
Cc: Sean Christopherson <seanjc@google.com>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Will Deacon <will@kernel.org>
Cc: Zi Yan <ziy@nvidia.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Add pgalloc_tag_copy() to transfer the codetag from the old folio to the
new one during migration. This makes original allocation sites persist
cross migration rather than lump into the get_new_folio callbacks passed
into migrate_pages(), e.g., compaction_alloc():
# echo 1 >/proc/sys/vm/compact_memory
# grep compaction_alloc /proc/allocinfo
Before this patch:
132968448 32463 mm/compaction.c:1880 func:compaction_alloc
After this patch:
0 0 mm/compaction.c:1880 func:compaction_alloc
Link: https://lkml.kernel.org/r/20240906042108.1150526-3-yuzhao@google.com
Fixes: dcfe378c81f7 ("lib: introduce support for page allocation tagging")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The current assumption is that a large folio can only be split into
order-0 folios. That is not the case for hugeTLB demotion, nor for THP
split: see commit c010d47f107f ("mm: thp: split huge page to any lower
order pages").
When a large folio is split into ones of a lower non-zero order, only the
new head pages should be tagged. Tagging tail pages can cause imbalanced
"calls" counters, since only head pages are untagged by pgalloc_tag_sub()
and the "calls" counts on tail pages are leaked, e.g.,
# echo 2048kB >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote_size
# echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/nr_hugepages
# time echo 700 >/sys/kernel/mm/hugepages/hugepages-1048576kB/demote
# echo 0 >/sys/kernel/mm/hugepages/hugepages-2048kB/nr_hugepages
# grep alloc_gigantic_folio /proc/allocinfo
Before this patch:
0 549427200 mm/hugetlb.c:1549 func:alloc_gigantic_folio
real 0m2.057s
user 0m0.000s
sys 0m2.051s
After this patch:
0 0 mm/hugetlb.c:1549 func:alloc_gigantic_folio
real 0m1.711s
user 0m0.000s
sys 0m1.704s
Not tagging tail pages also improves the splitting time, e.g., by about
15% when demoting 1GB hugeTLB folios to 2MB ones, as shown above.
Link: https://lkml.kernel.org/r/20240906042108.1150526-2-yuzhao@google.com
Fixes: be25d1d4e822 ("mm: create new codetag references during page splitting")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Link: https://lkml.kernel.org/r/20240906042108.1150526-1-yuzhao@google.com
Fixes: 22d407b164ff ("lib: add allocation tagging support for memory allocation profiling")
Signed-off-by: Yu Zhao <yuzhao@google.com>
Acked-by: Suren Baghdasaryan <surenb@google.com>
Acked-by: Muchun Song <muchun.song@linux.dev>
Cc: Kent Overstreet <kent.overstreet@linux.dev>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In many places, in the comments, we use both "higher-order" and
"high-order" to describe the non 0-order pages. That is confusing,
because a "higher-order" statement does not reflect what it is compared
with.
Link: https://lkml.kernel.org/r/20240906095049.3486-1-urezki@gmail.com
Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Suggested-by: Baoquan He <bhe@redhat.com>
Reviewed-by: Baoquan He <bhe@redhat.com>
Cc: Christoph Hellwig <hch@infradead.org>
Cc: Michal Hocko <mhocko@suse.com>
Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sony.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
Use helper function va_size() to improve code readability. No functional
modification involved.
Link: https://lkml.kernel.org/r/20240906102539.3537207-1-zhangpeng362@huawei.com
Signed-off-by: ZhangPeng <zhangpeng362@huawei.com>
Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com>
Cc: Christoph Hellwig <hch@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The tracing of invalidation and truncation operations on large files
showed that xa_get_order() is among the top functions where kernel spends
a lot of CPUs. xa_get_order() needs to traverse the tree to reach the
right node for a given index and then extract the order of the entry.
However it seems like at many places it is being called within an already
happening tree traversal where there is no need to do another traversal.
Just use xas_get_order() at those places.
Link: https://lkml.kernel.org/r/20240906230512.124643-1-shakeel.butt@linux.dev
Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev>
Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Nhat Pham <nphamcs@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
People keep trying to remove three functions that are going to be used in
a feature that is being developed. Dropping the functions entirely may
end up with people trying to use the bit for other uses, as people have
tried in the past.
Adding __maybe_unused stops compilers complaining about the unused
functions so they can be silently optimised out of the compiled code and
people won't try to claim the bit for another use.
Link: https://lore.kernel.org/all/20230726080916.17454-2-zhangpeng.00@bytedance.com/
Link: https://lore.kernel.org/all/202408310728.S7EE59BN-lkp@intel.com/
Link: https://lkml.kernel.org/r/20240907021506.4018676-1-Liam.Howlett@oracle.com
Signed-off-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Reviewed-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Kuan-Wei Chiu <visitorckw@gmail.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
A clean up to make variable names more clear and to improve code
readability.
No functional change.
Link: https://lkml.kernel.org/r/20240905003058.1859929-6-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently, if multiple reclaimers raced on the same position, the
reclaimers which detect the race will still reclaim from the same memcg.
Instead, the reclaimers which detect the race should move on to the next
memcg in the hierarchy.
So, in the case where multiple traversals race, jump back to the start of
the mem_cgroup_iter() function to find the next memcg in the hierarchy to
reclaim from.
Link: https://lkml.kernel.org/r/20240905003058.1859929-5-kinseyho@google.com
Reported-by: syzbot+e099d407346c45275ce9@syzkaller.appspotmail.com
Closes: https://lore.kernel.org/000000000000817cf10620e20d33@google.com/
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The generation number in struct mem_cgroup_reclaim_iter should be
incremented on every round-trip. Currently, it is possible for a
concurrent reclaimer to jump in at the end of the hierarchy, causing a
traversal restart (resetting the iteration position) without incrementing
the generation number.
By resetting the position without incrementing the generation, it's
possible for another ongoing mem_cgroup_iter() thread to walk the tree
twice.
Move the traversal restart such that the generation number is
incremented before the restart.
Link: https://lkml.kernel.org/r/20240905003058.1859929-4-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
To obtain the pointer to the next memcg position, mem_cgroup_iter()
currently holds css->refcnt during memcg traversal only to put css->refcnt
at the end of the routine. This isn't necessary as an rcu_read_lock is
already held throughout the function. The use of the RCU read lock with
css_next_descendant_pre() guarantees that sibling linkage is safe without
holding a ref on the passed-in @css.
Remove css->refcnt usage during traversal by leveraging RCU.
Link: https://lkml.kernel.org/r/20240905003058.1859929-3-kinseyho@google.com
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Reviewed-by: T.J. Mercier <tjmercier@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Michal Koutný <mkoutny@suse.com>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Yosry Ahmed <yosryahmed@google.com>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch series "Improve mem_cgroup_iter()", v4.
Incremental cgroup iteration is being used again [1]. This patchset
improves the reliability of mem_cgroup_iter(). It also improves
simplicity and code readability.
[1] https://lore.kernel.org/20240514202641.2821494-1-hannes@cmpxchg.org/
This patch (of 5):
Explicitly document that css sibling/descendant linkage is protected by
cgroup_mutex or RCU. Also, document in css_next_descendant_pre() and
similar functions that it isn't necessary to hold a ref on @pos.
The following changes in this patchset rely on this clarification for
simplification in memcg iteration code.
Link: https://lkml.kernel.org/r/20240905003058.1859929-1-kinseyho@google.com
Link: https://lkml.kernel.org/r/20240905003058.1859929-2-kinseyho@google.com
Suggested-by: Yosry Ahmed <yosryahmed@google.com>
Reviewed-by: Michal Koutný <mkoutny@suse.com>
Signed-off-by: Kinsey Ho <kinseyho@google.com>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Michal Hocko <mhocko@kernel.org>
Cc: Muchun Song <muchun.song@linux.dev>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Shakeel Butt <shakeel.butt@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: Zefan Li <lizefan.x@bytedance.com>
Cc: Hugh Dickins <hughd@google.com>
Cc: T.J. Mercier <tjmercier@google.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When has_unaccepted_memory() is unused, it prevents kernel builds
with clang, `make W=1` and CONFIG_WERROR=y:
mm/page_alloc.c:7036:20: error: unused function 'has_unaccepted_memory' [-Werror,-Wunused-function]
7036 | static inline bool has_unaccepted_memory(void)
| ^~~~~~~~~~~~~~~~~~~~~
Fix it by removeing the CONFIG_UNACCEPTED_MEMORY=n stub.
Link: https://lkml.kernel.org/r/20240905142220.49d93337a0abce5690e515d9@linux-foundation.org
Reported-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com>
Closes: https://lkml.kernel.org/r/20240905171553.275054-1-andriy.shevchenko@linux.intel.com
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
random.h is not needed since commit 6c542ab75714 ("mm/demotion: build
demotion targets based on explicit memory tiers"), all functions moved
into memory-tiers.
nsproxy.h is not needed since commit 228ebcbe634a ("Uninline
find_task_by_xxx set of functions"), no nsproxy, we only call
find_task_by_vpid() now.
hugetlb_cgroup.h is not needed since commit ab5ac90aecf5 ("mm, hugetlb: do
not rely on overcommit limit during migration"), move_hugetlb_state() is
called and it belongs to hugetlb.h, which is already included.
balloon_compaction.h, we have more general movable_operations for non-lru
movable page migration, so it could be dropped.
memremap.h, userfaultfd_k.h and oom.h are introduced for zone device page
migration, but all functions are moved into migrate_device.c, so no needed
anymore too.
Link: https://lkml.kernel.org/r/20240905152432.626877-1-wangkefeng.wang@huawei.com
Signed-off-by: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
The helper find_get_task_by_vpid() can totally replace the task_struct
find logic in split_huge_pages_pid(), so use it to simplify the code.
Also delete the needless comments for the helper function name already
explains what it's doing here.
Link: https://lkml.kernel.org/r/20240905153028.1205128-1-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Use find_get_task_by_vpid() to replace the task_struct find logic in
find_mm_struct(), note that this patch move the ptrace_may_access() call
out from rcu_read_lock() scope, this is ok because it actually does not
need it, find_get_task_by_vpid() already get the pid and task safely,
ptrace_may_access() can use the task safely, like what
sched_core_share_pid() similarly do.
Link: https://lkml.kernel.org/r/20240905153118.1205173-1-sunnanyong@huawei.com
Signed-off-by: Nanyong Sun <sunnanyong@huawei.com>
Cc: Kefeng Wang <wangkefeng.wang@huawei.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
aggr_interval is zero
The aggregation interval of test purpose damon_attrs for
damon_test_nr_accesses_to_accesses_bp() becomes zero on 32 bit
architecture, since size of int and long types are same. As a result,
damon_nr_accesses_to_accesses_bp() call with the test data triggers
divide-by-zero exception. damon_nr_accesses_to_accesses_bp() shouldn't
be called with such data, and the non-test code avoids that by checking
the case on damon_update_monitoring_results(). Skip the test code in
the case, and add an explicit caution of the case on the comment for the
test target function.
Link: https://lkml.kernel.org/r/20240905162423.74053-1-sj@kernel.org
Fixes: 5e06ad590096 ("mm/damon/core-test: test max_nr_accesses overflow caused divide-by-zero")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/c771b962-a58f-435b-89e4-1211a9323181@roeck-us.net
Cc: Brendan Higgins <brendanhiggins@google.com>
Cc: David Gow <davidgow@google.com>
Cc: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The following KASAN splat was shown:
[ 44.505448] ================================================================== 20:37:27 [3421/145075]
[ 44.505455] BUG: KASAN: slab-use-after-free in special_mapping_close+0x9c/0xc8
[ 44.505471] Read of size 8 at addr 00000000868dac48 by task sh/1384
[ 44.505479]
[ 44.505486] CPU: 51 UID: 0 PID: 1384 Comm: sh Not tainted 6.11.0-rc6-next-20240902-dirty #1496
[ 44.505503] Hardware name: IBM 3931 A01 704 (z/VM 7.3.0)
[ 44.505508] Call Trace:
[ 44.505511] [<000b0324d2f78080>] dump_stack_lvl+0xd0/0x108
[ 44.505521] [<000b0324d2f5435c>] print_address_description.constprop.0+0x34/0x2e0
[ 44.505529] [<000b0324d2f5464c>] print_report+0x44/0x138
[ 44.505536] [<000b0324d1383192>] kasan_report+0xc2/0x140
[ 44.505543] [<000b0324d2f52904>] special_mapping_close+0x9c/0xc8
[ 44.505550] [<000b0324d12c7978>] remove_vma+0x78/0x120
[ 44.505557] [<000b0324d128a2c6>] exit_mmap+0x326/0x750
[ 44.505563] [<000b0324d0ba655a>] __mmput+0x9a/0x370
[ 44.505570] [<000b0324d0bbfbe0>] exit_mm+0x240/0x340
[ 44.505575] [<000b0324d0bc0228>] do_exit+0x548/0xd70
[ 44.505580] [<000b0324d0bc1102>] do_group_exit+0x132/0x390
[ 44.505586] [<000b0324d0bc13b6>] __s390x_sys_exit_group+0x56/0x60
[ 44.505592] [<000b0324d0adcbd6>] do_syscall+0x2f6/0x430
[ 44.505599] [<000b0324d2f78434>] __do_syscall+0xa4/0x170
[ 44.505606] [<000b0324d2f9454c>] system_call+0x74/0x98
[ 44.505614]
[ 44.505616] Allocated by task 1384:
[ 44.505621] kasan_save_stack+0x40/0x70
[ 44.505630] kasan_save_track+0x28/0x40
[ 44.505636] __kasan_kmalloc+0xa0/0xc0
[ 44.505642] __create_xol_area+0xfa/0x410
[ 44.505648] get_xol_area+0xb0/0xf0
[ 44.505652] uprobe_notify_resume+0x27a/0x470
[ 44.505657] irqentry_exit_to_user_mode+0x15e/0x1d0
[ 44.505664] pgm_check_handler+0x122/0x170
[ 44.505670]
[ 44.505672] Freed by task 1384:
[ 44.505676] kasan_save_stack+0x40/0x70
[ 44.505682] kasan_save_track+0x28/0x40
[ 44.505687] kasan_save_free_info+0x4a/0x70
[ 44.505693] __kasan_slab_free+0x5a/0x70
[ 44.505698] kfree+0xe8/0x3f0
[ 44.505704] __mmput+0x20/0x370
[ 44.505709] exit_mm+0x240/0x340
[ 44.505713] do_exit+0x548/0xd70
[ 44.505718] do_group_exit+0x132/0x390
[ 44.505722] __s390x_sys_exit_group+0x56/0x60
[ 44.505727] do_syscall+0x2f6/0x430
[ 44.505732] __do_syscall+0xa4/0x170
[ 44.505738] system_call+0x74/0x98
The problem is that uprobe_clear_state() kfree's struct xol_area, which
contains struct vm_special_mapping *xol_mapping. This one is passed to
_install_special_mapping() in xol_add_vma().
__mput reads:
static inline void __mmput(struct mm_struct *mm)
{
VM_BUG_ON(atomic_read(&mm->mm_users));
uprobe_clear_state(mm);
exit_aio(mm);
ksm_exit(mm);
khugepaged_exit(mm); /* must run before exit_mmap */
exit_mmap(mm);
...
}
So uprobe_clear_state() in the beginning free's the memory area
containing the vm_special_mapping data, but exit_mmap() uses this
address later via vma->vm_private_data (which was set in
_install_special_mapping().
Fix this by moving uprobe_clear_state() to uprobes.c and use it as
close() callback.
[usama.anjum@collabora.com: remove unneeded condition]
Link: https://lkml.kernel.org/r/20240906101825.177490-1-usama.anjum@collabora.com
Link: https://lkml.kernel.org/r/20240903073629.2442754-1-svens@linux.ibm.com
Fixes: 223febc6e557 ("mm: add optional close() to struct vm_special_mapping")
Signed-off-by: Sven Schnelle <svens@linux.ibm.com>
Suggested-by: Linus Torvalds <torvalds@linux-foundation.org>
Cc: Adrian Hunter <adrian.hunter@intel.com>
Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
Cc: Arnaldo Carvalho de Melo <acme@kernel.org>
Cc: Ian Rogers <irogers@google.com>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Jiri Olsa <jolsa@kernel.org>
Cc: Kan Liang <kan.liang@linux.intel.com>
Cc: Mark Rutland <mark.rutland@arm.com>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Namhyung Kim <namhyung@kernel.org>
Cc: Oleg Nesterov <oleg@redhat.com>
Cc: Peter Zijlstra <peterz@infradead.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
PGFREE is currently updated in two code paths:
- __free_pages_ok(): for pages freed to the buddy allocator.
- free_unref_page_commit(): for pages freed to the pcplists.
Before commit df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled
with zone->lock"), free_unref_page_commit() used to fallback to freeing
isolated pages directly to the buddy allocator through free_one_page().
This was done _after_ updating PGFREE, so the counter was correctly
updated.
However, that commit moved the fallback logic to its callers (now called
free_unref_page() and free_unref_folios()), so PGFREE was no longer
updated in this fallback case.
Now that the code has developed, there are more cases in free_unref_page()
and free_unref_folios() where we fallback to calling free_one_page() (e.g.
!pcp_allowed_order(), pcp_spin_trylock() fails). These cases also miss
updating PGFREE.
To make sure PGFREE is updated in all cases where pages are freed to the
buddy allocator, move the update down the stack to free_one_page().
This was noticed through code inspection, although it should be noticeable
at runtime (at least with some workloads).
Link: https://lkml.kernel.org/r/20240904205419.821776-1-yosryahmed@google.com
Fixes: df1acc856923 ("mm/page_alloc: avoid conflating IRQs disabled with zone->lock")
Signed-off-by: Yosry Ahmed <yosryahmed@google.com>
Cc: Brendan Jackman <jackmanb@google.com>
Cc: Mel Gorman <mgorman@techsingularity.net>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow
stack guard gap during placement") our current mmap() implementation does
not take care to ensure that a new mapping isn't placed with existing
mappings inside it's own guard gaps. This is particularly important for
shadow stacks since if two shadow stacks end up getting placed adjacent to
each other then they can overflow into each other which weakens the
protection offered by the feature.
On x86 there is a custom arch_get_unmapped_area() which was updated by the
above commit to cover this case by specifying a start_gap for allocations
with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and
use the generic implementation of arch_get_unmapped_area() so let's make
the equivalent change there so they also don't get shadow stack pages
placed without guard pages. x86 uses a single page guard, this is also
sufficient for arm64 where we either do single word pops and pushes or
unconstrained writes.
Architectures which do not have this feature will define VM_SHADOW_STACK
to VM_NONE and hence be unaffected.
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-3-a46b8b6dc0ed@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Suggested-by: Rick Edgecombe <rick.p.edgecombe@intel.com>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Liam R. Howlett <Liam.Howlett@Oracle.com>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In preparation for using vm_flags to ensure guard pages for shadow stacks
supply them as an argument to generic_get_unmapped_area(). The only user
outside of the core code is the PowerPC book3s64 implementation which is
trivially wrapping the generic implementation in the radix_enabled() case.
No functional changes.
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-2-a46b8b6dc0ed@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Michael Ellerman <mpe@ellerman.id.au>
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: Helge Deller <deller@gmx.de>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Patch series "mm: Care about shadow stack guard gap when getting an
unmapped area", v2.
As covered in the commit log for c44357c2e76b ("x86/mm: care about shadow
stack guard gap during placement") our current mmap() implementation does
not take care to ensure that a new mapping isn't placed with existing
mappings inside it's own guard gaps. This is particularly important for
shadow stacks since if two shadow stacks end up getting placed adjacent to
each other then they can overflow into each other which weakens the
protection offered by the feature.
On x86 there is a custom arch_get_unmapped_area() which was updated by the
above commit to cover this case by specifying a start_gap for allocations
with VM_SHADOW_STACK. Both arm64 and RISC-V have equivalent features and
use the generic implementation of arch_get_unmapped_area() so let's make
the equivalent change there so they also don't get shadow stack pages
placed without guard pages. The arm64 and RISC-V shadow stack
implementations are currently on the list:
https://lore.kernel.org/r/20240829-arm64-gcs-v12-0-42fec94743
https://lore.kernel.org/lkml/20240403234054.2020347-1-debug@rivosinc.com/
Given the addition of the use of vm_flags in the generic implementation we
also simplify the set of possibilities that have to be dealt with in the
core code by making arch_get_unmapped_area() take vm_flags as standard.
This is a bit invasive since the prototype change touches quite a few
architectures but since the parameter is ignored the change is
straightforward, the simplification for the generic code seems worth it.
This patch (of 3):
When we introduced arch_get_unmapped_area_vmflags() in 961148704acd ("mm:
introduce arch_get_unmapped_area_vmflags()") we did so as part of properly
supporting guard pages for shadow stacks on x86_64, which uses a custom
arch_get_unmapped_area(). Equivalent features are also present on both
arm64 and RISC-V, both of which use the generic implementation of
arch_get_unmapped_area() and will require equivalent modification there.
Rather than continue to deal with having two versions of the functions
let's bite the bullet and have all implementations of
arch_get_unmapped_area() take vm_flags as a parameter.
The new parameter is currently ignored by all implementations other than
x86. The only caller that doesn't have a vm_flags available is
mm_get_unmapped_area(), as for the x86 implementation and the wrapper used
on other architectures this is modified to supply no flags.
No functional changes.
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-0-a46b8b6dc0ed@kernel.org
Link: https://lkml.kernel.org/r/20240904-mm-generic-shadow-stack-guard-v2-1-a46b8b6dc0ed@kernel.org
Signed-off-by: Mark Brown <broonie@kernel.org>
Acked-by: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
Reviewed-by: Liam R. Howlett <Liam.Howlett@Oracle.com>
Acked-by: Helge Deller <deller@gmx.de> [parisc]
Cc: Alexander Gordeev <agordeev@linux.ibm.com>
Cc: Andreas Larsson <andreas@gaisler.com>
Cc: Borislav Petkov <bp@alien8.de>
Cc: Christian Borntraeger <borntraeger@linux.ibm.com>
Cc: Christophe Leroy <christophe.leroy@csgroup.eu>
Cc: Chris Zankel <chris@zankel.net>
Cc: Dave Hansen <dave.hansen@linux.intel.com>
Cc: David S. Miller <davem@davemloft.net>
Cc: "Edgecombe, Rick P" <rick.p.edgecombe@intel.com>
Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com>
Cc: Guo Ren <guoren@kernel.org>
Cc: Heiko Carstens <hca@linux.ibm.com>
Cc: "H. Peter Anvin" <hpa@zytor.com>
Cc: Huacai Chen <chenhuacai@kernel.org>
Cc: Ingo Molnar <mingo@redhat.com>
Cc: Ivan Kokshaysky <ink@jurassic.park.msu.ru>
Cc: James Bottomley <James.Bottomley@HansenPartnership.com>
Cc: John Paul Adrian Glaubitz <glaubitz@physik.fu-berlin.de>
Cc: Matt Turner <mattst88@gmail.com>
Cc: Max Filippov <jcmvbkbc@gmail.com>
Cc: Michael Ellerman <mpe@ellerman.id.au>
Cc: Naveen N Rao <naveen@kernel.org>
Cc: Nicholas Piggin <npiggin@gmail.com>
Cc: Richard Henderson <richard.henderson@linaro.org>
Cc: Rich Felker <dalias@libc.org>
Cc: Russell King <linux@armlinux.org.uk>
Cc: Sven Schnelle <svens@linux.ibm.com>
Cc: Thomas Bogendoerfer <tsbogend@alpha.franken.de>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Vasily Gorbik <gor@linux.ibm.com>
Cc: Vineet Gupta <vgupta@kernel.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Cc: WANG Xuerui <kernel@xen0n.name>
Cc: Yoshinori Sato <ysato@users.sourceforge.jp>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
damon_test_three_regions_in_vmas() initializes a maple tree with
MM_MT_FLAGS. The flags contains MT_FLAGS_LOCK_EXTERN, which means mt_lock
of the maple tree will not be used. And therefore the maple tree
initialization code skips initialization of the mt_lock. However,
__link_vmas(), which adds vmas for test to the maple tree, uses the
mt_lock. In other words, the uninitialized spinlock is used. The problem
becomes clear when spinlock debugging is turned on, since it reports
spinlock bad magic bug.
Fix the issue by excluding MT_FLAGS_LOCK_EXTERN from the maple tree
initialization flags. Note that we don't use empty flags to make it
further similar to the usage of mm maple tree, and to be prepared for
possible future changes, as suggested by Liam.
Link: https://lkml.kernel.org/r/20240904172931.1284-1-sj@kernel.org
Fixes: d0cf3dd47f0d ("damon: convert __damon_va_three_regions to use the VMA iterator")
Signed-off-by: SeongJae Park <sj@kernel.org>
Reported-by: Guenter Roeck <linux@roeck-us.net>
Closes: https://lore.kernel.org/1453b2b2-6119-4082-ad9e-f3c5239bf87e@roeck-us.net
Suggested-by: Liam R. Howlett <Liam.Howlett@oracle.com>
Tested-by: Guenter Roeck <linux@roeck-us.net>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zsmalloc is not exclusive to zswap. Commit b3fbd58fcbb1 ("mm: Kconfig:
simplify zswap configuration") made CONFIG_ZSMALLOC only visible when
CONFIG_ZSWAP is selected, which makes it impossible to menuconfig
zsmalloc-specific features (stats, chain-size, etc.) on systems that use
ZRAM but don't have ZSWAP enabled.
Make zsmalloc depend on both ZRAM and ZSWAP.
Link: https://lkml.kernel.org/r/20240903040143.1580705-1-senozhatsky@chromium.org
Fixes: b3fbd58fcbb1 ("mm: Kconfig: simplify zswap configuration")
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Johannes Weiner <hannes@cmpxchg.org>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
In commit b6273b55d885 ("filemap: add trace events for get_pages,
map_pages, and fault"), mm_filemap_get_pages was added to trace page cache
access. However, it tracks an extra page beyond the end of the accessed
range. This patch fixes it by replacing last_index with last_index - 1.
Link: https://lkml.kernel.org/r/20240903102100.70405-1-takayas@chromium.org
Fixes: b6273b55d885 ("filemap: add trace events for get_pages, map_pages, and fault")
Signed-off-by: Takaya Saeki <takayas@chromium.org>
Cc: Junichi Uekawa <uekawa@chromium.org>
Cc: Masami Hiramatsu <mhiramat@kernel.org>
Cc: Mathieu Desnoyers <mathieu.desnoyers@efficios.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Steven Rostedt (Google) <rostedt@goodmis.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Take the end of a file write into consideration when deciding whether or
not to use huge pages for tmpfs files when the tmpfs filesystem is mounted
with huge=within_size
This allows large writes that append to the end of a file to automatically
use large pages.
Doing 4MB sequential writes without fallocate to a 16GB tmpfs file with
fio. The numbers without THP or with huge=always stay the same, but the
performance with huge=within_size now matches that of huge=always.
huge before after
4kB pages 1560 MB/s 1560 MB/s
within_size 1560 MB/s 4720 MB/s
always: 4720 MB/s 4720 MB/s
[akpm@linux-foundation.org: coding-style cleanups]
Link: https://lkml.kernel.org/r/20240903111928.7171e60c@imladris.surriel.com
Signed-off-by: Rik van Riel <riel@surriel.com>
Reviewed-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Tested-by: Baolin Wang <baolin.wang@linux.alibaba.com>
Cc: Darrick J. Wong <djwong@kernel.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Matthew Wilcox <willy@infradead.org>
Cc: Vlastimil Babka <vbabka@suse.cz>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
recompress device attribute supports alg=NAME parameter so that we can
specify only one particular algorithm we want to perform recompression
with. However, with algo params we now can have several exactly same
secondary algorithms but each with its own params tuning (e.g. priority 1
configured to use more aggressive level, and priority 2 configured to use
a pre-trained dictionary). Support priority=NUM parameter so that we can
correctly determine which secondary algorithm we want to use.
Link: https://lkml.kernel.org/r/20240902105656.1383858-25-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Document brief description of compression algorithms' parameters:
compression level and pre-trained dictionary.
[senozhatsky@chromium.org: trivial fixup]
Link: https://lkml.kernel.org/r/20240903063722.1603592-1-senozhatsky@chromium.org
Link: https://lkml.kernel.org/r/20240902105656.1383858-24-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This adds support for pre-trained zstd dictionaries [1] Dictionary is
setup in params once (per-comp) and loaded to Cctx and Dctx by reference,
so we don't allocate extra memory.
TEST
====
*** zstd
/sys/block/zram0/mm_stat
1750654976 504565092 514203648 0 514203648 1 0 34204 34204
*** zstd dict=/etc/zstd-dict-amd64
/sys/block/zram0/mm_stat
1750638592 465851259 475373568 0 475373568 1 0 34185 34185
*** zstd level=8 dict=/etc/zstd-dict-amd64
/sys/block/zram0/mm_stat
1750642688 430765171 439955456 0 439955456 1 0 34185 34185
[1] https://github.com/facebook/zstd/blob/dev/programs/zstd.1.md#dictionary-builder
Link: https://lkml.kernel.org/r/20240902105656.1383858-23-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Support pre-trained dictionary param. Just like lz4, lz4hc doesn't
mandate specific format of the dictionary and zstd --train can be used to
train a dictionary for lz4, according to [1].
TEST
====
*** lz4hc
/sys/block/zram0/mm_stat
1750638592 608954620 621031424 0 621031424 1 0 34288 34288
*** lz4hc dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750671360 505068582 514994176 0 514994176 1 0 34278 34278
[1] https://github.com/lz4/lz4/issues/557
Link: https://lkml.kernel.org/r/20240902105656.1383858-22-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Support pre-trained dictionary param. lz4 doesn't mandate specific format
of the dictionary and even zstd --train can be used to train a dictionary
for lz4, according to [1].
TEST
====
*** lz4
/sys/block/zram0/mm_stat
1750654976 664188565 676864000 0 676864000 1 0 34288 34288
*** lz4 dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750638592 619891141 632053760 0 632053760 1 0 34278 34278
*** lz4 level=5 dict=/etc/lz4-dict-amd64
/sys/block/zram0/mm_stat
1750638592 727174243 740810752 0 740810752 1 0 34437 34437
[1] https://github.com/lz4/lz4/issues/557
Link: https://lkml.kernel.org/r/20240902105656.1383858-21-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Immutable params never change once comp has been allocated and setup, so
we don't need to store multiple copies of them in each per-CPU backend
context. Move those to per-comp zcomp_params and pass it to backends
callbacks for requests execution. Basically, this means parameters
sharing between different contexts.
Also introduce two new backends callbacks: setup_params() and
release_params(). First, we need to validate params in a driver-specific
way; second, driver may want to allocate its specific representation of
the params which is needed to execute requests.
Link: https://lkml.kernel.org/r/20240902105656.1383858-20-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Keep run-time driver data (scratch buffers, etc.) in zcomp_ctx structure.
This structure is allocated per-CPU because drivers (backends) need to
modify its content during requests execution.
We will split mutable and immutable driver data, this is a preparation
path.
Link: https://lkml.kernel.org/r/20240902105656.1383858-19-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Encapsulate compression/decompression data in zcomp_req structure.
Link: https://lkml.kernel.org/r/20240902105656.1383858-18-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Handle dict=path algorithm param so that we can read a pre-trained
compression algorithm dictionary which we then pass to the backend
configuration.
Link: https://lkml.kernel.org/r/20240902105656.1383858-17-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This attribute is used to setup compression algorithms' parameters, so we
can tweak algorithms' characteristics. At this point only 'level' is
supported (to be extended in the future).
Each call sets up parameters for one particular algorithm, which should be
specified either by the algorithm's priority or algo name. This is
expected to be called after corresponding algorithm is selected via
comp_algorithm or recomp_algorithm.
echo "priority=0 level=1" > /sys/block/zram0/algorithm_params
or
echo "algo=zstd level=1" > /sys/block/zram0/algorithm_params
Link: https://lkml.kernel.org/r/20240902105656.1383858-16-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
| |
zstd compression params depends on level, but are constant for a given
instance of zstd compression backend. Calculate once (during ctx
creation).
Link: https://lkml.kernel.org/r/20240902105656.1383858-15-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
| |
We will store a per-algorithm parameters there (compression level,
dictionary, dictionary size, etc.).
Link: https://lkml.kernel.org/r/20240902105656.1383858-14-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Make sure that backends array has anything apart from the sentinel NULL
value.
We also select LZO_BACKEND if none backends were selected.
Link: https://lkml.kernel.org/r/20240902105656.1383858-13-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Add s/w 842 compression support.
Link: https://lkml.kernel.org/r/20240902105656.1383858-12-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Add s/w zlib (inflate/deflate) compression.
Link: https://lkml.kernel.org/r/20240902105656.1383858-11-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
zram works with PAGE_SIZE buffers, so we always know exact size of the
source buffer and hence can pass estimated_src_size to zstd_get_params().
This hint on x86_64, for example, reduces the size of the work memory
buffer from 1303520 bytes down to 90080 bytes. Given that compression
streams are per-CPU that's quite some memory saving.
Link: https://lkml.kernel.org/r/20240902105656.1383858-10-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Add s/w zstd compression.
Link: https://lkml.kernel.org/r/20240902105656.1383858-9-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Add s/w lz4hc compression support.
Link: https://lkml.kernel.org/r/20240902105656.1383858-8-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Add s/w lz4 compression support.
Link: https://lkml.kernel.org/r/20240902105656.1383858-7-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
Add s/w lzo/lzorle compression support.
Link: https://lkml.kernel.org/r/20240902105656.1383858-6-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Moving to custom backends implementation gives us ability to have our own
minimalistic and extendable API, and algorithms tunings becomes possible.
The list of compression backends is empty at this point, we will add
backends in the followup patches.
Link: https://lkml.kernel.org/r/20240902105656.1383858-5-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Minchan Kim <minchan@kernel.org>
Cc: Nick Terrell <terrelln@fb.com>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ZSTD_createCDict_advanced2() must ensure that
ZSTD_createCDict_advanced_internal() has successfully allocated cdict.
customMalloc() may be called under low memory condition and may be unable
to allocate workspace for cdict.
Link: https://lkml.kernel.org/r/20240902105656.1383858-4-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|
|
|
|
|
|
|
|
|
|
| |
This symbol is needed to enable lz4hc dictionary support.
Link: https://lkml.kernel.org/r/20240902105656.1383858-3-senozhatsky@chromium.org
Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org>
Cc: Nick Terrell <terrelln@fb.com>
Cc: Minchan Kim <minchan@kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
|