diff options
author | Maxime Ripard <maxime@cerno.tech> | 2022-10-18 15:00:03 +0200 |
---|---|---|
committer | Maxime Ripard <maxime@cerno.tech> | 2022-10-18 15:00:03 +0200 |
commit | a140a6a2d5ec0329ad05cd3532a91ad0ce58dceb (patch) | |
tree | b2d44a1da423c53bd6c3ab3facd45ff5f2087ffd /Documentation/mm | |
parent | dma-buf: Remove obsoleted internal lock (diff) | |
parent | Linux 6.1-rc1 (diff) | |
download | linux-a140a6a2d5ec0329ad05cd3532a91ad0ce58dceb.tar.xz linux-a140a6a2d5ec0329ad05cd3532a91ad0ce58dceb.zip |
Merge drm/drm-next into drm-misc-next
Let's kick-off this release cycle.
Signed-off-by: Maxime Ripard <maxime@cerno.tech>
Diffstat (limited to 'Documentation/mm')
-rw-r--r-- | Documentation/mm/index.rst | 1 | ||||
-rw-r--r-- | Documentation/mm/ksm.rst | 2 | ||||
-rw-r--r-- | Documentation/mm/multigen_lru.rst | 159 | ||||
-rw-r--r-- | Documentation/mm/page_owner.rst | 25 | ||||
-rw-r--r-- | Documentation/mm/slub.rst | 33 | ||||
-rw-r--r-- | Documentation/mm/unevictable-lru.rst | 6 |
6 files changed, 194 insertions, 32 deletions
diff --git a/Documentation/mm/index.rst b/Documentation/mm/index.rst index 575ccd40e30c..4aa12b8be278 100644 --- a/Documentation/mm/index.rst +++ b/Documentation/mm/index.rst @@ -51,6 +51,7 @@ above structured documentation, or deleted if it has served its purpose. ksm memory-model mmu_notifier + multigen_lru numa overcommit-accounting page_migration diff --git a/Documentation/mm/ksm.rst b/Documentation/mm/ksm.rst index 9e37add068e6..f83cfbc12f4c 100644 --- a/Documentation/mm/ksm.rst +++ b/Documentation/mm/ksm.rst @@ -26,7 +26,7 @@ tree. If a KSM page is shared between less than ``max_page_sharing`` VMAs, the node of the stable tree that represents such KSM page points to a -list of struct rmap_item and the ``page->mapping`` of the +list of struct ksm_rmap_item and the ``page->mapping`` of the KSM page points to the stable tree node. When the sharing passes this threshold, KSM adds a second dimension to diff --git a/Documentation/mm/multigen_lru.rst b/Documentation/mm/multigen_lru.rst new file mode 100644 index 000000000000..d7062c6a8946 --- /dev/null +++ b/Documentation/mm/multigen_lru.rst @@ -0,0 +1,159 @@ +.. SPDX-License-Identifier: GPL-2.0 + +============= +Multi-Gen LRU +============= +The multi-gen LRU is an alternative LRU implementation that optimizes +page reclaim and improves performance under memory pressure. Page +reclaim decides the kernel's caching policy and ability to overcommit +memory. It directly impacts the kswapd CPU usage and RAM efficiency. + +Design overview +=============== +Objectives +---------- +The design objectives are: + +* Good representation of access recency +* Try to profit from spatial locality +* Fast paths to make obvious choices +* Simple self-correcting heuristics + +The representation of access recency is at the core of all LRU +implementations. In the multi-gen LRU, each generation represents a +group of pages with similar access recency. Generations establish a +(time-based) common frame of reference and therefore help make better +choices, e.g., between different memcgs on a computer or different +computers in a data center (for job scheduling). + +Exploiting spatial locality improves efficiency when gathering the +accessed bit. A rmap walk targets a single page and does not try to +profit from discovering a young PTE. A page table walk can sweep all +the young PTEs in an address space, but the address space can be too +sparse to make a profit. The key is to optimize both methods and use +them in combination. + +Fast paths reduce code complexity and runtime overhead. Unmapped pages +do not require TLB flushes; clean pages do not require writeback. +These facts are only helpful when other conditions, e.g., access +recency, are similar. With generations as a common frame of reference, +additional factors stand out. But obvious choices might not be good +choices; thus self-correction is necessary. + +The benefits of simple self-correcting heuristics are self-evident. +Again, with generations as a common frame of reference, this becomes +attainable. Specifically, pages in the same generation can be +categorized based on additional factors, and a feedback loop can +statistically compare the refault percentages across those categories +and infer which of them are better choices. + +Assumptions +----------- +The protection of hot pages and the selection of cold pages are based +on page access channels and patterns. There are two access channels: + +* Accesses through page tables +* Accesses through file descriptors + +The protection of the former channel is by design stronger because: + +1. The uncertainty in determining the access patterns of the former + channel is higher due to the approximation of the accessed bit. +2. The cost of evicting the former channel is higher due to the TLB + flushes required and the likelihood of encountering the dirty bit. +3. The penalty of underprotecting the former channel is higher because + applications usually do not prepare themselves for major page + faults like they do for blocked I/O. E.g., GUI applications + commonly use dedicated I/O threads to avoid blocking rendering + threads. + +There are also two access patterns: + +* Accesses exhibiting temporal locality +* Accesses not exhibiting temporal locality + +For the reasons listed above, the former channel is assumed to follow +the former pattern unless ``VM_SEQ_READ`` or ``VM_RAND_READ`` is +present, and the latter channel is assumed to follow the latter +pattern unless outlying refaults have been observed. + +Workflow overview +================= +Evictable pages are divided into multiple generations for each +``lruvec``. The youngest generation number is stored in +``lrugen->max_seq`` for both anon and file types as they are aged on +an equal footing. The oldest generation numbers are stored in +``lrugen->min_seq[]`` separately for anon and file types as clean file +pages can be evicted regardless of swap constraints. These three +variables are monotonically increasing. + +Generation numbers are truncated into ``order_base_2(MAX_NR_GENS+1)`` +bits in order to fit into the gen counter in ``folio->flags``. Each +truncated generation number is an index to ``lrugen->lists[]``. The +sliding window technique is used to track at least ``MIN_NR_GENS`` and +at most ``MAX_NR_GENS`` generations. The gen counter stores a value +within ``[1, MAX_NR_GENS]`` while a page is on one of +``lrugen->lists[]``; otherwise it stores zero. + +Each generation is divided into multiple tiers. A page accessed ``N`` +times through file descriptors is in tier ``order_base_2(N)``. Unlike +generations, tiers do not have dedicated ``lrugen->lists[]``. In +contrast to moving across generations, which requires the LRU lock, +moving across tiers only involves atomic operations on +``folio->flags`` and therefore has a negligible cost. A feedback loop +modeled after the PID controller monitors refaults over all the tiers +from anon and file types and decides which tiers from which types to +evict or protect. + +There are two conceptually independent procedures: the aging and the +eviction. They form a closed-loop system, i.e., the page reclaim. + +Aging +----- +The aging produces young generations. Given an ``lruvec``, it +increments ``max_seq`` when ``max_seq-min_seq+1`` approaches +``MIN_NR_GENS``. The aging promotes hot pages to the youngest +generation when it finds them accessed through page tables; the +demotion of cold pages happens consequently when it increments +``max_seq``. The aging uses page table walks and rmap walks to find +young PTEs. For the former, it iterates ``lruvec_memcg()->mm_list`` +and calls ``walk_page_range()`` with each ``mm_struct`` on this list +to scan PTEs, and after each iteration, it increments ``max_seq``. For +the latter, when the eviction walks the rmap and finds a young PTE, +the aging scans the adjacent PTEs. For both, on finding a young PTE, +the aging clears the accessed bit and updates the gen counter of the +page mapped by this PTE to ``(max_seq%MAX_NR_GENS)+1``. + +Eviction +-------- +The eviction consumes old generations. Given an ``lruvec``, it +increments ``min_seq`` when ``lrugen->lists[]`` indexed by +``min_seq%MAX_NR_GENS`` becomes empty. To select a type and a tier to +evict from, it first compares ``min_seq[]`` to select the older type. +If both types are equally old, it selects the one whose first tier has +a lower refault percentage. The first tier contains single-use +unmapped clean pages, which are the best bet. The eviction sorts a +page according to its gen counter if the aging has found this page +accessed through page tables and updated its gen counter. It also +moves a page to the next generation, i.e., ``min_seq+1``, if this page +was accessed multiple times through file descriptors and the feedback +loop has detected outlying refaults from the tier this page is in. To +this end, the feedback loop uses the first tier as the baseline, for +the reason stated earlier. + +Summary +------- +The multi-gen LRU can be disassembled into the following parts: + +* Generations +* Rmap walks +* Page table walks +* Bloom filters +* PID controller + +The aging and the eviction form a producer-consumer model; +specifically, the latter drives the former by the sliding window over +generations. Within the aging, rmap walks drive page table walks by +inserting hot densely populated page tables to the Bloom filters. +Within the eviction, the PID controller uses refaults as the feedback +to select types to evict and tiers to protect. diff --git a/Documentation/mm/page_owner.rst b/Documentation/mm/page_owner.rst index f5c954afe97c..127514955a5e 100644 --- a/Documentation/mm/page_owner.rst +++ b/Documentation/mm/page_owner.rst @@ -38,22 +38,10 @@ not affect to allocation performance, especially if the static keys jump label patching functionality is available. Following is the kernel's code size change due to this facility. -- Without page owner:: - - text data bss dec hex filename - 48392 2333 644 51369 c8a9 mm/page_alloc.o - -- With page owner:: - - text data bss dec hex filename - 48800 2445 644 51889 cab1 mm/page_alloc.o - 6662 108 29 6799 1a8f mm/page_owner.o - 1025 8 8 1041 411 mm/page_ext.o - -Although, roughly, 8 KB code is added in total, page_alloc.o increase by -520 bytes and less than half of it is in hotpath. Building the kernel with -page owner and turning it on if needed would be great option to debug -kernel memory problem. +Although enabling page owner increases kernel size by several kilobytes, +most of this code is outside page allocator and its hot path. Building +the kernel with page owner and turning it on if needed would be great +option to debug kernel memory problem. There is one notice that is caused by implementation detail. page owner stores information into the memory from struct page extension. This memory @@ -94,6 +82,11 @@ Usage Page allocated via order XXX, ... PFN XXX ... // Detailed stack + By default, it will do full pfn dump, to start with a given pfn, + page_owner supports fseek. + + FILE *fp = fopen("/sys/kernel/debug/page_owner", "r"); + fseek(fp, pfn_start, SEEK_SET); The ``page_owner_sort`` tool ignores ``PFN`` rows, puts the remaining rows in buf, uses regexp to extract the page order value, counts the times diff --git a/Documentation/mm/slub.rst b/Documentation/mm/slub.rst index 43063ade737a..4e1578186b4f 100644 --- a/Documentation/mm/slub.rst +++ b/Documentation/mm/slub.rst @@ -400,21 +400,30 @@ information: allocated objects. The output is sorted by frequency of each trace. Information in the output: - Number of objects, allocating function, minimal/average/maximal jiffies since alloc, - pid range of the allocating processes, cpu mask of allocating cpus, and stack trace. + Number of objects, allocating function, possible memory wastage of + kmalloc objects(total/per-object), minimal/average/maximal jiffies + since alloc, pid range of the allocating processes, cpu mask of + allocating cpus, numa node mask of origins of memory, and stack trace. Example::: - 1085 populate_error_injection_list+0x97/0x110 age=166678/166680/166682 pid=1 cpus=1:: - __slab_alloc+0x6d/0x90 - kmem_cache_alloc_trace+0x2eb/0x300 - populate_error_injection_list+0x97/0x110 - init_error_injection+0x1b/0x71 - do_one_initcall+0x5f/0x2d0 - kernel_init_freeable+0x26f/0x2d7 - kernel_init+0xe/0x118 - ret_from_fork+0x22/0x30 - + 338 pci_alloc_dev+0x2c/0xa0 waste=521872/1544 age=290837/291891/293509 pid=1 cpus=106 nodes=0-1 + __kmem_cache_alloc_node+0x11f/0x4e0 + kmalloc_trace+0x26/0xa0 + pci_alloc_dev+0x2c/0xa0 + pci_scan_single_device+0xd2/0x150 + pci_scan_slot+0xf7/0x2d0 + pci_scan_child_bus_extend+0x4e/0x360 + acpi_pci_root_create+0x32e/0x3b0 + pci_acpi_scan_root+0x2b9/0x2d0 + acpi_pci_root_add.cold.11+0x110/0xb0a + acpi_bus_attach+0x262/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 + device_for_each_child+0xb7/0x110 + acpi_dev_for_each_child+0x77/0xa0 + acpi_bus_attach+0x108/0x3f0 2. free_traces:: diff --git a/Documentation/mm/unevictable-lru.rst b/Documentation/mm/unevictable-lru.rst index b280367d6a44..4a0e158aa9ce 100644 --- a/Documentation/mm/unevictable-lru.rst +++ b/Documentation/mm/unevictable-lru.rst @@ -197,7 +197,7 @@ unevictable list for the memory cgroup and node being scanned. There may be situations where a page is mapped into a VM_LOCKED VMA, but the page is not marked as PG_mlocked. Such pages will make it all the way to shrink_active_list() or shrink_page_list() where they will be detected when -vmscan walks the reverse map in page_referenced() or try_to_unmap(). The page +vmscan walks the reverse map in folio_referenced() or try_to_unmap(). The page is culled to the unevictable list when it is released by the shrinker. To "cull" an unevictable page, vmscan simply puts the page back on the LRU list @@ -267,7 +267,7 @@ the LRU. Such pages can be "noticed" by memory management in several places: (4) in the fault path and when a VM_LOCKED stack segment is expanded; or (5) as mentioned above, in vmscan:shrink_page_list() when attempting to - reclaim a page in a VM_LOCKED VMA by page_referenced() or try_to_unmap(). + reclaim a page in a VM_LOCKED VMA by folio_referenced() or try_to_unmap(). mlocked pages become unlocked and rescued from the unevictable list when: @@ -547,7 +547,7 @@ vmscan's shrink_inactive_list() and shrink_page_list() also divert obviously unevictable pages found on the inactive lists to the appropriate memory cgroup and node unevictable list. -rmap's page_referenced_one(), called via vmscan's shrink_active_list() or +rmap's folio_referenced_one(), called via vmscan's shrink_active_list() or shrink_page_list(), and rmap's try_to_unmap_one() called via shrink_page_list(), check for (3) pages still mapped into VM_LOCKED VMAs, and call mlock_vma_page() to correct them. Such pages are culled to the unevictable list when released |