summaryrefslogtreecommitdiffstats
path: root/mm (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'fixes-2024-06-13' of ↵Linus Torvalds2024-06-131-0/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock Pull memblock fixes from Mike Rapoport: "Fix validation of NUMA coverage. memblock_validate_numa_coverage() was checking for a unset node ID using NUMA_NO_NODE, but x86 used MAX_NUMNODES when no node ID was specified by buggy firmware. Update memblock to substitute MAX_NUMNODES with NUMA_NO_NODE in memblock_set_node() and use NUMA_NO_NODE in x86::numa_init()" * tag 'fixes-2024-06-13' of git://git.kernel.org/pub/scm/linux/kernel/git/rppt/memblock: x86/mm/numa: Use NUMA_NO_NODE when calling memblock_set_node() memblock: make memblock_set_node() also warn about use of MAX_NUMNODES
| * memblock: make memblock_set_node() also warn about use of MAX_NUMNODESJan Beulich2024-05-311-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On an (old) x86 system with SRAT just covering space above 4Gb: ACPI: SRAT: Node 0 PXM 0 [mem 0x100000000-0xfffffffff] hotplug the commit referenced below leads to this NUMA configuration no longer being refused by a CONFIG_NUMA=y kernel (previously NUMA: nodes only cover 6144MB of your 8185MB e820 RAM. Not used. No NUMA configuration found Faking a node at [mem 0x0000000000000000-0x000000027fffffff] was seen in the log directly after the message quoted above), because of memblock_validate_numa_coverage() checking for NUMA_NO_NODE (only). This in turn led to memblock_alloc_range_nid()'s warning about MAX_NUMNODES triggering, followed by a NULL deref in memmap_init() when trying to access node 64's (NODE_SHIFT=6) node data. To compensate said change, make memblock_set_node() warn on and adjust a passed in value of MAX_NUMNODES, just like various other functions already do. Fixes: ff6c3d81f2e8 ("NUMA: optimize detection of memory with no node id assigned by firmware") Signed-off-by: Jan Beulich <jbeulich@suse.com> Cc: stable@vger.kernel.org Link: https://lore.kernel.org/r/1c8a058c-5365-4f27-a9f1-3aeb7fb3e7b2@suse.com Signed-off-by: Mike Rapoport (IBM) <rppt@kernel.org>
* | mm: fix xyz_noprof functions calling profiled functionsSuren Baghdasaryan2024-06-063-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Grepping /proc/allocinfo for "noprof" reveals several xyz_noprof functions, which means internally they are calling profiled functions. This should never happen as such calls move allocation charge from a higher level location where it should be accounted for into these lower level helpers. Fix this by replacing profiled function calls with noprof ones. Link: https://lkml.kernel.org/r/20240531205350.3973009-1-surenb@google.com Fixes: b951aaff5035 ("mm: enable page allocation tagging") Fixes: e26d8769da6d ("mempool: hook up to memory allocation profiling") Fixes: 88ae5fb755b0 ("mm: vmalloc: enable memory allocation profiling") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Reviewed-by: Kees Cook <kees@kernel.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | codetag: avoid race at alloc_slab_obj_extsThadeu Lima de Souza Cascardo2024-06-061-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_MEM_ALLOC_PROFILING_DEBUG is enabled, the following warning may be noticed: [ 48.299584] ------------[ cut here ]------------ [ 48.300092] alloc_tag was not set [ 48.300528] WARNING: CPU: 2 PID: 1361 at include/linux/alloc_tag.h:130 alloc_tagging_slab_free_hook+0x84/0xc7 [ 48.301305] Modules linked in: [ 48.301553] CPU: 2 PID: 1361 Comm: systemd-udevd Not tainted 6.10.0-rc1-00003-gac8755535862 #176 [ 48.302196] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 [ 48.302752] RIP: 0010:alloc_tagging_slab_free_hook+0x84/0xc7 [ 48.303169] Code: 8d 1c c4 48 85 db 74 4d 48 83 3b 00 75 1e 80 3d 65 02 86 04 00 75 15 48 c7 c7 11 48 1d 85 c6 05 55 02 86 04 01 e8 64 44 a5 ff <0f> 0b 48 8b 03 48 85 c0 74 21 48 83 f8 01 74 14 48 8b 50 20 48 f7 [ 48.304411] RSP: 0018:ffff8880111b7d40 EFLAGS: 00010282 [ 48.304916] RAX: 0000000000000000 RBX: ffff88800fcc9008 RCX: 0000000000000000 [ 48.305455] RDX: 0000000080000000 RSI: ffff888014060000 RDI: ffffed1002236f97 [ 48.305979] RBP: 0000000000001100 R08: fffffbfff0aa73a1 R09: 0000000000000000 [ 48.306473] R10: ffffffff814515e5 R11: 0000000000000003 R12: ffff88800fcc9000 [ 48.306943] R13: ffff88800b2e5cc0 R14: ffff8880111b7d90 R15: 0000000000000000 [ 48.307529] FS: 00007faf5d1908c0(0000) GS:ffff88806cf00000(0000) knlGS:0000000000000000 [ 48.308223] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 48.308710] CR2: 000058fb220c9118 CR3: 00000000110cc000 CR4: 0000000000750ef0 [ 48.309274] PKRU: 55555554 [ 48.309804] Call Trace: [ 48.310029] <TASK> [ 48.310290] ? show_regs+0x84/0x8d [ 48.310722] ? alloc_tagging_slab_free_hook+0x84/0xc7 [ 48.311298] ? __warn+0x13b/0x2ff [ 48.311580] ? alloc_tagging_slab_free_hook+0x84/0xc7 [ 48.311987] ? report_bug+0x2ce/0x3ab [ 48.312292] ? handle_bug+0x8c/0x107 [ 48.312563] ? exc_invalid_op+0x34/0x6f [ 48.312842] ? asm_exc_invalid_op+0x1a/0x20 [ 48.313173] ? this_cpu_in_panic+0x1c/0x72 [ 48.313503] ? alloc_tagging_slab_free_hook+0x84/0xc7 [ 48.313880] ? putname+0x143/0x14e [ 48.314152] kmem_cache_free+0xe9/0x214 [ 48.314454] putname+0x143/0x14e [ 48.314712] do_unlinkat+0x413/0x45e [ 48.315001] ? __pfx_do_unlinkat+0x10/0x10 [ 48.315388] ? __check_object_size+0x4d7/0x525 [ 48.315744] ? __sanitizer_cov_trace_pc+0x20/0x4a [ 48.316167] ? __sanitizer_cov_trace_pc+0x20/0x4a [ 48.316757] ? getname_flags+0x4ed/0x500 [ 48.317261] __x64_sys_unlink+0x42/0x4a [ 48.317741] do_syscall_64+0xe2/0x149 [ 48.318171] entry_SYSCALL_64_after_hwframe+0x76/0x7e [ 48.318602] RIP: 0033:0x7faf5d8850ab [ 48.318891] Code: fd ff ff e8 27 dd 01 00 0f 1f 80 00 00 00 00 f3 0f 1e fa b8 5f 00 00 00 0f 05 c3 0f 1f 40 00 f3 0f 1e fa b8 57 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 05 c3 0f 1f 40 00 48 8b 15 41 2d 0e 00 f7 d8 [ 48.320649] RSP: 002b:00007ffc44982b38 EFLAGS: 00000246 ORIG_RAX: 0000000000000057 [ 48.321182] RAX: ffffffffffffffda RBX: 00005ba344a44680 RCX: 00007faf5d8850ab [ 48.321667] RDX: 0000000000000000 RSI: 00005ba344a44430 RDI: 00007ffc44982b40 [ 48.322139] RBP: 00007ffc44982c00 R08: 0000000000000000 R09: 0000000000000007 [ 48.322598] R10: 00005ba344a44430 R11: 0000000000000246 R12: 0000000000000000 [ 48.323071] R13: 00007ffc44982b40 R14: 0000000000000000 R15: 0000000000000000 [ 48.323596] </TASK> This is due to a race when two objects are allocated from the same slab, which did not have an obj_exts allocated for. In such a case, the two threads will notice the NULL obj_exts and after one assigns slab->obj_exts, the second one will happily do the exchange if it reads this new assigned value. In order to avoid that, verify that the read obj_exts does not point to an allocated obj_exts before doing the exchange. Link: https://lkml.kernel.org/r/20240527183007.1595037-1-cascardo@igalia.com Fixes: 09c46563ff6d ("codetag: debug: introduce OBJEXTS_ALLOC_FAIL to mark failed slab_ext allocations") Signed-off-by: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Gustavo A. R. Silva <gustavoars@kernel.org> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <keescook@chromium.org> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Thadeu Lima de Souza Cascardo <cascardo@igalia.com> Cc: Kent Overstreet <kent.overstreet@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/hugetlb: do not call vma_add_reservation upon ENOMEMOscar Salvador2024-06-061-2/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sysbot reported a splat [1] on __unmap_hugepage_range(). This is because vma_needs_reservation() can return -ENOMEM if allocate_file_region_entries() fails to allocate the file_region struct for the reservation. Check for that and do not call vma_add_reservation() if that is the case, otherwise region_abort() and region_del() will see that we do not have any file_regions. If we detect that vma_needs_reservation() returned -ENOMEM, we clear the hugetlb_restore_reserve flag as if this reservation was still consumed, so free_huge_folio() will not increment the resv count. [1] https://lore.kernel.org/linux-mm/0000000000004096100617c58d54@google.com/T/#ma5983bc1ab18a54910da83416b3f89f3c7ee43aa Link: https://lkml.kernel.org/r/20240528205323.20439-1-osalvador@suse.de Fixes: df7a6d1f6405 ("mm/hugetlb: restore the reservation if needed") Signed-off-by: Oscar Salvador <osalvador@suse.de> Reported-and-tested-by: syzbot+d3fe2dc5ffe9380b714b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/linux-mm/0000000000004096100617c58d54@google.com/ Cc: Breno Leitao <leitao@debian.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/ksm: fix ksm_zero_pages accountingChengming Zhou2024-06-061-6/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We normally ksm_zero_pages++ in ksmd when page is merged with zero page, but ksm_zero_pages-- is done from page tables side, where there is no any accessing protection of ksm_zero_pages. So we can read very exceptional value of ksm_zero_pages in rare cases, such as -1, which is very confusing to users. Fix it by changing to use atomic_long_t, and the same case with the mm->ksm_zero_pages. Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-2-34bb358fdc13@linux.dev Fixes: e2942062e01d ("ksm: count all zero pages placed by KSM") Fixes: 6080d19f0704 ("ksm: add ksm zero pages for each process") Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Stefan Roesch <shr@devkernel.io> Cc: xu xin <xu.xin16@zte.com.cn> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm/ksm: fix ksm_pages_scanned accountingChengming Zhou2024-06-061-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/ksm: fix some accounting problems", v3. We encountered some abnormal ksm_pages_scanned and ksm_zero_pages during some random tests. 1. ksm_pages_scanned unchanged even ksmd scanning has progress. 2. ksm_zero_pages maybe -1 in some rare cases. This patch (of 2): During testing, I found ksm_pages_scanned is unchanged although the scan_get_next_rmap_item() did return valid rmap_item that is not NULL. The reason is the scan_get_next_rmap_item() will return NULL after a full scan, so ksm_do_scan() just return without accounting of the ksm_pages_scanned. Fix it by just putting ksm_pages_scanned accounting in that loop, and it will be accounted more timely if that loop would last for a long time. Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-0-34bb358fdc13@linux.dev Link: https://lkml.kernel.org/r/20240528-b4-ksm-counters-v3-1-34bb358fdc13@linux.dev Fixes: b348b5fe2b5f ("mm/ksm: add pages scanned metric") Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: xu xin <xu.xin16@zte.com.cn> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Ran Xiaokai <ran.xiaokai@zte.com.cn> Cc: Stefan Roesch <shr@devkernel.io> Cc: Yang Yang <yang.yang29@zte.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | kmsan: do not wipe out origin when doing partial unpoisoningAlexander Potapenko2024-06-061-4/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As noticed by Brian, KMSAN should not be zeroing the origin when unpoisoning parts of a four-byte uninitialized value, e.g.: char a[4]; kmsan_unpoison_memory(a, 1); This led to false negatives, as certain poisoned values could receive zero origins, preventing those values from being reported. To fix the problem, check that kmsan_internal_set_shadow_origin() writes zero origins only to slots which have zero shadow. Link: https://lkml.kernel.org/r/20240528104807.738758-1-glider@google.com Fixes: f80be4571b19 ("kmsan: add KMSAN runtime core") Signed-off-by: Alexander Potapenko <glider@google.com> Reported-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com> Link: https://lore.kernel.org/lkml/20240524232804.1984355-1-bjohannesmeyer@gmail.com/T/ Reviewed-by: Marco Elver <elver@google.com> Tested-by: Brian Johannesmeyer <bjohannesmeyer@gmail.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Kees Cook <keescook@chromium.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | vmalloc: check CONFIG_EXECMEM in is_vmalloc_or_module_addr()Cong Wang2024-06-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of") CONFIG_BPF_JIT does not depend on CONFIG_MODULES any more and bpf jit also uses the [MODULES_VADDR, MODULES_END] memory region. But is_vmalloc_or_module_addr() still checks CONFIG_MODULES, which then returns false for a bpf jit memory region when CONFIG_MODULES is not defined. It leads to the following kernel BUG: [ 1.567023] ------------[ cut here ]------------ [ 1.567883] kernel BUG at mm/vmalloc.c:745! [ 1.568477] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI [ 1.569367] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.9.0+ #448 [ 1.570247] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014 [ 1.570786] RIP: 0010:vmalloc_to_page+0x48/0x1ec [ 1.570786] Code: 0f 00 00 e8 eb 1a 05 00 b8 37 00 00 00 48 ba fe ff ff ff ff 1f 00 00 4c 03 25 76 49 c6 02 48 c1 e0 28 48 01 e8 48 39 d0 76 02 <0f> 0b 4c 89 e7 e8 bf 1a 05 00 49 8b 04 24 48 a9 9f ff ff ff 0f 84 [ 1.570786] RSP: 0018:ffff888007787960 EFLAGS: 00010212 [ 1.570786] RAX: 000036ffa0000000 RBX: 0000000000000640 RCX: ffffffff8147e93c [ 1.570786] RDX: 00001ffffffffffe RSI: dffffc0000000000 RDI: ffffffff840e32c8 [ 1.570786] RBP: ffffffffa0000000 R08: 0000000000000000 R09: 0000000000000000 [ 1.570786] R10: ffff888007787a88 R11: ffffffff8475d8e7 R12: ffffffff83e80ff8 [ 1.570786] R13: 0000000000000640 R14: 0000000000000640 R15: 0000000000000640 [ 1.570786] FS: 0000000000000000(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000 [ 1.570786] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.570786] CR2: ffff888006a01000 CR3: 0000000003e80000 CR4: 0000000000350ef0 [ 1.570786] Call Trace: [ 1.570786] <TASK> [ 1.570786] ? __die_body+0x1b/0x58 [ 1.570786] ? die+0x31/0x4b [ 1.570786] ? do_trap+0x9d/0x138 [ 1.570786] ? vmalloc_to_page+0x48/0x1ec [ 1.570786] ? do_error_trap+0xcd/0x102 [ 1.570786] ? vmalloc_to_page+0x48/0x1ec [ 1.570786] ? vmalloc_to_page+0x48/0x1ec [ 1.570786] ? handle_invalid_op+0x2f/0x38 [ 1.570786] ? vmalloc_to_page+0x48/0x1ec [ 1.570786] ? exc_invalid_op+0x2b/0x41 [ 1.570786] ? asm_exc_invalid_op+0x16/0x20 [ 1.570786] ? vmalloc_to_page+0x26/0x1ec [ 1.570786] ? vmalloc_to_page+0x48/0x1ec [ 1.570786] __text_poke+0xb6/0x458 [ 1.570786] ? __pfx_text_poke_memcpy+0x10/0x10 [ 1.570786] ? __pfx___mutex_lock+0x10/0x10 [ 1.570786] ? __pfx___text_poke+0x10/0x10 [ 1.570786] ? __pfx_get_random_u32+0x10/0x10 [ 1.570786] ? srso_return_thunk+0x5/0x5f [ 1.570786] text_poke_copy_locked+0x70/0x84 [ 1.570786] text_poke_copy+0x32/0x4f [ 1.570786] bpf_arch_text_copy+0xf/0x27 [ 1.570786] bpf_jit_binary_pack_finalize+0x26/0x5a [ 1.570786] bpf_int_jit_compile+0x576/0x8ad [ 1.570786] ? __pfx_bpf_int_jit_compile+0x10/0x10 [ 1.570786] ? srso_return_thunk+0x5/0x5f [ 1.570786] ? __kmalloc_node_track_caller+0x2b5/0x2e0 [ 1.570786] bpf_prog_select_runtime+0x7c/0x199 [ 1.570786] bpf_prepare_filter+0x1e9/0x25b [ 1.570786] ? __pfx_bpf_prepare_filter+0x10/0x10 [ 1.570786] ? srso_return_thunk+0x5/0x5f [ 1.570786] ? _find_next_bit+0x29/0x7e [ 1.570786] bpf_prog_create+0xb8/0xe0 [ 1.570786] ptp_classifier_init+0x75/0xa1 [ 1.570786] ? __pfx_ptp_classifier_init+0x10/0x10 [ 1.570786] ? srso_return_thunk+0x5/0x5f [ 1.570786] ? register_pernet_subsys+0x36/0x42 [ 1.570786] ? srso_return_thunk+0x5/0x5f [ 1.570786] sock_init+0x99/0xa3 [ 1.570786] ? __pfx_sock_init+0x10/0x10 [ 1.570786] do_one_initcall+0x104/0x2c4 [ 1.570786] ? __pfx_do_one_initcall+0x10/0x10 [ 1.570786] ? parameq+0x25/0x2d [ 1.570786] ? rcu_is_watching+0x1c/0x3c [ 1.570786] ? trace_kmalloc+0x81/0xb2 [ 1.570786] ? srso_return_thunk+0x5/0x5f [ 1.570786] ? __kmalloc+0x29c/0x2c7 [ 1.570786] ? srso_return_thunk+0x5/0x5f [ 1.570786] do_initcalls+0xf9/0x123 [ 1.570786] kernel_init_freeable+0x24f/0x289 [ 1.570786] ? __pfx_kernel_init+0x10/0x10 [ 1.570786] kernel_init+0x19/0x13a [ 1.570786] ret_from_fork+0x24/0x41 [ 1.570786] ? __pfx_kernel_init+0x10/0x10 [ 1.570786] ret_from_fork_asm+0x1a/0x30 [ 1.570786] </TASK> [ 1.570819] ---[ end trace 0000000000000000 ]--- [ 1.571463] RIP: 0010:vmalloc_to_page+0x48/0x1ec [ 1.572111] Code: 0f 00 00 e8 eb 1a 05 00 b8 37 00 00 00 48 ba fe ff ff ff ff 1f 00 00 4c 03 25 76 49 c6 02 48 c1 e0 28 48 01 e8 48 39 d0 76 02 <0f> 0b 4c 89 e7 e8 bf 1a 05 00 49 8b 04 24 48 a9 9f ff ff ff 0f 84 [ 1.574632] RSP: 0018:ffff888007787960 EFLAGS: 00010212 [ 1.575129] RAX: 000036ffa0000000 RBX: 0000000000000640 RCX: ffffffff8147e93c [ 1.576097] RDX: 00001ffffffffffe RSI: dffffc0000000000 RDI: ffffffff840e32c8 [ 1.577084] RBP: ffffffffa0000000 R08: 0000000000000000 R09: 0000000000000000 [ 1.578077] R10: ffff888007787a88 R11: ffffffff8475d8e7 R12: ffffffff83e80ff8 [ 1.578810] R13: 0000000000000640 R14: 0000000000000640 R15: 0000000000000640 [ 1.579823] FS: 0000000000000000(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000 [ 1.580992] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 1.581869] CR2: ffff888006a01000 CR3: 0000000003e80000 CR4: 0000000000350ef0 [ 1.582800] Kernel panic - not syncing: Fatal exception [ 1.583765] ---[ end Kernel panic - not syncing: Fatal exception ]--- Fix this by checking CONFIG_EXECMEM instead. Link: https://lkml.kernel.org/r/20240528160838.102223-1-xiyou.wangcong@gmail.com Fixes: 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of") Signed-off-by: Cong Wang <cong.wang@bytedance.com> Acked-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Luis Chamberlain <mcgrof@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: page_alloc: fix highatomic typing in multi-block buddiesJohannes Weiner2024-06-061-16/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Christoph reports a page allocator splat triggered by xfstests: generic/176 214s ... [ 1204.507931] run fstests generic/176 at 2024-05-27 12:52:30 XFS (nvme0n1): Mounting V5 Filesystem cd936307-415f-48a3-b99d-a2d52ae1f273 XFS (nvme0n1): Ending clean mount XFS (nvme1n1): Mounting V5 Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850 XFS (nvme1n1): Ending clean mount XFS (nvme1n1): Unmounting Filesystem ab3ee1a4-af62-4934-9a6a-6c2fde321850 XFS (nvme1n1): Mounting V5 Filesystem 7099b02d-9c58-4d1d-be1d-2cc472d12cd9 XFS (nvme1n1): Ending clean mount ------------[ cut here ]------------ page type is 3, passed migratetype is 1 (nr=512) WARNING: CPU: 0 PID: 509870 at mm/page_alloc.c:645 expand+0x1c5/0x1f0 Modules linked in: i2c_i801 crc32_pclmul i2c_smbus [last unloaded: scsi_debug] CPU: 0 PID: 509870 Comm: xfs_io Not tainted 6.10.0-rc1+ #2437 Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.16.3-debian-1.16.3-2 04/01/2014 RIP: 0010:expand+0x1c5/0x1f0 Code: 05 16 70 bf 02 01 e8 ca fc ff ff 8b 54 24 34 44 89 e1 48 c7 c7 80 a2 28 83 48 89 c6 b8 01 00 3 RSP: 0018:ffffc90003b2b968 EFLAGS: 00010082 RAX: 0000000000000000 RBX: ffffffff83fa9480 RCX: 0000000000000000 RDX: 0000000000000005 RSI: 0000000000000027 RDI: 00000000ffffffff RBP: 00000000001f2600 R08: 00000000fffeffff R09: 0000000000000001 R10: 0000000000000000 R11: ffffffff83676200 R12: 0000000000000009 R13: 0000000000000200 R14: 0000000000000001 R15: ffffea0007c98000 FS: 00007f72ca3d5780(0000) GS:ffff8881f9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007f72ca1fff38 CR3: 00000001aa0c6002 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff07f0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> ? __warn+0x7b/0x120 ? expand+0x1c5/0x1f0 ? report_bug+0x191/0x1c0 ? handle_bug+0x3c/0x80 ? exc_invalid_op+0x17/0x70 ? asm_exc_invalid_op+0x1a/0x20 ? expand+0x1c5/0x1f0 ? expand+0x1c5/0x1f0 __rmqueue_pcplist+0x3a9/0x730 get_page_from_freelist+0x7a0/0xf00 __alloc_pages_noprof+0x153/0x2e0 __folio_alloc_noprof+0x10/0xa0 __filemap_get_folio+0x16b/0x370 iomap_write_begin+0x496/0x680 While trying to service a movable allocation (page type 1), the page allocator runs into a two-pageblock buddy on the movable freelist whose second block is typed as highatomic (page type 3). This inconsistency is caused by the highatomic reservation system operating on single pageblocks, while MAX_ORDER can be bigger than that - in this configuration, pageblock_order is 9 while MAX_PAGE_ORDER is 10. The test case is observed to make several adjacent order-3 requests with __GFP_DIRECT_RECLAIM cleared, which marks the surrounding block as highatomic. Upon freeing, the blocks merge into an order-10 buddy. When the highatomic pool is drained later on, this order-10 buddy gets moved back to the movable list, but only the first pageblock is marked movable again. A subsequent expand() of this buddy warns about the tail being of a different type. This is a long-standing bug that's surfaced by the recent block type warnings added to the allocator. The consequences seem mostly benign, it just results in odd behavior: the highatomic tail blocks are not properly drained, instead they end up on the movable list first, then go back to the highatomic list after an alloc-free cycle. To fix this, make the highatomic reservation code aware that allocations/buddies can be larger than a pageblock. While it's an old quirk, the recently added type consistency warnings seem to be the most prominent consequence of it. Set the Fixes: tag accordingly to highlight this backporting dependency. Link: https://lkml.kernel.org/r/20240530114203.GA1222079@cmpxchg.org Fixes: e0932b6c1f94 ("mm: page_alloc: consolidate free page accounting") Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reported-by: Christoph Hellwig <hch@infradead.org> Reviewed-by: Zi Yan <ziy@nvidia.com> Tested-by: Christoph Hellwig <hch@lst.de> Cc: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Vlastimil Babka <vbabka@suse.cz> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | memcg: remove the lockdep assert from __mod_objcg_mlstate()Sebastian Andrzej Siewior2024-06-061-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The assert was introduced in the commit cited below as an insurance that the semantic is the same after the local_irq_save() has been removed and the function has been made static. The original requirement to disable interrupt was due the modification of per-CPU counters which require interrupts to be disabled because the counter update operation is not atomic and some of the counters are updated from interrupt context. All callers of __mod_objcg_mlstate() acquire a lock (memcg_stock.stock_lock) which disables interrupts on !PREEMPT_RT and the lockdep assert is satisfied. On PREEMPT_RT the interrupts are not disabled and the assert triggers. The safety of the counter update is already ensured by VM_WARN_ON_IRQS_ENABLED() which is part of __mod_memcg_lruvec_state() and does not require yet another check. Remove the lockdep assert from __mod_objcg_mlstate(). Link: https://lkml.kernel.org/r/20240528141341.rz_rytN_@linutronix.de Fixes: 91882c1617c1 ("memcg: simple cleanup of stats update functions") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: drop the 'anon_' prefix for swap-out mTHP countersBaolin Wang2024-06-063-6/+6
|/ | | | | | | | | | | | | | | | | | | | | | | The mTHP swap related counters: 'anon_swpout' and 'anon_swpout_fallback' are confusing with an 'anon_' prefix, since the shmem can swap out non-anonymous pages. So drop the 'anon_' prefix to keep consistent with the old swap counter names. This is needed in 6.10-rcX to avoid having an inconsistent ABI out in the field. Link: https://lkml.kernel.org/r/7a8989c13299920d7589007a30065c3e2c19f0e0.1716431702.git.baolin.wang@linux.alibaba.com Fixes: d0f048ac39f6 ("mm: add per-order mTHP anon_swpout and anon_swpout_fallback counters") Fixes: 42248b9d34ea ("mm: add docs for per-order mTHP counters and transhuge_page ABI") Signed-off-by: Baolin Wang <baolin.wang@linux.alibaba.com> Suggested-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: Barry Song <baohua@kernel.org> Cc: David Hildenbrand <david@redhat.com> Cc: Lance Yang <ioworker0@gmail.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'mm-hotfixes-stable-2024-05-25-09-13' of ↵Linus Torvalds2024-05-263-6/+13
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes, 11 of which are cc:stable. A few nilfs2 fixes, the remainder are for MM: a couple of selftests fixes, various singletons fixing various issues in various parts" * tag 'mm-hotfixes-stable-2024-05-25-09-13' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/ksm: fix possible UAF of stable_node mm/memory-failure: fix handling of dissolved but not taken off from buddy pages mm: /proc/pid/smaps_rollup: avoid skipping vma after getting mmap_lock again nilfs2: fix potential hang in nilfs_detach_log_writer() nilfs2: fix unexpected freezing of nilfs_segctor_sync() nilfs2: fix use-after-free of timer for log writer thread selftests/mm: fix build warnings on ppc64 arm64: patching: fix handling of execmem addresses selftests/mm: compaction_test: fix bogus test success and reduce probability of OOM-killer invocation selftests/mm: compaction_test: fix incorrect write of zero to nr_hugepages selftests/mm: compaction_test: fix bogus test success on Aarch64 mailmap: update email address for Satya Priya mm/huge_memory: don't unpoison huge_zero_folio kasan, fortify: properly rename memintrinsics lib: add version into /proc/allocinfo output mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL
| * mm/ksm: fix possible UAF of stable_nodeChengming Zhou2024-05-241-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page deduplication limit") introduced a possible failure case in the stable_tree_insert(), where we may free the new allocated stable_node_dup if we fail to prepare the missing chain node. Then that kfolio return and unlock with a freed stable_node set... And any MM activities can come in to access kfolio->mapping, so UAF. Fix it by moving folio_set_stable_node() to the end after stable_node is inserted successfully. Link: https://lkml.kernel.org/r/20240513-b4-ksm-stable-node-uaf-v1-1-f687de76f452@linux.dev Fixes: 2c653d0ee2ae ("ksm: introduce ksm_max_page_sharing per page deduplication limit") Signed-off-by: Chengming Zhou <chengming.zhou@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Hugh Dickins <hughd@google.com> Cc: Stefan Roesch <shr@devkernel.io> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/memory-failure: fix handling of dissolved but not taken off from buddy pagesMiaohe Lin2024-05-241-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I did memory failure tests recently, below panic occurs: page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) raw: 06fffe0000000000 dead000000000100 dead000000000122 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffffff 0000000000000000 page dumped because: VM_BUG_ON_PAGE(!PageBuddy(page)) ------------[ cut here ]------------ kernel BUG at include/linux/page-flags.h:1009! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI RIP: 0010:__del_page_from_free_list+0x151/0x180 RSP: 0018:ffffa49c90437998 EFLAGS: 00000046 RAX: 0000000000000035 RBX: 0000000000000009 RCX: ffff8dd8dfd1c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff8dd8dfd1c9c0 RBP: ffffd901233b8000 R08: ffffffffab5511f8 R09: 0000000000008c69 R10: 0000000000003c15 R11: ffffffffab5511f8 R12: ffff8dd8fffc0c80 R13: 0000000000000001 R14: ffff8dd8fffc0c80 R15: 0000000000000009 FS: 00007ff916304740(0000) GS:ffff8dd8dfd00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055eae50124c8 CR3: 00000008479e0000 CR4: 00000000000006f0 Call Trace: <TASK> __rmqueue_pcplist+0x23b/0x520 get_page_from_freelist+0x26b/0xe40 __alloc_pages_noprof+0x113/0x1120 __folio_alloc_noprof+0x11/0xb0 alloc_buddy_hugetlb_folio.isra.0+0x5a/0x130 __alloc_fresh_hugetlb_folio+0xe7/0x140 alloc_pool_huge_folio+0x68/0x100 set_max_huge_pages+0x13d/0x340 hugetlb_sysctl_handler_common+0xe8/0x110 proc_sys_call_handler+0x194/0x280 vfs_write+0x387/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7ff916114887 RSP: 002b:00007ffec8a2fd78 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 000055eae500e350 RCX: 00007ff916114887 RDX: 0000000000000004 RSI: 000055eae500e390 RDI: 0000000000000003 RBP: 000055eae50104c0 R08: 0000000000000000 R09: 000055eae50104c0 R10: 0000000000000077 R11: 0000000000000246 R12: 0000000000000004 R13: 0000000000000004 R14: 00007ff916216b80 R15: 00007ff916216a00 </TASK> Modules linked in: mce_inject hwpoison_inject ---[ end trace 0000000000000000 ]--- And before the panic, there had an warning about bad page state: BUG: Bad page state in process page-types pfn:8cee00 page: refcount:0 mapcount:0 mapping:0000000000000000 index:0x0 pfn:0x8cee00 flags: 0x6fffe0000000000(node=1|zone=2|lastcpupid=0x7fff) page_type: 0xffffff7f(buddy) raw: 06fffe0000000000 ffffd901241c0008 ffffd901240f8008 0000000000000000 raw: 0000000000000000 0000000000000009 00000000ffffff7f 0000000000000000 page dumped because: nonzero mapcount Modules linked in: mce_inject hwpoison_inject CPU: 8 PID: 154211 Comm: page-types Not tainted 6.9.0-rc4-00499-g5544ec3178e2-dirty #22 Call Trace: <TASK> dump_stack_lvl+0x83/0xa0 bad_page+0x63/0xf0 free_unref_page+0x36e/0x5c0 unpoison_memory+0x50b/0x630 simple_attr_write_xsigned.constprop.0.isra.0+0xb3/0x110 debugfs_attr_write+0x42/0x60 full_proxy_write+0x5b/0x80 vfs_write+0xcd/0x550 ksys_write+0x64/0xe0 do_syscall_64+0xc2/0x1d0 entry_SYSCALL_64_after_hwframe+0x77/0x7f RIP: 0033:0x7f189a514887 RSP: 002b:00007ffdcd899718 EFLAGS: 00000246 ORIG_RAX: 0000000000000001 RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f189a514887 RDX: 0000000000000009 RSI: 00007ffdcd899730 RDI: 0000000000000003 RBP: 00007ffdcd8997a0 R08: 0000000000000000 R09: 00007ffdcd8994b2 R10: 0000000000000000 R11: 0000000000000246 R12: 00007ffdcda199a8 R13: 0000000000404af1 R14: 000000000040ad78 R15: 00007f189a7a5040 </TASK> The root cause should be the below race: memory_failure try_memory_failure_hugetlb me_huge_page __page_handle_poison dissolve_free_hugetlb_folio drain_all_pages -- Buddy page can be isolated e.g. for compaction. take_page_off_buddy -- Failed as page is not in the buddy list. -- Page can be putback into buddy after compaction. page_ref_inc -- Leads to buddy page with refcnt = 1. Then unpoison_memory() can unpoison the page and send the buddy page back into buddy list again leading to the above bad page state warning. And bad_page() will call page_mapcount_reset() to remove PageBuddy from buddy page leading to later VM_BUG_ON_PAGE(!PageBuddy(page)) when trying to allocate this page. Fix this issue by only treating __page_handle_poison() as successful when it returns 1. Link: https://lkml.kernel.org/r/20240523071217.1696196-1-linmiaohe@huawei.com Fixes: ceaf8fbea79a ("mm, hwpoison: skip raw hwpoison page in freeing 1GB hugepage") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/huge_memory: don't unpoison huge_zero_folioMiaohe Lin2024-05-241-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I did memory failure tests recently, below panic occurs: kernel BUG at include/linux/mm.h:1135! invalid opcode: 0000 [#1] PREEMPT SMP NOPTI CPU: 9 PID: 137 Comm: kswapd1 Not tainted 6.9.0-rc4-00491-gd5ce28f156fe-dirty #14 RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0 RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246 RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0 RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492 R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00 FS: 0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0 Call Trace: <TASK> do_shrink_slab+0x14f/0x6a0 shrink_slab+0xca/0x8c0 shrink_node+0x2d0/0x7d0 balance_pgdat+0x33a/0x720 kswapd+0x1f3/0x410 kthread+0xd5/0x100 ret_from_fork+0x2f/0x50 ret_from_fork_asm+0x1a/0x30 </TASK> Modules linked in: mce_inject hwpoison_inject ---[ end trace 0000000000000000 ]--- RIP: 0010:shrink_huge_zero_page_scan+0x168/0x1a0 RSP: 0018:ffff9933c6c57bd0 EFLAGS: 00000246 RAX: 000000000000003e RBX: 0000000000000000 RCX: ffff88f61fc5c9c8 RDX: 0000000000000000 RSI: 0000000000000027 RDI: ffff88f61fc5c9c0 RBP: ffffcd7c446b0000 R08: ffffffff9a9405f0 R09: 0000000000005492 R10: 00000000000030ea R11: ffffffff9a9405f0 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: ffff88e703c4ac00 FS: 0000000000000000(0000) GS:ffff88f61fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 000055f4da6e9878 CR3: 0000000c71048000 CR4: 00000000000006f0 The root cause is that HWPoison flag will be set for huge_zero_folio without increasing the folio refcnt. But then unpoison_memory() will decrease the folio refcnt unexpectedly as it appears like a successfully hwpoisoned folio leading to VM_BUG_ON_PAGE(page_ref_count(page) == 0) when releasing huge_zero_folio. Skip unpoisoning huge_zero_folio in unpoison_memory() to fix this issue. We're not prepared to unpoison huge_zero_folio yet. Link: https://lkml.kernel.org/r/20240516122608.22610-1-linmiaohe@huawei.com Fixes: 478d134e9506 ("mm/huge_memory: do not overkill when splitting huge_zero_page") Signed-off-by: Miaohe Lin <linmiaohe@huawei.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Yang Shi <shy828301@gmail.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: Xu Yu <xuyu@linux.alibaba.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAILHailong.Liu2024-05-241-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc") includes support for __GFP_NOFAIL, but it presents a conflict with commit dd544141b9eb ("vmalloc: back off when the current task is OOM-killed"). A possible scenario is as follows: process-a __vmalloc_node_range(GFP_KERNEL | __GFP_NOFAIL) __vmalloc_area_node() vm_area_alloc_pages() --> oom-killer send SIGKILL to process-a if (fatal_signal_pending(current)) break; --> return NULL; To fix this, do not check fatal_signal_pending() in vm_area_alloc_pages() if __GFP_NOFAIL set. This issue occurred during OPLUS KASAN TEST. Below is part of the log -> oom-killer sends signal to process [65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198 [65731.259685] [T32454] Call trace: [65731.259698] [T32454] dump_backtrace+0xf4/0x118 [65731.259734] [T32454] show_stack+0x18/0x24 [65731.259756] [T32454] dump_stack_lvl+0x60/0x7c [65731.259781] [T32454] dump_stack+0x18/0x38 [65731.259800] [T32454] mrdump_common_die+0x250/0x39c [mrdump] [65731.259936] [T32454] ipanic_die+0x20/0x34 [mrdump] [65731.260019] [T32454] atomic_notifier_call_chain+0xb4/0xfc [65731.260047] [T32454] notify_die+0x114/0x198 [65731.260073] [T32454] die+0xf4/0x5b4 [65731.260098] [T32454] die_kernel_fault+0x80/0x98 [65731.260124] [T32454] __do_kernel_fault+0x160/0x2a8 [65731.260146] [T32454] do_bad_area+0x68/0x148 [65731.260174] [T32454] do_mem_abort+0x151c/0x1b34 [65731.260204] [T32454] el1_abort+0x3c/0x5c [65731.260227] [T32454] el1h_64_sync_handler+0x54/0x90 [65731.260248] [T32454] el1h_64_sync+0x68/0x6c [65731.260269] [T32454] z_erofs_decompress_queue+0x7f0/0x2258 --> be->decompressed_pages = kvcalloc(be->nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL); kernel panic by NULL pointer dereference. erofs assume kvmalloc with __GFP_NOFAIL never return NULL. [65731.260293] [T32454] z_erofs_runqueue+0xf30/0x104c [65731.260314] [T32454] z_erofs_readahead+0x4f0/0x968 [65731.260339] [T32454] read_pages+0x170/0xadc [65731.260364] [T32454] page_cache_ra_unbounded+0x874/0xf30 [65731.260388] [T32454] page_cache_ra_order+0x24c/0x714 [65731.260411] [T32454] filemap_fault+0xbf0/0x1a74 [65731.260437] [T32454] __do_fault+0xd0/0x33c [65731.260462] [T32454] handle_mm_fault+0xf74/0x3fe0 [65731.260486] [T32454] do_mem_abort+0x54c/0x1b34 [65731.260509] [T32454] el0_da+0x44/0x94 [65731.260531] [T32454] el0t_64_sync_handler+0x98/0xb4 [65731.260553] [T32454] el0t_64_sync+0x198/0x19c Link: https://lkml.kernel.org/r/20240510100131.1865-1-hailong.liu@oppo.com Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL") Signed-off-by: Hailong.Liu <hailong.liu@oppo.com> Acked-by: Michal Hocko <mhocko@suse.com> Suggested-by: Barry Song <21cnbao@gmail.com> Reported-by: Oven <liyangouwen1@oppo.com> Reviewed-by: Barry Song <baohua@kernel.org> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Cc: Chao Yu <chao@kernel.org> Cc: Christoph Hellwig <hch@infradead.org> Cc: Gao Xiang <xiang@kernel.org> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Michal Hocko <mhocko@suse.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mseal: add mseal syscallJeff Xu2024-05-247-1/+431
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new mseal() is an syscall on 64 bit CPU, and with following signature: int mseal(void addr, size_t len, unsigned long flags) addr/len: memory range. flags: reserved. mseal() blocks following operations for the given memory range. 1> Unmapping, moving to another location, and shrinking the size, via munmap() and mremap(), can leave an empty space, therefore can be replaced with a VMA with a new set of attributes. 2> Moving or expanding a different VMA into the current location, via mremap(). 3> Modifying a VMA via mmap(MAP_FIXED). 4> Size expansion, via mremap(), does not appear to pose any specific risks to sealed VMAs. It is included anyway because the use case is unclear. In any case, users can rely on merging to expand a sealed VMA. 5> mprotect() and pkey_mprotect(). 6> Some destructive madvice() behaviors (e.g. MADV_DONTNEED) for anonymous memory, when users don't have write permission to the memory. Those behaviors can alter region contents by discarding pages, effectively a memset(0) for anonymous memory. Following input during RFC are incooperated into this patch: Jann Horn: raising awareness and providing valuable insights on the destructive madvise operations. Linus Torvalds: assisting in defining system call signature and scope. Liam R. Howlett: perf optimization. Theo de Raadt: sharing the experiences and insight gained from implementing mimmutable() in OpenBSD. Finally, the idea that inspired this patch comes from Stephen Röttger's work in Chrome V8 CFI. [jeffxu@chromium.org: add branch prediction hint, per Pedro] Link: https://lkml.kernel.org/r/20240423192825.1273679-2-jeffxu@chromium.org Link: https://lkml.kernel.org/r/20240415163527.626541-3-jeffxu@chromium.org Signed-off-by: Jeff Xu <jeffxu@chromium.org> Reviewed-by: Kees Cook <keescook@chromium.org> Reviewed-by: Liam R. Howlett <Liam.Howlett@oracle.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Dave Hansen <dave.hansen@intel.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Guenter Roeck <groeck@chromium.org> Cc: Jann Horn <jannh@google.com> Cc: Jeff Xu <jeffxu@google.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Jorge Lucangeli Obes <jorgelo@chromium.org> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Muhammad Usama Anjum <usama.anjum@collabora.com> Cc: Pedro Falcato <pedro.falcato@gmail.com> Cc: Stephen Röttger <sroettger@google.com> Cc: Suren Baghdasaryan <surenb@google.com> Cc: Amer Al Shanawany <amer.shanawany@gmail.com> Cc: Javier Carrasco <javier.carrasco.cruz@gmail.com> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'mm-stable-2024-05-22-17-22' of ↵Linus Torvalds2024-05-232-14/+5
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull more mm updates from Andrew Morton: "A series from Dave Chinner which cleans up and fixes the handling of nested allocations within stackdepot and page-owner" * tag 'mm-stable-2024-05-22-17-22' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/page-owner: use gfp_nested_mask() instead of open coded masking stackdepot: use gfp_nested_mask() instead of open coded masking mm: lift gfp_kmemleak_mask() to gfp.h
| * mm/page-owner: use gfp_nested_mask() instead of open coded maskingDave Chinner2024-05-191-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The page-owner tracking code records stack traces during page allocation. To do this, it must do a memory allocation for the stack information from inside an existing memory allocation context. This internal allocation must obey the high level caller allocation constraints to avoid generating false positive warnings that have nothing to do with the code they are instrumenting/tracking (e.g. through lockdep reclaim state tracking) We also don't want recording stack traces to deplete emergency memory reserves - debug code is useless if it creates new issues that can't be replicated when the debug code is disabled. Switch the stack tracking allocation masking to use gfp_nested_mask() to address these issues. gfp_nested_mask() naturally strips GFP_ZONEMASK, too, which greatly simplifies this code. Link: https://lkml.kernel.org/r/20240430054604.4169568-4-david@fromorbit.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Marco Elver <elver@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * mm: lift gfp_kmemleak_mask() to gfp.hDave Chinner2024-05-191-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: fix nested allocation context filtering". This patchset is the followup to the comment I made earlier today: https://lore.kernel.org/linux-xfs/ZjAyIWUzDipofHFJ@dread.disaster.area/ Tl;dr: Memory allocations that are done inside the public memory allocation API need to obey the reclaim recursion constraints placed on the allocation by the original caller, including the "don't track recursion for this allocation" case defined by __GFP_NOLOCKDEP. These nested allocations are generally in debug code that is tracking something about the allocation (kmemleak, KASAN, etc) and so are allocating private kernel objects that only that debug system will use. Neither the page-owner code nor the stack depot code get this right. They also also clear GFP_ZONEMASK as a separate operation, which is completely redundant because the constraint filter applied immediately after guarantees that GFP_ZONEMASK bits are cleared. kmemleak gets this filtering right. It preserves the allocation constraints for deadlock prevention and clears all other context flags whilst also ensuring that the nested allocation will fail quickly, silently and without depleting emergency kernel reserves if there is no memory available. This can be made much more robust, immune to whack-a-mole games and the code greatly simplified by lifting gfp_kmemleak_mask() to include/linux/gfp.h and using that everywhere. Also document it so that there is no excuse for not knowing about it when writing new debug code that nests allocations. Tested with lockdep, KASAN + page_owner=on and kmemleak=on over multiple fstests runs with XFS. This patch (of 3): Any "internal" nested allocation done from within an allocation context needs to obey the high level allocation gfp_mask constraints. This is necessary for debug code like KASAN, kmemleak, lockdep, etc that allocate memory for saving stack traces and other information during memory allocation. If they don't obey things like __GFP_NOLOCKDEP or __GFP_NOWARN, they produce false positive failure detections. kmemleak gets this right by using gfp_kmemleak_mask() to pass through the relevant context flags to the nested allocation to ensure that the allocation follows the constraints of the caller context. KASAN recently was foudn to be missing __GFP_NOLOCKDEP due to stack depot allocations, and even more recently the page owner tracking code was also found to be missing __GFP_NOLOCKDEP support. We also don't wan't want KASAN or lockdep to drive the system into OOM kill territory by exhausting emergency reserves. This is something that kmemleak also gets right by adding (__GFP_NORETRY | __GFP_NOMEMALLOC | __GFP_NOWARN) to the allocation mask. Hence it is clear that we need to define a common nested allocation filter mask for these sorts of third party nested allocations used in debug code. So to start this process, lift gfp_kmemleak_mask() to gfp.h and rename it to gfp_nested_mask(), and convert the kmemleak callers to use it. Link: https://lkml.kernel.org/r/20240430054604.4169568-1-david@fromorbit.com Link: https://lkml.kernel.org/r/20240430054604.4169568-2-david@fromorbit.com Signed-off-by: Dave Chinner <dchinner@redhat.com> Reviewed-by: Marco Elver <elver@google.com> Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Vlastimil Babka <vbabka@suse.cz> Reviewed-by: Oscar Salvador <osalvador@suse.de> Cc: Andrey Konovalov <andreyknvl@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | mm: simplify and improve print_vma_addr() outputLinus Torvalds2024-05-221-13/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | Use '%pD' to print out the filename, and print out the actual offset within the file too, rather than just what the virtual address of the mapping is (which doesn't tell you anything about any mapping offsets). Also, use the exact vma_lookup() instead of find_vma() - the latter looks up any vma _after_ the address, which is of questionable value (yes, maybe you fell off the beginning, but you'd be more likely to fall off the end). Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'pull-set_blocksize' of ↵Linus Torvalds2024-05-211-27/+2
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs blocksize updates from Al Viro: "This gets rid of bogus set_blocksize() uses, switches it over to be based on a 'struct file *' and verifies that the caller has the device opened exclusively" * tag 'pull-set_blocksize' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: make set_blocksize() fail unless block device is opened exclusive set_blocksize(): switch to passing struct file * btrfs_get_bdev_and_sb(): call set_blocksize() only for exclusive opens swsusp: don't bother with setting block size zram: don't bother with reopening - just use O_EXCL for open swapon(2): open swap with O_EXCL swapon(2)/swapoff(2): don't bother with block size pktcdvd: sort set_blocksize() calls out bcache_register(): don't bother with set_blocksize()
| * swapon(2): open swap with O_EXCLAl Viro2024-05-021-17/+2
| | | | | | | | | | | | | | | | | | ... eliminating the need to reopen block devices so they could be exclusively held. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * swapon(2)/swapoff(2): don't bother with block sizeAl Viro2024-05-021-11/+1
| | | | | | | | | | | | | | | | | | | | | | | | once upon a time that used to matter; these days we do swap IO for swap devices at the level that doesn't give a damn about block size, buffer_head or anything of that sort - just attach the page to bio, set the location and size (the latter to PAGE_SIZE) and feed into queue. Reviewed-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge tag 'mm-nonmm-stable-2024-05-19-11-56' of ↵Linus Torvalds2024-05-191-0/+11
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull non-mm updates from Andrew Morton: "Mainly singleton patches, documented in their respective changelogs. Notable series include: - Some maintenance and performance work for ocfs2 in Heming Zhao's series "improve write IO performance when fragmentation is high". - Some ocfs2 bugfixes from Su Yue in the series "ocfs2 bugs fixes exposed by fstests". - kfifo header rework from Andy Shevchenko in the series "kfifo: Clean up kfifo.h". - GDB script fixes from Florian Rommel in the series "scripts/gdb: Fixes for $lx_current and $lx_per_cpu". - After much discussion, a coding-style update from Barry Song explaining one reason why inline functions are preferred over macros. The series is "codingstyle: avoid unused parameters for a function-like macro"" * tag 'mm-nonmm-stable-2024-05-19-11-56' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (62 commits) fs/proc: fix softlockup in __read_vmcore nilfs2: convert BUG_ON() in nilfs_finish_roll_forward() to WARN_ON() scripts: checkpatch: check unused parameters for function-like macro Documentation: coding-style: ask function-like macros to evaluate parameters nilfs2: use __field_struct() for a bitwise field selftests/kcmp: remove unused open mode nilfs2: remove calls to folio_set_error() and folio_clear_error() kernel/watchdog_perf.c: tidy up kerneldoc watchdog: allow nmi watchdog to use raw perf event watchdog: handle comma separated nmi_watchdog command line nilfs2: make superblock data array index computation sparse friendly squashfs: remove calls to set the folio error flag squashfs: convert squashfs_symlink_read_folio to use folio APIs scripts/gdb: fix detection of current CPU in KGDB scripts/gdb: make get_thread_info accept pointers scripts/gdb: fix parameter handling in $lx_per_cpu scripts/gdb: fix failing KGDB detection during probe kfifo: don't use "proxy" headers media: stih-cec: add missing io.h media: rc: add missing io.h ...
| * | mm: kmsan: implement kmsan_memmove()Alexander Potapenko2024-04-261-0/+11
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Provide a hook that can be used by custom memcpy implementations to tell KMSAN that the metadata needs to be copied. Without that, false positive reports are possible in the cases where KMSAN fails to intercept memory initialization. Link: https://lore.kernel.org/all/3b7dbd88-0861-4638-b2d2-911c97a4cadf@I-love.SAKURA.ne.jp/ Link: https://lkml.kernel.org/r/20240320101851.2589698-1-glider@google.com Signed-off-by: Alexander Potapenko <glider@google.com> Suggested-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Reviewed-by: Marco Elver <elver@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | Merge tag 'mm-stable-2024-05-17-19-19' of ↵Linus Torvalds2024-05-1980-3657/+4650
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull mm updates from Andrew Morton: "The usual shower of singleton fixes and minor series all over MM, documented (hopefully adequately) in the respective changelogs. Notable series include: - Lucas Stach has provided some page-mapping cleanup/consolidation/ maintainability work in the series "mm/treewide: Remove pXd_huge() API". - In the series "Allow migrate on protnone reference with MPOL_PREFERRED_MANY policy", Donet Tom has optimized mempolicy's MPOL_PREFERRED_MANY mode, yielding almost doubled performance in one test. - In their series "Memory allocation profiling" Kent Overstreet and Suren Baghdasaryan have contributed a means of determining (via /proc/allocinfo) whereabouts in the kernel memory is being allocated: number of calls and amount of memory. - Matthew Wilcox has provided the series "Various significant MM patches" which does a number of rather unrelated things, but in largely similar code sites. - In his series "mm: page_alloc: freelist migratetype hygiene" Johannes Weiner has fixed the page allocator's handling of migratetype requests, with resulting improvements in compaction efficiency. - In the series "make the hugetlb migration strategy consistent" Baolin Wang has fixed a hugetlb migration issue, which should improve hugetlb allocation reliability. - Liu Shixin has hit an I/O meltdown caused by readahead in a memory-tight memcg. Addressed in the series "Fix I/O high when memory almost met memcg limit". - In the series "mm/filemap: optimize folio adding and splitting" Kairui Song has optimized pagecache insertion, yielding ~10% performance improvement in one test. - Baoquan He has cleaned up and consolidated the early zone initialization code in the series "mm/mm_init.c: refactor free_area_init_core()". - Baoquan has also redone some MM initializatio code in the series "mm/init: minor clean up and improvement". - MM helper cleanups from Christoph Hellwig in his series "remove follow_pfn". - More cleanups from Matthew Wilcox in the series "Various page->flags cleanups". - Vlastimil Babka has contributed maintainability improvements in the series "memcg_kmem hooks refactoring". - More folio conversions and cleanups in Matthew Wilcox's series: "Convert huge_zero_page to huge_zero_folio" "khugepaged folio conversions" "Remove page_idle and page_young wrappers" "Use folio APIs in procfs" "Clean up __folio_put()" "Some cleanups for memory-failure" "Remove page_mapping()" "More folio compat code removal" - David Hildenbrand chipped in with "fs/proc/task_mmu: convert hugetlb functions to work on folis". - Code consolidation and cleanup work related to GUP's handling of hugetlbs in Peter Xu's series "mm/gup: Unify hugetlb, part 2". - Rick Edgecombe has developed some fixes to stack guard gaps in the series "Cover a guard gap corner case". - Jinjiang Tu has fixed KSM's behaviour after a fork+exec in the series "mm/ksm: fix ksm exec support for prctl". - Baolin Wang has implemented NUMA balancing for multi-size THPs. This is a simple first-cut implementation for now. The series is "support multi-size THP numa balancing". - Cleanups to vma handling helper functions from Matthew Wilcox in the series "Unify vma_address and vma_pgoff_address". - Some selftests maintenance work from Dev Jain in the series "selftests/mm: mremap_test: Optimizations and style fixes". - Improvements to the swapping of multi-size THPs from Ryan Roberts in the series "Swap-out mTHP without splitting". - Kefeng Wang has significantly optimized the handling of arm64's permission page faults in the series "arch/mm/fault: accelerate pagefault when badaccess" "mm: remove arch's private VM_FAULT_BADMAP/BADACCESS" - GUP cleanups from David Hildenbrand in "mm/gup: consistently call it GUP-fast". - hugetlb fault code cleanups from Vishal Moola in "Hugetlb fault path to use struct vm_fault". - selftests build fixes from John Hubbard in the series "Fix selftests/mm build without requiring "make headers"". - Memory tiering fixes/improvements from Ho-Ren (Jack) Chuang in the series "Improved Memory Tier Creation for CPUless NUMA Nodes". Fixes the initialization code so that migration between different memory types works as intended. - David Hildenbrand has improved follow_pte() and fixed an errant driver in the series "mm: follow_pte() improvements and acrn follow_pte() fixes". - David also did some cleanup work on large folio mapcounts in his series "mm: mapcount for large folios + page_mapcount() cleanups". - Folio conversions in KSM in Alex Shi's series "transfer page to folio in KSM". - Barry Song has added some sysfs stats for monitoring multi-size THP's in the series "mm: add per-order mTHP alloc and swpout counters". - Some zswap cleanups from Yosry Ahmed in the series "zswap same-filled and limit checking cleanups". - Matthew Wilcox has been looking at buffer_head code and found the documentation to be lacking. The series is "Improve buffer head documentation". - Multi-size THPs get more work, this time from Lance Yang. His series "mm/madvise: enhance lazyfreeing with mTHP in madvise_free" optimizes the freeing of these things. - Kemeng Shi has added more userspace-visible writeback instrumentation in the series "Improve visibility of writeback". - Kemeng Shi then sent some maintenance work on top in the series "Fix and cleanups to page-writeback". - Matthew Wilcox reduces mmap_lock traffic in the anon vma code in the series "Improve anon_vma scalability for anon VMAs". Intel's test bot reported an improbable 3x improvement in one test. - SeongJae Park adds some DAMON feature work in the series "mm/damon: add a DAMOS filter type for page granularity access recheck" "selftests/damon: add DAMOS quota goal test" - Also some maintenance work in the series "mm/damon/paddr: simplify page level access re-check for pageout" "mm/damon: misc fixes and improvements" - David Hildenbrand has disabled some known-to-fail selftests ni the series "selftests: mm: cow: flag vmsplice() hugetlb tests as XFAIL". - memcg metadata storage optimizations from Shakeel Butt in "memcg: reduce memory consumption by memcg stats". - DAX fixes and maintenance work from Vishal Verma in the series "dax/bus.c: Fixups for dax-bus locking"" * tag 'mm-stable-2024-05-17-19-19' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (426 commits) memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_order selftests/mm: hugetlb_madv_vs_map: avoid test skipping by querying hugepage size at runtime mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wp mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_fault selftests: cgroup: add tests to verify the zswap writeback path mm: memcg: make alloc_mem_cgroup_per_node_info() return bool mm/damon/core: fix return value from damos_wmark_metric_value mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPED selftests: cgroup: remove redundant enabling of memory controller Docs/mm/damon/maintainer-profile: allow posting patches based on damon/next tree Docs/mm/damon/maintainer-profile: change the maintainer's timezone from PST to PT Docs/mm/damon/design: use a list for supported filters Docs/admin-guide/mm/damon/usage: fix wrong schemes effective quota update command Docs/admin-guide/mm/damon/usage: fix wrong example of DAMOS filter matching sysfs file selftests/damon: classify tests for functionalities and regressions selftests/damon/_damon_sysfs: use 'is' instead of '==' for 'None' selftests/damon/_damon_sysfs: find sysfs mount point from /proc/mounts selftests/damon/_damon_sysfs: check errors from nr_schemes file reads mm/damon/core: initialize ->esz_bp from damos_quota_init_priv() selftests/damon: add a test for DAMOS quota goal ...
| * | memcg, oom: cleanup unused memcg_oom_gfp_mask and memcg_oom_orderXiu Jianfeng2024-05-121-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 857f21397f71 ("memcg, oom: remove unnecessary check in mem_cgroup_oom_synchronize()"), memcg_oom_gfp_mask and memcg_oom_order are no longer used any more. Link: https://lkml.kernel.org/r/20240509032628.1217652-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Cc: Benjamin Segall <bsegall@google.com> Cc: Daniel Bristot de Oliveira <bristot@redhat.com> Cc: Dietmar Eggemann <dietmar.eggemann@arm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Steven Rostedt (Google) <rostedt@goodmis.org> Cc: Valentin Schneider <vschneid@redhat.com> Cc: Vincent Guittot <vincent.guittot@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_wpOscar Salvador2024-05-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 1cb9dc4b475c ("mm: hwpoison: support recovery from HugePage copy-on-write faults") added support to use the mc variants when coping hugetlb pages on CoW faults. Add the missing VM_FAULT_SET_HINDEX, so the right si_addr_lsb will be passed to userspace to report the extension of the faulty area. Link: https://lkml.kernel.org/r/20240509100148.22384-3-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/hugetlb: add missing VM_FAULT_SET_HINDEX in hugetlb_faultOscar Salvador2024-05-121-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Minor fixups for hugetlb fault path". This series contains a couple of fixups for hugetlb_fault and hugetlb_wp respectively, where a VM_FAULT_SET_HINDEX call was missing. I did not bother with a Fixes tag because the missing piece here is that we will not report to userspace the right extension of the faulty area by adjusting struct kernel_siginfo.si_addr_lsb, but I do not consider that to be a big issue because I assume that userspace already knows the size of the mapping anyway. This patch (of 2): commit af19487f00f3 ("mm: make PTE_MARKER_SWAPIN_ERROR more general") added the code to handle pte_markers in hugetlb faulting path. In case of an UFFD_POISON event, a PTE_MARKER_POISONED will be created and we will return VM_FAULT_HWPOISON_LARGE upon detecting that in the fault path. Add the missing VM_FAULT_SET_HINDEX, so the right si_addr_lsb will be passed to userspace to report the extension of the faulty area. Link: https://lkml.kernel.org/r/20240509100148.22384-1-osalvador@suse.de Link: https://lkml.kernel.org/r/20240509100148.22384-2-osalvador@suse.de Signed-off-by: Oscar Salvador <osalvador@suse.de> Acked-by: Peter Xu <peterx@redhat.com> Acked-by: Axel Rasmussen <axelrasmussen@google.com> Cc: Liu Shixin <liushixin2@huawei.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: memcg: make alloc_mem_cgroup_per_node_info() return boolXiu Jianfeng2024-05-121-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alloc_mem_cgroup_per_node_info() returns int that doesn't map to any errno error code. The only existing caller doesn't really need an error code so change the function to return bool (true on success) because this is slightly less confusing and more consistent with the other code. Link: https://lkml.kernel.org/r/20240507132324.1158510-1-xiujianfeng@huawei.com Signed-off-by: Xiu Jianfeng <xiujianfeng@huawei.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Roman Gushchin <roman.gushchin@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/damon/core: fix return value from damos_wmark_metric_valueAlex Rusuf2024-05-121-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | damos_wmark_metric_value's return value is 'unsigned long', so returning -EINVAL as 'unsigned long' may turn out to be very different from the expected one (using 2's complement) and treat as usual matric's value. So, fix that, checking if returned value is not 0. Link: https://lkml.kernel.org/r/20240506180238.53842-1-sj@kernel.org Fixes: ee801b7dd782 ("mm/damon/schemes: activate schemes based on a watermarks mechanism") Signed-off-by: Alex Rusuf <yorha.op@gmail.com> Reviewed-by: SeongJae Park <sj@kernel.org> Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: do not update memcg stats for NR_{FILE/SHMEM}_PMDMAPPEDYosry Ahmed2024-05-121-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, all NR_VM_EVENT_ITEMS stats were maintained per-memcg, although some of those fields are not exposed anywhere. Commit 14e0f6c957e39 ("memcg: reduce memory for the lruvec and memcg stats") changed this such that we only maintain the stats we actually expose per-memcg via a translation table. Additionally, commit 514462bbe927b ("memcg: warn for unexpected events and stats") added a warning if a per-memcg stat update is attempted for a stat that is not in the translation table. The warning started firing for the NR_{FILE/SHMEM}_PMDMAPPED stat updates in the rmap code. These stats are not maintained per-memcg, and hence are not in the translation table. Do not use __lruvec_stat_mod_folio() when updating NR_FILE_PMDMAPPED and NR_SHMEM_PMDMAPPED. Use __mod_node_page_state() instead, which updates the global per-node stats only. Link: https://lkml.kernel.org/r/20240506192924.271999-1-yosryahmed@google.com Fixes: 514462bbe927 ("memcg: warn for unexpected events and stats") Signed-off-by: Yosry Ahmed <yosryahmed@google.com> Reported-by: syzbot+9319a4268a640e26b72b@syzkaller.appspotmail.com Closes: https://lore.kernel.org/lkml/0000000000001b9d500617c8b23c@google.com Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/damon/core: initialize ->esz_bp from damos_quota_init_priv()SeongJae Park2024-05-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/damon: misc fixes and improvements". Add miscelleneous and non-urgent fixes and improvements for DAMON code, selftests, and documents. This patch (of 10): damos_quota_init_priv() function should initialize all private fields of struct damos_quota. However, it is not initializing ->esz_bp field. This could result in use of uninitialized variable from damon_feed_loop_next_input() function. There is no such issue at the moment because every caller of the function is passing damos_quota object that already having the field zero value. But we cannot guarantee the future, and the function is not doing what it is promising. A bug is a bug. This fix is for preventing possible future issues. Link: https://lkml.kernel.org/r/20240503180318.72798-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240503180318.72798-2-sj@kernel.org Fixes: 9294a037c015 ("mm/damon/core: implement goal-oriented feedback-driven quota auto-tuning") Signed-off-by: SeongJae Park <sj@kernel.org> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | thp: remove HPAGE_PMD_ORDER minimum assertionMatthew Wilcox (Oracle)2024-05-071-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | We now handle order-1 folios correctly, so we don't need this assertion any more. Link: https://lkml.kernel.org/r/20240429190114.3126789-1-willy@infradead.org Signed-off-by: Matthew Wilcox (Oracle) <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/vmscan: remove ignore_references argument of reclaim_folio_list()SeongJae Park2024-05-071-6/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All reclaim_folio_list() callers are passing 'true' for 'ignore_references' parameter. In other words, the parameter is not really being used. Simplify the code by removing the parameter. Link: https://lkml.kernel.org/r/20240429224451.67081-5-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/vmscan: remove ignore_references argument of reclaim_pages()SeongJae Park2024-05-074-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All reclaim_pages() callers are setting 'ignore_references' parameter 'true'. In other words, the parameter is not really being used. Remove the argument to make it simple. Link: https://lkml.kernel.org/r/20240429224451.67081-4-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/damon/paddr: do page level access check for pageout DAMOS action on its ownSeongJae Park2024-05-071-4/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 'pageout' DAMOS action implementation of 'paddr' DAMON operations set asks reclaim_pages() to do page level access check if the user is not asking DAMOS to do that on its own. Simplify the logic by making the check always be done by 'paddr'. Link: https://lkml.kernel.org/r/20240429224451.67081-3-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/damon/paddr: avoid unnecessary page level access check for pageout DAMOS ↵SeongJae Park2024-05-071-1/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | action Patch series "mm/damon/paddr: simplify page level access re-check for pageout. The 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to do page level access check again. But the user can ask 'paddr' to do the page level access check on its own, using DAMOS filter of 'young page' type. Meanwhile, 'paddr' is the only user of reclaim_pages() that asks the page level access check. Make 'paddr' does the page level access check on its own always, and simplify reclaim_pages() by removing the page level access check request handling logic. As a result of the change for reclaim_pages(), reclaim_folio_list(), which is called by reclaim_pages(), also no more need to do the page level access check. Simplify the function, too. This patch (of 4): 'pageout' DAMOS action implementation of 'paddr' asks reclaim_pages() to do the page level access check. User could ask DAMOS to do the page level access check on its own using 'young page' type DAMOS filter. In the case, pageout DAMOS action unnecessarily asks reclaim_pages() to do the check again. Ask the page level access check only if the scheme is not having the filter. Link: https://lkml.kernel.org/r/20240429224451.67081-1-sj@kernel.org Link: https://lkml.kernel.org/r/20240429224451.67081-2-sj@kernel.org Signed-off-by: SeongJae Park <sj@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/gup: fix hugepd handling in hugetlb reworkPeter Xu2024-05-071-25/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit a12083d721d7 added hugepd handling for gup-slow, reusing gup-fast functions. follow_hugepd() correctly took the vma pointer in, however didn't pass it over into the lower functions, which was overlooked. The issue is gup_fast_hugepte() uses the vma pointer to make the correct decision on whether an unshare is needed for a FOLL_PIN|FOLL_LONGTERM. Now without vma ponter it will constantly return "true" (needs an unshare) for a page cache, even though in the SHARED case it will be wrong to unshare. The other problem is, even if an unshare is needed, it now returns 0 rather than -EMLINK, which will not trigger a follow up FAULT_FLAG_UNSHARE fault. That will need to be fixed too when the unshare is wanted. gup_longterm test didn't expose this issue in the past because it didn't yet test R/O unshare in this case, another separate patch will enable that in future tests. Fix it by passing vma correctly to the bottom, rename gup_fast_hugepte() back to gup_hugepte() as it is shared between the fast/slow paths, and also allow -EMLINK to be returned properly by gup_hugepte() even though gup-fast will take it the same as zero. Link: https://lkml.kernel.org/r/20240430131303.264331-1-peterx@redhat.com Fixes: a12083d721d7 ("mm/gup: handle hugepd for follow_page()") Signed-off-by: Peter Xu <peterx@redhat.com> Reported-by: David Hildenbrand <david@redhat.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: John Hubbard <jhubbard@nvidia.com> Cc: Lorenzo Stoakes <lstoakes@gmail.com> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/hugetlb: align cma on allocation order, not demotion orderFrank van der Linden2024-05-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Align the CMA area for hugetlb gigantic pages to their size, not the size that they can be demoted to. Otherwise there might be misaligned sections at the start and end of the CMA area that will never be used for hugetlb page allocations. Link: https://lkml.kernel.org/r/20240430161437.2100295-1-fvdl@google.com Fixes: a01f43901cfb ("hugetlb: be sure to free demoted CMA pages to CMA") Signed-off-by: Frank van der Linden <fvdl@google.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: memcg: use READ_ONCE()/WRITE_ONCE() to access stock->nr_pagesBreno Leitao2024-05-071-9/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A memcg pointer in the per-cpu stock can be accessed by drain_all_stock() and consume_stock() in parallel, causing a potential race, which is believed to e harmless. KCSAN shows this data-race clearly in the splat below: BUG: KCSAN: data-race in drain_all_stock.part.0 / try_charge_memcg write to 0xffff88903f8b0788 of 4 bytes by task 35901 on cpu 2: try_charge_memcg (mm/memcontrol.c:2323 mm/memcontrol.c:2746) __mem_cgroup_charge (mm/memcontrol.c:7287 mm/memcontrol.c:7301) do_anonymous_page (mm/memory.c:1054 mm/memory.c:4375 mm/memory.c:4433) __handle_mm_fault (mm/memory.c:3878 mm/memory.c:5300 mm/memory.c:5441) handle_mm_fault (mm/memory.c:5606) do_user_addr_fault (arch/x86/mm/fault.c:1363) exc_page_fault (./arch/x86/include/asm/irqflags.h:37 ./arch/x86/include/asm/irqflags.h:72 arch/x86/mm/fault.c:1513 arch/x86/mm/fault.c:1563) asm_exc_page_fault (./arch/x86/include/asm/idtentry.h:623) read to 0xffff88903f8b0788 of 4 bytes by task 287 on cpu 27: drain_all_stock.part.0 (mm/memcontrol.c:2433) mem_cgroup_css_offline (mm/memcontrol.c:5398 mm/memcontrol.c:5687) css_killed_work_fn (kernel/cgroup/cgroup.c:5521 kernel/cgroup/cgroup.c:5794) process_one_work (kernel/workqueue.c:3254) worker_thread (kernel/workqueue.c:3329 kernel/workqueue.c:3416) kthread (kernel/kthread.c:388) ret_from_fork (arch/x86/kernel/process.c:147) ret_from_fork_asm (arch/x86/entry/entry_64.S:257) value changed: 0x00000014 -> 0x00000013 This happens because drain_all_stock() is reading stock->nr_pages, while consume_stock() might be updating the same address, causing a potential data-race. Make the shared addresses bulletproof regarding to reads and writes, similarly to what stock->cached_objcg and stock->cached. Annotate all accesses to stock->nr_pages with READ_ONCE()/WRITE_ONCE(). Link: https://lkml.kernel.org/r/20240501095420.679208-1-leitao@debian.org Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Muchun Song <songmuchun@bytedance.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: fix race between __split_huge_pmd_locked() and GUP-fastRyan Roberts2024-05-072-23/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __split_huge_pmd_locked() can be called for a present THP, devmap or (non-present) migration entry. It calls pmdp_invalidate() unconditionally on the pmdp and only determines if it is present or not based on the returned old pmd. This is a problem for the migration entry case because pmd_mkinvalid(), called by pmdp_invalidate() must only be called for a present pmd. On arm64 at least, pmd_mkinvalid() will mark the pmd such that any future call to pmd_present() will return true. And therefore any lockless pgtable walker could see the migration entry pmd in this state and start interpretting the fields as if it were present, leading to BadThings (TM). GUP-fast appears to be one such lockless pgtable walker. x86 does not suffer the above problem, but instead pmd_mkinvalid() will corrupt the offset field of the swap entry within the swap pte. See link below for discussion of that problem. Fix all of this by only calling pmdp_invalidate() for a present pmd. And for good measure let's add a warning to all implementations of pmdp_invalidate[_ad](). I've manually reviewed all other pmdp_invalidate[_ad]() call sites and believe all others to be conformant. This is a theoretical bug found during code review. I don't have any test case to trigger it in practice. Link: https://lkml.kernel.org/r/20240501143310.1381675-1-ryan.roberts@arm.com Link: https://lore.kernel.org/all/0dd7827a-6334-439a-8fd0-43c98e6af22b@arm.com/ Fixes: 84c3fc4e9c56 ("mm: thp: check pmd migration entry in common path") Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Zi Yan <ziy@nvidia.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Andreas Larsson <andreas@gaisler.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jonathan Corbet <corbet@lwn.net> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Will Deacon <will@kernel.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm/debug_vm_pgtable: test pmd_leaf() behavior with pmd_mkinvalid()Ryan Roberts2024-05-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | An invalidated pmd should still cause pmd_leaf() to return true. Let's test for that to ensure all arches remain consistent. Link: https://lkml.kernel.org/r/20240501144439.1389048-1-ryan.roberts@arm.com Signed-off-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | memcg: use proper type for mod_memcg_stateShakeel Butt2024-05-071-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The memcg stats update functions can take arbitrary integer but the only input which make sense is enum memcg_stat_item and we don't want these functions to be called with arbitrary integer, so replace the parameter type with enum memcg_stat_item and compiler will be able to warn if memcg stat update functions are called with incorrect index value. Link: https://lkml.kernel.org/r/20240501172617.678560-9-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | memcg: warn for unexpected events and statsShakeel Butt2024-05-071-16/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To reduce memory usage by the memcg events and stats, the kernel uses indirection table and only allocate stats and events which are being used by the memcg code. To make this more robust, let's add warnings where unexpected stats and events indexes are used. Link: https://lkml.kernel.org/r/20240501172617.678560-8-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | mm: cleanup WORKINGSET_NODES in workingsetShakeel Butt2024-05-071-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | WORKINGSET_NODES is not exposed in the memcg stats and thus there is no need to use the memcg specific stat update functions for it. In future if we decide to expose WORKINGSET_NODES in the memcg stats, we can revert this patch. Link: https://lkml.kernel.org/r/20240501172617.678560-7-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | memcg: cleanup __mod_memcg_lruvec_stateShakeel Butt2024-05-071-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are no memcg specific stats for NR_SHMEM_PMDMAPPED and NR_FILE_PMDMAPPED. Let's remove them. Link: https://lkml.kernel.org/r/20240501172617.678560-6-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Yosry Ahmed <yosryahmed@google.com> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Reviewed-by: T.J. Mercier <tjmercier@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | memcg: reduce memory for the lruvec and memcg statsShakeel Butt2024-05-071-20/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At the moment, the amount of memory allocated for stats related structs in the mem_cgroup corresponds to the size of enum node_stat_item. However not all fields in enum node_stat_item have corresponding memcg stats. So, let's use indirection mechanism similar to the one used for memcg vmstats management. For a given x86_64 config, the size of stats with and without patch is: structs size in bytes w/o with struct lruvec_stats 1128 648 struct lruvec_stats_percpu 752 432 struct memcg_vmstats 1832 1352 struct memcg_vmstats_percpu 1280 960 The memory savings are further compounded by the fact that these structs are allocated for each cpu and for each node. To be precise, for each memcg the memory saved would be: Memory saved = ((21 * 3 * NR_NODES) + (21 * 2 * NR_NODES * NR_CPUS) + (21 * 3) + (21 * 2 * NR_CPUS)) * sizeof(long) Where 21 is the number of fields eliminated. Link: https://lkml.kernel.org/r/20240501172617.678560-5-shakeel.butt@linux.dev Signed-off-by: Shakeel Butt <shakeel.butt@linux.dev> Reviewed-by: Roman Gushchin <roman.gushchin@linux.dev> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Michal Hocko <mhocko@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: T.J. Mercier <tjmercier@google.com> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>