summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* mm/hugetlb: avoid looping to the same hugepage if !pages and !vmasZhigang Lu2019-12-011-0/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mmapping an existing hugetlbfs file with MAP_POPULATE, we find it is very time consuming. For example, mmapping a 128GB file takes about 50 milliseconds. Sampling with perfevent shows it spends 99% time in the same_page loop in follow_hugetlb_page(). samples: 205 of event 'cycles', Event count (approx.): 136686374 - 99.04% test_mmap_huget [kernel.kallsyms] [k] follow_hugetlb_page follow_hugetlb_page __get_user_pages __mlock_vma_pages_range __mm_populate vm_mmap_pgoff sys_mmap_pgoff sys_mmap system_call_fastpath __mmap64 follow_hugetlb_page() is called with pages=NULL and vmas=NULL, so for each hugepage, we run into the same_page loop for pages_per_huge_page() times, but doing nothing. With this change, it takes less then 1 millisecond to mmap a 128GB file in hugetlbfs. Link: http://lkml.kernel.org/r/1567581712-5992-1-git-send-email-totty.lu@gmail.com Signed-off-by: Zhigang Lu <tonnylu@tencent.com> Reviewed-by: Haozhong Zhang <hzhongzhang@tencent.com> Reviewed-by: Zongming Zhang <knightzhang@tencent.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: remove unused hstate in hugetlb_fault_mutex_hash()Wei Yang2019-12-014-14/+8
| | | | | | | | | | | | | | | | | | | | | The first parameter hstate in function hugetlb_fault_mutex_hash() is not used anymore. This patch removes it. [akpm@linux-foundation.org: various build fixes] [cai@lca.pw: fix a GCC compilation warning] Link: http://lkml.kernel.org/r/1570544108-32331-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/20191005003302.785-1-richardw.yang@linux.intel.com Signed-off-by: Wei Yang <richardw.yang@linux.intel.com> Signed-off-by: Qian Cai <cai@lca.pw> Suggested-by: Andrew Morton <akpm@linux-foundation.org> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Hugh Dickins <hughd@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: remove duplicated codeMina Almasry2019-12-011-62/+57
| | | | | | | | | | | | | | | | | | Remove duplicated code between region_chg and region_add, and refactor it into a common function, add_reservation_in_range. This is mostly done because there is a follow up change in another series that disables region coalescing in region_add, and I want to make that change in one place only. It should improve maintainability anyway on its own. [akpm@linux-foundation.org: coding style fixes] Link: http://lkml.kernel.org/r/20190919200428.188797-3-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlb: region_chg provides only cache entryMina Almasry2019-12-011-52/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current behavior is that region_chg provides both a cache entry in resv->region_cache, AND a placeholder entry in resv->regions. region_add first tries to use the placeholder, and if it finds that the placeholder has been deleted by a racing region_del call, it uses the cache entry. This behavior is completely unnecessary and is removed in this patch for a couple of reasons: 1. region_add needs to either find a cached file_region entry in resv->region_cache, or find an entry in resv->regions to expand. It does not need both. 2. region_chg adding a placeholder entry in resv->regions opens up a possible race with region_del, where region_chg adds a placeholder region in resv->regions, and this region is deleted by a racing call to region_del during region_chg execution or before region_add is called. Removing the race makes the code easier to reason about and maintain. In addition, a follow up patch in another series that disables region coalescing, which would be further complicated if the race with region_del exists. Link: http://lkml.kernel.org/r/20190919200428.188797-2-almasrymina@google.com Signed-off-by: Mina Almasry <almasrymina@google.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Rientjes <rientjes@google.com> Cc: Shakeel Butt <shakeelb@google.com> Cc: Greg Thelen <gthelen@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlbfs: take read_lock on i_mmap for PMD sharingWaiman Long2019-12-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A customer with large SMP systems (up to 16 sockets) with application that uses large amount of static hugepages (~500-1500GB) are experiencing random multisecond delays. These delays were caused by the long time it took to scan the VMA interval tree with mmap_sem held. The sharing of huge PMD does not require changes to the i_mmap at all. Therefore, we can just take the read lock and let other threads searching for the right VMA share it in parallel. Once the right VMA is found, either the PMD lock (2M huge page for x86-64) or the mm->page_table_lock will be acquired to perform the actual PMD sharing. Lock contention, if present, will happen in the spinlock. That is much better than contention in the rwsem where the time needed to scan the the interval tree is indeterminate. With this patch applied, the customer is seeing significant performance improvement over the unpatched kernel. Link: http://lkml.kernel.org/r/20191107211809.9539-1-longman@redhat.com Signed-off-by: Waiman Long <longman@redhat.com> Suggested-by: Mike Kravetz <mike.kravetz@oracle.com> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Davidlohr Bueso <dave@stgolabs.net> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Ingo Molnar <mingo@redhat.com> Cc: Will Deacon <will.deacon@arm.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlbfs: add O_TMPFILE supportPiotr Sarna2019-12-011-4/+24
| | | | | | | | | | | | | | | | | | | | | | | With hugetlbfs, a common pattern for mapping anonymous huge pages is to create a temporary file first. Currently libraries like libhugetlbfs and seastar create these with a standard mkstemp+unlink trick, but it would be more robust to be able to simply pass the O_TMPFILE flag to open(). O_TMPFILE is already supported by several file systems like ext4 and xfs. The implementation simply uses the existi= ng d_tmpfile utility function to instantiate the dcache entry for the file. Tested manually by successfully creating a temporary file by opening it with (O_TMPFILE|O_RDWR) on mounted hugetlbfs and successfully mapping 2M huge pages with it. Without the patch, trying to open a file with O_TMPFILE results in -ENOSUP. Link: http://lkml.kernel.org/r/bc9383eff6e1374d79f3a92257ae829ba1e6ae60.1573285189.git.p.sarna@tlen.pl Signed-off-by: Piotr Sarna <p.sarna@tlen.pl> Reviewed-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Michal Hocko <mhocko@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlbfs: convert macros to static inline, fix sparse warningMike Kravetz2019-12-011-22/+115
| | | | | | | | | | | | | | | | | | huge_pte_offset() produced a sparse warning due to an improper return type when the kernel was built with !CONFIG_HUGETLB_PAGE. Fix the bad type and also convert all the macros in this block to static inline wrappers. Two existing wrappers in this block had lines in excess of 80 columns so clean those up as well. No functional change. Link: http://lkml.kernel.org/r/20191112194558.139389-3-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Ben Dooks <ben.dooks@codethink.co.uk> Suggested-by: Jason Gunthorpe <jgg@ziepe.ca> Cc: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* powerpc/mm: remove pmd_huge/pud_huge stubs and include hugetlb.hMike Kravetz2019-12-013-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "hugetlbfs: convert macros to static inline, fix sparse warning". The definition for huge_pte_offset() in <linux/hugetlb.h> causes a sparse warning in the !CONFIG_HUGETLB_PAGE. Fix this as well as converting all macros in this block of definitions to static inlines for better type checking. When making the above changes, build errors were found in powerpc due to duplicate definitions. A separate powerpc specific patch is included as a requisite to remove the definitions and get them from <linux/hugetlb.h>. This patch (of 2): This removes the power specific stubs created by commit aad71e3928be ("powerpc/mm: Fix build break with RADIX=y & HUGETLBFS=n") used when !CONFIG_HUGETLB_PAGE. Instead, it addresses the build break by getting the definitions from <linux/hugetlb.h>. This allows the macros in <linux/hugetlb.h> to be replaced with static inlines. Link: http://lkml.kernel.org/r/20191112194558.139389-2-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Acked-by: Michael Ellerman <mpe@ellerman.id.au> Cc: Ben Dooks <ben.dooks@codethink.co.uk> Cc: Jason Gunthorpe <jgg@ziepe.ca> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/hugetlbfs: fix error handling when setting up mountsMike Kravetz2019-12-011-9/+22
| | | | | | | | | | | | | | | | | | | | | | | It is assumed that the hugetlbfs_vfsmount[] array will contain either a valid vfsmount pointer or NULL for each hstate after initialization. Changes made while converting to use fs_context broke this assumption. While fixing the hugetlbfs_vfsmount issue, it was discovered that init_hugetlbfs_fs never did correctly clean up when encountering a vfs mount error. It was found during code inspection. A small memory allocation failure would be the most likely cause of taking a error path with the bug. This is unlikely to happen as this is early init code. Link: http://lkml.kernel.org/r/94b6244d-2c24-e269-b12c-e3ba694b242d@oracle.com Reported-by: Chengguang Xu <cgxu519@mykernel.net> Fixes: 32021982a324 ("hugetlbfs: Convert to fs_context") Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Cc: David Howells <dhowells@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* hugetlbfs: hugetlb_fault_mutex_hash() cleanupMike Kravetz2019-12-014-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | A new clang diagnostic (-Wsizeof-array-div) warns about the calculation to determine the number of u32's in an array of unsigned longs. Suppress warning by adding parentheses. While looking at the above issue, noticed that the 'address' parameter to hugetlb_fault_mutex_hash is no longer used. So, remove it from the definition and all callers. No functional change. Link: http://lkml.kernel.org/r/20190919011847.18400-1-mike.kravetz@oracle.com Signed-off-by: Mike Kravetz <mike.kravetz@oracle.com> Reported-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Davidlohr Bueso <dbueso@suse.de> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Ilie Halip <ilie.halip@gmail.com> Cc: David Bolvansky <david.bolvansky@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: support memblock alloc on the exact node for sparse_buffer_init()Yunfeng Ye2019-12-013-12/+58
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | sparse_buffer_init() use memblock_alloc_try_nid_raw() to allocate memory for page management structure, if memory allocation fails from specified node, it will fall back to allocate from other nodes. Normally, the page management structure will not exceed 2% of the total memory, but a large continuous block of allocation is needed. In most cases, memory allocation from the specified node will succeed, but a node memory become highly fragmented will fail. we expect to allocate memory base section rather than by allocating a large block of memory from other NUMA nodes Add memblock_alloc_exact_nid_raw() for this situation, which allocate boot memory block on the exact node. If a large contiguous block memory allocate fail in sparse_buffer_init(), it will fall back to allocate small block memory base section. Link: http://lkml.kernel.org/r/66755ea7-ab10-8882-36fd-3e02b03775d5@huawei.com Signed-off-by: Yunfeng Ye <yeyunfeng@huawei.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Cc: Wei Yang <richardw.yang@linux.intel.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Dan Williams <dan.j.williams@intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock: correct doc for functionCao jin2019-12-011-1/+1
| | | | | | | | | | | | Change "max_addr" to "end" for less confusion in memblock_alloc_range_nid comments. Link: http://lkml.kernel.org/r/20191113051822.3296-1-ruansy.fnst@cn.fujitsu.com Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Signed-off-by: Shiyang Ruan <ruansy.fnst@cn.fujitsu.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memblock.c: cleanup docCao jin2019-12-011-24/+20
| | | | | | | | | | | | | | | | | fix typos for: elaboarte -> elaborate architecure -> architecture compltes -> completes And, convert the markup :c:func:`foo` to foo() as kernel documentation toolchain can recognize foo() as a function. Link: http://lkml.kernel.org/r/20190912123127.8694-1-caoj.fnst@cn.fujitsu.com Signed-off-by: Cao jin <caoj.fnst@cn.fujitsu.com> Suggested-by: Mike Rapoport <rppt@linux.ibm.com> Reviewed-by: Mike Rapoport <rppt@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/mempolicy.c: fix checking unmapped holes for mbindLi Xinhai2019-12-011-13/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mbind() is required to report EFAULT if range, specified by addr and len, contains unmapped holes. In current implementation, below rules are applied for this checking: 1: Unmapped holes at any part of the specified range should be reported as EFAULT if mbind() for none MPOL_DEFAULT cases; 2: Unmapped holes at any part of the specified range should be ignored (do not reprot EFAULT) if mbind() for MPOL_DEFAULT case; 3: The whole range in an unmapped hole should be reported as EFAULT; Note that rule 2 does not fullfill the mbind() API definition, but since that behavior has existed for long days (the internal flag MPOL_MF_DISCONTIG_OK is for this purpose), this patch does not plan to change it. In current code, application observed inconsistent behavior on rule 1 and rule 2 respectively. That inconsistency is fixed as below details. Cases of rule 1: - Hole at head side of range. Current code reprot EFAULT, no change by this patch. [ vma ][ hole ][ vma ] [ range ] - Hole at middle of range. Current code report EFAULT, no change by this patch. [ vma ][ hole ][ vma ] [ range ] - Hole at tail side of range. Current code do not report EFAULT, this patch fixes it. [ vma ][ hole ][ vma ] [ range ] Cases of rule 2: - Hole at head side of range. Current code reports EFAULT, this patch fixes it. [ vma ][ hole ][ vma ] [ range ] - Hole at middle of range. Current code does not report EFAULT, no change by this patch. [ vma ][ hole ][ vma] [ range ] - Hole at tail side of range. Current code does not report EFAULT, no change by this patch. [ vma ][ hole ][ vma] [ range ] This patch has no changes to rule 3. The unmapped hole checking can also be handled by using .pte_hole(), instead of .test_walk(). But .pte_hole() is called for holes inside and outside vma, which causes more cost, so this patch keeps the original design with .test_walk(). Link: http://lkml.kernel.org/r/1573218104-11021-3-git-send-email-lixinhai.lxh@gmail.com Fixes: 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()") Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: linux-man <linux-man@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/mempolicy.c: check range first in queue_pages_test_walkLi Xinhai2019-12-011-9/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: Fix checking unmapped holes for mbind", v4. This patchset fix checking unmapped holes for mbind(). First patch makes sure the vma been correctly tracked in .test_walk(), so each time when .test_walk() is called, the neighborhood of two vma is correct. Current problem is that the !vma_migratable() check could cause return immediately without update tracking to vma. Second patch fix the inconsistent report of EFAULT when mbind() is called for MPOL_DEFAULT and non MPOL_DEFAULT cases, so application do not need to have workaround code to handle this special behavior. Currently there are two problems, one is that the .test_walk() can not know there is hole at tail side of range, because .test_walk() only call for vma not for hole. The other one is that mbind_range() checks for hole at head side of range but do not consider the MPOL_MF_DISCONTIG_OK flag as done in .test_walk(). This patch (of 2): Checking unmapped hole and updating the previous vma must be handled first, otherwise the unmapped hole could be calculated from a wrong previous vma. Several commits were relevant to this error: - commit 6f4576e3687b ("mempolicy: apply page table walker on queue_pages_range()") This commit was correct, the VM_PFNMAP check was after updating previous vma - commit 48684a65b4e3 ("mm: pagewalk: fix misbehavior of walk_page_range for vma(VM_PFNMAP)") This commit added VM_PFNMAP check before updating previous vma. Then, there were two VM_PFNMAP check did same thing twice. - commit acda0c334028 ("mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_page s_range()") This commit tried to fix the duplicated VM_PFNMAP check, but it wrongly removed the one which was after updating vma. Link: http://lkml.kernel.org/r/1573218104-11021-2-git-send-email-lixinhai.lxh@gmail.com Fixes: acda0c334028 (mm/mempolicy.c: get rid of duplicated check for vma(VM_PFNMAP) in queue_pages_range()) Signed-off-by: Li Xinhai <lixinhai.lxh@gmail.com> Reviewed-by: Naoya Horiguchi <n-horiguchi@ah.jp.nec.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Hugh Dickins <hughd@google.com> Cc: linux-man <linux-man@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/z3fold.c: add inter-page compactionVitaly Wool2019-12-011-72/+303
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For each page scheduled for compaction (e. g. by z3fold_free()), try to apply inter-page compaction before running the traditional/ existing intra-page compaction. That means, if the page has only one buddy, we treat that buddy as a new object that we aim to place into an existing z3fold page. If such a page is found, that object is transferred and the old page is freed completely. The transferred object is named "foreign" and treated slightly differently thereafter. Namely, we increase "foreign handle" counter for the new page. Pages with non-zero "foreign handle" count become unmovable. This patch implements "foreign handle" detection when a handle is freed to decrement the foreign handle counter accordingly, so a page may as well become movable again as the time goes by. As a result, we almost always have exactly 3 objects per page and significantly better average compression ratio. [cai@lca.pw: fix -Wunused-but-set-variable warnings] Link: http://lkml.kernel.org/r/1570542062-29144-1-git-send-email-cai@lca.pw [vitalywool@gmail.com: avoid subtle race when freeing slots] Link: http://lkml.kernel.org/r/20191127152118.6314b99074b0626d4c5a8835@gmail.com [vitalywool@gmail.com: compact objects more accurately] Link: http://lkml.kernel.org/r/20191127152216.6ad33745a21ba71c53606acb@gmail.com [vitalywool@gmail.com: protect handle reads] Link: http://lkml.kernel.org/r/20191127152345.8059852f60947686674d726d@gmail.com Link: http://lkml.kernel.org/r/20191006041457.24113-1-vitalywool@gmail.com Signed-off-by: Vitaly Wool <vitaly.vul@sony.com> Cc: Dan Streetman <ddstreet@ieee.org> Cc: Henry Burns <henrywolfeburns@gmail.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kernel: sysctl: make drop_caches write-onlyJohannes Weiner2019-12-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the drop_caches proc file and sysctl read back the last value written, suggesting this is somehow a stateful setting instead of a one-time command. Make it write-only, like e.g. compact_memory. While mitigating a VM problem at scale in our fleet, there was confusion about whether writing to this file will permanently switch the kernel into a non-caching mode. This influences the decision making in a tense situation, where tens of people are trying to fix tens of thousands of affected machines: Do we need a rollback strategy? What are the performance implications of operating in a non-caching state for several days? It also caused confusion when the kernel team said we may need to write the file several times to make sure it's effective ("But it already reads back 3?"). Link: http://lkml.kernel.org/r/20191031221602.9375-1-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Chris Down <chris@chrisdown.name> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmscan.c: fix typo in commentXianting Tian2019-12-011-1/+1
| | | | | | | | | | Fix the typo "resheduled" -> "rescheduled" in comment Link: http://lkml.kernel.org/r/1573486327-9591-1-git-send-email-xianting_tian@126.com Signed-off-by: Xianting Tian <xianting_tian@126.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: enforce inactive:active ratio at the reclaim rootJohannes Weiner2019-12-012-71/+118
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We split the LRU lists into inactive and an active parts to maximize workingset protection while allowing just enough inactive cache space to faciltate readahead and writeback for one-off file accesses (e.g. a linear scan through a file, or logging); or just enough inactive anon to maintain recent reference information when reclaim needs to swap. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, inactive:active size decisions are done on a per-cgroup level. As a result, we'll reclaim a cgroup's workingset when it doesn't have cold pages, even when one of its siblings has plenty of it that should be reclaimed first. For example: workload A has 50M worth of hot cache but doesn't do any one-off file accesses; meanwhile, parallel workload B scans files and rarely accesses the same page twice. If these workloads were to run in an uncgrouped system, A would be protected from the high rate of cache faults from B. But if they were put in parallel cgroups for memory accounting purposes, B's fast cache fault rate would push out the hot cache pages of A. This is unexpected and undesirable - the "scan resistance" of the page cache is broken. This patch moves inactive:active size balancing decisions to the root of reclaim - the same level where the LRU order is established. It does this by looking at the recursive size of the inactive and the active file sets of the cgroup subtree at the beginning of the reclaim cycle, and then making a decision - scan or skip active pages - that applies throughout the entire run and to every cgroup involved. With that in place, in the test above, the VM will recognize that there are plenty of inactive pages in the combined cache set of workloads A and B and prefer the one-off cache in B over the hot pages in A. The scan resistance of the cache is restored. [cai@lca.pw: fix some -Wenum-conversion warnings] Link: http://lkml.kernel.org/r/1573848697-29262-1-git-send-email-cai@lca.pw Link: http://lkml.kernel.org/r/20191107205334.158354-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Rik van Riel <riel@surriel.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: detect file thrashing at the reclaim rootJohannes Weiner2019-12-014-33/+73
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We use refault information to determine whether the cache workingset is stable or transitioning, and dynamically adjust the inactive:active file LRU ratio so as to maximize protection from one-off cache during stable periods, and minimize IO during transitions. With cgroups and their nested LRU lists, we currently don't do this correctly. While recursive cgroup reclaim establishes a relative LRU order among the pages of all involved cgroups, refaults only affect the local LRU order in the cgroup in which they are occuring. As a result, cache transitions can take longer in a cgrouped system as the active pages of sibling cgroups aren't challenged when they should be. [ Right now, this is somewhat theoretical, because the siblings, under continued regular reclaim pressure, should eventually run out of inactive pages - and since inactive:active *size* balancing is also done on a cgroup-local level, we will challenge the active pages eventually in most cases. But the next patch will move that relative size enforcement to the reclaim root as well, and then this patch here will be necessary to propagate refault pressure to siblings. ] This patch moves refault detection to the root of reclaim. Instead of remembering the cgroup owner of an evicted page, remember the cgroup that caused the reclaim to happen. When refaults later occur, they'll correctly influence the cross-cgroup LRU order that reclaim follows. I.e. if global reclaim kicked out pages in some subgroup A/B/C, the refault of those pages will challenge the global LRU order, and not just the local order down inside C. [hannes@cmpxchg.org: use page_memcg() instead of another lookup] Link: http://lkml.kernel.org/r/20191115160722.GA309754@cmpxchg.org Link: http://lkml.kernel.org/r/20191107205334.158354-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Cc: Shakeel Butt <shakeelb@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: move file exhaustion detection to the node levelJohannes Weiner2019-12-011-38/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: fix page aging across multiple cgroups". When applications are put into unconfigured cgroups for memory accounting purposes, the cgrouping itself should not change the behavior of the page reclaim code. We expect the VM to reclaim the coldest pages in the system. But right now the VM can reclaim hot pages in one cgroup while there is eligible cold cache in others. This is because one part of the reclaim algorithm isn't truly cgroup hierarchy aware: the inactive/active list balancing. That is the part that is supposed to protect hot cache data from one-off streaming IO. The recursive cgroup reclaim scheme will scan and rotate the physical LRU lists of each eligible cgroup at the same rate in a round-robin fashion, thereby establishing a relative order among the pages of all those cgroups. However, the inactive/active balancing decisions are made locally within each cgroup, so when a cgroup is running low on cold pages, its hot pages will get reclaimed - even when sibling cgroups have plenty of cold cache eligible in the same reclaim run. For example: [root@ham ~]# head -n1 /proc/meminfo MemTotal: 1016336 kB [root@ham ~]# ./reclaimtest2.sh Establishing 50M active files in cgroup A... Hot pages cached: 12800/12800 workingset-a Linearly scanning through 18G of file data in cgroup B: real 0m4.269s user 0m0.051s sys 0m4.182s Hot pages cached: 134/12800 workingset-a The streaming IO in B, which doesn't benefit from caching at all, pushes out most of the workingset in A. Solution This series fixes the problem by elevating inactive/active balancing decisions to the toplevel of the reclaim run. This is either a cgroup that hit its limit, or straight-up global reclaim if there is physical memory pressure. From there, it takes a recursive view of the cgroup subtree to decide whether page deactivation is necessary. In the test above, the VM will then recognize that cgroup B has plenty of eligible cold cache, and that the hot pages in A can be spared: [root@ham ~]# ./reclaimtest2.sh Establishing 50M active files in cgroup A... Hot pages cached: 12800/12800 workingset-a Linearly scanning through 18G of file data in cgroup B: real 0m4.244s user 0m0.064s sys 0m4.177s Hot pages cached: 12800/12800 workingset-a Implementation Whether active pages can be deactivated or not is influenced by two factors: the inactive list dropping below a minimum size relative to the active list, and the occurence of refaults. This patch series first moves refault detection to the reclaim root, then enforces the minimum inactive size based on a recursive view of the cgroup tree's LRUs. History Note that this actually never worked correctly in Linux cgroups. In the past it worked for global reclaim and leaf limit reclaim only (we used to have two physical LRU linkages per page), but it never worked for intermediate limit reclaim over multiple leaf cgroups. We're noticing this now because 1) we're putting everything into cgroups for accounting, not just the things we want to control and 2) we're moving away from leaf limits that invoke reclaim on individual cgroups, toward large tree reclaim, triggered by high-level limits, or physical memory pressure that is influenced by local protections such as memory.low and memory.min instead. This patch (of 3): When file pages are lower than the watermark on a node, we try to force scan anonymous pages to counter-act the balancing algorithms preference for new file pages when they are likely thrashing. This is a node-level decision, but it's currently made each time we look at an lruvec. This is unnecessarily expensive and also a layering violation that makes the code harder to understand. Clean this up by making the check once per node and setting a flag in the scan_control. Link: http://lkml.kernel.org/r/20191107205334.158354-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Shakeel Butt <shakeelb@google.com> Reviewed-by: Suren Baghdasaryan <surenb@google.com> Cc: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Rik van Riel <riel@surriel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: harmonize writeback congestion tracking for nodes & memcgsJohannes Weiner2019-12-013-64/+37
| | | | | | | | | | | | | | | | | | | The current writeback congestion tracking has separate flags for kswapd reclaim (node level) and cgroup limit reclaim (memcg-node level). This is unnecessarily complicated: the lruvec is an existing abstraction layer for that node-memcg intersection. Introduce lruvec->flags and LRUVEC_CONGESTED. Then track that at the reclaim root level, which is either the NUMA node for global reclaim, or the cgroup-node intersection for cgroup reclaim. Link: http://lkml.kernel.org/r/20191022144803.302233-9-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: split shrink_node() into node part and memcgs partJohannes Weiner2019-12-011-16/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | This function is getting long and unwieldy, split out the memcg bits. The updated shrink_node() handles the generic (node) reclaim aspects: - global vmpressure notifications - writeback and congestion throttling - reclaim/compaction management - kswapd giving up on unreclaimable nodes It then calls a new shrink_node_memcgs() which handles cgroup specifics: - the cgroup tree traversal - memory.low considerations - per-cgroup slab shrinking callbacks - per-cgroup vmpressure notifications [hannes@cmpxchg.org: rename "root" to "target_memcg", per Roman] Link: http://lkml.kernel.org/r/20191025143640.GA386981@cmpxchg.org Link: http://lkml.kernel.org/r/20191022144803.302233-8-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: turn shrink_node_memcg() into shrink_lruvec()Johannes Weiner2019-12-011-11/+10
| | | | | | | | | | | | | | | | | An lruvec holds LRU pages owned by a certain NUMA node and cgroup. Instead of awkwardly passing around a combination of a pgdat and a memcg pointer, pass down the lruvec as soon as we can look it up. Nested callers that need to access node or cgroup properties can look them them up if necessary, but there are only a few cases. Link: http://lkml.kernel.org/r/20191022144803.302233-7-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: replace shrink_node() loop with a retry jumpJohannes Weiner2019-12-011-116/+115
| | | | | | | | | | | | | | | | | | | Most of the function body is inside a loop, which imposes an additional indentation and scoping level that makes the code a bit hard to follow and modify. The looping only happens in case of reclaim-compaction, which isn't the common case. So rather than adding yet another function level to the reclaim path and have every reclaim invocation go through a level that only exists for one specific cornercase, use a retry goto. Link: http://lkml.kernel.org/r/20191022144803.302233-6-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: naming fixes: global_reclaim() and sane_reclaim()Johannes Weiner2019-12-011-20/+18
| | | | | | | | | | | | | | | | | | | | | | | | Seven years after introducing the global_reclaim() function, I still have to double take when reading a callsite. I don't know how others do it, this is a terrible name. Invert the meaning and rename it to cgroup_reclaim(). [ After all, "global reclaim" is just regular reclaim invoked from the page allocator. It's reclaim on behalf of a cgroup limit that is a special case of reclaim, and should be explicit - not the reverse. ] sane_reclaim() isn't very descriptive either: it tests whether we can use the regular writeback throttling - available during regular page reclaim or cgroup2 limit reclaim - or need to use the broken wait_on_page_writeback() method. Use "writeback_throttling_sane()". Link: http://lkml.kernel.org/r/20191022144803.302233-5-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: move inactive_list_is_low() swap check to the callerJohannes Weiner2019-12-011-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | inactive_list_is_low() should be about one thing: checking the ratio between inactive and active list. Kitchensink checks like the one for swap space makes the function hard to use and modify its callsites. Luckly, most callers already have an understanding of the swap situation, so it's easy to clean up. get_scan_count() has its own, memcg-aware swap check, and doesn't even get to the inactive_list_is_low() check on the anon list when there is no swap space available. shrink_list() is called on the results of get_scan_count(), so that check is redundant too. age_active_anon() has its own totalswap_pages check right before it checks the list proportions. The shrink_node_memcg() site is the only one that doesn't do its own swap check. Add it there. Then delete the swap check from inactive_list_is_low(). Link: http://lkml.kernel.org/r/20191022144803.302233-4-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: clean up and clarify lruvec lookup procedureJohannes Weiner2019-12-017-34/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a per-memcg lruvec and a NUMA node lruvec. Which one is being used is somewhat confusing right now, and it's easy to make mistakes - especially when it comes to global reclaim. How it works: when memory cgroups are enabled, we always use the root_mem_cgroup's per-node lruvecs. When memory cgroups are not compiled in or disabled at runtime, we use pgdat->lruvec. Document that in a comment. Due to the way the reclaim code is generalized, all lookups use the mem_cgroup_lruvec() helper function, and nobody should have to find the right lruvec manually right now. But to avoid future mistakes, rename the pgdat->lruvec member to pgdat->__lruvec and delete the convenience wrapper that suggests it's a commonly accessed member. While in this area, swap the mem_cgroup_lruvec() argument order. The name suggests a memcg operation, yet it takes a pgdat first and a memcg second. I have to double take every time I call this. Fix that. Link: http://lkml.kernel.org/r/20191022144803.302233-3-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Michal Hocko <mhocko@suse.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm: vmscan: simplify lruvec_lru_size()Johannes Weiner2019-12-011-16/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm: vmscan: cgroup-related cleanups". Here are 8 patches that clean up the reclaim code's interaction with cgroups a bit. They're not supposed to change any behavior, just make the implementation easier to understand and work with. This patch (of 8): This function currently takes the node or lruvec size and subtracts the zones that are excluded by the classzone index of the allocation. It uses four different types of counters to do this. Just add up the eligible zones. [cai@lca.pw: fix an undefined behavior for zone id] Link: http://lkml.kernel.org/r/20191108204407.1435-1-cai@lca.pw [akpm@linux-foundation.org: deal with the MAX_NR_ZONES special case. per Qian Cai] Link: http://lkml.kernel.org/r/64E60F6F-7582-427B-8DD5-EF97B1656F5A@lca.pw Link: http://lkml.kernel.org/r/20191022144803.302233-2-hannes@cmpxchg.org Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Reviewed-by: Roman Gushchin <guro@fb.com> Reviewed-by: Shakeel Butt <shakeelb@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmscan.c: remove unused scan_control parameter from pageout()Yang Shi2019-12-011-5/+4
| | | | | | | | | | | | | | | Since lumpy reclaim was removed in v3.5 scan_control is not used by may_write_to_{queue|inode} and pageout() anymore, remove the unused parameter. Link: http://lkml.kernel.org/r/1570124498-19300-1-git-send-email-yang.shi@linux.alibaba.com Signed-off-by: Yang Shi <yang.shi@linux.alibaba.com> Acked-by: David Rientjes <rientjes@google.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmscan: remove unused lru_pages argumentAndrey Ryabinin2019-12-011-12/+5
| | | | | | | | | | | | | | | | | | Since 9092c71bb724 ("mm: use sc->priority for slab shrink targets") the argument 'unsigned long *lru_pages' passed around with no purpose. Remove it. Link: http://lkml.kernel.org/r/20190228083329.31892-4-aryabinin@virtuozzo.com Signed-off-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Acked-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Vlastimil Babka <vbabka@suse.cz> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Michal Hocko <mhocko@kernel.org> Cc: Rik van Riel <riel@surriel.com> Cc: William Kucharski <william.kucharski@oracle.com> Cc: John Hubbard <jhubbard@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc.c: print reserved_highatomic infolijiazi2019-12-011-0/+2
| | | | | | | | | | | | | | Print nr_reserved_highatomic in show_free_areas, because when alloc_harder is false, this value will be subtracted from the free_pages in __zone_watermark_ok. Printing this value can help analyze memory allocaction failure issues. Link: http://lkml.kernel.org/r/19515f3de2fb6abe66b52e03e4b676a21e82beda.1573634806.git.lijiazi@xiaomi.com Signed-off-by: lijiazi <lijiazi@xiaomi.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Mel Gorman <mgorman@techsingularity.net> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* include/linux/mmzone.h: fix comment for ISOLATE_UNMAPPED macroHao Lee2019-12-011-1/+1
| | | | | | | | | | | | | | | | | Both file-backed pages and anonymous pages can be unmapped. ISOLATE_UNMAPPED is not just for file-backed pages. Link: http://lkml.kernel.org/r/20191024151621.GA20400@haolee.github.io Signed-off-by: Hao Lee <haolee.swjtu@gmail.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Wei Yang <richard.weiyang@gmail.com> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Johannes Weiner <hannes@cmpxchg.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, pcpu: make zone pcp updates and reset internal to the mmMel Gorman2019-12-012-3/+3
| | | | | | | | | | | | | | | | | Memory hotplug needs to be able to reset and reinit the pcpu allocator batch and high limits but this action is internal to the VM. Move the declaration to internal.h Link: http://lkml.kernel.org/r/20191021094808.28824-4-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm, pcp: share common code between memory hotplug and percpu sysctl handlerMel Gorman2019-12-011-11/+12
| | | | | | | | | | | | | | | | | | | Both the percpu_pagelist_fraction sysctl handler and memory hotplug have a common requirement of updating the pcpu page allocation batch and high values. Split the relevant helper to share common code. No functional change. Link: http://lkml.kernel.org/r/20191021094808.28824-3-mgorman@techsingularity.net Signed-off-by: Mel Gorman <mgorman@techsingularity.net> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Borislav Petkov <bp@alien8.de> Cc: Matt Fleming <matt@codeblueprint.co.uk> Cc: Qian Cai <cai@lca.pw> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/page_alloc: add alloc_contig_pages()Anshuman Khandual2019-12-013-75/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | HugeTLB helper alloc_gigantic_page() implements fairly generic allocation method where it scans over various zones looking for a large contiguous pfn range before trying to allocate it with alloc_contig_range(). Other than deriving the requested order from 'struct hstate', there is nothing HugeTLB specific in there. This can be made available for general use to allocate contiguous memory which could not have been allocated through the buddy allocator. alloc_gigantic_page() has been split carving out actual allocation method which is then made available via new alloc_contig_pages() helper wrapped under CONFIG_CONTIG_ALLOC. All references to 'gigantic' have been replaced with more generic term 'contig'. Allocated pages here should be freed with free_contig_range() or by calling __free_page() on each allocated page. Link: http://lkml.kernel.org/r/1571300646-32240-1-git-send-email-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Michal Hocko <mhocko@suse.com> Cc: David Rientjes <rientjes@google.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Mel Gorman <mgorman@techsingularity.net> Cc: Mike Rapoport <rppt@linux.ibm.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Pavel Tatashin <pavel.tatashin@microsoft.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: David Hildenbrand <david@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* x86/kasan: support KASAN_VMALLOCDaniel Axtens2019-12-012-0/+62
| | | | | | | | | | | | | | | | | | | | In the case where KASAN directly allocates memory to back vmalloc space, don't map the early shadow page over it. We prepopulate pgds/p4ds for the range that would otherwise be empty. This is required to get it synced to hardware on boot, allowing the lower levels of the page tables to be filled dynamically. Link: http://lkml.kernel.org/r/20191031093909.9228-5-dja@axtens.net Signed-off-by: Daniel Axtens <dja@axtens.net> Acked-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fork: support VMAP_STACK with KASAN_VMALLOCDaniel Axtens2019-12-012-4/+9
| | | | | | | | | | | | | | | | | | Supporting VMAP_STACK with KASAN_VMALLOC is straightforward: - clear the shadow region of vmapped stacks when swapping them in - tweak Kconfig to allow VMAP_STACK to be turned on with KASAN Link: http://lkml.kernel.org/r/20191031093909.9228-4-dja@axtens.net Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Dmitry Vyukov <dvyukov@google.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kasan: add test for vmallocDaniel Axtens2019-12-011-0/+26
| | | | | | | | | | | | | | | Test kasan vmalloc support by adding a new test to the module. Link: http://lkml.kernel.org/r/20191031093909.9228-3-dja@axtens.net Signed-off-by: Daniel Axtens <dja@axtens.net> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* kasan: support backing vmalloc space with real shadow memoryDaniel Axtens2019-12-019-9/+408
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "kasan: support backing vmalloc space with real shadow memory", v11. Currently, vmalloc space is backed by the early shadow page. This means that kasan is incompatible with VMAP_STACK. This series provides a mechanism to back vmalloc space with real, dynamically allocated memory. I have only wired up x86, because that's the only currently supported arch I can work with easily, but it's very easy to wire up other architectures, and it appears that there is some work-in-progress code to do this on arm64 and s390. This has been discussed before in the context of VMAP_STACK: - https://bugzilla.kernel.org/show_bug.cgi?id=202009 - https://lkml.org/lkml/2018/7/22/198 - https://lkml.org/lkml/2019/7/19/822 In terms of implementation details: Most mappings in vmalloc space are small, requiring less than a full page of shadow space. Allocating a full shadow page per mapping would therefore be wasteful. Furthermore, to ensure that different mappings use different shadow pages, mappings would have to be aligned to KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE. Instead, share backing space across multiple mappings. Allocate a backing page when a mapping in vmalloc space uses a particular page of the shadow region. This page can be shared by other vmalloc mappings later on. We hook in to the vmap infrastructure to lazily clean up unused shadow memory. Testing with test_vmalloc.sh on an x86 VM with 2 vCPUs shows that: - Turning on KASAN, inline instrumentation, without vmalloc, introuduces a 4.1x-4.2x slowdown in vmalloc operations. - Turning this on introduces the following slowdowns over KASAN: * ~1.76x slower single-threaded (test_vmalloc.sh performance) * ~2.18x slower when both cpus are performing operations simultaneously (test_vmalloc.sh sequential_test_order=1) This is unfortunate but given that this is a debug feature only, not the end of the world. The benchmarks are also a stress-test for the vmalloc subsystem: they're not indicative of an overall 2x slowdown! This patch (of 4): Hook into vmalloc and vmap, and dynamically allocate real shadow memory to back the mappings. Most mappings in vmalloc space are small, requiring less than a full page of shadow space. Allocating a full shadow page per mapping would therefore be wasteful. Furthermore, to ensure that different mappings use different shadow pages, mappings would have to be aligned to KASAN_SHADOW_SCALE_SIZE * PAGE_SIZE. Instead, share backing space across multiple mappings. Allocate a backing page when a mapping in vmalloc space uses a particular page of the shadow region. This page can be shared by other vmalloc mappings later on. We hook in to the vmap infrastructure to lazily clean up unused shadow memory. To avoid the difficulties around swapping mappings around, this code expects that the part of the shadow region that covers the vmalloc space will not be covered by the early shadow page, but will be left unmapped. This will require changes in arch-specific code. This allows KASAN with VMAP_STACK, and may be helpful for architectures that do not have a separate module space (e.g. powerpc64, which I am currently working on). It also allows relaxing the module alignment back to PAGE_SIZE. Testing with test_vmalloc.sh on an x86 VM with 2 vCPUs shows that: - Turning on KASAN, inline instrumentation, without vmalloc, introuduces a 4.1x-4.2x slowdown in vmalloc operations. - Turning this on introduces the following slowdowns over KASAN: * ~1.76x slower single-threaded (test_vmalloc.sh performance) * ~2.18x slower when both cpus are performing operations simultaneously (test_vmalloc.sh sequential_test_order=3D1) This is unfortunate but given that this is a debug feature only, not the end of the world. The full benchmark results are: Performance No KASAN KASAN original x baseline KASAN vmalloc x baseline x KASAN fix_size_alloc_test 662004 11404956 17.23 19144610 28.92 1.68 full_fit_alloc_test 710950 12029752 16.92 13184651 18.55 1.10 long_busy_list_alloc_test 9431875 43990172 4.66 82970178 8.80 1.89 random_size_alloc_test 5033626 23061762 4.58 47158834 9.37 2.04 fix_align_alloc_test 1252514 15276910 12.20 31266116 24.96 2.05 random_size_align_alloc_te 1648501 14578321 8.84 25560052 15.51 1.75 align_shift_alloc_test 147 830 5.65 5692 38.72 6.86 pcpu_alloc_test 80732 125520 1.55 140864 1.74 1.12 Total Cycles 119240774314 763211341128 6.40 1390338696894 11.66 1.82 Sequential, 2 cpus No KASAN KASAN original x baseline KASAN vmalloc x baseline x KASAN fix_size_alloc_test 1423150 14276550 10.03 27733022 19.49 1.94 full_fit_alloc_test 1754219 14722640 8.39 15030786 8.57 1.02 long_busy_list_alloc_test 11451858 52154973 4.55 107016027 9.34 2.05 random_size_alloc_test 5989020 26735276 4.46 68885923 11.50 2.58 fix_align_alloc_test 2050976 20166900 9.83 50491675 24.62 2.50 random_size_align_alloc_te 2858229 17971700 6.29 38730225 13.55 2.16 align_shift_alloc_test 405 6428 15.87 26253 64.82 4.08 pcpu_alloc_test 127183 151464 1.19 216263 1.70 1.43 Total Cycles 54181269392 308723699764 5.70 650772566394 12.01 2.11 fix_size_alloc_test 1420404 14289308 10.06 27790035 19.56 1.94 full_fit_alloc_test 1736145 14806234 8.53 15274301 8.80 1.03 long_busy_list_alloc_test 11404638 52270785 4.58 107550254 9.43 2.06 random_size_alloc_test 6017006 26650625 4.43 68696127 11.42 2.58 fix_align_alloc_test 2045504 20280985 9.91 50414862 24.65 2.49 random_size_align_alloc_te 2845338 17931018 6.30 38510276 13.53 2.15 align_shift_alloc_test 472 3760 7.97 9656 20.46 2.57 pcpu_alloc_test 118643 132732 1.12 146504 1.23 1.10 Total Cycles 54040011688 309102805492 5.72 651325675652 12.05 2.11 [dja@axtens.net: fixups] Link: http://lkml.kernel.org/r/20191120052719.7201-1-dja@axtens.net Link: https://bugzilla.kernel.org/show_bug.cgi?id=3D202009 Link: http://lkml.kernel.org/r/20191031093909.9228-2-dja@axtens.net Signed-off-by: Mark Rutland <mark.rutland@arm.com> [shadow rework] Signed-off-by: Daniel Axtens <dja@axtens.net> Co-developed-by: Mark Rutland <mark.rutland@arm.com> Acked-by: Vasily Gorbik <gor@linux.ibm.com> Reviewed-by: Andrey Ryabinin <aryabinin@virtuozzo.com> Cc: Alexander Potapenko <glider@google.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Christophe Leroy <christophe.leroy@c-s.fr> Cc: Qian Cai <cai@lca.pw> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: rework vmap_area_lockUladzislau Rezki (Sony)2019-12-011-30/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With the new allocation approach introduced in the 5.2 kernel, it becomes possible to get rid of one global spinlock. By doing that we can further improve the KVA from the performance point of view. Basically we can have two independent locks, one for allocation part and another one for deallocation, because of two different entities: "free data structures" and "busy data structures". As a result, allocation/deallocation operations can still interfere between each other in case of running simultaneously on different CPUs, it means there is still dependency, but with two locks it becomes lower. Summarizing: - it reduces the high lock contention - it allows to perform operations on "free" and "busy" trees in parallel on different CPUs. Please note it does not solve scalability issue. Test results: In order to evaluate this patch, we can run "vmalloc test driver" to see how many CPU cycles it takes to complete all test cases running sequentially. All online CPUs run it so it will cause a high lock contention. HiKey 960, ARM64, 8xCPUs, big.LITTLE: <snip> sudo ./test_vmalloc.sh sequential_test_order=1 <snip> <default> [ 390.950557] All test took CPU0=457126382 cycles [ 391.046690] All test took CPU1=454763452 cycles [ 391.128586] All test took CPU2=454539334 cycles [ 391.222669] All test took CPU3=455649517 cycles [ 391.313946] All test took CPU4=388272196 cycles [ 391.410425] All test took CPU5=384036264 cycles [ 391.492219] All test took CPU6=387432964 cycles [ 391.578433] All test took CPU7=387201996 cycles <default> <patched> [ 304.721224] All test took CPU0=391521310 cycles [ 304.821219] All test took CPU1=393533002 cycles [ 304.917120] All test took CPU2=392243032 cycles [ 305.008986] All test took CPU3=392353853 cycles [ 305.108944] All test took CPU4=297630721 cycles [ 305.196406] All test took CPU5=297548736 cycles [ 305.288602] All test took CPU6=297092392 cycles [ 305.381088] All test took CPU7=297293597 cycles <patched> ~14%-23% patched variant is better. Link: http://lkml.kernel.org/r/20191022155800.20468-1-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Andrew Morton <akpm@linux-foundation.org> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* selftests: vm: add fragment CONFIG_TEST_VMALLOCAnders Roxell2019-12-011-0/+1
| | | | | | | | | | | | | | | | | | When running test_vmalloc.sh smoke the following print out states that the fragment is missing. # ./test_vmalloc.sh: You must have the following enabled in your kernel: # CONFIG_TEST_VMALLOC=m Rework to add the fragment 'CONFIG_TEST_VMALLOC=m' to the config file. Link: http://lkml.kernel.org/r/20190916095217.19665-1-anders.roxell@linaro.org Fixes: a05ef00c9790 ("selftests/vm: add script helper for CONFIG_TEST_VMALLOC_MODULE") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Cc: "Uladzislau Rezki (Sony)" <urezki@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: add more comments to the adjust_va_to_fit_type()Uladzislau Rezki (Sony)2019-12-011-0/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | When fit type is NE_FIT_TYPE there is a need in one extra object. Usually the "ne_fit_preload_node" per-CPU variable has it and there is no need in GFP_NOWAIT allocation, but there are exceptions. This commit just adds more explanations, as a result giving answers on questions like when it can occur, how often, under which conditions and what happens if GFP_NOWAIT gets failed. Link: http://lkml.kernel.org/r/20191016095438.12391-3-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Daniel Wagner <dwagner@suse.de> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: respect passed gfp_mask when doing preloadingUladzislau Rezki (Sony)2019-12-011-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | Allocation functions should comply with the given gfp_mask as much as possible. The preallocation code in alloc_vmap_area doesn't follow that pattern and it is using a hardcoded GFP_KERNEL. Although this doesn't really make much difference because vmalloc is not GFP_NOWAIT compliant in general (e.g. page table allocations are GFP_KERNEL) there is no reason to spread that bad habit and it is good to fix the antipattern. [mhocko@suse.com: rewrite changelog] Link: http://lkml.kernel.org/r/20191016095438.12391-2-urezki@gmail.com Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Daniel Wagner <dwagner@suse.de> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Cc: Steven Rostedt <rostedt@goodmis.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc: remove preempt_disable/enable when doing preloadingUladzislau Rezki (Sony)2019-12-011-17/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some background. The preemption was disabled before to guarantee that a preloaded object is available for a CPU, it was stored for. That was achieved by combining the disabling the preemption and taking the spin lock while the ne_fit_preload_node is checked. The aim was to not allocate in atomic context when spinlock is taken later, for regular vmap allocations. But that approach conflicts with CONFIG_PREEMPT_RT philosophy. It means that calling spin_lock() with disabled preemption is forbidden in the CONFIG_PREEMPT_RT kernel. Therefore, get rid of preempt_disable() and preempt_enable() when the preload is done for splitting purpose. As a result we do not guarantee now that a CPU is preloaded, instead we minimize the case when it is not, with this change, by populating the per cpu preload pointer under the vmap_area_lock. This implies that at least each caller that has done the preallocation will not fallback to an atomic allocation later. It is possible that the preallocation would be pointless or that no preallocation is done because of the race but the data shows that this is really rare. For example i run the special test case that follows the preload pattern and path. 20 "unbind" threads run it and each does 1000000 allocations. Only 3.5 times among 1000000 a CPU was not preloaded. So it can happen but the number is negligible. [mhocko@suse.com: changelog additions] Link: http://lkml.kernel.org/r/20191016095438.12391-1-urezki@gmail.com Fixes: 82dd23e84be3 ("mm/vmalloc.c: preload a CPU with one object for split purpose") Signed-off-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Reviewed-by: Steven Rostedt (VMware) <rostedt@goodmis.org> Acked-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Daniel Wagner <dwagner@suse.de> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Oleksiy Avramchenko <oleksiy.avramchenko@sonymobile.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/vmalloc.c: remove unnecessary highmem_mask from parameter of ↵Liu Xiang2019-12-011-1/+1
| | | | | | | | | | | | | gfpflags_allow_blocking() gfpflags_allow_blocking() does not care about __GFP_HIGHMEM, so highmem_mask can be removed. Link: http://lkml.kernel.org/r/1568812319-3467-1-git-send-email-liuxiang_1999@126.com Signed-off-by: Liu Xiang <liuxiang_1999@126.com> Reviewed-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/sparse.c: do not waste pre allocated memmap spaceMichal Hocko2019-12-011-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Vincent has noticed [1] that there is something unusual with the memmap allocations going on on his platform : I noticed this because on my ARM64 platform, with 1 GiB of memory the : first [and only] section is allocated from the zeroing path while with : 2 GiB of memory the first 1 GiB section is allocated from the : non-zeroing path. The underlying problem is that although sparse_buffer_init allocates enough memory for all sections on the node sparse_buffer_alloc is not able to consume them due to mismatch in the expected allocation alignement. While sparse_buffer_init preallocation uses the PAGE_SIZE alignment the real memmap has to be aligned to section_map_size() this results in a wasted initial chunk of the preallocated memmap and unnecessary fallback allocation for a section. While we are at it also change __populate_section_memmap to align to the requested size because at least VMEMMAP has constrains to have memmap properly aligned. [1] http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com [akpm@linux-foundation.org: tweak layout, per David] Link: http://lkml.kernel.org/r/20191119092642.31799-1-mhocko@kernel.org Fixes: 35fd1eb1e821 ("mm/sparse: abstract sparse buffer allocations") Signed-off-by: Michal Hocko <mhocko@suse.com> Reported-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Debugged-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Oscar Salvador <OSalvador@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/sparse.c: mark populate_section_memmap as __meminitIlya Leoshkevich2019-12-011-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | Building the kernel on s390 with -Og produces the following warning: WARNING: vmlinux.o(.text+0x28dabe): Section mismatch in reference from the function populate_section_memmap() to the function .meminit.text:__populate_section_memmap() The function populate_section_memmap() references the function __meminit __populate_section_memmap(). This is often because populate_section_memmap lacks a __meminit annotation or the annotation of __populate_section_memmap is wrong. While -Og is not supported, in theory this might still happen with another compiler or on another architecture. So fix this by using the correct section annotations. [iii@linux.ibm.com: v2] Link: http://lkml.kernel.org/r/20191030151639.41486-1-iii@linux.ibm.com Link: http://lkml.kernel.org/r/20191028165549.14478-1-iii@linux.ibm.com Signed-off-by: Ilya Leoshkevich <iii@linux.ibm.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Heiko Carstens <heiko.carstens@de.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Oscar Salvador <OSalvador@suse.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/sparse: consistently do not zero memmapVincent Whitchurch2019-12-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sparsemem without VMEMMAP has two allocation paths to allocate the memory needed for its memmap (done in sparse_mem_map_populate()). In one allocation path (sparse_buffer_alloc() succeeds), the memory is not zeroed (since it was previously allocated with memblock_alloc_try_nid_raw()). In the other allocation path (sparse_buffer_alloc() fails and sparse_mem_map_populate() falls back to memblock_alloc_try_nid()), the memory is zeroed. AFAICS this difference does not appear to be on purpose. If the code is supposed to work with non-initialized memory (__init_single_page() takes care of zeroing the struct pages which are actually used), we should consistently not zero the memory, to avoid masking bugs. ( I noticed this because on my ARM64 platform, with 1 GiB of memory the first [and only] section is allocated from the zeroing path while with 2 GiB of memory the first 1 GiB section is allocated from the non-zeroing path. ) Michal: "the main user visible problem is a memory wastage. The overal amount of memory should be small. I wouldn't call it stable material." Link: http://lkml.kernel.org/r/20191030131122.8256-1-vincent.whitchurch@axis.com Signed-off-by: Vincent Whitchurch <vincent.whitchurch@axis.com> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Oscar Salvador <osalvador@suse.de> Reviewed-by: Pavel Tatashin <pasha.tatashin@soleen.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* mm/memory_hotplug.c: don't allow to online/offline memory blocks with holesDavid Hildenbrand2019-12-011-2/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our onlining/offlining code is unnecessarily complicated. Only memory blocks added during boot can have holes (a range that is not IORESOURCE_SYSTEM_RAM). Hotplugged memory never has holes (e.g., see add_memory_resource()). All memory blocks that belong to boot memory are already online. Note that boot memory can have holes and the memmap of the holes is marked PG_reserved. However, also memory allocated early during boot is PG_reserved - basically every page of boot memory that is not given to the buddy is PG_reserved. Therefore, when we stop allowing to offline memory blocks with holes, we implicitly no longer have to deal with onlining memory blocks with holes. E.g., online_pages() will do a walk_system_ram_range(..., online_pages_range), whereby online_pages_range() will effectively only free the memory holes not falling into a hole to the buddy. The other pages (holes) are kept PG_reserved (via move_pfn_range_to_zone()->memmap_init_zone()). This allows to simplify the code. For example, we no longer have to worry about marking pages that fall into memory holes PG_reserved when onlining memory. We can stop setting pages PG_reserved completely in memmap_init_zone(). Offlining memory blocks added during boot is usually not guaranteed to work either way (unmovable data might have easily ended up on that memory during boot). So stopping to do that should not really hurt. Also, people are not even aware of a setup where onlining/offlining of memory blocks with holes used to work reliably (see [1] and [2] especially regarding the hotplug path) - I doubt it worked reliably. For the use case of offlining memory to unplug DIMMs, we should see no change. (holes on DIMMs would be weird). Please note that hardware errors (PG_hwpoison) are not memory holes and are not affected by this change when offlining. [1] https://lkml.org/lkml/2019/10/22/135 [2] https://lkml.org/lkml/2019/8/14/1365 Link: http://lkml.kernel.org/r/20191119115237.6662-1-david@redhat.com Reviewed-by: Dan Williams <dan.j.williams@intel.com> Signed-off-by: David Hildenbrand <david@redhat.com> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Oscar Salvador <osalvador@suse.de> Cc: Pavel Tatashin <pasha.tatashin@soleen.com> Cc: Dan Williams <dan.j.williams@intel.com> Cc: Anshuman Khandual <anshuman.khandual@arm.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>