From 3faa52c03f440d1b9ddef18c4f189f4790d52d7e Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Wed, 1 Apr 2020 21:05:29 -0700 Subject: mm/gup: track FOLL_PIN pages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Add tracking of pages that were pinned via FOLL_PIN. This tracking is implemented via overloading of page->_refcount: pins are added by adding GUP_PIN_COUNTING_BIAS (1024) to the refcount. This provides a fuzzy indication of pinning, and it can have false positives (and that's OK). Please see the pre-existing Documentation/core-api/pin_user_pages.rst for details. As mentioned in pin_user_pages.rst, callers who effectively set FOLL_PIN (typically via pin_user_pages*()) are required to ultimately free such pages via unpin_user_page(). Please also note the limitation, discussed in pin_user_pages.rst under the "TODO: for 1GB and larger huge pages" section. (That limitation will be removed in a following patch.) The effect of a FOLL_PIN flag is similar to that of FOLL_GET, and may be thought of as "FOLL_GET for DIO and/or RDMA use". Pages that have been pinned via FOLL_PIN are identifiable via a new function call: bool page_maybe_dma_pinned(struct page *page); What to do in response to encountering such a page, is left to later patchsets. There is discussion about this in [1], [2], [3], and [4]. This also changes a BUG_ON(), to a WARN_ON(), in follow_page_mask(). [1] Some slow progress on get_user_pages() (Apr 2, 2019): https://lwn.net/Articles/784574/ [2] DMA and get_user_pages() (LPC: Dec 12, 2018): https://lwn.net/Articles/774411/ [3] The trouble with get_user_pages() (Apr 30, 2018): https://lwn.net/Articles/753027/ [4] LWN kernel index: get_user_pages(): https://lwn.net/Kernel/Index/#Memory_management-get_user_pages [jhubbard@nvidia.com: add kerneldoc] Link: http://lkml.kernel.org/r/20200307021157.235726-1-jhubbard@nvidia.com [imbrenda@linux.ibm.com: if pin fails, we need to unpin, a simple put_page will not be enough] Link: http://lkml.kernel.org/r/20200306132537.783769-2-imbrenda@linux.ibm.com [akpm@linux-foundation.org: fix put_compound_head defined but not used] Suggested-by: Jan Kara Suggested-by: Jérôme Glisse Signed-off-by: John Hubbard Signed-off-by: Claudio Imbrenda Signed-off-by: Andrew Morton Reviewed-by: Jan Kara Acked-by: Kirill A. Shutemov Cc: Ira Weiny Cc: "Matthew Wilcox (Oracle)" Cc: Al Viro Cc: Christoph Hellwig Cc: Dan Williams Cc: Dave Chinner Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Michal Hocko Cc: Mike Kravetz Cc: Shuah Khan Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/20200211001536.1027652-7-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- Documentation/core-api/pin_user_pages.rst | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 1d490155ecd7..9829345428f8 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -173,8 +173,8 @@ CASE 4: Pinning for struct page manipulation only ------------------------------------------------- Here, normal GUP calls are sufficient, so neither flag needs to be set. -page_dma_pinned(): the whole point of pinning -============================================= +page_maybe_dma_pinned(): the whole point of pinning +=================================================== The whole point of marking pages as "DMA-pinned" or "gup-pinned" is to be able to query, "is this page DMA-pinned?" That allows code such as page_mkclean() @@ -186,7 +186,7 @@ and debates (see the References at the end of this document). It's a TODO item here: fill in the details once that's worked out. Meanwhile, it's safe to say that having this available: :: - static inline bool page_dma_pinned(struct page *page) + static inline bool page_maybe_dma_pinned(struct page *page) ...is a prerequisite to solving the long-running gup+DMA problem. -- cgit v1.2.3 From 47e29d32afba11b13efb51f03154a8cf22fb4360 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Wed, 1 Apr 2020 21:05:33 -0700 Subject: mm/gup: page->hpage_pinned_refcount: exact pin counts for huge pages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit For huge pages (and in fact, any compound page), the GUP_PIN_COUNTING_BIAS scheme tends to overflow too easily, each tail page increments the head page->_refcount by GUP_PIN_COUNTING_BIAS (1024). That limits the number of huge pages that can be pinned. This patch removes that limitation, by using an exact form of pin counting for compound pages of order > 1. The "order > 1" is required because this approach uses the 3rd struct page in the compound page, and order 1 compound pages only have two pages, so that won't work there. A new struct page field, hpage_pinned_refcount, has been added, replacing a padding field in the union (so no new space is used). This enhancement also has a useful side effect: huge pages and compound pages (of order > 1) do not suffer from the "potential false positives" problem that is discussed in the page_dma_pinned() comment block. That is because these compound pages have extra space for tracking things, so they get exact pin counts instead of overloading page->_refcount. Documentation/core-api/pin_user_pages.rst is updated accordingly. Suggested-by: Jan Kara Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Reviewed-by: Jan Kara Acked-by: Kirill A. Shutemov Cc: Ira Weiny Cc: Jérôme Glisse Cc: "Matthew Wilcox (Oracle)" Cc: Al Viro Cc: Christoph Hellwig Cc: Dan Williams Cc: Dave Chinner Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Michal Hocko Cc: Mike Kravetz Cc: Shuah Khan Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/20200211001536.1027652-8-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- Documentation/core-api/pin_user_pages.rst | 40 +++++++--------- include/linux/mm.h | 26 +++++++++++ include/linux/mm_types.h | 7 ++- mm/gup.c | 78 +++++++++++++++++++++++++++---- mm/hugetlb.c | 6 +++ mm/page_alloc.c | 2 + mm/rmap.c | 6 +++ 7 files changed, 133 insertions(+), 32 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 9829345428f8..7e5dd8b1b3f2 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -52,8 +52,22 @@ Which flags are set by each wrapper For these pin_user_pages*() functions, FOLL_PIN is OR'd in with whatever gup flags the caller provides. The caller is required to pass in a non-null struct -pages* array, and the function then pin pages by incrementing each by a special -value. For now, that value is +1, just like get_user_pages*().:: +pages* array, and the function then pins pages by incrementing each by a special +value: GUP_PIN_COUNTING_BIAS. + +For huge pages (and in fact, any compound page of more than 2 pages), the +GUP_PIN_COUNTING_BIAS scheme is not used. Instead, an exact form of pin counting +is achieved, by using the 3rd struct page in the compound page. A new struct +page field, hpage_pinned_refcount, has been added in order to support this. + +This approach for compound pages avoids the counting upper limit problems that +are discussed below. Those limitations would have been aggravated severely by +huge pages, because each tail page adds a refcount to the head page. And in +fact, testing revealed that, without a separate hpage_pinned_refcount field, +page overflows were seen in some huge page stress tests. + +This also means that huge pages and compound pages (of order > 1) do not suffer +from the false positives problem that is mentioned below.:: Function -------- @@ -99,27 +113,6 @@ pages: This also leads to limitations: there are only 31-10==21 bits available for a counter that increments 10 bits at a time. -TODO: for 1GB and larger huge pages, this is cutting it close. That's because -when pin_user_pages() follows such pages, it increments the head page by "1" -(where "1" used to mean "+1" for get_user_pages(), but now means "+1024" for -pin_user_pages()) for each tail page. So if you have a 1GB huge page: - -* There are 256K (18 bits) worth of 4 KB tail pages. -* There are 21 bits available to count up via GUP_PIN_COUNTING_BIAS (that is, - 10 bits at a time) -* There are 21 - 18 == 3 bits available to count. Except that there aren't, - because you need to allow for a few normal get_page() calls on the head page, - as well. Fortunately, the approach of using addition, rather than "hard" - bitfields, within page->_refcount, allows for sharing these bits gracefully. - But we're still looking at about 8 references. - -This, however, is a missing feature more than anything else, because it's easily -solved by addressing an obvious inefficiency in the original get_user_pages() -approach of retrieving pages: stop treating all the pages as if they were -PAGE_SIZE. Retrieve huge pages as huge pages. The callers need to be aware of -this, so some work is required. Once that's in place, this limitation mostly -disappears from view, because there will be ample refcounting range available. - * Callers must specifically request "dma-pinned tracking of pages". In other words, just calling get_user_pages() will not suffice; a new set of functions, pin_user_page() and related, must be used. @@ -228,5 +221,6 @@ References * `Some slow progress on get_user_pages() (Apr 2, 2019) `_ * `DMA and get_user_pages() (LPC: Dec 12, 2018) `_ * `The trouble with get_user_pages() (Apr 30, 2018) `_ +* `LWN kernel index: get_user_pages() `_ John Hubbard, October, 2019 diff --git a/include/linux/mm.h b/include/linux/mm.h index 10be09c8227e..6a426f8fd1e9 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -770,6 +770,24 @@ static inline unsigned int compound_order(struct page *page) return page[1].compound_order; } +static inline bool hpage_pincount_available(struct page *page) +{ + /* + * Can the page->hpage_pinned_refcount field be used? That field is in + * the 3rd page of the compound page, so the smallest (2-page) compound + * pages cannot support it. + */ + page = compound_head(page); + return PageCompound(page) && compound_order(page) > 1; +} + +static inline int compound_pincount(struct page *page) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + page = compound_head(page); + return atomic_read(compound_pincount_ptr(page)); +} + static inline void set_compound_order(struct page *page, unsigned int order) { page[1].compound_order = order; @@ -1084,6 +1102,11 @@ void unpin_user_pages(struct page **pages, unsigned long npages); * refcounts, and b) all the callers of this routine are expected to be able to * deal gracefully with a false positive. * + * For huge pages, the result will be exactly correct. That's because we have + * more tracking data available: the 3rd struct page in the compound page is + * used to track the pincount (instead using of the GUP_PIN_COUNTING_BIAS + * scheme). + * * For more information, please see Documentation/vm/pin_user_pages.rst. * * @page: pointer to page to be queried. @@ -1092,6 +1115,9 @@ void unpin_user_pages(struct page **pages, unsigned long npages); */ static inline bool page_maybe_dma_pinned(struct page *page) { + if (hpage_pincount_available(page)) + return compound_pincount(page) > 0; + /* * page_ref_count() is signed. If that refcount overflows, then * page_ref_count() returns a negative value, and callers will avoid diff --git a/include/linux/mm_types.h b/include/linux/mm_types.h index c28911c3afa8..dd555e6d23f3 100644 --- a/include/linux/mm_types.h +++ b/include/linux/mm_types.h @@ -137,7 +137,7 @@ struct page { }; struct { /* Second tail page of compound page */ unsigned long _compound_pad_1; /* compound_head */ - unsigned long _compound_pad_2; + atomic_t hpage_pinned_refcount; /* For both global and memcg */ struct list_head deferred_list; }; @@ -226,6 +226,11 @@ static inline atomic_t *compound_mapcount_ptr(struct page *page) return &page[1].compound_mapcount; } +static inline atomic_t *compound_pincount_ptr(struct page *page) +{ + return &page[2].hpage_pinned_refcount; +} + /* * Used for sizing the vmemmap region on some architectures */ diff --git a/mm/gup.c b/mm/gup.c index ee4f14f108fe..6601df4c7682 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -29,6 +29,22 @@ struct follow_page_context { unsigned int page_mask; }; +static void hpage_pincount_add(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + VM_BUG_ON_PAGE(page != compound_head(page), page); + + atomic_add(refs, compound_pincount_ptr(page)); +} + +static void hpage_pincount_sub(struct page *page, int refs) +{ + VM_BUG_ON_PAGE(!hpage_pincount_available(page), page); + VM_BUG_ON_PAGE(page != compound_head(page), page); + + atomic_sub(refs, compound_pincount_ptr(page)); +} + /* * Return the compound head page with ref appropriately incremented, * or NULL if that failed. @@ -70,8 +86,25 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, if (flags & FOLL_GET) return try_get_compound_head(page, refs); else if (flags & FOLL_PIN) { - refs *= GUP_PIN_COUNTING_BIAS; - return try_get_compound_head(page, refs); + /* + * When pinning a compound page of order > 1 (which is what + * hpage_pincount_available() checks for), use an exact count to + * track it, via hpage_pincount_add/_sub(). + * + * However, be sure to *also* increment the normal page refcount + * field at least once, so that the page really is pinned. + */ + if (!hpage_pincount_available(page)) + refs *= GUP_PIN_COUNTING_BIAS; + + page = try_get_compound_head(page, refs); + if (!page) + return NULL; + + if (hpage_pincount_available(page)) + hpage_pincount_add(page, refs); + + return page; } WARN_ON_ONCE(1); @@ -106,12 +139,25 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) if (flags & FOLL_GET) return try_get_page(page); else if (flags & FOLL_PIN) { + int refs = 1; + page = compound_head(page); if (WARN_ON_ONCE(page_ref_count(page) <= 0)) return false; - page_ref_add(page, GUP_PIN_COUNTING_BIAS); + if (hpage_pincount_available(page)) + hpage_pincount_add(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + /* + * Similar to try_grab_compound_head(): even if using the + * hpage_pincount_add/_sub() routines, be sure to + * *also* increment the normal page refcount field at least + * once, so that the page really is pinned. + */ + page_ref_add(page, refs); } return true; @@ -120,12 +166,17 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) #ifdef CONFIG_DEV_PAGEMAP_OPS static bool __unpin_devmap_managed_user_page(struct page *page) { - int count; + int count, refs = 1; if (!page_is_devmap_managed(page)) return false; - count = page_ref_sub_return(page, GUP_PIN_COUNTING_BIAS); + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + count = page_ref_sub_return(page, refs); /* * devmap page refcounts are 1-based, rather than 0-based: if @@ -157,6 +208,8 @@ static bool __unpin_devmap_managed_user_page(struct page *page) */ void unpin_user_page(struct page *page) { + int refs = 1; + page = compound_head(page); /* @@ -168,7 +221,12 @@ void unpin_user_page(struct page *page) if (__unpin_devmap_managed_user_page(page)) return; - if (page_ref_sub_and_test(page, GUP_PIN_COUNTING_BIAS)) + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, 1); + else + refs = GUP_PIN_COUNTING_BIAS; + + if (page_ref_sub_and_test(page, refs)) __put_page(page); } EXPORT_SYMBOL(unpin_user_page); @@ -1955,8 +2013,12 @@ EXPORT_SYMBOL(get_user_pages_unlocked); static void put_compound_head(struct page *page, int refs, unsigned int flags) { - if (flags & FOLL_PIN) - refs *= GUP_PIN_COUNTING_BIAS; + if (flags & FOLL_PIN) { + if (hpage_pincount_available(page)) + hpage_pincount_sub(page, refs); + else + refs *= GUP_PIN_COUNTING_BIAS; + } VM_BUG_ON_PAGE(page_ref_count(page) < refs, page); /* diff --git a/mm/hugetlb.c b/mm/hugetlb.c index ba1de6bc1402..3d31a235b53d 100644 --- a/mm/hugetlb.c +++ b/mm/hugetlb.c @@ -1009,6 +1009,9 @@ static void destroy_compound_gigantic_page(struct page *page, struct page *p = page + 1; atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + for (i = 1; i < nr_pages; i++, p = mem_map_next(p, page, i)) { clear_compound_head(p); set_page_refcounted(p); @@ -1287,6 +1290,9 @@ static void prep_compound_gigantic_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); + + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); } /* diff --git a/mm/page_alloc.c b/mm/page_alloc.c index 6e7e9c1d6caa..8f3a3bf2c347 100644 --- a/mm/page_alloc.c +++ b/mm/page_alloc.c @@ -688,6 +688,8 @@ void prep_compound_page(struct page *page, unsigned int order) set_compound_head(p, page); } atomic_set(compound_mapcount_ptr(page), -1); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); } #ifdef CONFIG_DEBUG_PAGEALLOC diff --git a/mm/rmap.c b/mm/rmap.c index b3e381919835..e45b9b991e2f 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -1178,6 +1178,9 @@ void page_add_new_anon_rmap(struct page *page, VM_BUG_ON_PAGE(!PageTransHuge(page), page); /* increment count (starts at -1) */ atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + __inc_node_page_state(page, NR_ANON_THPS); } else { /* Anon THP always mapped first with PMD */ @@ -1974,6 +1977,9 @@ void hugepage_add_new_anon_rmap(struct page *page, { BUG_ON(address < vma->vm_start || address >= vma->vm_end); atomic_set(compound_mapcount_ptr(page), 0); + if (hpage_pincount_available(page)) + atomic_set(compound_pincount_ptr(page), 0); + __page_set_anon_rmap(page, vma, address, 1); } #endif /* CONFIG_HUGETLB_PAGE */ -- cgit v1.2.3 From 1970dc6f5226416957ad0cc70ab47386ed3195a6 Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Wed, 1 Apr 2020 21:05:37 -0700 Subject: mm/gup: /proc/vmstat: pin_user_pages (FOLL_PIN) reporting MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Now that pages are "DMA-pinned" via pin_user_page*(), and unpinned via unpin_user_pages*(), we need some visibility into whether all of this is working correctly. Add two new fields to /proc/vmstat: nr_foll_pin_acquired nr_foll_pin_released These are documented in Documentation/core-api/pin_user_pages.rst. They represent the number of pages (since boot time) that have been pinned ("nr_foll_pin_acquired") and unpinned ("nr_foll_pin_released"), via pin_user_pages*() and unpin_user_pages*(). In the absence of long-running DMA or RDMA operations that hold pages pinned, the above two fields will normally be equal to each other. Also: update Documentation/core-api/pin_user_pages.rst, to remove an earlier (now confirmed untrue) claim about a performance problem with /proc/vmstat. Also: update Documentation/core-api/pin_user_pages.rst to rename the new /proc/vmstat entries, to the names listed here. Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Reviewed-by: Jan Kara Acked-by: Kirill A. Shutemov Cc: Ira Weiny Cc: Jérôme Glisse Cc: "Matthew Wilcox (Oracle)" Cc: Al Viro Cc: Christoph Hellwig Cc: Dan Williams Cc: Dave Chinner Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Michal Hocko Cc: Mike Kravetz Cc: Shuah Khan Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/20200211001536.1027652-9-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- Documentation/core-api/pin_user_pages.rst | 33 ++++++++++++++++++++++++++----- include/linux/mmzone.h | 2 ++ mm/gup.c | 13 ++++++++++++ mm/vmstat.c | 2 ++ 4 files changed, 45 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 7e5dd8b1b3f2..5c8a5f89756b 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -208,12 +208,35 @@ has the following new calls to exercise the new pin*() wrapper functions: You can monitor how many total dma-pinned pages have been acquired and released since the system was booted, via two new /proc/vmstat entries: :: - /proc/vmstat/nr_foll_pin_requested - /proc/vmstat/nr_foll_pin_requested + /proc/vmstat/nr_foll_pin_acquired + /proc/vmstat/nr_foll_pin_released -Those are both going to show zero, unless CONFIG_DEBUG_VM is set. This is -because there is a noticeable performance drop in unpin_user_page(), when they -are activated. +Under normal conditions, these two values will be equal unless there are any +long-term [R]DMA pins in place, or during pin/unpin transitions. + +* nr_foll_pin_acquired: This is the number of logical pins that have been + acquired since the system was powered on. For huge pages, the head page is + pinned once for each page (head page and each tail page) within the huge page. + This follows the same sort of behavior that get_user_pages() uses for huge + pages: the head page is refcounted once for each tail or head page in the huge + page, when get_user_pages() is applied to a huge page. + +* nr_foll_pin_released: The number of logical pins that have been released since + the system was powered on. Note that pages are released (unpinned) on a + PAGE_SIZE granularity, even if the original pin was applied to a huge page. + Becaused of the pin count behavior described above in "nr_foll_pin_acquired", + the accounting balances out, so that after doing this:: + + pin_user_pages(huge_page); + for (each page in huge_page) + unpin_user_page(page); + +...the following is expected:: + + nr_foll_pin_released == nr_foll_pin_acquired + +(...unless it was already out of balance due to a long-term RDMA pin being in +place.) References ========== diff --git a/include/linux/mmzone.h b/include/linux/mmzone.h index 462f6873905a..4bca42eeb439 100644 --- a/include/linux/mmzone.h +++ b/include/linux/mmzone.h @@ -243,6 +243,8 @@ enum node_stat_item { NR_DIRTIED, /* page dirtyings since bootup */ NR_WRITTEN, /* page writings since bootup */ NR_KERNEL_MISC_RECLAIMABLE, /* reclaimable non-slab kernel pages */ + NR_FOLL_PIN_ACQUIRED, /* via: pin_user_page(), gup flag: FOLL_PIN */ + NR_FOLL_PIN_RELEASED, /* pages returned via unpin_user_page() */ NR_VM_NODE_STAT_ITEMS }; diff --git a/mm/gup.c b/mm/gup.c index 6601df4c7682..c560c9cc0ee5 100644 --- a/mm/gup.c +++ b/mm/gup.c @@ -86,6 +86,8 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, if (flags & FOLL_GET) return try_get_compound_head(page, refs); else if (flags & FOLL_PIN) { + int orig_refs = refs; + /* * When pinning a compound page of order > 1 (which is what * hpage_pincount_available() checks for), use an exact count to @@ -104,6 +106,9 @@ static __maybe_unused struct page *try_grab_compound_head(struct page *page, if (hpage_pincount_available(page)) hpage_pincount_add(page, refs); + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, + orig_refs); + return page; } @@ -158,6 +163,8 @@ bool __must_check try_grab_page(struct page *page, unsigned int flags) * once, so that the page really is pinned. */ page_ref_add(page, refs); + + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_ACQUIRED, 1); } return true; @@ -178,6 +185,7 @@ static bool __unpin_devmap_managed_user_page(struct page *page) count = page_ref_sub_return(page, refs); + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1); /* * devmap page refcounts are 1-based, rather than 0-based: if * refcount is 1, then the page is free and the refcount is @@ -228,6 +236,8 @@ void unpin_user_page(struct page *page) if (page_ref_sub_and_test(page, refs)) __put_page(page); + + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, 1); } EXPORT_SYMBOL(unpin_user_page); @@ -2014,6 +2024,9 @@ EXPORT_SYMBOL(get_user_pages_unlocked); static void put_compound_head(struct page *page, int refs, unsigned int flags) { if (flags & FOLL_PIN) { + mod_node_page_state(page_pgdat(page), NR_FOLL_PIN_RELEASED, + refs); + if (hpage_pincount_available(page)) hpage_pincount_sub(page, refs); else diff --git a/mm/vmstat.c b/mm/vmstat.c index 78d53378db99..c9c0d71f917f 100644 --- a/mm/vmstat.c +++ b/mm/vmstat.c @@ -1168,6 +1168,8 @@ const char * const vmstat_text[] = { "nr_dirtied", "nr_written", "nr_kernel_misc_reclaimable", + "nr_foll_pin_acquired", + "nr_foll_pin_released", /* enum writeback_stat_item counters */ "nr_dirty_threshold", -- cgit v1.2.3 From dc8fb2f282ad13e550b65958fea40c7eb766d42a Mon Sep 17 00:00:00 2001 From: John Hubbard Date: Wed, 1 Apr 2020 21:05:52 -0700 Subject: mm: dump_page(): additional diagnostics for huge pinned pages MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit As part of pin_user_pages() and related API calls, pages are "dma-pinned". For the case of compound pages of order > 1, the per-page accounting of dma pins is accomplished via the 3rd struct page in the compound page. In order to support debugging of any pin_user_pages()- related problems, enhance dump_page() so as to report the pin count in that case. Documentation/core-api/pin_user_pages.rst is also updated accordingly. Signed-off-by: John Hubbard Signed-off-by: Andrew Morton Acked-by: Kirill A. Shutemov Cc: Jan Kara Cc: Matthew Wilcox Cc: Ira Weiny Cc: Jérôme Glisse Cc: Al Viro Cc: Christoph Hellwig Cc: Dan Williams Cc: Dave Chinner Cc: Jason Gunthorpe Cc: Jonathan Corbet Cc: Michal Hocko Cc: Mike Kravetz Cc: Shuah Khan Cc: Vlastimil Babka Link: http://lkml.kernel.org/r/20200211001536.1027652-13-jhubbard@nvidia.com Signed-off-by: Linus Torvalds --- Documentation/core-api/pin_user_pages.rst | 7 +++++++ mm/debug.c | 21 ++++++++++++++++----- 2 files changed, 23 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/pin_user_pages.rst b/Documentation/core-api/pin_user_pages.rst index 5c8a5f89756b..2e939ff10b86 100644 --- a/Documentation/core-api/pin_user_pages.rst +++ b/Documentation/core-api/pin_user_pages.rst @@ -238,6 +238,13 @@ long-term [R]DMA pins in place, or during pin/unpin transitions. (...unless it was already out of balance due to a long-term RDMA pin being in place.) +Other diagnostics +================= + +dump_page() has been enhanced slightly, to handle these new counting fields, and +to better report on compound pages in general. Specifically, for compound pages +with order > 1, the exact (hpage_pinned_refcount) pincount is reported. + References ========== diff --git a/mm/debug.c b/mm/debug.c index f5ffb0784559..2189357f0987 100644 --- a/mm/debug.c +++ b/mm/debug.c @@ -85,11 +85,22 @@ void __dump_page(struct page *page, const char *reason) mapcount = PageSlab(head) ? 0 : page_mapcount(page); if (compound) - pr_warn("page:%px refcount:%d mapcount:%d mapping:%p " - "index:%#lx head:%px order:%u compound_mapcount:%d\n", - page, page_ref_count(head), mapcount, - mapping, page_to_pgoff(page), head, - compound_order(head), compound_mapcount(page)); + if (hpage_pincount_available(page)) { + pr_warn("page:%px refcount:%d mapcount:%d mapping:%p " + "index:%#lx head:%px order:%u " + "compound_mapcount:%d compound_pincount:%d\n", + page, page_ref_count(head), mapcount, + mapping, page_to_pgoff(page), head, + compound_order(head), compound_mapcount(page), + compound_pincount(page)); + } else { + pr_warn("page:%px refcount:%d mapcount:%d mapping:%p " + "index:%#lx head:%px order:%u " + "compound_mapcount:%d\n", + page, page_ref_count(head), mapcount, + mapping, page_to_pgoff(page), head, + compound_order(head), compound_mapcount(page)); + } else pr_warn("page:%px refcount:%d mapcount:%d mapping:%p index:%#lx\n", page, page_ref_count(page), mapcount, -- cgit v1.2.3 From 8a931f801340c2be10552c7b5622d5f4852f3a36 Mon Sep 17 00:00:00 2001 From: Johannes Weiner Date: Wed, 1 Apr 2020 21:07:07 -0700 Subject: mm: memcontrol: recursive memory.low protection MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit Right now, the effective protection of any given cgroup is capped by its own explicit memory.low setting, regardless of what the parent says. The reasons for this are mostly historical and ease of implementation: to make delegation of memory.low safe, effective protection is the min() of all memory.low up the tree. Unfortunately, this limitation makes it impossible to protect an entire subtree from another without forcing the user to make explicit protection allocations all the way to the leaf cgroups - something that is highly undesirable in real life scenarios. Consider memory in a data center host. At the cgroup top level, we have a distinction between system management software and the actual workload the system is executing. Both branches are further subdivided into individual services, job components etc. We want to protect the workload as a whole from the system management software, but that doesn't mean we want to protect and prioritize individual workload wrt each other. Their memory demand can vary over time, and we'd want the VM to simply cache the hottest data within the workload subtree. Yet, the current memory.low limitations force us to allocate a fixed amount of protection to each workload component in order to get protection from system management software in general. This results in very inefficient resource distribution. Another concern with mandating downward allocation is that, as the complexity of the cgroup tree grows, it gets harder for the lower levels to be informed about decisions made at the host-level. Consider a container inside a namespace that in turn creates its own nested tree of cgroups to run multiple workloads. It'd be extremely difficult to configure memory.low parameters in those leaf cgroups that on one hand balance pressure among siblings as the container desires, while also reflecting the host-level protection from e.g. rpm upgrades, that lie beyond one or more delegation and namespacing points in the tree. It's highly unusual from a cgroup interface POV that nested levels have to be aware of and reflect decisions made at higher levels for them to be effective. To enable such use cases and scale configurability for complex trees, this patch implements a resource inheritance model for memory that is similar to how the CPU and the IO controller implement work-conserving resource allocations: a share of a resource allocated to a subree always applies to the entire subtree recursively, while allowing, but not mandating, children to further specify distribution rules. That means that if protection is explicitly allocated among siblings, those configured shares are being followed during page reclaim just like they are now. However, if the memory.low set at a higher level is not fully claimed by the children in that subtree, the "floating" remainder is applied to each cgroup in the tree in proportion to its size. Since reclaim pressure is applied in proportion to size as well, each child in that tree gets the same boost, and the effect is neutral among siblings - with respect to each other, they behave as if no memory control was enabled at all, and the VM simply balances the memory demands optimally within the subtree. But collectively those cgroups enjoy a boost over the cgroups in neighboring trees. E.g. a leaf cgroup with a memory.low setting of 0 no longer means that it's not getting a share of the hierarchically assigned resource, just that it doesn't claim a fixed amount of it to protect from its siblings. This allows us to recursively protect one subtree (workload) from another (system management), while letting subgroups compete freely among each other - without having to assign fixed shares to each leaf, and without nested groups having to echo higher-level settings. The floating protection composes naturally with fixed protection. Consider the following example tree: A A: low = 2G / \ A1: low = 1G A1 A2 A2: low = 0G As outside pressure is applied to this tree, A1 will enjoy a fixed protection from A2 of 1G, but the remaining, unclaimed 1G from A is split evenly among A1 and A2, coming out to 1.5G and 0.5G. There is a slight risk of regressing theoretical setups where the top-level cgroups don't know about the true budgeting and set bogusly high "bypass" values that are meaningfully allocated down the tree. Such setups would rely on unclaimed protection to be discarded, and distributing it would change the intended behavior. Be safe and hide the new behavior behind a mount option, 'memory_recursiveprot'. Signed-off-by: Johannes Weiner Signed-off-by: Andrew Morton Acked-by: Tejun Heo Acked-by: Roman Gushchin Acked-by: Chris Down Cc: Michal Hocko Cc: Michal Koutný Link: http://lkml.kernel.org/r/20200227195606.46212-4-hannes@cmpxchg.org Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v2.rst | 11 +++++++ include/linux/cgroup-defs.h | 5 ++++ kernel/cgroup/cgroup.c | 17 ++++++++++- mm/memcontrol.c | 51 ++++++++++++++++++++++++++++++--- 4 files changed, 79 insertions(+), 5 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/cgroup-v2.rst b/Documentation/admin-guide/cgroup-v2.rst index fbb111616705..bcc80269bb6a 100644 --- a/Documentation/admin-guide/cgroup-v2.rst +++ b/Documentation/admin-guide/cgroup-v2.rst @@ -188,6 +188,17 @@ cgroup v2 currently supports the following mount options. modified through remount from the init namespace. The mount option is ignored on non-init namespace mounts. + memory_recursiveprot + + Recursively apply memory.min and memory.low protection to + entire subtrees, without requiring explicit downward + propagation into leaf cgroups. This allows protecting entire + subtrees from one another, while retaining free competition + within those subtrees. This should have been the default + behavior but is a mount-option to avoid regressing setups + relying on the original semantics (e.g. specifying bogusly + high 'bypass' protection values at higher tree levels). + Organizing Processes and Threads -------------------------------- diff --git a/include/linux/cgroup-defs.h b/include/linux/cgroup-defs.h index 63097cb243cb..e1fafed22db1 100644 --- a/include/linux/cgroup-defs.h +++ b/include/linux/cgroup-defs.h @@ -94,6 +94,11 @@ enum { * Enable legacy local memory.events. */ CGRP_ROOT_MEMORY_LOCAL_EVENTS = (1 << 5), + + /* + * Enable recursive subtree protection + */ + CGRP_ROOT_MEMORY_RECURSIVE_PROT = (1 << 6), }; /* cftype->flags */ diff --git a/kernel/cgroup/cgroup.c b/kernel/cgroup/cgroup.c index 915dda3f7f19..755c07d845ce 100644 --- a/kernel/cgroup/cgroup.c +++ b/kernel/cgroup/cgroup.c @@ -1813,12 +1813,14 @@ int cgroup_show_path(struct seq_file *sf, struct kernfs_node *kf_node, enum cgroup2_param { Opt_nsdelegate, Opt_memory_localevents, + Opt_memory_recursiveprot, nr__cgroup2_params }; static const struct fs_parameter_spec cgroup2_fs_parameters[] = { fsparam_flag("nsdelegate", Opt_nsdelegate), fsparam_flag("memory_localevents", Opt_memory_localevents), + fsparam_flag("memory_recursiveprot", Opt_memory_recursiveprot), {} }; @@ -1839,6 +1841,9 @@ static int cgroup2_parse_param(struct fs_context *fc, struct fs_parameter *param case Opt_memory_localevents: ctx->flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; return 0; + case Opt_memory_recursiveprot: + ctx->flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; + return 0; } return -EINVAL; } @@ -1855,6 +1860,11 @@ static void apply_cgroup_root_flags(unsigned int root_flags) cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_LOCAL_EVENTS; else cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_LOCAL_EVENTS; + + if (root_flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) + cgrp_dfl_root.flags |= CGRP_ROOT_MEMORY_RECURSIVE_PROT; + else + cgrp_dfl_root.flags &= ~CGRP_ROOT_MEMORY_RECURSIVE_PROT; } } @@ -1864,6 +1874,8 @@ static int cgroup_show_options(struct seq_file *seq, struct kernfs_root *kf_root seq_puts(seq, ",nsdelegate"); if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_LOCAL_EVENTS) seq_puts(seq, ",memory_localevents"); + if (cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT) + seq_puts(seq, ",memory_recursiveprot"); return 0; } @@ -6412,7 +6424,10 @@ static struct kobj_attribute cgroup_delegate_attr = __ATTR_RO(delegate); static ssize_t features_show(struct kobject *kobj, struct kobj_attribute *attr, char *buf) { - return snprintf(buf, PAGE_SIZE, "nsdelegate\nmemory_localevents\n"); + return snprintf(buf, PAGE_SIZE, + "nsdelegate\n" + "memory_localevents\n" + "memory_recursiveprot\n"); } static struct kobj_attribute cgroup_features_attr = __ATTR_RO(features); diff --git a/mm/memcontrol.c b/mm/memcontrol.c index 0e3a8c11fb3b..07032c088608 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -6264,13 +6264,27 @@ struct cgroup_subsys memory_cgrp_subsys = { * budget is NOT proportional. A cgroup's protection from a sibling * is capped to its own memory.min/low setting. * + * 5. However, to allow protecting recursive subtrees from each other + * without having to declare each individual cgroup's fixed share + * of the ancestor's claim to protection, any unutilized - + * "floating" - protection from up the tree is distributed in + * proportion to each cgroup's *usage*. This makes the protection + * neutral wrt sibling cgroups and lets them compete freely over + * the shared parental protection budget, but it protects the + * subtree as a whole from neighboring subtrees. + * + * Note that 4. and 5. are not in conflict: 4. is about protecting + * against immediate siblings whereas 5. is about protecting against + * neighboring subtrees. */ static unsigned long effective_protection(unsigned long usage, + unsigned long parent_usage, unsigned long setting, unsigned long parent_effective, unsigned long siblings_protected) { unsigned long protected; + unsigned long ep; protected = min(usage, setting); /* @@ -6301,7 +6315,34 @@ static unsigned long effective_protection(unsigned long usage, * protection is always dependent on how memory is actually * consumed among the siblings anyway. */ - return protected; + ep = protected; + + /* + * If the children aren't claiming (all of) the protection + * afforded to them by the parent, distribute the remainder in + * proportion to the (unprotected) memory of each cgroup. That + * way, cgroups that aren't explicitly prioritized wrt each + * other compete freely over the allowance, but they are + * collectively protected from neighboring trees. + * + * We're using unprotected memory for the weight so that if + * some cgroups DO claim explicit protection, we don't protect + * the same bytes twice. + */ + if (!(cgrp_dfl_root.flags & CGRP_ROOT_MEMORY_RECURSIVE_PROT)) + return ep; + + if (parent_effective > siblings_protected && usage > protected) { + unsigned long unclaimed; + + unclaimed = parent_effective - siblings_protected; + unclaimed *= usage - protected; + unclaimed /= parent_usage - siblings_protected; + + ep += unclaimed; + } + + return ep; } /** @@ -6321,8 +6362,8 @@ static unsigned long effective_protection(unsigned long usage, enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, struct mem_cgroup *memcg) { + unsigned long usage, parent_usage; struct mem_cgroup *parent; - unsigned long usage; if (mem_cgroup_disabled()) return MEMCG_PROT_NONE; @@ -6347,11 +6388,13 @@ enum mem_cgroup_protection mem_cgroup_protected(struct mem_cgroup *root, goto out; } - memcg->memory.emin = effective_protection(usage, + parent_usage = page_counter_read(&parent->memory); + + memcg->memory.emin = effective_protection(usage, parent_usage, memcg->memory.min, READ_ONCE(parent->memory.emin), atomic_long_read(&parent->memory.children_min_usage)); - memcg->memory.elow = effective_protection(usage, + memcg->memory.elow = effective_protection(usage, parent_usage, memcg->memory.low, READ_ONCE(parent->memory.elow), atomic_long_read(&parent->memory.children_low_usage)); -- cgit v1.2.3 From 767e5ee54ed75a1a89d92c872d82f3fe72c15650 Mon Sep 17 00:00:00 2001 From: "Matthew Wilcox (Oracle)" Date: Wed, 1 Apr 2020 21:07:55 -0700 Subject: mm: add pagemap.h to the fine documentation The documentation currently does not include the deathless prose written to describe functions in pagemap.h because it's not included in any rst file. Fix up the mismatches between parameter names and the documentation and add the file to mm-api. Signed-off-by: Matthew Wilcox (Oracle) Signed-off-by: Andrew Morton Reviewed-by: Zi Yan Reviewed-by: John Hubbard Cc: Jonathan Corbet Link: http://lkml.kernel.org/r/20200221220045.24989-1-willy@infradead.org Signed-off-by: Linus Torvalds --- Documentation/core-api/mm-api.rst | 3 +++ include/linux/pagemap.h | 8 ++++---- 2 files changed, 7 insertions(+), 4 deletions(-) (limited to 'Documentation') diff --git a/Documentation/core-api/mm-api.rst b/Documentation/core-api/mm-api.rst index be726986ff75..2adffb3f7914 100644 --- a/Documentation/core-api/mm-api.rst +++ b/Documentation/core-api/mm-api.rst @@ -73,6 +73,9 @@ File Mapping and Page Cache .. kernel-doc:: mm/truncate.c :export: +.. kernel-doc:: include/linux/pagemap.h + :internal: + Memory pools ============ diff --git a/include/linux/pagemap.h b/include/linux/pagemap.h index b82eabf0268e..f56282491a48 100644 --- a/include/linux/pagemap.h +++ b/include/linux/pagemap.h @@ -33,8 +33,8 @@ enum mapping_flags { /** * mapping_set_error - record a writeback error in the address_space - * @mapping - the mapping in which an error should be set - * @error - the error to set in the mapping + * @mapping: the mapping in which an error should be set + * @error: the error to set in the mapping * * When writeback fails in some way, we must record that error so that * userspace can be informed when fsync and the like are called. We endeavor @@ -303,9 +303,9 @@ static inline struct page *find_lock_page(struct address_space *mapping, * atomic allocation! */ static inline struct page *find_or_create_page(struct address_space *mapping, - pgoff_t offset, gfp_t gfp_mask) + pgoff_t index, gfp_t gfp_mask) { - return pagecache_get_page(mapping, offset, + return pagecache_get_page(mapping, index, FGP_LOCK|FGP_ACCESSED|FGP_CREAT, gfp_mask); } -- cgit v1.2.3 From 6923aa0d8c629a7853822626877dcb11f4f1d354 Mon Sep 17 00:00:00 2001 From: Sebastian Andrzej Siewior Date: Wed, 1 Apr 2020 21:10:42 -0700 Subject: mm/compaction: Disable compact_unevictable_allowed on RT Since commit 5bbe3547aa3ba ("mm: allow compaction of unevictable pages") it is allowed to examine mlocked pages and compact them by default. On -RT even minor pagefaults are problematic because it may take a few 100us to resolve them and until then the task is blocked. Make compact_unevictable_allowed = 0 default and issue a warning on RT if it is changed. [bigeasy@linutronix.de: v5] Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200319165536.ovi75tsr2seared4@linutronix.de Signed-off-by: Sebastian Andrzej Siewior Signed-off-by: Andrew Morton Reviewed-by: Andrew Morton Acked-by: Mel Gorman Acked-by: Vlastimil Babka Cc: Thomas Gleixner Cc: Luis Chamberlain Cc: Kees Cook Cc: Iurii Zaikin Cc: Vlastimil Babka Link: https://lore.kernel.org/linux-mm/20190710144138.qyn4tuttdq6h7kqx@linutronix.de/ Link: http://lkml.kernel.org/r/20200303202225.nhqc3v5gwlb7x6et@linutronix.de Signed-off-by: Linus Torvalds --- Documentation/admin-guide/sysctl/vm.rst | 3 +++ kernel/sysctl.c | 29 ++++++++++++++++++++++++++++- mm/compaction.c | 4 ++++ 3 files changed, 35 insertions(+), 1 deletion(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/sysctl/vm.rst b/Documentation/admin-guide/sysctl/vm.rst index 64aeee1009ca..0329a4d3fa9e 100644 --- a/Documentation/admin-guide/sysctl/vm.rst +++ b/Documentation/admin-guide/sysctl/vm.rst @@ -128,6 +128,9 @@ allowed to examine the unevictable lru (mlocked pages) for pages to compact. This should be used on systems where stalls for minor page faults are an acceptable trade for large contiguous free memory. Set to 0 to prevent compaction from moving pages that are unevictable. Default value is 1. +On CONFIG_PREEMPT_RT the default value is 0 in order to avoid a page fault, due +to compaction, which would block the task from becomming active until the fault +is resolved. dirty_background_bytes diff --git a/kernel/sysctl.c b/kernel/sysctl.c index cb650bb9da68..8a176d8727a3 100644 --- a/kernel/sysctl.c +++ b/kernel/sysctl.c @@ -212,6 +212,11 @@ static int proc_do_cad_pid(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); static int proc_taint(struct ctl_table *table, int write, void __user *buffer, size_t *lenp, loff_t *ppos); +#ifdef CONFIG_COMPACTION +static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table, + int write, void __user *buffer, + size_t *lenp, loff_t *ppos); +#endif #endif #ifdef CONFIG_PRINTK @@ -1467,7 +1472,7 @@ static struct ctl_table vm_table[] = { .data = &sysctl_compact_unevictable_allowed, .maxlen = sizeof(int), .mode = 0644, - .proc_handler = proc_dointvec_minmax, + .proc_handler = proc_dointvec_minmax_warn_RT_change, .extra1 = SYSCTL_ZERO, .extra2 = SYSCTL_ONE, }, @@ -2555,6 +2560,28 @@ int proc_dointvec(struct ctl_table *table, int write, return do_proc_dointvec(table, write, buffer, lenp, ppos, NULL, NULL); } +#ifdef CONFIG_COMPACTION +static int proc_dointvec_minmax_warn_RT_change(struct ctl_table *table, + int write, void __user *buffer, + size_t *lenp, loff_t *ppos) +{ + int ret, old; + + if (!IS_ENABLED(CONFIG_PREEMPT_RT) || !write) + return proc_dointvec_minmax(table, write, buffer, lenp, ppos); + + old = *(int *)table->data; + ret = proc_dointvec_minmax(table, write, buffer, lenp, ppos); + if (ret) + return ret; + if (old != *(int *)table->data) + pr_warn_once("sysctl attribute %s changed by %s[%d]\n", + table->procname, current->comm, + task_pid_nr(current)); + return ret; +} +#endif + /** * proc_douintvec - read a vector of unsigned integers * @table: the sysctl table diff --git a/mm/compaction.c b/mm/compaction.c index 07947387244a..c589ead54fb3 100644 --- a/mm/compaction.c +++ b/mm/compaction.c @@ -1594,7 +1594,11 @@ typedef enum { * Allow userspace to control policy on scanning the unevictable LRU for * compactable pages. */ +#ifdef CONFIG_PREEMPT_RT +int sysctl_compact_unevictable_allowed __read_mostly = 0; +#else int sysctl_compact_unevictable_allowed __read_mostly = 1; +#endif static inline void update_fast_start_pfn(struct compact_control *cc, unsigned long pfn) -- cgit v1.2.3 From 6566704dafddcdff4b2a794da5b030b74e74107f Mon Sep 17 00:00:00 2001 From: Mina Almasry Date: Wed, 1 Apr 2020 21:11:41 -0700 Subject: hugetlb_cgroup: add hugetlb_cgroup reservation docs Add docs for how to use hugetlb_cgroup reservations, and their behavior. Signed-off-by: Mina Almasry Signed-off-by: Andrew Morton Cc: David Rientjes Cc: Greg Thelen Cc: Mike Kravetz Cc: Sandipan Das Cc: Shakeel Butt Cc: Shuah Khan Link: http://lkml.kernel.org/r/20200211213128.73302-9-almasrymina@google.com Signed-off-by: Linus Torvalds --- Documentation/admin-guide/cgroup-v1/hugetlb.rst | 103 +++++++++++++++++++++--- 1 file changed, 92 insertions(+), 11 deletions(-) (limited to 'Documentation') diff --git a/Documentation/admin-guide/cgroup-v1/hugetlb.rst b/Documentation/admin-guide/cgroup-v1/hugetlb.rst index a3902aa253a9..338f2c7d7a1c 100644 --- a/Documentation/admin-guide/cgroup-v1/hugetlb.rst +++ b/Documentation/admin-guide/cgroup-v1/hugetlb.rst @@ -2,13 +2,6 @@ HugeTLB Controller ================== -The HugeTLB controller allows to limit the HugeTLB usage per control group and -enforces the controller limit during page fault. Since HugeTLB doesn't -support page reclaim, enforcing the limit at page fault time implies that, -the application will get SIGBUS signal if it tries to access HugeTLB pages -beyond its limit. This requires the application to know beforehand how much -HugeTLB pages it would require for its use. - HugeTLB controller can be created by first mounting the cgroup filesystem. # mount -t cgroup -o hugetlb none /sys/fs/cgroup @@ -28,10 +21,14 @@ process (bash) into it. Brief summary of control files:: - hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb usage - hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded - hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb - hugetlb..failcnt # show the number of allocation failure due to HugeTLB limit + hugetlb..rsvd.limit_in_bytes # set/show limit of "hugepagesize" hugetlb reservations + hugetlb..rsvd.max_usage_in_bytes # show max "hugepagesize" hugetlb reservations and no-reserve faults + hugetlb..rsvd.usage_in_bytes # show current reservations and no-reserve faults for "hugepagesize" hugetlb + hugetlb..rsvd.failcnt # show the number of allocation failure due to HugeTLB reservation limit + hugetlb..limit_in_bytes # set/show limit of "hugepagesize" hugetlb faults + hugetlb..max_usage_in_bytes # show max "hugepagesize" hugetlb usage recorded + hugetlb..usage_in_bytes # show current usage for "hugepagesize" hugetlb + hugetlb..failcnt # show the number of allocation failure due to HugeTLB usage limit For a system supporting three hugepage sizes (64k, 32M and 1G), the control files include:: @@ -40,11 +37,95 @@ files include:: hugetlb.1GB.max_usage_in_bytes hugetlb.1GB.usage_in_bytes hugetlb.1GB.failcnt + hugetlb.1GB.rsvd.limit_in_bytes + hugetlb.1GB.rsvd.max_usage_in_bytes + hugetlb.1GB.rsvd.usage_in_bytes + hugetlb.1GB.rsvd.failcnt hugetlb.64KB.limit_in_bytes hugetlb.64KB.max_usage_in_bytes hugetlb.64KB.usage_in_bytes hugetlb.64KB.failcnt + hugetlb.64KB.rsvd.limit_in_bytes + hugetlb.64KB.rsvd.max_usage_in_bytes + hugetlb.64KB.rsvd.usage_in_bytes + hugetlb.64KB.rsvd.failcnt hugetlb.32MB.limit_in_bytes hugetlb.32MB.max_usage_in_bytes hugetlb.32MB.usage_in_bytes hugetlb.32MB.failcnt + hugetlb.32MB.rsvd.limit_in_bytes + hugetlb.32MB.rsvd.max_usage_in_bytes + hugetlb.32MB.rsvd.usage_in_bytes + hugetlb.32MB.rsvd.failcnt + + +1. Page fault accounting + +hugetlb..limit_in_bytes +hugetlb..max_usage_in_bytes +hugetlb..usage_in_bytes +hugetlb..failcnt + +The HugeTLB controller allows users to limit the HugeTLB usage (page fault) per +control group and enforces the limit during page fault. Since HugeTLB +doesn't support page reclaim, enforcing the limit at page fault time implies +that, the application will get SIGBUS signal if it tries to fault in HugeTLB +pages beyond its limit. Therefore the application needs to know exactly how many +HugeTLB pages it uses before hand, and the sysadmin needs to make sure that +there are enough available on the machine for all the users to avoid processes +getting SIGBUS. + + +2. Reservation accounting + +hugetlb..rsvd.limit_in_bytes +hugetlb..rsvd.max_usage_in_bytes +hugetlb..rsvd.usage_in_bytes +hugetlb..rsvd.failcnt + +The HugeTLB controller allows to limit the HugeTLB reservations per control +group and enforces the controller limit at reservation time and at the fault of +HugeTLB memory for which no reservation exists. Since reservation limits are +enforced at reservation time (on mmap or shget), reservation limits never causes +the application to get SIGBUS signal if the memory was reserved before hand. For +MAP_NORESERVE allocations, the reservation limit behaves the same as the fault +limit, enforcing memory usage at fault time and causing the application to +receive a SIGBUS if it's crossing its limit. + +Reservation limits are superior to page fault limits described above, since +reservation limits are enforced at reservation time (on mmap or shget), and +never causes the application to get SIGBUS signal if the memory was reserved +before hand. This allows for easier fallback to alternatives such as +non-HugeTLB memory for example. In the case of page fault accounting, it's very +hard to avoid processes getting SIGBUS since the sysadmin needs precisely know +the HugeTLB usage of all the tasks in the system and make sure there is enough +pages to satisfy all requests. Avoiding tasks getting SIGBUS on overcommited +systems is practically impossible with page fault accounting. + + +3. Caveats with shared memory + +For shared HugeTLB memory, both HugeTLB reservation and page faults are charged +to the first task that causes the memory to be reserved or faulted, and all +subsequent uses of this reserved or faulted memory is done without charging. + +Shared HugeTLB memory is only uncharged when it is unreserved or deallocated. +This is usually when the HugeTLB file is deleted, and not when the task that +caused the reservation or fault has exited. + + +4. Caveats with HugeTLB cgroup offline. + +When a HugeTLB cgroup goes offline with some reservations or faults still +charged to it, the behavior is as follows: + +- The fault charges are charged to the parent HugeTLB cgroup (reparented), +- the reservation charges remain on the offline HugeTLB cgroup. + +This means that if a HugeTLB cgroup gets offlined while there is still HugeTLB +reservations charged to it, that cgroup persists as a zombie until all HugeTLB +reservations are uncharged. HugeTLB reservations behave in this manner to match +the memory controller whose cgroups also persist as zombie until all charged +memory is uncharged. Also, the tracking of HugeTLB reservations is a bit more +complex compared to the tracking of HugeTLB faults, so it is significantly +harder to reparent reservations at offline time. -- cgit v1.2.3