diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2024-01-09 20:18:47 +0100 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2024-01-09 20:18:47 +0100 |
commit | fb46e22a9e3863e08aef8815df9f17d0f4b9aede (patch) | |
tree | 83e052911fa8d8d90bcf9de2796e17e19040613f /mm/memcontrol.c | |
parent | Merge tag 'slab-for-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/vba... (diff) | |
parent | mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER (diff) | |
download | linux-fb46e22a9e3863e08aef8815df9f17d0f4b9aede.tar.xz linux-fb46e22a9e3863e08aef8815df9f17d0f4b9aede.zip |
Merge tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm
Pull MM updates from Andrew Morton:
"Many singleton patches against the MM code. The patch series which are
included in this merge do the following:
- Peng Zhang has done some mapletree maintainance work in the series
'maple_tree: add mt_free_one() and mt_attr() helpers'
'Some cleanups of maple tree'
- In the series 'mm: use memmap_on_memory semantics for dax/kmem'
Vishal Verma has altered the interworking between memory-hotplug
and dax/kmem so that newly added 'device memory' can more easily
have its memmap placed within that newly added memory.
- Matthew Wilcox continues folio-related work (including a few fixes)
in the patch series
'Add folio_zero_tail() and folio_fill_tail()'
'Make folio_start_writeback return void'
'Fix fault handler's handling of poisoned tail pages'
'Convert aops->error_remove_page to ->error_remove_folio'
'Finish two folio conversions'
'More swap folio conversions'
- Kefeng Wang has also contributed folio-related work in the series
'mm: cleanup and use more folio in page fault'
- Jim Cromie has improved the kmemleak reporting output in the series
'tweak kmemleak report format'.
- In the series 'stackdepot: allow evicting stack traces' Andrey
Konovalov to permits clients (in this case KASAN) to cause eviction
of no longer needed stack traces.
- Charan Teja Kalla has fixed some accounting issues in the page
allocator's atomic reserve calculations in the series 'mm:
page_alloc: fixes for high atomic reserve caluculations'.
- Dmitry Rokosov has added to the samples/ dorectory some sample code
for a userspace memcg event listener application. See the series
'samples: introduce cgroup events listeners'.
- Some mapletree maintanance work from Liam Howlett in the series
'maple_tree: iterator state changes'.
- Nhat Pham has improved zswap's approach to writeback in the series
'workload-specific and memory pressure-driven zswap writeback'.
- DAMON/DAMOS feature and maintenance work from SeongJae Park in the
series
'mm/damon: let users feed and tame/auto-tune DAMOS'
'selftests/damon: add Python-written DAMON functionality tests'
'mm/damon: misc updates for 6.8'
- Yosry Ahmed has improved memcg's stats flushing in the series 'mm:
memcg: subtree stats flushing and thresholds'.
- In the series 'Multi-size THP for anonymous memory' Ryan Roberts
has added a runtime opt-in feature to transparent hugepages which
improves performance by allocating larger chunks of memory during
anonymous page faults.
- Matthew Wilcox has also contributed some cleanup and maintenance
work against eh buffer_head code int he series 'More buffer_head
cleanups'.
- Suren Baghdasaryan has done work on Andrea Arcangeli's series
'userfaultfd move option'. UFFDIO_MOVE permits userspace heap
compaction algorithms to move userspace's pages around rather than
UFFDIO_COPY'a alloc/copy/free.
- Stefan Roesch has developed a 'KSM Advisor', in the series 'mm/ksm:
Add ksm advisor'. This is a governor which tunes KSM's scanning
aggressiveness in response to userspace's current needs.
- Chengming Zhou has optimized zswap's temporary working memory use
in the series 'mm/zswap: dstmem reuse optimizations and cleanups'.
- Matthew Wilcox has performed some maintenance work on the writeback
code, both code and within filesystems. The series is 'Clean up the
writeback paths'.
- Andrey Konovalov has optimized KASAN's handling of alloc and free
stack traces for secondary-level allocators, in the series 'kasan:
save mempool stack traces'.
- Andrey also performed some KASAN maintenance work in the series
'kasan: assorted clean-ups'.
- David Hildenbrand has gone to town on the rmap code. Cleanups, more
pte batching, folio conversions and more. See the series 'mm/rmap:
interface overhaul'.
- Kinsey Ho has contributed some maintenance work on the MGLRU code
in the series 'mm/mglru: Kconfig cleanup'.
- Matthew Wilcox has contributed lruvec page accounting code cleanups
in the series 'Remove some lruvec page accounting functions'"
* tag 'mm-stable-2024-01-08-15-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (361 commits)
mm, treewide: rename MAX_ORDER to MAX_PAGE_ORDER
mm, treewide: introduce NR_PAGE_ORDERS
selftests/mm: add separate UFFDIO_MOVE test for PMD splitting
selftests/mm: skip test if application doesn't has root privileges
selftests/mm: conform test to TAP format output
selftests: mm: hugepage-mmap: conform to TAP format output
selftests/mm: gup_test: conform test to TAP format output
mm/selftests: hugepage-mremap: conform test to TAP format output
mm/vmstat: move pgdemote_* out of CONFIG_NUMA_BALANCING
mm: zsmalloc: return -ENOSPC rather than -EINVAL in zs_malloc while size is too large
mm/memcontrol: remove __mod_lruvec_page_state()
mm/khugepaged: use a folio more in collapse_file()
slub: use a folio in __kmalloc_large_node
slub: use folio APIs in free_large_kmalloc()
slub: use alloc_pages_node() in alloc_slab_page()
mm: remove inc/dec lruvec page state functions
mm: ratelimit stat flush from workingset shrinker
kasan: stop leaking stack trace handles
mm/mglru: remove CONFIG_TRANSPARENT_HUGEPAGE
mm/mglru: add dummy pmd_dirty()
...
Diffstat (limited to 'mm/memcontrol.c')
-rw-r--r-- | mm/memcontrol.c | 319 |
1 files changed, 196 insertions, 123 deletions
diff --git a/mm/memcontrol.c b/mm/memcontrol.c index d58ec11317c7..e4c8735e7c85 100644 --- a/mm/memcontrol.c +++ b/mm/memcontrol.c @@ -574,116 +574,6 @@ mem_cgroup_largest_soft_limit_node(struct mem_cgroup_tree_per_node *mctz) return mz; } -/* - * memcg and lruvec stats flushing - * - * Many codepaths leading to stats update or read are performance sensitive and - * adding stats flushing in such codepaths is not desirable. So, to optimize the - * flushing the kernel does: - * - * 1) Periodically and asynchronously flush the stats every 2 seconds to not let - * rstat update tree grow unbounded. - * - * 2) Flush the stats synchronously on reader side only when there are more than - * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization - * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but - * only for 2 seconds due to (1). - */ -static void flush_memcg_stats_dwork(struct work_struct *w); -static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); -static DEFINE_PER_CPU(unsigned int, stats_updates); -static atomic_t stats_flush_ongoing = ATOMIC_INIT(0); -static atomic_t stats_flush_threshold = ATOMIC_INIT(0); -static u64 flush_next_time; - -#define FLUSH_TIME (2UL*HZ) - -/* - * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can - * not rely on this as part of an acquired spinlock_t lock. These functions are - * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion - * is sufficient. - */ -static void memcg_stats_lock(void) -{ - preempt_disable_nested(); - VM_WARN_ON_IRQS_ENABLED(); -} - -static void __memcg_stats_lock(void) -{ - preempt_disable_nested(); -} - -static void memcg_stats_unlock(void) -{ - preempt_enable_nested(); -} - -static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) -{ - unsigned int x; - - if (!val) - return; - - cgroup_rstat_updated(memcg->css.cgroup, smp_processor_id()); - - x = __this_cpu_add_return(stats_updates, abs(val)); - if (x > MEMCG_CHARGE_BATCH) { - /* - * If stats_flush_threshold exceeds the threshold - * (>num_online_cpus()), cgroup stats update will be triggered - * in __mem_cgroup_flush_stats(). Increasing this var further - * is redundant and simply adds overhead in atomic update. - */ - if (atomic_read(&stats_flush_threshold) <= num_online_cpus()) - atomic_add(x / MEMCG_CHARGE_BATCH, &stats_flush_threshold); - __this_cpu_write(stats_updates, 0); - } -} - -static void do_flush_stats(void) -{ - /* - * We always flush the entire tree, so concurrent flushers can just - * skip. This avoids a thundering herd problem on the rstat global lock - * from memcg flushers (e.g. reclaim, refault, etc). - */ - if (atomic_read(&stats_flush_ongoing) || - atomic_xchg(&stats_flush_ongoing, 1)) - return; - - WRITE_ONCE(flush_next_time, jiffies_64 + 2*FLUSH_TIME); - - cgroup_rstat_flush(root_mem_cgroup->css.cgroup); - - atomic_set(&stats_flush_threshold, 0); - atomic_set(&stats_flush_ongoing, 0); -} - -void mem_cgroup_flush_stats(void) -{ - if (atomic_read(&stats_flush_threshold) > num_online_cpus()) - do_flush_stats(); -} - -void mem_cgroup_flush_stats_ratelimited(void) -{ - if (time_after64(jiffies_64, READ_ONCE(flush_next_time))) - mem_cgroup_flush_stats(); -} - -static void flush_memcg_stats_dwork(struct work_struct *w) -{ - /* - * Always flush here so that flushing in latency-sensitive paths is - * as cheap as possible. - */ - do_flush_stats(); - queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); -} - /* Subset of vm_event_item to report for memcg event stats */ static const unsigned int memcg_vm_event_stat[] = { PGPGIN, @@ -704,6 +594,7 @@ static const unsigned int memcg_vm_event_stat[] = { #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) ZSWPIN, ZSWPOUT, + ZSWPWB, #endif #ifdef CONFIG_TRANSPARENT_HUGEPAGE THP_FAULT_ALLOC, @@ -741,6 +632,9 @@ struct memcg_vmstats_percpu { /* Cgroup1: threshold notifications & softlimit tree updates */ unsigned long nr_page_events; unsigned long targets[MEM_CGROUP_NTARGETS]; + + /* Stats updates since the last flush */ + unsigned int stats_updates; }; struct memcg_vmstats { @@ -755,8 +649,134 @@ struct memcg_vmstats { /* Pending child counts during tree propagation */ long state_pending[MEMCG_NR_STAT]; unsigned long events_pending[NR_MEMCG_EVENTS]; + + /* Stats updates since the last flush */ + atomic64_t stats_updates; }; +/* + * memcg and lruvec stats flushing + * + * Many codepaths leading to stats update or read are performance sensitive and + * adding stats flushing in such codepaths is not desirable. So, to optimize the + * flushing the kernel does: + * + * 1) Periodically and asynchronously flush the stats every 2 seconds to not let + * rstat update tree grow unbounded. + * + * 2) Flush the stats synchronously on reader side only when there are more than + * (MEMCG_CHARGE_BATCH * nr_cpus) update events. Though this optimization + * will let stats be out of sync by atmost (MEMCG_CHARGE_BATCH * nr_cpus) but + * only for 2 seconds due to (1). + */ +static void flush_memcg_stats_dwork(struct work_struct *w); +static DECLARE_DEFERRABLE_WORK(stats_flush_dwork, flush_memcg_stats_dwork); +static u64 flush_last_time; + +#define FLUSH_TIME (2UL*HZ) + +/* + * Accessors to ensure that preemption is disabled on PREEMPT_RT because it can + * not rely on this as part of an acquired spinlock_t lock. These functions are + * never used in hardirq context on PREEMPT_RT and therefore disabling preemtion + * is sufficient. + */ +static void memcg_stats_lock(void) +{ + preempt_disable_nested(); + VM_WARN_ON_IRQS_ENABLED(); +} + +static void __memcg_stats_lock(void) +{ + preempt_disable_nested(); +} + +static void memcg_stats_unlock(void) +{ + preempt_enable_nested(); +} + + +static bool memcg_should_flush_stats(struct mem_cgroup *memcg) +{ + return atomic64_read(&memcg->vmstats->stats_updates) > + MEMCG_CHARGE_BATCH * num_online_cpus(); +} + +static inline void memcg_rstat_updated(struct mem_cgroup *memcg, int val) +{ + int cpu = smp_processor_id(); + unsigned int x; + + if (!val) + return; + + cgroup_rstat_updated(memcg->css.cgroup, cpu); + + for (; memcg; memcg = parent_mem_cgroup(memcg)) { + x = __this_cpu_add_return(memcg->vmstats_percpu->stats_updates, + abs(val)); + + if (x < MEMCG_CHARGE_BATCH) + continue; + + /* + * If @memcg is already flush-able, increasing stats_updates is + * redundant. Avoid the overhead of the atomic update. + */ + if (!memcg_should_flush_stats(memcg)) + atomic64_add(x, &memcg->vmstats->stats_updates); + __this_cpu_write(memcg->vmstats_percpu->stats_updates, 0); + } +} + +static void do_flush_stats(struct mem_cgroup *memcg) +{ + if (mem_cgroup_is_root(memcg)) + WRITE_ONCE(flush_last_time, jiffies_64); + + cgroup_rstat_flush(memcg->css.cgroup); +} + +/* + * mem_cgroup_flush_stats - flush the stats of a memory cgroup subtree + * @memcg: root of the subtree to flush + * + * Flushing is serialized by the underlying global rstat lock. There is also a + * minimum amount of work to be done even if there are no stat updates to flush. + * Hence, we only flush the stats if the updates delta exceeds a threshold. This + * avoids unnecessary work and contention on the underlying lock. + */ +void mem_cgroup_flush_stats(struct mem_cgroup *memcg) +{ + if (mem_cgroup_disabled()) + return; + + if (!memcg) + memcg = root_mem_cgroup; + + if (memcg_should_flush_stats(memcg)) + do_flush_stats(memcg); +} + +void mem_cgroup_flush_stats_ratelimited(struct mem_cgroup *memcg) +{ + /* Only flush if the periodic flusher is one full cycle late */ + if (time_after64(jiffies_64, READ_ONCE(flush_last_time) + 2*FLUSH_TIME)) + mem_cgroup_flush_stats(memcg); +} + +static void flush_memcg_stats_dwork(struct work_struct *w) +{ + /* + * Deliberately ignore memcg_should_flush_stats() here so that flushing + * in latency-sensitive paths is as cheap as possible. + */ + do_flush_stats(root_mem_cgroup); + queue_delayed_work(system_unbound_wq, &stats_flush_dwork, FLUSH_TIME); +} + unsigned long memcg_page_state(struct mem_cgroup *memcg, int idx) { long x = READ_ONCE(memcg->vmstats->state[idx]); @@ -871,16 +891,15 @@ void __mod_lruvec_state(struct lruvec *lruvec, enum node_stat_item idx, __mod_memcg_lruvec_state(lruvec, idx, val); } -void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, +void __lruvec_stat_mod_folio(struct folio *folio, enum node_stat_item idx, int val) { - struct page *head = compound_head(page); /* rmap on tail pages */ struct mem_cgroup *memcg; - pg_data_t *pgdat = page_pgdat(page); + pg_data_t *pgdat = folio_pgdat(folio); struct lruvec *lruvec; rcu_read_lock(); - memcg = page_memcg(head); + memcg = folio_memcg(folio); /* Untracked pages have no memcg, no lruvec. Update only the node */ if (!memcg) { rcu_read_unlock(); @@ -892,7 +911,7 @@ void __mod_lruvec_page_state(struct page *page, enum node_stat_item idx, __mod_lruvec_state(lruvec, idx, val); rcu_read_unlock(); } -EXPORT_SYMBOL(__mod_lruvec_page_state); +EXPORT_SYMBOL(__lruvec_stat_mod_folio); void __mod_lruvec_kmem_state(void *p, enum node_stat_item idx, int val) { @@ -1628,7 +1647,7 @@ static void memcg_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) * * Current memory state: */ - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { u64 size; @@ -4178,7 +4197,7 @@ static int memcg_numa_stat_show(struct seq_file *m, void *v) int nid; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); for (stat = stats; stat < stats + ARRAY_SIZE(stats); stat++) { seq_printf(m, "%s=%lu", stat->name, @@ -4259,7 +4278,7 @@ static void memcg1_stat_format(struct mem_cgroup *memcg, struct seq_buf *s) BUILD_BUG_ON(ARRAY_SIZE(memcg1_stat_names) != ARRAY_SIZE(memcg1_stats)); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memcg1_stats); i++) { unsigned long nr; @@ -4755,7 +4774,7 @@ void mem_cgroup_wb_stats(struct bdi_writeback *wb, unsigned long *pfilepages, struct mem_cgroup *memcg = mem_cgroup_from_css(wb->memcg_css); struct mem_cgroup *parent; - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); *pdirty = memcg_page_state(memcg, NR_FILE_DIRTY); *pwriteback = memcg_page_state(memcg, NR_WRITEBACK); @@ -5518,6 +5537,8 @@ mem_cgroup_css_alloc(struct cgroup_subsys_state *parent_css) WRITE_ONCE(memcg->soft_limit, PAGE_COUNTER_MAX); #if defined(CONFIG_MEMCG_KMEM) && defined(CONFIG_ZSWAP) memcg->zswap_max = PAGE_COUNTER_MAX; + WRITE_ONCE(memcg->zswap_writeback, + !parent || READ_ONCE(parent->zswap_writeback)); #endif page_counter_set_high(&memcg->swap, PAGE_COUNTER_MAX); if (parent) { @@ -5614,6 +5635,8 @@ static void mem_cgroup_css_offline(struct cgroup_subsys_state *css) page_counter_set_min(&memcg->memory, 0); page_counter_set_low(&memcg->memory, 0); + zswap_memcg_offline_cleanup(memcg); + memcg_offline_kmem(memcg); reparent_shrinker_deferred(memcg); wb_memcg_offline(memcg); @@ -5784,6 +5807,10 @@ static void mem_cgroup_css_rstat_flush(struct cgroup_subsys_state *css, int cpu) } } } + statc->stats_updates = 0; + /* We are in a per-cpu loop here, only do the atomic write once */ + if (atomic64_read(&memcg->vmstats->stats_updates)) + atomic64_set(&memcg->vmstats->stats_updates, 0); } #ifdef CONFIG_MMU @@ -6783,6 +6810,10 @@ static ssize_t memory_max_write(struct kernfs_open_file *of, return nbytes; } +/* + * Note: don't forget to update the 'samples/cgroup/memcg_event_listener' + * if any new events become available. + */ static void __memory_events_show(struct seq_file *m, atomic_long_t *events) { seq_printf(m, "low %lu\n", atomic_long_read(&events[MEMCG_LOW])); @@ -6839,7 +6870,7 @@ static int memory_numa_stat_show(struct seq_file *m, void *v) int i; struct mem_cgroup *memcg = mem_cgroup_from_seq(m); - mem_cgroup_flush_stats(); + mem_cgroup_flush_stats(memcg); for (i = 0; i < ARRAY_SIZE(memory_stats); i++) { int nid; @@ -8081,7 +8112,11 @@ bool obj_cgroup_may_zswap(struct obj_cgroup *objcg) break; } - cgroup_rstat_flush(memcg->css.cgroup); + /* + * mem_cgroup_flush_stats() ignores small changes. Use + * do_flush_stats() directly to get accurate stats for charging. + */ + do_flush_stats(memcg); pages = memcg_page_state(memcg, MEMCG_ZSWAP_B) / PAGE_SIZE; if (pages < max) continue; @@ -8143,11 +8178,19 @@ void obj_cgroup_uncharge_zswap(struct obj_cgroup *objcg, size_t size) rcu_read_unlock(); } +bool mem_cgroup_zswap_writeback_enabled(struct mem_cgroup *memcg) +{ + /* if zswap is disabled, do not block pages going to the swapping device */ + return !is_zswap_enabled() || !memcg || READ_ONCE(memcg->zswap_writeback); +} + static u64 zswap_current_read(struct cgroup_subsys_state *css, struct cftype *cft) { - cgroup_rstat_flush(css->cgroup); - return memcg_page_state(mem_cgroup_from_css(css), MEMCG_ZSWAP_B); + struct mem_cgroup *memcg = mem_cgroup_from_css(css); + + mem_cgroup_flush_stats(memcg); + return memcg_page_state(memcg, MEMCG_ZSWAP_B); } static int zswap_max_show(struct seq_file *m, void *v) @@ -8173,6 +8216,31 @@ static ssize_t zswap_max_write(struct kernfs_open_file *of, return nbytes; } +static int zswap_writeback_show(struct seq_file *m, void *v) +{ + struct mem_cgroup *memcg = mem_cgroup_from_seq(m); + + seq_printf(m, "%d\n", READ_ONCE(memcg->zswap_writeback)); + return 0; +} + +static ssize_t zswap_writeback_write(struct kernfs_open_file *of, + char *buf, size_t nbytes, loff_t off) +{ + struct mem_cgroup *memcg = mem_cgroup_from_css(of_css(of)); + int zswap_writeback; + ssize_t parse_ret = kstrtoint(strstrip(buf), 0, &zswap_writeback); + + if (parse_ret) + return parse_ret; + + if (zswap_writeback != 0 && zswap_writeback != 1) + return -EINVAL; + + WRITE_ONCE(memcg->zswap_writeback, zswap_writeback); + return nbytes; +} + static struct cftype zswap_files[] = { { .name = "zswap.current", @@ -8185,6 +8253,11 @@ static struct cftype zswap_files[] = { .seq_show = zswap_max_show, .write = zswap_max_write, }, + { + .name = "zswap.writeback", + .seq_show = zswap_writeback_show, + .write = zswap_writeback_write, + }, { } /* terminate */ }; #endif /* CONFIG_MEMCG_KMEM && CONFIG_ZSWAP */ |