mm: page_alloc: fair zone allocator policy

Each zone that holds userspace pages of one workload must be aged at a speed proportional to the zone size. Otherwise, the time an individual page gets to stay in memory depends on the zone it happened to be allocated in. Asymmetry in the zone aging creates rather unpredictable aging behavior and results in the wrong pages being reclaimed, activated etc. But exactly this happens right now because of the way the page allocator and kswapd interact. The page allocator uses per-node lists of all zones in the system, ordered by preference, when allocating a new page. When the first iteration does not yield any results, kswapd is woken up and the allocator retries. Due to the way kswapd reclaims zones below the high watermark while a zone can be allocated from when it is above the low watermark, the allocator may keep kswapd running while kswapd reclaim ensures that the page allocator can keep allocating from the first zone in the zonelist for extended periods of time. Meanwhile the other zones rarely see new allocations and thus get aged much slower in comparison. The result is that the occasional page placed in lower zones gets relatively more time in memory, even gets promoted to the active list after its peers have long been evicted. Meanwhile, the bulk of the working set may be thrashing on the preferred zone even though there may be significant amounts of memory available in the lower zones. Even the most basic test -- repeatedly reading a file slightly bigger than memory -- shows how broken the zone aging is. In this scenario, no single page should be able stay in memory long enough to get referenced twice and activated, but activation happens in spades: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 0 nr_active_file 8 nr_inactive_file 1582 nr_active_file 11994 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 70 nr_inactive_file 258753 nr_active_file 443214 nr_inactive_file 149793 nr_active_file 12021 Fix this with a very simple round robin allocator. Each zone is allowed a batch of allocations that is proportional to the zone's size, after which it is treated as full. The batch counters are reset when all zones have been tried and the allocator enters the slowpath and kicks off kswapd reclaim. Allocation and reclaim is now fairly spread out to all available/allowable zones: $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 174 nr_active_file 4865 nr_inactive_file 53 nr_active_file 860 $ cat data data data data >/dev/null $ grep active_file /proc/zoneinfo nr_inactive_file 0 nr_active_file 0 nr_inactive_file 666622 nr_active_file 4988 nr_inactive_file 190969 nr_active_file 937 When zone_reclaim_mode is enabled, allocations will now spread out to all zones on the local node, not just the first preferred zone (which on a 4G node might be a tiny Normal zone). Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> Acked-by: Mel Gorman <mgorman@suse.de> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: Andrea Arcangeli <aarcange@redhat.com> Cc: Paul Bolle <paul.bollee@gmail.com> Cc: Zlatko Calusic <zcalusic@bitsync.net> Tested-by: Kevin Hilman <khilman@linaro.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Johannes Weiner <hannes@cmpxchg.org> 2013-09-11 23:20:47 +0200
committer: Linus Torvalds <torvalds@linux-foundation.org> 2013-09-12 00:57:23 +0200
commit: 81c0a2bb515fd4daae8cab64352877480792b515 (patch)
tree: 5ef326d226fdd14332cd0e5382e6dd2759dd08e3 /mm/vmstat.c
parent: mm: page_alloc: rearrange watermark checking in get_page_from_freelist (diff)
download: linux-81c0a2bb515fd4daae8cab64352877480792b515.tar.xz
linux-81c0a2bb515fd4daae8cab64352877480792b515.zip
1 files changed, 1 insertions, 0 deletions
diff --git a/mm/vmstat.c b/mm/vmstat.c
index ca06e9653827..8a8da1f9b044 100644
--- a/mm/vmstat.c
+++ b/mm/vmstat.c
@@ -703,6 +703,7 @@ static void walk_zones_in_node(struct seq_file *m, pg_data_t *pgdat,
 const char * const vmstat_text[] = {
 	/* Zoned VM counters */
 	"nr_free_pages",
+	"nr_alloc_batch",
 	"nr_inactive_anon",
 	"nr_active_anon",
 	"nr_inactive_file",
author	Johannes Weiner <hannes@cmpxchg.org>	2013-09-11 23:20:47 +0200
committer	Linus Torvalds <torvalds@linux-foundation.org>	2013-09-12 00:57:23 +0200
commit	81c0a2bb515fd4daae8cab64352877480792b515 (patch)
tree	5ef326d226fdd14332cd0e5382e6dd2759dd08e3 /mm/vmstat.c
parent	mm: page_alloc: rearrange watermark checking in get_page_from_freelist (diff)
download	linux-81c0a2bb515fd4daae8cab64352877480792b515.tar.xz linux-81c0a2bb515fd4daae8cab64352877480792b515.zip