mm, oom: do not rely on TIF_MEMDIE for memory reserves access

For ages we have been relying on TIF_MEMDIE thread flag to mark OOM victims and then, among other things, to give these threads full access to memory reserves. There are few shortcomings of this implementation, though. First of all and the most serious one is that the full access to memory reserves is quite dangerous because we leave no safety room for the system to operate and potentially do last emergency steps to move on. Secondly this flag is per task_struct while the OOM killer operates on mm_struct granularity so all processes sharing the given mm are killed. Giving the full access to all these task_structs could lead to a quick memory reserves depletion. We have tried to reduce this risk by giving TIF_MEMDIE only to the main thread and the currently allocating task but that doesn't really solve this problem while it surely opens up a room for corner cases - e.g. GFP_NO{FS,IO} requests might loop inside the allocator without access to memory reserves because a particular thread was not the group leader. Now that we have the oom reaper and that all oom victims are reapable after 1b51e65eab64 ("oom, oom_reaper: allow to reap mm shared by the kthreads") we can be more conservative and grant only partial access to memory reserves because there are reasonable chances of the parallel memory freeing. We still want some access to reserves because we do not want other consumers to eat up the victim's freed memory. oom victims will still contend with __GFP_HIGH users but those shouldn't be so aggressive to starve oom victims completely. Introduce ALLOC_OOM flag and give all tsk_is_oom_victim tasks access to the half of the reserves. This makes the access to reserves independent on which task has passed through mark_oom_victim. Also drop any usage of TIF_MEMDIE from the page allocator proper and replace it by tsk_is_oom_victim as well which will make page_alloc.c completely TIF_MEMDIE free finally. CONFIG_MMU=n doesn't have oom reaper so let's stick to the original ALLOC_NO_WATERMARKS approach. There is a demand to make the oom killer memcg aware which will imply many tasks killed at once. This change will allow such a usecase without worrying about complete memory reserves depletion. Link: http://lkml.kernel.org/r/20170810075019.28998-2-mhocko@kernel.org Signed-off-by: Michal Hocko <mhocko@suse.com> Acked-by: Mel Gorman <mgorman@techsingularity.net> Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Cc: David Rientjes <rientjes@google.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Roman Gushchin <guro@fb.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
author: Michal Hocko <mhocko@suse.com> 2017-09-07 01:24:50 +0200
committer: Linus Torvalds <torvalds@linux-foundation.org> 2017-09-07 02:27:30 +0200
commit: cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90 (patch)
tree: 3285aa754d29c10b6ea6062eed62c65f091b0ff5 /mm/internal.h
parent: z3fold: use per-cpu unbuddied lists (diff)
download: linux-cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90.tar.xz
linux-cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90.zip
1 files changed, 11 insertions, 0 deletions
diff --git a/mm/internal.h b/mm/internal.h
index 781c0d54d75a..1df011f62480 100644
--- a/mm/internal.h
+++ b/mm/internal.h
@@ -480,6 +480,17 @@ unsigned long reclaim_clean_pages_from_list(struct zone *zone,
 /* Mask to get the watermark bits */
 #define ALLOC_WMARK_MASK	(ALLOC_NO_WATERMARKS-1)
 
+/*
+ * Only MMU archs have async oom victim reclaim - aka oom_reaper so we
+ * cannot assume a reduced access to memory reserves is sufficient for
+ * !MMU
+ */
+#ifdef CONFIG_MMU
+#define ALLOC_OOM		0x08
+#else
+#define ALLOC_OOM		ALLOC_NO_WATERMARKS
+#endif
+
 #define ALLOC_HARDER		0x10 /* try to alloc harder */
 #define ALLOC_HIGH		0x20 /* __GFP_HIGH set */
 #define ALLOC_CPUSET		0x40 /* check for correct cpuset */
author	Michal Hocko <mhocko@suse.com>	2017-09-07 01:24:50 +0200
committer	Linus Torvalds <torvalds@linux-foundation.org>	2017-09-07 02:27:30 +0200
commit	cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90 (patch)
tree	3285aa754d29c10b6ea6062eed62c65f091b0ff5 /mm/internal.h
parent	z3fold: use per-cpu unbuddied lists (diff)
download	linux-cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90.tar.xz linux-cd04ae1e2dc8e3651b8c427ec1b9500c6eed7b90.zip