[PATCH] cpuset exit NULL dereference fix

There is a race in the kernel cpuset code, between the code to handle notify_on_release, and the code to remove a cpuset. The notify_on_release code can end up trying to access a cpuset that has been removed. In the most common case, this causes a NULL pointer dereference from the routine cpuset_path. However all manner of bad things are possible, in theory at least. The existing code decrements the cpuset use count, and if the count goes to zero, processes the notify_on_release request, if appropriate. However, once the count goes to zero, unless we are holding the global cpuset_sem semaphore, there is nothing to stop another task from immediately removing the cpuset entirely, and recycling its memory. The obvious fix would be to always hold the cpuset_sem semaphore while decrementing the use count and dealing with notify_on_release. However we don't want to force a global semaphore into the mainline task exit path, as that might create a scaling problem. The actual fix is almost as easy - since this is only an issue for cpusets using notify_on_release, which the top level big cpusets don't normally need to use, only take the cpuset_sem for cpusets using notify_on_release. This code has been run for hours without a hiccup, while running a cpuset create/destroy stress test that could crash the existing kernel in seconds. This patch applies to the current -linus git kernel. Signed-off-by: Paul Jackson <pj@sgi.com> Acked-by: Simon Derr <simon.derr@bull.net> Acked-by: Dinakar Guniguntala <dino@in.ibm.com> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
author: Paul Jackson <pj@sgi.com> 2005-05-27 11:02:43 +0200
committer: Linus Torvalds <torvalds@ppc970.osdl.org> 2005-05-27 17:07:26 +0200
commit: 2efe86b809d97debaaf9fcc13b041aedf15bd3d2 (patch)
tree: 87e039397918f4c5b0a21d798589a8ce517bfa2d
parent: [PATCH] remove non-cleanroom pwc driver compression (diff)
download: linux-2efe86b809d97debaaf9fcc13b041aedf15bd3d2.tar.xz
linux-2efe86b809d97debaaf9fcc13b041aedf15bd3d2.zip
1 files changed, 19 insertions, 5 deletions
diff --git a/kernel/cpuset.c b/kernel/cpuset.c
index 961d74044deb..00e8f2575512 100644
--- a/kernel/cpuset.c
+++ b/kernel/cpuset.c
@@ -166,9 +166,8 @@ static struct super_block *cpuset_sb = NULL;
  * The hooks from fork and exit, cpuset_fork() and cpuset_exit(), don't
  * (usually) grab cpuset_sem.  These are the two most performance
  * critical pieces of code here.  The exception occurs on exit(),
- * if the last task using a cpuset exits, and the cpuset was marked
- * notify_on_release.  In that case, the cpuset_sem is taken, the
- * path to the released cpuset calculated, and a usermode call made
+ * when a task in a notify_on_release cpuset exits.  Then cpuset_sem
+ * is taken, and if the cpuset count is zero, a usermode call made
  * to /sbin/cpuset_release_agent with the name of the cpuset (path
  * relative to the root of cpuset file system) as the argument.
  *
@@ -1404,6 +1403,18 @@ void cpuset_fork(struct task_struct *tsk)
  *
  * Description: Detach cpuset from @tsk and release it.
  *
+ * Note that cpusets marked notify_on_release force every task
+ * in them to take the global cpuset_sem semaphore when exiting.
+ * This could impact scaling on very large systems.  Be reluctant
+ * to use notify_on_release cpusets where very high task exit
+ * scaling is required on large systems.
+ *
+ * Don't even think about derefencing 'cs' after the cpuset use
+ * count goes to zero, except inside a critical section guarded
+ * by the cpuset_sem semaphore.  If you don't hold cpuset_sem,
+ * then a zero cpuset use count is a license to any other task to
+ * nuke the cpuset immediately.
+ *
  **/
 
 void cpuset_exit(struct task_struct *tsk)
@@ -1415,10 +1426,13 @@ void cpuset_exit(struct task_struct *tsk)
 	tsk->cpuset = NULL;
 	task_unlock(tsk);
 
-	if (atomic_dec_and_test(&cs->count)) {
+	if (notify_on_release(cs)) {
 		down(&cpuset_sem);
-		check_for_release(cs);
+		if (atomic_dec_and_test(&cs->count))
+			check_for_release(cs);
 		up(&cpuset_sem);
+	} else {
+		atomic_dec(&cs->count);
 	}
 }
author	Paul Jackson <pj@sgi.com>	2005-05-27 11:02:43 +0200
committer	Linus Torvalds <torvalds@ppc970.osdl.org>	2005-05-27 17:07:26 +0200
commit	2efe86b809d97debaaf9fcc13b041aedf15bd3d2 (patch)
tree	87e039397918f4c5b0a21d798589a8ce517bfa2d
parent	[PATCH] remove non-cleanroom pwc driver compression (diff)
download	linux-2efe86b809d97debaaf9fcc13b041aedf15bd3d2.tar.xz linux-2efe86b809d97debaaf9fcc13b041aedf15bd3d2.zip