sched: Fix race in cpupri introduced by cpumask_var changes

Background: Several race conditions in the scheduler have cropped up recently, which Steven and I have tracked down using ftrace. The most recent one turns out to be a race in how the scheduler determines a suitable migration target for RT tasks, introduced recently with commit: commit 68e74568fbe5854952355e942acca51f138096d9 Date: Tue Nov 25 02:35:13 2008 +1030 sched: convert struct cpupri_vec cpumask_var_t. The original design of cpupri allowed lockless readers to quickly determine a best-estimate target. Races between the pri_active bitmap and the vec->mask were handled in the original code because we would detect and return "0" when this occured. The design was predicated on the *effective* atomicity (*) of caching the result of cpus_and() between the cpus_allowed and the vec->mask. Commit 68e74568 changed the behavior such that vec->mask is accessed multiple times. This introduces a subtle race, the result of which means we can have a result that returns "1", but with an empty bitmap. *) yes, we know cpus_and() is not a locked operator across the entire composite array, but it is implicitly atomic on a per-word basis which is all the design required to work. Implementation: Rather than forgoing the lockless design, or reverting to a stack-based cpumask_t, we simply check for when the race has been encountered and continue processing in the event that the race is hit. This renders the removal race as if the priority bit had been atomically cleared as well, and allows the algorithm to execute correctly. Signed-off-by: Gregory Haskins <ghaskins@novell.com> CC: Rusty Russell <rusty@rustcorp.com.au> CC: Steven Rostedt <srostedt@redhat.com> Signed-off-by: Peter Zijlstra <a.p.zijlstra@chello.nl> LKML-Reference: <20090730145728.25226.92769.stgit@dev.haskins.net> Signed-off-by: Ingo Molnar <mingo@elte.hu>
author: Gregory Haskins <ghaskins@novell.com> 2009-07-30 16:57:28 +0200
committer: Ingo Molnar <mingo@elte.hu> 2009-08-02 14:23:29 +0200
commit: 07903af152b0597d94e9b0030746b63c4664e787 (patch)
tree: 245f1e9d7a7021f479b0d67e922c6783e59c5d50 /kernel
parent: sched: Fix latencytop and sleep profiling vs group scheduling (diff)
download: linux-07903af152b0597d94e9b0030746b63c4664e787.tar.xz
linux-07903af152b0597d94e9b0030746b63c4664e787.zip
1 files changed, 14 insertions, 1 deletions
diff --git a/kernel/sched_cpupri.c b/kernel/sched_cpupri.c
index e6c251790dde..d014efbf947a 100644
--- a/kernel/sched_cpupri.c
+++ b/kernel/sched_cpupri.c
@@ -81,8 +81,21 @@ int cpupri_find(struct cpupri *cp, struct task_struct *p,
 		if (cpumask_any_and(&p->cpus_allowed, vec->mask) >= nr_cpu_ids)
 			continue;
 
-		if (lowest_mask)
+		if (lowest_mask) {
 			cpumask_and(lowest_mask, &p->cpus_allowed, vec->mask);
+
+			/*
+			 * We have to ensure that we have at least one bit
+			 * still set in the array, since the map could have
+			 * been concurrently emptied between the first and
+			 * second reads of vec->mask.  If we hit this
+			 * condition, simply act as though we never hit this
+			 * priority level and continue on.
+			 */
+			if (cpumask_any(lowest_mask) >= nr_cpu_ids)
+				continue;
+		}
+
 		return 1;
 	}
author	Gregory Haskins <ghaskins@novell.com>	2009-07-30 16:57:28 +0200
committer	Ingo Molnar <mingo@elte.hu>	2009-08-02 14:23:29 +0200
commit	07903af152b0597d94e9b0030746b63c4664e787 (patch)
tree	245f1e9d7a7021f479b0d67e922c6783e59c5d50 /kernel
parent	sched: Fix latencytop and sleep profiling vs group scheduling (diff)
download	linux-07903af152b0597d94e9b0030746b63c4664e787.tar.xz linux-07903af152b0597d94e9b0030746b63c4664e787.zip