diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2020-12-15 03:29:11 +0100 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2020-12-15 03:29:11 +0100 |
commit | adb35e8dc98ba9bda99ff79ac6a05b8fcde2a762 (patch) | |
tree | ceb0334110d80b5a756764c3d089257c83faaec9 /kernel/sched/membarrier.c | |
parent | Merge tag 'timers-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kern... (diff) | |
parent | sched/fair: Trivial correction of the newidle_balance() comment (diff) | |
download | linux-adb35e8dc98ba9bda99ff79ac6a05b8fcde2a762.tar.xz linux-adb35e8dc98ba9bda99ff79ac6a05b8fcde2a762.zip |
Merge tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip
Pull scheduler updates from Thomas Gleixner:
- migrate_disable/enable() support which originates from the RT tree
and is now a prerequisite for the new preemptible kmap_local() API
which aims to replace kmap_atomic().
- A fair amount of topology and NUMA related improvements
- Improvements for the frequency invariant calculations
- Enhanced robustness for the global CPU priority tracking and decision
making
- The usual small fixes and enhancements all over the place
* tag 'sched-core-2020-12-14' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (61 commits)
sched/fair: Trivial correction of the newidle_balance() comment
sched/fair: Clear SMT siblings after determining the core is not idle
sched: Fix kernel-doc markup
x86: Print ratio freq_max/freq_base used in frequency invariance calculations
x86, sched: Use midpoint of max_boost and max_P for frequency invariance on AMD EPYC
x86, sched: Calculate frequency invariance for AMD systems
irq_work: Optimize irq_work_single()
smp: Cleanup smp_call_function*()
irq_work: Cleanup
sched: Limit the amount of NUMA imbalance that can exist at fork time
sched/numa: Allow a floating imbalance between NUMA nodes
sched: Avoid unnecessary calculation of load imbalance at clone time
sched/numa: Rename nr_running and break out the magic number
sched: Make migrate_disable/enable() independent of RT
sched/topology: Condition EAS enablement on FIE support
arm64: Rebuild sched domains on invariance status changes
sched/topology,schedutil: Wrap sched domains rebuild
sched/uclamp: Allow to reset a task uclamp constraint value
sched/core: Fix typos in comments
Documentation: scheduler: fix information on arch SD flags, sched_domain and sched_debug
...
Diffstat (limited to 'kernel/sched/membarrier.c')
-rw-r--r-- | kernel/sched/membarrier.c | 147 |
1 files changed, 143 insertions, 4 deletions
diff --git a/kernel/sched/membarrier.c b/kernel/sched/membarrier.c index 9d8df34bea75..08ae45ad9261 100644 --- a/kernel/sched/membarrier.c +++ b/kernel/sched/membarrier.c @@ -7,6 +7,134 @@ #include "sched.h" /* + * For documentation purposes, here are some membarrier ordering + * scenarios to keep in mind: + * + * A) Userspace thread execution after IPI vs membarrier's memory + * barrier before sending the IPI + * + * Userspace variables: + * + * int x = 0, y = 0; + * + * The memory barrier at the start of membarrier() on CPU0 is necessary in + * order to enforce the guarantee that any writes occurring on CPU0 before + * the membarrier() is executed will be visible to any code executing on + * CPU1 after the IPI-induced memory barrier: + * + * CPU0 CPU1 + * + * x = 1 + * membarrier(): + * a: smp_mb() + * b: send IPI IPI-induced mb + * c: smp_mb() + * r2 = y + * y = 1 + * barrier() + * r1 = x + * + * BUG_ON(r1 == 0 && r2 == 0) + * + * The write to y and load from x by CPU1 are unordered by the hardware, + * so it's possible to have "r1 = x" reordered before "y = 1" at any + * point after (b). If the memory barrier at (a) is omitted, then "x = 1" + * can be reordered after (a) (although not after (c)), so we get r1 == 0 + * and r2 == 0. This violates the guarantee that membarrier() is + * supposed by provide. + * + * The timing of the memory barrier at (a) has to ensure that it executes + * before the IPI-induced memory barrier on CPU1. + * + * B) Userspace thread execution before IPI vs membarrier's memory + * barrier after completing the IPI + * + * Userspace variables: + * + * int x = 0, y = 0; + * + * The memory barrier at the end of membarrier() on CPU0 is necessary in + * order to enforce the guarantee that any writes occurring on CPU1 before + * the membarrier() is executed will be visible to any code executing on + * CPU0 after the membarrier(): + * + * CPU0 CPU1 + * + * x = 1 + * barrier() + * y = 1 + * r2 = y + * membarrier(): + * a: smp_mb() + * b: send IPI IPI-induced mb + * c: smp_mb() + * r1 = x + * BUG_ON(r1 == 0 && r2 == 1) + * + * The writes to x and y are unordered by the hardware, so it's possible to + * have "r2 = 1" even though the write to x doesn't execute until (b). If + * the memory barrier at (c) is omitted then "r1 = x" can be reordered + * before (b) (although not before (a)), so we get "r1 = 0". This violates + * the guarantee that membarrier() is supposed to provide. + * + * The timing of the memory barrier at (c) has to ensure that it executes + * after the IPI-induced memory barrier on CPU1. + * + * C) Scheduling userspace thread -> kthread -> userspace thread vs membarrier + * + * CPU0 CPU1 + * + * membarrier(): + * a: smp_mb() + * d: switch to kthread (includes mb) + * b: read rq->curr->mm == NULL + * e: switch to user (includes mb) + * c: smp_mb() + * + * Using the scenario from (A), we can show that (a) needs to be paired + * with (e). Using the scenario from (B), we can show that (c) needs to + * be paired with (d). + * + * D) exit_mm vs membarrier + * + * Two thread groups are created, A and B. Thread group B is created by + * issuing clone from group A with flag CLONE_VM set, but not CLONE_THREAD. + * Let's assume we have a single thread within each thread group (Thread A + * and Thread B). Thread A runs on CPU0, Thread B runs on CPU1. + * + * CPU0 CPU1 + * + * membarrier(): + * a: smp_mb() + * exit_mm(): + * d: smp_mb() + * e: current->mm = NULL + * b: read rq->curr->mm == NULL + * c: smp_mb() + * + * Using scenario (B), we can show that (c) needs to be paired with (d). + * + * E) kthread_{use,unuse}_mm vs membarrier + * + * CPU0 CPU1 + * + * membarrier(): + * a: smp_mb() + * kthread_unuse_mm() + * d: smp_mb() + * e: current->mm = NULL + * b: read rq->curr->mm == NULL + * kthread_use_mm() + * f: current->mm = mm + * g: smp_mb() + * c: smp_mb() + * + * Using the scenario from (A), we can show that (a) needs to be paired + * with (g). Using the scenario from (B), we can show that (c) needs to + * be paired with (d). + */ + +/* * Bitmask made from a "or" of all commands within enum membarrier_cmd, * except MEMBARRIER_CMD_QUERY. */ @@ -101,6 +229,18 @@ void membarrier_exec_mmap(struct mm_struct *mm) this_cpu_write(runqueues.membarrier_state, 0); } +void membarrier_update_current_mm(struct mm_struct *next_mm) +{ + struct rq *rq = this_rq(); + int membarrier_state = 0; + + if (next_mm) + membarrier_state = atomic_read(&next_mm->membarrier_state); + if (READ_ONCE(rq->membarrier_state) == membarrier_state) + return; + WRITE_ONCE(rq->membarrier_state, membarrier_state); +} + static int membarrier_global_expedited(void) { int cpu; @@ -139,12 +279,11 @@ static int membarrier_global_expedited(void) continue; /* - * Skip the CPU if it runs a kernel thread. The scheduler - * leaves the prior task mm in place as an optimization when - * scheduling a kthread. + * Skip the CPU if it runs a kernel thread which is not using + * a task mm. */ p = rcu_dereference(cpu_rq(cpu)->curr); - if (p->flags & PF_KTHREAD) + if (!p->mm) continue; __cpumask_set_cpu(cpu, tmpmask); |