diff options
Diffstat (limited to 'Documentation/RCU/Design')
3 files changed, 88 insertions, 166 deletions
diff --git a/Documentation/RCU/Design/Data-Structures/Data-Structures.html b/Documentation/RCU/Design/Data-Structures/Data-Structures.html index f5120a00f511..1d2051c0c3fc 100644 --- a/Documentation/RCU/Design/Data-Structures/Data-Structures.html +++ b/Documentation/RCU/Design/Data-Structures/Data-Structures.html @@ -1227,9 +1227,11 @@ to overflow the counter, this approach corrects the CPU enters the idle loop from process context. </p><p>The <tt>->dynticks</tt> field counts the corresponding -CPU's transitions to and from dyntick-idle mode, so that this counter -has an even value when the CPU is in dyntick-idle mode and an odd -value otherwise. +CPU's transitions to and from either dyntick-idle or user mode, so +that this counter has an even value when the CPU is in dyntick-idle +mode or user mode and an odd value otherwise. The transitions to/from +user mode need to be counted for user mode adaptive-ticks support +(see timers/NO_HZ.txt). </p><p>The <tt>->rcu_need_heavy_qs</tt> field is used to record the fact that the RCU core code would really like to @@ -1372,8 +1374,7 @@ that is, if the CPU is currently idle. Accessor Functions</a></h3> <p>The following listing shows the -<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt>, -<tt>rcu_for_each_nonleaf_node_breadth_first()</tt>, and +<tt>rcu_get_root()</tt>, <tt>rcu_for_each_node_breadth_first</tt> and <tt>rcu_for_each_leaf_node()</tt> function and macros: <pre> @@ -1386,13 +1387,9 @@ Accessor Functions</a></h3> 7 for ((rnp) = &(rsp)->node[0]; \ 8 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) 9 - 10 #define rcu_for_each_nonleaf_node_breadth_first(rsp, rnp) \ - 11 for ((rnp) = &(rsp)->node[0]; \ - 12 (rnp) < (rsp)->level[NUM_RCU_LVLS - 1]; (rnp)++) - 13 - 14 #define rcu_for_each_leaf_node(rsp, rnp) \ - 15 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ - 16 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) + 10 #define rcu_for_each_leaf_node(rsp, rnp) \ + 11 for ((rnp) = (rsp)->level[NUM_RCU_LVLS - 1]; \ + 12 (rnp) < &(rsp)->node[NUM_RCU_NODES]; (rnp)++) </pre> <p>The <tt>rcu_get_root()</tt> simply returns a pointer to the @@ -1405,10 +1402,7 @@ macro takes advantage of the layout of the <tt>rcu_node</tt> structures in the <tt>rcu_state</tt> structure's <tt>->node[]</tt> array, performing a breadth-first traversal by simply traversing the array in order. -The <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> macro operates -similarly, but traverses only the first part of the array, thus excluding -the leaf <tt>rcu_node</tt> structures. -Finally, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only +Similarly, the <tt>rcu_for_each_leaf_node()</tt> macro traverses only the last part of the array, thus traversing only the leaf <tt>rcu_node</tt> structures. @@ -1416,15 +1410,14 @@ the last part of the array, thus traversing only the leaf <tr><th> </th></tr> <tr><th align="left">Quick Quiz:</th></tr> <tr><td> - What do <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> and + What does <tt>rcu_for_each_leaf_node()</tt> do if the <tt>rcu_node</tt> tree contains only a single node? </td></tr> <tr><th align="left">Answer:</th></tr> <tr><td bgcolor="#ffffff"><font color="ffffff"> In the single-node case, - <tt>rcu_for_each_nonleaf_node_breadth_first()</tt> is a no-op - and <tt>rcu_for_each_leaf_node()</tt> traverses the single node. + <tt>rcu_for_each_leaf_node()</tt> traverses the single node. </font></td></tr> <tr><td> </td></tr> </table> diff --git a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html index 7394f034be65..e62c7c34a369 100644 --- a/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html +++ b/Documentation/RCU/Design/Expedited-Grace-Periods/Expedited-Grace-Periods.html @@ -12,10 +12,9 @@ high efficiency and minimal disturbance, expedited grace periods accept lower efficiency and significant disturbance to attain shorter latencies. <p> -There are three flavors of RCU (RCU-bh, RCU-preempt, and RCU-sched), -but only two flavors of expedited grace periods because the RCU-bh -expedited grace period maps onto the RCU-sched expedited grace period. -Each of the remaining two implementations is covered in its own section. +There are two flavors of RCU (RCU-preempt and RCU-sched), with an earlier +third RCU-bh flavor having been implemented in terms of the other two. +Each of the two implementations is covered in its own section. <ol> <li> <a href="#Expedited Grace Period Design"> @@ -158,7 +157,7 @@ whether or not the current CPU is in an RCU read-side critical section. The best that <tt>sync_sched_exp_handler()</tt> can do is to check for idle, on the off-chance that the CPU went idle while the IPI was in flight. -If the CPU is idle, then tt>sync_sched_exp_handler()</tt> reports +If the CPU is idle, then <tt>sync_sched_exp_handler()</tt> reports the quiescent state. <p> diff --git a/Documentation/RCU/Design/Requirements/Requirements.html b/Documentation/RCU/Design/Requirements/Requirements.html index 49690228b1c6..43c4e2f05f40 100644 --- a/Documentation/RCU/Design/Requirements/Requirements.html +++ b/Documentation/RCU/Design/Requirements/Requirements.html @@ -1306,8 +1306,6 @@ doing so would degrade real-time response. <p> This non-requirement appeared with preemptible RCU. -If you need a grace period that waits on non-preemptible code regions, use -<a href="#Sched Flavor">RCU-sched</a>. <h2><a name="Parallelism Facts of Life">Parallelism Facts of Life</a></h2> @@ -2165,14 +2163,9 @@ however, this is not a panacea because there would be severe restrictions on what operations those callbacks could invoke. <p> -Perhaps surprisingly, <tt>synchronize_rcu()</tt>, -<a href="#Bottom-Half Flavor"><tt>synchronize_rcu_bh()</tt></a> -(<a href="#Bottom-Half Flavor">discussed below</a>), -<a href="#Sched Flavor"><tt>synchronize_sched()</tt></a>, +Perhaps surprisingly, <tt>synchronize_rcu()</tt> and <tt>synchronize_rcu_expedited()</tt>, -<tt>synchronize_rcu_bh_expedited()</tt>, and -<tt>synchronize_sched_expedited()</tt> -will all operate normally +will operate normally during very early boot, the reason being that there is only one CPU and preemption is disabled. This means that the call <tt>synchronize_rcu()</tt> (or friends) @@ -2269,12 +2262,23 @@ Thankfully, RCU update-side primitives, including The name notwithstanding, some Linux-kernel architectures can have nested NMIs, which RCU must handle correctly. Andy Lutomirski -<a href="https://lkml.kernel.org/g/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> +<a href="https://lkml.kernel.org/r/CALCETrXLq1y7e_dKFPgou-FKHB6Pu-r8+t-6Ds+8=va7anBWDA@mail.gmail.com">surprised me</a> with this requirement; he also kindly surprised me with -<a href="https://lkml.kernel.org/g/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> +<a href="https://lkml.kernel.org/r/CALCETrXSY9JpW3uE6H8WYk81sg56qasA2aqmjMPsq5dOtzso=g@mail.gmail.com">an algorithm</a> that meets this requirement. +<p> +Furthermore, NMI handlers can be interrupted by what appear to RCU +to be normal interrupts. +One way that this can happen is for code that directly invokes +<tt>rcu_irq_enter()</tt> and </tt>rcu_irq_exit()</tt> to be called +from an NMI handler. +This astonishing fact of life prompted the current code structure, +which has <tt>rcu_irq_enter()</tt> invoking <tt>rcu_nmi_enter()</tt> +and <tt>rcu_irq_exit()</tt> invoking <tt>rcu_nmi_exit()</tt>. +And yes, I also learned of this requirement the hard way. + <h3><a name="Loadable Modules">Loadable Modules</a></h3> <p> @@ -2394,30 +2398,9 @@ when invoked from a CPU-hotplug notifier. <p> RCU depends on the scheduler, and the scheduler uses RCU to protect some of its data structures. -This means the scheduler is forbidden from acquiring -the runqueue locks and the priority-inheritance locks -in the middle of an outermost RCU read-side critical section unless either -(1) it releases them before exiting that same -RCU read-side critical section, or -(2) interrupts are disabled across -that entire RCU read-side critical section. -This same prohibition also applies (recursively!) to any lock that is acquired -while holding any lock to which this prohibition applies. -Adhering to this rule prevents preemptible RCU from invoking -<tt>rcu_read_unlock_special()</tt> while either runqueue or -priority-inheritance locks are held, thus avoiding deadlock. - -<p> -Prior to v4.4, it was only necessary to disable preemption across -RCU read-side critical sections that acquired scheduler locks. -In v4.4, expedited grace periods started using IPIs, and these -IPIs could force a <tt>rcu_read_unlock()</tt> to take the slowpath. -Therefore, this expedited-grace-period change required disabling of -interrupts, not just preemption. - -<p> -For RCU's part, the preemptible-RCU <tt>rcu_read_unlock()</tt> -implementation must be written carefully to avoid similar deadlocks. +The preemptible-RCU <tt>rcu_read_unlock()</tt> +implementation must therefore be written carefully to avoid deadlocks +involving the scheduler's runqueue and priority-inheritance locks. In particular, <tt>rcu_read_unlock()</tt> must tolerate an interrupt where the interrupt handler invokes both <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt>. @@ -2426,7 +2409,7 @@ negative nesting levels to avoid destructive recursion via interrupt handler's use of RCU. <p> -This pair of mutual scheduler-RCU requirements came as a +This scheduler-RCU requirement came as a <a href="https://lwn.net/Articles/453002/">complete surprise</a>. <p> @@ -2437,9 +2420,28 @@ when running context-switch-heavy workloads when built with <tt>CONFIG_NO_HZ_FULL=y</tt> <a href="http://www.rdrop.com/users/paulmck/scalability/paper/BareMetal.2015.01.15b.pdf">did come as a surprise [PDF]</a>. RCU has made good progress towards meeting this requirement, even -for context-switch-have <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, +for context-switch-heavy <tt>CONFIG_NO_HZ_FULL=y</tt> workloads, but there is room for further improvement. +<p> +In the past, it was forbidden to disable interrupts across an +<tt>rcu_read_unlock()</tt> unless that interrupt-disabled region +of code also included the matching <tt>rcu_read_lock()</tt>. +Violating this restriction could result in deadlocks involving the +scheduler's runqueue and priority-inheritance spinlocks. +This restriction was lifted when interrupt-disabled calls to +<tt>rcu_read_unlock()</tt> started deferring the reporting of +the resulting RCU-preempt quiescent state until the end of that +interrupts-disabled region. +This deferred reporting means that the scheduler's runqueue and +priority-inheritance locks cannot be held while reporting an RCU-preempt +quiescent state, which lifts the earlier restriction, at least from +a deadlock perspective. +Unfortunately, real-time systems using RCU priority boosting may +need this restriction to remain in effect because deferred +quiescent-state reporting also defers deboosting, which in turn +degrades real-time latencies. + <h3><a name="Tracing and RCU">Tracing and RCU</a></h3> <p> @@ -2850,15 +2852,22 @@ The other four flavors are listed below, with requirements for each described in a separate section. <ol> -<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor</a> -<li> <a href="#Sched Flavor">Sched Flavor</a> +<li> <a href="#Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a> +<li> <a href="#Sched Flavor">Sched Flavor (Historical)</a> <li> <a href="#Sleepable RCU">Sleepable RCU</a> <li> <a href="#Tasks RCU">Tasks RCU</a> -<li> <a href="#Waiting for Multiple Grace Periods"> - Waiting for Multiple Grace Periods</a> </ol> -<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor</a></h3> +<h3><a name="Bottom-Half Flavor">Bottom-Half Flavor (Historical)</a></h3> + +<p> +The RCU-bh flavor of RCU has since been expressed in terms of +the other RCU flavors as part of a consolidation of the three +flavors into a single flavor. +The read-side API remains, and continues to disable softirq and to +be accounted for by lockdep. +Much of the material in this section is therefore strictly historical +in nature. <p> The softirq-disable (AKA “bottom-half”, @@ -2918,8 +2927,20 @@ includes <tt>call_rcu_bh()</tt>, <tt>rcu_barrier_bh()</tt>, and <tt>rcu_read_lock_bh_held()</tt>. +However, the update-side APIs are now simple wrappers for other RCU +flavors, namely RCU-sched in CONFIG_PREEMPT=n kernels and RCU-preempt +otherwise. + +<h3><a name="Sched Flavor">Sched Flavor (Historical)</a></h3> -<h3><a name="Sched Flavor">Sched Flavor</a></h3> +<p> +The RCU-sched flavor of RCU has since been expressed in terms of +the other RCU flavors as part of a consolidation of the three +flavors into a single flavor. +The read-side API remains, and continues to disable preemption and to +be accounted for by lockdep. +Much of the material in this section is therefore strictly historical +in nature. <p> Before preemptible RCU, waiting for an RCU grace period had the @@ -3139,94 +3160,14 @@ The tasks-RCU API is quite compact, consisting only of <tt>call_rcu_tasks()</tt>, <tt>synchronize_rcu_tasks()</tt>, and <tt>rcu_barrier_tasks()</tt>. - -<h3><a name="Waiting for Multiple Grace Periods"> -Waiting for Multiple Grace Periods</a></h3> - -<p> -Perhaps you have an RCU protected data structure that is accessed from -RCU read-side critical sections, from softirq handlers, and from -hardware interrupt handlers. -That is three flavors of RCU, the normal flavor, the bottom-half flavor, -and the sched flavor. -How to wait for a compound grace period? - -<p> -The best approach is usually to “just say no!” and -insert <tt>rcu_read_lock()</tt> and <tt>rcu_read_unlock()</tt> -around each RCU read-side critical section, regardless of what -environment it happens to be in. -But suppose that some of the RCU read-side critical sections are -on extremely hot code paths, and that use of <tt>CONFIG_PREEMPT=n</tt> -is not a viable option, so that <tt>rcu_read_lock()</tt> and -<tt>rcu_read_unlock()</tt> are not free. -What then? - -<p> -You <i>could</i> wait on all three grace periods in succession, as follows: - -<blockquote> -<pre> - 1 synchronize_rcu(); - 2 synchronize_rcu_bh(); - 3 synchronize_sched(); -</pre> -</blockquote> - -<p> -This works, but triples the update-side latency penalty. -In cases where this is not acceptable, <tt>synchronize_rcu_mult()</tt> -may be used to wait on all three flavors of grace period concurrently: - -<blockquote> -<pre> - 1 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched); -</pre> -</blockquote> - -<p> -But what if it is necessary to also wait on SRCU? -This can be done as follows: - -<blockquote> -<pre> - 1 static void call_my_srcu(struct rcu_head *head, - 2 void (*func)(struct rcu_head *head)) - 3 { - 4 call_srcu(&my_srcu, head, func); - 5 } - 6 - 7 synchronize_rcu_mult(call_rcu, call_rcu_bh, call_rcu_sched, call_my_srcu); -</pre> -</blockquote> - -<p> -If you needed to wait on multiple different flavors of SRCU -(but why???), you would need to create a wrapper function resembling -<tt>call_my_srcu()</tt> for each SRCU flavor. - -<table> -<tr><th> </th></tr> -<tr><th align="left">Quick Quiz:</th></tr> -<tr><td> - But what if I need to wait for multiple RCU flavors, but I also need - the grace periods to be expedited? -</td></tr> -<tr><th align="left">Answer:</th></tr> -<tr><td bgcolor="#ffffff"><font color="ffffff"> - If you are using expedited grace periods, there should be less penalty - for waiting on them in succession. - But if that is nevertheless a problem, you can use workqueues - or multiple kthreads to wait on the various expedited grace - periods concurrently. -</font></td></tr> -<tr><td> </td></tr> -</table> - -<p> -Again, it is usually better to adjust the RCU read-side critical sections -to use a single flavor of RCU, but when this is not feasible, you can use -<tt>synchronize_rcu_mult()</tt>. +In <tt>CONFIG_PREEMPT=n</tt> kernels, trampolines cannot be preempted, +so these APIs map to +<tt>call_rcu()</tt>, +<tt>synchronize_rcu()</tt>, and +<tt>rcu_barrier()</tt>, respectively. +In <tt>CONFIG_PREEMPT=y</tt> kernels, trampolines can be preempted, +and these three APIs are therefore implemented by separate functions +that check for voluntary context switches. <h2><a name="Possible Future Changes">Possible Future Changes</a></h2> @@ -3238,12 +3179,6 @@ grace-period state machine so as to avoid the need for the additional latency. <p> -Expedited grace periods scan the CPUs, so their latency and overhead -increases with increasing numbers of CPUs. -If this becomes a serious problem on large systems, it will be necessary -to do some redesign to avoid this scalability problem. - -<p> RCU disables CPU hotplug in a few places, perhaps most notably in the <tt>rcu_barrier()</tt> operations. If there is a strong reason to use <tt>rcu_barrier()</tt> in CPU-hotplug @@ -3288,11 +3223,6 @@ require extremely good demonstration of need and full exploration of alternatives. <p> -There is an embarrassingly large number of flavors of RCU, and this -number has been increasing over time. -Perhaps it will be possible to combine some at some future date. - -<p> RCU's various kthreads are reasonably recent additions. It is quite likely that adjustments will be required to more gracefully handle extreme loads. |