Merge branches 'array.2015.05.13a', 'doc.2015.05.12a', 'fixes.2015.05.12a', 'hotplug.2015.05.12a', 'init.2015.05.13a', 'list.2015.05.12a', 'tiny.2015.05.13a' and 'torture.2015.05.13a' into HEAD
array.2015.05.13a: Remove all uses of RCU-protected array indexes.
doc.2015.05.12a: Docuemntation updates.
fixes.2015.05.12a: Miscellaneous fixes.
hotplug.2015.05.12a: CPU-hotplug updates.
init.2015.05.13a: Initialization/Kconfig updates.
list.2015.05.12a: Updates for RCU-protected lists.
tiny.2015.05.13a: Updates to Tiny RCU.
torture.2015.05.13a: Torture-testing updates.
diff --git a/Documentation/RCU/arrayRCU.txt b/Documentation/RCU/arrayRCU.txt
index 453ebe69..f05a9af 100644
--- a/Documentation/RCU/arrayRCU.txt
+++ b/Documentation/RCU/arrayRCU.txt
@@ -10,7 +10,19 @@
3. Resizeable Arrays
-Each of these situations are discussed below.
+Each of these three situations involves an RCU-protected pointer to an
+array that is separately indexed. It might be tempting to consider use
+of RCU to instead protect the index into an array, however, this use
+case is -not- supported. The problem with RCU-protected indexes into
+arrays is that compilers can play way too many optimization games with
+integers, which means that the rules governing handling of these indexes
+are far more trouble than they are worth. If RCU-protected indexes into
+arrays prove to be particularly valuable (which they have not thus far),
+explicit cooperation from the compiler will be required to permit them
+to be safely used.
+
+That aside, each of the three RCU-protected pointer situations are
+described in the following sections.
Situation 1: Hash Tables
@@ -36,9 +48,9 @@
Situation 3: Resizeable Arrays
Use of RCU for resizeable arrays is demonstrated by the grow_ary()
-function used by the System V IPC code. The array is used to map from
-semaphore, message-queue, and shared-memory IDs to the data structure
-that represents the corresponding IPC construct. The grow_ary()
+function formerly used by the System V IPC code. The array is used
+to map from semaphore, message-queue, and shared-memory IDs to the data
+structure that represents the corresponding IPC construct. The grow_ary()
function does not acquire any locks; instead its caller must hold the
ids->sem semaphore.
diff --git a/Documentation/RCU/lockdep.txt b/Documentation/RCU/lockdep.txt
index cd83d23..da51d30 100644
--- a/Documentation/RCU/lockdep.txt
+++ b/Documentation/RCU/lockdep.txt
@@ -47,11 +47,6 @@
Use explicit check expression "c" along with
srcu_read_lock_held()(). This is useful in code that
is invoked by both SRCU readers and updaters.
- rcu_dereference_index_check(p, c):
- Use explicit check expression "c", but the caller
- must supply one of the rcu_read_lock_held() functions.
- This is useful in code that uses RCU-protected arrays
- that is invoked by both RCU readers and updaters.
rcu_dereference_raw(p):
Don't check. (Use sparingly, if at all.)
rcu_dereference_protected(p, c):
@@ -64,11 +59,6 @@
but retain the compiler constraints that prevent duplicating
or coalescsing. This is useful when when testing the
value of the pointer itself, for example, against NULL.
- rcu_access_index(idx):
- Return the value of the index and omit all barriers, but
- retain the compiler constraints that prevent duplicating
- or coalescsing. This is useful when when testing the
- value of the index itself, for example, against -1.
The rcu_dereference_check() check expression can be any boolean
expression, but would normally include a lockdep expression. However,
diff --git a/Documentation/RCU/rcu_dereference.txt b/Documentation/RCU/rcu_dereference.txt
index ceb05da..1e6c0da 100644
--- a/Documentation/RCU/rcu_dereference.txt
+++ b/Documentation/RCU/rcu_dereference.txt
@@ -25,17 +25,6 @@
for an example where the compiler can in fact deduce the exact
value of the pointer, and thus cause misordering.
-o Do not use single-element RCU-protected arrays. The compiler
- is within its right to assume that the value of an index into
- such an array must necessarily evaluate to zero. The compiler
- could then substitute the constant zero for the computation, so
- that the array index no longer depended on the value returned
- by rcu_dereference(). If the array index no longer depends
- on rcu_dereference(), then both the compiler and the CPU
- are within their rights to order the array access before the
- rcu_dereference(), which can cause the array access to return
- garbage.
-
o Avoid cancellation when using the "+" and "-" infix arithmetic
operators. For example, for a given variable "x", avoid
"(x-x)". There are similar arithmetic pitfalls from other
@@ -76,14 +65,15 @@
dereferencing. For example, the following (rather improbable)
code is buggy:
- int a[2];
- int index;
- int force_zero_index = 1;
+ int *p;
+ int *q;
...
- r1 = rcu_dereference(i1)
- r2 = a[r1 && force_zero_index]; /* BUGGY!!! */
+ p = rcu_dereference(gp)
+ q = &global_q;
+ q += p != &oom_p1 && p != &oom_p2;
+ r1 = *q; /* BUGGY!!! */
The reason this is buggy is that "&&" and "||" are often compiled
using branches. While weak-memory machines such as ARM or PowerPC
@@ -94,14 +84,15 @@
">", ">=", "<", or "<=") when dereferencing. For example,
the following (quite strange) code is buggy:
- int a[2];
- int index;
- int flip_index = 0;
+ int *p;
+ int *q;
...
- r1 = rcu_dereference(i1)
- r2 = a[r1 != flip_index]; /* BUGGY!!! */
+ p = rcu_dereference(gp)
+ q = &global_q;
+ q += p > &oom_p;
+ r1 = *q; /* BUGGY!!! */
As before, the reason this is buggy is that relational operators
are often compiled using branches. And as before, although
@@ -193,6 +184,11 @@
pointer. Note that the volatile cast in rcu_dereference()
will normally prevent the compiler from knowing too much.
+ However, please note that if the compiler knows that the
+ pointer takes on only one of two values, a not-equal
+ comparison will provide exactly the information that the
+ compiler needs to deduce the value of the pointer.
+
o Disable any value-speculation optimizations that your compiler
might provide, especially if you are making use of feedback-based
optimizations that take data collected from prior runs. Such
diff --git a/Documentation/RCU/whatisRCU.txt b/Documentation/RCU/whatisRCU.txt
index 88dfce1..5746b0c 100644
--- a/Documentation/RCU/whatisRCU.txt
+++ b/Documentation/RCU/whatisRCU.txt
@@ -256,7 +256,9 @@
If you are going to be fetching multiple fields from the
RCU-protected structure, using the local variable is of
course preferred. Repeated rcu_dereference() calls look
- ugly and incur unnecessary overhead on Alpha CPUs.
+ ugly, do not guarantee that the same pointer will be returned
+ if an update happened while in the critical section, and incur
+ unnecessary overhead on Alpha CPUs.
Note that the value returned by rcu_dereference() is valid
only within the enclosing RCU read-side critical section.
@@ -879,9 +881,7 @@
All: lockdep-checked RCU-protected pointer access
- rcu_access_index
rcu_access_pointer
- rcu_dereference_index_check
rcu_dereference_raw
rcu_lockdep_assert
rcu_sleep_check
diff --git a/Documentation/kernel-parameters.txt b/Documentation/kernel-parameters.txt
index b8d876b..ce8f46f 100644
--- a/Documentation/kernel-parameters.txt
+++ b/Documentation/kernel-parameters.txt
@@ -2992,11 +2992,34 @@
Set maximum number of finished RCU callbacks to
process in one batch.
+ rcutree.dump_tree= [KNL]
+ Dump the structure of the rcu_node combining tree
+ out at early boot. This is used for diagnostic
+ purposes, to verify correct tree setup.
+
+ rcutree.gp_cleanup_delay= [KNL]
+ Set the number of jiffies to delay each step of
+ RCU grace-period cleanup. This only has effect
+ when CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP is set.
+
rcutree.gp_init_delay= [KNL]
Set the number of jiffies to delay each step of
RCU grace-period initialization. This only has
- effect when CONFIG_RCU_TORTURE_TEST_SLOW_INIT is
- set.
+ effect when CONFIG_RCU_TORTURE_TEST_SLOW_INIT
+ is set.
+
+ rcutree.gp_preinit_delay= [KNL]
+ Set the number of jiffies to delay each step of
+ RCU grace-period pre-initialization, that is,
+ the propagation of recent CPU-hotplug changes up
+ the rcu_node combining tree. This only has effect
+ when CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT is set.
+
+ rcutree.rcu_fanout_exact= [KNL]
+ Disable autobalancing of the rcu_node combining
+ tree. This is used by rcutorture, and might
+ possibly be useful for architectures having high
+ cache-to-cache transfer latencies.
rcutree.rcu_fanout_leaf= [KNL]
Increase the number of CPUs assigned to each
diff --git a/Documentation/memory-barriers.txt b/Documentation/memory-barriers.txt
index f957461..360841d 100644
--- a/Documentation/memory-barriers.txt
+++ b/Documentation/memory-barriers.txt
@@ -617,16 +617,16 @@
However, stores are not speculated. This means that ordering -is- provided
for load-store control dependencies, as in the following example:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (q) {
ACCESS_ONCE(b) = p;
}
-Control dependencies pair normally with other types of barriers.
-That said, please note that ACCESS_ONCE() is not optional! Without the
-ACCESS_ONCE(), might combine the load from 'a' with other loads from
-'a', and the store to 'b' with other stores to 'b', with possible highly
-counterintuitive effects on ordering.
+Control dependencies pair normally with other types of barriers. That
+said, please note that READ_ONCE_CTRL() is not optional! Without the
+READ_ONCE_CTRL(), the compiler might combine the load from 'a' with
+other loads from 'a', and the store to 'b' with other stores to 'b',
+with possible highly counterintuitive effects on ordering.
Worse yet, if the compiler is able to prove (say) that the value of
variable 'a' is always non-zero, it would be well within its rights
@@ -636,12 +636,15 @@
q = a;
b = p; /* BUG: Compiler and CPU can both reorder!!! */
-So don't leave out the ACCESS_ONCE().
+Finally, the READ_ONCE_CTRL() includes an smp_read_barrier_depends()
+that DEC Alpha needs in order to respect control depedencies.
+
+So don't leave out the READ_ONCE_CTRL().
It is tempting to try to enforce ordering on identical stores on both
branches of the "if" statement as follows:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (q) {
barrier();
ACCESS_ONCE(b) = p;
@@ -655,7 +658,7 @@
Unfortunately, current compilers will transform this as follows at high
optimization levels:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
barrier();
ACCESS_ONCE(b) = p; /* BUG: No ordering vs. load from a!!! */
if (q) {
@@ -685,7 +688,7 @@
In contrast, without explicit memory barriers, two-legged-if control
ordering is guaranteed only when the stores differ, for example:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (q) {
ACCESS_ONCE(b) = p;
do_something();
@@ -694,14 +697,14 @@
do_something_else();
}
-The initial ACCESS_ONCE() is still required to prevent the compiler from
-proving the value of 'a'.
+The initial READ_ONCE_CTRL() is still required to prevent the compiler
+from proving the value of 'a'.
In addition, you need to be careful what you do with the local variable 'q',
otherwise the compiler might be able to guess the value and again remove
the needed conditional. For example:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (q % MAX) {
ACCESS_ONCE(b) = p;
do_something();
@@ -714,7 +717,7 @@
equal to zero, in which case the compiler is within its rights to
transform the above code into the following:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
ACCESS_ONCE(b) = p;
do_something_else();
@@ -725,7 +728,7 @@
relying on this ordering, you should make sure that MAX is greater than
one, perhaps as follows:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
BUILD_BUG_ON(MAX <= 1); /* Order load from a with store to b. */
if (q % MAX) {
ACCESS_ONCE(b) = p;
@@ -742,14 +745,15 @@
You must also be careful not to rely too much on boolean short-circuit
evaluation. Consider this example:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
if (a || 1 > 0)
ACCESS_ONCE(b) = 1;
-Because the second condition is always true, the compiler can transform
-this example as following, defeating control dependency:
+Because the first condition cannot fault and the second condition is
+always true, the compiler can transform this example as following,
+defeating control dependency:
- q = ACCESS_ONCE(a);
+ q = READ_ONCE_CTRL(a);
ACCESS_ONCE(b) = 1;
This example underscores the need to ensure that the compiler cannot
@@ -762,8 +766,8 @@
x and y both being zero:
CPU 0 CPU 1
- ===================== =====================
- r1 = ACCESS_ONCE(x); r2 = ACCESS_ONCE(y);
+ ======================= =======================
+ r1 = READ_ONCE_CTRL(x); r2 = READ_ONCE_CTRL(y);
if (r1 > 0) if (r2 > 0)
ACCESS_ONCE(y) = 1; ACCESS_ONCE(x) = 1;
@@ -783,7 +787,8 @@
assertion can fail after the combined three-CPU example completes. If you
need the three-CPU example to provide ordering, you will need smp_mb()
between the loads and stores in the CPU 0 and CPU 1 code fragments,
-that is, just before or just after the "if" statements.
+that is, just before or just after the "if" statements. Furthermore,
+the original two-CPU example is very fragile and should be avoided.
These two examples are the LB and WWC litmus tests from this paper:
http://www.cl.cam.ac.uk/users/pes20/ppc-supplemental/test6.pdf and this
@@ -791,6 +796,12 @@
In summary:
+ (*) Control dependencies must be headed by READ_ONCE_CTRL().
+ Or, as a much less preferable alternative, interpose
+ be headed by READ_ONCE() or an ACCESS_ONCE() read and must
+ have smp_read_barrier_depends() between this read and the
+ control-dependent write.
+
(*) Control dependencies can order prior loads against later stores.
However, they do -not- guarantee any other sort of ordering:
Not prior loads against later loads, nor prior stores against
@@ -1784,10 +1795,9 @@
Memory operations issued before the ACQUIRE may be completed after
the ACQUIRE operation has completed. An smp_mb__before_spinlock(),
- combined with a following ACQUIRE, orders prior loads against
- subsequent loads and stores and also orders prior stores against
- subsequent stores. Note that this is weaker than smp_mb()! The
- smp_mb__before_spinlock() primitive is free on many architectures.
+ combined with a following ACQUIRE, orders prior stores against
+ subsequent loads and stores. Note that this is weaker than smp_mb()!
+ The smp_mb__before_spinlock() primitive is free on many architectures.
(2) RELEASE operation implication:
diff --git a/arch/arm/kernel/smp.c b/arch/arm/kernel/smp.c
index cca5b87..8ef0ef0 100644
--- a/arch/arm/kernel/smp.c
+++ b/arch/arm/kernel/smp.c
@@ -218,15 +218,13 @@
return 0;
}
-static DECLARE_COMPLETION(cpu_died);
-
/*
* called on the thread which is asking for a CPU to be shutdown -
* waits until shutdown has completed, or it is timed out.
*/
void __cpu_die(unsigned int cpu)
{
- if (!wait_for_completion_timeout(&cpu_died, msecs_to_jiffies(5000))) {
+ if (!cpu_wait_death(cpu, 5)) {
pr_err("CPU%u: cpu didn't die\n", cpu);
return;
}
@@ -272,7 +270,7 @@
* this returns, power and/or clocks can be removed at any point
* from this CPU and its cache by platform_cpu_kill().
*/
- complete(&cpu_died);
+ (void)cpu_report_death();
/*
* Ensure that the cache lines associated with that completion are
diff --git a/arch/arm64/kernel/smp.c b/arch/arm64/kernel/smp.c
index 2cb0081..a899c1b 100644
--- a/arch/arm64/kernel/smp.c
+++ b/arch/arm64/kernel/smp.c
@@ -253,15 +253,13 @@
return cpu_ops[cpu]->cpu_kill(cpu);
}
-static DECLARE_COMPLETION(cpu_died);
-
/*
* called on the thread which is asking for a CPU to be shutdown -
* waits until shutdown has completed, or it is timed out.
*/
void __cpu_die(unsigned int cpu)
{
- if (!wait_for_completion_timeout(&cpu_died, msecs_to_jiffies(5000))) {
+ if (!cpu_wait_death(cpu, 5)) {
pr_crit("CPU%u: cpu didn't die\n", cpu);
return;
}
@@ -294,7 +292,7 @@
local_irq_disable();
/* Tell __cpu_die() that this CPU is now safe to dispose of */
- complete(&cpu_died);
+ (void)cpu_report_death();
/*
* Actually shutdown the CPU. This must never fail. The specific hotplug
diff --git a/arch/powerpc/include/asm/barrier.h b/arch/powerpc/include/asm/barrier.h
index a3bf5be..1124f59 100644
--- a/arch/powerpc/include/asm/barrier.h
+++ b/arch/powerpc/include/asm/barrier.h
@@ -89,5 +89,6 @@
#define smp_mb__before_atomic() smp_mb()
#define smp_mb__after_atomic() smp_mb()
+#define smp_mb__before_spinlock() smp_mb()
#endif /* _ASM_POWERPC_BARRIER_H */
diff --git a/arch/x86/kernel/cpu/mcheck/mce.c b/arch/x86/kernel/cpu/mcheck/mce.c
index e535533..d4298fe 100644
--- a/arch/x86/kernel/cpu/mcheck/mce.c
+++ b/arch/x86/kernel/cpu/mcheck/mce.c
@@ -53,9 +53,12 @@
static DEFINE_MUTEX(mce_chrdev_read_mutex);
#define rcu_dereference_check_mce(p) \
- rcu_dereference_index_check((p), \
- rcu_read_lock_sched_held() || \
- lockdep_is_held(&mce_chrdev_read_mutex))
+({ \
+ rcu_lockdep_assert(rcu_read_lock_sched_held() || \
+ lockdep_is_held(&mce_chrdev_read_mutex), \
+ "suspicious rcu_dereference_check_mce() usage"); \
+ smp_load_acquire(&(p)); \
+})
#define CREATE_TRACE_POINTS
#include <trace/events/mce.h>
@@ -1884,7 +1887,7 @@
static unsigned int mce_chrdev_poll(struct file *file, poll_table *wait)
{
poll_wait(file, &mce_chrdev_wait, wait);
- if (rcu_access_index(mcelog.next))
+ if (READ_ONCE(mcelog.next))
return POLLIN | POLLRDNORM;
if (!mce_apei_read_done && apei_check_mce())
return POLLIN | POLLRDNORM;
@@ -1929,8 +1932,8 @@
}
EXPORT_SYMBOL_GPL(register_mce_write_callback);
-ssize_t mce_chrdev_write(struct file *filp, const char __user *ubuf,
- size_t usize, loff_t *off)
+static ssize_t mce_chrdev_write(struct file *filp, const char __user *ubuf,
+ size_t usize, loff_t *off)
{
if (mce_write)
return mce_write(filp, ubuf, usize, off);
diff --git a/drivers/md/bitmap.c b/drivers/md/bitmap.c
index 2bc56e2..3290177 100644
--- a/drivers/md/bitmap.c
+++ b/drivers/md/bitmap.c
@@ -181,7 +181,7 @@
rcu_read_lock();
if (rdev == NULL)
/* start at the beginning */
- rdev = list_entry_rcu(&mddev->disks, struct md_rdev, same_set);
+ rdev = list_entry_rcu(mddev->disks.next, struct md_rdev, same_set);
else {
/* release the previous rdev and start from there. */
rdev_dec_pending(rdev, mddev);
diff --git a/include/linux/compiler.h b/include/linux/compiler.h
index 0e41ca0..16e4c12 100644
--- a/include/linux/compiler.h
+++ b/include/linux/compiler.h
@@ -248,6 +248,22 @@
#define WRITE_ONCE(x, val) \
({ typeof(x) __val = (val); __write_once_size(&(x), &__val, sizeof(__val)); __val; })
+/**
+ * READ_ONCE_CTRL - Read a value heading a control dependency
+ * @x: The value to be read, heading the control dependency
+ *
+ * Control dependencies are tricky. See Documentation/memory-barriers.txt
+ * for important information on how to use them. Note that in many cases,
+ * use of smp_load_acquire() will be much simpler. Control dependencies
+ * should be avoided except on the hottest of hotpaths.
+ */
+#define READ_ONCE_CTRL(x) \
+({ \
+ typeof(x) __val = READ_ONCE(x); \
+ smp_read_barrier_depends(); /* Enforce control dependency. */ \
+ __val; \
+})
+
#endif /* __KERNEL__ */
#endif /* __ASSEMBLY__ */
diff --git a/include/linux/rculist.h b/include/linux/rculist.h
index 6653972..c881741 100644
--- a/include/linux/rculist.h
+++ b/include/linux/rculist.h
@@ -247,10 +247,7 @@
* primitives such as list_add_rcu() as long as it's guarded by rcu_read_lock().
*/
#define list_entry_rcu(ptr, type, member) \
-({ \
- typeof(*ptr) __rcu *__ptr = (typeof(*ptr) __rcu __force *)ptr; \
- container_of((typeof(ptr))rcu_dereference_raw(__ptr), type, member); \
-})
+ container_of((typeof(ptr))rcu_dereference_raw(ptr), type, member)
/**
* Where are list_empty_rcu() and list_first_entry_rcu()?
@@ -549,8 +546,8 @@
*/
#define hlist_for_each_entry_from_rcu(pos, member) \
for (; pos; \
- pos = hlist_entry_safe(rcu_dereference((pos)->member.next),\
- typeof(*(pos)), member))
+ pos = hlist_entry_safe(rcu_dereference_raw(hlist_next_rcu( \
+ &(pos)->member)), typeof(*(pos)), member))
#endif /* __KERNEL__ */
#endif
diff --git a/include/linux/rcupdate.h b/include/linux/rcupdate.h
index 87bb0ee..03a899a 100644
--- a/include/linux/rcupdate.h
+++ b/include/linux/rcupdate.h
@@ -292,10 +292,6 @@
void rcu_bh_qs(void);
void rcu_check_callbacks(int user);
struct notifier_block;
-void rcu_idle_enter(void);
-void rcu_idle_exit(void);
-void rcu_irq_enter(void);
-void rcu_irq_exit(void);
int rcu_cpu_notify(struct notifier_block *self,
unsigned long action, void *hcpu);
@@ -628,21 +624,6 @@
((typeof(*p) __force __kernel *)(p)); \
})
-#define __rcu_access_index(p, space) \
-({ \
- typeof(p) _________p1 = READ_ONCE(p); \
- rcu_dereference_sparse(p, space); \
- (_________p1); \
-})
-#define __rcu_dereference_index_check(p, c) \
-({ \
- /* Dependency order vs. p above. */ \
- typeof(p) _________p1 = lockless_dereference(p); \
- rcu_lockdep_assert(c, \
- "suspicious rcu_dereference_index_check() usage"); \
- (_________p1); \
-})
-
/**
* RCU_INITIALIZER() - statically initialize an RCU-protected global variable
* @v: The value to statically initialize with.
@@ -787,41 +768,6 @@
#define rcu_dereference_raw_notrace(p) __rcu_dereference_check((p), 1, __rcu)
/**
- * rcu_access_index() - fetch RCU index with no dereferencing
- * @p: The index to read
- *
- * Return the value of the specified RCU-protected index, but omit the
- * smp_read_barrier_depends() and keep the READ_ONCE(). This is useful
- * when the value of this index is accessed, but the index is not
- * dereferenced, for example, when testing an RCU-protected index against
- * -1. Although rcu_access_index() may also be used in cases where
- * update-side locks prevent the value of the index from changing, you
- * should instead use rcu_dereference_index_protected() for this use case.
- */
-#define rcu_access_index(p) __rcu_access_index((p), __rcu)
-
-/**
- * rcu_dereference_index_check() - rcu_dereference for indices with debug checking
- * @p: The pointer to read, prior to dereferencing
- * @c: The conditions under which the dereference will take place
- *
- * Similar to rcu_dereference_check(), but omits the sparse checking.
- * This allows rcu_dereference_index_check() to be used on integers,
- * which can then be used as array indices. Attempting to use
- * rcu_dereference_check() on an integer will give compiler warnings
- * because the sparse address-space mechanism relies on dereferencing
- * the RCU-protected pointer. Dereferencing integers is not something
- * that even gcc will put up with.
- *
- * Note that this function does not implicitly check for RCU read-side
- * critical sections. If this function gains lots of uses, it might
- * make sense to provide versions for each flavor of RCU, but it does
- * not make sense as of early 2010.
- */
-#define rcu_dereference_index_check(p, c) \
- __rcu_dereference_index_check((p), (c))
-
-/**
* rcu_dereference_protected() - fetch RCU pointer when updates prevented
* @p: The pointer to read, prior to dereferencing
* @c: The conditions under which the dereference will take place
@@ -1153,13 +1099,13 @@
#define kfree_rcu(ptr, rcu_head) \
__kfree_rcu(&((ptr)->rcu_head), offsetof(typeof(*(ptr)), rcu_head))
-#if defined(CONFIG_TINY_RCU) || defined(CONFIG_RCU_NOCB_CPU_ALL)
+#ifdef CONFIG_TINY_RCU
static inline int rcu_needs_cpu(unsigned long *delta_jiffies)
{
*delta_jiffies = ULONG_MAX;
return 0;
}
-#endif /* #if defined(CONFIG_TINY_RCU) || defined(CONFIG_RCU_NOCB_CPU_ALL) */
+#endif /* #ifdef CONFIG_TINY_RCU */
#if defined(CONFIG_RCU_NOCB_CPU_ALL)
static inline bool rcu_is_nocb_cpu(int cpu) { return true; }
diff --git a/include/linux/rcutiny.h b/include/linux/rcutiny.h
index 937edae..3df6c1e 100644
--- a/include/linux/rcutiny.h
+++ b/include/linux/rcutiny.h
@@ -159,6 +159,22 @@
{
}
+static inline void rcu_idle_enter(void)
+{
+}
+
+static inline void rcu_idle_exit(void)
+{
+}
+
+static inline void rcu_irq_enter(void)
+{
+}
+
+static inline void rcu_irq_exit(void)
+{
+}
+
static inline void exit_rcu(void)
{
}
diff --git a/include/linux/rcutree.h b/include/linux/rcutree.h
index d2e583a..3fa4a43 100644
--- a/include/linux/rcutree.h
+++ b/include/linux/rcutree.h
@@ -31,9 +31,7 @@
#define __LINUX_RCUTREE_H
void rcu_note_context_switch(void);
-#ifndef CONFIG_RCU_NOCB_CPU_ALL
int rcu_needs_cpu(unsigned long *delta_jiffies);
-#endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
void rcu_cpu_stall_reset(void);
/*
@@ -93,6 +91,11 @@
void rcu_bh_force_quiescent_state(void);
void rcu_sched_force_quiescent_state(void);
+void rcu_idle_enter(void);
+void rcu_idle_exit(void);
+void rcu_irq_enter(void);
+void rcu_irq_exit(void);
+
void exit_rcu(void);
void rcu_scheduler_starting(void);
diff --git a/include/linux/spinlock.h b/include/linux/spinlock.h
index 3e18379..0063b24 100644
--- a/include/linux/spinlock.h
+++ b/include/linux/spinlock.h
@@ -120,7 +120,7 @@
/*
* Despite its name it doesn't necessarily has to be a full barrier.
* It should only guarantee that a STORE before the critical section
- * can not be reordered with a LOAD inside this section.
+ * can not be reordered with LOADs and STOREs inside this section.
* spin_lock() is the one-way barrier, this LOAD can not escape out
* of the region. So the default implementation simply ensures that
* a STORE can not move into the critical section, smp_wmb() should
diff --git a/init/Kconfig b/init/Kconfig
index dc24dec..4c081970 100644
--- a/init/Kconfig
+++ b/init/Kconfig
@@ -465,13 +465,9 @@
menu "RCU Subsystem"
-choice
- prompt "RCU Implementation"
- default TREE_RCU
-
config TREE_RCU
- bool "Tree-based hierarchical RCU"
- depends on !PREEMPT && SMP
+ bool
+ default y if !PREEMPT && SMP
help
This option selects the RCU implementation that is
designed for very large SMP system with hundreds or
@@ -479,8 +475,8 @@
smaller systems.
config PREEMPT_RCU
- bool "Preemptible tree-based hierarchical RCU"
- depends on PREEMPT
+ bool
+ default y if PREEMPT
help
This option selects the RCU implementation that is
designed for very large SMP systems with hundreds or
@@ -491,15 +487,28 @@
Select this option if you are unsure.
config TINY_RCU
- bool "UP-only small-memory-footprint RCU"
- depends on !PREEMPT && !SMP
+ bool
+ default y if !PREEMPT && !SMP
help
This option selects the RCU implementation that is
designed for UP systems from which real-time response
is not required. This option greatly reduces the
memory footprint of RCU.
-endchoice
+config RCU_EXPERT
+ bool "Make expert-level adjustments to RCU configuration"
+ default n
+ help
+ This option needs to be enabled if you wish to make
+ expert-level adjustments to RCU configuration. By default,
+ no such adjustments can be made, which has the often-beneficial
+ side-effect of preventing "make oldconfig" from asking you all
+ sorts of detailed questions about how you would like numerous
+ obscure RCU options to be set up.
+
+ Say Y if you need to make expert-level adjustments to RCU.
+
+ Say N if you are unsure.
config SRCU
bool
@@ -509,7 +518,7 @@
sections.
config TASKS_RCU
- bool "Task_based RCU implementation using voluntary context switch"
+ bool
default n
select SRCU
help
@@ -517,8 +526,6 @@
only voluntary context switch (not preemption!), idle, and
user-mode execution as quiescent states.
- If unsure, say N.
-
config RCU_STALL_COMMON
def_bool ( TREE_RCU || PREEMPT_RCU || RCU_TRACE )
help
@@ -531,9 +538,7 @@
bool
config RCU_USER_QS
- bool "Consider userspace as in RCU extended quiescent state"
- depends on HAVE_CONTEXT_TRACKING && SMP
- select CONTEXT_TRACKING
+ bool
help
This option sets hooks on kernel / userspace boundaries and
puts RCU in extended quiescent state when the CPU runs in
@@ -541,12 +546,6 @@
excluded from the global RCU state machine and thus doesn't
try to keep the timer tick on for RCU.
- Unless you want to hack and help the development of the full
- dynticks mode, you shouldn't enable this option. It also
- adds unnecessary overhead.
-
- If unsure say N
-
config CONTEXT_TRACKING_FORCE
bool "Force context tracking"
depends on CONTEXT_TRACKING
@@ -578,7 +577,7 @@
int "Tree-based hierarchical RCU fanout value"
range 2 64 if 64BIT
range 2 32 if !64BIT
- depends on TREE_RCU || PREEMPT_RCU
+ depends on (TREE_RCU || PREEMPT_RCU) && RCU_EXPERT
default 64 if 64BIT
default 32 if !64BIT
help
@@ -596,9 +595,9 @@
config RCU_FANOUT_LEAF
int "Tree-based hierarchical RCU leaf-level fanout value"
- range 2 RCU_FANOUT if 64BIT
- range 2 RCU_FANOUT if !64BIT
- depends on TREE_RCU || PREEMPT_RCU
+ range 2 64 if 64BIT
+ range 2 32 if !64BIT
+ depends on (TREE_RCU || PREEMPT_RCU) && RCU_EXPERT
default 16
help
This option controls the leaf-level fanout of hierarchical
@@ -621,23 +620,9 @@
Take the default if unsure.
-config RCU_FANOUT_EXACT
- bool "Disable tree-based hierarchical RCU auto-balancing"
- depends on TREE_RCU || PREEMPT_RCU
- default n
- help
- This option forces use of the exact RCU_FANOUT value specified,
- regardless of imbalances in the hierarchy. This is useful for
- testing RCU itself, and might one day be useful on systems with
- strong NUMA behavior.
-
- Without RCU_FANOUT_EXACT, the code will balance the hierarchy.
-
- Say N if unsure.
-
config RCU_FAST_NO_HZ
bool "Accelerate last non-dyntick-idle CPU's grace periods"
- depends on NO_HZ_COMMON && SMP
+ depends on NO_HZ_COMMON && SMP && RCU_EXPERT
default n
help
This option permits CPUs to enter dynticks-idle state even if
@@ -663,7 +648,7 @@
config RCU_BOOST
bool "Enable RCU priority boosting"
- depends on RT_MUTEXES && PREEMPT_RCU
+ depends on RT_MUTEXES && PREEMPT_RCU && RCU_EXPERT
default n
help
This option boosts the priority of preempted RCU readers that
@@ -680,6 +665,7 @@
range 0 99 if !RCU_BOOST
default 1 if RCU_BOOST
default 0 if !RCU_BOOST
+ depends on RCU_EXPERT
help
This option specifies the SCHED_FIFO priority value that will be
assigned to the rcuc/n and rcub/n threads and is also the value
diff --git a/kernel/cpu.c b/kernel/cpu.c
index 94bbe46..9c9c9fa 100644
--- a/kernel/cpu.c
+++ b/kernel/cpu.c
@@ -398,7 +398,6 @@
err = __stop_machine(take_cpu_down, &tcd_param, cpumask_of(cpu));
if (err) {
/* CPU didn't die: tell everyone. Can't complain. */
- smpboot_unpark_threads(cpu);
cpu_notify_nofail(CPU_DOWN_FAILED | mod, hcpu);
goto out_release;
}
@@ -463,6 +462,7 @@
switch (action & ~CPU_TASKS_FROZEN) {
+ case CPU_DOWN_FAILED:
case CPU_ONLINE:
smpboot_unpark_threads(cpu);
break;
@@ -479,7 +479,7 @@
.priority = CPU_PRI_SMPBOOT,
};
-void __cpuinit smpboot_thread_init(void)
+void smpboot_thread_init(void)
{
register_cpu_notifier(&smpboot_thread_notifier);
}
diff --git a/kernel/events/ring_buffer.c b/kernel/events/ring_buffer.c
index 232f00f..17fcb73 100644
--- a/kernel/events/ring_buffer.c
+++ b/kernel/events/ring_buffer.c
@@ -141,7 +141,7 @@
perf_output_get_handle(handle);
do {
- tail = ACCESS_ONCE(rb->user_page->data_tail);
+ tail = READ_ONCE_CTRL(rb->user_page->data_tail);
offset = head = local_read(&rb->head);
if (!rb->overwrite &&
unlikely(CIRC_SPACE(head, tail, perf_data_size(rb)) < size))
diff --git a/kernel/rcu/tiny.c b/kernel/rcu/tiny.c
index 069742d..591af0c 100644
--- a/kernel/rcu/tiny.c
+++ b/kernel/rcu/tiny.c
@@ -49,39 +49,6 @@
#include "tiny_plugin.h"
-/*
- * Enter idle, which is an extended quiescent state if we have fully
- * entered that mode.
- */
-void rcu_idle_enter(void)
-{
-}
-EXPORT_SYMBOL_GPL(rcu_idle_enter);
-
-/*
- * Exit an interrupt handler towards idle.
- */
-void rcu_irq_exit(void)
-{
-}
-EXPORT_SYMBOL_GPL(rcu_irq_exit);
-
-/*
- * Exit idle, so that we are no longer in an extended quiescent state.
- */
-void rcu_idle_exit(void)
-{
-}
-EXPORT_SYMBOL_GPL(rcu_idle_exit);
-
-/*
- * Enter an interrupt handler, moving away from idle.
- */
-void rcu_irq_enter(void)
-{
-}
-EXPORT_SYMBOL_GPL(rcu_irq_enter);
-
#if defined(CONFIG_DEBUG_LOCK_ALLOC) || defined(CONFIG_RCU_TRACE)
/*
@@ -170,6 +137,11 @@
/* Move the ready-to-invoke callbacks to a local list. */
local_irq_save(flags);
+ if (rcp->donetail == &rcp->rcucblist) {
+ /* No callbacks ready, so just leave. */
+ local_irq_restore(flags);
+ return;
+ }
RCU_TRACE(trace_rcu_batch_start(rcp->name, 0, rcp->qlen, -1));
list = rcp->rcucblist;
rcp->rcucblist = *rcp->donetail;
diff --git a/kernel/rcu/tree.c b/kernel/rcu/tree.c
index 4d32995..7651d7d 100644
--- a/kernel/rcu/tree.c
+++ b/kernel/rcu/tree.c
@@ -91,7 +91,7 @@
#define RCU_STATE_INITIALIZER(sname, sabbr, cr) \
DEFINE_RCU_TPS(sname) \
-DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, sname##_data); \
+static DEFINE_PER_CPU_SHARED_ALIGNED(struct rcu_data, sname##_data); \
struct rcu_state sname##_state = { \
.level = { &sname##_state.node[0] }, \
.rda = &sname##_data, \
@@ -110,20 +110,22 @@
RCU_STATE_INITIALIZER(rcu_sched, 's', call_rcu_sched);
RCU_STATE_INITIALIZER(rcu_bh, 'b', call_rcu_bh);
-static struct rcu_state *rcu_state_p;
+static struct rcu_state *const rcu_state_p;
+static struct rcu_data __percpu *const rcu_data_p;
LIST_HEAD(rcu_struct_flavors);
-/* Increase (but not decrease) the CONFIG_RCU_FANOUT_LEAF at boot time. */
-static int rcu_fanout_leaf = CONFIG_RCU_FANOUT_LEAF;
+/* Dump rcu_node combining tree at boot to verify correct setup. */
+static bool dump_tree;
+module_param(dump_tree, bool, 0444);
+/* Control rcu_node-tree auto-balancing at boot time. */
+static bool rcu_fanout_exact;
+module_param(rcu_fanout_exact, bool, 0444);
+/* Increase (but not decrease) the RCU_FANOUT_LEAF at boot time. */
+static int rcu_fanout_leaf = RCU_FANOUT_LEAF;
module_param(rcu_fanout_leaf, int, 0444);
int rcu_num_lvls __read_mostly = RCU_NUM_LVLS;
-static int num_rcu_lvl[] = { /* Number of rcu_nodes at specified level. */
- NUM_RCU_LVL_0,
- NUM_RCU_LVL_1,
- NUM_RCU_LVL_2,
- NUM_RCU_LVL_3,
- NUM_RCU_LVL_4,
-};
+/* Number of rcu_nodes at specified level. */
+static int num_rcu_lvl[] = NUM_RCU_LVL_INIT;
int rcu_num_nodes __read_mostly = NUM_RCU_NODES; /* Total # rcu_nodes in use. */
/*
@@ -159,14 +161,46 @@
static void invoke_rcu_callbacks(struct rcu_state *rsp, struct rcu_data *rdp);
/* rcuc/rcub kthread realtime priority */
+#ifdef CONFIG_RCU_KTHREAD_PRIO
static int kthread_prio = CONFIG_RCU_KTHREAD_PRIO;
+#else /* #ifdef CONFIG_RCU_KTHREAD_PRIO */
+static int kthread_prio = IS_ENABLED(CONFIG_RCU_BOOST) ? 1 : 0;
+#endif /* #else #ifdef CONFIG_RCU_KTHREAD_PRIO */
module_param(kthread_prio, int, 0644);
-/* Delay in jiffies for grace-period initialization delays. */
-static int gp_init_delay = IS_ENABLED(CONFIG_RCU_TORTURE_TEST_SLOW_INIT)
- ? CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY
- : 0;
+/* Delay in jiffies for grace-period initialization delays, debug only. */
+
+#ifdef CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT
+static int gp_preinit_delay = CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT_DELAY;
+module_param(gp_preinit_delay, int, 0644);
+#else /* #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT */
+static const int gp_preinit_delay;
+#endif /* #else #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT */
+
+#ifdef CONFIG_RCU_TORTURE_TEST_SLOW_INIT
+static int gp_init_delay = CONFIG_RCU_TORTURE_TEST_SLOW_INIT_DELAY;
module_param(gp_init_delay, int, 0644);
+#else /* #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_INIT */
+static const int gp_init_delay;
+#endif /* #else #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_INIT */
+
+#ifdef CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP
+static int gp_cleanup_delay = CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY;
+module_param(gp_cleanup_delay, int, 0644);
+#else /* #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP */
+static const int gp_cleanup_delay;
+#endif /* #else #ifdef CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP */
+
+/*
+ * Number of grace periods between delays, normalized by the duration of
+ * the delay. The longer the the delay, the more the grace periods between
+ * each delay. The reason for this normalization is that it means that,
+ * for non-zero delays, the overall slowdown of grace periods is constant
+ * regardless of the duration of the delay. This arrangement balances
+ * the need for long delays to increase some race probabilities with the
+ * need for fast grace periods to increase other race probabilities.
+ */
+#define PER_RCU_NODE_PERIOD 3 /* Number of grace periods between delays. */
/*
* Track the rcutorture test sequence number and the update version
@@ -582,7 +616,8 @@
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
trace_rcu_dyntick(TPS("Start"), oldval, rdtp->dynticks_nesting);
- if (!user && !is_idle_task(current)) {
+ if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ !user && !is_idle_task(current)) {
struct task_struct *idle __maybe_unused =
idle_task(smp_processor_id());
@@ -601,7 +636,8 @@
smp_mb__before_atomic(); /* See above. */
atomic_inc(&rdtp->dynticks);
smp_mb__after_atomic(); /* Force ordering with next sojourn. */
- WARN_ON_ONCE(atomic_read(&rdtp->dynticks) & 0x1);
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ atomic_read(&rdtp->dynticks) & 0x1);
rcu_dynticks_task_enter();
/*
@@ -627,7 +663,8 @@
rdtp = this_cpu_ptr(&rcu_dynticks);
oldval = rdtp->dynticks_nesting;
- WARN_ON_ONCE((oldval & DYNTICK_TASK_NEST_MASK) == 0);
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ (oldval & DYNTICK_TASK_NEST_MASK) == 0);
if ((oldval & DYNTICK_TASK_NEST_MASK) == DYNTICK_TASK_NEST_VALUE) {
rdtp->dynticks_nesting = 0;
rcu_eqs_enter_common(oldval, user);
@@ -700,7 +737,8 @@
rdtp = this_cpu_ptr(&rcu_dynticks);
oldval = rdtp->dynticks_nesting;
rdtp->dynticks_nesting--;
- WARN_ON_ONCE(rdtp->dynticks_nesting < 0);
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ rdtp->dynticks_nesting < 0);
if (rdtp->dynticks_nesting)
trace_rcu_dyntick(TPS("--="), oldval, rdtp->dynticks_nesting);
else
@@ -725,10 +763,12 @@
atomic_inc(&rdtp->dynticks);
/* CPUs seeing atomic_inc() must see later RCU read-side crit sects */
smp_mb__after_atomic(); /* See above. */
- WARN_ON_ONCE(!(atomic_read(&rdtp->dynticks) & 0x1));
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ !(atomic_read(&rdtp->dynticks) & 0x1));
rcu_cleanup_after_idle();
trace_rcu_dyntick(TPS("End"), oldval, rdtp->dynticks_nesting);
- if (!user && !is_idle_task(current)) {
+ if (IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ !user && !is_idle_task(current)) {
struct task_struct *idle __maybe_unused =
idle_task(smp_processor_id());
@@ -752,7 +792,7 @@
rdtp = this_cpu_ptr(&rcu_dynticks);
oldval = rdtp->dynticks_nesting;
- WARN_ON_ONCE(oldval < 0);
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) && oldval < 0);
if (oldval & DYNTICK_TASK_NEST_MASK) {
rdtp->dynticks_nesting += DYNTICK_TASK_NEST_VALUE;
} else {
@@ -825,7 +865,8 @@
rdtp = this_cpu_ptr(&rcu_dynticks);
oldval = rdtp->dynticks_nesting;
rdtp->dynticks_nesting++;
- WARN_ON_ONCE(rdtp->dynticks_nesting == 0);
+ WARN_ON_ONCE(IS_ENABLED(CONFIG_RCU_EQS_DEBUG) &&
+ rdtp->dynticks_nesting == 0);
if (oldval)
trace_rcu_dyntick(TPS("++="), oldval, rdtp->dynticks_nesting);
else
@@ -1132,8 +1173,9 @@
j = jiffies;
gpa = READ_ONCE(rsp->gp_activity);
if (j - gpa > 2 * HZ)
- pr_err("%s kthread starved for %ld jiffies!\n",
- rsp->name, j - gpa);
+ pr_err("%s kthread starved for %ld jiffies! g%lu c%lu f%#x\n",
+ rsp->name, j - gpa,
+ rsp->gpnum, rsp->completed, rsp->gp_flags);
}
/*
@@ -1729,6 +1771,13 @@
rcu_gp_kthread_wake(rsp);
}
+static void rcu_gp_slow(struct rcu_state *rsp, int delay)
+{
+ if (delay > 0 &&
+ !(rsp->gpnum % (rcu_num_nodes * PER_RCU_NODE_PERIOD * delay)))
+ schedule_timeout_uninterruptible(delay);
+}
+
/*
* Initialize a new grace period. Return 0 if no grace period required.
*/
@@ -1771,6 +1820,7 @@
* will handle subsequent offline CPUs.
*/
rcu_for_each_leaf_node(rsp, rnp) {
+ rcu_gp_slow(rsp, gp_preinit_delay);
raw_spin_lock_irq(&rnp->lock);
smp_mb__after_unlock_lock();
if (rnp->qsmaskinit == rnp->qsmaskinitnext &&
@@ -1827,6 +1877,7 @@
* process finishes, because this kthread handles both.
*/
rcu_for_each_node_breadth_first(rsp, rnp) {
+ rcu_gp_slow(rsp, gp_init_delay);
raw_spin_lock_irq(&rnp->lock);
smp_mb__after_unlock_lock();
rdp = this_cpu_ptr(rsp->rda);
@@ -1844,10 +1895,6 @@
raw_spin_unlock_irq(&rnp->lock);
cond_resched_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);
- if (IS_ENABLED(CONFIG_RCU_TORTURE_TEST_SLOW_INIT) &&
- gp_init_delay > 0 &&
- !(rsp->gpnum % (rcu_num_nodes * 10)))
- schedule_timeout_uninterruptible(gp_init_delay);
}
return 1;
@@ -1942,6 +1989,7 @@
raw_spin_unlock_irq(&rnp->lock);
cond_resched_rcu_qs();
WRITE_ONCE(rsp->gp_activity, jiffies);
+ rcu_gp_slow(rsp, gp_cleanup_delay);
}
rnp = rcu_get_root(rsp);
raw_spin_lock_irq(&rnp->lock);
@@ -2136,6 +2184,7 @@
__releases(rcu_get_root(rsp)->lock)
{
WARN_ON_ONCE(!rcu_gp_in_progress(rsp));
+ WRITE_ONCE(rsp->gp_flags, READ_ONCE(rsp->gp_flags) | RCU_GP_FLAG_FQS);
raw_spin_unlock_irqrestore(&rcu_get_root(rsp)->lock, flags);
rcu_gp_kthread_wake(rsp);
}
@@ -2333,8 +2382,6 @@
rcu_report_qs_rdp(rdp->cpu, rsp, rdp);
}
-#ifdef CONFIG_HOTPLUG_CPU
-
/*
* Send the specified CPU's RCU callbacks to the orphanage. The
* specified CPU must be offline, and the caller must hold the
@@ -2345,7 +2392,7 @@
struct rcu_node *rnp, struct rcu_data *rdp)
{
/* No-CBs CPUs do not have orphanable callbacks. */
- if (rcu_is_nocb_cpu(rdp->cpu))
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) || rcu_is_nocb_cpu(rdp->cpu))
return;
/*
@@ -2404,7 +2451,8 @@
struct rcu_data *rdp = raw_cpu_ptr(rsp->rda);
/* No-CBs CPUs are handled specially. */
- if (rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
+ rcu_nocb_adopt_orphan_cbs(rsp, rdp, flags))
return;
/* Do the accounting first. */
@@ -2451,6 +2499,9 @@
RCU_TRACE(struct rcu_data *rdp = this_cpu_ptr(rsp->rda));
RCU_TRACE(struct rcu_node *rnp = rdp->mynode);
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
+ return;
+
RCU_TRACE(mask = rdp->grpmask);
trace_rcu_grace_period(rsp->name,
rnp->gpnum + 1 - !!(rnp->qsmask & mask),
@@ -2479,7 +2530,8 @@
long mask;
struct rcu_node *rnp = rnp_leaf;
- if (rnp->qsmaskinit || rcu_preempt_has_tasks(rnp))
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU) ||
+ rnp->qsmaskinit || rcu_preempt_has_tasks(rnp))
return;
for (;;) {
mask = rnp->grpmask;
@@ -2510,6 +2562,9 @@
struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
+ return;
+
/* Remove outgoing CPU from mask in the leaf rcu_node structure. */
mask = rdp->grpmask;
raw_spin_lock_irqsave(&rnp->lock, flags);
@@ -2531,6 +2586,9 @@
struct rcu_data *rdp = per_cpu_ptr(rsp->rda, cpu);
struct rcu_node *rnp = rdp->mynode; /* Outgoing CPU's rdp & rnp. */
+ if (!IS_ENABLED(CONFIG_HOTPLUG_CPU))
+ return;
+
/* Adjust any no-longer-needed kthreads. */
rcu_boost_kthread_setaffinity(rnp, -1);
@@ -2545,26 +2603,6 @@
cpu, rdp->qlen, rdp->nxtlist);
}
-#else /* #ifdef CONFIG_HOTPLUG_CPU */
-
-static void rcu_cleanup_dying_cpu(struct rcu_state *rsp)
-{
-}
-
-static void __maybe_unused rcu_cleanup_dead_rnp(struct rcu_node *rnp_leaf)
-{
-}
-
-static void rcu_cleanup_dying_idle_cpu(int cpu, struct rcu_state *rsp)
-{
-}
-
-static void rcu_cleanup_dead_cpu(int cpu, struct rcu_state *rsp)
-{
-}
-
-#endif /* #else #ifdef CONFIG_HOTPLUG_CPU */
-
/*
* Invoke any RCU callbacks that have made it to the end of their grace
* period. Thottle as specified by rdp->blimit.
@@ -2729,10 +2767,6 @@
mask = 0;
raw_spin_lock_irqsave(&rnp->lock, flags);
smp_mb__after_unlock_lock();
- if (!rcu_gp_in_progress(rsp)) {
- raw_spin_unlock_irqrestore(&rnp->lock, flags);
- return;
- }
if (rnp->qsmask == 0) {
if (rcu_state_p == &rcu_sched_state ||
rsp != rcu_state_p ||
@@ -2762,8 +2796,6 @@
bit = 1;
for (; cpu <= rnp->grphi; cpu++, bit <<= 1) {
if ((rnp->qsmask & bit) != 0) {
- if ((rnp->qsmaskinit & bit) == 0)
- *isidle = false; /* Pending hotplug. */
if (f(per_cpu_ptr(rsp->rda, cpu), isidle, maxj))
mask |= bit;
}
@@ -3285,7 +3317,7 @@
if (ULONG_CMP_GE((ulong)atomic_long_read(&rsp->expedited_start),
(ulong)atomic_long_read(&rsp->expedited_done) +
ULONG_MAX / 8)) {
- synchronize_sched();
+ wait_rcu_gp(call_rcu_sched);
atomic_long_inc(&rsp->expedited_wrap);
return;
}
@@ -3491,7 +3523,7 @@
* non-NULL, store an indication of whether all callbacks are lazy.
* (If there are no callbacks, all of them are deemed to be lazy.)
*/
-static int __maybe_unused rcu_cpu_has_callbacks(bool *all_lazy)
+static bool __maybe_unused rcu_cpu_has_callbacks(bool *all_lazy)
{
bool al = true;
bool hc = false;
@@ -3778,7 +3810,7 @@
rdp->gpnum = rnp->completed; /* Make CPU later note any new GP. */
rdp->completed = rnp->completed;
rdp->passed_quiesce = false;
- rdp->rcu_qs_ctr_snap = __this_cpu_read(rcu_qs_ctr);
+ rdp->rcu_qs_ctr_snap = per_cpu(rcu_qs_ctr, cpu);
rdp->qs_pending = false;
trace_rcu_grace_period(rsp->name, rdp->gpnum, TPS("cpuonl"));
raw_spin_unlock_irqrestore(&rnp->lock, flags);
@@ -3922,24 +3954,24 @@
/*
* Compute the per-level fanout, either using the exact fanout specified
- * or balancing the tree, depending on CONFIG_RCU_FANOUT_EXACT.
+ * or balancing the tree, depending on the rcu_fanout_exact boot parameter.
*/
-static void __init rcu_init_levelspread(struct rcu_state *rsp)
+static void __init rcu_init_levelspread(int *levelspread, const int *levelcnt)
{
int i;
- if (IS_ENABLED(CONFIG_RCU_FANOUT_EXACT)) {
- rsp->levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf;
+ if (rcu_fanout_exact) {
+ levelspread[rcu_num_lvls - 1] = rcu_fanout_leaf;
for (i = rcu_num_lvls - 2; i >= 0; i--)
- rsp->levelspread[i] = CONFIG_RCU_FANOUT;
+ levelspread[i] = RCU_FANOUT;
} else {
int ccur;
int cprv;
cprv = nr_cpu_ids;
for (i = rcu_num_lvls - 1; i >= 0; i--) {
- ccur = rsp->levelcnt[i];
- rsp->levelspread[i] = (cprv + ccur - 1) / ccur;
+ ccur = levelcnt[i];
+ levelspread[i] = (cprv + ccur - 1) / ccur;
cprv = ccur;
}
}
@@ -3951,44 +3983,39 @@
static void __init rcu_init_one(struct rcu_state *rsp,
struct rcu_data __percpu *rda)
{
- static const char * const buf[] = {
- "rcu_node_0",
- "rcu_node_1",
- "rcu_node_2",
- "rcu_node_3" }; /* Match MAX_RCU_LVLS */
- static const char * const fqs[] = {
- "rcu_node_fqs_0",
- "rcu_node_fqs_1",
- "rcu_node_fqs_2",
- "rcu_node_fqs_3" }; /* Match MAX_RCU_LVLS */
+ static const char * const buf[] = RCU_NODE_NAME_INIT;
+ static const char * const fqs[] = RCU_FQS_NAME_INIT;
static u8 fl_mask = 0x1;
+
+ int levelcnt[RCU_NUM_LVLS]; /* # nodes in each level. */
+ int levelspread[RCU_NUM_LVLS]; /* kids/node in each level. */
int cpustride = 1;
int i;
int j;
struct rcu_node *rnp;
- BUILD_BUG_ON(MAX_RCU_LVLS > ARRAY_SIZE(buf)); /* Fix buf[] init! */
+ BUILD_BUG_ON(RCU_NUM_LVLS > ARRAY_SIZE(buf)); /* Fix buf[] init! */
- /* Silence gcc 4.8 warning about array index out of range. */
- if (rcu_num_lvls > RCU_NUM_LVLS)
- panic("rcu_init_one: rcu_num_lvls overflow");
+ /* Silence gcc 4.8 false positive about array index out of range. */
+ if (rcu_num_lvls <= 0 || rcu_num_lvls > RCU_NUM_LVLS)
+ panic("rcu_init_one: rcu_num_lvls out of range");
/* Initialize the level-tracking arrays. */
for (i = 0; i < rcu_num_lvls; i++)
- rsp->levelcnt[i] = num_rcu_lvl[i];
+ levelcnt[i] = num_rcu_lvl[i];
for (i = 1; i < rcu_num_lvls; i++)
- rsp->level[i] = rsp->level[i - 1] + rsp->levelcnt[i - 1];
- rcu_init_levelspread(rsp);
+ rsp->level[i] = rsp->level[i - 1] + levelcnt[i - 1];
+ rcu_init_levelspread(levelspread, levelcnt);
rsp->flavor_mask = fl_mask;
fl_mask <<= 1;
/* Initialize the elements themselves, starting from the leaves. */
for (i = rcu_num_lvls - 1; i >= 0; i--) {
- cpustride *= rsp->levelspread[i];
+ cpustride *= levelspread[i];
rnp = rsp->level[i];
- for (j = 0; j < rsp->levelcnt[i]; j++, rnp++) {
+ for (j = 0; j < levelcnt[i]; j++, rnp++) {
raw_spin_lock_init(&rnp->lock);
lockdep_set_class_and_name(&rnp->lock,
&rcu_node_class[i], buf[i]);
@@ -4008,10 +4035,10 @@
rnp->grpmask = 0;
rnp->parent = NULL;
} else {
- rnp->grpnum = j % rsp->levelspread[i - 1];
+ rnp->grpnum = j % levelspread[i - 1];
rnp->grpmask = 1UL << rnp->grpnum;
rnp->parent = rsp->level[i - 1] +
- j / rsp->levelspread[i - 1];
+ j / levelspread[i - 1];
}
rnp->level = i;
INIT_LIST_HEAD(&rnp->blkd_tasks);
@@ -4039,9 +4066,7 @@
{
ulong d;
int i;
- int j;
- int n = nr_cpu_ids;
- int rcu_capacity[MAX_RCU_LVLS + 1];
+ int rcu_capacity[RCU_NUM_LVLS];
/*
* Initialize any unspecified boot parameters.
@@ -4057,7 +4082,7 @@
jiffies_till_next_fqs = d;
/* If the compile-time values are accurate, just leave. */
- if (rcu_fanout_leaf == CONFIG_RCU_FANOUT_LEAF &&
+ if (rcu_fanout_leaf == RCU_FANOUT_LEAF &&
nr_cpu_ids == NR_CPUS)
return;
pr_info("RCU: Adjusting geometry for rcu_fanout_leaf=%d, nr_cpu_ids=%d\n",
@@ -4065,46 +4090,67 @@
/*
* Compute number of nodes that can be handled an rcu_node tree
- * with the given number of levels. Setting rcu_capacity[0] makes
- * some of the arithmetic easier.
+ * with the given number of levels.
*/
- rcu_capacity[0] = 1;
- rcu_capacity[1] = rcu_fanout_leaf;
- for (i = 2; i <= MAX_RCU_LVLS; i++)
- rcu_capacity[i] = rcu_capacity[i - 1] * CONFIG_RCU_FANOUT;
+ rcu_capacity[0] = rcu_fanout_leaf;
+ for (i = 1; i < RCU_NUM_LVLS; i++)
+ rcu_capacity[i] = rcu_capacity[i - 1] * RCU_FANOUT;
/*
+ * The tree must be able to accommodate the configured number of CPUs.
+ * If this limit is exceeded than we have a serious problem elsewhere.
+ *
* The boot-time rcu_fanout_leaf parameter is only permitted
* to increase the leaf-level fanout, not decrease it. Of course,
* the leaf-level fanout cannot exceed the number of bits in
- * the rcu_node masks. Finally, the tree must be able to accommodate
- * the configured number of CPUs. Complain and fall back to the
- * compile-time values if these limits are exceeded.
+ * the rcu_node masks. Complain and fall back to the compile-
+ * time values if these limits are exceeded.
*/
- if (rcu_fanout_leaf < CONFIG_RCU_FANOUT_LEAF ||
- rcu_fanout_leaf > sizeof(unsigned long) * 8 ||
- n > rcu_capacity[MAX_RCU_LVLS]) {
+ if (nr_cpu_ids > rcu_capacity[RCU_NUM_LVLS - 1])
+ panic("rcu_init_geometry: rcu_capacity[] is too small");
+ else if (rcu_fanout_leaf < RCU_FANOUT_LEAF ||
+ rcu_fanout_leaf > sizeof(unsigned long) * 8) {
WARN_ON(1);
return;
}
+ /* Calculate the number of levels in the tree. */
+ for (i = 0; nr_cpu_ids > rcu_capacity[i]; i++) {
+ }
+ rcu_num_lvls = i + 1;
+
/* Calculate the number of rcu_nodes at each level of the tree. */
- for (i = 1; i <= MAX_RCU_LVLS; i++)
- if (n <= rcu_capacity[i]) {
- for (j = 0; j <= i; j++)
- num_rcu_lvl[j] =
- DIV_ROUND_UP(n, rcu_capacity[i - j]);
- rcu_num_lvls = i;
- for (j = i + 1; j <= MAX_RCU_LVLS; j++)
- num_rcu_lvl[j] = 0;
- break;
- }
+ for (i = 0; i < rcu_num_lvls; i++) {
+ int cap = rcu_capacity[(rcu_num_lvls - 1) - i];
+ num_rcu_lvl[i] = DIV_ROUND_UP(nr_cpu_ids, cap);
+ }
/* Calculate the total number of rcu_node structures. */
rcu_num_nodes = 0;
- for (i = 0; i <= MAX_RCU_LVLS; i++)
+ for (i = 0; i < rcu_num_lvls; i++)
rcu_num_nodes += num_rcu_lvl[i];
- rcu_num_nodes -= n;
+}
+
+/*
+ * Dump out the structure of the rcu_node combining tree associated
+ * with the rcu_state structure referenced by rsp.
+ */
+static void __init rcu_dump_rcu_node_tree(struct rcu_state *rsp)
+{
+ int level = 0;
+ struct rcu_node *rnp;
+
+ pr_info("rcu_node tree layout dump\n");
+ pr_info(" ");
+ rcu_for_each_node_breadth_first(rsp, rnp) {
+ if (rnp->level != level) {
+ pr_cont("\n");
+ pr_info(" ");
+ level = rnp->level;
+ }
+ pr_cont("%d:%d ^%d ", rnp->grplo, rnp->grphi, rnp->grpnum);
+ }
+ pr_cont("\n");
}
void __init rcu_init(void)
@@ -4117,6 +4163,8 @@
rcu_init_geometry();
rcu_init_one(&rcu_bh_state, &rcu_bh_data);
rcu_init_one(&rcu_sched_state, &rcu_sched_data);
+ if (dump_tree)
+ rcu_dump_rcu_node_tree(&rcu_sched_state);
__rcu_init_preempt();
open_softirq(RCU_SOFTIRQ, rcu_process_callbacks);
diff --git a/kernel/rcu/tree.h b/kernel/rcu/tree.h
index a69d3da..f68ba68 100644
--- a/kernel/rcu/tree.h
+++ b/kernel/rcu/tree.h
@@ -35,47 +35,70 @@
* In practice, this did work well going from three levels to four.
* Of course, your mileage may vary.
*/
-#define MAX_RCU_LVLS 4
-#define RCU_FANOUT_1 (CONFIG_RCU_FANOUT_LEAF)
-#define RCU_FANOUT_2 (RCU_FANOUT_1 * CONFIG_RCU_FANOUT)
-#define RCU_FANOUT_3 (RCU_FANOUT_2 * CONFIG_RCU_FANOUT)
-#define RCU_FANOUT_4 (RCU_FANOUT_3 * CONFIG_RCU_FANOUT)
+
+#ifdef CONFIG_RCU_FANOUT
+#define RCU_FANOUT CONFIG_RCU_FANOUT
+#else /* #ifdef CONFIG_RCU_FANOUT */
+# ifdef CONFIG_64BIT
+# define RCU_FANOUT 64
+# else
+# define RCU_FANOUT 32
+# endif
+#endif /* #else #ifdef CONFIG_RCU_FANOUT */
+
+#ifdef CONFIG_RCU_FANOUT_LEAF
+#define RCU_FANOUT_LEAF CONFIG_RCU_FANOUT_LEAF
+#else /* #ifdef CONFIG_RCU_FANOUT_LEAF */
+# ifdef CONFIG_64BIT
+# define RCU_FANOUT_LEAF 64
+# else
+# define RCU_FANOUT_LEAF 32
+# endif
+#endif /* #else #ifdef CONFIG_RCU_FANOUT_LEAF */
+
+#define RCU_FANOUT_1 (RCU_FANOUT_LEAF)
+#define RCU_FANOUT_2 (RCU_FANOUT_1 * RCU_FANOUT)
+#define RCU_FANOUT_3 (RCU_FANOUT_2 * RCU_FANOUT)
+#define RCU_FANOUT_4 (RCU_FANOUT_3 * RCU_FANOUT)
#if NR_CPUS <= RCU_FANOUT_1
# define RCU_NUM_LVLS 1
# define NUM_RCU_LVL_0 1
-# define NUM_RCU_LVL_1 (NR_CPUS)
-# define NUM_RCU_LVL_2 0
-# define NUM_RCU_LVL_3 0
-# define NUM_RCU_LVL_4 0
+# define NUM_RCU_NODES NUM_RCU_LVL_0
+# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0 }
+# define RCU_NODE_NAME_INIT { "rcu_node_0" }
+# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0" }
#elif NR_CPUS <= RCU_FANOUT_2
# define RCU_NUM_LVLS 2
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-# define NUM_RCU_LVL_2 (NR_CPUS)
-# define NUM_RCU_LVL_3 0
-# define NUM_RCU_LVL_4 0
+# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1)
+# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1 }
+# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1" }
+# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1" }
#elif NR_CPUS <= RCU_FANOUT_3
# define RCU_NUM_LVLS 3
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-# define NUM_RCU_LVL_3 (NR_CPUS)
-# define NUM_RCU_LVL_4 0
+# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2)
+# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2 }
+# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2" }
+# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2" }
#elif NR_CPUS <= RCU_FANOUT_4
# define RCU_NUM_LVLS 4
# define NUM_RCU_LVL_0 1
# define NUM_RCU_LVL_1 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_3)
# define NUM_RCU_LVL_2 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_2)
# define NUM_RCU_LVL_3 DIV_ROUND_UP(NR_CPUS, RCU_FANOUT_1)
-# define NUM_RCU_LVL_4 (NR_CPUS)
+# define NUM_RCU_NODES (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3)
+# define NUM_RCU_LVL_INIT { NUM_RCU_LVL_0, NUM_RCU_LVL_1, NUM_RCU_LVL_2, NUM_RCU_LVL_3 }
+# define RCU_NODE_NAME_INIT { "rcu_node_0", "rcu_node_1", "rcu_node_2", "rcu_node_3" }
+# define RCU_FQS_NAME_INIT { "rcu_node_fqs_0", "rcu_node_fqs_1", "rcu_node_fqs_2", "rcu_node_fqs_3" }
#else
# error "CONFIG_RCU_FANOUT insufficient for NR_CPUS"
#endif /* #if (NR_CPUS) <= RCU_FANOUT_1 */
-#define RCU_SUM (NUM_RCU_LVL_0 + NUM_RCU_LVL_1 + NUM_RCU_LVL_2 + NUM_RCU_LVL_3 + NUM_RCU_LVL_4)
-#define NUM_RCU_NODES (RCU_SUM - NR_CPUS)
-
extern int rcu_num_lvls;
extern int rcu_num_nodes;
@@ -170,7 +193,6 @@
/* if there is no such task. If there */
/* is no current expedited grace period, */
/* then there can cannot be any such task. */
-#ifdef CONFIG_RCU_BOOST
struct list_head *boost_tasks;
/* Pointer to first task that needs to be */
/* priority boosted, or NULL if no priority */
@@ -208,7 +230,6 @@
unsigned long n_balk_nos;
/* Refused to boost: not sure why, though. */
/* This can happen due to race conditions. */
-#endif /* #ifdef CONFIG_RCU_BOOST */
#ifdef CONFIG_RCU_NOCB_CPU
wait_queue_head_t nocb_gp_wq[2];
/* Place for rcu_nocb_kthread() to wait GP. */
@@ -423,8 +444,6 @@
struct rcu_state {
struct rcu_node node[NUM_RCU_NODES]; /* Hierarchy. */
struct rcu_node *level[RCU_NUM_LVLS]; /* Hierarchy levels. */
- u32 levelcnt[MAX_RCU_LVLS + 1]; /* # nodes in each level. */
- u8 levelspread[RCU_NUM_LVLS]; /* kids/node in each level. */
u8 flavor_mask; /* bit in flavor mask. */
struct rcu_data __percpu *rda; /* pointer of percu rcu_data. */
void (*call)(struct rcu_head *head, /* call_rcu() flavor. */
@@ -519,14 +538,11 @@
* RCU implementation internal declarations:
*/
extern struct rcu_state rcu_sched_state;
-DECLARE_PER_CPU(struct rcu_data, rcu_sched_data);
extern struct rcu_state rcu_bh_state;
-DECLARE_PER_CPU(struct rcu_data, rcu_bh_data);
#ifdef CONFIG_PREEMPT_RCU
extern struct rcu_state rcu_preempt_state;
-DECLARE_PER_CPU(struct rcu_data, rcu_preempt_data);
#endif /* #ifdef CONFIG_PREEMPT_RCU */
#ifdef CONFIG_RCU_BOOST
diff --git a/kernel/rcu/tree_plugin.h b/kernel/rcu/tree_plugin.h
index 58b1ebd..a2f64e4 100644
--- a/kernel/rcu/tree_plugin.h
+++ b/kernel/rcu/tree_plugin.h
@@ -43,7 +43,17 @@
DEFINE_PER_CPU(unsigned int, rcu_cpu_kthread_loops);
DEFINE_PER_CPU(char, rcu_cpu_has_work);
-#endif /* #ifdef CONFIG_RCU_BOOST */
+#else /* #ifdef CONFIG_RCU_BOOST */
+
+/*
+ * Some architectures do not define rt_mutexes, but if !CONFIG_RCU_BOOST,
+ * all uses are in dead code. Provide a definition to keep the compiler
+ * happy, but add WARN_ON_ONCE() to complain if used in the wrong place.
+ * This probably needs to be excluded from -rt builds.
+ */
+#define rt_mutex_owner(a) ({ WARN_ON_ONCE(1); NULL; })
+
+#endif /* #else #ifdef CONFIG_RCU_BOOST */
#ifdef CONFIG_RCU_NOCB_CPU
static cpumask_var_t rcu_nocb_mask; /* CPUs to have callbacks offloaded. */
@@ -60,11 +70,11 @@
{
if (IS_ENABLED(CONFIG_RCU_TRACE))
pr_info("\tRCU debugfs-based tracing is enabled.\n");
- if ((IS_ENABLED(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 64) ||
- (!IS_ENABLED(CONFIG_64BIT) && CONFIG_RCU_FANOUT != 32))
+ if ((IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 64) ||
+ (!IS_ENABLED(CONFIG_64BIT) && RCU_FANOUT != 32))
pr_info("\tCONFIG_RCU_FANOUT set to non-default value of %d\n",
- CONFIG_RCU_FANOUT);
- if (IS_ENABLED(CONFIG_RCU_FANOUT_EXACT))
+ RCU_FANOUT);
+ if (rcu_fanout_exact)
pr_info("\tHierarchical RCU autobalancing is disabled.\n");
if (IS_ENABLED(CONFIG_RCU_FAST_NO_HZ))
pr_info("\tRCU dyntick-idle grace-period acceleration is enabled.\n");
@@ -74,12 +84,12 @@
pr_info("\tRCU torture testing starts during boot.\n");
if (IS_ENABLED(CONFIG_RCU_CPU_STALL_INFO))
pr_info("\tAdditional per-CPU info printed with stalls.\n");
- if (NUM_RCU_LVL_4 != 0)
- pr_info("\tFour-level hierarchy is enabled.\n");
- if (CONFIG_RCU_FANOUT_LEAF != 16)
+ if (RCU_NUM_LVLS >= 4)
+ pr_info("\tFour(or more)-level hierarchy is enabled.\n");
+ if (RCU_FANOUT_LEAF != 16)
pr_info("\tBuild-time adjustment of leaf fanout to %d.\n",
- CONFIG_RCU_FANOUT_LEAF);
- if (rcu_fanout_leaf != CONFIG_RCU_FANOUT_LEAF)
+ RCU_FANOUT_LEAF);
+ if (rcu_fanout_leaf != RCU_FANOUT_LEAF)
pr_info("\tBoot-time adjustment of leaf fanout to %d.\n", rcu_fanout_leaf);
if (nr_cpu_ids != NR_CPUS)
pr_info("\tRCU restricting CPUs from NR_CPUS=%d to nr_cpu_ids=%d.\n", NR_CPUS, nr_cpu_ids);
@@ -90,7 +100,8 @@
#ifdef CONFIG_PREEMPT_RCU
RCU_STATE_INITIALIZER(rcu_preempt, 'p', call_rcu);
-static struct rcu_state *rcu_state_p = &rcu_preempt_state;
+static struct rcu_state *const rcu_state_p = &rcu_preempt_state;
+static struct rcu_data __percpu *const rcu_data_p = &rcu_preempt_data;
static int rcu_preempted_readers_exp(struct rcu_node *rnp);
static void rcu_report_exp_rnp(struct rcu_state *rsp, struct rcu_node *rnp,
@@ -116,11 +127,11 @@
*/
static void rcu_preempt_qs(void)
{
- if (!__this_cpu_read(rcu_preempt_data.passed_quiesce)) {
+ if (!__this_cpu_read(rcu_data_p->passed_quiesce)) {
trace_rcu_grace_period(TPS("rcu_preempt"),
- __this_cpu_read(rcu_preempt_data.gpnum),
+ __this_cpu_read(rcu_data_p->gpnum),
TPS("cpuqs"));
- __this_cpu_write(rcu_preempt_data.passed_quiesce, 1);
+ __this_cpu_write(rcu_data_p->passed_quiesce, 1);
barrier(); /* Coordinate with rcu_preempt_check_callbacks(). */
current->rcu_read_unlock_special.b.need_qs = false;
}
@@ -150,7 +161,7 @@
!t->rcu_read_unlock_special.b.blocked) {
/* Possibly blocking in an RCU read-side critical section. */
- rdp = this_cpu_ptr(rcu_preempt_state.rda);
+ rdp = this_cpu_ptr(rcu_state_p->rda);
rnp = rdp->mynode;
raw_spin_lock_irqsave(&rnp->lock, flags);
smp_mb__after_unlock_lock();
@@ -180,10 +191,9 @@
if ((rnp->qsmask & rdp->grpmask) && rnp->gp_tasks != NULL) {
list_add(&t->rcu_node_entry, rnp->gp_tasks->prev);
rnp->gp_tasks = &t->rcu_node_entry;
-#ifdef CONFIG_RCU_BOOST
- if (rnp->boost_tasks != NULL)
+ if (IS_ENABLED(CONFIG_RCU_BOOST) &&
+ rnp->boost_tasks != NULL)
rnp->boost_tasks = rnp->gp_tasks;
-#endif /* #ifdef CONFIG_RCU_BOOST */
} else {
list_add(&t->rcu_node_entry, &rnp->blkd_tasks);
if (rnp->qsmask & rdp->grpmask)
@@ -263,9 +273,7 @@
bool empty_exp_now;
unsigned long flags;
struct list_head *np;
-#ifdef CONFIG_RCU_BOOST
bool drop_boost_mutex = false;
-#endif /* #ifdef CONFIG_RCU_BOOST */
struct rcu_node *rnp;
union rcu_special special;
@@ -307,9 +315,11 @@
t->rcu_read_unlock_special.b.blocked = false;
/*
- * Remove this task from the list it blocked on. The
- * task can migrate while we acquire the lock, but at
- * most one time. So at most two passes through loop.
+ * Remove this task from the list it blocked on. The task
+ * now remains queued on the rcu_node corresponding to
+ * the CPU it first blocked on, so the first attempt to
+ * acquire the task's rcu_node's ->lock will succeed.
+ * Keep the loop and add a WARN_ON() out of sheer paranoia.
*/
for (;;) {
rnp = t->rcu_blocked_node;
@@ -317,6 +327,7 @@
smp_mb__after_unlock_lock();
if (rnp == t->rcu_blocked_node)
break;
+ WARN_ON_ONCE(1);
raw_spin_unlock(&rnp->lock); /* irqs remain disabled. */
}
empty_norm = !rcu_preempt_blocked_readers_cgp(rnp);
@@ -331,12 +342,12 @@
rnp->gp_tasks = np;
if (&t->rcu_node_entry == rnp->exp_tasks)
rnp->exp_tasks = np;
-#ifdef CONFIG_RCU_BOOST
- if (&t->rcu_node_entry == rnp->boost_tasks)
- rnp->boost_tasks = np;
- /* Snapshot ->boost_mtx ownership with rcu_node lock held. */
- drop_boost_mutex = rt_mutex_owner(&rnp->boost_mtx) == t;
-#endif /* #ifdef CONFIG_RCU_BOOST */
+ if (IS_ENABLED(CONFIG_RCU_BOOST)) {
+ if (&t->rcu_node_entry == rnp->boost_tasks)
+ rnp->boost_tasks = np;
+ /* Snapshot ->boost_mtx ownership w/rnp->lock held. */
+ drop_boost_mutex = rt_mutex_owner(&rnp->boost_mtx) == t;
+ }
/*
* If this was the last task on the current list, and if
@@ -353,24 +364,21 @@
rnp->grplo,
rnp->grphi,
!!rnp->gp_tasks);
- rcu_report_unblock_qs_rnp(&rcu_preempt_state,
- rnp, flags);
+ rcu_report_unblock_qs_rnp(rcu_state_p, rnp, flags);
} else {
raw_spin_unlock_irqrestore(&rnp->lock, flags);
}
-#ifdef CONFIG_RCU_BOOST
/* Unboost if we were boosted. */
- if (drop_boost_mutex)
+ if (IS_ENABLED(CONFIG_RCU_BOOST) && drop_boost_mutex)
rt_mutex_unlock(&rnp->boost_mtx);
-#endif /* #ifdef CONFIG_RCU_BOOST */
/*
* If this was the last task on the expedited lists,
* then we need to report up the rcu_node hierarchy.
*/
if (!empty_exp && empty_exp_now)
- rcu_report_exp_rnp(&rcu_preempt_state, rnp, true);
+ rcu_report_exp_rnp(rcu_state_p, rnp, true);
} else {
local_irq_restore(flags);
}
@@ -390,7 +398,7 @@
raw_spin_unlock_irqrestore(&rnp->lock, flags);
return;
}
- t = list_entry(rnp->gp_tasks,
+ t = list_entry(rnp->gp_tasks->prev,
struct task_struct, rcu_node_entry);
list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry)
sched_show_task(t);
@@ -447,7 +455,7 @@
if (!rcu_preempt_blocked_readers_cgp(rnp))
return 0;
rcu_print_task_stall_begin(rnp);
- t = list_entry(rnp->gp_tasks,
+ t = list_entry(rnp->gp_tasks->prev,
struct task_struct, rcu_node_entry);
list_for_each_entry_continue(t, &rnp->blkd_tasks, rcu_node_entry) {
pr_cont(" P%d", t->pid);
@@ -491,8 +499,8 @@
return;
}
if (t->rcu_read_lock_nesting > 0 &&
- __this_cpu_read(rcu_preempt_data.qs_pending) &&
- !__this_cpu_read(rcu_preempt_data.passed_quiesce))
+ __this_cpu_read(rcu_data_p->qs_pending) &&
+ !__this_cpu_read(rcu_data_p->passed_quiesce))
t->rcu_read_unlock_special.b.need_qs = true;
}
@@ -500,7 +508,7 @@
static void rcu_preempt_do_callbacks(void)
{
- rcu_do_batch(&rcu_preempt_state, this_cpu_ptr(&rcu_preempt_data));
+ rcu_do_batch(rcu_state_p, this_cpu_ptr(rcu_data_p));
}
#endif /* #ifdef CONFIG_RCU_BOOST */
@@ -510,7 +518,7 @@
*/
void call_rcu(struct rcu_head *head, void (*func)(struct rcu_head *rcu))
{
- __call_rcu(head, func, &rcu_preempt_state, -1, 0);
+ __call_rcu(head, func, rcu_state_p, -1, 0);
}
EXPORT_SYMBOL_GPL(call_rcu);
@@ -711,7 +719,7 @@
void synchronize_rcu_expedited(void)
{
struct rcu_node *rnp;
- struct rcu_state *rsp = &rcu_preempt_state;
+ struct rcu_state *rsp = rcu_state_p;
unsigned long snap;
int trycount = 0;
@@ -798,7 +806,7 @@
*/
void rcu_barrier(void)
{
- _rcu_barrier(&rcu_preempt_state);
+ _rcu_barrier(rcu_state_p);
}
EXPORT_SYMBOL_GPL(rcu_barrier);
@@ -807,7 +815,7 @@
*/
static void __init __rcu_init_preempt(void)
{
- rcu_init_one(&rcu_preempt_state, &rcu_preempt_data);
+ rcu_init_one(rcu_state_p, rcu_data_p);
}
/*
@@ -830,7 +838,8 @@
#else /* #ifdef CONFIG_PREEMPT_RCU */
-static struct rcu_state *rcu_state_p = &rcu_sched_state;
+static struct rcu_state *const rcu_state_p = &rcu_sched_state;
+static struct rcu_data __percpu *const rcu_data_p = &rcu_sched_data;
/*
* Tell them what RCU they are running.
@@ -1172,7 +1181,7 @@
struct sched_param sp;
struct task_struct *t;
- if (&rcu_preempt_state != rsp)
+ if (rcu_state_p != rsp)
return 0;
if (!rcu_scheduler_fully_active || rcu_rnp_online_cpus(rnp) == 0)
@@ -1366,13 +1375,12 @@
* Because we not have RCU_FAST_NO_HZ, just check whether this CPU needs
* any flavor of RCU.
*/
-#ifndef CONFIG_RCU_NOCB_CPU_ALL
int rcu_needs_cpu(unsigned long *delta_jiffies)
{
*delta_jiffies = ULONG_MAX;
- return rcu_cpu_has_callbacks(NULL);
+ return IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL)
+ ? 0 : rcu_cpu_has_callbacks(NULL);
}
-#endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
/*
* Because we do not have RCU_FAST_NO_HZ, don't bother cleaning up
@@ -1479,11 +1487,15 @@
*
* The caller must have disabled interrupts.
*/
-#ifndef CONFIG_RCU_NOCB_CPU_ALL
int rcu_needs_cpu(unsigned long *dj)
{
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
+ if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL)) {
+ *dj = ULONG_MAX;
+ return 0;
+ }
+
/* Snapshot to detect later posting of non-lazy callback. */
rdtp->nonlazy_posted_snap = rdtp->nonlazy_posted;
@@ -1510,7 +1522,6 @@
}
return 0;
}
-#endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
/*
* Prepare a CPU for idle from an RCU perspective. The first major task
@@ -1524,7 +1535,6 @@
*/
static void rcu_prepare_for_idle(void)
{
-#ifndef CONFIG_RCU_NOCB_CPU_ALL
bool needwake;
struct rcu_data *rdp;
struct rcu_dynticks *rdtp = this_cpu_ptr(&rcu_dynticks);
@@ -1532,6 +1542,9 @@
struct rcu_state *rsp;
int tne;
+ if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL))
+ return;
+
/* Handle nohz enablement switches conservatively. */
tne = READ_ONCE(tick_nohz_active);
if (tne != rdtp->tick_nohz_enabled_snap) {
@@ -1579,7 +1592,6 @@
if (needwake)
rcu_gp_kthread_wake(rsp);
}
-#endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
}
/*
@@ -1589,12 +1601,11 @@
*/
static void rcu_cleanup_after_idle(void)
{
-#ifndef CONFIG_RCU_NOCB_CPU_ALL
- if (rcu_is_nocb_cpu(smp_processor_id()))
+ if (IS_ENABLED(CONFIG_RCU_NOCB_CPU_ALL) ||
+ rcu_is_nocb_cpu(smp_processor_id()))
return;
if (rcu_try_advance_all_cbs())
invoke_rcu_core();
-#endif /* #ifndef CONFIG_RCU_NOCB_CPU_ALL */
}
/*
@@ -3048,9 +3059,9 @@
if (tick_nohz_full_cpu(smp_processor_id()) &&
(!rcu_gp_in_progress(rsp) ||
ULONG_CMP_LT(jiffies, READ_ONCE(rsp->gp_start) + HZ)))
- return 1;
+ return true;
#endif /* #ifdef CONFIG_NO_HZ_FULL */
- return 0;
+ return false;
}
/*
diff --git a/lib/Kconfig.debug b/lib/Kconfig.debug
index 1767057..b908048 100644
--- a/lib/Kconfig.debug
+++ b/lib/Kconfig.debug
@@ -1233,6 +1233,7 @@
depends on DEBUG_KERNEL
select TORTURE_TEST
select SRCU
+ select TASKS_RCU
default n
help
This option provides a kernel module that runs torture tests
@@ -1261,12 +1262,38 @@
Say N here if you want the RCU torture tests to start only
after being manually enabled via /proc.
+config RCU_TORTURE_TEST_SLOW_PREINIT
+ bool "Slow down RCU grace-period pre-initialization to expose races"
+ depends on RCU_TORTURE_TEST
+ help
+ This option delays grace-period pre-initialization (the
+ propagation of CPU-hotplug changes up the rcu_node combining
+ tree) for a few jiffies between initializing each pair of
+ consecutive rcu_node structures. This helps to expose races
+ involving grace-period pre-initialization, in other words, it
+ makes your kernel less stable. It can also greatly increase
+ grace-period latency, especially on systems with large numbers
+ of CPUs. This is useful when torture-testing RCU, but in
+ almost no other circumstance.
+
+ Say Y here if you want your system to crash and hang more often.
+ Say N if you want a sane system.
+
+config RCU_TORTURE_TEST_SLOW_PREINIT_DELAY
+ int "How much to slow down RCU grace-period pre-initialization"
+ range 0 5
+ default 3
+ depends on RCU_TORTURE_TEST_SLOW_PREINIT
+ help
+ This option specifies the number of jiffies to wait between
+ each rcu_node structure pre-initialization step.
+
config RCU_TORTURE_TEST_SLOW_INIT
bool "Slow down RCU grace-period initialization to expose races"
depends on RCU_TORTURE_TEST
help
- This option makes grace-period initialization block for a
- few jiffies between initializing each pair of consecutive
+ This option delays grace-period initialization for a few
+ jiffies between initializing each pair of consecutive
rcu_node structures. This helps to expose races involving
grace-period initialization, in other words, it makes your
kernel less stable. It can also greatly increase grace-period
@@ -1281,10 +1308,35 @@
int "How much to slow down RCU grace-period initialization"
range 0 5
default 3
+ depends on RCU_TORTURE_TEST_SLOW_INIT
help
This option specifies the number of jiffies to wait between
each rcu_node structure initialization.
+config RCU_TORTURE_TEST_SLOW_CLEANUP
+ bool "Slow down RCU grace-period cleanup to expose races"
+ depends on RCU_TORTURE_TEST
+ help
+ This option delays grace-period cleanup for a few jiffies
+ between cleaning up each pair of consecutive rcu_node
+ structures. This helps to expose races involving grace-period
+ cleanup, in other words, it makes your kernel less stable.
+ It can also greatly increase grace-period latency, especially
+ on systems with large numbers of CPUs. This is useful when
+ torture-testing RCU, but in almost no other circumstance.
+
+ Say Y here if you want your system to crash and hang more often.
+ Say N if you want a sane system.
+
+config RCU_TORTURE_TEST_SLOW_CLEANUP_DELAY
+ int "How much to slow down RCU grace-period cleanup"
+ range 0 5
+ default 3
+ depends on RCU_TORTURE_TEST_SLOW_CLEANUP
+ help
+ This option specifies the number of jiffies to wait between
+ each rcu_node structure cleanup operation.
+
config RCU_CPU_STALL_TIMEOUT
int "RCU CPU stall timeout in seconds"
depends on RCU_STALL_COMMON
@@ -1321,6 +1373,17 @@
Say Y here if you want to enable RCU tracing
Say N if you are unsure.
+config RCU_EQS_DEBUG
+ bool "Use this when adding any sort of NO_HZ support to your arch"
+ depends on DEBUG_KERNEL
+ help
+ This option provides consistency checks in RCU's handling of
+ NO_HZ. These checks have proven quite helpful in detecting
+ bugs in arch-specific NO_HZ code.
+
+ Say N here if you need ultimate kernel/user switch latencies
+ Say Y if you are unsure
+
endmenu # "RCU Debugging"
config DEBUG_BLOCK_EXT_DEVT
diff --git a/net/netfilter/core.c b/net/netfilter/core.c
index e616301..ad70195 100644
--- a/net/netfilter/core.c
+++ b/net/netfilter/core.c
@@ -166,7 +166,7 @@
/* We may already have this, but read-locks nest anyway */
rcu_read_lock();
- elem = list_entry_rcu(&nf_hooks[state->pf][state->hook],
+ elem = list_entry_rcu(nf_hooks[state->pf][state->hook].next,
struct nf_hook_ops, list);
next_hook:
verdict = nf_iterate(&nf_hooks[state->pf][state->hook], skb, state,
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/CFcommon b/tools/testing/selftests/rcutorture/configs/rcu/CFcommon
index 4970121..f824b4c 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/CFcommon
+++ b/tools/testing/selftests/rcutorture/configs/rcu/CFcommon
@@ -1,3 +1,5 @@
CONFIG_RCU_TORTURE_TEST=y
CONFIG_PRINTK_TIME=y
+CONFIG_RCU_TORTURE_TEST_SLOW_CLEANUP=y
CONFIG_RCU_TORTURE_TEST_SLOW_INIT=y
+CONFIG_RCU_TORTURE_TEST_SLOW_PREINIT=y
diff --git a/tools/testing/selftests/rcutorture/configs/rcu/TINY02.boot b/tools/testing/selftests/rcutorture/configs/rcu/TINY02.boot
index 0f08027..6c1a292 100644
--- a/tools/testing/selftests/rcutorture/configs/rcu/TINY02.boot
+++ b/tools/testing/selftests/rcutorture/configs/rcu/TINY02.boot
@@ -1,2 +1,3 @@
rcupdate.rcu_self_test=1
rcupdate.rcu_self_test_bh=1
+rcutorture.torture_type=rcu_bh