rcu: Fix synchronize_rcu latency by removing CPU threshold for wake_from_gp
Remove the magic number WAKE_FROM_GP_CPU_THRESHOLD (16) and always enable
the rcu_normal_wake_from_gp optimization. This provides consistent low
latency for synchronize_rcu() regardless of system size.
The optimization enables direct wakeup of synchronize_rcu() waiters from
the GP kthread instead of waiting for softirq callback processing. With
the optimization disabled (default on systems with >16 CPUs), the FQS
loop waits multiple iterations (jiffies_till_first_fqs=3 each), leading
to ~9.5ms latency. With the optimization enabled, only a single FQS
iteration is needed, reducing latency to ~4ms (58% improvement).
Performance measurements on a 36-CPU system:
Without optimization (default, >16 CPUs):
- synchronize_rcu() latency: ~9.5ms
- FQS iterations: 2
- Wakeup path: softirq callback chain
With optimization enabled:
- synchronize_rcu() latency: ~4ms (58% faster)
- FQS iterations: 1
- Wakeup path: direct from GP kthread
For workloads involving synchronize_rcu() in teardown paths, such as
network bridge deletion (~100ms total), this reduces overall latency by
approximately 5%.
The rcutree.rcu_normal_wake_from_gp module parameter remains available
to disable the optimization if needed (set to 0).
Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
1 file changed