rcu: Fix synchronize_rcu latency by removing CPU threshold for wake_from_gp

Remove the magic number WAKE_FROM_GP_CPU_THRESHOLD (16) and always enable
the rcu_normal_wake_from_gp optimization. This provides consistent low
latency for synchronize_rcu() regardless of system size.

The optimization enables direct wakeup of synchronize_rcu() waiters from
the GP kthread instead of waiting for softirq callback processing. With
the optimization disabled (default on systems with >16 CPUs), the FQS
loop waits multiple iterations (jiffies_till_first_fqs=3 each), leading
to ~9.5ms latency. With the optimization enabled, only a single FQS
iteration is needed, reducing latency to ~4ms (58% improvement).

Performance measurements on a 36-CPU system:

  Without optimization (default, >16 CPUs):
    - synchronize_rcu() latency: ~9.5ms
    - FQS iterations: 2
    - Wakeup path: softirq callback chain

  With optimization enabled:
    - synchronize_rcu() latency: ~4ms (58% faster)
    - FQS iterations: 1
    - Wakeup path: direct from GP kthread

For workloads involving synchronize_rcu() in teardown paths, such as
network bridge deletion (~100ms total), this reduces overall latency by
approximately 5%.

The rcutree.rcu_normal_wake_from_gp module parameter remains available
to disable the optimization if needed (set to 0).

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>
1 file changed