refs/heads/hp-per-task - pub/scm/linux/kernel/git/jfern/linux.git

commit	ab45ffc98534050c7466201b1393198fa6fe2820	[log] [tgz]
author	Joel Fernandes <joelagnelf@nvidia.com>	Sun Dec 21 04:43:48 2025 -0500
committer	Joel Fernandes <joelagnelf@nvidia.com>	Sun Dec 21 04:43:48 2025 -0500
tree	ca966be929fec6bb6655c90550226942e99c6e58
parent	efd5265b7822fcf515e11c16816b086dc3d3e3a6 [diff]

rcu: Fix synchronize_rcu latency by removing CPU threshold for wake_from_gp

Remove the magic number WAKE_FROM_GP_CPU_THRESHOLD (16) and always enable
the rcu_normal_wake_from_gp optimization. This provides consistent low
latency for synchronize_rcu() regardless of system size.

The optimization enables direct wakeup of synchronize_rcu() waiters from
the GP kthread instead of waiting for softirq callback processing. With
the optimization disabled (default on systems with >16 CPUs), the FQS
loop waits multiple iterations (jiffies_till_first_fqs=3 each), leading
to ~9.5ms latency. With the optimization enabled, only a single FQS
iteration is needed, reducing latency to ~4ms (58% improvement).

Performance measurements on a 36-CPU system:

  Without optimization (default, >16 CPUs):
    - synchronize_rcu() latency: ~9.5ms
    - FQS iterations: 2
    - Wakeup path: softirq callback chain

  With optimization enabled:
    - synchronize_rcu() latency: ~4ms (58% faster)
    - FQS iterations: 1
    - Wakeup path: direct from GP kthread

For workloads involving synchronize_rcu() in teardown paths, such as
network bridge deletion (~100ms total), this reduces overall latency by
approximately 5%.

The rcutree.rcu_normal_wake_from_gp module parameter remains available
to disable the optimization if needed (set to 0).

Signed-off-by: Joel Fernandes (Google) <joel@joelfernandes.org>

kernel/rcu/tree.c[diff]

1 file changed

tree: ca966be929fec6bb6655c90550226942e99c6e58