e2e7500f58fe41bd83bea2b415fdfe9e206620e7 - pub/scm/linux/kernel/git/jfern/linux.git

commit	e2e7500f58fe41bd83bea2b415fdfe9e206620e7	[log] [tgz]
author	Joel Fernandes <joelagnelf@nvidia.com>	Mon Dec 22 17:49:09 2025 -0500
committer	Joel Fernandes <joelagnelf@nvidia.com>	Mon Dec 22 19:31:44 2025 -0500
tree	3704ad1ab324978bb5a3bf5afcaa93011a240ce3
parent	2b207864a3b5a21e102d84cb0634eefc080eafe4 [diff]

rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early The RCU grace period mechanism uses a two-phase FQS (Force Quiescent State) design where the first FQS saves dyntick-idle snapshots and the second FQS compares them. This results in long and unncessary latency for synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ) whenever one FQS wait sufficed. Some investigations showed that the GP kthread's CPU is the holdout CPU a lot of times after the first FQS as - it cannot be detected as "idle" because it's actively running the FQS scan in the GP kthread. Therefore, at the start of the first FQS, immediately report a quiescent state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The GP kthread cannot be in an RCU read-side critical section while running the FQS scan, so this is safe and results in significant tail latency improvements. I benchmarked 100 synchronize_rcu() calls, 6 runs each showing good tail latency improvements per synchronize_rcu() call (default settings for fqs jiffies): Baseline (without fix): | Run | Mean | Min | Max | |-----|----------|----------|-----------| | 1 | 4.036 ms | 3.509 ms | 7.973 ms | | 2 | 4.049 ms | 3.904 ms | 8.003 ms | | 3 | 4.033 ms | 1.160 ms | 10.083 ms | | 4 | 3.993 ms | 3.145 ms | 4.093 ms | | 5 | 3.988 ms | 2.675 ms | 4.123 ms | | 6 | 4.019 ms | 3.894 ms | 5.845 ms | With fix: | Run | Mean | Min | Max | |-----|----------|----------|----------| | 1 | 3.991 ms | 2.953 ms | 4.125 ms | | 2 | 3.995 ms | 3.439 ms | 4.081 ms | | 3 | 3.989 ms | 2.974 ms | 4.079 ms | | 4 | 3.997 ms | 3.667 ms | 4.072 ms | | 5 | 4.027 ms | 2.550 ms | 7.928 ms | | 6 | 3.989 ms | 2.886 ms | 4.076 ms | The fix reduces worst-case latency due to the second FQS wait not running when not needed. Tested rcutorture TREE and SRCU configurations. Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>