rcu: Reduce synchronize_rcu() latency by reporting GP kthread's CPU QS early

The RCU grace period mechanism uses a two-phase FQS (Force Quiescent
State) design where the first FQS saves dyntick-idle snapshots and
the second FQS compares them. This results in long and unncessary latency for
synchronize_rcu() on idle systems (two FQS waits of ~3ms each with 1000HZ)
whenever one FQS wait sufficed.

Some investigations showed that the GP kthread's CPU is the holdout CPU
a lot of times after the first FQS as - it cannot be detected as "idle"
because it's actively running the FQS scan in the GP kthread.

Therefore, at the start of the first FQS, immediately report a quiescent
state for the GP kthread's CPU using rcu_qs() + rcu_report_qs_rdp(). The
GP kthread cannot be in an RCU read-side critical section while running
the FQS scan, so this is safe and results in significant tail latency
improvements.

I benchmarked 100 synchronize_rcu() calls, 6 runs each showing good tail
latency improvements per synchronize_rcu() call (default settings for fqs
jiffies):

Baseline (without fix):
| Run | Mean     | Min      | Max       |
|-----|----------|----------|-----------|
| 1   | 4.036 ms | 3.509 ms | 7.973 ms  |
| 2   | 4.049 ms | 3.904 ms | 8.003 ms  |
| 3   | 4.033 ms | 1.160 ms | 10.083 ms |
| 4   | 3.993 ms | 3.145 ms | 4.093 ms  |
| 5   | 3.988 ms | 2.675 ms | 4.123 ms  |
| 6   | 4.019 ms | 3.894 ms | 5.845 ms  |

With fix:
| Run | Mean     | Min      | Max      |
|-----|----------|----------|----------|
| 1   | 3.991 ms | 2.953 ms | 4.125 ms |
| 2   | 3.995 ms | 3.439 ms | 4.081 ms |
| 3   | 3.989 ms | 2.974 ms | 4.079 ms |
| 4   | 3.997 ms | 3.667 ms | 4.072 ms |
| 5   | 4.027 ms | 2.550 ms | 7.928 ms |
| 6   | 3.989 ms | 2.886 ms | 4.076 ms |

The fix reduces worst-case latency due to the second FQS wait not
running when not needed.

Tested rcutorture TREE and SRCU configurations.

Signed-off-by: Joel Fernandes <joelagnelf@nvidia.com>
1 file changed