nohz: Synchronize sleep time stats with seqlock
When some call site uses get_cpu_*_time_us() to read a sleeptime
stat, it deduces the total sleeptime by adding the pending time
to the last sleeptime snapshot if the CPU target is idle.
Namely this sums up to:
sleeptime = ts($CPU)->idle_sleeptime;
if (ts($CPU)->idle_active)
sleeptime += NOW() - ts($CPU)->idle_entrytime
But this only works if idle_sleeptime, idle_entrytime and idle_active are
read and updated under some disciplined order.
Lets consider the following scenario:
CPU 0 CPU 1
(seq 1) ts(CPU 0)->idle_active = 1
ts(CPU 0)->idle_entrytime = NOW()
(seq 2) sleeptime = NOW() - ts(CPU 0)->idle_entrytime
ts(CPU 0)->idle_sleeptime += sleeptime sleeptime = ts(CPU 0)->idle_sleeptime;
if (ts(CPU 0)->idle_active)
ts(CPU 0)->idle_entrytime = NOW() sleeptime += NOW() - ts(CPU 0)->idle_entrytime
The resulting value of sleeptime in CPU 1 can vary depending of some
ordering scenario:
* If it sees the value of idle_entrytime after seq 1 and the value of idle_sleeptime
after seq 2, the value of sleeptime will be buggy because it accounts the delta twice,
so it will be too high.
* If it sees the value of idle_entrytime after seq 2 and the value of idle_sleeptime
after seq 1, the value of sleeptime will be buggy because it misses the delta, so it
will be too low.
* If it sees the value of idle_entrytime and idle_sleeptime, both as seen after seq 1 or 2,
the value will be correct.
Some more tricky scenario can also happen if idle_active value is read from a former sequence.
Hence we must honour the following constraints:
- idle_sleeptime, idle_active and idle_entrytime must be updated and read
under some correctly enforced SMP ordering
- The three variable values as read by CPU 1 must belong to the same update
sequences from CPU 0. The update sequences must be delimited such that the
resulting three values after a sequence completion produce a coherent result
together when read from the CPU 1.
- We need to prevent from fetching middle-state sequence values.
The ideal solution to implement this synchronization is to use a seqcount. Lets
use one here around these three values to enforce sequence synchronization between
updates and read.
This fixes a reported bug where non-monotonic sleeptime stats are returned by /proc/stat
when it is frequently read. And potential cpufreq governor bugs.
Reported-by: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Reported-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Signed-off-by: Frederic Weisbecker <fweisbec@gmail.com>
Cc: Fernando Luis Vazquez Cao <fernando_b1@lab.ntt.co.jp>
Cc: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp>
Cc: Thomas Gleixner <tglx@linutronix.de>
Cc: Ingo Molnar <mingo@kernel.org>
Cc: Peter Zijlstra <peterz@infradead.org>
Cc: Andrew Morton <akpm@linux-foundation.org>
Cc: Arjan van de Ven <arjan@linux.intel.com>
Cc: Oleg Nesterov <oleg@redhat.com>
2 files changed