releases/3.8.8/sched_clock-prevent-64bit-inatomicity-on-32bit-systems.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 Mon Sep 17 00:00:00 2001
 From: Thomas Gleixner <tglx@linutronix.de>
 Date: Sat, 6 Apr 2013 10:10:27 +0200
 Subject: sched_clock: Prevent 64bit inatomicity on 32bit systems

 From: Thomas Gleixner <tglx@linutronix.de>

 commit a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 upstream.

 The sched_clock_remote() implementation has the following inatomicity
 problem on 32bit systems when accessing the remote scd->clock, which
 is a 64bit value.

 CPU0			CPU1

 sched_clock_local()	sched_clock_remote(CPU0)
 ...
 			remote_clock = scd[CPU0]->clock
 			    read_low32bit(scd[CPU0]->clock)
 cmpxchg64(scd->clock,...)
 			    read_high32bit(scd[CPU0]->clock)

 While the update of scd->clock is using an atomic64 mechanism, the
 readout on the remote cpu is not, which can cause completely bogus
 readouts.

 It is a quite rare problem, because it requires the update to hit the
 narrow race window between the low/high readout and the update must go
 across the 32bit boundary.

 The resulting misbehaviour is, that CPU1 will see the sched_clock on
 CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
 to this bogus timestamp. This stays that way due to the clamping
 implementation for about 4 seconds until the synchronization with
 CLOCK_MONOTONIC undoes the problem.

 The issue is hard to observe, because it might only result in a less
 accurate SCHED_OTHER timeslicing behaviour. To create observable
 damage on realtime scheduling classes, it is necessary that the bogus
 update of CPU1 sched_clock happens in the context of an realtime
 thread, which then gets charged 4 seconds of RT runtime, which results
 in the RT throttler mechanism to trigger and prevent scheduling of RT
 tasks for a little less than 4 seconds. So this is quite unlikely as
 well.

 The issue was quite hard to decode as the reproduction time is between
 2 days and 3 weeks and intrusive tracing makes it less likely, but the
 following trace recorded with trace_clock=global, which uses
 sched_clock_local(), gave the final hint:

   <idle>-0   0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
   <idle>-0   0d..30 400269.477151: hrtimer_start:  hrtimer=0xf7061e80 ...
 irq/20-S-587 1d..32 400273.772118: sched_wakeup:   comm= ... target_cpu=0
   <idle>-0   0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

 What happens is that CPU0 goes idle and invokes
 sched_clock_idle_sleep_event() which invokes sched_clock_local() and
 CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
 sched_remote_clock(). The time jump gets propagated to CPU0 via
 sched_remote_clock() and stays stale on both cores for ~4 seconds.

 There are only two other possibilities, which could cause a stale
 sched clock:

 1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
    wrong value.

 2) sched_clock() which reads the TSC returns a sporadic wrong value.

 #1 can be excluded because sched_clock would continue to increase for
    one jiffy and then go stale.

 #2 can be excluded because it would not make the clock jump
    forward. It would just result in a stale sched_clock for one jiffy.

 After quite some brain twisting and finding the same pattern on other
 traces, sched_clock_remote() remained the only place which could cause
 such a problem and as explained above it's indeed racy on 32bit
 systems.

 So while on 64bit systems the readout is atomic, we need to verify the
 remote readout on 32bit machines. We need to protect the local->clock
 readout in sched_clock_remote() on 32bit as well because an NMI could
 hit between the low and the high readout, call sched_clock_local() and
 modify local->clock.

 Thanks to Siegfried Wulsch for bearing with my debug requests and
 going through the tedious tasks of running a bunch of reproducer
 systems to generate the debug information which let me decode the
 issue.

 Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de>
 Acked-by: Peter Zijlstra <peterz@infradead.org>
 Cc: Steven Rostedt <rostedt@goodmis.org>
 Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
 Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 ---
  kernel/sched/clock.c |   26 ++++++++++++++++++++++++++
  1 file changed, 26 insertions(+)

 --- a/kernel/sched/clock.c
 +++ b/kernel/sched/clock.c
 @@ -176,10 +176,36 @@ static u64 sched_clock_remote(struct sch
  	u64 this_clock, remote_clock;
  	u64 *ptr, old_val, val;

 +#if BITS_PER_LONG != 64
 +again:
 +	/*
 +	 * Careful here: The local and the remote clock values need to
 +	 * be read out atomic as we need to compare the values and
 +	 * then update either the local or the remote side. So the
 +	 * cmpxchg64 below only protects one readout.
 +	 *
 +	 * We must reread via sched_clock_local() in the retry case on
 +	 * 32bit as an NMI could use sched_clock_local() via the
 +	 * tracer and hit between the readout of
 +	 * the low32bit and the high 32bit portion.
 +	 */
 +	this_clock = sched_clock_local(my_scd);
 +	/*
 +	 * We must enforce atomic readout on 32bit, otherwise the
 +	 * update on the remote cpu can hit inbetween the readout of
 +	 * the low32bit and the high 32bit portion.
 +	 */
 +	remote_clock = cmpxchg64(&scd->clock, 0, 0);
 +#else
 +	/*
 +	 * On 64bit the read of [my]scd->clock is atomic versus the
 +	 * update, so we can avoid the above 32bit dance.
 +	 */
  	sched_clock_local(my_scd);
  again:
  	this_clock = my_scd->clock;
  	remote_clock = scd->clock;
 +#endif

  	/*
  	 * Use the opportunity that we have both locks
	From a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 Mon Sep 17 00:00:00 2001
	From: Thomas Gleixner <tglx@linutronix.de>
	Date: Sat, 6 Apr 2013 10:10:27 +0200
	Subject: sched_clock: Prevent 64bit inatomicity on 32bit systems

	From: Thomas Gleixner <tglx@linutronix.de>

	commit a1cbcaa9ea87b87a96b9fc465951dcf36e459ca2 upstream.

	The sched_clock_remote() implementation has the following inatomicity
	problem on 32bit systems when accessing the remote scd->clock, which
	is a 64bit value.

	CPU0 CPU1

	sched_clock_local() sched_clock_remote(CPU0)
	...
	remote_clock = scd[CPU0]->clock
	read_low32bit(scd[CPU0]->clock)
	cmpxchg64(scd->clock,...)
	read_high32bit(scd[CPU0]->clock)

	While the update of scd->clock is using an atomic64 mechanism, the
	readout on the remote cpu is not, which can cause completely bogus
	readouts.

	It is a quite rare problem, because it requires the update to hit the
	narrow race window between the low/high readout and the update must go
	across the 32bit boundary.

	The resulting misbehaviour is, that CPU1 will see the sched_clock on
	CPU1 ~4 seconds ahead of it's own and update CPU1s sched_clock value
	to this bogus timestamp. This stays that way due to the clamping
	implementation for about 4 seconds until the synchronization with
	CLOCK_MONOTONIC undoes the problem.

	The issue is hard to observe, because it might only result in a less
	accurate SCHED_OTHER timeslicing behaviour. To create observable
	damage on realtime scheduling classes, it is necessary that the bogus
	update of CPU1 sched_clock happens in the context of an realtime
	thread, which then gets charged 4 seconds of RT runtime, which results
	in the RT throttler mechanism to trigger and prevent scheduling of RT
	tasks for a little less than 4 seconds. So this is quite unlikely as
	well.

	The issue was quite hard to decode as the reproduction time is between
	2 days and 3 weeks and intrusive tracing makes it less likely, but the
	following trace recorded with trace_clock=global, which uses
	sched_clock_local(), gave the final hint:

	<idle>-0 0d..30 400269.477150: hrtimer_cancel: hrtimer=0xf7061e80
	<idle>-0 0d..30 400269.477151: hrtimer_start: hrtimer=0xf7061e80 ...
	irq/20-S-587 1d..32 400273.772118: sched_wakeup: comm= ... target_cpu=0
	<idle>-0 0dN.30 400273.772118: hrtimer_cancel: hrtimer=0xf7061e80

	What happens is that CPU0 goes idle and invokes
	sched_clock_idle_sleep_event() which invokes sched_clock_local() and
	CPU1 runs a remote wakeup for CPU0 at the same time, which invokes
	sched_remote_clock(). The time jump gets propagated to CPU0 via
	sched_remote_clock() and stays stale on both cores for ~4 seconds.

	There are only two other possibilities, which could cause a stale
	sched clock:

	1) ktime_get() which reads out CLOCK_MONOTONIC returns a sporadic
	wrong value.

	2) sched_clock() which reads the TSC returns a sporadic wrong value.

	#1 can be excluded because sched_clock would continue to increase for
	one jiffy and then go stale.

	#2 can be excluded because it would not make the clock jump
	forward. It would just result in a stale sched_clock for one jiffy.

	After quite some brain twisting and finding the same pattern on other
	traces, sched_clock_remote() remained the only place which could cause
	such a problem and as explained above it's indeed racy on 32bit
	systems.

	So while on 64bit systems the readout is atomic, we need to verify the
	remote readout on 32bit machines. We need to protect the local->clock
	readout in sched_clock_remote() on 32bit as well because an NMI could
	hit between the low and the high readout, call sched_clock_local() and
	modify local->clock.

	Thanks to Siegfried Wulsch for bearing with my debug requests and
	going through the tedious tasks of running a bunch of reproducer
	systems to generate the debug information which let me decode the
	issue.

	Reported-by: Siegfried Wulsch <Siegfried.Wulsch@rovema.de>
	Acked-by: Peter Zijlstra <peterz@infradead.org>
	Cc: Steven Rostedt <rostedt@goodmis.org>
	Link: http://lkml.kernel.org/r/alpine.LFD.2.02.1304051544160.21884@ionos
	Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

	---
	kernel/sched/clock.c \| 26 ++++++++++++++++++++++++++
	1 file changed, 26 insertions(+)

	--- a/kernel/sched/clock.c
	+++ b/kernel/sched/clock.c
	@@ -176,10 +176,36 @@ static u64 sched_clock_remote(struct sch
	u64 this_clock, remote_clock;
	u64 *ptr, old_val, val;

	+#if BITS_PER_LONG != 64
	+again:
	+ /*
	+ * Careful here: The local and the remote clock values need to
	+ * be read out atomic as we need to compare the values and
	+ * then update either the local or the remote side. So the
	+ * cmpxchg64 below only protects one readout.
	+ *
	+ * We must reread via sched_clock_local() in the retry case on
	+ * 32bit as an NMI could use sched_clock_local() via the
	+ * tracer and hit between the readout of
	+ * the low32bit and the high 32bit portion.
	+ */
	+ this_clock = sched_clock_local(my_scd);
	+ /*
	+ * We must enforce atomic readout on 32bit, otherwise the
	+ * update on the remote cpu can hit inbetween the readout of
	+ * the low32bit and the high 32bit portion.
	+ */
	+ remote_clock = cmpxchg64(&scd->clock, 0, 0);
	+#else
	+ /*
	+ * On 64bit the read of [my]scd->clock is atomic versus the
	+ * update, so we can avoid the above 32bit dance.
	+ */
	sched_clock_local(my_scd);
	again:
	this_clock = my_scd->clock;
	remote_clock = scd->clock;
	+#endif

	/*
	* Use the opportunity that we have both locks