releases/3.16.40/tile-avoid-using-clocksource_cyc2ns-with-absolute-cycle-count.patch - pub/scm/linux/kernel/git/bwh/linux-stable-queue - Git at Google

 From: Chris Metcalf <cmetcalf@mellanox.com>
 Date: Wed, 16 Nov 2016 11:18:05 -0500
 Subject: tile: avoid using clocksource_cyc2ns with absolute cycle count

 commit e658a6f14d7c0243205f035979d0ecf6c12a036f upstream.

 For large values of "mult" and long uptimes, the intermediate
 result of "cycles * mult" can overflow 64 bits.  For example,
 the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
 we have mult = 853, and after 208.5 days, we overflow 64 bits.

 Since clocksource_cyc2ns() is intended to be used for relative
 cycle counts, not absolute cycle counts, performance is more
 importance than accepting a wider range of cycle values.  So,
 just use mult_frac() directly in tile's sched_clock().

 Commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow
 in sched_clock") by Salman Qazi results in essentially the same
 generated code for x86 as this change does for tile.  In fact,
 a follow-on change by Salman introduced mult_frac() and switched
 to using it, so the C code was largely identical at that point too.

 Peter Zijlstra then added mul_u64_u32_shr() and switched x86
 to use it.  This is, in principle, better; by optimizing the
 64x64->64 multiplies to be 32x32->64 multiplies we can potentially
 save some time.  However, the compiler piplines the 64x64->64
 multiplies pretty well, and the conditional branch in the generic
 mul_u64_u32_shr() causes some bubbles in execution, with the
 result that it's pretty much a wash.  If tilegx provided its own
 implementation of mul_u64_u32_shr() without the conditional branch,
 we could potentially save 3 cycles, but that seems like small gain
 for a fair amount of additional build scaffolding; no other platform
 currently provides a mul_u64_u32_shr() override, and tile doesn't
 currently have an <asm/div64.h> header to put the override in.

 Additionally, gcc currently has an optimization bug that prevents
 it from recognizing the opportunity to use a 32x32->64 multiply,
 and so the result would be no better than the existing mult_frac()
 until such time as the compiler is fixed.

 For now, just using mult_frac() seems like the right answer.

 Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
 Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
 ---
  arch/tile/kernel/time.c | 4 ++--
  1 file changed, 2 insertions(+), 2 deletions(-)

 --- a/arch/tile/kernel/time.c
 +++ b/arch/tile/kernel/time.c
 @@ -216,8 +216,8 @@ void do_timer_interrupt(struct pt_regs *
   */
  unsigned long long sched_clock(void)
  {
 -	return clocksource_cyc2ns(get_cycles(),
 -				  sched_clock_mult, SCHED_CLOCK_SHIFT);
 +	return mult_frac(get_cycles(),
 +			 sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT);
  }

  int setup_profiling_timer(unsigned int multiplier)
	From: Chris Metcalf <cmetcalf@mellanox.com>
	Date: Wed, 16 Nov 2016 11:18:05 -0500
	Subject: tile: avoid using clocksource_cyc2ns with absolute cycle count

	commit e658a6f14d7c0243205f035979d0ecf6c12a036f upstream.

	For large values of "mult" and long uptimes, the intermediate
	result of "cycles * mult" can overflow 64 bits. For example,
	the tile platform calls clocksource_cyc2ns with a 1.2 GHz clock;
	we have mult = 853, and after 208.5 days, we overflow 64 bits.

	Since clocksource_cyc2ns() is intended to be used for relative
	cycle counts, not absolute cycle counts, performance is more
	importance than accepting a wider range of cycle values. So,
	just use mult_frac() directly in tile's sched_clock().

	Commit 4cecf6d401a0 ("sched, x86: Avoid unnecessary overflow
	in sched_clock") by Salman Qazi results in essentially the same
	generated code for x86 as this change does for tile. In fact,
	a follow-on change by Salman introduced mult_frac() and switched
	to using it, so the C code was largely identical at that point too.

	Peter Zijlstra then added mul_u64_u32_shr() and switched x86
	to use it. This is, in principle, better; by optimizing the
	64x64->64 multiplies to be 32x32->64 multiplies we can potentially
	save some time. However, the compiler piplines the 64x64->64
	multiplies pretty well, and the conditional branch in the generic
	mul_u64_u32_shr() causes some bubbles in execution, with the
	result that it's pretty much a wash. If tilegx provided its own
	implementation of mul_u64_u32_shr() without the conditional branch,
	we could potentially save 3 cycles, but that seems like small gain
	for a fair amount of additional build scaffolding; no other platform
	currently provides a mul_u64_u32_shr() override, and tile doesn't
	currently have an <asm/div64.h> header to put the override in.

	Additionally, gcc currently has an optimization bug that prevents
	it from recognizing the opportunity to use a 32x32->64 multiply,
	and so the result would be no better than the existing mult_frac()
	until such time as the compiler is fixed.

	For now, just using mult_frac() seems like the right answer.

	Signed-off-by: Chris Metcalf <cmetcalf@mellanox.com>
	Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
	---
	arch/tile/kernel/time.c \| 4 ++--
	1 file changed, 2 insertions(+), 2 deletions(-)

	--- a/arch/tile/kernel/time.c
	+++ b/arch/tile/kernel/time.c
	@@ -216,8 +216,8 @@ void do_timer_interrupt(struct pt_regs *
	*/
	unsigned long long sched_clock(void)
	{
	- return clocksource_cyc2ns(get_cycles(),
	- sched_clock_mult, SCHED_CLOCK_SHIFT);
	+ return mult_frac(get_cycles(),
	+ sched_clock_mult, 1ULL << SCHED_CLOCK_SHIFT);
	}

	int setup_profiling_timer(unsigned int multiplier)