releases/3.10.13/sched-x86-optimize-switch_mm-for-multi-threaded-workloads.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 Mon Sep 17 00:00:00 2001
 From: Rik van Riel <riel@redhat.com>
 Date: Wed, 31 Jul 2013 22:14:21 -0400
 Subject: sched/x86: Optimize switch_mm() for multi-threaded workloads

 From: Rik van Riel <riel@redhat.com>

 commit 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 upstream.

 Dick Fowles, Don Zickus and Joe Mario have been working on
 improvements to perf, and noticed heavy cache line contention
 on the mm_cpumask, running linpack on a 60 core / 120 thread
 system.

 The cause turned out to be unnecessary atomic accesses to the
 mm_cpumask. When in lazy TLB mode, the CPU is only removed from
 the mm_cpumask if there is a TLB flush event.

 Most of the time, no such TLB flush happens, and the kernel
 skips the TLB reload. It can also skip the atomic memory
 set & test.

 Here is a summary of Joe's test results:

  * The __schedule function dropped from 24% of all program cycles down
    to 5.5%.

  * The cacheline contention/hotness for accesses to that bitmask went
    from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
    all shared misses which is now quite cold)

  * The average load latency for the bit-test-n-set instruction in
    __schedule dropped from 10k-15k cycles down to an average of 600 cycles.

  * The linpack program results improved from 133 GFlops to 144 GFlops.
    Peak GFlops rose from 133 to 153.

 Reported-by: Don Zickus <dzickus@redhat.com>
 Reported-by: Joe Mario <jmario@redhat.com>
 Tested-by: Joe Mario <jmario@redhat.com>
 Signed-off-by: Rik van Riel <riel@redhat.com>
 Reviewed-by: Paul Turner <pjt@google.com>
 Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
 Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com
 [ Made the comments consistent around the modified code. ]
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 ---
  arch/x86/include/asm/mmu_context.h |   20 +++++++++++++-------
  1 file changed, 13 insertions(+), 7 deletions(-)

 --- a/arch/x86/include/asm/mmu_context.h
 +++ b/arch/x86/include/asm/mmu_context.h
 @@ -45,22 +45,28 @@ static inline void switch_mm(struct mm_s
  		/* Re-load page tables */
  		load_cr3(next->pgd);

 -		/* stop flush ipis for the previous mm */
 +		/* Stop flush ipis for the previous mm */
  		cpumask_clear_cpu(cpu, mm_cpumask(prev));

 -		/*
 -		 * load the LDT, if the LDT is different:
 -		 */
 +		/* Load the LDT, if the LDT is different: */
  		if (unlikely(prev->context.ldt != next->context.ldt))
  			load_LDT_nolock(&next->context);
  	}
  #ifdef CONFIG_SMP
 -	else {
 +	  else {
  		this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
  		BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

 -		if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
 -			/* We were in lazy tlb mode and leave_mm disabled
 +		if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
 +			/*
 +			 * On established mms, the mm_cpumask is only changed
 +			 * from irq context, from ptep_clear_flush() while in
 +			 * lazy tlb mode, and here. Irqs are blocked during
 +			 * schedule, protecting us from simultaneous changes.
 +			 */
 +			cpumask_set_cpu(cpu, mm_cpumask(next));
 +			/*
 +			 * We were in lazy tlb mode and leave_mm disabled
  			 * tlb flush IPI delivery. We must reload CR3
  			 * to make sure to use no freed page tables.
  			 */
	From 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 Mon Sep 17 00:00:00 2001
	From: Rik van Riel <riel@redhat.com>
	Date: Wed, 31 Jul 2013 22:14:21 -0400
	Subject: sched/x86: Optimize switch_mm() for multi-threaded workloads

	From: Rik van Riel <riel@redhat.com>

	commit 8f898fbbe5ee5e20a77c4074472a1fd088dc47d1 upstream.

	Dick Fowles, Don Zickus and Joe Mario have been working on
	improvements to perf, and noticed heavy cache line contention
	on the mm_cpumask, running linpack on a 60 core / 120 thread
	system.

	The cause turned out to be unnecessary atomic accesses to the
	mm_cpumask. When in lazy TLB mode, the CPU is only removed from
	the mm_cpumask if there is a TLB flush event.

	Most of the time, no such TLB flush happens, and the kernel
	skips the TLB reload. It can also skip the atomic memory
	set & test.

	Here is a summary of Joe's test results:

	* The __schedule function dropped from 24% of all program cycles down
	to 5.5%.

	* The cacheline contention/hotness for accesses to that bitmask went
	from being the 1st/2nd hottest - down to the 84th hottest (0.3% of
	all shared misses which is now quite cold)

	* The average load latency for the bit-test-n-set instruction in
	__schedule dropped from 10k-15k cycles down to an average of 600 cycles.

	* The linpack program results improved from 133 GFlops to 144 GFlops.
	Peak GFlops rose from 133 to 153.

	Reported-by: Don Zickus <dzickus@redhat.com>
	Reported-by: Joe Mario <jmario@redhat.com>
	Tested-by: Joe Mario <jmario@redhat.com>
	Signed-off-by: Rik van Riel <riel@redhat.com>
	Reviewed-by: Paul Turner <pjt@google.com>
	Acked-by: Linus Torvalds <torvalds@linux-foundation.org>
	Link: http://lkml.kernel.org/r/20130731221421.616d3d20@annuminas.surriel.com
	[ Made the comments consistent around the modified code. ]
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

	---
	arch/x86/include/asm/mmu_context.h \| 20 +++++++++++++-------
	1 file changed, 13 insertions(+), 7 deletions(-)

	--- a/arch/x86/include/asm/mmu_context.h
	+++ b/arch/x86/include/asm/mmu_context.h
	@@ -45,22 +45,28 @@ static inline void switch_mm(struct mm_s
	/* Re-load page tables */
	load_cr3(next->pgd);

	- /* stop flush ipis for the previous mm */
	+ /* Stop flush ipis for the previous mm */
	cpumask_clear_cpu(cpu, mm_cpumask(prev));

	- /*
	- * load the LDT, if the LDT is different:
	- */
	+ /* Load the LDT, if the LDT is different: */
	if (unlikely(prev->context.ldt != next->context.ldt))
	load_LDT_nolock(&next->context);
	}
	#ifdef CONFIG_SMP
	- else {
	+ else {
	this_cpu_write(cpu_tlbstate.state, TLBSTATE_OK);
	BUG_ON(this_cpu_read(cpu_tlbstate.active_mm) != next);

	- if (!cpumask_test_and_set_cpu(cpu, mm_cpumask(next))) {
	- /* We were in lazy tlb mode and leave_mm disabled
	+ if (!cpumask_test_cpu(cpu, mm_cpumask(next))) {
	+ /*
	+ * On established mms, the mm_cpumask is only changed
	+ * from irq context, from ptep_clear_flush() while in
	+ * lazy tlb mode, and here. Irqs are blocked during
	+ * schedule, protecting us from simultaneous changes.
	+ */
	+ cpumask_set_cpu(cpu, mm_cpumask(next));
	+ /*
	+ * We were in lazy tlb mode and leave_mm disabled
	* tlb flush IPI delivery. We must reload CR3
	* to make sure to use no freed page tables.
	*/