releases/5.0.14/perf-x86-amd-update-generic-hardware-cache-events-for-family-17h.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From 0e3b74e26280f2cf8753717a950b97d424da6046 Mon Sep 17 00:00:00 2001
 From: Kim Phillips <kim.phillips@amd.com>
 Date: Thu, 2 May 2019 15:29:47 +0000
 Subject: perf/x86/amd: Update generic hardware cache events for Family 17h
 MIME-Version: 1.0
 Content-Type: text/plain; charset=UTF-8
 Content-Transfer-Encoding: 8bit

 From: Kim Phillips <kim.phillips@amd.com>

 commit 0e3b74e26280f2cf8753717a950b97d424da6046 upstream.

 Add a new amd_hw_cache_event_ids_f17h assignment structure set
 for AMD families 17h and above, since a lot has changed.  Specifically:

 L1 Data Cache

 The data cache access counter remains the same on Family 17h.

 For DC misses, PMCx041's definition changes with Family 17h,
 so instead we use the L2 cache accesses from L1 data cache
 misses counter (PMCx060,umask=0xc8).

 For DC hardware prefetch events, Family 17h breaks compatibility
 for PMCx067 "Data Prefetcher", so instead, we use PMCx05a "Hardware
 Prefetch DC Fills."

 L1 Instruction Cache

 PMCs 0x80 and 0x81 (32-byte IC fetches and misses) are backward
 compatible on Family 17h.

 For prefetches, we remove the erroneous PMCx04B assignment which
 counts how many software data cache prefetch load instructions were
 dispatched.

 LL - Last Level Cache

 Removing PMCs 7D, 7E, and 7F assignments, as they do not exist
 on Family 17h, where the last level cache is L3.  L3 counters
 can be accessed using the existing AMD Uncore driver.

 Data TLB

 On Intel machines, data TLB accesses ("dTLB-loads") are assigned
 to counters that count load/store instructions retired.  This
 is inconsistent with instruction TLB accesses, where Intel
 implementations report iTLB misses that hit in the STLB.

 Ideally, dTLB-loads would count higher level dTLB misses that hit
 in lower level TLBs, and dTLB-load-misses would report those
 that also missed in those lower-level TLBs, therefore causing
 a page table walk.  That would be consistent with instruction
 TLB operation, remove the redundancy between dTLB-loads and
 L1-dcache-loads, and prevent perf from producing artificially
 low percentage ratios, i.e. the "0.01%" below:

         42,550,869      L1-dcache-loads
         41,591,860      dTLB-loads
              4,802      dTLB-load-misses          #    0.01% of all dTLB cache hits
          7,283,682      L1-dcache-stores
          7,912,392      dTLB-stores
                310      dTLB-store-misses

 On AMD Families prior to 17h, the "Data Cache Accesses" counter is
 used, which is slightly better than load/store instructions retired,
 but still counts in terms of individual load/store operations
 instead of TLB operations.

 So, for AMD Families 17h and higher, this patch assigns "dTLB-loads"
 to a counter for L1 dTLB misses that hit in the L2 dTLB, and
 "dTLB-load-misses" to a counter for L1 DTLB misses that caused
 L2 DTLB misses and therefore also caused page table walks.  This
 results in a much more accurate view of data TLB performance:

         60,961,781      L1-dcache-loads
              4,601      dTLB-loads
                963      dTLB-load-misses          #   20.93% of all dTLB cache hits

 Note that for all AMD families, data loads and stores are combined
 in a single accesses counter, so no 'L1-dcache-stores' are reported
 separately, and stores are counted with loads in 'L1-dcache-loads'.

 Also note that the "% of all dTLB cache hits" string is misleading
 because (a) "dTLB cache": although TLBs can be considered caches for
 page tables, in this context, it can be misinterpreted as data cache
 hits because the figures are similar (at least on Intel), and (b) not
 all those loads (technically accesses) technically "hit" at that
 hardware level.  "% of all dTLB accesses" would be more clear/accurate.

 Instruction TLB

 On Intel machines, 'iTLB-loads' measure iTLB misses that hit in the
 STLB, and 'iTLB-load-misses' measure iTLB misses that also missed in
 the STLB and completed a page table walk.

 For AMD Family 17h and above, for 'iTLB-loads' we replace the
 erroneous instruction cache fetches counter with PMCx084
 "L1 ITLB Miss, L2 ITLB Hit".

 For 'iTLB-load-misses' we still use PMCx085 "L1 ITLB Miss,
 L2 ITLB Miss", but set a 0xff umask because without it the event
 does not get counted.

 Branch Predictor (BPU)

 PMCs 0xc2 and 0xc3 continue to be valid across all AMD Families.

 Node Level Events

 Family 17h does not have a PMCx0e9 counter, and corresponding counters
 have not been made available publicly, so for now, we mark them as
 unsupported for Families 17h and above.

 Reference:

   "Open-Source Register Reference For AMD Family 17h Processors Models 00h-2Fh"
   Released 7/17/2018, Publication #56255, Revision 3.03:
   https://www.amd.com/system/files/TechDocs/56255_OSRR.pdf

 [ mingo: tidied up the line breaks. ]
 Signed-off-by: Kim Phillips <kim.phillips@amd.com>
 Cc: <stable@vger.kernel.org> # v4.9+
 Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
 Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
 Cc: Borislav Petkov <bp@alien8.de>
 Cc: H. Peter Anvin <hpa@zytor.com>
 Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
 Cc: Jiri Olsa <jolsa@redhat.com>
 Cc: Linus Torvalds <torvalds@linux-foundation.org>
 Cc: Martin Liška <mliska@suse.cz>
 Cc: Namhyung Kim <namhyung@kernel.org>
 Cc: Peter Zijlstra <peterz@infradead.org>
 Cc: Pu Wen <puwen@hygon.cn>
 Cc: Stephane Eranian <eranian@google.com>
 Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
 Cc: Thomas Gleixner <tglx@linutronix.de>
 Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
 Cc: Vince Weaver <vincent.weaver@maine.edu>
 Cc: linux-kernel@vger.kernel.org
 Cc: linux-perf-users@vger.kernel.org
 Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
 Signed-off-by: Ingo Molnar <mingo@kernel.org>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

 ---
  arch/x86/events/amd/core.c |  111 +++++++++++++++++++++++++++++++++++++++++++--
  1 file changed, 108 insertions(+), 3 deletions(-)

 --- a/arch/x86/events/amd/core.c
 +++ b/arch/x86/events/amd/core.c
 @@ -116,6 +116,110 @@ static __initconst const u64 amd_hw_cach
   },
  };

 +static __initconst const u64 amd_hw_cache_event_ids_f17h
 +				[PERF_COUNT_HW_CACHE_MAX]
 +				[PERF_COUNT_HW_CACHE_OP_MAX]
 +				[PERF_COUNT_HW_CACHE_RESULT_MAX] = {
 +[C(L1D)] = {
 +	[C(OP_READ)] = {
 +		[C(RESULT_ACCESS)] = 0x0040, /* Data Cache Accesses */
 +		[C(RESULT_MISS)]   = 0xc860, /* L2$ access from DC Miss */
 +	},
 +	[C(OP_WRITE)] = {
 +		[C(RESULT_ACCESS)] = 0,
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +	[C(OP_PREFETCH)] = {
 +		[C(RESULT_ACCESS)] = 0xff5a, /* h/w prefetch DC Fills */
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +},
 +[C(L1I)] = {
 +	[C(OP_READ)] = {
 +		[C(RESULT_ACCESS)] = 0x0080, /* Instruction cache fetches  */
 +		[C(RESULT_MISS)]   = 0x0081, /* Instruction cache misses   */
 +	},
 +	[C(OP_WRITE)] = {
 +		[C(RESULT_ACCESS)] = -1,
 +		[C(RESULT_MISS)]   = -1,
 +	},
 +	[C(OP_PREFETCH)] = {
 +		[C(RESULT_ACCESS)] = 0,
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +},
 +[C(LL)] = {
 +	[C(OP_READ)] = {
 +		[C(RESULT_ACCESS)] = 0,
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +	[C(OP_WRITE)] = {
 +		[C(RESULT_ACCESS)] = 0,
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +	[C(OP_PREFETCH)] = {
 +		[C(RESULT_ACCESS)] = 0,
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +},
 +[C(DTLB)] = {
 +	[C(OP_READ)] = {
 +		[C(RESULT_ACCESS)] = 0xff45, /* All L2 DTLB accesses */
 +		[C(RESULT_MISS)]   = 0xf045, /* L2 DTLB misses (PT walks) */
 +	},
 +	[C(OP_WRITE)] = {
 +		[C(RESULT_ACCESS)] = 0,
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +	[C(OP_PREFETCH)] = {
 +		[C(RESULT_ACCESS)] = 0,
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +},
 +[C(ITLB)] = {
 +	[C(OP_READ)] = {
 +		[C(RESULT_ACCESS)] = 0x0084, /* L1 ITLB misses, L2 ITLB hits */
 +		[C(RESULT_MISS)]   = 0xff85, /* L1 ITLB misses, L2 misses */
 +	},
 +	[C(OP_WRITE)] = {
 +		[C(RESULT_ACCESS)] = -1,
 +		[C(RESULT_MISS)]   = -1,
 +	},
 +	[C(OP_PREFETCH)] = {
 +		[C(RESULT_ACCESS)] = -1,
 +		[C(RESULT_MISS)]   = -1,
 +	},
 +},
 +[C(BPU)] = {
 +	[C(OP_READ)] = {
 +		[C(RESULT_ACCESS)] = 0x00c2, /* Retired Branch Instr.      */
 +		[C(RESULT_MISS)]   = 0x00c3, /* Retired Mispredicted BI    */
 +	},
 +	[C(OP_WRITE)] = {
 +		[C(RESULT_ACCESS)] = -1,
 +		[C(RESULT_MISS)]   = -1,
 +	},
 +	[C(OP_PREFETCH)] = {
 +		[C(RESULT_ACCESS)] = -1,
 +		[C(RESULT_MISS)]   = -1,
 +	},
 +},
 +[C(NODE)] = {
 +	[C(OP_READ)] = {
 +		[C(RESULT_ACCESS)] = 0,
 +		[C(RESULT_MISS)]   = 0,
 +	},
 +	[C(OP_WRITE)] = {
 +		[C(RESULT_ACCESS)] = -1,
 +		[C(RESULT_MISS)]   = -1,
 +	},
 +	[C(OP_PREFETCH)] = {
 +		[C(RESULT_ACCESS)] = -1,
 +		[C(RESULT_MISS)]   = -1,
 +	},
 +},
 +};
 +
  /*
   * AMD Performance Monitor K7 and later, up to and including Family 16h:
   */
 @@ -865,9 +969,10 @@ __init int amd_pmu_init(void)
  		x86_pmu.amd_nb_constraints = 0;
  	}

 -	/* Events are common for all AMDs */
 -	memcpy(hw_cache_event_ids, amd_hw_cache_event_ids,
 -	       sizeof(hw_cache_event_ids));
 +	if (boot_cpu_data.x86 >= 0x17)
 +		memcpy(hw_cache_event_ids, amd_hw_cache_event_ids_f17h, sizeof(hw_cache_event_ids));
 +	else
 +		memcpy(hw_cache_event_ids, amd_hw_cache_event_ids, sizeof(hw_cache_event_ids));

  	return 0;
  }
	From 0e3b74e26280f2cf8753717a950b97d424da6046 Mon Sep 17 00:00:00 2001
	From: Kim Phillips <kim.phillips@amd.com>
	Date: Thu, 2 May 2019 15:29:47 +0000
	Subject: perf/x86/amd: Update generic hardware cache events for Family 17h
	MIME-Version: 1.0
	Content-Type: text/plain; charset=UTF-8
	Content-Transfer-Encoding: 8bit

	From: Kim Phillips <kim.phillips@amd.com>

	commit 0e3b74e26280f2cf8753717a950b97d424da6046 upstream.

	Add a new amd_hw_cache_event_ids_f17h assignment structure set
	for AMD families 17h and above, since a lot has changed. Specifically:

	L1 Data Cache

	The data cache access counter remains the same on Family 17h.

	For DC misses, PMCx041's definition changes with Family 17h,
	so instead we use the L2 cache accesses from L1 data cache
	misses counter (PMCx060,umask=0xc8).

	For DC hardware prefetch events, Family 17h breaks compatibility
	for PMCx067 "Data Prefetcher", so instead, we use PMCx05a "Hardware
	Prefetch DC Fills."

	L1 Instruction Cache

	PMCs 0x80 and 0x81 (32-byte IC fetches and misses) are backward
	compatible on Family 17h.

	For prefetches, we remove the erroneous PMCx04B assignment which
	counts how many software data cache prefetch load instructions were
	dispatched.

	LL - Last Level Cache

	Removing PMCs 7D, 7E, and 7F assignments, as they do not exist
	on Family 17h, where the last level cache is L3. L3 counters
	can be accessed using the existing AMD Uncore driver.

	Data TLB

	On Intel machines, data TLB accesses ("dTLB-loads") are assigned
	to counters that count load/store instructions retired. This
	is inconsistent with instruction TLB accesses, where Intel
	implementations report iTLB misses that hit in the STLB.

	Ideally, dTLB-loads would count higher level dTLB misses that hit
	in lower level TLBs, and dTLB-load-misses would report those
	that also missed in those lower-level TLBs, therefore causing
	a page table walk. That would be consistent with instruction
	TLB operation, remove the redundancy between dTLB-loads and
	L1-dcache-loads, and prevent perf from producing artificially
	low percentage ratios, i.e. the "0.01%" below:

	42,550,869 L1-dcache-loads
	41,591,860 dTLB-loads
	4,802 dTLB-load-misses # 0.01% of all dTLB cache hits
	7,283,682 L1-dcache-stores
	7,912,392 dTLB-stores
	310 dTLB-store-misses

	On AMD Families prior to 17h, the "Data Cache Accesses" counter is
	used, which is slightly better than load/store instructions retired,
	but still counts in terms of individual load/store operations
	instead of TLB operations.

	So, for AMD Families 17h and higher, this patch assigns "dTLB-loads"
	to a counter for L1 dTLB misses that hit in the L2 dTLB, and
	"dTLB-load-misses" to a counter for L1 DTLB misses that caused
	L2 DTLB misses and therefore also caused page table walks. This
	results in a much more accurate view of data TLB performance:

	60,961,781 L1-dcache-loads
	4,601 dTLB-loads
	963 dTLB-load-misses # 20.93% of all dTLB cache hits

	Note that for all AMD families, data loads and stores are combined
	in a single accesses counter, so no 'L1-dcache-stores' are reported
	separately, and stores are counted with loads in 'L1-dcache-loads'.

	Also note that the "% of all dTLB cache hits" string is misleading
	because (a) "dTLB cache": although TLBs can be considered caches for
	page tables, in this context, it can be misinterpreted as data cache
	hits because the figures are similar (at least on Intel), and (b) not
	all those loads (technically accesses) technically "hit" at that
	hardware level. "% of all dTLB accesses" would be more clear/accurate.

	Instruction TLB

	On Intel machines, 'iTLB-loads' measure iTLB misses that hit in the
	STLB, and 'iTLB-load-misses' measure iTLB misses that also missed in
	the STLB and completed a page table walk.

	For AMD Family 17h and above, for 'iTLB-loads' we replace the
	erroneous instruction cache fetches counter with PMCx084
	"L1 ITLB Miss, L2 ITLB Hit".

	For 'iTLB-load-misses' we still use PMCx085 "L1 ITLB Miss,
	L2 ITLB Miss", but set a 0xff umask because without it the event
	does not get counted.

	Branch Predictor (BPU)

	PMCs 0xc2 and 0xc3 continue to be valid across all AMD Families.

	Node Level Events

	Family 17h does not have a PMCx0e9 counter, and corresponding counters
	have not been made available publicly, so for now, we mark them as
	unsupported for Families 17h and above.

	Reference:

	"Open-Source Register Reference For AMD Family 17h Processors Models 00h-2Fh"
	Released 7/17/2018, Publication #56255, Revision 3.03:
	https://www.amd.com/system/files/TechDocs/56255_OSRR.pdf

	[ mingo: tidied up the line breaks. ]
	Signed-off-by: Kim Phillips <kim.phillips@amd.com>
	Cc: <stable@vger.kernel.org> # v4.9+
	Cc: Alexander Shishkin <alexander.shishkin@linux.intel.com>
	Cc: Arnaldo Carvalho de Melo <acme@redhat.com>
	Cc: Borislav Petkov <bp@alien8.de>
	Cc: H. Peter Anvin <hpa@zytor.com>
	Cc: Janakarajan Natarajan <Janakarajan.Natarajan@amd.com>
	Cc: Jiri Olsa <jolsa@redhat.com>
	Cc: Linus Torvalds <torvalds@linux-foundation.org>
	Cc: Martin Liška <mliska@suse.cz>
	Cc: Namhyung Kim <namhyung@kernel.org>
	Cc: Peter Zijlstra <peterz@infradead.org>
	Cc: Pu Wen <puwen@hygon.cn>
	Cc: Stephane Eranian <eranian@google.com>
	Cc: Suravee Suthikulpanit <Suravee.Suthikulpanit@amd.com>
	Cc: Thomas Gleixner <tglx@linutronix.de>
	Cc: Thomas Lendacky <Thomas.Lendacky@amd.com>
	Cc: Vince Weaver <vincent.weaver@maine.edu>
	Cc: linux-kernel@vger.kernel.org
	Cc: linux-perf-users@vger.kernel.org
	Fixes: e40ed1542dd7 ("perf/x86: Add perf support for AMD family-17h processors")
	Signed-off-by: Ingo Molnar <mingo@kernel.org>
	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>

	---
	arch/x86/events/amd/core.c \| 111 +++++++++++++++++++++++++++++++++++++++++++--
	1 file changed, 108 insertions(+), 3 deletions(-)

	--- a/arch/x86/events/amd/core.c
	+++ b/arch/x86/events/amd/core.c
	@@ -116,6 +116,110 @@ static __initconst const u64 amd_hw_cach
	},
	};

	+static __initconst const u64 amd_hw_cache_event_ids_f17h
	+ [PERF_COUNT_HW_CACHE_MAX]
	+ [PERF_COUNT_HW_CACHE_OP_MAX]
	+ [PERF_COUNT_HW_CACHE_RESULT_MAX] = {
	+[C(L1D)] = {
	+ [C(OP_READ)] = {
	+ [C(RESULT_ACCESS)] = 0x0040, /* Data Cache Accesses */
	+ [C(RESULT_MISS)] = 0xc860, /* L2$ access from DC Miss */
	+ },
	+ [C(OP_WRITE)] = {
	+ [C(RESULT_ACCESS)] = 0,
	+ [C(RESULT_MISS)] = 0,
	+ },
	+ [C(OP_PREFETCH)] = {
	+ [C(RESULT_ACCESS)] = 0xff5a, /* h/w prefetch DC Fills */
	+ [C(RESULT_MISS)] = 0,
	+ },
	+},
	+[C(L1I)] = {
	+ [C(OP_READ)] = {
	+ [C(RESULT_ACCESS)] = 0x0080, /* Instruction cache fetches */
	+ [C(RESULT_MISS)] = 0x0081, /* Instruction cache misses */
	+ },
	+ [C(OP_WRITE)] = {
	+ [C(RESULT_ACCESS)] = -1,
	+ [C(RESULT_MISS)] = -1,
	+ },
	+ [C(OP_PREFETCH)] = {
	+ [C(RESULT_ACCESS)] = 0,
	+ [C(RESULT_MISS)] = 0,
	+ },
	+},
	+[C(LL)] = {
	+ [C(OP_READ)] = {
	+ [C(RESULT_ACCESS)] = 0,
	+ [C(RESULT_MISS)] = 0,
	+ },
	+ [C(OP_WRITE)] = {
	+ [C(RESULT_ACCESS)] = 0,
	+ [C(RESULT_MISS)] = 0,
	+ },
	+ [C(OP_PREFETCH)] = {
	+ [C(RESULT_ACCESS)] = 0,
	+ [C(RESULT_MISS)] = 0,
	+ },
	+},
	+[C(DTLB)] = {
	+ [C(OP_READ)] = {
	+ [C(RESULT_ACCESS)] = 0xff45, /* All L2 DTLB accesses */
	+ [C(RESULT_MISS)] = 0xf045, /* L2 DTLB misses (PT walks) */
	+ },
	+ [C(OP_WRITE)] = {
	+ [C(RESULT_ACCESS)] = 0,
	+ [C(RESULT_MISS)] = 0,
	+ },
	+ [C(OP_PREFETCH)] = {
	+ [C(RESULT_ACCESS)] = 0,
	+ [C(RESULT_MISS)] = 0,
	+ },
	+},
	+[C(ITLB)] = {
	+ [C(OP_READ)] = {
	+ [C(RESULT_ACCESS)] = 0x0084, /* L1 ITLB misses, L2 ITLB hits */
	+ [C(RESULT_MISS)] = 0xff85, /* L1 ITLB misses, L2 misses */
	+ },
	+ [C(OP_WRITE)] = {
	+ [C(RESULT_ACCESS)] = -1,
	+ [C(RESULT_MISS)] = -1,
	+ },
	+ [C(OP_PREFETCH)] = {
	+ [C(RESULT_ACCESS)] = -1,
	+ [C(RESULT_MISS)] = -1,
	+ },
	+},
	+[C(BPU)] = {
	+ [C(OP_READ)] = {
	+ [C(RESULT_ACCESS)] = 0x00c2, /* Retired Branch Instr. */
	+ [C(RESULT_MISS)] = 0x00c3, /* Retired Mispredicted BI */
	+ },
	+ [C(OP_WRITE)] = {
	+ [C(RESULT_ACCESS)] = -1,
	+ [C(RESULT_MISS)] = -1,
	+ },
	+ [C(OP_PREFETCH)] = {
	+ [C(RESULT_ACCESS)] = -1,
	+ [C(RESULT_MISS)] = -1,
	+ },
	+},
	+[C(NODE)] = {
	+ [C(OP_READ)] = {
	+ [C(RESULT_ACCESS)] = 0,
	+ [C(RESULT_MISS)] = 0,
	+ },
	+ [C(OP_WRITE)] = {
	+ [C(RESULT_ACCESS)] = -1,
	+ [C(RESULT_MISS)] = -1,
	+ },
	+ [C(OP_PREFETCH)] = {
	+ [C(RESULT_ACCESS)] = -1,
	+ [C(RESULT_MISS)] = -1,
	+ },
	+},
	+};
	+
	/*
	* AMD Performance Monitor K7 and later, up to and including Family 16h:
	*/
	@@ -865,9 +969,10 @@ __init int amd_pmu_init(void)
	x86_pmu.amd_nb_constraints = 0;
	}

	- /* Events are common for all AMDs */
	- memcpy(hw_cache_event_ids, amd_hw_cache_event_ids,
	- sizeof(hw_cache_event_ids));
	+ if (boot_cpu_data.x86 >= 0x17)
	+ memcpy(hw_cache_event_ids, amd_hw_cache_event_ids_f17h, sizeof(hw_cache_event_ids));
	+ else
	+ memcpy(hw_cache_event_ids, amd_hw_cache_event_ids, sizeof(hw_cache_event_ids));

	return 0;
	}