refs/heads/hmat_rfc_v1 - pub/scm/linux/kernel/git/zwisler/linux

commit	5145fe164c2f595aeb3c9584aa40ffec093e9706	[log] [tgz]
author	Ross Zwisler <ross.zwisler@linux.intel.com>	Wed May 31 16:58:01 2017 -0600
committer	Ross Zwisler <ross.zwisler@linux.intel.com>	Thu Jun 01 17:01:24 2017 -0600
tree	c1644e1a5f835d234ab932b4e53cf94fc11c9c2c
parent	384caa304d6388bfeddc34b6fd58068d18a6c9cc [diff]

hmem: add performance attributes

Add performance information found in the HMAT to the sysfs representation.
This information lives as an attribute group named "via_mem_initX" in the
memory target:

  # tree mem_tgt2
  mem_tgt2
  ├── firmware_id
  ├── is_cached
  ├── is_enabled
  ├── is_isolated
  ├── node2 -> ../../node/node2
  ├── phys_addr_base
  ├── phys_length_bytes
  ├── power
  │   ├── async
  │   ...
  ├── subsystem -> ../../../../bus/hmem
  ├── uevent
  └── via_mem_init0
      ├── mem_init0 -> ../../mem_init0
      ├── mem_tgt2 -> ../../mem_tgt2
      ├── read_bw_MBps
      ├── read_lat_nsec
      ├── write_bw_MBps
      └── write_lat_nsec

This attribute group surfaces latency and bandwidth performance for a given
(initiator,target) pairing.  For example:

  # grep . mem_tgt2/via_mem_init0/* 2>/dev/null
  mem_tgt2/via_mem_init0/read_bw_MBps:40960
  mem_tgt2/via_mem_init0/read_lat_nsec:50
  mem_tgt2/via_mem_init0/write_bw_MBps:40960
  mem_tgt2/via_mem_init0/write_lat_nsec:50

The initiator has a symlink to the performance information which lives in
the target's attribute group:

  # ls -l mem_init0/via_mem_tgt2
  lrwxrwxrwx. 1 root root 0 Jun  1 10:00 mem_init0/via_mem_tgt2 ->
  ../mem_tgt2/via_mem_init0

We create performance attribute groups only for local (initiator,target)
pairings, where the local initiator for a given target is defined by the
"Processor Proximity Domain" field in the HMAT's Memory Subsystem Address
Range Structure table.

A given target is only local to a single initiator, so each target will
have at most one "via_mem_initX" attribute group.  A given memory initiator
may have multiple local memory targets, so multiple "via_mem_tgtX" links
may exist for a given initiator.

If a given memory target is cached we give performance numbers only for the
media itself, and rely on the "is_cached" attribute to represent the
fact that there is a caching layer.

The fact that we only expose a subset of the performance information
presented in the HMAT via sysfs as a compromise, driven by fact that those
usages will be the highest performing and because to represent all possible
paths could cause an unmanageable explosion of sysfs entries.

If we dump everything from the HMAT into sysfs we end up with
O(num_targets * num_initiators * num_caching_levels) attributes.  Each of
these attributes only takes up 2 bytes in a System Locality Latency and
Bandwidth Information Structure, but if we have to create a directory entry
for each it becomes much more expensive.

For example, very large systems today can have on the order of thousands of
NUMA nodes.  Say we have a system which used to have 1,000 NUMA nodes that
each had both a CPU and local memory.  The HMAT allows us to separate the
CPUs and memory into separate NUMA nodes, so we can end up with 1,000 CPU
initiator NUMA nodes and 1,000 memory target NUMA nodes.  If we represented
the performance information for each possible CPU/memory pair in sysfs we
would end up with 1,000,000 attribute groups.

This is a lot to pass in a set of packed data tables, but I think we'll
break sysfs if we try to create millions of attributes, regardless of how
we nest them in a directory hierarchy.

By only representing performance information for local (initiator,target)
pairings, we reduce the number of sysfs entries to O(num_targets).

Signed-off-by: Ross Zwisler <ross.zwisler@linux.intel.com>

4 files changed

tree: c1644e1a5f835d234ab932b4e53cf94fc11c9c2c