Add a tiering specific README document to the toplevel
Add a README that details setup steps and notes for running and evaluating
the tiering specific features present in this branch.
Signed-off-by: Vishal Verma <vishal.l.verma@intel.com>
diff --git a/README-tiering.txt b/README-tiering.txt
new file mode 100644
index 0000000..ebd5c26
--- /dev/null
+++ b/README-tiering.txt
@@ -0,0 +1,122 @@
+0. Introduction
+===============
+
+This document describes the software setup steps and usage hints that can be
+helpful in setting up a system to use tiered memory for evaluation.
+
+The document will go over:
+1. Any kernel config options required to enable the tiered memory features
+2. Any additional userspace tooling required, and related instructions
+3. Any post-boot setup for configurable or tunable knobs
+
+Note:
+Any instructions/settings described here may be tailored to the branch this
+is under. Setup steps may change from release to release, and for each
+release branch, the setup document accompanying that branch should be
+consulted.
+
+1. Kernel build and configuration
+=================================
+
+a. The recommended starting point is a distro-default kernel config. We
+ use and recommend using a recent Fedora config for a starting point.
+
+b. Ensure the following:
+ CONFIG_DEV_DAX_KMEM=m
+ CONFIG_MEMORY_HOTPLUG_DEFAULT_ONLINE=n
+ NUMA_BALANCING=y
+
+
+2. Tooling setup
+================
+
+a. Install 'ndctl' and 'daxctl' from your distro, or from upstream:
+ https://github.com/pmem/ndctl
+ This may especially be required if the distro version of daxctl
+ is not new enough to support the daxctl reconfigure-device command[1]
+
+ [1]: https://pmem.io/ndctl/daxctl-reconfigure-device.html
+
+b. Assuming that persistent memory devices are the next demotion tier
+ for system memory, perform the following steps to allow a pmem device
+ to be hot-plugged system RAM:
+
+ First, convert 'fsdax' namespace(s) to 'devdax':
+ ndctl create-namespace -fe namespaceX.Y -m devdax
+
+c. Reconfigure 'daxctl' devices to system-ram using the kmem facility:
+ daxctl reconfigure-device -m system-ram daxX.Y.
+
+ The JSON emitted at this step contains the 'target_node' for this
+ hotplugged memory. This is the memory-only NUMA node where this
+ memory appears, and can be used explicitly with normal libnuma/numactl
+ techniques.
+
+d. Ensure the newly created NUMA nodes for the hotplugged memory are in
+ ZONE_MOVABLE. The JSON from daxctl in the above step should indicate
+ this with a 'movable: true' attribute. Based on the distribution, there
+ may be udev rules that interfere with memory onlining. They may race to
+ online memory into ZONE_NORMAL rather than movable. If this is the case,
+ find and disable any such udev rules.
+
+3. Post boot setup
+==================
+
+a. Setup migration targets
+ # echo 2 > /sys/devices/system/node/node0/migration_path
+ # echo 3 > /sys/devices/system/node/node1/migration_path
+ etc.
+ The number of nodes this will need to be performed for depends
+ on CPU nodes in the system, and CPU-less nodes added in the previous step.
+ In this example, The hotplugged pmem physically attached to node0 is node2,
+ so the demotion path is node0 -> node2
+
+b. Enable 'autonuma'
+ # echo 2 > /proc/sys/kernel/numa_balancing
+ # echo 30 > /proc/sys/kernel/numa_balancing_rate_limit_mbps
+
+c. Enable swapcache promotion
+ This is only valid when swap is turned on. When the option is enabled, it
+ will migrate the hot swapcache page to fast DRAM node when possible.
+ # sysctl -w vm.enable_swapcache_promotion=1
+
+ The promoted page number could be checked in /proc/vmstat:
+ hmem_swapcache_promote_src
+ hmem_swapcache_promote_dst
+
+d. Enable page cache promotion
+ Support cache page promotion in shmem/generic file reads and writes. The
+ demotion is on the page reclaim path. Provide user interfaces to
+ control the rate limits of promotion and demotion. By default, promotion
+ is disabled and demotion is enabled.
+
+ For performance experiments, the recommended settings are:
+ To enable promotion, set a non-zero ratelimit. Using '-1' disables any
+ ratelimiting, and also enables promotion.
+ # echo -1 > /proc/sys/vm/promotion_ratelimit_mbytes_per_sec
+ # echo -1 > /proc/sys/vm/demotion_ratelimit_mbytes_per_sec
+
+ Enable node-reclaim for HMEM cold page demotion (new in tiering-0.5)
+ After the device-dax instances are onlined node-reclaim needs to be
+ enabled to start migrating “cold” pages from DRAM to pmem
+ # echo 15 > /proc/sys/vm/zone_reclaim_mode
+
+ The number of promoted pages can be checked by the following counters in
+ /proc/vmstat or /sys/devices/system/node/node[n]/vmstat:
+ hmem_reclaim_promote_src
+ hmem_reclaim_promote_dst
+ nr_promoted
+
+ The number of pages demoted can be checked by the following counters:
+ hmem_reclaim_demote_src
+ hmem_reclaim_demote_dst
+
+ The page number of failure in promotion could be checked by the
+ following counters:
+ nr_promote_fail
+ nr_promote_isolate_fail
+
+ The number of pages that have been limited by the ratelimit can be checked
+ by the following counter:
+ nr_promote_ratelimit
+