autonuma: Support to scan page table asynchronously

In AutoNUMA, the page tables of processes are scanned periodically to
trigger the NUMA hint page faults.  The scanning is done
synchronously.  This has many advantages compared with asynchronous
scanning, including

- The processes which benefit from the AutoNUMA will pay the overhead

- The hot pages will be scanned more often, while the cold pages will
  not

- Reduces the cache ping-pong for page table itself, and TLB flushing

One drawback is that this may introduce the latency outliers.  With
default configuration, scanning code could take several milliseconds
to complete in tests.  This is acceptable for most workloads, but may be
not desirable for some other workloads.

One possible solution is to make it possible to trigger page table
scanning synchronously, and offload the actual page table scanning to
some kernel threads (in fact work queue).  And users can switch
between synchronous and asynchronous scanning at run time via a sysfs
knob.

The patch has been tested with pmbench (which can measure memory
access latency) on a 2-socket server machine with 256 GB memory.

Latency (ms)            Base (count)            Async (count)
0.5-1                           2399                        1
  1-2                          13436                        0
  2-4                           7435                        1

In test, the pmbench score has no measurable changes between base and
patched kernel with asynchronous scanning.  But as in the above table
the number of the latency outliers reduces from tens thousands to
nearly 0.  The test time is 3600s.

TODO: ABI document

Signed-off-by: "Huang, Ying" <ying.huang@intel.com>
5 files changed