patches/old/docs-mm-add-document-for-swap-table.patch - pub/scm/linux/kernel/git/akpm/25-new - Git at Google

 From: Chris Li <chrisl@kernel.org>
 Subject: docs/mm: add document for swap table
 Date: Wed, 17 Sep 2025 00:00:46 +0800

 Patch series "mm, swap: introduce swap table as swap cache (phase I)", v4.

 This is the first phase of the bigger series implementing basic
 infrastructures for the Swap Table idea proposed at the LSF/MM/BPF topic
 "Integrate swap cache, swap maps with swap allocator" [1].  To give credit
 where it is due, this is based on Chris Li's idea and a prototype of using
 cluster size atomic arrays to implement swap cache.

 This phase I contains 15 patches, introduces the swap table infrastructure
 and uses it as the swap cache backend.  By doing so, we have up to ~5-20%
 performance gain in throughput, RPS or build time for benchmark and
 workload tests.  The speed up is due to less contention on the swap cache
 access and shallower swap cache lookup path.  The cluster size is much
 finer-grained than the 64M address space split, which is removed in this
 phase I.  It also unifies and cleans up the swap code base.

 Each swap cluster will dynamically allocate the swap table, which is an
 atomic array to cover every swap slot in the cluster.  It replaces the
 swap cache backed by XArray.  In phase I, the static allocated swap_map
 still co-exists with the swap table.  The memory usage is about the same
 as the original on average.  A few exception test cases show about 1%
 higher in memory usage.  In the following phases of the series, swap_map
 will merge into the swap table without additional memory allocation.  It
 will result in net memory reduction compared to the original swap cache.

 Testing has shown that phase I has a significant performance improvement
 from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical
 workloads.

 The full picture with a summary can be found at [2].  An older bigger
 series of 28 patches is posted at [3].

 vm-scability test:
 ==================
 Test with:
 usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap)
                            Before:         After:
 System time:               219.12s         158.16s        (-27.82%)
 Sum Throughput:            4767.13 MB/s    6128.59 MB/s   (+28.55%)
 Single process Throughput: 150.21 MB/s     196.52 MB/s    (+30.83%)
 Free latency:              175047.58 us    131411.87 us   (-24.92%)

 usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
 PMEM as swap)
                            Before:         After:
 System time:               356.16s         284.68s      (-20.06%)
 Sum Throughput:            4648.35 MB/s    5453.52 MB/s (+17.32%)
 Single process Throughput: 141.63 MB/s     168.35 MB/s  (+18.86%)
 Free latency:              499907.71 us    484977.03 us (-2.99%)

 This shows an improvement of more than 20% improvement in most readings.

 Build kernel test:
 ==================
 The following result matrix is from building kernel with defconfig on
 tmpfs with ZSWAP / ZRAM, using different memory pressure and setups.
 Measuring sys and real time in seconds, less is better (user time is
 almost identical as expected):

  -j<NR> / Mem  | Sys before / after  | Real before / after
 Using 16G ZRAM with memcg limit:
      6  / 192M | 9686 / 9472  -2.21% | 2130  / 2096   -1.59%
      12 / 256M | 6610 / 6451  -2.41% |  827  /  812   -1.81%
      24 / 384M | 5938 / 5701  -3.37% |  414  /  405   -2.17%
      48 / 768M | 4696 / 4409  -6.11% |  188  /  182   -3.19%
 With 64k folio:
      24 / 512M | 4222 / 4162  -1.42% |  326  /  321   -1.53%
      48 / 1G   | 3688 / 3622  -1.79% |  151  /  149   -1.32%
 With ZSWAP with 3G memcg (using higher limit due to kmem account):
      48 / 3G   |  603 /  581  -3.65% |  81   /   80   -1.23%

 Testing extremely high global memory and schedule pressure: Using ZSWAP
 with 32G NVMEs in a 48c VM that has 4G memory, no memcg limit, system
 components take up about 1.5G already, using make -j48 to build defconfig:

 Before:  sys time: 2069.53s            real time: 135.76s
 After:   sys time: 2021.13s (-2.34%)   real time: 134.23s (-1.12%)

 On another 48c 4G memory VM, using 16G ZRAM as swap, testing make
 -j48 with same config:

 Before:  sys time: 1756.96s            real time: 111.01s
 After:   sys time: 1715.90s (-2.34%)   real time: 109.51s (-1.35%)

 All cases are more or less faster, and no regression even under extremely
 heavy global memory pressure.

 Redis / Valkey bench:
 =====================
 The test machine is a ARM64 VM with 1536M memory 12 cores, Redis is set to
 use 2500M memory, and ZRAM swap size is set to 5G:

 Testing with:
 redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get

                 no BGSAVE                with BGSAVE
 Before:         487576.06 RPS            280016.02 RPS
 After:          487541.76 RPS (-0.01%)   300155.32 RPS (+7.19%)

 Testing with:
 redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get
                 no BGSAVE                with BGSAVE
 Before:         466789.59 RPS            281213.92 RPS
 After:          466402.89 RPS (-0.08%)   298411.84 RPS (+6.12%)

 With BGSAVE enabled, most Redis memory will have a swap count > 1 so swap
 cache is heavily in use.  We can see a about 6% performance gain.  No
 BGSAVE is very slightly slower (<0.1%) due to the higher memory pressure
 of the co-existence of swap_map and swap table.  This will be optimzed
 into a net gain and up to 20% gain in BGSAVE case in the following phases.

 HDD swap is also ~40% faster with usemem because we removed an old
 contention workaround.


 This patch (of 15):

 Swap table is the new swap cache.

 [chrisl@kernel.org: move swap table document, redo swap table size sentence]
   Link: https://lkml.kernel.org/r/CACePvbXjaUyzB_9RSSSgR6BNvz+L9anvn0vcNf_J0jD7-4Yy6Q@mail.gmail.com
 Link: https://lkml.kernel.org/r/20250916160100.31545-1-ryncsn@gmail.com
 Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
 Link: https://lkml.kernel.org/r/20250916160100.31545-2-ryncsn@gmail.com
 Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
 Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2]
 Signed-off-by: Chris Li <chrisl@kernel.org>
 Signed-off-by: Kairui Song <kasong@tencent.com>
 Suggested-by: Chris Li <chrisl@kernel.org>
 Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
 Cc: Baoquan He <bhe@redhat.com>
 Cc: Barry Song <baohua@kernel.org>
 Cc: David Hildenbrand <david@redhat.com>
 Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
 Cc: Hugh Dickins <hughd@google.com>
 Cc: Johannes Weiner <hannes@cmpxchg.org>
 Cc: Kemeng Shi <shikemeng@huaweicloud.com>
 Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
 Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
 Cc: Nhat Pham <nphamcs@gmail.com>
 Cc: Yosry Ahmed <yosryahmed@google.com>
 Cc: Zi Yan <ziy@nvidia.com>
 Cc: kernel test robot <oliver.sang@intel.com>
 Cc: SeongJae Park <sj@kernel.org>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---

  Documentation/mm/index.rst      |    1
  Documentation/mm/swap-table.rst |   69 ++++++++++++++++++++++++++++++
  MAINTAINERS                     |    1
  3 files changed, 71 insertions(+)

 --- a/Documentation/mm/index.rst~docs-mm-add-document-for-swap-table
 +++ a/Documentation/mm/index.rst
 @@ -20,6 +20,7 @@ see the :doc:`admin guide <../admin-guid
     highmem
     page_reclaim
     swap
 +   swap-table
     page_cache
     shmfs
     oom
 diff --git a/Documentation/mm/swap-table.rst a/Documentation/mm/swap-table.rst
 new file mode 100644
 --- /dev/null
 +++ a/Documentation/mm/swap-table.rst
 @@ -0,0 +1,69 @@
 +.. SPDX-License-Identifier: GPL-2.0
 +
 +:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
 +
 +==========
 +Swap Table
 +==========
 +
 +Swap table implements swap cache as a per-cluster swap cache value array.
 +
 +Swap Entry
 +----------
 +
 +A swap entry contains the information required to serve the anonymous page
 +fault.
 +
 +Swap entry is encoded as two parts: swap type and swap offset.
 +
 +The swap type indicates which swap device to use.
 +The swap offset is the offset of the swap file to read the page data from.
 +
 +Swap Cache
 +----------
 +
 +Swap cache is a map to look up folios using swap entry as the key. The result
 +value can have three possible types depending on which stage of this swap entry
 +was in.
 +
 +1. NULL: This swap entry is not used.
 +
 +2. folio: A folio has been allocated and bound to this swap entry. This is
 +   the transient state of swap out or swap in. The folio data can be in
 +   the folio or swap file, or both.
 +
 +3. shadow: The shadow contains the working set information of the swapped
 +   out folio. This is the normal state for a swapped out page.
 +
 +Swap Table Internals
 +--------------------
 +
 +The previous swap cache is implemented by XArray. The XArray is a tree
 +structure. Each lookup will go through multiple nodes. Can we do better?
 +
 +Notice that most of the time when we look up the swap cache, we are either
 +in a swap in or swap out path. We should already have the swap cluster,
 +which contains the swap entry.
 +
 +If we have a per-cluster array to store swap cache value in the cluster.
 +Swap cache lookup within the cluster can be a very simple array lookup.
 +
 +We give such a per-cluster swap cache value array a name: the swap table.
 +
 +A swap table is an array of pointers. Each pointer is the same size as a
 +PTE. The size of a swap table for one swap cluster typically matches a PTE
 +page table, which is one page on modern 64-bit systems.
 +
 +With swap table, swap cache lookup can achieve great locality, simpler,
 +and faster.
 +
 +Locking
 +-------
 +
 +Swap table modification requires taking the cluster lock. If a folio
 +is being added to or removed from the swap table, the folio must be
 +locked prior to the cluster lock. After adding or removing is done, the
 +folio shall be unlocked.
 +
 +Swap table lookup is protected by RCU and atomic read. If the lookup
 +returns a folio, the user must lock the folio before use.
 --- a/MAINTAINERS~docs-mm-add-document-for-swap-table
 +++ a/MAINTAINERS
 @@ -16225,6 +16225,7 @@ R:	Barry Song <baohua@kernel.org>
  R:	Chris Li <chrisl@kernel.org>
  L:	linux-mm@kvack.org
  S:	Maintained
 +F:	Documentation/mm/swap-table.rst
  F:	include/linux/swap.h
  F:	include/linux/swapfile.h
  F:	include/linux/swapops.h
 _
	From: Chris Li <chrisl@kernel.org>
	Subject: docs/mm: add document for swap table
	Date: Wed, 17 Sep 2025 00:00:46 +0800

	Patch series "mm, swap: introduce swap table as swap cache (phase I)", v4.

	This is the first phase of the bigger series implementing basic
	infrastructures for the Swap Table idea proposed at the LSF/MM/BPF topic
	"Integrate swap cache, swap maps with swap allocator" [1]. To give credit
	where it is due, this is based on Chris Li's idea and a prototype of using
	cluster size atomic arrays to implement swap cache.

	This phase I contains 15 patches, introduces the swap table infrastructure
	and uses it as the swap cache backend. By doing so, we have up to ~5-20%
	performance gain in throughput, RPS or build time for benchmark and
	workload tests. The speed up is due to less contention on the swap cache
	access and shallower swap cache lookup path. The cluster size is much
	finer-grained than the 64M address space split, which is removed in this
	phase I. It also unifies and cleans up the swap code base.

	Each swap cluster will dynamically allocate the swap table, which is an
	atomic array to cover every swap slot in the cluster. It replaces the
	swap cache backed by XArray. In phase I, the static allocated swap_map
	still co-exists with the swap table. The memory usage is about the same
	as the original on average. A few exception test cases show about 1%
	higher in memory usage. In the following phases of the series, swap_map
	will merge into the swap table without additional memory allocation. It
	will result in net memory reduction compared to the original swap cache.

	Testing has shown that phase I has a significant performance improvement
	from 8c/1G ARM machine to 48c96t/128G x86_64 servers in many practical
	workloads.

	The full picture with a summary can be found at [2]. An older bigger
	series of 28 patches is posted at [3].

	vm-scability test:
	==================
	Test with:
	usemem --init-time -O -y -x -n 31 1G (4G memcg, PMEM as swap)
	Before: After:
	System time: 219.12s 158.16s (-27.82%)
	Sum Throughput: 4767.13 MB/s 6128.59 MB/s (+28.55%)
	Single process Throughput: 150.21 MB/s 196.52 MB/s (+30.83%)
	Free latency: 175047.58 us 131411.87 us (-24.92%)

	usemem --init-time -O -y -x -n 32 1536M (16G memory, global pressure,
	PMEM as swap)
	Before: After:
	System time: 356.16s 284.68s (-20.06%)
	Sum Throughput: 4648.35 MB/s 5453.52 MB/s (+17.32%)
	Single process Throughput: 141.63 MB/s 168.35 MB/s (+18.86%)
	Free latency: 499907.71 us 484977.03 us (-2.99%)

	This shows an improvement of more than 20% improvement in most readings.

	Build kernel test:
	==================
	The following result matrix is from building kernel with defconfig on
	tmpfs with ZSWAP / ZRAM, using different memory pressure and setups.
	Measuring sys and real time in seconds, less is better (user time is
	almost identical as expected):

	-j<NR> / Mem \| Sys before / after \| Real before / after
	Using 16G ZRAM with memcg limit:
	6 / 192M \| 9686 / 9472 -2.21% \| 2130 / 2096 -1.59%
	12 / 256M \| 6610 / 6451 -2.41% \| 827 / 812 -1.81%
	24 / 384M \| 5938 / 5701 -3.37% \| 414 / 405 -2.17%
	48 / 768M \| 4696 / 4409 -6.11% \| 188 / 182 -3.19%
	With 64k folio:
	24 / 512M \| 4222 / 4162 -1.42% \| 326 / 321 -1.53%
	48 / 1G \| 3688 / 3622 -1.79% \| 151 / 149 -1.32%
	With ZSWAP with 3G memcg (using higher limit due to kmem account):
	48 / 3G \| 603 / 581 -3.65% \| 81 / 80 -1.23%

	Testing extremely high global memory and schedule pressure: Using ZSWAP
	with 32G NVMEs in a 48c VM that has 4G memory, no memcg limit, system
	components take up about 1.5G already, using make -j48 to build defconfig:

	Before: sys time: 2069.53s real time: 135.76s
	After: sys time: 2021.13s (-2.34%) real time: 134.23s (-1.12%)

	On another 48c 4G memory VM, using 16G ZRAM as swap, testing make
	-j48 with same config:

	Before: sys time: 1756.96s real time: 111.01s
	After: sys time: 1715.90s (-2.34%) real time: 109.51s (-1.35%)

	All cases are more or less faster, and no regression even under extremely
	heavy global memory pressure.

	Redis / Valkey bench:
	=====================
	The test machine is a ARM64 VM with 1536M memory 12 cores, Redis is set to
	use 2500M memory, and ZRAM swap size is set to 5G:

	Testing with:
	redis-benchmark -r 2000000 -n 2000000 -d 1024 -c 12 -P 32 -t get

	no BGSAVE with BGSAVE
	Before: 487576.06 RPS 280016.02 RPS
	After: 487541.76 RPS (-0.01%) 300155.32 RPS (+7.19%)

	Testing with:
	redis-benchmark -r 2500000 -n 2500000 -d 1024 -c 12 -P 32 -t get
	no BGSAVE with BGSAVE
	Before: 466789.59 RPS 281213.92 RPS
	After: 466402.89 RPS (-0.08%) 298411.84 RPS (+6.12%)

	With BGSAVE enabled, most Redis memory will have a swap count > 1 so swap
	cache is heavily in use. We can see a about 6% performance gain. No
	BGSAVE is very slightly slower (<0.1%) due to the higher memory pressure
	of the co-existence of swap_map and swap table. This will be optimzed
	into a net gain and up to 20% gain in BGSAVE case in the following phases.

	HDD swap is also ~40% faster with usemem because we removed an old
	contention workaround.


	This patch (of 15):

	Swap table is the new swap cache.

	[chrisl@kernel.org: move swap table document, redo swap table size sentence]
	Link: https://lkml.kernel.org/r/CACePvbXjaUyzB_9RSSSgR6BNvz+L9anvn0vcNf_J0jD7-4Yy6Q@mail.gmail.com
	Link: https://lkml.kernel.org/r/20250916160100.31545-1-ryncsn@gmail.com
	Link: https://lore.kernel.org/linux-mm/20250514201729.48420-1-ryncsn@gmail.com/ [3]
	Link: https://lkml.kernel.org/r/20250916160100.31545-2-ryncsn@gmail.com
	Link: https://lore.kernel.org/CAMgjq7BvQ0ZXvyLGp2YP96+i+6COCBBJCYmjXHGBnfisCAb8VA@mail.gmail.com [1]
	Link: https://github.com/ryncsn/linux/tree/kasong/devel/swap-table [2]
	Signed-off-by: Chris Li <chrisl@kernel.org>
	Signed-off-by: Kairui Song <kasong@tencent.com>
	Suggested-by: Chris Li <chrisl@kernel.org>
	Cc: Baolin Wang <baolin.wang@linux.alibaba.com>
	Cc: Baoquan He <bhe@redhat.com>
	Cc: Barry Song <baohua@kernel.org>
	Cc: David Hildenbrand <david@redhat.com>
	Cc: "Huang, Ying" <ying.huang@linux.alibaba.com>
	Cc: Hugh Dickins <hughd@google.com>
	Cc: Johannes Weiner <hannes@cmpxchg.org>
	Cc: Kemeng Shi <shikemeng@huaweicloud.com>
	Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com>
	Cc: Matthew Wilcox (Oracle) <willy@infradead.org>
	Cc: Nhat Pham <nphamcs@gmail.com>
	Cc: Yosry Ahmed <yosryahmed@google.com>
	Cc: Zi Yan <ziy@nvidia.com>
	Cc: kernel test robot <oliver.sang@intel.com>
	Cc: SeongJae Park <sj@kernel.org>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	---

	Documentation/mm/index.rst \| 1
	Documentation/mm/swap-table.rst \| 69 ++++++++++++++++++++++++++++++
	MAINTAINERS \| 1
	3 files changed, 71 insertions(+)

	--- a/Documentation/mm/index.rst~docs-mm-add-document-for-swap-table
	+++ a/Documentation/mm/index.rst
	@@ -20,6 +20,7 @@ see the :doc:`admin guide <../admin-guid
	highmem
	page_reclaim
	swap
	+ swap-table
	page_cache
	shmfs
	oom
	diff --git a/Documentation/mm/swap-table.rst a/Documentation/mm/swap-table.rst
	new file mode 100644
	--- /dev/null
	+++ a/Documentation/mm/swap-table.rst
	@@ -0,0 +1,69 @@
	+.. SPDX-License-Identifier: GPL-2.0
	+
	+:Author: Chris Li <chrisl@kernel.org>, Kairui Song <kasong@tencent.com>
	+
	+==========
	+Swap Table
	+==========
	+
	+Swap table implements swap cache as a per-cluster swap cache value array.
	+
	+Swap Entry
	+----------
	+
	+A swap entry contains the information required to serve the anonymous page
	+fault.
	+
	+Swap entry is encoded as two parts: swap type and swap offset.
	+
	+The swap type indicates which swap device to use.
	+The swap offset is the offset of the swap file to read the page data from.
	+
	+Swap Cache
	+----------
	+
	+Swap cache is a map to look up folios using swap entry as the key. The result
	+value can have three possible types depending on which stage of this swap entry
	+was in.
	+
	+1. NULL: This swap entry is not used.
	+
	+2. folio: A folio has been allocated and bound to this swap entry. This is
	+ the transient state of swap out or swap in. The folio data can be in
	+ the folio or swap file, or both.
	+
	+3. shadow: The shadow contains the working set information of the swapped
	+ out folio. This is the normal state for a swapped out page.
	+
	+Swap Table Internals
	+--------------------
	+
	+The previous swap cache is implemented by XArray. The XArray is a tree
	+structure. Each lookup will go through multiple nodes. Can we do better?
	+
	+Notice that most of the time when we look up the swap cache, we are either
	+in a swap in or swap out path. We should already have the swap cluster,
	+which contains the swap entry.
	+
	+If we have a per-cluster array to store swap cache value in the cluster.
	+Swap cache lookup within the cluster can be a very simple array lookup.
	+
	+We give such a per-cluster swap cache value array a name: the swap table.
	+
	+A swap table is an array of pointers. Each pointer is the same size as a
	+PTE. The size of a swap table for one swap cluster typically matches a PTE
	+page table, which is one page on modern 64-bit systems.
	+
	+With swap table, swap cache lookup can achieve great locality, simpler,
	+and faster.
	+
	+Locking
	+-------
	+
	+Swap table modification requires taking the cluster lock. If a folio
	+is being added to or removed from the swap table, the folio must be
	+locked prior to the cluster lock. After adding or removing is done, the
	+folio shall be unlocked.
	+
	+Swap table lookup is protected by RCU and atomic read. If the lookup
	+returns a folio, the user must lock the folio before use.
	--- a/MAINTAINERS~docs-mm-add-document-for-swap-table
	+++ a/MAINTAINERS
	@@ -16225,6 +16225,7 @@ R: Barry Song <baohua@kernel.org>
	R: Chris Li <chrisl@kernel.org>
	L: linux-mm@kvack.org
	S: Maintained
	+F: Documentation/mm/swap-table.rst
	F: include/linux/swap.h
	F: include/linux/swapfile.h
	F: include/linux/swapops.h
	_