| From: cuishiwei <cuishw@inspur.com> |
| Subject: mm: disable demotion during memory reclamation |
| Date: Tue, 9 Sep 2025 09:21:41 +0800 |
| |
| I've found an issue while using CXL memory. My machine has one DRAM NUMA |
| node and one CXL NUMA node: |
| |
| node 1 cpus: 96 97 98 99... - dram Numa node |
| node 1 size: 772048 MB |
| node 1 free: 759737 MB |
| node 3 cpus: - CXL memory Numa node |
| node 3 size: 524288 MB |
| node 3 free: 524287 MB |
| 1.enable demotion |
| echo 1 > /sys/kernel/mm/numa/demotion_enabled |
| 2.Execute a memory allocation program in a memcg |
| cgexec -g memory:test numactl -N 1 ./allocate_memory 20 - allocate 20G memory |
| numastat allocate_memory: |
| Node 0 Node 1 Node 3 |
| --------------- --------------- --------------- |
| Huge 0.00 0.00 0.00 |
| Heap 0.00 0.00 0.00 |
| Stack 0.00 0.01 0.00 |
| Private 0.05 20481.56 0.01 |
| 3.Setting the memory cgroup memory limit to be exceeded |
| echo 15G > /sys/fs/cgroup/test/memory.max |
| numastat allocate_memory: |
| Node 0 Node 1 Node 3 |
| --------------- --------------- --------------- |
| Huge 0.00 0.00 0.00 |
| Heap 0.00 0.00 0.00 |
| Stack 0.00 0.01 0.00 |
| Private 0.00 4011.54 10560.00 |
| |
| This happens because demotion was enabled, when the memcg's memory limit |
| was exceeded, memory from the DRAM NUMA node was first migrated to the CXL |
| NUMA node. After that, a memory reclaim was performed, which was |
| unnecessary. |
| |
| When a memory cgroup exceeds its memory limit, the system reclaims its |
| cold memory.However, if /sys/kernel/mm/numa/demotion_enabled is set to 1, |
| memory on fast memory nodes will also be demoted to slow memory nodes. |
| |
| This demotion contradicts the goal of reclaiming cold memory within the |
| memcg.At this point, demoting cold memory from fast to slow nodes is |
| pointless;it doesn't reduce the memcg's memory usage. Therefore, we |
| should set no_demotion when reclaiming memory in a memcg. |
| |
| Link: https://lkml.kernel.org/r/20250909012141.1467-1-cuishw@inspur.com |
| Signed-off-by: cuishiwei <cuishw@inspur.com> |
| Cc: Axel Rasmussen <axelrasmussen@google.com> |
| Cc: David Hildenbrand <david@redhat.com> |
| Cc: Johannes Weiner <hannes@cmpxchg.org> |
| Cc: Lorenzo Stoakes <lorenzo.stoakes@oracle.com> |
| Cc: Qi Zheng <zhengqi.arch@bytedance.com> |
| Cc: Shakeel Butt <shakeel.butt@linux.dev> |
| Cc: Wei Xu <weixugc@google.com> |
| Cc: Yuanchu Xie <yuanchu@google.com> |
| Cc: Michal Hocko <mhocko@suse.com> |
| Cc: Roman Gushchin <roman.gushchin@linux.dev> |
| Cc: Muchun Song <songmuchun@bytedance.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| mm/vmscan.c | 1 + |
| 1 file changed, 1 insertion(+) |
| |
| --- a/mm/vmscan.c~disable-demotion-during-memory-reclamation |
| +++ a/mm/vmscan.c |
| @@ -6714,6 +6714,7 @@ unsigned long try_to_free_mem_cgroup_pag |
| .may_unmap = 1, |
| .may_swap = !!(reclaim_options & MEMCG_RECLAIM_MAY_SWAP), |
| .proactive = !!(reclaim_options & MEMCG_RECLAIM_PROACTIVE), |
| + .no_demotion = 1, |
| }; |
| /* |
| * Traverse the ZONELIST_FALLBACK zonelist of the current node to put |
| _ |