| From: Ge Yang <yangge1116@126.com> |
| Subject: mm/cma: using per-CMA locks to improve concurrent allocation performance |
| Date: Mon, 10 Feb 2025 09:56:06 +0800 |
| |
| For different CMAs, concurrent allocation of CMA memory ideally should not |
| require synchronization using locks. Currently, a global cma_mutex lock |
| is employed to synchronize all CMA allocations, which can impact the |
| performance of concurrent allocations across different CMAs. |
| |
| To test the performance impact, follow these steps: |
| 1. Boot the kernel with the command line argument hugetlb_cma=30G to |
| allocate a 30GB CMA area specifically for huge page allocations. (note: |
| on my machine, which has 3 nodes, each node is initialized with 10G of |
| CMA) |
| 2. Use the dd command with parameters if=/dev/zero of=/dev/shm/file bs=1G |
| count=30 to fully utilize the CMA area by writing zeroes to a file in |
| /dev/shm. |
| 3. Open three terminals and execute the following commands simultaneously: |
| (Note: Each of these commands attempts to allocate 10GB [2621440 * 4KB |
| pages] of CMA memory.) |
| On Terminal 1: time echo 2621440 > /sys/kernel/debug/cma/hugetlb1/alloc |
| On Terminal 2: time echo 2621440 > /sys/kernel/debug/cma/hugetlb2/alloc |
| On Terminal 3: time echo 2621440 > /sys/kernel/debug/cma/hugetlb3/alloc |
| |
| We attempt to allocate pages through the CMA debug interface and use the |
| time command to measure the duration of each allocation. |
| Performance comparison: |
| Without this patch With this patch |
| Terminal1 ~7s ~7s |
| Terminal2 ~14s ~8s |
| Terminal3 ~21s ~7s |
| |
| To solve problem above, we could use per-CMA locks to improve concurrent |
| allocation performance. This would allow each CMA to be managed |
| independently, reducing the need for a global lock and thus improving |
| scalability and performance. |
| |
| Link: https://lkml.kernel.org/r/1739152566-744-1-git-send-email-yangge1116@126.com |
| Signed-off-by: Ge Yang <yangge1116@126.com> |
| Reviewed-by: Barry Song <baohua@kernel.org> |
| Acked-by: David Hildenbrand <david@redhat.com> |
| Reviewed-by: Oscar Salvador <osalvador@suse.de> |
| Cc: Aisheng Dong <aisheng.dong@nxp.com> |
| Cc: Baolin Wang <baolin.wang@linux.alibaba.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| mm/cma.c | 7 ++++--- |
| mm/cma.h | 1 + |
| 2 files changed, 5 insertions(+), 3 deletions(-) |
| |
| --- a/mm/cma.c~mm-cma-using-per-cma-locks-to-improve-concurrent-allocation-performance |
| +++ a/mm/cma.c |
| @@ -34,7 +34,6 @@ |
| |
| struct cma cma_areas[MAX_CMA_AREAS]; |
| unsigned int cma_area_count; |
| -static DEFINE_MUTEX(cma_mutex); |
| |
| static int __init __cma_declare_contiguous_nid(phys_addr_t base, |
| phys_addr_t size, phys_addr_t limit, |
| @@ -175,6 +174,8 @@ static void __init cma_activate_area(str |
| |
| spin_lock_init(&cma->lock); |
| |
| + mutex_init(&cma->alloc_mutex); |
| + |
| #ifdef CONFIG_CMA_DEBUGFS |
| INIT_HLIST_HEAD(&cma->mem_head); |
| spin_lock_init(&cma->mem_head_lock); |
| @@ -813,9 +814,9 @@ static int cma_range_alloc(struct cma *c |
| spin_unlock_irq(&cma->lock); |
| |
| pfn = cmr->base_pfn + (bitmap_no << cma->order_per_bit); |
| - mutex_lock(&cma_mutex); |
| + mutex_lock(&cma->alloc_mutex); |
| ret = alloc_contig_range(pfn, pfn + count, MIGRATE_CMA, gfp); |
| - mutex_unlock(&cma_mutex); |
| + mutex_unlock(&cma->alloc_mutex); |
| if (ret == 0) { |
| page = pfn_to_page(pfn); |
| break; |
| --- a/mm/cma.h~mm-cma-using-per-cma-locks-to-improve-concurrent-allocation-performance |
| +++ a/mm/cma.h |
| @@ -39,6 +39,7 @@ struct cma { |
| unsigned long available_count; |
| unsigned int order_per_bit; /* Order of pages represented by one bit */ |
| spinlock_t lock; |
| + struct mutex alloc_mutex; |
| #ifdef CONFIG_CMA_DEBUGFS |
| struct hlist_head mem_head; |
| spinlock_t mem_head_lock; |
| _ |