| From: Sergey Senozhatsky <senozhatsky@chromium.org> |
| Subject: zram: sleepable entry locking |
| Date: Mon, 3 Mar 2025 11:03:10 +0900 |
| |
| Patch series "zsmalloc/zram: there be preemption", v10. |
| |
| Currently zram runs compression and decompression in non-preemptible |
| sections, e.g. |
| |
| zcomp_stream_get() // grabs CPU local lock |
| zcomp_compress() |
| |
| or |
| |
| zram_slot_lock() // grabs entry spin-lock |
| zcomp_stream_get() // grabs CPU local lock |
| zs_map_object() // grabs rwlock and CPU local lock |
| zcomp_decompress() |
| |
| Potentially a little troublesome for a number of reasons. |
| |
| For instance, this makes it impossible to use async compression algorithms |
| or/and H/W compression algorithms, which can wait for OP completion or |
| resource availability. This also restricts what compression algorithms |
| can do internally, for example, zstd can allocate internal state memory |
| for C/D dictionaries: |
| |
| do_fsync() |
| do_writepages() |
| zram_bio_write() |
| zram_write_page() // become non-preemptible |
| zcomp_compress() |
| zstd_compress() |
| ZSTD_compress_usingCDict() |
| ZSTD_compressBegin_usingCDict_internal() |
| ZSTD_resetCCtx_usingCDict() |
| ZSTD_resetCCtx_internal() |
| zstd_custom_alloc() // memory allocation |
| |
| Not to mention that the system can be configured to maximize compression |
| ratio at a cost of CPU/HW time (e.g. lz4hc or deflate with very high |
| compression level) so zram can stay in non-preemptible section (even under |
| spin-lock or/and rwlock) for an extended period of time. Aside from |
| compression algorithms, this also restricts what zram can do. One |
| particular example is zram_write_page() zsmalloc handle allocation, which |
| has an optimistic allocation (disallowing direct reclaim) and a |
| pessimistic fallback path, which then forces zram to compress the page one |
| more time. |
| |
| This series changes zram to not directly impose atomicity restrictions on |
| compression algorithms (and on itself), which makes zram write() fully |
| preemptible; zram read(), sadly, is not always preemptible yet. There are |
| still indirect atomicity restrictions imposed by zsmalloc(). One notable |
| example is object mapping API, which returns with: a) local CPU lock held |
| b) zspage rwlock held |
| |
| First, zsmalloc's zspage lock is converted from rwlock to a special type |
| of RW-lookalike look with some extra guarantees/features. Second, a new |
| handle mapping is introduced which doesn't use per-CPU buffers (and hence |
| no local CPU lock), does fewer memcpy() calls, but requires users to |
| provide a pointer to temp buffer for object copy-in (when needed). Third, |
| zram is converted to the new zsmalloc mapping API and thus zram read() |
| becomes preemptible. |
| |
| |
| This patch (of 19): |
| |
| Concurrent modifications of meta table entries is now handled by per-entry |
| spin-lock. This has a number of shortcomings. |
| |
| First, this imposes atomic requirements on compression backends. zram can |
| call both zcomp_compress() and zcomp_decompress() under entry spin-lock, |
| which implies that we can use only compression algorithms that don't |
| schedule/sleep/wait during compression and decompression. This, for |
| instance, makes it impossible to use some of the ASYNC compression |
| algorithms (H/W compression, etc.) implementations. |
| |
| Second, this can potentially trigger watchdogs. For example, entry |
| re-compression with secondary algorithms is performed under entry |
| spin-lock. Given that we chain secondary compression algorithms and that |
| some of them can be configured for best compression ratio (and worst |
| compression speed) zram can stay under spin-lock for quite some time. |
| |
| Having a per-entry mutex (or, for instance, a rw-semaphore) significantly |
| increases sizeof() of each entry and hence the meta table. Therefore |
| entry locking returns back to bit locking, as before, however, this time |
| also preempt-rt friendly, because if waits-on-bit instead of |
| spinning-on-bit. Lock owners are also now permitted to schedule, which is |
| a first step on the path of making zram non-atomic. |
| |
| Link: https://lkml.kernel.org/r/20250303022425.285971-1-senozhatsky@chromium.org |
| Link: https://lkml.kernel.org/r/20250303022425.285971-2-senozhatsky@chromium.org |
| Signed-off-by: Sergey Senozhatsky <senozhatsky@chromium.org> |
| Cc: Hillf Danton <hdanton@sina.com> |
| Cc: Kairui Song <ryncsn@gmail.com> |
| Cc: Minchan Kim <minchan@kernel.org> |
| Cc: Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
| Cc: Yosry Ahmed <yosry.ahmed@linux.dev> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| drivers/block/zram/zram_drv.c | 54 +++++++++++++++++++++++++------- |
| drivers/block/zram/zram_drv.h | 15 +++++--- |
| 2 files changed, 52 insertions(+), 17 deletions(-) |
| |
| --- a/drivers/block/zram/zram_drv.c~zram-sleepable-entry-locking |
| +++ a/drivers/block/zram/zram_drv.c |
| @@ -58,19 +58,56 @@ static void zram_free_page(struct zram * |
| static int zram_read_from_zspool(struct zram *zram, struct page *page, |
| u32 index); |
| |
| -static int zram_slot_trylock(struct zram *zram, u32 index) |
| +#define slot_dep_map(zram, index) (&(zram)->table[(index)].dep_map) |
| + |
| +static void zram_slot_lock_init(struct zram *zram, u32 index) |
| +{ |
| + static struct lock_class_key __key; |
| + |
| + lockdep_init_map(slot_dep_map(zram, index), "zram->table[index].lock", |
| + &__key, 0); |
| +} |
| + |
| +/* |
| + * entry locking rules: |
| + * |
| + * 1) Lock is exclusive |
| + * |
| + * 2) lock() function can sleep waiting for the lock |
| + * |
| + * 3) Lock owner can sleep |
| + * |
| + * 4) Use TRY lock variant when in atomic context |
| + * - must check return value and handle locking failers |
| + */ |
| +static __must_check bool zram_slot_trylock(struct zram *zram, u32 index) |
| { |
| - return spin_trylock(&zram->table[index].lock); |
| + unsigned long *lock = &zram->table[index].flags; |
| + |
| + if (!test_and_set_bit_lock(ZRAM_ENTRY_LOCK, lock)) { |
| + mutex_acquire(slot_dep_map(zram, index), 0, 1, _RET_IP_); |
| + lock_acquired(slot_dep_map(zram, index), _RET_IP_); |
| + return true; |
| + } |
| + |
| + return false; |
| } |
| |
| static void zram_slot_lock(struct zram *zram, u32 index) |
| { |
| - spin_lock(&zram->table[index].lock); |
| + unsigned long *lock = &zram->table[index].flags; |
| + |
| + mutex_acquire(slot_dep_map(zram, index), 0, 0, _RET_IP_); |
| + wait_on_bit_lock(lock, ZRAM_ENTRY_LOCK, TASK_UNINTERRUPTIBLE); |
| + lock_acquired(slot_dep_map(zram, index), _RET_IP_); |
| } |
| |
| static void zram_slot_unlock(struct zram *zram, u32 index) |
| { |
| - spin_unlock(&zram->table[index].lock); |
| + unsigned long *lock = &zram->table[index].flags; |
| + |
| + mutex_release(slot_dep_map(zram, index), _RET_IP_); |
| + clear_and_wake_up_bit(ZRAM_ENTRY_LOCK, lock); |
| } |
| |
| static inline bool init_done(struct zram *zram) |
| @@ -93,7 +130,6 @@ static void zram_set_handle(struct zram |
| zram->table[index].handle = handle; |
| } |
| |
| -/* flag operations require table entry bit_spin_lock() being held */ |
| static bool zram_test_flag(struct zram *zram, u32 index, |
| enum zram_pageflags flag) |
| { |
| @@ -1473,15 +1509,11 @@ static bool zram_meta_alloc(struct zram |
| huge_class_size = zs_huge_class_size(zram->mem_pool); |
| |
| for (index = 0; index < num_pages; index++) |
| - spin_lock_init(&zram->table[index].lock); |
| + zram_slot_lock_init(zram, index); |
| + |
| return true; |
| } |
| |
| -/* |
| - * To protect concurrent access to the same index entry, |
| - * caller should hold this table index entry's bit_spinlock to |
| - * indicate this index entry is accessing. |
| - */ |
| static void zram_free_page(struct zram *zram, size_t index) |
| { |
| unsigned long handle; |
| --- a/drivers/block/zram/zram_drv.h~zram-sleepable-entry-locking |
| +++ a/drivers/block/zram/zram_drv.h |
| @@ -28,7 +28,6 @@ |
| #define ZRAM_SECTOR_PER_LOGICAL_BLOCK \ |
| (1 << (ZRAM_LOGICAL_BLOCK_SHIFT - SECTOR_SHIFT)) |
| |
| - |
| /* |
| * ZRAM is mainly used for memory efficiency so we want to keep memory |
| * footprint small and thus squeeze size and zram pageflags into a flags |
| @@ -46,6 +45,7 @@ |
| /* Flags for zram pages (table[page_no].flags) */ |
| enum zram_pageflags { |
| ZRAM_SAME = ZRAM_FLAG_SHIFT, /* Page consists the same element */ |
| + ZRAM_ENTRY_LOCK, /* entry access lock bit */ |
| ZRAM_WB, /* page is stored on backing_device */ |
| ZRAM_PP_SLOT, /* Selected for post-processing */ |
| ZRAM_HUGE, /* Incompressible page */ |
| @@ -58,16 +58,19 @@ enum zram_pageflags { |
| __NR_ZRAM_PAGEFLAGS, |
| }; |
| |
| -/*-- Data structures */ |
| - |
| -/* Allocated for each disk page */ |
| +/* |
| + * Allocated for each disk page. We use bit-lock (ZRAM_ENTRY_LOCK bit |
| + * of flags) to save memory. There can be plenty of entries and standard |
| + * locking primitives (e.g. mutex) will significantly increase sizeof() |
| + * of each entry and hence of the meta table. |
| + */ |
| struct zram_table_entry { |
| unsigned long handle; |
| - unsigned int flags; |
| - spinlock_t lock; |
| + unsigned long flags; |
| #ifdef CONFIG_ZRAM_TRACK_ENTRY_ACTIME |
| ktime_t ac_time; |
| #endif |
| + struct lockdep_map dep_map; |
| }; |
| |
| struct zram_stats { |
| _ |