| From: Huang Ying <ying.huang@intel.com> |
| Subject: migrate_pages: fix deadlock in batched migration |
| Date: Fri, 3 Mar 2023 11:01:53 +0800 |
| |
| Patch series "migrate_pages: fix deadlock in batched synchronous |
| migration", v2. |
| |
| Two deadlock bugs were reported for the migrate_pages() batching series. |
| Thanks Hugh and Pengfei. Analysis shows that if we have locked some other |
| folios except the one we are migrating, it's not safe in general to wait |
| synchronously, for example, to wait the writeback to complete or wait to |
| lock the buffer head. |
| |
| So 1/3 fixes the deadlock in a simple way, where the batching support for |
| the synchronous migration is disabled. The change is straightforward and |
| easy to be understood. While 3/3 re-introduce the batching for |
| synchronous migration via trying to migrate asynchronously in batch |
| optimistically, then fall back to migrate synchronously one by one for |
| fail-to-migrate folios. Test shows that this can restore the TLB flushing |
| batching performance for synchronous migration effectively. |
| |
| |
| This patch (of 3): |
| |
| Two deadlock bugs were reported for the migrate_pages() batching series. |
| Thanks Hugh and Pengfei! For example, in the following deadlock trace |
| snippet, |
| |
| INFO: task kworker/u4:0:9 blocked for more than 147 seconds. |
| Not tainted 6.2.0-rc4-kvm+ #1314 |
| "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| task:kworker/u4:0 state:D stack:0 pid:9 ppid:2 flags:0x00004000 |
| Workqueue: loop4 loop_rootcg_workfn |
| Call Trace: |
| <TASK> |
| __schedule+0x43b/0xd00 |
| schedule+0x6a/0xf0 |
| io_schedule+0x4a/0x80 |
| folio_wait_bit_common+0x1b5/0x4e0 |
| ? __pfx_wake_page_function+0x10/0x10 |
| __filemap_get_folio+0x73d/0x770 |
| shmem_get_folio_gfp+0x1fd/0xc80 |
| shmem_write_begin+0x91/0x220 |
| generic_perform_write+0x10e/0x2e0 |
| __generic_file_write_iter+0x17e/0x290 |
| ? generic_write_checks+0x12b/0x1a0 |
| generic_file_write_iter+0x97/0x180 |
| ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 |
| do_iter_readv_writev+0x13c/0x210 |
| ? __sanitizer_cov_trace_const_cmp4+0x1a/0x20 |
| do_iter_write+0xf6/0x330 |
| vfs_iter_write+0x46/0x70 |
| loop_process_work+0x723/0xfe0 |
| loop_rootcg_workfn+0x28/0x40 |
| process_one_work+0x3cc/0x8d0 |
| worker_thread+0x66/0x630 |
| ? __pfx_worker_thread+0x10/0x10 |
| kthread+0x153/0x190 |
| ? __pfx_kthread+0x10/0x10 |
| ret_from_fork+0x29/0x50 |
| </TASK> |
| |
| INFO: task repro:1023 blocked for more than 147 seconds. |
| Not tainted 6.2.0-rc4-kvm+ #1314 |
| "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. |
| task:repro state:D stack:0 pid:1023 ppid:360 flags:0x00004004 |
| Call Trace: |
| <TASK> |
| __schedule+0x43b/0xd00 |
| schedule+0x6a/0xf0 |
| io_schedule+0x4a/0x80 |
| folio_wait_bit_common+0x1b5/0x4e0 |
| ? compaction_alloc+0x77/0x1150 |
| ? __pfx_wake_page_function+0x10/0x10 |
| folio_wait_bit+0x30/0x40 |
| folio_wait_writeback+0x2e/0x1e0 |
| migrate_pages_batch+0x555/0x1ac0 |
| ? __pfx_compaction_alloc+0x10/0x10 |
| ? __pfx_compaction_free+0x10/0x10 |
| ? __this_cpu_preempt_check+0x17/0x20 |
| ? lock_is_held_type+0xe6/0x140 |
| migrate_pages+0x100e/0x1180 |
| ? __pfx_compaction_free+0x10/0x10 |
| ? __pfx_compaction_alloc+0x10/0x10 |
| compact_zone+0xe10/0x1b50 |
| ? lock_is_held_type+0xe6/0x140 |
| ? check_preemption_disabled+0x80/0xf0 |
| compact_node+0xa3/0x100 |
| ? __sanitizer_cov_trace_const_cmp8+0x1c/0x30 |
| ? _find_first_bit+0x7b/0x90 |
| sysctl_compaction_handler+0x5d/0xb0 |
| proc_sys_call_handler+0x29d/0x420 |
| proc_sys_write+0x2b/0x40 |
| vfs_write+0x3a3/0x780 |
| ksys_write+0xb7/0x180 |
| __x64_sys_write+0x26/0x30 |
| do_syscall_64+0x3b/0x90 |
| entry_SYSCALL_64_after_hwframe+0x72/0xdc |
| RIP: 0033:0x7f3a2471f59d |
| RSP: 002b:00007ffe567f7288 EFLAGS: 00000217 ORIG_RAX: 0000000000000001 |
| RAX: ffffffffffffffda RBX: 0000000000000000 RCX: 00007f3a2471f59d |
| RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000005 |
| RBP: 00007ffe567f72a0 R08: 0000000000000010 R09: 0000000000000010 |
| R10: 0000000000000010 R11: 0000000000000217 R12: 00000000004012e0 |
| R13: 00007ffe567f73e0 R14: 0000000000000000 R15: 0000000000000000 |
| </TASK> |
| |
| The page migration task has held the lock of the shmem folio A, and is |
| waiting the writeback of the folio B of the file system on the loop block |
| device to complete. While the loop worker task which writes back the |
| folio B is waiting to lock the shmem folio A, because the folio A backs |
| the folio B in the loop device. Thus deadlock is triggered. |
| |
| In general, if we have locked some other folios except the one we are |
| migrating, it's not safe to wait synchronously, for example, to wait the |
| writeback to complete or wait to lock the buffer head. |
| |
| To fix the deadlock, in this patch, we avoid to batch the page migration |
| except for MIGRATE_ASYNC mode. In MIGRATE_ASYNC mode, synchronous waiting |
| is avoided. |
| |
| The fix can be improved further. We will do that as soon as possible. |
| |
| Link: https://lkml.kernel.org/r/20230303030155.160983-1-ying.huang@intel.com |
| Link: https://lore.kernel.org/linux-mm/87a6c8c-c5c1-67dc-1e32-eb30831d6e3d@google.com/ |
| Link: https://lore.kernel.org/linux-mm/874jrg7kke.fsf@yhuang6-desk2.ccr.corp.intel.com/ |
| Link: https://lore.kernel.org/linux-mm/20230227110614.dngdub2j3exr6dfp@quack3/ |
| Link: https://lkml.kernel.org/r/20230303030155.160983-2-ying.huang@intel.com |
| Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") |
| Signed-off-by: "Huang, Ying" <ying.huang@intel.com> |
| Reported-by: Hugh Dickins <hughd@google.com> |
| Reported-by: "Xu, Pengfei" <pengfei.xu@intel.com> |
| Cc: Jan Kara <jack@suse.cz> |
| Cc: Baolin Wang <baolin.wang@linux.alibaba.com> |
| Cc: Christoph Hellwig <hch@lst.de> |
| Cc: Stefan Roesch <shr@devkernel.io> |
| Cc: Tejun Heo <tj@kernel.org> |
| Cc: Xin Hao <xhao@linux.alibaba.com> |
| Cc: Zi Yan <ziy@nvidia.com> |
| Cc: Yang Shi <shy828301@gmail.com> |
| Cc: Matthew Wilcox <willy@infradead.org> |
| Cc: Mike Kravetz <mike.kravetz@oracle.com> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| |
| --- a/mm/migrate.c~migrate_pages-fix-deadlock-in-batched-migration |
| +++ a/mm/migrate.c |
| @@ -1112,7 +1112,7 @@ static void migrate_folio_done(struct fo |
| /* Obtain the lock on page, remove all ptes. */ |
| static int migrate_folio_unmap(new_page_t get_new_page, free_page_t put_new_page, |
| unsigned long private, struct folio *src, |
| - struct folio **dstp, int force, bool avoid_force_lock, |
| + struct folio **dstp, int force, |
| enum migrate_mode mode, enum migrate_reason reason, |
| struct list_head *ret) |
| { |
| @@ -1163,17 +1163,6 @@ static int migrate_folio_unmap(new_page_ |
| if (current->flags & PF_MEMALLOC) |
| goto out; |
| |
| - /* |
| - * We have locked some folios and are going to wait to lock |
| - * this folio. To avoid a potential deadlock, let's bail |
| - * out and not do that. The locked folios will be moved and |
| - * unlocked, then we can wait to lock this folio. |
| - */ |
| - if (avoid_force_lock) { |
| - rc = -EDEADLOCK; |
| - goto out; |
| - } |
| - |
| folio_lock(src); |
| } |
| locked = true; |
| @@ -1253,7 +1242,7 @@ static int migrate_folio_unmap(new_page_ |
| /* Establish migration ptes */ |
| VM_BUG_ON_FOLIO(folio_test_anon(src) && |
| !folio_test_ksm(src) && !anon_vma, src); |
| - try_to_migrate(src, TTU_BATCH_FLUSH); |
| + try_to_migrate(src, mode == MIGRATE_ASYNC ? TTU_BATCH_FLUSH : 0); |
| page_was_mapped = 1; |
| } |
| |
| @@ -1267,7 +1256,7 @@ out: |
| * A folio that has not been unmapped will be restored to |
| * right list unless we want to retry. |
| */ |
| - if (rc == -EAGAIN || rc == -EDEADLOCK) |
| + if (rc == -EAGAIN) |
| ret = NULL; |
| |
| migrate_folio_undo_src(src, page_was_mapped, anon_vma, locked, ret); |
| @@ -1618,6 +1607,11 @@ static int migrate_hugetlbs(struct list_ |
| /* |
| * migrate_pages_batch() first unmaps folios in the from list as many as |
| * possible, then move the unmapped folios. |
| + * |
| + * We only batch migration if mode == MIGRATE_ASYNC to avoid to wait a |
| + * lock or bit when we have locked more than one folio. Which may cause |
| + * deadlock (e.g., for loop device). So, if mode != MIGRATE_ASYNC, the |
| + * length of the from list must be <= 1. |
| */ |
| static int migrate_pages_batch(struct list_head *from, new_page_t get_new_page, |
| free_page_t put_new_page, unsigned long private, |
| @@ -1640,11 +1634,11 @@ static int migrate_pages_batch(struct li |
| LIST_HEAD(dst_folios); |
| bool nosplit = (reason == MR_NUMA_MISPLACED); |
| bool no_split_folio_counting = false; |
| - bool avoid_force_lock; |
| |
| + VM_WARN_ON_ONCE(mode != MIGRATE_ASYNC && |
| + !list_empty(from) && !list_is_singular(from)); |
| retry: |
| rc_saved = 0; |
| - avoid_force_lock = false; |
| retry = 1; |
| for (pass = 0; |
| pass < NR_MAX_MIGRATE_PAGES_RETRY && (retry || large_retry); |
| @@ -1689,15 +1683,14 @@ retry: |
| } |
| |
| rc = migrate_folio_unmap(get_new_page, put_new_page, private, |
| - folio, &dst, pass > 2, avoid_force_lock, |
| - mode, reason, ret_folios); |
| + folio, &dst, pass > 2, mode, |
| + reason, ret_folios); |
| /* |
| * The rules are: |
| * Success: folio will be freed |
| * Unmap: folio will be put on unmap_folios list, |
| * dst folio put on dst_folios list |
| * -EAGAIN: stay on the from list |
| - * -EDEADLOCK: stay on the from list |
| * -ENOMEM: stay on the from list |
| * Other errno: put on ret_folios list |
| */ |
| @@ -1749,14 +1742,6 @@ retry: |
| goto out; |
| else |
| goto move; |
| - case -EDEADLOCK: |
| - /* |
| - * The folio cannot be locked for potential deadlock. |
| - * Go move (and unlock) all locked folios. Then we can |
| - * try again. |
| - */ |
| - rc_saved = rc; |
| - goto move; |
| case -EAGAIN: |
| if (is_large) { |
| large_retry++; |
| @@ -1771,11 +1756,6 @@ retry: |
| stats->nr_thp_succeeded += is_thp; |
| break; |
| case MIGRATEPAGE_UNMAP: |
| - /* |
| - * We have locked some folios, don't force lock |
| - * to avoid deadlock. |
| - */ |
| - avoid_force_lock = true; |
| list_move_tail(&folio->lru, &unmap_folios); |
| list_add_tail(&dst->lru, &dst_folios); |
| break; |
| @@ -1900,17 +1880,15 @@ out: |
| */ |
| list_splice_init(from, ret_folios); |
| list_splice_init(&split_folios, from); |
| + /* |
| + * Force async mode to avoid to wait lock or bit when we have |
| + * locked more than one folios. |
| + */ |
| + mode = MIGRATE_ASYNC; |
| no_split_folio_counting = true; |
| goto retry; |
| } |
| |
| - /* |
| - * We have unlocked all locked folios, so we can force lock now, let's |
| - * try again. |
| - */ |
| - if (rc == -EDEADLOCK) |
| - goto retry; |
| - |
| return rc; |
| } |
| |
| @@ -1945,7 +1923,7 @@ int migrate_pages(struct list_head *from |
| enum migrate_mode mode, int reason, unsigned int *ret_succeeded) |
| { |
| int rc, rc_gather; |
| - int nr_pages; |
| + int nr_pages, batch; |
| struct folio *folio, *folio2; |
| LIST_HEAD(folios); |
| LIST_HEAD(ret_folios); |
| @@ -1959,6 +1937,11 @@ int migrate_pages(struct list_head *from |
| mode, reason, &stats, &ret_folios); |
| if (rc_gather < 0) |
| goto out; |
| + |
| + if (mode == MIGRATE_ASYNC) |
| + batch = NR_MAX_BATCHED_MIGRATION; |
| + else |
| + batch = 1; |
| again: |
| nr_pages = 0; |
| list_for_each_entry_safe(folio, folio2, from, lru) { |
| @@ -1969,11 +1952,11 @@ again: |
| } |
| |
| nr_pages += folio_nr_pages(folio); |
| - if (nr_pages > NR_MAX_BATCHED_MIGRATION) |
| + if (nr_pages >= batch) |
| break; |
| } |
| - if (nr_pages > NR_MAX_BATCHED_MIGRATION) |
| - list_cut_before(&folios, from, &folio->lru); |
| + if (nr_pages >= batch) |
| + list_cut_before(&folios, from, &folio2->lru); |
| else |
| list_splice_init(from, &folios); |
| rc = migrate_pages_batch(&folios, get_new_page, put_new_page, private, |
| _ |