| From 6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7 Mon Sep 17 00:00:00 2001 |
| From: Zhao Heming <heming.zhao@suse.com> |
| Date: Sat, 3 Apr 2021 11:01:25 +0800 |
| Subject: md: md_open returns -EBUSY when entering racing area |
| |
| From: Zhao Heming <heming.zhao@suse.com> |
| |
| commit 6a4db2a60306eb65bfb14ccc9fde035b74a4b4e7 upstream. |
| |
| commit d3374825ce57 ("md: make devices disappear when they are no longer |
| needed.") introduced protection between mddev creating & removing. The |
| md_open shouldn't create mddev when all_mddevs list doesn't contain |
| mddev. With currently code logic, there will be very easy to trigger |
| soft lockup in non-preempt env. |
| |
| This patch changes md_open returning from -ERESTARTSYS to -EBUSY, which |
| will break the infinitely retry when md_open enter racing area. |
| |
| This patch is partly fix soft lockup issue, full fix needs mddev_find |
| is split into two functions: mddev_find & mddev_find_or_alloc. And |
| md_open should call new mddev_find (it only does searching job). |
| |
| For more detail, please refer with Christoph's "split mddev_find" patch |
| in later commits. |
| |
| *** env *** |
| kvm-qemu VM 2C1G with 2 iscsi luns |
| kernel should be non-preempt |
| |
| *** script *** |
| |
| about trigger every time with below script |
| |
| ``` |
| 1 node1="mdcluster1" |
| 2 node2="mdcluster2" |
| 3 |
| 4 mdadm -Ss |
| 5 ssh ${node2} "mdadm -Ss" |
| 6 wipefs -a /dev/sda /dev/sdb |
| 7 mdadm -CR /dev/md0 -b clustered -e 1.2 -n 2 -l mirror /dev/sda \ |
| /dev/sdb --assume-clean |
| 8 |
| 9 for i in {1..10}; do |
| 10 echo ==== $i ====; |
| 11 |
| 12 echo "test ...." |
| 13 ssh ${node2} "mdadm -A /dev/md0 /dev/sda /dev/sdb" |
| 14 sleep 1 |
| 15 |
| 16 echo "clean ....." |
| 17 ssh ${node2} "mdadm -Ss" |
| 18 done |
| ``` |
| |
| I use mdcluster env to trigger soft lockup, but it isn't mdcluster |
| speical bug. To stop md array in mdcluster env will do more jobs than |
| non-cluster array, which will leave enough time/gap to allow kernel to |
| run md_open. |
| |
| *** stack *** |
| |
| ``` |
| [ 884.226509] mddev_put+0x1c/0xe0 [md_mod] |
| [ 884.226515] md_open+0x3c/0xe0 [md_mod] |
| [ 884.226518] __blkdev_get+0x30d/0x710 |
| [ 884.226520] ? bd_acquire+0xd0/0xd0 |
| [ 884.226522] blkdev_get+0x14/0x30 |
| [ 884.226524] do_dentry_open+0x204/0x3a0 |
| [ 884.226531] path_openat+0x2fc/0x1520 |
| [ 884.226534] ? seq_printf+0x4e/0x70 |
| [ 884.226536] do_filp_open+0x9b/0x110 |
| [ 884.226542] ? md_release+0x20/0x20 [md_mod] |
| [ 884.226543] ? seq_read+0x1d8/0x3e0 |
| [ 884.226545] ? kmem_cache_alloc+0x18a/0x270 |
| [ 884.226547] ? do_sys_open+0x1bd/0x260 |
| [ 884.226548] do_sys_open+0x1bd/0x260 |
| [ 884.226551] do_syscall_64+0x5b/0x1e0 |
| [ 884.226554] entry_SYSCALL_64_after_hwframe+0x44/0xa9 |
| ``` |
| |
| *** rootcause *** |
| |
| "mdadm -A" (or other array assemble commands) will start a daemon "mdadm |
| --monitor" by default. When "mdadm -Ss" is running, the stop action will |
| wakeup "mdadm --monitor". The "--monitor" daemon will immediately get |
| info from /proc/mdstat. This time mddev in kernel still exist, so |
| /proc/mdstat still show md device, which makes "mdadm --monitor" to open |
| /dev/md0. |
| |
| The previously "mdadm -Ss" is removing action, the "mdadm --monitor" |
| open action will trigger md_open which is creating action. Racing is |
| happening. |
| |
| ``` |
| <thread 1>: "mdadm -Ss" |
| md_release |
| mddev_put deletes mddev from all_mddevs |
| queue_work for mddev_delayed_delete |
| at this time, "/dev/md0" is still available for opening |
| |
| <thread 2>: "mdadm --monitor ..." |
| md_open |
| + mddev_find can't find mddev of /dev/md0, and create a new mddev and |
| | return. |
| + trigger "if (mddev->gendisk != bdev->bd_disk)" and return |
| -ERESTARTSYS. |
| ``` |
| |
| In non-preempt kernel, <thread 2> is occupying on current CPU. and |
| mddev_delayed_delete which was created in <thread 1> also can't be |
| schedule. |
| |
| In preempt kernel, it can also trigger above racing. But kernel doesn't |
| allow one thread running on a CPU all the time. after <thread 2> running |
| some time, the later "mdadm -A" (refer above script line 13) will call |
| md_alloc to alloc a new gendisk for mddev. it will break md_open |
| statement "if (mddev->gendisk != bdev->bd_disk)" and return 0 to caller, |
| the soft lockup is broken. |
| |
| Cc: stable@vger.kernel.org |
| Reviewed-by: Christoph Hellwig <hch@lst.de> |
| Signed-off-by: Zhao Heming <heming.zhao@suse.com> |
| Signed-off-by: Song Liu <song@kernel.org> |
| Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| --- |
| drivers/md/md.c | 3 +-- |
| 1 file changed, 1 insertion(+), 2 deletions(-) |
| |
| diff --git a/drivers/md/md.c b/drivers/md/md.c |
| index 368cad6cd53a..464cca5d5952 100644 |
| --- a/drivers/md/md.c |
| +++ b/drivers/md/md.c |
| @@ -7821,8 +7821,7 @@ static int md_open(struct block_device *bdev, fmode_t mode) |
| /* Wait until bdev->bd_disk is definitely gone */ |
| if (work_pending(&mddev->del_work)) |
| flush_workqueue(md_misc_wq); |
| - /* Then retry the open from the top */ |
| - return -ERESTARTSYS; |
| + return -EBUSY; |
| } |
| BUG_ON(mddev != bdev->bd_disk->private_data); |
| |
| -- |
| 2.31.1 |
| |