releases/5.4.160/btrfs-do-not-take-the-uuid_mutex-in-btrfs_rm_device.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From cdeec5af7258fbbc4dc60eb86848be7bcb30c9ed Mon Sep 17 00:00:00 2001
 From: Sasha Levin <sashal@kernel.org>
 Date: Tue, 27 Jul 2021 17:01:14 -0400
 Subject: btrfs: do not take the uuid_mutex in btrfs_rm_device

 From: Josef Bacik <josef@toxicpanda.com>

 [ Upstream commit 8ef9dc0f14ba6124c62547a4fdc59b163d8b864e ]

 We got the following lockdep splat while running fstests (specifically
 btrfs/003 and btrfs/020 in a row) with the new rc.  This was uncovered
 by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which
 converted loop to using workqueues, which comes with lockdep
 annotations that don't exist with kworkers.  The lockdep splat is as
 follows:

   WARNING: possible circular locking dependency detected
   5.14.0-rc2-custom+ #34 Not tainted
   ------------------------------------------------------
   losetup/156417 is trying to acquire lock:
   ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600

   but task is already holding lock:
   ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]

   which lock already depends on the new lock.

   the existing dependency chain (in reverse order) is:

   -> #5 (&lo->lo_mutex){+.+.}-{3:3}:
 	 __mutex_lock+0xba/0x7c0
 	 lo_open+0x28/0x60 [loop]
 	 blkdev_get_whole+0x28/0xf0
 	 blkdev_get_by_dev.part.0+0x168/0x3c0
 	 blkdev_open+0xd2/0xe0
 	 do_dentry_open+0x163/0x3a0
 	 path_openat+0x74d/0xa40
 	 do_filp_open+0x9c/0x140
 	 do_sys_openat2+0xb1/0x170
 	 __x64_sys_openat+0x54/0x90
 	 do_syscall_64+0x3b/0x90
 	 entry_SYSCALL_64_after_hwframe+0x44/0xae

   -> #4 (&disk->open_mutex){+.+.}-{3:3}:
 	 __mutex_lock+0xba/0x7c0
 	 blkdev_get_by_dev.part.0+0xd1/0x3c0
 	 blkdev_get_by_path+0xc0/0xd0
 	 btrfs_scan_one_device+0x52/0x1f0 [btrfs]
 	 btrfs_control_ioctl+0xac/0x170 [btrfs]
 	 __x64_sys_ioctl+0x83/0xb0
 	 do_syscall_64+0x3b/0x90
 	 entry_SYSCALL_64_after_hwframe+0x44/0xae

   -> #3 (uuid_mutex){+.+.}-{3:3}:
 	 __mutex_lock+0xba/0x7c0
 	 btrfs_rm_device+0x48/0x6a0 [btrfs]
 	 btrfs_ioctl+0x2d1c/0x3110 [btrfs]
 	 __x64_sys_ioctl+0x83/0xb0
 	 do_syscall_64+0x3b/0x90
 	 entry_SYSCALL_64_after_hwframe+0x44/0xae

   -> #2 (sb_writers#11){.+.+}-{0:0}:
 	 lo_write_bvec+0x112/0x290 [loop]
 	 loop_process_work+0x25f/0xcb0 [loop]
 	 process_one_work+0x28f/0x5d0
 	 worker_thread+0x55/0x3c0
 	 kthread+0x140/0x170
 	 ret_from_fork+0x22/0x30

   -> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
 	 process_one_work+0x266/0x5d0
 	 worker_thread+0x55/0x3c0
 	 kthread+0x140/0x170
 	 ret_from_fork+0x22/0x30

   -> #0 ((wq_completion)loop0){+.+.}-{0:0}:
 	 __lock_acquire+0x1130/0x1dc0
 	 lock_acquire+0xf5/0x320
 	 flush_workqueue+0xae/0x600
 	 drain_workqueue+0xa0/0x110
 	 destroy_workqueue+0x36/0x250
 	 __loop_clr_fd+0x9a/0x650 [loop]
 	 lo_ioctl+0x29d/0x780 [loop]
 	 block_ioctl+0x3f/0x50
 	 __x64_sys_ioctl+0x83/0xb0
 	 do_syscall_64+0x3b/0x90
 	 entry_SYSCALL_64_after_hwframe+0x44/0xae

   other info that might help us debug this:
   Chain exists of:
     (wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
    Possible unsafe locking scenario:
 	 CPU0                    CPU1
 	 ----                    ----
     lock(&lo->lo_mutex);
 				 lock(&disk->open_mutex);
 				 lock(&lo->lo_mutex);
     lock((wq_completion)loop0);

    *** DEADLOCK ***
   1 lock held by losetup/156417:
    #0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]

   stack backtrace:
   CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
   Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
   Call Trace:
    dump_stack_lvl+0x57/0x72
    check_noncircular+0x10a/0x120
    __lock_acquire+0x1130/0x1dc0
    lock_acquire+0xf5/0x320
    ? flush_workqueue+0x84/0x600
    flush_workqueue+0xae/0x600
    ? flush_workqueue+0x84/0x600
    drain_workqueue+0xa0/0x110
    destroy_workqueue+0x36/0x250
    __loop_clr_fd+0x9a/0x650 [loop]
    lo_ioctl+0x29d/0x780 [loop]
    ? __lock_acquire+0x3a0/0x1dc0
    ? update_dl_rq_load_avg+0x152/0x360
    ? lock_is_held_type+0xa5/0x120
    ? find_held_lock.constprop.0+0x2b/0x80
    block_ioctl+0x3f/0x50
    __x64_sys_ioctl+0x83/0xb0
    do_syscall_64+0x3b/0x90
    entry_SYSCALL_64_after_hwframe+0x44/0xae
   RIP: 0033:0x7f645884de6b

 Usually the uuid_mutex exists to protect the fs_devices that map
 together all of the devices that match a specific uuid.  In rm_device
 we're messing with the uuid of a device, so it makes sense to protect
 that here.

 However in doing that it pulls in a whole host of lockdep dependencies,
 as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
 we end up with the dependency chain under the uuid_mutex being added
 under the normal sb write dependency chain, which causes problems with
 loop devices.

 We don't need the uuid mutex here however.  If we call
 btrfs_scan_one_device() before we scratch the super block we will find
 the fs_devices and not find the device itself and return EBUSY because
 the fs_devices is open.  If we call it after the scratch happens it will
 not appear to be a valid btrfs file system.

 We do not need to worry about other fs_devices modifying operations here
 because we're protected by the exclusive operations locking.

 So drop the uuid_mutex here in order to fix the lockdep splat.

 A more detailed explanation from the discussion:

 We are worried about rm and scan racing with each other, before this
 change we'll zero the device out under the UUID mutex so when scan does
 run it'll make sure that it can go through the whole device scan thing
 without rm messing with us.

 We aren't worried if the scratch happens first, because the result is we
 don't think this is a btrfs device and we bail out.

 The only case we are concerned with is we scratch _after_ scan is able
 to read the superblock and gets a seemingly valid super block, so lets
 consider this case.

 Scan will call device_list_add() with the device we're removing.  We'll
 call find_fsid_with_metadata_uuid() and get our fs_devices for this
 UUID.  At this point we lock the fs_devices->device_list_mutex.  This is
 what protects us in this case, but we have two cases here.

 1. We aren't to the device removal part of the RM.  We found our device,
    and device name matches our path, we go down and we set total_devices
    to our super number of devices, which doesn't affect anything because
    we haven't done the remove yet.

 2. We are past the device removal part, which is protected by the
    device_list_mutex.  Scan doesn't find the device, it goes down and
    does the

    if (fs_devices->opened)
 	   return -EBUSY;

    check and we bail out.

 Nothing about this situation is ideal, but the lockdep splat is real,
 and the fix is safe, tho admittedly a bit scary looking.

 Reviewed-by: Anand Jain <anand.jain@oracle.com>
 Signed-off-by: Josef Bacik <josef@toxicpanda.com>
 Reviewed-by: David Sterba <dsterba@suse.com>
 [ copy more from the discussion ]
 Signed-off-by: David Sterba <dsterba@suse.com>
 Signed-off-by: Sasha Levin <sashal@kernel.org>
 ---
  fs/btrfs/volumes.c | 10 +++++-----
  1 file changed, 5 insertions(+), 5 deletions(-)

 diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
 index 8af92e44deae4..344d18de1f08c 100644
 --- a/fs/btrfs/volumes.c
 +++ b/fs/btrfs/volumes.c
 @@ -2162,8 +2162,11 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
  	u64 num_devices;
  	int ret = 0;

 -	mutex_lock(&uuid_mutex);
 -
 +	/*
 +	 * The device list in fs_devices is accessed without locks (neither
 +	 * uuid_mutex nor device_list_mutex) as it won't change on a mounted
 +	 * filesystem and another device rm cannot run.
 +	 */
  	num_devices = btrfs_num_devices(fs_info);

  	ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
 @@ -2207,11 +2210,9 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
  		mutex_unlock(&fs_info->chunk_mutex);
  	}

 -	mutex_unlock(&uuid_mutex);
  	ret = btrfs_shrink_device(device, 0);
  	if (!ret)
  		btrfs_reada_remove_dev(device);
 -	mutex_lock(&uuid_mutex);
  	if (ret)
  		goto error_undo;

 @@ -2293,7 +2294,6 @@ int btrfs_rm_device(struct btrfs_fs_info *fs_info, const char *device_path,
  	}

  out:
 -	mutex_unlock(&uuid_mutex);
  	return ret;

  error_undo:
 --
 2.33.0
	From cdeec5af7258fbbc4dc60eb86848be7bcb30c9ed Mon Sep 17 00:00:00 2001
	From: Sasha Levin <sashal@kernel.org>
	Date: Tue, 27 Jul 2021 17:01:14 -0400
	Subject: btrfs: do not take the uuid_mutex in btrfs_rm_device

	From: Josef Bacik <josef@toxicpanda.com>

	[ Upstream commit 8ef9dc0f14ba6124c62547a4fdc59b163d8b864e ]

	We got the following lockdep splat while running fstests (specifically
	btrfs/003 and btrfs/020 in a row) with the new rc. This was uncovered
	by 87579e9b7d8d ("loop: use worker per cgroup instead of kworker") which
	converted loop to using workqueues, which comes with lockdep
	annotations that don't exist with kworkers. The lockdep splat is as
	follows:

	WARNING: possible circular locking dependency detected
	5.14.0-rc2-custom+ #34 Not tainted
	------------------------------------------------------
	losetup/156417 is trying to acquire lock:
	ffff9c7645b02d38 ((wq_completion)loop0){+.+.}-{0:0}, at: flush_workqueue+0x84/0x600

	but task is already holding lock:
	ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]

	which lock already depends on the new lock.

	the existing dependency chain (in reverse order) is:

	-> #5 (&lo->lo_mutex){+.+.}-{3:3}:
	__mutex_lock+0xba/0x7c0
	lo_open+0x28/0x60 [loop]
	blkdev_get_whole+0x28/0xf0
	blkdev_get_by_dev.part.0+0x168/0x3c0
	blkdev_open+0xd2/0xe0
	do_dentry_open+0x163/0x3a0
	path_openat+0x74d/0xa40
	do_filp_open+0x9c/0x140
	do_sys_openat2+0xb1/0x170
	__x64_sys_openat+0x54/0x90
	do_syscall_64+0x3b/0x90
	entry_SYSCALL_64_after_hwframe+0x44/0xae

	-> #4 (&disk->open_mutex){+.+.}-{3:3}:
	__mutex_lock+0xba/0x7c0
	blkdev_get_by_dev.part.0+0xd1/0x3c0
	blkdev_get_by_path+0xc0/0xd0
	btrfs_scan_one_device+0x52/0x1f0 [btrfs]
	btrfs_control_ioctl+0xac/0x170 [btrfs]
	__x64_sys_ioctl+0x83/0xb0
	do_syscall_64+0x3b/0x90
	entry_SYSCALL_64_after_hwframe+0x44/0xae

	-> #3 (uuid_mutex){+.+.}-{3:3}:
	__mutex_lock+0xba/0x7c0
	btrfs_rm_device+0x48/0x6a0 [btrfs]
	btrfs_ioctl+0x2d1c/0x3110 [btrfs]
	__x64_sys_ioctl+0x83/0xb0
	do_syscall_64+0x3b/0x90
	entry_SYSCALL_64_after_hwframe+0x44/0xae

	-> #2 (sb_writers#11){.+.+}-{0:0}:
	lo_write_bvec+0x112/0x290 [loop]
	loop_process_work+0x25f/0xcb0 [loop]
	process_one_work+0x28f/0x5d0
	worker_thread+0x55/0x3c0
	kthread+0x140/0x170
	ret_from_fork+0x22/0x30

	-> #1 ((work_completion)(&lo->rootcg_work)){+.+.}-{0:0}:
	process_one_work+0x266/0x5d0
	worker_thread+0x55/0x3c0
	kthread+0x140/0x170
	ret_from_fork+0x22/0x30

	-> #0 ((wq_completion)loop0){+.+.}-{0:0}:
	__lock_acquire+0x1130/0x1dc0
	lock_acquire+0xf5/0x320
	flush_workqueue+0xae/0x600
	drain_workqueue+0xa0/0x110
	destroy_workqueue+0x36/0x250
	__loop_clr_fd+0x9a/0x650 [loop]
	lo_ioctl+0x29d/0x780 [loop]
	block_ioctl+0x3f/0x50
	__x64_sys_ioctl+0x83/0xb0
	do_syscall_64+0x3b/0x90
	entry_SYSCALL_64_after_hwframe+0x44/0xae

	other info that might help us debug this:
	Chain exists of:
	(wq_completion)loop0 --> &disk->open_mutex --> &lo->lo_mutex
	Possible unsafe locking scenario:
	CPU0 CPU1
	---- ----
	lock(&lo->lo_mutex);
	lock(&disk->open_mutex);
	lock(&lo->lo_mutex);
	lock((wq_completion)loop0);

	* DEADLOCK *
	1 lock held by losetup/156417:
	#0: ffff9c7647395468 (&lo->lo_mutex){+.+.}-{3:3}, at: __loop_clr_fd+0x41/0x650 [loop]

	stack backtrace:
	CPU: 8 PID: 156417 Comm: losetup Not tainted 5.14.0-rc2-custom+ #34
	Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015
	Call Trace:
	dump_stack_lvl+0x57/0x72
	check_noncircular+0x10a/0x120
	__lock_acquire+0x1130/0x1dc0
	lock_acquire+0xf5/0x320
	? flush_workqueue+0x84/0x600
	flush_workqueue+0xae/0x600
	? flush_workqueue+0x84/0x600
	drain_workqueue+0xa0/0x110
	destroy_workqueue+0x36/0x250
	__loop_clr_fd+0x9a/0x650 [loop]
	lo_ioctl+0x29d/0x780 [loop]
	? __lock_acquire+0x3a0/0x1dc0
	? update_dl_rq_load_avg+0x152/0x360
	? lock_is_held_type+0xa5/0x120
	? find_held_lock.constprop.0+0x2b/0x80
	block_ioctl+0x3f/0x50
	__x64_sys_ioctl+0x83/0xb0
	do_syscall_64+0x3b/0x90
	entry_SYSCALL_64_after_hwframe+0x44/0xae
	RIP: 0033:0x7f645884de6b

	Usually the uuid_mutex exists to protect the fs_devices that map
	together all of the devices that match a specific uuid. In rm_device
	we're messing with the uuid of a device, so it makes sense to protect
	that here.

	However in doing that it pulls in a whole host of lockdep dependencies,
	as we call mnt_may_write() on the sb before we grab the uuid_mutex, thus
	we end up with the dependency chain under the uuid_mutex being added
	under the normal sb write dependency chain, which causes problems with
	loop devices.

	We don't need the uuid mutex here however. If we call
	btrfs_scan_one_device() before we scratch the super block we will find
	the fs_devices and not find the device itself and return EBUSY because
	the fs_devices is open. If we call it after the scratch happens it will
	not appear to be a valid btrfs file system.

	We do not need to worry about other fs_devices modifying operations here
	because we're protected by the exclusive operations locking.

	So drop the uuid_mutex here in order to fix the lockdep splat.

	A more detailed explanation from the discussion:

	We are worried about rm and scan racing with each other, before this
	change we'll zero the device out under the UUID mutex so when scan does
	run it'll make sure that it can go through the whole device scan thing
	without rm messing with us.

	We aren't worried if the scratch happens first, because the result is we
	don't think this is a btrfs device and we bail out.

	The only case we are concerned with is we scratch _after_ scan is able
	to read the superblock and gets a seemingly valid super block, so lets
	consider this case.

	Scan will call device_list_add() with the device we're removing. We'll
	call find_fsid_with_metadata_uuid() and get our fs_devices for this
	UUID. At this point we lock the fs_devices->device_list_mutex. This is
	what protects us in this case, but we have two cases here.

	1. We aren't to the device removal part of the RM. We found our device,
	and device name matches our path, we go down and we set total_devices
	to our super number of devices, which doesn't affect anything because
	we haven't done the remove yet.

	2. We are past the device removal part, which is protected by the
	device_list_mutex. Scan doesn't find the device, it goes down and
	does the

	if (fs_devices->opened)
	return -EBUSY;

	check and we bail out.

	Nothing about this situation is ideal, but the lockdep splat is real,
	and the fix is safe, tho admittedly a bit scary looking.

	Reviewed-by: Anand Jain <anand.jain@oracle.com>
	Signed-off-by: Josef Bacik <josef@toxicpanda.com>
	Reviewed-by: David Sterba <dsterba@suse.com>
	[ copy more from the discussion ]
	Signed-off-by: David Sterba <dsterba@suse.com>
	Signed-off-by: Sasha Levin <sashal@kernel.org>
	---
	fs/btrfs/volumes.c \| 10 +++++-----
	1 file changed, 5 insertions(+), 5 deletions(-)

	diff --git a/fs/btrfs/volumes.c b/fs/btrfs/volumes.c
	index 8af92e44deae4..344d18de1f08c 100644
	--- a/fs/btrfs/volumes.c
	+++ b/fs/btrfs/volumes.c
	@@ -2162,8 +2162,11 @@ int btrfs_rm_device(struct btrfs_fs_info fs_info, const char device_path,
	u64 num_devices;
	int ret = 0;

	- mutex_lock(&uuid_mutex);
	-
	+ /*
	+ * The device list in fs_devices is accessed without locks (neither
	+ * uuid_mutex nor device_list_mutex) as it won't change on a mounted
	+ * filesystem and another device rm cannot run.
	+ */
	num_devices = btrfs_num_devices(fs_info);

	ret = btrfs_check_raid_min_devices(fs_info, num_devices - 1);
	@@ -2207,11 +2210,9 @@ int btrfs_rm_device(struct btrfs_fs_info fs_info, const char device_path,
	mutex_unlock(&fs_info->chunk_mutex);
	}

	- mutex_unlock(&uuid_mutex);
	ret = btrfs_shrink_device(device, 0);
	if (!ret)
	btrfs_reada_remove_dev(device);
	- mutex_lock(&uuid_mutex);
	if (ret)
	goto error_undo;

	@@ -2293,7 +2294,6 @@ int btrfs_rm_device(struct btrfs_fs_info fs_info, const char device_path,
	}

	out:
	- mutex_unlock(&uuid_mutex);
	return ret;

	error_undo:
	--
	2.33.0