releases/3.16.36/workqueue-fix-ghost-pending-flag-while-doing-mq-io.patch - pub/scm/linux/kernel/git/bwh/linux-stable-queue - Git at Google

 From: Roman Pen <roman.penyaev@profitbricks.com>
 Date: Tue, 26 Apr 2016 13:15:35 +0200
 Subject: workqueue: fix ghost PENDING flag while doing MQ IO

 commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

 The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
 with the following backtrace:

 [  601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
 [  601.347574]       Tainted: G           O    4.4.5-1-storage+ #6
 [  601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
 [  601.348142] kworker/u129:5  D ffff880803077988     0  1636      2 0x00000000
 [  601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
 [  601.348999]  ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
 [  601.349662]  ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
 [  601.350333]  ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
 [  601.350965] Call Trace:
 [  601.351203]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
 [  601.351444]  [<ffffffff815b01d5>] schedule+0x35/0x80
 [  601.351709]  [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
 [  601.351958]  [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
 [  601.352208]  [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
 [  601.352446]  [<ffffffff815b0920>] ? bit_wait+0x60/0x60
 [  601.352688]  [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
 [  601.352951]  [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
 [  601.353196]  [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
 [  601.353440]  [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
 [  601.353689]  [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
 [  601.353958]  [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
 [  601.354200]  [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
 [  601.354441]  [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
 [  601.354688]  [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
 [  601.354932]  [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
 [  601.355193]  [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
 [  601.355432]  [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
 [  601.355679]  [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
 [  601.355925]  [<ffffffff81198379>] vfs_write+0xa9/0x1a0
 [  601.356164]  [<ffffffff811c59d8>] kernel_write+0x38/0x50

 The underlying device is a null_blk, with default parameters:

   queue_mode    = MQ
   submit_queues = 1

 Verification that nullb0 has something inflight:

 root@pserver8:~# cat /sys/block/nullb0/inflight
        0        1
 root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
 ...
 /sys/block/nullb0/mq/0/cpu2/rq_list
 CTX pending:
         ffff8838038e2400
 ...

 During debug it became clear that stalled request is always inserted in
 the rq_list from the following path:

    save_stack_trace_tsk + 34
    blk_mq_insert_requests + 231
    blk_mq_flush_plug_list + 281
    blk_flush_plug_list + 199
    wait_on_page_bit + 192
    __filemap_fdatawait_range + 228
    filemap_fdatawait_range + 20
    filemap_write_and_wait_range + 63
    blkdev_fsync + 27
    vfs_fsync_range + 73
    blkdev_write_iter + 202
    __vfs_write + 170
    vfs_write + 169
    kernel_write + 56

 So blk_flush_plug_list() was called with from_schedule == true.

 If from_schedule is true, that means that finally blk_mq_insert_requests()
 offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
 i.e. it calls kblockd_schedule_delayed_work_on().

 That means, that we race with another CPU, which is about to execute
 __blk_mq_run_hw_queue() work.

 Further debugging shows the following traces from different CPUs:

   CPU#0                                  CPU#1
   ----------------------------------     -------------------------------
   reqeust A inserted
   STORE hctx->ctx_map[0] bit marked
   kblockd_schedule...() returns 1
   <schedule to kblockd workqueue>
                                          request B inserted
                                          STORE hctx->ctx_map[1] bit marked
                                          kblockd_schedule...() returns 0
   *** WORK PENDING bit is cleared ***
   flush_busy_ctxs() is executed, but
   bit 1, set by CPU#1, is not observed

 As a result request B pended forever.

 This behaviour can be explained by speculative LOAD of hctx->ctx_map on
 CPU#0, which is reordered with clear of PENDING bit and executed _before_
 actual STORE of bit 1 on CPU#1.

 The proper fix is an explicit full barrier <mfence>, which guarantees
 that clear of PENDING bit is to be executed before all possible
 speculative LOADS or STORES inside actual work function.

 Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
 Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
 Cc: Michael Wang <yun.wang@profitbricks.com>
 Cc: Tejun Heo <tj@kernel.org>
 Cc: Jens Axboe <axboe@kernel.dk>
 Cc: linux-block@vger.kernel.org
 Cc: linux-kernel@vger.kernel.org
 Signed-off-by: Tejun Heo <tj@kernel.org>
 Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
 ---
  kernel/workqueue.c | 29 +++++++++++++++++++++++++++++
  1 file changed, 29 insertions(+)

 --- a/kernel/workqueue.c
 +++ b/kernel/workqueue.c
 @@ -634,6 +634,35 @@ static void set_work_pool_and_clear_pend
  	 */
  	smp_wmb();
  	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
 +	/*
 +	 * The following mb guarantees that previous clear of a PENDING bit
 +	 * will not be reordered with any speculative LOADS or STORES from
 +	 * work->current_func, which is executed afterwards.  This possible
 +	 * reordering can lead to a missed execution on attempt to qeueue
 +	 * the same @work.  E.g. consider this case:
 +	 *
 +	 *   CPU#0                         CPU#1
 +	 *   ----------------------------  --------------------------------
 +	 *
 +	 * 1  STORE event_indicated
 +	 * 2  queue_work_on() {
 +	 * 3    test_and_set_bit(PENDING)
 +	 * 4 }                             set_..._and_clear_pending() {
 +	 * 5                                 set_work_data() # clear bit
 +	 * 6                                 smp_mb()
 +	 * 7                               work->current_func() {
 +	 * 8				      LOAD event_indicated
 +	 *				   }
 +	 *
 +	 * Without an explicit full barrier speculative LOAD on line 8 can
 +	 * be executed before CPU#0 does STORE on line 1.  If that happens,
 +	 * CPU#0 observes the PENDING bit is still set and new execution of
 +	 * a @work is not queued in a hope, that CPU#1 will eventually
 +	 * finish the queued @work.  Meanwhile CPU#1 does not see
 +	 * event_indicated is set, because speculative LOAD was executed
 +	 * before actual STORE.
 +	 */
 +	smp_mb();
  }

  static void clear_work_data(struct work_struct *work)
	From: Roman Pen <roman.penyaev@profitbricks.com>
	Date: Tue, 26 Apr 2016 13:15:35 +0200
	Subject: workqueue: fix ghost PENDING flag while doing MQ IO

	commit 346c09f80459a3ad97df1816d6d606169a51001a upstream.

	The bug in a workqueue leads to a stalled IO request in MQ ctx->rq_list
	with the following backtrace:

	[ 601.347452] INFO: task kworker/u129:5:1636 blocked for more than 120 seconds.
	[ 601.347574] Tainted: G O 4.4.5-1-storage+ #6
	[ 601.347651] "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message.
	[ 601.348142] kworker/u129:5 D ffff880803077988 0 1636 2 0x00000000
	[ 601.348519] Workqueue: ibnbd_server_fileio_wq ibnbd_dev_file_submit_io_worker [ibnbd_server]
	[ 601.348999] ffff880803077988 ffff88080466b900 ffff8808033f9c80 ffff880803078000
	[ 601.349662] ffff880807c95000 7fffffffffffffff ffffffff815b0920 ffff880803077ad0
	[ 601.350333] ffff8808030779a0 ffffffff815b01d5 0000000000000000 ffff880803077a38
	[ 601.350965] Call Trace:
	[ 601.351203] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
	[ 601.351444] [<ffffffff815b01d5>] schedule+0x35/0x80
	[ 601.351709] [<ffffffff815b2dd2>] schedule_timeout+0x192/0x230
	[ 601.351958] [<ffffffff812d43f7>] ? blk_flush_plug_list+0xc7/0x220
	[ 601.352208] [<ffffffff810bd737>] ? ktime_get+0x37/0xa0
	[ 601.352446] [<ffffffff815b0920>] ? bit_wait+0x60/0x60
	[ 601.352688] [<ffffffff815af784>] io_schedule_timeout+0xa4/0x110
	[ 601.352951] [<ffffffff815b3a4e>] ? _raw_spin_unlock_irqrestore+0xe/0x10
	[ 601.353196] [<ffffffff815b093b>] bit_wait_io+0x1b/0x70
	[ 601.353440] [<ffffffff815b056d>] __wait_on_bit+0x5d/0x90
	[ 601.353689] [<ffffffff81127bd0>] wait_on_page_bit+0xc0/0xd0
	[ 601.353958] [<ffffffff81096db0>] ? autoremove_wake_function+0x40/0x40
	[ 601.354200] [<ffffffff81127cc4>] __filemap_fdatawait_range+0xe4/0x140
	[ 601.354441] [<ffffffff81127d34>] filemap_fdatawait_range+0x14/0x30
	[ 601.354688] [<ffffffff81129a9f>] filemap_write_and_wait_range+0x3f/0x70
	[ 601.354932] [<ffffffff811ced3b>] blkdev_fsync+0x1b/0x50
	[ 601.355193] [<ffffffff811c82d9>] vfs_fsync_range+0x49/0xa0
	[ 601.355432] [<ffffffff811cf45a>] blkdev_write_iter+0xca/0x100
	[ 601.355679] [<ffffffff81197b1a>] __vfs_write+0xaa/0xe0
	[ 601.355925] [<ffffffff81198379>] vfs_write+0xa9/0x1a0
	[ 601.356164] [<ffffffff811c59d8>] kernel_write+0x38/0x50

	The underlying device is a null_blk, with default parameters:

	queue_mode = MQ
	submit_queues = 1

	Verification that nullb0 has something inflight:

	root@pserver8:~# cat /sys/block/nullb0/inflight
	0 1
	root@pserver8:~# find /sys/block/nullb0/mq/0/cpu* -name rq_list -print -exec cat {} \;
	...
	/sys/block/nullb0/mq/0/cpu2/rq_list
	CTX pending:
	ffff8838038e2400
	...

	During debug it became clear that stalled request is always inserted in
	the rq_list from the following path:

	save_stack_trace_tsk + 34
	blk_mq_insert_requests + 231
	blk_mq_flush_plug_list + 281
	blk_flush_plug_list + 199
	wait_on_page_bit + 192
	__filemap_fdatawait_range + 228
	filemap_fdatawait_range + 20
	filemap_write_and_wait_range + 63
	blkdev_fsync + 27
	vfs_fsync_range + 73
	blkdev_write_iter + 202
	__vfs_write + 170
	vfs_write + 169
	kernel_write + 56

	So blk_flush_plug_list() was called with from_schedule == true.

	If from_schedule is true, that means that finally blk_mq_insert_requests()
	offloads execution of __blk_mq_run_hw_queue() and uses kblockd workqueue,
	i.e. it calls kblockd_schedule_delayed_work_on().

	That means, that we race with another CPU, which is about to execute
	__blk_mq_run_hw_queue() work.

	Further debugging shows the following traces from different CPUs:

	CPU#0 CPU#1
	---------------------------------- -------------------------------
	reqeust A inserted
	STORE hctx->ctx_map[0] bit marked
	kblockd_schedule...() returns 1
	<schedule to kblockd workqueue>
	request B inserted
	STORE hctx->ctx_map[1] bit marked
	kblockd_schedule...() returns 0
	* WORK PENDING bit is cleared *
	flush_busy_ctxs() is executed, but
	bit 1, set by CPU#1, is not observed

	As a result request B pended forever.

	This behaviour can be explained by speculative LOAD of hctx->ctx_map on
	CPU#0, which is reordered with clear of PENDING bit and executed _before_
	actual STORE of bit 1 on CPU#1.

	The proper fix is an explicit full barrier <mfence>, which guarantees
	that clear of PENDING bit is to be executed before all possible
	speculative LOADS or STORES inside actual work function.

	Signed-off-by: Roman Pen <roman.penyaev@profitbricks.com>
	Cc: Gioh Kim <gi-oh.kim@profitbricks.com>
	Cc: Michael Wang <yun.wang@profitbricks.com>
	Cc: Tejun Heo <tj@kernel.org>
	Cc: Jens Axboe <axboe@kernel.dk>
	Cc: linux-block@vger.kernel.org
	Cc: linux-kernel@vger.kernel.org
	Signed-off-by: Tejun Heo <tj@kernel.org>
	Signed-off-by: Ben Hutchings <ben@decadent.org.uk>
	---
	kernel/workqueue.c \| 29 +++++++++++++++++++++++++++++
	1 file changed, 29 insertions(+)

	--- a/kernel/workqueue.c
	+++ b/kernel/workqueue.c
	@@ -634,6 +634,35 @@ static void set_work_pool_and_clear_pend
	*/
	smp_wmb();
	set_work_data(work, (unsigned long)pool_id << WORK_OFFQ_POOL_SHIFT, 0);
	+ /*
	+ * The following mb guarantees that previous clear of a PENDING bit
	+ * will not be reordered with any speculative LOADS or STORES from
	+ * work->current_func, which is executed afterwards. This possible
	+ * reordering can lead to a missed execution on attempt to qeueue
	+ * the same @work. E.g. consider this case:
	+ *
	+ * CPU#0 CPU#1
	+ * ---------------------------- --------------------------------
	+ *
	+ * 1 STORE event_indicated
	+ * 2 queue_work_on() {
	+ * 3 test_and_set_bit(PENDING)
	+ * 4 } set_..._and_clear_pending() {
	+ * 5 set_work_data() # clear bit
	+ * 6 smp_mb()
	+ * 7 work->current_func() {
	+ * 8 LOAD event_indicated
	+ * }
	+ *
	+ * Without an explicit full barrier speculative LOAD on line 8 can
	+ * be executed before CPU#0 does STORE on line 1. If that happens,
	+ * CPU#0 observes the PENDING bit is still set and new execution of
	+ * a @work is not queued in a hope, that CPU#1 will eventually
	+ * finish the queued @work. Meanwhile CPU#1 does not see
	+ * event_indicated is set, because speculative LOAD was executed
	+ * before actual STORE.
	+ */
	+ smp_mb();
	}

	static void clear_work_data(struct work_struct *work)