release/5.2.35/Btrfs-only-associate-the-locked-page-with-one-async_.patch - pub/scm/linux/kernel/git/paulg/longterm-queue-5.2 - Git at Google

 From 3d214a4fb55c279617c046f97c45159646b81c4f Mon Sep 17 00:00:00 2001
 From: Chris Mason <clm@fb.com>
 Date: Wed, 10 Jul 2019 12:28:16 -0700
 Subject: [PATCH] Btrfs: only associate the locked page with one async_chunk
  struct

 commit 1d53c9e6723022b12e4a5ed4b141f67c834b7f6f upstream.

 The btrfs writepages function collects a large range of pages flagged
 for delayed allocation, and then sends them down through the COW code
 for processing.  When compression is on, we allocate one async_chunk
 structure for every 512K, and then run those pages through the
 compression code for IO submission.

 writepages starts all of this off with a single page, locked by the
 original call to extent_write_cache_pages(), and it's important to keep
 track of this page because it has already been through
 clear_page_dirty_for_io().

 The btrfs async_chunk struct has a pointer to the locked_page, and when
 we're redirtying the page because compression had to fallback to
 uncompressed IO, we use page->index to decide if a given async_chunk
 struct really owns that page.

 But, this is racey.  If a given delalloc range is broken up into two
 async_chunks (chunkA and chunkB), we can end up with something like
 this:

  compress_file_range(chunkA)
  submit_compress_extents(chunkA)
  submit compressed bios(chunkA)
  put_page(locked_page)

 				 compress_file_range(chunkB)
 				 ...

 Or:

  async_cow_submit
   submit_compressed_extents <--- falls back to buffered writeout
    cow_file_range
     extent_clear_unlock_delalloc
      __process_pages_contig
        put_page(locked_pages)

 					    async_cow_submit

 The end result is that chunkA is completed and cleaned up before chunkB
 even starts processing.  This means we can free locked_page() and reuse
 it elsewhere.  If we get really lucky, it'll have the same page->index
 in its new home as it did before.

 While we're processing chunkB, we might decide we need to fall back to
 uncompressed IO, and so compress_file_range() will call
 __set_page_dirty_nobufers() on chunkB->locked_page.

 Without cgroups in use, this creates as a phantom dirty page, which
 isn't great but isn't the end of the world. What can happen, it can go
 through the fixup worker and the whole COW machinery again:

 in submit_compressed_extents():
   while (async extents) {
   ...
     cow_file_range
     if (!page_started ...)
       extent_write_locked_range
     else if (...)
       unlock_page
     continue;

 This hasn't been observed in practice but is still possible.

 With cgroups in use, we might crash in the accounting code because
 page->mapping->i_wb isn't set.

   BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
   IP: percpu_counter_add_batch+0x11/0x70
   PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
   Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
   CPU: 16 PID: 2172 Comm: rm Not tainted
   RIP: 0010:percpu_counter_add_batch+0x11/0x70
   RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
   RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
   RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
   RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
   R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
   R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
   FS:  00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
   CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
   CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
   DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
   DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
   Call Trace:
    account_page_cleaned+0x15b/0x1f0
    __cancel_dirty_page+0x146/0x200
    truncate_cleanup_page+0x92/0xb0
    truncate_inode_pages_range+0x202/0x7d0
    btrfs_evict_inode+0x92/0x5a0
    evict+0xc1/0x190
    do_unlinkat+0x176/0x280
    do_syscall_64+0x63/0x1a0
    entry_SYSCALL_64_after_hwframe+0x42/0xb7

 The fix here is to make asyc_chunk->locked_page NULL everywhere but the
 one async_chunk struct that's allowed to do things to the locked page.

 Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
 Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
 Reviewed-by: Josef Bacik <josef@toxicpanda.com>
 Signed-off-by: Chris Mason <clm@fb.com>
 [ update changelog from mail thread discussion ]
 Signed-off-by: David Sterba <dsterba@suse.com>
 Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>

 diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
 index 927bb4ce3fa1..5be8b876990f 100644
 --- a/fs/btrfs/extent_io.c
 +++ b/fs/btrfs/extent_io.c
 @@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping,
  			if (page_ops & PAGE_SET_PRIVATE2)
  				SetPagePrivate2(pages[i]);

 -			if (pages[i] == locked_page) {
 +			if (locked_page && pages[i] == locked_page) {
  				put_page(pages[i]);
  				pages_locked++;
  				continue;
 diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
 index 047188759cc4..25515b097125 100644
 --- a/fs/btrfs/inode.c
 +++ b/fs/btrfs/inode.c
 @@ -701,10 +701,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk,
  	 * to our extent and set things up for the async work queue to run
  	 * cow_file_range to do the normal delalloc dance.
  	 */
 -	if (page_offset(async_chunk->locked_page) >= start &&
 -	    page_offset(async_chunk->locked_page) <= end)
 +	if (async_chunk->locked_page &&
 +	    (page_offset(async_chunk->locked_page) >= start &&
 +	     page_offset(async_chunk->locked_page)) <= end) {
  		__set_page_dirty_nobuffers(async_chunk->locked_page);
  		/* unlocked later on in the async handlers */
 +	}

  	if (redirty)
  		extent_range_redirty_for_io(inode, start, end);
 @@ -794,7 +796,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
  						  async_extent->start +
  						  async_extent->ram_size - 1,
  						  WB_SYNC_ALL);
 -			else if (ret)
 +			else if (ret && async_chunk->locked_page)
  				unlock_page(async_chunk->locked_page);
  			kfree(async_extent);
  			cond_resched();
 @@ -1271,10 +1273,25 @@ static int cow_file_range_async(struct inode *inode, struct page *locked_page,
  		async_chunk[i].inode = inode;
  		async_chunk[i].start = start;
  		async_chunk[i].end = cur_end;
 -		async_chunk[i].locked_page = locked_page;
  		async_chunk[i].write_flags = write_flags;
  		INIT_LIST_HEAD(&async_chunk[i].extents);

 +		/*
 +		 * The locked_page comes all the way from writepage and its
 +		 * the original page we were actually given.  As we spread
 +		 * this large delalloc region across multiple async_chunk
 +		 * structs, only the first struct needs a pointer to locked_page
 +		 *
 +		 * This way we don't need racey decisions about who is supposed
 +		 * to unlock it.
 +		 */
 +		if (locked_page) {
 +			async_chunk[i].locked_page = locked_page;
 +			locked_page = NULL;
 +		} else {
 +			async_chunk[i].locked_page = NULL;
 +		}
 +
  		btrfs_init_work(&async_chunk[i].work, async_cow_start,
  				async_cow_submit, async_cow_free);

 --
 2.7.4
	From 3d214a4fb55c279617c046f97c45159646b81c4f Mon Sep 17 00:00:00 2001
	From: Chris Mason <clm@fb.com>
	Date: Wed, 10 Jul 2019 12:28:16 -0700
	Subject: [PATCH] Btrfs: only associate the locked page with one async_chunk
	struct

	commit 1d53c9e6723022b12e4a5ed4b141f67c834b7f6f upstream.

	The btrfs writepages function collects a large range of pages flagged
	for delayed allocation, and then sends them down through the COW code
	for processing. When compression is on, we allocate one async_chunk
	structure for every 512K, and then run those pages through the
	compression code for IO submission.

	writepages starts all of this off with a single page, locked by the
	original call to extent_write_cache_pages(), and it's important to keep
	track of this page because it has already been through
	clear_page_dirty_for_io().

	The btrfs async_chunk struct has a pointer to the locked_page, and when
	we're redirtying the page because compression had to fallback to
	uncompressed IO, we use page->index to decide if a given async_chunk
	struct really owns that page.

	But, this is racey. If a given delalloc range is broken up into two
	async_chunks (chunkA and chunkB), we can end up with something like
	this:

	compress_file_range(chunkA)
	submit_compress_extents(chunkA)
	submit compressed bios(chunkA)
	put_page(locked_page)

	compress_file_range(chunkB)
	...

	Or:

	async_cow_submit
	submit_compressed_extents <--- falls back to buffered writeout
	cow_file_range
	extent_clear_unlock_delalloc
	__process_pages_contig
	put_page(locked_pages)

	async_cow_submit

	The end result is that chunkA is completed and cleaned up before chunkB
	even starts processing. This means we can free locked_page() and reuse
	it elsewhere. If we get really lucky, it'll have the same page->index
	in its new home as it did before.

	While we're processing chunkB, we might decide we need to fall back to
	uncompressed IO, and so compress_file_range() will call
	__set_page_dirty_nobufers() on chunkB->locked_page.

	Without cgroups in use, this creates as a phantom dirty page, which
	isn't great but isn't the end of the world. What can happen, it can go
	through the fixup worker and the whole COW machinery again:

	in submit_compressed_extents():
	while (async extents) {
	...
	cow_file_range
	if (!page_started ...)
	extent_write_locked_range
	else if (...)
	unlock_page
	continue;

	This hasn't been observed in practice but is still possible.

	With cgroups in use, we might crash in the accounting code because
	page->mapping->i_wb isn't set.

	BUG: unable to handle kernel NULL pointer dereference at 00000000000000d0
	IP: percpu_counter_add_batch+0x11/0x70
	PGD 66534e067 P4D 66534e067 PUD 66534f067 PMD 0
	Oops: 0000 [#1] SMP DEBUG_PAGEALLOC
	CPU: 16 PID: 2172 Comm: rm Not tainted
	RIP: 0010:percpu_counter_add_batch+0x11/0x70
	RSP: 0018:ffffc9000a97bbe0 EFLAGS: 00010286
	RAX: 0000000000000005 RBX: 0000000000000090 RCX: 0000000000026115
	RDX: 0000000000000030 RSI: ffffffffffffffff RDI: 0000000000000090
	RBP: 0000000000000000 R08: fffffffffffffff5 R09: 0000000000000000
	R10: 00000000000260c0 R11: ffff881037fc26c0 R12: ffffffffffffffff
	R13: ffff880fe4111548 R14: ffffc9000a97bc90 R15: 0000000000000001
	FS: 00007f5503ced480(0000) GS:ffff880ff7200000(0000) knlGS:0000000000000000
	CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	CR2: 00000000000000d0 CR3: 00000001e0459005 CR4: 0000000000360ee0
	DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
	DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
	Call Trace:
	account_page_cleaned+0x15b/0x1f0
	__cancel_dirty_page+0x146/0x200
	truncate_cleanup_page+0x92/0xb0
	truncate_inode_pages_range+0x202/0x7d0
	btrfs_evict_inode+0x92/0x5a0
	evict+0xc1/0x190
	do_unlinkat+0x176/0x280
	do_syscall_64+0x63/0x1a0
	entry_SYSCALL_64_after_hwframe+0x42/0xb7

	The fix here is to make asyc_chunk->locked_page NULL everywhere but the
	one async_chunk struct that's allowed to do things to the locked page.

	Link: https://lore.kernel.org/linux-btrfs/c2419d01-5c84-3fb4-189e-4db519d08796@suse.com/
	Fixes: 771ed689d2cd ("Btrfs: Optimize compressed writeback and reads")
	Reviewed-by: Josef Bacik <josef@toxicpanda.com>
	Signed-off-by: Chris Mason <clm@fb.com>
	[ update changelog from mail thread discussion ]
	Signed-off-by: David Sterba <dsterba@suse.com>
	Signed-off-by: Paul Gortmaker <paul.gortmaker@windriver.com>

	diff --git a/fs/btrfs/extent_io.c b/fs/btrfs/extent_io.c
	index 927bb4ce3fa1..5be8b876990f 100644
	--- a/fs/btrfs/extent_io.c
	+++ b/fs/btrfs/extent_io.c
	@@ -1838,7 +1838,7 @@ static int __process_pages_contig(struct address_space *mapping,
	if (page_ops & PAGE_SET_PRIVATE2)
	SetPagePrivate2(pages[i]);

	- if (pages[i] == locked_page) {
	+ if (locked_page && pages[i] == locked_page) {
	put_page(pages[i]);
	pages_locked++;
	continue;
	diff --git a/fs/btrfs/inode.c b/fs/btrfs/inode.c
	index 047188759cc4..25515b097125 100644
	--- a/fs/btrfs/inode.c
	+++ b/fs/btrfs/inode.c
	@@ -701,10 +701,12 @@ static noinline void compress_file_range(struct async_chunk *async_chunk,
	* to our extent and set things up for the async work queue to run
	* cow_file_range to do the normal delalloc dance.
	*/
	- if (page_offset(async_chunk->locked_page) >= start &&
	- page_offset(async_chunk->locked_page) <= end)
	+ if (async_chunk->locked_page &&
	+ (page_offset(async_chunk->locked_page) >= start &&
	+ page_offset(async_chunk->locked_page)) <= end) {
	__set_page_dirty_nobuffers(async_chunk->locked_page);
	/* unlocked later on in the async handlers */
	+ }

	if (redirty)
	extent_range_redirty_for_io(inode, start, end);
	@@ -794,7 +796,7 @@ static noinline void submit_compressed_extents(struct async_chunk *async_chunk)
	async_extent->start +
	async_extent->ram_size - 1,
	WB_SYNC_ALL);
	- else if (ret)
	+ else if (ret && async_chunk->locked_page)
	unlock_page(async_chunk->locked_page);
	kfree(async_extent);
	cond_resched();
	@@ -1271,10 +1273,25 @@ static int cow_file_range_async(struct inode inode, struct page locked_page,
	async_chunk[i].inode = inode;
	async_chunk[i].start = start;
	async_chunk[i].end = cur_end;
	- async_chunk[i].locked_page = locked_page;
	async_chunk[i].write_flags = write_flags;
	INIT_LIST_HEAD(&async_chunk[i].extents);

	+ /*
	+ * The locked_page comes all the way from writepage and its
	+ * the original page we were actually given. As we spread
	+ * this large delalloc region across multiple async_chunk
	+ * structs, only the first struct needs a pointer to locked_page
	+ *
	+ * This way we don't need racey decisions about who is supposed
	+ * to unlock it.
	+ */
	+ if (locked_page) {
	+ async_chunk[i].locked_page = locked_page;
	+ locked_page = NULL;
	+ } else {
	+ async_chunk[i].locked_page = NULL;
	+ }
	+
	btrfs_init_work(&async_chunk[i].work, async_cow_start,
	async_cow_submit, async_cow_free);

	--
	2.7.4