releases/4.9.119/btrfs-fix-file-data-corruption-after-cloning-a-range-and-fsync.patch - pub/scm/linux/kernel/git/stable/stable-queue - Git at Google

 From bd3599a0e142cd73edd3b6801068ac3f48ac771a Mon Sep 17 00:00:00 2001
 From: Filipe Manana <fdmanana@suse.com>
 Date: Thu, 12 Jul 2018 01:36:43 +0100
 Subject: Btrfs: fix file data corruption after cloning a range and fsync

 From: Filipe Manana <fdmanana@suse.com>

 commit bd3599a0e142cd73edd3b6801068ac3f48ac771a upstream.

 When we clone a range into a file we can end up dropping existing
 extent maps (or trimming them) and replacing them with new ones if the
 range to be cloned overlaps with a range in the destination inode.
 When that happens we add the new extent maps to the list of modified
 extents in the inode's extent map tree, so that a "fast" fsync (the flag
 BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
 and log corresponding extent items. However, at the end of range cloning
 operation we do truncate all the pages in the affected range (in order to
 ensure future reads will not get stale data). Sometimes this truncation
 will release the corresponding extent maps besides the pages from the page
 cache. If this happens, then a "fast" fsync operation will miss logging
 some extent items, because it relies exclusively on the extent maps being
 present in the inode's extent tree, leading to data loss/corruption if
 the fsync ends up using the same transaction used by the clone operation
 (that transaction was not committed in the meanwhile). An extent map is
 released through the callback btrfs_invalidatepage(), which gets called by
 truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
 later ends up calling try_release_extent_mapping() which will release the
 extent map if some conditions are met, like the file size being greater
 than 16Mb, gfp flags allow blocking and the range not being locked (which
 is the case during the clone operation) nor being the extent map flagged
 as pinned (also the case for cloning).

 The following example, turned into a test for fstests, reproduces the
 issue:

   $ mkfs.btrfs -f /dev/sdb
   $ mount /dev/sdb /mnt

   $ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
   $ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

   $ xfs_io -c "fsync" /mnt/bar
   # reflink destination offset corresponds to the size of file bar,
   # 2728Kb minus 4Kb.
   $ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
   $ xfs_io -c "fsync" /mnt/bar

   $ md5sum /mnt/bar
   95a95813a8c2abc9aa75a6c2914a077e  /mnt/bar

   <power fail>

   $ mount /dev/sdb /mnt
   $ md5sum /mnt/bar
   207fd8d0b161be8a84b945f0df8d5f8d  /mnt/bar
   # digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
   # power failure

 In the above example, the destination offset of the clone operation
 corresponds to the size of the "bar" file minus 4Kb. So during the clone
 operation, the extent map covering the range from 2572Kb to 2728Kb gets
 trimmed so that it ends at offset 2724Kb, and a new extent map covering
 the range from 2724Kb to 11724Kb is created. So at the end of the clone
 operation when we ask to truncate the pages in the range from 2724Kb to
 2724Kb + 15908Kb, the page invalidation callback ends up removing the new
 extent map (through try_release_extent_mapping()) when the page at offset
 2724Kb is passed to that callback.

 Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
 map is removed at try_release_extent_mapping(), forcing the next fsync to
 search for modified extents in the fs/subvolume tree instead of relying on
 the presence of extent maps in memory. This way we can continue doing a
 "fast" fsync if the destination range of a clone operation does not
 overlap with an existing range or if any of the criteria necessary to
 remove an extent map at try_release_extent_mapping() is not met (file
 size not bigger then 16Mb or gfp flags do not allow blocking).

 CC: stable@vger.kernel.org # 3.16+
 Signed-off-by: Filipe Manana <fdmanana@suse.com>
 Signed-off-by: David Sterba <dsterba@suse.com>
 Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
 Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 ---
  fs/btrfs/extent_io.c |    3 +++
  1 file changed, 3 insertions(+)

 --- a/fs/btrfs/extent_io.c
 +++ b/fs/btrfs/extent_io.c
 @@ -4298,6 +4298,7 @@ int try_release_extent_mapping(struct ex
  	struct extent_map *em;
  	u64 start = page_offset(page);
  	u64 end = start + PAGE_SIZE - 1;
 +	struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host);

  	if (gfpflags_allow_blocking(mask) &&
  	    page->mapping->host->i_size > SZ_16M) {
 @@ -4320,6 +4321,8 @@ int try_release_extent_mapping(struct ex
  					    extent_map_end(em) - 1,
  					    EXTENT_LOCKED | EXTENT_WRITEBACK,
  					    0, NULL)) {
 +				set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
 +					&btrfs_inode->runtime_flags);
  				remove_extent_mapping(map, em);
  				/* once for the rb tree */
  				free_extent_map(em);
	From bd3599a0e142cd73edd3b6801068ac3f48ac771a Mon Sep 17 00:00:00 2001
	From: Filipe Manana <fdmanana@suse.com>
	Date: Thu, 12 Jul 2018 01:36:43 +0100
	Subject: Btrfs: fix file data corruption after cloning a range and fsync

	From: Filipe Manana <fdmanana@suse.com>

	commit bd3599a0e142cd73edd3b6801068ac3f48ac771a upstream.

	When we clone a range into a file we can end up dropping existing
	extent maps (or trimming them) and replacing them with new ones if the
	range to be cloned overlaps with a range in the destination inode.
	When that happens we add the new extent maps to the list of modified
	extents in the inode's extent map tree, so that a "fast" fsync (the flag
	BTRFS_INODE_NEEDS_FULL_SYNC not set in the inode) will see the extent maps
	and log corresponding extent items. However, at the end of range cloning
	operation we do truncate all the pages in the affected range (in order to
	ensure future reads will not get stale data). Sometimes this truncation
	will release the corresponding extent maps besides the pages from the page
	cache. If this happens, then a "fast" fsync operation will miss logging
	some extent items, because it relies exclusively on the extent maps being
	present in the inode's extent tree, leading to data loss/corruption if
	the fsync ends up using the same transaction used by the clone operation
	(that transaction was not committed in the meanwhile). An extent map is
	released through the callback btrfs_invalidatepage(), which gets called by
	truncate_inode_pages_range(), and it calls __btrfs_releasepage(). The
	later ends up calling try_release_extent_mapping() which will release the
	extent map if some conditions are met, like the file size being greater
	than 16Mb, gfp flags allow blocking and the range not being locked (which
	is the case during the clone operation) nor being the extent map flagged
	as pinned (also the case for cloning).

	The following example, turned into a test for fstests, reproduces the
	issue:

	$ mkfs.btrfs -f /dev/sdb
	$ mount /dev/sdb /mnt

	$ xfs_io -f -c "pwrite -S 0x18 9000K 6908K" /mnt/foo
	$ xfs_io -f -c "pwrite -S 0x20 2572K 156K" /mnt/bar

	$ xfs_io -c "fsync" /mnt/bar
	# reflink destination offset corresponds to the size of file bar,
	# 2728Kb minus 4Kb.
	$ xfs_io -c ""reflink ${SCRATCH_MNT}/foo 0 2724K 15908K" /mnt/bar
	$ xfs_io -c "fsync" /mnt/bar

	$ md5sum /mnt/bar
	95a95813a8c2abc9aa75a6c2914a077e /mnt/bar

	<power fail>

	$ mount /dev/sdb /mnt
	$ md5sum /mnt/bar
	207fd8d0b161be8a84b945f0df8d5f8d /mnt/bar
	# digest should be 95a95813a8c2abc9aa75a6c2914a077e like before the
	# power failure

	In the above example, the destination offset of the clone operation
	corresponds to the size of the "bar" file minus 4Kb. So during the clone
	operation, the extent map covering the range from 2572Kb to 2728Kb gets
	trimmed so that it ends at offset 2724Kb, and a new extent map covering
	the range from 2724Kb to 11724Kb is created. So at the end of the clone
	operation when we ask to truncate the pages in the range from 2724Kb to
	2724Kb + 15908Kb, the page invalidation callback ends up removing the new
	extent map (through try_release_extent_mapping()) when the page at offset
	2724Kb is passed to that callback.

	Fix this by setting the bit BTRFS_INODE_NEEDS_FULL_SYNC whenever an extent
	map is removed at try_release_extent_mapping(), forcing the next fsync to
	search for modified extents in the fs/subvolume tree instead of relying on
	the presence of extent maps in memory. This way we can continue doing a
	"fast" fsync if the destination range of a clone operation does not
	overlap with an existing range or if any of the criteria necessary to
	remove an extent map at try_release_extent_mapping() is not met (file
	size not bigger then 16Mb or gfp flags do not allow blocking).

	CC: stable@vger.kernel.org # 3.16+
	Signed-off-by: Filipe Manana <fdmanana@suse.com>
	Signed-off-by: David Sterba <dsterba@suse.com>
	Signed-off-by: Sudip Mukherjee <sudipm.mukherjee@gmail.com>
	Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	---
	fs/btrfs/extent_io.c \| 3 +++
	1 file changed, 3 insertions(+)

	--- a/fs/btrfs/extent_io.c
	+++ b/fs/btrfs/extent_io.c
	@@ -4298,6 +4298,7 @@ int try_release_extent_mapping(struct ex
	struct extent_map *em;
	u64 start = page_offset(page);
	u64 end = start + PAGE_SIZE - 1;
	+ struct btrfs_inode *btrfs_inode = BTRFS_I(page->mapping->host);

	if (gfpflags_allow_blocking(mask) &&
	page->mapping->host->i_size > SZ_16M) {
	@@ -4320,6 +4321,8 @@ int try_release_extent_mapping(struct ex
	extent_map_end(em) - 1,
	EXTENT_LOCKED \| EXTENT_WRITEBACK,
	0, NULL)) {
	+ set_bit(BTRFS_INODE_NEEDS_FULL_SYNC,
	+ &btrfs_inode->runtime_flags);
	remove_extent_mapping(map, em);
	/* once for the rb tree */
	free_extent_map(em);