| From bf504110bc8aa05df48b0e5f0aa84bfb81e0574b Mon Sep 17 00:00:00 2001 |
| From: Filipe Manana <fdmanana@suse.com> |
| Date: Mon, 4 Mar 2019 14:06:12 +0000 |
| Subject: Btrfs: fix incorrect file size after shrinking truncate and fsync |
| |
| From: Filipe Manana <fdmanana@suse.com> |
| |
| commit bf504110bc8aa05df48b0e5f0aa84bfb81e0574b upstream. |
| |
| If we do a shrinking truncate against an inode which is already present |
| in the respective log tree and then rename it, as part of logging the new |
| name we end up logging an inode item that reflects the old size of the |
| file (the one which we previously logged) and not the new smaller size. |
| The decision to preserve the size previously logged was added by commit |
| 1a4bcf470c886b ("Btrfs: fix fsync data loss after adding hard link to |
| inode") in order to avoid data loss after replaying the log. However that |
| decision is only needed for the case the logged inode size is smaller then |
| the current size of the inode, as explained in that commit's change log. |
| If the current size of the inode is smaller then the previously logged |
| size, we know a shrinking truncate happened and therefore need to use |
| that smaller size. |
| |
| Example to trigger the problem: |
| |
| $ mkfs.btrfs -f /dev/sdb |
| $ mount /dev/sdb /mnt |
| |
| $ xfs_io -f -c "pwrite -S 0xab 0 8000" /mnt/foo |
| $ xfs_io -c "fsync" /mnt/foo |
| $ xfs_io -c "truncate 3000" /mnt/foo |
| |
| $ mv /mnt/foo /mnt/bar |
| $ xfs_io -c "fsync" /mnt/bar |
| |
| <power failure> |
| |
| $ mount /dev/sdb /mnt |
| $ od -t x1 -A d /mnt/bar |
| 0000000 ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab ab |
| * |
| 0008000 |
| |
| Once we rename the file, we log its name (and inode item), and because |
| the inode was already logged before in the current transaction, we log it |
| with a size of 8000 bytes because that is the size we previously logged |
| (with the first fsync). As part of the rename, besides logging the inode, |
| we do also sync the log, which is done since commit d4682ba03ef618 |
| ("Btrfs: sync log after logging new name"), so the next fsync against our |
| inode is effectively a no-op, since no new changes happened since the |
| rename operation. Even if did not sync the log during the rename |
| operation, the same problem (fize size of 8000 bytes instead of 3000 |
| bytes) would be visible after replaying the log if the log ended up |
| getting synced to disk through some other means, such as for example by |
| fsyncing some other modified file. In the example above the fsync after |
| the rename operation is there just because not every filesystem may |
| guarantee logging/journalling the inode (and syncing the log/journal) |
| during the rename operation, for example it is needed for f2fs, but not |
| for ext4 and xfs. |
| |
| Fix this scenario by, when logging a new name (which is triggered by |
| rename and link operations), using the current size of the inode instead |
| of the previously logged inode size. |
| |
| A test case for fstests follows soon. |
| |
| Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=202695 |
| CC: stable@vger.kernel.org # 4.4+ |
| Reported-by: Seulbae Kim <seulbae@gatech.edu> |
| Signed-off-by: Filipe Manana <fdmanana@suse.com> |
| Signed-off-by: David Sterba <dsterba@suse.com> |
| Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| |
| --- |
| fs/btrfs/tree-log.c | 13 +++++++++++++ |
| 1 file changed, 13 insertions(+) |
| |
| --- a/fs/btrfs/tree-log.c |
| +++ b/fs/btrfs/tree-log.c |
| @@ -4501,6 +4501,19 @@ static int logged_inode_size(struct btrf |
| item = btrfs_item_ptr(path->nodes[0], path->slots[0], |
| struct btrfs_inode_item); |
| *size_ret = btrfs_inode_size(path->nodes[0], item); |
| + /* |
| + * If the in-memory inode's i_size is smaller then the inode |
| + * size stored in the btree, return the inode's i_size, so |
| + * that we get a correct inode size after replaying the log |
| + * when before a power failure we had a shrinking truncate |
| + * followed by addition of a new name (rename / new hard link). |
| + * Otherwise return the inode size from the btree, to avoid |
| + * data loss when replaying a log due to previously doing a |
| + * write that expands the inode's size and logging a new name |
| + * immediately after. |
| + */ |
| + if (*size_ret > inode->vfs_inode.i_size) |
| + *size_ret = inode->vfs_inode.i_size; |
| } |
| |
| btrfs_release_path(path); |