| From: Filipe Manana <fdmanana@suse.com> |
| Date: Wed, 28 Nov 2018 14:54:28 +0000 |
| Subject: Btrfs: fix fsync of files with multiple hard links in new directories |
| |
| commit 41bd60676923822de1df2c50b3f9a10171f4338a upstream. |
| |
| The log tree has a long standing problem that when a file is fsync'ed we |
| only check for new ancestors, created in the current transaction, by |
| following only the hard link for which the fsync was issued. We follow the |
| ancestors using the VFS' dget_parent() API. This means that if we create a |
| new link for a file in a directory that is new (or in an any other new |
| ancestor directory) and then fsync the file using an old hard link, we end |
| up not logging the new ancestor, and on log replay that new hard link and |
| ancestor do not exist. In some cases, involving renames, the file will not |
| exist at all. |
| |
| Example: |
| |
| mkfs.btrfs -f /dev/sdb |
| mount /dev/sdb /mnt |
| |
| mkdir /mnt/A |
| touch /mnt/foo |
| ln /mnt/foo /mnt/A/bar |
| xfs_io -c fsync /mnt/foo |
| |
| <power failure> |
| |
| In this example after log replay only the hard link named 'foo' exists |
| and directory A does not exist, which is unexpected. In other major linux |
| filesystems, such as ext4, xfs and f2fs for example, both hard links exist |
| and so does directory A after mounting again the filesystem. |
| |
| Checking if any new ancestors are new and need to be logged was added in |
| 2009 by commit 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes"), |
| however only for the ancestors of the hard link (dentry) for which the |
| fsync was issued, instead of checking for all ancestors for all of the |
| inode's hard links. |
| |
| So fix this by tracking the id of the last transaction where a hard link |
| was created for an inode and then on fsync fallback to a full transaction |
| commit when an inode has more than one hard link and at least one new hard |
| link was created in the current transaction. This is the simplest solution |
| since this is not a common use case (adding frequently hard links for |
| which there's an ancestor created in the current transaction and then |
| fsync the file). In case it ever becomes a common use case, a solution |
| that consists of iterating the fs/subvol btree for each hard link and |
| check if any ancestor is new, could be implemented. |
| |
| This solves many unexpected scenarios reported by Jayashree Mohan and |
| Vijay Chidambaram, and for which there is a new test case for fstests |
| under review. |
| |
| Fixes: 12fcfd22fe5b ("Btrfs: tree logging unlink/rename fixes") |
| Reported-by: Vijay Chidambaram <vvijay03@gmail.com> |
| Reported-by: Jayashree Mohan <jayashree2912@gmail.com> |
| Signed-off-by: Filipe Manana <fdmanana@suse.com> |
| Signed-off-by: David Sterba <dsterba@suse.com> |
| [bwh: Backported to 3.16: |
| - In btrfs_log_inode_parent(), inode is a struct inode pointer not a |
| struct btrfs_inode pointer |
| - Adjust context] |
| Signed-off-by: Ben Hutchings <ben@decadent.org.uk> |
| --- |
| fs/btrfs/btrfs_inode.h | 6 ++++++ |
| fs/btrfs/inode.c | 17 +++++++++++++++++ |
| fs/btrfs/tree-log.c | 16 ++++++++++++++++ |
| 3 files changed, 39 insertions(+) |
| |
| --- a/fs/btrfs/btrfs_inode.h |
| +++ b/fs/btrfs/btrfs_inode.h |
| @@ -144,6 +144,12 @@ struct btrfs_inode { |
| u64 last_unlink_trans; |
| |
| /* |
| + * Track the transaction id of the last transaction used to create a |
| + * hard link for the inode. This is used by the log tree (fsync). |
| + */ |
| + u64 last_link_trans; |
| + |
| + /* |
| * Number of bytes outstanding that are going to need csums. This is |
| * used in ENOSPC accounting. |
| */ |
| --- a/fs/btrfs/inode.c |
| +++ b/fs/btrfs/inode.c |
| @@ -3561,6 +3561,21 @@ cache_index: |
| * inode is not a directory, logging its parent unnecessarily. |
| */ |
| BTRFS_I(inode)->last_unlink_trans = BTRFS_I(inode)->last_trans; |
| + /* |
| + * Similar reasoning for last_link_trans, needs to be set otherwise |
| + * for a case like the following: |
| + * |
| + * mkdir A |
| + * touch foo |
| + * ln foo A/bar |
| + * echo 2 > /proc/sys/vm/drop_caches |
| + * fsync foo |
| + * <power failure> |
| + * |
| + * Would result in link bar and directory A not existing after the power |
| + * failure. |
| + */ |
| + BTRFS_I(inode)->last_link_trans = BTRFS_I(inode)->last_trans; |
| |
| path->slots[0]++; |
| if (inode->i_nlink != 1 || |
| @@ -6183,6 +6198,7 @@ static int btrfs_link(struct dentry *old |
| if (err) |
| goto fail; |
| } |
| + BTRFS_I(inode)->last_link_trans = trans->transid; |
| d_instantiate(dentry, inode); |
| btrfs_log_new_name(trans, inode, NULL, parent); |
| } |
| @@ -8250,6 +8266,7 @@ struct inode *btrfs_alloc_inode(struct s |
| ei->index_cnt = (u64)-1; |
| ei->dir_index = 0; |
| ei->last_unlink_trans = 0; |
| + ei->last_link_trans = 0; |
| ei->last_log_commit = 0; |
| |
| spin_lock_init(&ei->lock); |
| --- a/fs/btrfs/tree-log.c |
| +++ b/fs/btrfs/tree-log.c |
| @@ -4430,6 +4430,22 @@ static int btrfs_log_inode_parent(struct |
| goto end_trans; |
| } |
| |
| + /* |
| + * If a new hard link was added to the inode in the current transaction |
| + * and its link count is now greater than 1, we need to fallback to a |
| + * transaction commit, otherwise we can end up not logging all its new |
| + * parents for all the hard links. Here just from the dentry used to |
| + * fsync, we can not visit the ancestor inodes for all the other hard |
| + * links to figure out if any is new, so we fallback to a transaction |
| + * commit (instead of adding a lot of complexity of scanning a btree, |
| + * since this scenario is not a common use case). |
| + */ |
| + if (inode->i_nlink > 1 && |
| + BTRFS_I(inode)->last_link_trans > last_committed) { |
| + ret = -EMLINK; |
| + goto end_trans; |
| + } |
| + |
| inode_only = LOG_INODE_EXISTS; |
| while (1) { |
| if (!parent || !parent->d_inode || sb != parent->d_inode->i_sb) |