xfs_repair: estimate per-AG btree slack better
The slack calculation for per-AG btrees is a bit inaccurate because it
only disables slack space in the new btrees when the amount of free
space in the AG (not counting the btrees) is less than 3/32ths of the
AG. In other words, it assumes that the btrees will fit in less than 9
percent of the space.
However, there's one scenario where this goes wrong -- if the rmapbt
consumes a significant portion of the AG space. Say a filesystem is
hosting a VM image farm that starts with perfectly shared images. As
time goes by, random writes to those images will slowly cause the rmapbt
to increase in size as blocks within those images get COWed.
Suppose that the rmapbt now consumes 20% of the space in the AG, that
the AG is nearly full, and that the blocks in the old rmapbt are mostly
full. At the start of phase5_func, mk_incore_fstree will return that
num_freeblocks is ~20% of the AG size. Hence the slack calculation will
conclude that there's plenty of space in the AG and new btrees will be
built with 25% slack in the blocks. If the size of these new expanded
btrees is larger than the free space in the AG, repair will fail to
allocate btree blocks and fail, causing severe filesystem damage.
To combat this, estimate the worst case size of the AG btrees given the
number of records we intend to put in them, subtract that worst case
figure from num_freeblocks, and feed that to bulkload_estimate_ag_slack.
This results in tighter packing of new btree blocks when space is dear,
and hopefully fewer problems. This /can/ be reproduced with generic/333
if you hack it to keep COWing blocks until the filesystem is totally
out of space, even if reflink has long since refused to share more
blocks.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Carlos Maiolino <cem@kernel.org>
6 files changed