blob: 29c78feb04fa225be37aabf613cd475d5cc74c71 [file] [log] [blame]
[[Data_Extents]]
= Data Extents
XFS manages space using extents, which are defined as a starting location and
length. A fork in an XFS inode maps a logical offset to a space extent. This
enables a file's extent map to support sparse files (i.e. ``holes'' in the file).
A flag is also used to specify if the extent has been preallocated but has not
yet been written (unwritten extent).
A file can have more than one extent if one chunk of contiguous disk space is
not available for the file. As a file grows, the XFS space allocator will
attempt to keep space contiguous and to merge extents. If more than one file is
being allocated space in the same AG at the same time, multiple extents for the
files will occur as the extent allocations interleave. The effect of this can
vary depending on the extent allocator used in the XFS driver.
An extent is 128 bits in size and uses the following packed layout:
.Extent record format
[cols="1,1,1,1"]
|===
|bit[127]
|bits[73-126]
|bits[21-72]
|bits[0-20]
|flag
|logical file block offset
|absolute block number
|# of blocks
|===
The extent is represented by the +xfs_bmbt_rec+ structure which uses a big
endian format on-disk. In-core management of extents use the +xfs_bmbt_irec+
structure which is the unpacked version of +xfs_bmbt_rec+:
[source, c]
----
struct xfs_bmbt_irec {
xfs_fileoff_t br_startoff;
xfs_fsblock_t br_startblock;
xfs_filblks_t br_blockcount;
xfs_exntst_t br_state;
};
----
*br_startoff*::
Logical block offset of this mapping.
*br_startblock*::
Filesystem block of this mapping.
*br_blockcount*::
The length of this mapping.
*br_state*::
The extent +br_state+ field uses the following enum declaration:
[source, c]
----
typedef enum {
XFS_EXT_NORM,
XFS_EXT_UNWRITTEN,
XFS_EXT_INVALID
} xfs_exntst_t;
----
Some other points about extents:
* The +xfs_bmbt_rec_32_t+ and +xfs_bmbt_rec_64_t+ structures were effectively
the same as +xfs_bmbt_rec_t+, just different representations of the same 128
bits in on-disk big endian format. +xfs_bmbt_rec_32_t+ was removed and
+xfs_bmbt_rec_64_t+ renamed to +xfs_bmbt_rec_t+ some time ago.
* When a file is created and written to, XFS will endeavour to keep the extents
within the same AG as the inode. It may use a different AG if the AG is busy
or there is no space left in it.
* If a file is zero bytes long, it will have no extents and +di_nblocks+ and
+di_nexents+ will be zero. Any file with data will have at least one extent, and
each extent can use from 1 to over 2 million blocks (2^21^) on the filesystem.
For a default 4KB block size filesystem, a single extent can be up to 8GB in
length.
The following two subsections cover the two methods of storing extent
information for a file. The first is the fastest and simplest where the inode
completely contains an extent array to the file's data. The second is slower and
more complex B+tree which can handle thousands to millions of extents
efficiently.
[[Extent_List]]
== Extent List
If the entire extent list is short enough to fit within the inode's fork
region, we say that the fork is in ``extent list'' format. This is the most
optimal in terms of speed and resource consumption. The trade-off is the file
can only have a few extents before the inode runs out of space.
The data fork of the inode contains an array of extents; the size of the array
is determined by the inode's +di_nextents+ value.
.Inode data fork extent layout
image::images/32.png[]
The number of extents that can fit in the inode depends on the inode size and
+di_forkoff+. For a default 256 byte inode with no extended attributes, a file
can have up to 9 extents with this format. On a default v5 filesystem with 512
byte inodes, a file can have up to 21 extents with this format. Beyond that,
extents have to use the B+tree format.
=== xfs_db Inode Data Fork Extents Example
An 8MB file with one extent:
----
xfs_db> inode <inode#>
xfs_db> p
core.magic = 0x494e
core.mode = 0100644
core.version = 1
core.format = 2 (extents)
...
core.size = 8294400
core.nblocks = 2025
core.extsize = 0
core.nextents = 1
core.naextents = 0
core.forkoff = 0
...
u.bmx[0] = [startoff,startblock,blockcount,extentflag]
0:[0,25356,2025,0]
----
A 24MB file with three extents:
----
xfs_db> inode <inode#>
xfs_db> p
...
core.format = 2 (extents)
...
core.size = 24883200
core.nblocks = 6075
core.nextents = 3
...
u.bmx[0-2] = [startoff,startblock,blockcount,extentflag]
0:[0,27381,2025,0]
1:[2025,31431,2025,0]
2:[4050,35481,2025,0]
----
Raw disk version of the inode with the third extent highlighted (+di_u+
starts at offset 0x64):
[subs="quotes"]
----
xfs_db> type text
xfs_db> p
00: 49 4e 81 a4 01 02 00 01 00 00 00 00 00 00 00 00 IN..............
10: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 ................
20: 44 b6 88 dd 2f 8a ed d0 44 b6 88 f7 10 8c 5b de D.......D.......
30: 44 b6 88 f7 10 8c 5b d0 00 00 00 00 01 7b b0 00 D...............
40: 00 00 00 00 00 00 17 bb 00 00 00 00 00 00 00 03 ................
50: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 ................
60: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 0d ................
70: 5e a0 07 e9 00 00 00 00 00 0f d2 00 00 00 00 0f ................
80: 58 e0 07 e9 *00 00 00 00 00 1f a4 00 00 00 00 11 X...............
90: 53 20 07 e9* 00 00 00 00 00 00 00 00 00 00 00 00 S...............
a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
be: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
co: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
do: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
fo: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
----
We can expand the highlighted section into the following bit array from MSB to
LSB with the file offset and the block count highlighted:
[subs="quotes"]
----
127-96: 0**000 0000 0000 0000 0000 0000 0000 0000**
95-64: **0000 0000 0001 1111 1010 010**0 0000 0000
63-32: 0000 0000 0000 0000 0000 0000 0000 1111
31-0 : 0101 1000 111**0 0000 0000 0111 1110 1001**
Grouping by highlights we get:
file offset = 0x0fd2 (4050)
start block = 0x7ac7 (31431)
block count = 0x07e9 (2025)
----
A 4MB file with two extents and a hole in the middle, the first extent
containing 64KB of data, the second about 4MB in containing 32KB (+write+ 64KB,
+lseek+ 4MB, +write+ 32KB operations):
----
xfs_db> inode <inode#>
xfs_db> p
...
core.format = 2 (extents)
...
core.size = 4063232
core.nblocks = 24
core.nextents = 2
...
u.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
0:[0,37506,16,0]
1:[984,37522,8,0]
----
[[Btree_Extent_List]]
== B+tree Extent List
To manage extent maps that cannot fit in the inode fork area, XFS uses
xref:Long_Format_Btrees[long format B+trees]. The root node of the B+tree is
stored in the inode's data fork. All block pointers for extent B+trees are
64-bit filesystem block numbers.
For a single level B+tree, the root node points to the B+tree's leaves. Each
leaf occupies one filesystem block and contains a header and an array of extents
sorted by the file's offset. Each leaf has left and right (or backward and
forward) block pointers to adjacent leaves. For a standard 4KB filesystem block,
a leaf can contain up to 254 extents before a B+tree rebalance is triggered.
For a multi-level B+tree, the root node points to other B+tree nodes which
eventually point to the extent leaves. B+tree keys are based on the file's
offset and have pointers to the next level down. Nodes at each level in the
B+tree also have pointers to the adjacent nodes.
The base B+tree node is used for extents, directories and extended attributes.
The structures used for an inode's B+tree root are:
[source, c]
----
struct xfs_bmdr_block {
__be16 bb_level;
__be16 bb_numrecs;
};
struct xfs_bmbt_key {
xfs_fileoff_t br_startoff;
};
typedef xfs_fsblock_t xfs_bmbt_ptr_t, xfs_bmdr_ptr_t;
----
* On disk, the B+tree node starts with the +xfs_bmdr_block_t+ header followed by
an array of +xfs_bmbt_key_t+ values and then an array of +xfs_bmbt_ptr_t+
values. The size of both arrays is specified by the header's +bb_numrecs+ value.
* The root node in the inode can only contain up to 9 key/pointer pairs for a
standard 256 byte inode before a new level of nodes is added between the root
and the leaves. This will be less if +di_forkoff+ is not zero (i.e. attributes
are in use on the inode).
* The magic number for a BMBT block is ``BMAP'' (0x424d4150). On a v5
filesystem, this is ``BMA3'' (0x424d4133).
* For intermediate nodes, the data following +xfs_btree_lblock+ is the same as
the root node: array of +xfs_bmbt_key+ value followed by an array of
+xfs_bmbt_ptr_t+ values that starts halfway through the block (offset 0x808 for
a 4096 byte filesystem block).
* For leaves, an array of +xfs_bmbt_rec+ extents follow the +xfs_btree_lblock+
header.
* Nodes and leaves use the same value for +bb_magic+.
* The +bb_level+ value determines if the node is an intermediate node or a leaf.
Leaves have a +bb_level+ of zero, nodes are one or greater.
* Intermediate nodes, like leaves, can contain up to 254 pointers to leaf blocks
for a standard 4KB filesystem block size as both the keys and pointers are 64
bits in size.
.Single level extent B+tree
image::images/35.png[]
.Multiple level extent B+tree
image::images/36.png[]
=== xfs_db bmbt Example
In this example, we dissect the data fork of a VM image that is sufficiently
sparse and interleaved to have become a B+tree.
----
xfs_db> inode 132
xfs_db> p
core.magic = 0x494e
core.mode = 0100600
core.version = 3
core.format = 3 (btree)
...
u3.bmbt.level = 1
u3.bmbt.numrecs = 3
u3.bmbt.keys[1-3] = [startoff] 1:[0] 2:[9072] 3:[13136]
u3.bmbt.ptrs[1-3] = 1:8568 2:8569 3:8570
----
As you can see, the block map B+tree is rooted in the inode. This tree has two
levels, so let's go down a level to look at the records:
----
xfs_db> addr u3.bmbt.ptrs[1]
xfs_db> p
magic = 0x424d4133
level = 0
numrecs = 251
leftsib = null
rightsib = 8569
bno = 68544
lsn = 0x100000006
uuid = 9579903c-333f-4673-a7d4-3254c05816ea
owner = 132
crc = 0xc61513dc (correct)
recs[1-251] = [startoff,startblock,blockcount,extentflag]
1:[0,8520,48,0] 2:[48,4421,16,0] 3:[80,9136,16,0] 4:[96,8569,16,0]
5:[144,8601,32,0] 6:[192,8637,16,0] 7:[240,8680,16,0] 8:[288,9870,16,0]
9:[320,9920,16,0] 10:[336,9950,16,0] 11:[384,4004,32,0]
12:[432,6771,16,0] 13:[480,2702,16,0] 14:[528,8420,16,0]
...
----