| [[Data_Extents]] |
| = Data Extents |
| |
| XFS manages space using extents, which are defined as a starting location and |
| length. A fork in an XFS inode maps a logical offset to a space extent. This |
| enables a file's extent map to support sparse files (i.e. ``holes'' in the file). |
| A flag is also used to specify if the extent has been preallocated but has not |
| yet been written (unwritten extent). |
| |
| A file can have more than one extent if one chunk of contiguous disk space is |
| not available for the file. As a file grows, the XFS space allocator will |
| attempt to keep space contiguous and to merge extents. If more than one file is |
| being allocated space in the same AG at the same time, multiple extents for the |
| files will occur as the extent allocations interleave. The effect of this can |
| vary depending on the extent allocator used in the XFS driver. |
| |
| An extent is 128 bits in size and uses the following packed layout: |
| |
| .Extent record format |
| image::images/31.png[] |
| |
| The extent is represented by the +xfs_bmbt_rec+ structure which uses a big |
| endian format on-disk. In-core management of extents use the +xfs_bmbt_irec+ |
| structure which is the unpacked version of +xfs_bmbt_rec+: |
| |
| [source, c] |
| ---- |
| struct xfs_bmbt_irec { |
| xfs_fileoff_t br_startoff; |
| xfs_fsblock_t br_startblock; |
| xfs_filblks_t br_blockcount; |
| xfs_exntst_t br_state; |
| }; |
| ---- |
| |
| *br_startoff*:: |
| Logical block offset of this mapping. |
| |
| *br_startblock*:: |
| Filesystem block of this mapping. |
| |
| *br_blockcount*:: |
| The length of this mapping. |
| |
| *br_state*:: |
| The extent +br_state+ field uses the following enum declaration: |
| |
| [source, c] |
| ---- |
| typedef enum { |
| XFS_EXT_NORM, |
| XFS_EXT_UNWRITTEN, |
| XFS_EXT_INVALID |
| } xfs_exntst_t; |
| ---- |
| |
| Some other points about extents: |
| |
| * The +xfs_bmbt_rec_32_t+ and +xfs_bmbt_rec_64_t+ structures were effectively |
| the same as +xfs_bmbt_rec_t+, just different representations of the same 128 |
| bits in on-disk big endian format. +xfs_bmbt_rec_32_t+ was removed and |
| +xfs_bmbt_rec_64_t+ renamed to +xfs_bmbt_rec_t+ some time ago. |
| |
| * When a file is created and written to, XFS will endeavour to keep the extents |
| within the same AG as the inode. It may use a different AG if the AG is busy |
| or there is no space left in it. |
| |
| * If a file is zero bytes long, it will have no extents and +di_nblocks+ and |
| +di_nexents+ will be zero. Any file with data will have at least one extent, and |
| each extent can use from 1 to over 2 million blocks (2^21^) on the filesystem. |
| For a default 4KB block size filesystem, a single extent can be up to 8GB in |
| length. |
| |
| The following two subsections cover the two methods of storing extent |
| information for a file. The first is the fastest and simplest where the inode |
| completely contains an extent array to the file's data. The second is slower and |
| more complex B+tree which can handle thousands to millions of extents |
| efficiently. |
| |
| |
| [[Extent_List]] |
| == Extent List |
| |
| If the entire extent list is short enough to fit within the inode's fork |
| region, we say that the fork is in ``extent list'' format. This is the most |
| optimal in terms of speed and resource consumption. The trade-off is the file |
| can only have a few extents before the inode runs out of space. |
| |
| The data fork of the inode contains an array of extents; the size of the array |
| is determined by the inode's +di_nextents+ value. |
| |
| .Inode data fork extent layout |
| image::images/32.png[] |
| |
| The number of extents that can fit in the inode depends on the inode size and |
| +di_forkoff+. For a default 256 byte inode with no extended attributes, a file |
| can have up to 9 extents with this format. On a default v5 filesystem with 512 |
| byte inodes, a file can have up to 21 extents with this format. Beyond that, |
| extents have to use the B+tree format. |
| |
| === xfs_db Inode Data Fork Extents Example |
| |
| An 8MB file with one extent: |
| |
| ---- |
| xfs_db> inode <inode#> |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 0100644 |
| core.version = 1 |
| core.format = 2 (extents) |
| ... |
| core.size = 8294400 |
| core.nblocks = 2025 |
| core.extsize = 0 |
| core.nextents = 1 |
| core.naextents = 0 |
| core.forkoff = 0 |
| ... |
| u.bmx[0] = [startoff,startblock,blockcount,extentflag] |
| 0:[0,25356,2025,0] |
| ---- |
| |
| A 24MB file with three extents: |
| |
| ---- |
| xfs_db> inode <inode#> |
| xfs_db> p |
| ... |
| core.format = 2 (extents) |
| ... |
| core.size = 24883200 |
| core.nblocks = 6075 |
| core.nextents = 3 |
| ... |
| u.bmx[0-2] = [startoff,startblock,blockcount,extentflag] |
| 0:[0,27381,2025,0] |
| 1:[2025,31431,2025,0] |
| 2:[4050,35481,2025,0] |
| ---- |
| |
| Raw disk version of the inode with the third extent highlighted (+di_u+ |
| starts at offset 0x64): |
| |
| [subs="quotes"] |
| ---- |
| xfs_db> type text |
| xfs_db> p |
| 00: 49 4e 81 a4 01 02 00 01 00 00 00 00 00 00 00 00 IN.............. |
| 10: 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 01 ................ |
| 20: 44 b6 88 dd 2f 8a ed d0 44 b6 88 f7 10 8c 5b de D.......D....... |
| 30: 44 b6 88 f7 10 8c 5b d0 00 00 00 00 01 7b b0 00 D............... |
| 40: 00 00 00 00 00 00 17 bb 00 00 00 00 00 00 00 03 ................ |
| 50: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| 60: ff ff ff ff 00 00 00 00 00 00 00 00 00 00 00 0d ................ |
| 70: 5e a0 07 e9 00 00 00 00 00 0f d2 00 00 00 00 0f ................ |
| 80: 58 e0 07 e9 *00 00 00 00 00 1f a4 00 00 00 00 11 X............... |
| 90: 53 20 07 e9* 00 00 00 00 00 00 00 00 00 00 00 00 S............... |
| a0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| be: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| co: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| do: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| e0: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| fo: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| ---- |
| |
| We can expand the highlighted section into the following bit array from MSB to |
| LSB with the file offset and the block count highlighted: |
| |
| [subs="quotes"] |
| ---- |
| 127-96: 0**000 0000 0000 0000 0000 0000 0000 0000** |
| 95-64: **0000 0000 0001 1111 1010 010**0 0000 0000 |
| 63-32: 0000 0000 0000 0000 0000 0000 0000 1111 |
| 31-0 : 0101 1000 111**0 0000 0000 0111 1110 1001** |
| |
| Grouping by highlights we get: |
| file offset = 0x0fd2 (4050) |
| start block = 0x7ac7 (31431) |
| block count = 0x07e9 (2025) |
| ---- |
| |
| A 4MB file with two extents and a hole in the middle, the first extent |
| containing 64KB of data, the second about 4MB in containing 32KB (+write+ 64KB, |
| +lseek+ 4MB, +write+ 32KB operations): |
| |
| ---- |
| xfs_db> inode <inode#> |
| xfs_db> p |
| ... |
| core.format = 2 (extents) |
| ... |
| core.size = 4063232 |
| core.nblocks = 24 |
| core.nextents = 2 |
| ... |
| u.bmx[0-1] = [startoff,startblock,blockcount,extentflag] |
| 0:[0,37506,16,0] |
| 1:[984,37522,8,0] |
| ---- |
| |
| |
| [[Btree_Extent_List]] |
| == B+tree Extent List |
| |
| To manage extent maps that cannot fit in the inode fork area, XFS uses |
| xref:Long_Format_Btrees[long format B+trees]. The root node of the B+tree is |
| stored in the inode's data fork. All block pointers for extent B+trees are |
| 64-bit filesystem block numbers. |
| |
| For a single level B+tree, the root node points to the B+tree's leaves. Each |
| leaf occupies one filesystem block and contains a header and an array of extents |
| sorted by the file's offset. Each leaf has left and right (or backward and |
| forward) block pointers to adjacent leaves. For a standard 4KB filesystem block, |
| a leaf can contain up to 254 extents before a B+tree rebalance is triggered. |
| |
| For a multi-level B+tree, the root node points to other B+tree nodes which |
| eventually point to the extent leaves. B+tree keys are based on the file's |
| offset and have pointers to the next level down. Nodes at each level in the |
| B+tree also have pointers to the adjacent nodes. |
| |
| The base B+tree node is used for extents, directories and extended attributes. |
| The structures used for an inode's B+tree root are: |
| |
| [source, c] |
| ---- |
| struct xfs_bmdr_block { |
| __be16 bb_level; |
| __be16 bb_numrecs; |
| }; |
| struct xfs_bmbt_key { |
| xfs_fileoff_t br_startoff; |
| }; |
| typedef xfs_fsblock_t xfs_bmbt_ptr_t, xfs_bmdr_ptr_t; |
| ---- |
| |
| * On disk, the B+tree node starts with the +xfs_bmdr_block_t+ header followed by |
| an array of +xfs_bmbt_key_t+ values and then an array of +xfs_bmbt_ptr_t+ |
| values. The size of both arrays is specified by the header's +bb_numrecs+ value. |
| |
| * The root node in the inode can only contain up to 9 key/pointer pairs for a |
| standard 256 byte inode before a new level of nodes is added between the root |
| and the leaves. This will be less if +di_forkoff+ is not zero (i.e. attributes |
| are in use on the inode). |
| |
| * The magic number for a BMBT block is ``BMAP'' (0x424d4150). On a v5 |
| filesystem, this is ``BMA3'' (0x424d4133). |
| |
| * For intermediate nodes, the data following +xfs_btree_lblock+ is the same as |
| the root node: array of +xfs_bmbt_key+ value followed by an array of |
| +xfs_bmbt_ptr_t+ values that starts halfway through the block (offset 0x808 for |
| a 4096 byte filesystem block). |
| |
| * For leaves, an array of +xfs_bmbt_rec+ extents follow the +xfs_btree_lblock+ |
| header. |
| |
| * Nodes and leaves use the same value for +bb_magic+. |
| |
| * The +bb_level+ value determines if the node is an intermediate node or a leaf. |
| Leaves have a +bb_level+ of zero, nodes are one or greater. |
| |
| * Intermediate nodes, like leaves, can contain up to 254 pointers to leaf blocks |
| for a standard 4KB filesystem block size as both the keys and pointers are 64 |
| bits in size. |
| |
| .Single level extent B+tree |
| image::images/35.png[] |
| |
| .Multiple level extent B+tree |
| image::images/36.png[] |
| |
| === xfs_db bmbt Example |
| |
| In this example, we dissect the data fork of a VM image that is sufficiently |
| sparse and interleaved to have become a B+tree. |
| |
| ---- |
| xfs_db> inode 132 |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 0100600 |
| core.version = 3 |
| core.format = 3 (btree) |
| ... |
| u3.bmbt.level = 1 |
| u3.bmbt.numrecs = 3 |
| u3.bmbt.keys[1-3] = [startoff] 1:[0] 2:[9072] 3:[13136] |
| u3.bmbt.ptrs[1-3] = 1:8568 2:8569 3:8570 |
| ---- |
| |
| As you can see, the block map B+tree is rooted in the inode. This tree has two |
| levels, so let's go down a level to look at the records: |
| |
| ---- |
| xfs_db> addr u3.bmbt.ptrs[1] |
| xfs_db> p |
| magic = 0x424d4133 |
| level = 0 |
| numrecs = 251 |
| leftsib = null |
| rightsib = 8569 |
| bno = 68544 |
| lsn = 0x100000006 |
| uuid = 9579903c-333f-4673-a7d4-3254c05816ea |
| owner = 132 |
| crc = 0xc61513dc (correct) |
| recs[1-251] = [startoff,startblock,blockcount,extentflag] |
| 1:[0,8520,48,0] 2:[48,4421,16,0] 3:[80,9136,16,0] 4:[96,8569,16,0] |
| 5:[144,8601,32,0] 6:[192,8637,16,0] 7:[240,8680,16,0] 8:[288,9870,16,0] |
| 9:[320,9920,16,0] 10:[336,9950,16,0] 11:[384,4004,32,0] |
| 12:[432,6771,16,0] 13:[480,2702,16,0] 14:[528,8420,16,0] |
| ... |
| ---- |