blob: 248ac7051a9e7b20d447513c6650ac79908457ff [file] [log] [blame]
<?xml version='1.0'?>
<!DOCTYPE section PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
]>
<chapter id="Data_Extents">
<title>Data Extents</title>
<para>XFS allocates space for a file using extents: starting location and length. XFS extents also specify the file's logical starting offset for a file. This allows a files extent map to automatically support sparse files (i.e. "holes" in the file). A flag is also used to specify if the extent has been preallocated and not yet been written to (unwritten extent).</para>
<para>A file can have more than one extent if one chunk of contiguous disk space is not available for the file. As a file grows, the XFS space allocator will attempt to keep space contiguous and merge extents. If more than one file is being allocated space in the same AG at the same time, multiple extents for the files will occur as the extents get interleaved. The effect of this can vary depending on the extent allocator used in the XFS driver.</para>
<para>An extent is 128 bits in size and uses the following packed layout:</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/31.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>30</phrase></textobject>
</mediaobject>
</para>
<para>The extent is represented by the <command>xfs_bmbt_rec_t</command> structure which uses a big endian format on-disk. In-core management of extents use the <command>xfs_bmbt_irec_t</command> structure which is the unpacked version of <command>xfs_bmbt_rec_t</command>:</para>
<programlisting>
typedef struct xfs_bmbt_irec {
xfs_fileoff_t br_startoff;
xfs_fsblock_t br_startblock;
xfs_filblks_t br_blockcount;
xfs_exntst_t br_state;
} xfs_bmbt_irec_t;
</programlisting>
<para>The extent <command>br_state</command> field uses the following enum declaration:</para>
<programlisting>
typedef enum {
XFS_EXT_NORM,
XFS_EXT_UNWRITTEN,
XFS_EXT_INVALID
} xfs_exntst_t;
</programlisting>
<para>Some other points about extents:</para>
<itemizedlist>
<listitem>
<para>The <command>xfs_bmbt_rec_32_t</command> and <command>xfs_bmbt_rec_64_t</command> structures are effectively the same as <command>xfs_bmbt_rec_t</command>, just different representations of the same 128 bits in on-disk big endian format.</para>
</listitem>
<listitem>
<para>When a file is created and written to, XFS will endeavour to keep the extents within the same AG as the inode. It may use a different AG if the AG is busy or there is no space left in it.</para>
</listitem>
<listitem>
<para>If a file is zero bytes long, it will have no extents, <command>di_nblocks</command> and <command>di_nexents</command> will be zero. Any file with data will have at least one extent, and each extent can use from 1 to over 2 million blocks (2<superscript>21</superscript>) on the filesystem. For a default 4KB block size filesystem, a single extent can be up to 8GB in length. </para>
</listitem>
</itemizedlist>
<para>The following two subsections cover the two methods of storing extent information for a file. The first is the fastest and simplest where the inode completely contains an extent array to the file's data. The second is slower and more complex B+tree which can handle thousands to millions of extents efficiently.</para>
<section id="Extent_List"><title>
Extent List</title>
<para>Local extents are where the entire extent array is stored within the inode's data fork itself. This is the most optimal in terms of speed and resource consumption. The trade-off is the file can only have a few extents before the inode runs out of space.</para>
<para>The "data fork" of the inode contains an array of extents, the size of the array determined by the inode's <command>di_nextents</command> value.</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/32.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>32</phrase></textobject>
</mediaobject>
</para>
<para>The number of extents that can fit in the inode depends on the inode size and <command>di_forkoff</command>. For a default 256 byte inode with no extended attributes, a file can up to 19 extents with this format. Beyond this, extents have to use the B+tree format.</para>
<bridgehead>xfs_db Example:</bridgehead>
<para>An 8MB file with one extent:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
core.magic = 0x494e
core.mode = 0100644
core.version = 1
core.format = 2 (extents)
...
core.size = 8294400
core.nblocks = 2025
core.extsize = 0
core.nextents = 1
core.naextents = 0
core.forkoff = 0
...
u.bmx[0] = [startoff,startblock,blockcount,extentflag]
0:[0,25356,2025,0]
</programlisting>
<para>A 24MB file with three extents:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
...
core.format = 2 (extents)
...
core.size = 24883200
core.nblocks = 6075
core.nextents = 3
...
u.bmx[0-2] = [startoff,startblock,blockcount,extentflag]
0:[0,27381,2025,0]
1:[2025,31431,2025,0]
2:[4050,35481,2025,0]
</programlisting>
<para>Raw disk version of the inode with the third extent highlighted (<command>di_u</command> always starts at offset 0x64):</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/33a.png" format="PNG" /></imageobject>
<textobject><phrase>code33a</phrase></textobject>
</mediaobject>
<para>We can expand the highlighted section into the following bit array from MSB to LSB with the file offset and the block count highlighted:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/33b.png" format="PNG" /></imageobject>
<textobject><phrase>code33b</phrase></textobject>
</mediaobject>
<para>A 4MB file with two extents and a hole in the middle, the first extent containing 64KB of data, the second about 4MB in containing 32KB (<command>write</command> 64KB, <command>lseek</command> ~4MB, <command>write</command> 32KB operations):</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
...
core.format = 2 (extents)
...
core.size = 4063232
core.nblocks = 24
core.nextents = 2
...
u.bmx[0-1] = [startoff,startblock,blockcount,extentflag]
0:[0,37506,16,0]
1:[984,37522,8,0]
</programlisting>
</section>
<section id="Btree_Extent_List"><title>B+tree Extent List</title>
<para>Beyond the simple extent array, to efficiently manage large extent maps, XFS uses B+trees. The root node of the B+tree is stored in the inode's data fork. All block pointers for extent B+trees are 64-bit absolute block numbers.</para>
<para>For a single level B+tree, the root node points to the B+tree's leaves. Each leaf occupies one filesystem block and contains a header and an array of extents sorted by the file's offset. Each leaf has left and right (or backward and forward) block pointers to adjacent leaves. For a standard 4KB filesystem block, a leaf can contain up to 254 extents before a B+tree rebalance is triggered. </para>
<para>For a multi-level B+tree, the root node points to other B+tree nodes which eventually point to the extent leaves. B+tree keys are based on the file's offset. The nodes at each level in the B+tree point to the adjacent nodes.</para>
<para>The base B+tree node is used for extents, directories and extended attributes. The structures used for inode's B+tree root are:</para>
<programlisting>
typedef struct xfs_bmdr_block {
__be16 bb_level;
__be16 bb_numrecs;
} xfs_bmdr_block_t;
typedef struct xfs_bmbt_key {
xfs_dfiloff_t br_startoff;
} xfs_bmbt_key_t, xfs_bmdr_key_t;
typedef xfs_dfsbno_t xfs_bmbt_ptr_t, xfs_bmdr_ptr_t;
</programlisting>
<itemizedlist>
<listitem>
<para>On disk, the B+tree node starts with the <command>xfs_bmbr_block_t</command> header followed by an array of <command>xfs_bmbt_key_t</command> values and then an array of <command>xfs_bmbt_ptr_t</command> values. The size of both arrays is specified by the header's <command>bb_numrecs</command> value.</para>
</listitem>
<listitem>
<para>The root node in the inode can only contain up to 19 key/pointer pairs for a standard 256 byte inode before a new level of nodes is added between the root and the leaves. This will be less if <command>di_forkoff</command> is not zero (i.e. attributes are in use on the inode).</para>
</listitem>
</itemizedlist>
<para>The subsequent nodes and leaves of the B+tree use the <command>xfs_bmbt_block_t</command> declaration:</para>
<programlisting>
typedef struct xfs_btree_lblock xfs_bmbt_block_t;
typedef struct xfs_btree_lblock {
__be32 bb_magic;
__be16 bb_level;
__be16 bb_numrecs;
__be64 bb_leftsib;
__be64 bb_rightsib;
} xfs_btree_lblock_t;
</programlisting>
<itemizedlist>
<listitem>
<para>For intermediate nodes, the data following <command>xfs_bmbt_block_t</command> is the same as the root node: array of <command>xfs_bmbt_key_t</command> value followed by an array of <command>xfs_bmbt_ptr_t</command> values that starts halfway through the block (offset 0x808 for a 4096 byte filesystem block).</para>
</listitem>
<listitem>
<para>For leaves, an array of <command>xfs_bmbt_rec_t</command> extents follow the <command>xfs_bmbt_block_t</command> header.</para>
</listitem>
<listitem>
<para>Nodes and leaves use the same value for <command>bb_magic</command>: </para>
</listitem>
</itemizedlist>
<programlisting>
#define XFS_BMAP_MAGIC        0x424d4150        /* 'BMAP' */
</programlisting>
<itemizedlist>
<listitem>
<para>The <command>bb_level</command> value determines if the node is an intermediate node or a leaf. Leaves have a <command>bb_level</command> of zero, nodes are one or greater.</para>
</listitem>
<listitem>
<para>Intermediate nodes, like leaves, can contain up to 254 pointers to leaf blocks for a standard 4KB filesystem block size as both the keys and pointers are 64 bits in size.</para>
</listitem>
</itemizedlist>
<para>The following diagram illustrates a single level extent B+tree:</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/35.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>35</phrase></textobject>
</mediaobject>
</para>
<para>The following diagram illustrates a two level extent B+tree:</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/36.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase></phrase></textobject>
</mediaobject>
</para><bridgehead>xfs_db Example:</bridgehead>
<para>TODO:</para></section></chapter>