blob: 97e219589a8f90724bfbcf733a4c2eeae21475ff [file] [log] [blame]
<?xml version='1.0'?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
]>
<chapter id="Directories"><title>
Directories</title>
<itemizedlist>
<listitem>
<para>Only v2 directories covered here. v1 directories are obsolete.</para>
</listitem>
<listitem>
<para>The size of a "directory block" is defined by the superblock's (<xref linkend="Superblocks"/>) <command>sb_dirblklog</command> value. The size in bytes = <command>sb_blocksize</command> * <command>2<superscript>sb_dirblklog</superscript></command>. For example, if <command>sb_blocksize</command> = 4096, <command>sb_dirblklog</command> = 2, the directory block size is 16384 bytes. Directory blocks are always allocated in multiples based on <command>sb_dirblklog</command>.  Directory blocks cannot be more that 65536 bytes in size.</para>
<note>
<para>
Note: the term "block" in this section will refer to directory blocks, not filesystem blocks unless otherwise specified.
</para>
</note>
</listitem>
<listitem>
<para>All directory entries contain the following "data":</para>
<itemizedlist>
<listitem>
<para>
Entry's name (counted string consisting of a single byte <command>namelen</command> followed by <command>name</command> consisting of an array of 8-bit chars without a NULL terminator).
</para>
</listitem>
<listitem>
<para>
Entry's absolute inode number (<xref linkend="Inode_Numbers"/>), which are always 64 bits (8 bytes) in size except a special case for shortform directories.
</para>
</listitem>
<listitem>
<para>
An <command>offset</command> or <command>tag</command> used for iterative readdir calls.
</para>
</listitem>
</itemizedlist>
</listitem>
<listitem>
<para>All non-shortform directories also contain two additional structures: "leaves" and "freespace indexes".</para>
<itemizedlist>
<listitem>
<para>
Leaves contain the sorted hashed name value (<command>xfs_da_hashname()</command> in xfs_da_btree.c) and associated "address" which points to the effective offset into the directory's data structures. Leaves are used to optimise lookup operations.
</para>
</listitem>
<listitem>
<para>
Freespace indexes contain free space/empty entry tracking for quickly finding an appropriately sized location for new entries. They maintain the largest free space for each "data" block.
</para>
</listitem>
</itemizedlist>
</listitem>
<listitem>
<para>A few common types are used for the directory structures:</para>
<programlisting language="C">
typedef __uint16_t xfs_dir2_data_off_t;
typedef __uint32_t xfs_dir2_dataptr_t;
</programlisting>
</listitem>
</itemizedlist>
<section id="Shortform_Directories"><title>
Shortform Directories</title>
<itemizedlist>
<listitem>
<para>Directory entries are stored within the inode.</para>
</listitem>
<listitem>
<para>Only data stored is the name, inode # and offset, no "leaf" or "freespace index" information is required as an inode can only store a few entries.</para>
</listitem>
<listitem>
<para>"." is not stored (as it's in the inode itself), and ".." is a dedicated <command>parent</command> field in the header.</para>
</listitem>
<listitem>
<para>The number of directories that can be stored in an inode depends on the inode size (<xref linkend="On-disk_Inode"/>), the number of entries, the length of the entry names and extended attribute data.</para>
</listitem>
<listitem>
<para>Once the number of entries exceed the space available in the inode, the format is converted to a "Block Directory".</para>
</listitem>
<listitem>
<para>Shortform directory data is packed as tightly as possible on the disk with the remaining space zeroed:</para>
<programlisting>
typedef struct xfs_dir2_sf {
xfs_dir2_sf_hdr_t hdr;
xfs_dir2_sf_entry_t list[1];
} xfs_dir2_sf_t;
typedef struct xfs_dir2_sf_hdr {
__uint8_t count;
__uint8_t i8count;
xfs_dir2_inou_t parent;
} xfs_dir2_sf_hdr_t;
typedef struct xfs_dir2_sf_entry {
__uint8_t namelen;
xfs_dir2_sf_off_t offset;
__uint8_t name[1];
xfs_dir2_inou_t inumber;
} xfs_dir2_sf_entry_t;
</programlisting>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/39.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>39</phrase></textobject>
</mediaobject>
</para>
</listitem>
<listitem>
<para>
Inode numbers are stored using 4 or 8 bytes depending on whether all the inode numbers for the directory fit in 4 bytes (32 bits) or not. If all inode numbers fit in 4 bytes, the header's <command>count</command> value specifies the number of entries in the directory and <command>i8count</command> will be zero. If any inode number exceeds 4 bytes, all inode numbers will be 8 bytes in size and the header's <command>i8count</command> value specifies the number of entries and count will be zero. The following union covers the shortform inode number structure:</para>
<programlisting>
typedef struct { __uint8_t i[8]; } xfs_dir2_ino8_t;
typedef struct { __uint8_t i[4]; } xfs_dir2_ino4_t;
typedef union {
xfs_dir2_ino8_t i8;
xfs_dir2_ino4_t i4;
} xfs_dir2_inou_t;
</programlisting>
</listitem>
</itemizedlist>
<bridgehead>xfs_db Example:</bridgehead>
<para>A directory is created with 4 files, all inode numbers fitting within 4 bytes:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
core.magic = 0x494e
core.mode = 040755
core.version = 1
core.format = 1 (local)
core.nlinkv1 = 2
...
core.size = 94
core.nblocks = 0
core.extsize = 0
core.nextents = 0
...
u.sfdir2.hdr.count = 4
u.sfdir2.hdr.i8count = 0
u.sfdir2.hdr.parent.i4 = 128 /* parent = root inode */
u.sfdir2.list[0].namelen = 15
u.sfdir2.list[0].offset = 0x30
u.sfdir2.list[0].name = "frame000000.tst"
u.sfdir2.list[0].inumber.i4 = 25165953
u.sfdir2.list[1].namelen = 15
u.sfdir2.list[1].offset = 0x50
u.sfdir2.list[1].name = "frame000001.tst"
u.sfdir2.list[1].inumber.i4 = 25165954
u.sfdir2.list[2].namelen = 15
u.sfdir2.list[2].offset = 0x70
u.sfdir2.list[2].name = "frame000002.tst"
u.sfdir2.list[2].inumber.i4 = 25165955
u.sfdir2.list[3].namelen = 15
u.sfdir2.list[3].offset = 0x90
u.sfdir2.list[3].name = "frame000003.tst"
u.sfdir2.list[3].inumber.i4 = 25165956
</programlisting>
<para>The raw data on disk with the first entry highlighted. The six byte header precedes the first entry:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/40.png" format="PNG" /></imageobject>
<textobject><phrase>code40</phrase></textobject>
</mediaobject>
<para>Next, an entry is deleted (frame000001.tst), and any entries after the deleted entry are moved or compacted to "cover" the hole:</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
core.magic = 0x494e
core.mode = 040755
core.version = 1
core.format = 1 (local)
core.nlinkv1 = 2
...
core.size = 72
core.nblocks = 0
core.extsize = 0
core.nextents = 0
...
u.sfdir2.hdr.count = 3
u.sfdir2.hdr.i8count = 0
u.sfdir2.hdr.parent.i4 = 128
u.sfdir2.list[0].namelen = 15
u.sfdir2.list[0].offset = 0x30
u.sfdir2.list[0].name = "frame000000.tst"
u.sfdir2.list[0].inumber.i4 = 25165953
u.sfdir2.list[1].namelen = 15
u.sfdir2.list[1].offset = 0x70
u.sfdir2.list[1].name = "frame000002.tst"
u.sfdir2.list[1].inumber.i4 = 25165955
u.sfdir2.list[2].namelen = 15
u.sfdir2.list[2].offset = 0x90
u.sfdir2.list[2].name = "frame000003.tst"
u.sfdir2.list[2].inumber.i4 = 25165956
</programlisting>
<para>Raw disk data, the space beyond the shortform entries is invalid and could be non-zero:</para>
<programlisting>
xfs_db&gt; type text
xfs_db&gt; p
00: 49 4e 41 ed 01 01 00 02 00 00 00 00 00 00 00 00 INA.............
10: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 03 ................
20: 44 b2 45 a2 09 fd e4 50 44 b2 45 a3 12 ee b5 d0 D.E....PD.E.....
30: 44 b2 45 a3 12 ee b5 d0 00 00 00 00 00 00 00 48 D.E............H
40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
50: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 ................
60: ff ff ff ff 03 00 00 00 00 80 0f 00 30 66 72 61 ............0fra
70: 6d 65 30 30 30 30 30 30 2e 74 73 74 01 80 00 81 me000000.tst....
80: 0f 00 70 66 72 61 6d 65 30 30 30 30 30 32 2e 74 ..pframe000002.t
90: 73 74 01 80 00 83 0f 00 90 66 72 61 6d 65 30 30 st.......frame00
a0: 30 30 30 33 2e 74 73 74 01 80 00 84 0f 00 90 66 0003.tst.......f
b0: 72 61 6d 65 30 30 30 30 30 33 2e 74 73 74 01 80 rame000003.tst..
c0: 00 84 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................
</programlisting>
<para>TODO: 8-byte inode number example</para></section>
<section id="Block_Directories"><title>
Block Directories</title>
<para>When the shortform directory space exceeds the space in an inode, the directory data is moved into a new single directory block outside the inode. The inode's format is changed from "local" to "extent". Following is a list of points about block directories.</para>
<itemizedlist>
<listitem>
<para>All directory data is stored within the one directory block, including "." and ".." entries which are mandatory. </para>
</listitem>
<listitem>
<para>The block also contains "leaf" and "freespace index " information.</para>
</listitem>
<listitem>
<para>The location of the block is defined by the inode's in-core extent list (<xref linkend="Extent_List"/>): the <command>di_u.u_bmx[0]</command> value. The file offset in the extent must always be zero and the <command>length</command> = (directory block size / filesystem block size). The block number points to the filesystem block containing the directory data.</para>
</listitem>
<listitem>
<para>Block directory data is stored in the following structures:</para>
<programlisting>
#define XFS_DIR2_DATA_FD_COUNT 3
typedef struct xfs_dir2_block {
xfs_dir2_data_hdr_t hdr;
xfs_dir2_data_union_t u[1];
xfs_dir2_leaf_entry_t leaf[1];
xfs_dir2_block_tail_t tail;
} xfs_dir2_block_t;
typedef struct xfs_dir2_data_hdr {
__uint32_t magic;
xfs_dir2_data_free_t bestfree[XFS_DIR2_DATA_FD_COUNT];
} xfs_dir2_data_hdr_t;
typedef struct xfs_dir2_data_free {
xfs_dir2_data_off_t offset;
xfs_dir2_data_off_t length;
} xfs_dir2_data_free_t;
typedef union {
xfs_dir2_data_entry_t entry;
xfs_dir2_data_unused_t unused;
} xfs_dir2_data_union_t;
typedef struct xfs_dir2_data_entry {
xfs_ino_t inumber;
__uint8_t namelen;
__uint8_t name[1];
xfs_dir2_data_off_t tag;
} xfs_dir2_data_entry_t;
typedef struct xfs_dir2_data_unused {
__uint16_t freetag; /* 0xffff */
xfs_dir2_data_off_t length;
xfs_dir2_data_off_t tag;
} xfs_dir2_data_unused_t;
typedef struct xfs_dir2_leaf_entry {
xfs_dahash_t hashval;
xfs_dir2_dataptr_t address;
} xfs_dir2_leaf_entry_t;
typedef struct xfs_dir2_block_tail {
__uint32_t count;
__uint32_t stale;
} xfs_dir2_block_tail_t;
</programlisting>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/43.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>43</phrase></textobject>
</mediaobject>
</para>
</listitem>
<listitem>
<para>The <command>tag</command> in the <command>xfs_dir2_data_entry_t</command> structure stores its offset from the start of the block.</para>
</listitem>
<listitem>
<para>Start of a free space region is marked with the <command>xfs_dir2_data_unused_t</command> structure where the <command>freetag</command> is <command>0xffff</command>. The <command>freetag</command> and <command>length</command> overwrites the <command>inumber</command> for an entry. The <command>tag</command> is located at <command>length - sizeof(tag)</command> from the start of the <command>unused</command> entry on-disk. </para>
</listitem>
<listitem>
<para>The <command>bestfree</command> array in the header points to as many as three of the largest spaces of free space within the block for storing new entries sorted by largest to third largest. If there are less than 3 empty regions, the remaining <command>bestfree</command> elements are zeroed. The <command>offset</command> specifies the offset from the start of the block in bytes, and the <command>length</command> specifies the size of the free space in bytes. The location each points to must contain the above <command>xfs_dir2_data_unused_t</command> structure. As a block cannot exceed 64KB in size, each is a 16-bit value. <command>bestfree</command> is used to optimise the time required to locate space to create an entry. It saves scanning through the block to find a location suitable for every entry created.</para>
</listitem>
<listitem>
<para>The <command>tail</command> structure specifies the number of elements in the <command>leaf</command> array and the number of <command>stale</command> entries in the array. The <command>tail</command> is always located at the end of the block. The <command>leaf</command> data immediately precedes the <command>tail</command> structure.</para>
</listitem>
<listitem>
<para>The <command>leaf</command> array, which grows from the end of the block just before the <command>tail</command> structure, contains an array of hash/address pairs for quickly looking up a name by a hash value. Hash values are covered by the introduction to directories. The <command>address</command> on-disk is the offset into the block divided by 8 (<command>XFS_DIR2_DATA_ALIGN</command>). Hash/address pairs are stored on disk to optimise lookup speed for large directories. If they were not stored, the hashes have to be calculated for all entries each time a lookup occurs in a directory.</para>
</listitem>
</itemizedlist>
<bridgehead>xfs_db Example:</bridgehead>
<para>A directory is created with 8 entries, directory block size = filesystem block size:</para>
<programlisting>
xfs_db> sb 0
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
...
dirblklog = 0
...
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
core.magic = 0x494e
core.mode = 040755
core.version = 1
core.format = 2 (extents)
core.nlinkv1 = 2
...
core.size = 4096
core.nblocks = 1
core.extsize = 0
core.nextents = 1
...
u.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,2097164,1,0]
</programlisting>
<para>Go to the "startblock" and show the raw disk data:</para>
<programlisting>
xfs_db&gt; dblock 0
xfs_db&gt; type text
xfs_db&gt; p
000: 58 44 32 42 01 30 0e 78 00 00 00 00 00 00 00 00 XD2B.0.x........
010: 00 00 00 00 02 00 00 80 01 2e 00 00 00 00 00 10 ................
020: 00 00 00 00 00 00 00 80 02 2e 2e 00 00 00 00 20 ................
030: 00 00 00 00 02 00 00 81 0f 66 72 61 6d 65 30 30 .........frame00
040: 30 30 30 30 2e 74 73 74 80 8e 59 00 00 00 00 30 0000.tst..Y....0
050: 00 00 00 00 02 00 00 82 0f 66 72 61 6d 65 30 30 .........frame00
060: 30 30 30 31 2e 74 73 74 d0 ca 5c 00 00 00 00 50 0001.tst.......P
070: 00 00 00 00 02 00 00 83 0f 66 72 61 6d 65 30 30 .........frame00
080: 30 30 30 32 2e 74 73 74 00 00 00 00 00 00 00 70 0002.tst.......p
090: 00 00 00 00 02 00 00 84 0f 66 72 61 6d 65 30 30 .........frame00
0a0: 30 30 30 33 2e 74 73 74 00 00 00 00 00 00 00 90 0003.tst........
0b0: 00 00 00 00 02 00 00 85 0f 66 72 61 6d 65 30 30 .........frame00
0c0: 30 30 30 34 2e 74 73 74 00 00 00 00 00 00 00 b0 0004.tst........
0d0: 00 00 00 00 02 00 00 86 0f 66 72 61 6d 65 30 30 .........frame00
0e0: 30 30 30 35 2e 74 73 74 00 00 00 00 00 00 00 d0 0005.tst........
0f0: 00 00 00 00 02 00 00 87 0f 66 72 61 6d 65 30 30 .........frame00
100: 30 30 30 36 2e 74 73 74 00 00 00 00 00 00 00 f0 0006.tst........
110: 00 00 00 00 02 00 00 88 0f 66 72 61 6d 65 30 30 .........frame00
120: 30 30 30 37 2e 74 73 74 00 00 00 00 00 00 01 10 0007.tst........
130: ff ff 0e 78 00 00 00 00 00 00 00 00 00 00 00 00 ...x............
</programlisting>
<para>The "leaf" and "tail" structures are stored at the end of the block, so as the directory grows, the middle is filled in:</para>
<programlisting>
fa0: 00 00 00 00 00 00 01 30 00 00 00 2e 00 00 00 02 .......0........
fb0: 00 00 17 2e 00 00 00 04 83 a0 40 b4 00 00 00 0e ................
fc0: 93 a0 40 b4 00 00 00 12 a3 a0 40 b4 00 00 00 06 ................
fd0: b3 a0 40 b4 00 00 00 0a c3 a0 40 b4 00 00 00 1e ................
fe0: d3 a0 40 b4 00 00 00 22 e3 a0 40 b4 00 00 00 16 ................
ff0: f3 a0 40 b4 00 00 00 1a 00 00 00 0a 00 00 00 00 ................
</programlisting>
<para>In a readable format:</para>
<programlisting>
xfs_db&gt; type dir2
xfs_db&gt; p
bhdr.magic = 0x58443242
bhdr.bestfree[0].offset = 0x130
bhdr.bestfree[0].length = 0xe78
bhdr.bestfree[1].offset = 0
bhdr.bestfree[1].length = 0
bhdr.bestfree[2].offset = 0
bhdr.bestfree[2].length = 0
bu[0].inumber = 33554560
bu[0].namelen = 1
bu[0].name = "."
bu[0].tag = 0x10
bu[1].inumber = 128
bu[1].namelen = 2
bu[1].name = ".."
bu[1].tag = 0x20
bu[2].inumber = 33554561
bu[2].namelen = 15
bu[2].name = "frame000000.tst"
bu[2].tag = 0x30
bu[3].inumber = 33554562
bu[3].namelen = 15
bu[3].name = "frame000001.tst"
bu[3].tag = 0x50
...
bu[8].inumber = 33554567
bu[8].namelen = 15
bu[8].name = "frame000006.tst"
bu[8].tag = 0xf0
bu[9].inumber = 33554568
bu[9].namelen = 15
bu[9].name = "frame000007.tst"
bu[9].tag = 0x110
bu[10].freetag = 0xffff
bu[10].length = 0xe78
bu[10].tag = 0x130
bleaf[0].hashval = 0x2e
bleaf[0].address = 0x2
bleaf[1].hashval = 0x172e
bleaf[1].address = 0x4
bleaf[2].hashval = 0x83a040b4
bleaf[2].address = 0xe
...
bleaf[8].hashval = 0xe3a040b4
bleaf[8].address = 0x16
bleaf[9].hashval = 0xf3a040b4
bleaf[9].address = 0x1a
btail.count = 10
btail.stale = 0
</programlisting>
<note>
<para>Note that with block directories, all xfs_db fields are preceded with "b". </para>
</note>
<para>For a simple lookup example, the hash of frame000000.tst is 0xb3a040b4. Looking up that value, we get an address of 0x6. Multiply that by 8, it becomes offset 0x30 and the inode at that point is 33554561. </para>
<para>When we remove an entry from the middle (frame000004.tst), we can see how the freespace details are adjusted:</para>
<programlisting>
bhdr.magic = 0x58443242
bhdr.bestfree[0].offset = 0x130
bhdr.bestfree[0].length = 0xe78
bhdr.bestfree[1].offset = 0xb0
bhdr.bestfree[1].length = 0x20
bhdr.bestfree[2].offset = 0
bhdr.bestfree[2].length = 0
...
bu[5].inumber = 33554564
bu[5].namelen = 15
bu[5].name = "frame000003.tst"
bu[5].tag = 0x90
bu[6].freetag = 0xffff
bu[6].length = 0x20
bu[6].tag = 0xb0
bu[7].inumber = 33554566
bu[7].namelen = 15
bu[7].name = "frame000005.tst"
bu[7].tag = 0xd0
...
bleaf[7].hashval = 0xd3a040b4
bleaf[7].address = 0x22
bleaf[8].hashval = 0xe3a040b4
bleaf[8].address = 0
bleaf[9].hashval = 0xf3a040b4
bleaf[9].address = 0x1a
btail.count = 10
btail.stale = 1
</programlisting>
<para>A new "bestfree" value is added for the entry, the start of the entry is marked as unused with 0xffff (which overwrites the inode number for an actual entry), and the length of the space. The tag remains intact at the <command>offset+length - sizeof(tag)</command>. The address for the hash is also cleared. The affected areas are highlighted below:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/46.png" format="PNG" /></imageobject>
<textobject><phrase>code46</phrase></textobject>
</mediaobject>
</section>
<section id="Leaf_Directories"><title>
Leaf Directories</title>
<para>Once a Block Directory (<xref linkend="Block_Directories"/>) has filled the block, the directory data is changed into a new format. It still uses extents (<xref linkend="Data_Extents"/>) and the same basic structures, but the "data" and "leaf" are split up into their own extents. The "leaf" information only occupies one extent. As "leaf" information is more compact than "data" information, more than one "data" extent is common.</para>
<itemizedlist>
<listitem>
<para>Block to Leaf conversions retain the existing block for the data entries and allocate a new block for the leaf and freespace index information.</para>
</listitem>
<listitem>
<para>As with all directories, data blocks must start at logical offset zero. </para>
</listitem>
<listitem>
<para>The "leaf" block has a special offset defined by <command>XFS_DIR2_LEAF_OFFSET</command>. Currently, this is 32GB and in the extent view, a block offset of 32GB/sb_blocksize. On a 4KB block filesystem, this is 0x800000 (8388608 decimal).</para>
</listitem>
<listitem>
<para>The "data" extents have a new header (no "leaf" data):</para>
<programlisting>
typedef struct xfs_dir2_data {
xfs_dir2_data_hdr_t hdr;
xfs_dir2_data_union_t u[1];
} xfs_dir2_data_t;
</programlisting>
</listitem>
<listitem>
<para>The "leaf" extent uses the following structures:</para>
<programlisting>
typedef struct xfs_dir2_leaf {
xfs_dir2_leaf_hdr_t hdr;
xfs_dir2_leaf_entry_t ents[1];
xfs_dir2_data_off_t bests[1];
xfs_dir2_leaf_tail_t tail;
} xfs_dir2_leaf_t;
typedef struct xfs_dir2_leaf_hdr {
xfs_da_blkinfo_t info;
__uint16_t count;
__uint16_t stale;
} xfs_dir2_leaf_hdr_t;
typedef struct xfs_dir2_leaf_tail {
__uint32_t bestcount;
} xfs_dir2_leaf_tail_t;
</programlisting>
</listitem>
<listitem>
<para>The leaves use the <command>xfs_da_blkinfo_t</command> filesystem block header. This header is used for directory and extended attribute (<xref linkend="Extended_Attributes"/>) leaves and B+tree nodes:</para>
<programlisting>
typedef struct xfs_da_blkinfo {
__be32 forw;
__be32 back;
__be16 magic;
__be16 pad;
} xfs_da_blkinfo_t;
</programlisting>
</listitem>
<listitem>
<para>The size of the <command>ents</command> array is specified by <command>hdr.count</command>.</para>
</listitem>
<listitem>
<para>The size of the bests array is specified by the tail.bestcount which is also the number of "data" blocks for  the directory. The bests array maintains each data block's <command>bestfree[0].length</command> value.</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/48.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>48</phrase></textobject>
</mediaobject>
</para>
</listitem>
</itemizedlist>
<bridgehead>xfs_db Example:</bridgehead>
<para>For this example, a directory was created with 256 entries (frame000000.tst to frame000255.tst) and then deleted some files (frame00005*, frame00018* and frame000240.tst) to show free list characteristics.</para>
<programlisting>
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
core.magic = 0x494e
core.mode = 040755
core.version = 1
core.format = 2 (extents)
core.nlinkv1 = 2
...
core.size = 12288
core.nblocks = 4
core.extsize = 0
core.nextents = 3
...
u.bmx[0-2] = [startoff,startblock,blockcount,extentflag]
0:[0,4718604,1,0]
1:[1,4718610,2,0]
2:[8388608,4718605,1,0]
</programlisting>
<para>As can be seen in this example, three blocks are used for "data" in two extents, and the "leaf" extent has a logical offset of 8388608 blocks (32GB).</para>
<para>Examining the first block:</para>
<programlisting>
xfs_db&gt; dblock 0
xfs_db&gt; type dir2
xfs_db&gt; p
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0x670
dhdr.bestfree[0].length = 0x140
dhdr.bestfree[1].offset = 0xff0
dhdr.bestfree[1].length = 0x10
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 75497600
du[0].namelen = 1
du[0].name = "."
du[0].tag = 0x10
du[1].inumber = 128
du[1].namelen = 2
du[1].name = ".."
du[1].tag = 0x20
du[2].inumber = 75497601
du[2].namelen = 15
du[2].name = "frame000000.tst"
du[2].tag = 0x30
du[3].inumber = 75497602
du[3].namelen = 15
du[3].name = "frame000001.tst"
du[3].tag = 0x50
...
du[51].inumber = 75497650
du[51].namelen = 15
du[51].name = "frame000049.tst"
du[51].tag = 0x650
du[52].freetag = 0xffff
du[52].length = 0x140
du[52].tag = 0x670
du[53].inumber = 75497661
du[53].namelen = 15
du[53].name = "frame000060.tst"
du[53].tag = 0x7b0
...
du[118].inumber = 75497758
du[118].namelen = 15
du[118].name = "frame000125.tst"
du[118].tag = 0xfd0
du[119].freetag = 0xffff
du[119].length = 0x10
du[119].tag = 0xff0
</programlisting>
<para>Note that the xfs_db field output is preceded by a "d" for "data".</para>
<para>The next "data" block:</para>
<programlisting>
xfs_db&gt; dblock 1
xfs_db&gt; type dir2
xfs_db&gt; p
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0x6d0
dhdr.bestfree[0].length = 0x140
dhdr.bestfree[1].offset = 0xe50
dhdr.bestfree[1].length = 0x20
dhdr.bestfree[2].offset = 0xff0
dhdr.bestfree[2].length = 0x10
du[0].inumber = 75497759
du[0].namelen = 15
du[0].name = "frame000126.tst"
du[0].tag = 0x10
...
du[53].inumber = 75497844
du[53].namelen = 15
du[53].name = "frame000179.tst"
du[53].tag = 0x6b0
du[54].freetag = 0xffff
du[54].length = 0x140
du[54].tag = 0x6d0
du[55].inumber = 75497855
du[55].namelen = 15
du[55].name = "frame000190.tst"
du[55].tag = 0x810
...
du[104].inumber = 75497904
du[104].namelen = 15
du[104].name = "frame000239.tst"
du[104].tag = 0xe30
du[105].freetag = 0xffff
du[105].length = 0x20
du[105].tag = 0xe50
du[106].inumber = 75497906
du[106].namelen = 15
du[106].name = "frame000241.tst"
du[106].tag = 0xe70
...
du[117].inumber = 75497917
du[117].namelen = 15
du[117].name = "frame000252.tst"
du[117].tag = 0xfd0
du[118].freetag = 0xffff
du[118].length = 0x10
du[118].tag = 0xff0
</programlisting>
<para>And the last data block:</para>
<programlisting>
xfs_db&gt; dblock 2
xfs_db&gt; type dir2
xfs_db&gt; p
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0x70
dhdr.bestfree[0].length = 0xf90
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 75497918
du[0].namelen = 15
du[0].name = "frame000253.tst"
du[0].tag = 0x10
du[1].inumber = 75497919
du[1].namelen = 15
du[1].name = "frame000254.tst"
du[1].tag = 0x30
du[2].inumber = 75497920
du[2].namelen = 15
du[2].name = "frame000255.tst"
du[2].tag = 0x50
du[3].freetag = 0xffff
du[3].length = 0xf90
du[3].tag = 0x70
</programlisting>
<para>Examining the "leaf" block (with the fields preceded by an "l" for "leaf"):</para>
<para>The directory before deleting some entries:</para>
<programlisting>
xfs_db&gt; dblock 8388608
xfs_db&gt; type dir2
xfs_db&gt; p
lhdr.info.forw = 0
lhdr.info.back = 0
lhdr.info.magic = 0xd2f1
lhdr.count = 258
lhdr.stale = 0
lbests[0-2] = 0:0x10 1:0x10 2:0xf90
lents[0].hashval = 0x2e
lents[0].address = 0x2
lents[1].hashval = 0x172e
lents[1].address = 0x4
lents[2].hashval = 0x23a04084
lents[2].address = 0x116
...
lents[257].hashval = 0xf3a048bc
lents[257].address = 0x366
ltail.bestcount = 3
</programlisting>
<para>Note how the <command>lbests</command> array correspond with the <command>bestfree[0].length</command> values in the "data" blocks:</para>
<programlisting>
xfs_db&gt; dblock 0
xfs_db&gt; type dir2
xfs_db&gt; p
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0xff0
dhdr.bestfree[0].length = 0x10
...
xfs_db&gt; dblock 1
xfs_db&gt; type dir2
xfs_db&gt; p
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0xff0
dhdr.bestfree[0].length = 0x10
...
xfs_db&gt; dblock 2
xfs_db&gt; type dir2
xfs_db&gt; p
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0x70
dhdr.bestfree[0].length = 0xf90
</programlisting>
<para>Now after the entries have been deleted:</para>
<programlisting>
xfs_db&gt; dblock 8388608
xfs_db&gt; type dir2
xfs_db&gt; p
lhdr.info.forw = 0
lhdr.info.back = 0
lhdr.info.magic = 0xd2f1
lhdr.count = 258
lhdr.stale = 21
lbests[0-2] = 0:0x140 1:0x140 2:0xf90
lents[0].hashval = 0x2e
lents[0].address = 0x2
lents[1].hashval = 0x172e
lents[1].address = 0x4
lents[2].hashval = 0x23a04084
lents[2].address = 0x116
...
</programlisting>
<para>As can be seen, the <command>lbests</command> values have been update to contain each <command>hdr.bestfree[0].length</command> values. The leaf's <command>hdr.stale</command> value has also been updated to specify the number of stale entries in the array. The stale entries have an address of zero.</para>
<para>TODO: Need an example for where new entries get inserted with several large free spaces.</para></section>
<section id="Node_Directories"><title>
Node Directories</title>
<para>When the "leaf" information fills a block, the extents undergo another separation. All "freeindex" information moves into its own extent. Like Leaf Directories (<xref linkend="Leaf_Directories"/>), the "leaf" block maintained the best free space information for each "data" block. This is not possible with more than one leaf.</para>
<itemizedlist>
<listitem>
<para>The "data" blocks stay the same as leaf directories.</para>
</listitem>
<listitem>
<para>The "leaf" blocks eventually change into a B+tree with the generic B+tree header pointing to directory "leaves" as described in Leaf Directories. The top-level blocks are called "nodes". It can exist in a state where there is still a single leaf block before it's split. Interpretation of the node vs. leaf blocks has to be performed by inspecting the magic value in the header. The combined leaf/freeindex blocks has a magic value of <command>XFS_DIR2_LEAF1_MAGIC (0xd2f1)</command>, a node directory's leaf/leaves have a magic value of <command>XFS_DIR2_LEAFN_MAGIC  (0xd2ff)</command> and intermediate nodes have a magic value of <command>XFS_DA_NODE_MAGIC (0xfebe)</command>.</para>
</listitem>
<listitem>
<para>The new "freeindex" block(s) only contains the bests for each data block.</para>
</listitem>
<listitem>
<para>The freeindex block uses the following structures:</para>
<programlisting>
typedef struct xfs_dir2_free_hdr {
__uint32_t magic;
__int32_t firstdb;
__int32_t nvalid;
__int32_t nused;
} xfs_dir2_free_hdr_t;
typedef struct xfs_dir2_free {
xfs_dir2_free_hdr_t hdr;
xfs_dir2_data_off_t bests[1];
} xfs_dir2_free_t;
</programlisting>
</listitem>
<listitem>
<para>The location of the leaf blocks can be in any order, the only way to determine the appropriate is by the node block hash/before values. Given a hash to lookup, you read the node's <command>btree</command> array and first <command>hashval</command> in the array that exceeds the given hash and it can then be found in the block pointed to by the <command>before</command> value. </para>
<programlisting>
typedef struct xfs_da_intnode {
struct xfs_da_node_hdr {
xfs_da_blkinfo_t info;
__uint16_t count;
__uint16_t level;
} hdr;
struct xfs_da_node_entry {
xfs_dahash_t hashval;
xfs_dablk_t before;
} btree[1];
} xfs_da_intnode_t;
</programlisting>
</listitem>
<listitem>
<para>The freeindex's <command>bests</command> array starts from the end of the block and grows to the start of the block.</para>
</listitem>
<listitem>
<para>When an data block becomes unused (ie. all entries in it have been deleted), the block is freed, the data extents contain a hole, and the freeindex's <command>hdr.nused</command> value is decremented and the associated <command>bests[]</command> entry is set to 0xffff. </para>
</listitem>
<listitem>
<para>As the first data block always contains "." and "..", it's invalid for the directory to have a hole at the start.</para>
</listitem>
<listitem>
<para>The freeindex's <command>hdr.nvalid</command> should always be the same as the number of allocated data directory blocks containing name/inode data and will always be less than or equal to <command>hdr.nused. hdr.nused</command> should be the same as the index of the last data directory block plus one (i.e. when the last data block is freed, <command>nused</command> and <command>nvalid</command> are decremented).</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/54.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>54</phrase></textobject>
</mediaobject>
</para>
</listitem>
</itemizedlist>
<bridgehead>xfs_db Example:</bridgehead>
<para>With the node directory examples, we are using a filesystems with 4KB block size, and a 16KB directory size. The directory has over 2000 entries:</para>
<programlisting>
xfs_db> sb 0
xfs_db> p
magicnum = 0x58465342
blocksize = 4096
...
dirblklog = 2
...
xfs_db&gt; inode &lt;inode#&gt;
xfs_db&gt; p
core.magic = 0x494e
core.mode = 040755
core.version = 1
core.format = 2 (extents)
...
core.size = 81920
core.nblocks = 36
core.extsize = 0
core.nextents = 8
...
u.bmx[0-7] = [startoff,startblock,blockcount,extentflag] 0:[0,7368,4,0]
1:[4,7408,4,0] 2:[8,7444,4,0] 3:[12,7480,4,0] 4:[16,7520,4,0]
5:[8388608,7396,4,0] 6:[8388612,7524,8,0] 7:[16777216,7516,4,0]
</programlisting>
<para>As can already be observed, all extents are allocated is multiples of 4 blocks.</para>
<para>Blocks 0 to 19 (16+4-1) are used for the data. Looking at blocks 16-19, it can seen that it's the same as the single-leaf format, except the <command>length</command> values are  a lot larger to accommodate the increased directory block size:</para>
<programlisting>
xfs_db&gt; dblock 16
xfs_db&gt; type dir2
xfs_db&gt; p
dhdr.magic = 0x58443244
dhdr.bestfree[0].offset = 0xb0
dhdr.bestfree[0].length = 0x3f50
dhdr.bestfree[1].offset = 0
dhdr.bestfree[1].length = 0
dhdr.bestfree[2].offset = 0
dhdr.bestfree[2].length = 0
du[0].inumber = 120224
du[0].namelen = 15
du[0].name = "frame002043.tst"
du[0].tag = 0x10
du[1].inumber = 120225
du[1].namelen = 15
du[1].name = "frame002044.tst"
du[1].tag = 0x30
du[2].inumber = 120226
du[2].namelen = 15
du[2].name = "frame002045.tst"
du[2].tag = 0x50
du[3].inumber = 120227
du[3].namelen = 15
du[3].name = "frame002046.tst"
du[3].tag = 0x70
du[4].inumber = 120228
du[4].namelen = 15
du[4].name = "frame002047.tst"
du[4].tag = 0x90
du[5].freetag = 0xffff
du[5].length = 0x3f50
du[5].tag = 0
</programlisting>
<para>Next, the "node" block, the fields are preceded with 'n' for node blocks:</para>
<programlisting>
xfs_db&gt; dblock 8388608
xfs_db&gt; type dir2
xfs_db&gt; p
nhdr.info.forw = 0
nhdr.info.back = 0
nhdr.info.magic = 0xfebe
nhdr.count = 2
nhdr.level = 1
nbtree[0-1] = [hashval,before] 0:[0xa3a440ac,8388616] 1:[0xf3a440bc,8388612]
</programlisting>
<para>The following leaf blocks have been allocated once as XFS knows it needs at two blocks when allocating a B+tree, so the length is 8 fsblocks. For all hashes &lt; 0xa3a440ac, they are located in the directory offset 8388616 and hashes below 0xf3a440bc are in offset 8388612. Hashes above f3a440bc don't exist in this directory.</para>
<programlisting>
xfs_db&gt; dblock 8388616
xfs_db&gt; type dir2
xfs_db&gt; p
lhdr.info.forw = 8388612
lhdr.info.back = 0
lhdr.info.magic = 0xd2ff
lhdr.count = 1023
lhdr.stale = 0
lents[0].hashval = 0x2e
lents[0].address = 0x2
lents[1].hashval = 0x172e
lents[1].address = 0x4
lents[2].hashval = 0x23a04084
lents[2].address = 0x116
...
lents[1021].hashval = 0xa3a440a4
lents[1021].address = 0x1fa2
lents[1022].hashval = 0xa3a440ac
lents[1022].address = 0x1fca
xfs_db&gt; dblock 8388612
xfs_db&gt; type dir2
xfs_db&gt; p
lhdr.info.forw = 0
lhdr.info.back = 8388616
lhdr.info.magic = 0xd2ff
lhdr.count = 1027
lhdr.stale = 0
lents[0].hashval = 0xa3a440b4
lents[0].address = 0x1f52
lents[1].hashval = 0xa3a440bc
lents[1].address = 0x1f7a
...
lents[1025].hashval = 0xf3a440b4
lents[1025].address = 0x1f66
lents[1026].hashval = 0xf3a440bc
lents[1026].address = 0x1f8e
</programlisting>
<para>An example lookup using xfs_db:</para>
<programlisting>
xfs_db&gt; hash frame001845.tst
0xf3a26094
Doing a binary search through the array, we get address 0x1ce6, which is
offset 0xe730. Each fsblock is 4KB in size (0x1000), so it will be offset
0x730 into directory offset 14. From the extent map, this will be fsblock
7482:
xfs_db&gt; fsblock 7482
xfs_db&gt; type text
xfs_db&gt; p
...
730: 00 00 00 00 00 01 d4 da 0f 66 72 61 6d 65 30 30 .........frame00
740: 31 38 34 35 2e 74 73 74 00 00 00 00 00 00 27 30 1845.tst.......0
</programlisting>
<para>Looking at the freeindex information (fields with an 'f' tag):</para>
<programlisting>
xfs_db&gt; fsblock 7516
xfs_db&gt; type dir2
xfs_db&gt; p
fhdr.magic = 0x58443246
fhdr.firstdb = 0
fhdr.nvalid = 5
fhdr.nused = 5
fbests[0-4] = 0:0x10 1:0x10 2:0x10 3:0x10 4:0x3f50
</programlisting>
<para>Like the Leaf Directory (<xref linkend="Leaf_Directories"/>), each of the <command>fbests</command> values correspond to each data block's <command>bestfree[0].length</command> value. </para>
<para>The raw disk layout, old data is not cleared after the array. The fbests array is highlighted:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/57.png" format="PNG" /></imageobject>
<textobject><phrase>code57</phrase></textobject>
</mediaobject>
<para>TODO: Example with a hole in the middle</para></section>
<section id="Btree_Directories">
<title>B+tree Directories</title>
<para>When the extent map in an inode grows beyond the inode's space, the inode format is changed to a "btree". The inode contains a filesystem block point to the B+tree extent map for the directory's blocks. The B+tree extents contain the extent map for the "data", "node", "leaf" and "freeindex" information as described in Node Directories (<xref linkend="Node_Directories"/>).</para>
<para>Refer to the previous section on B+tree Data Extents (<xref linkend="Btree_Extent_List"/>) for more information on XFS B+tree extents.</para>
<para>The following situations and changes can apply over Node Directories, and apply here as inode extents generally cannot contain the number of directory blocks that B+trees can handle:</para>
<itemizedlist>
<listitem>
<para>The node/leaf trees can be more than one level deep. </para>
</listitem>
<listitem>
<para>More than one freeindex block may exist, but this will be quite rare. It would required hundreds of thousand files with quite long file names (or millions with shorter names) to get a second freeindex block.</para>
</listitem>
</itemizedlist>
<bridgehead>xfs_db Example:</bridgehead>
<para>A directory has been created with 200,000 entries with each entry being 100 characters long. The filesystem block size and directory block size are 4KB:</para>
<programlisting>
xfs_db&gt; inode 772
xfs_db&gt; p
core.magic = 0x494e
core.mode = 040755
core.version = 1
core.format = 3 (btree)
...
core.size = 22757376
core.nblocks = 6145
core.extsize = 0
core.nextents = 234
core.naextents = 0
core.forkoff = 0
...
u.bmbt.level = 1
u.bmbt.numrecs = 1
u.bmbt.keys[1] = [startoff] 1:[0]
u.bmbt.ptrs[1] = 1:89
xfs_db&gt; fsblock 89
xfs_db&gt; type bmapbtd
xfs_db&gt; p
magic = 0x424d4150
level = 0
numrecs = 234
leftsib = null
rightsib = null
recs[1-234] = [startoff,startblock,blockcount,extentflag]
1:[0,53,1,0] 2:[1,55,13,0] 3:[14,69,1,0] 4:[15,72,13,0]
5:[28,86,2,0] 6:[30,90,21,0] 7:[51,112,1,0] 8:[52,114,11,0]
...
125:[5177,902,15,0] 126:[5192,918,6,0] 127:[5198,524786,358,0]
128:[8388608,54,1,0] 129:[8388609,70,2,0] 130:[8388611,85,1,0]
...
229:[8389164,917,1,0] 230:[8389165,924,19,0] 231:[8389184,944,9,0]
232:[16777216,68,1,0] 233:[16777217,7340114,1,0] 234:[16777218,5767362,1,0]
</programlisting>
<para>We have 128 extents and a total of 5555 blocks being used to store name/inode pairs. With only about 2000 values that can be stored in the freeindex block, 3 blocks have been allocated for this information. The <command>firstdb</command> field specifies the starting directory block number for each array:</para>
<programlisting>
xfs_db&gt; dblock 16777216
xfs_db&gt; type dir2
xfs_db&gt; p
fhdr.magic = 0x58443246
fhdr.firstdb = 0
fhdr.nvalid = 2040
fhdr.nused = 2040
fbests[0-2039] = ...
xfs_db&gt; dblock 16777217
xfs_db&gt; type dir2
xfs_db&gt; p
fhdr.magic = 0x58443246
fhdr.firstdb = 2040
fhdr.nvalid = 2040
fhdr.nused = 2040
fbests[0-2039] = ...
xfs_db&gt; dblock 16777218
xfs_db&gt; type dir2
xfs_db&gt; p
fhdr.magic = 0x58443246
fhdr.firstdb = 4080
fhdr.nvalid = 1476
fhdr.nused = 1476
fbests[0-1475] = ...
</programlisting>
<para>Looking at the root node in the node block, it's a pretty deep tree:</para>
<programlisting>
xfs_db&gt; dblock 8388608
xfs_db&gt; type dir2
xfs_db&gt; p
nhdr.info.forw = 0
nhdr.info.back = 0
nhdr.info.magic = 0xfebe
nhdr.count = 2
nhdr.level = 2
nbtree[0-1] = [hashval,before] 0:[0x6bbf6f39,8389121] 1:[0xfbbf7f79,8389120]
xfs_db&gt; dblock 8389121
xfs_db&gt; type dir2
xfs_db&gt; p
nhdr.info.forw = 8389120
nhdr.info.back = 0
nhdr.info.magic = 0xfebe
nhdr.count = 263
nhdr.level = 1
nbtree[0-262] = ... 262:[0x6bbf6f39,8388928]
xfs_db> dblock 8389120
xfs_db> type dir2
xfs_db> p
nhdr.info.forw = 0
nhdr.info.back = 8389121
nhdr.info.magic = 0xfebe
nhdr.count = 319
nhdr.level = 1
nbtree[0-318] = [hashval,before] 0:[0x70b14711,8388919] ...
</programlisting>
<para>The leaves at each the end of a node always point to the end leaves in adjacent nodes. Directory block 8388928 forward pointer is to block 8388919, and vice versa as highlighted in the following example:</para>
<mediaobject>
<imageobject><imagedata fileref="images/code/60.png" format="PNG" /></imageobject>
<textobject><phrase>code60</phrase></textobject>
</mediaobject></section>
</chapter>