| [[Directories]] |
| = Directories |
| |
| [NOTE] |
| Only v2 directories covered here. v1 directories are obsolete. |
| |
| [NOTE] |
| The term ``block'' in this section will refer to directory blocks, not filesystem |
| blocks unless otherwise specified. |
| |
| The size of a ``directory block'' is defined by the xref:Superblocks[superblock's] |
| +sb_dirblklog+ value. The size in bytes = +sb_blocksize+ × 2^sb_dirblklog^. |
| For example, if +sb_blocksize+ = 4096 and +sb_dirblklog+ = 2, the directory block |
| size is 16384 bytes. Directory blocks are always allocated in multiples based on |
| +sb_dirblklog+. Directory blocks cannot be more that 65536 bytes in size. |
| |
| All directory entries contain the following ``data'': |
| |
| * The entry's name (counted string consisting of a single byte +namelen+ |
| followed by +name+ consisting of an array of 8-bit chars without a NULL |
| terminator). |
| |
| * The entry's absolute xref:Inode_Numbers[inode number], which are |
| always 64 bits (8 bytes) in size except a special case for shortform |
| directories. |
| |
| * An +offset+ or +tag+ used for iterative readdir calls. |
| |
| * If the +XFS_SB_FEAT_INCOMPAT_FTYPE+ feature flag is set, each |
| directory entry contains an +ftype+ field that caches the inode's type |
| to avoid having to perform an inode lookup. |
| |
| .ftype values |
| [options="header"] |
| |===== |
| | Flag | Description |
| | +XFS_DIR3_FT_UNKNOWN+ | |
| Entry points to an unknown inode type. This should never appear on disk. |
| | +XFS_DIR3_FT_REG_FILE+ | |
| Entry points to a file. |
| | +XFS_DIR3_FT_DIR+ | |
| Entry points to another directory. |
| | +XFS_DIR3_FT_CHRDEV+ | |
| Entry points to a character device. |
| | +XFS_DIR3_FT_BLKDEV+ | |
| Entry points to a block device. |
| | +XFS_DIR3_FT_FIFO+ | |
| Entry points to a FIFO. |
| | +XFS_DIR3_FT_SOCK+ | |
| Entry points to a socket. |
| | +XFS_DIR3_FT_SYMLINK+ | |
| Entry points to a symbolic link. |
| | +XFS_DIR3_FT_WHT+ | |
| Entry points to an overlayfs whiteout file. This (as far as the author |
| knows) has never appeared on disk. |
| |===== |
| |
| All non-shortform directories also contain two additional structures: ``leaves'' |
| and ``freespace indexes''. |
| |
| * Leaves contain the sorted hashed name value (+xfs_da_hashname()+ in |
| xfs_da_btree.c) and associated ``address'' which points to the effective offset |
| into the directory's data structures. Leaves are used to optimise lookup |
| operations. |
| |
| * Freespace indexes contain free space/empty entry tracking for quickly finding an |
| appropriately sized location for new entries. They maintain the largest free |
| space for each ``data'' block. |
| |
| A few common types are used for the directory structures: |
| |
| [source, c] |
| ---- |
| typedef __uint16_t xfs_dir2_data_off_t; |
| typedef __uint32_t xfs_dir2_dataptr_t; |
| ---- |
| |
| |
| [[Shortform_Directories]] |
| == Short Form Directories |
| |
| * Directory entries are stored within the inode. |
| |
| * The only data stored is the name, inode number, and offset. No ``leaf'' or |
| ``freespace index'' information is required as an inode can only store a few |
| entries. |
| |
| * ``.'' is not stored (as it's in the inode itself), and ``..'' is a dedicated |
| +parent+ field in the header. |
| |
| * The number of directories that can be stored in an inode depends on the |
| xref:On-disk_Inode[inode] size, the number of entries, the length of the entry |
| names, and extended attribute data. |
| |
| * Once the number of entries exceeds the space available in the inode, the |
| format is converted to a xref:Block_Directories[block directory]. |
| |
| * Shortform directory data is packed as tightly as possible on the disk with the |
| remaining space zeroed: |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_sf { |
| xfs_dir2_sf_hdr_t hdr; |
| xfs_dir2_sf_entry_t list[1]; |
| } xfs_dir2_sf_t; |
| ---- |
| |
| *hdr*:: |
| Short form directory header. |
| |
| *list*:: |
| An array of variable-length directory entry records. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_sf_hdr { |
| __uint8_t count; |
| __uint8_t i8count; |
| xfs_dir2_inou_t parent; |
| } xfs_dir2_sf_hdr_t; |
| ---- |
| |
| *count*:: |
| Number of directory entries. |
| |
| *i8count*:: |
| Number of directory entries requiring 64-bit entries, if any inode numbers |
| require 64-bits. Zero otherwise. |
| |
| *parent*:: |
| The absolute inode number of this directory's parent. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_sf_entry { |
| __uint8_t namelen; |
| xfs_dir2_sf_off_t offset; |
| __uint8_t name[1]; |
| __uint8_t ftype; |
| xfs_dir2_inou_t inumber; |
| } xfs_dir2_sf_entry_t; |
| ---- |
| |
| *namelen*:: |
| Length of the name, in bytes. |
| |
| *offset*:: |
| Offset tag used to assist with directory iteration. |
| |
| *name*:: |
| The name of the directory entry. The entry is not NULL-terminated. |
| |
| *ftype*:: |
| The type of the inode. This is used to avoid reading the inode while iterating |
| a directory. The +XFS_SB_VERSION2_FTYPE+ feature must be set, or this field |
| will not be present. |
| |
| *inumber*:: |
| The inode number that this entry points to. The length is either 32 or 64 |
| bits, depending on whether +icount+ or +i8count+, respectively, are set in the |
| header. |
| |
| .Short form directory layout |
| image::images/39.png[] |
| |
| * Inode numbers are stored using 4 or 8 bytes depending on whether all the inode |
| numbers for the directory fit in 4 bytes (32 bits) or not. If all inode numbers |
| fit in 4 bytes, the header's +count+ value specifies the number of entries in |
| the directory and +i8count+ will be zero. If any inode number exceeds 4 bytes, |
| all inode numbers will be 8 bytes in size and the header's +i8count+ value |
| specifies the number of entries requiring larger inodes. +i4count+ is still |
| the number of entries. The following union covers the shortform inode number |
| structure: |
| |
| [source, c] |
| ---- |
| typedef struct { __uint8_t i[8]; } xfs_dir2_ino8_t; |
| typedef struct { __uint8_t i[4]; } xfs_dir2_ino4_t; |
| typedef union { |
| xfs_dir2_ino8_t i8; |
| xfs_dir2_ino4_t i4; |
| } xfs_dir2_inou_t; |
| ---- |
| |
| === xfs_db Short Form Directory Example |
| |
| A directory is created with 4 files, all inode numbers fitting within 4 bytes: |
| |
| ---- |
| xfs_db> inode <inode#> |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 040755 |
| core.version = 1 |
| core.format = 1 (local) |
| core.nlinkv1 = 2 |
| ... |
| core.size = 94 |
| core.nblocks = 0 |
| core.extsize = 0 |
| core.nextents = 0 |
| ... |
| u.sfdir2.hdr.count = 4 |
| u.sfdir2.hdr.i8count = 0 |
| u.sfdir2.hdr.parent.i4 = 128 /* parent = root inode */ |
| u.sfdir2.list[0].namelen = 15 |
| u.sfdir2.list[0].offset = 0x30 |
| u.sfdir2.list[0].name = "frame000000.tst" |
| u.sfdir2.list[0].inumber.i4 = 25165953 |
| u.sfdir2.list[1].namelen = 15 |
| u.sfdir2.list[1].offset = 0x50 |
| u.sfdir2.list[1].name = "frame000001.tst" |
| u.sfdir2.list[1].inumber.i4 = 25165954 |
| u.sfdir2.list[2].namelen = 15 |
| u.sfdir2.list[2].offset = 0x70 |
| u.sfdir2.list[2].name = "frame000002.tst" |
| u.sfdir2.list[2].inumber.i4 = 25165955 |
| u.sfdir2.list[3].namelen = 15 |
| u.sfdir2.list[3].offset = 0x90 |
| u.sfdir2.list[3].name = "frame000003.tst" |
| u.sfdir2.list[3].inumber.i4 = 25165956 |
| ---- |
| |
| The raw data on disk with the first entry highlighted. The six byte header |
| precedes the first entry: |
| |
| [subs="quotes"] |
| ---- |
| xfs_db> type text |
| xfs_db> p |
| 00: 49 4e 41 ed 01 01 00 02 00 00 00 00 00 00 00 00 INA............. |
| 10: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 02 ................ |
| 20: 44 ad 3a 83 1d a9 4a d0 44 ad 3a ab 0b c7 a7 d0 D.....J.D....... |
| 30: 44 ad 3a ab 0b c7 a7 d0 00 00 00 00 00 00 00 5e D............... |
| 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| 50: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| 60: ff ff ff ff 04 00 00 00 00 80 *0f 00 30 66 72 61* ............0fra |
| 70: *6d 65 30 30 30 30 30 30 2e 74 73 74 01 80 00 81* me000000.tst.... |
| 80: 0f 00 50 66 72 61 6d 65 30 30 30 30 30 31 2e 74 ..Pframe000001.t |
| 90: 73 74 01 80 00 82 0f 00 70 66 72 61 6d 65 30 30 st......pframe00 |
| a0: 30 30 30 32 2e 74 73 74 01 80 00 83 0f 00 90 66 0002.tst........ |
| b0: 72 61 6d 65 30 30 30 30 30 33 2e 74 73 74 01 80 rame000003.tst.. |
| cO: 00 84 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| ---- |
| |
| Next, an entry is deleted (frame000001.tst), and any entries after the deleted |
| entry are moved or compacted to ``cover'' the hole: |
| |
| ---- |
| xfs_db> inode <inode#> |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 040755 |
| core.version = 1 |
| core.format = 1 (local) |
| core.nlinkv1 = 2 |
| ... |
| core.size = 72 |
| core.nblocks = 0 |
| core.extsize = 0 |
| core.nextents = 0 |
| ... |
| u.sfdir2.hdr.count = 3 |
| u.sfdir2.hdr.i8count = 0 |
| u.sfdir2.hdr.parent.i4 = 128 |
| u.sfdir2.list[0].namelen = 15 |
| u.sfdir2.list[0].offset = 0x30 |
| u.sfdir2.list[0].name = "frame000000.tst" |
| u.sfdir2.list[0].inumber.i4 = 25165953 |
| u.sfdir2.list[1].namelen = 15 |
| u.sfdir2.list[1].offset = 0x70 |
| u.sfdir2.list[1].name = "frame000002.tst" |
| u.sfdir2.list[1].inumber.i4 = 25165955 |
| u.sfdir2.list[2].namelen = 15 |
| u.sfdir2.list[2].offset = 0x90 |
| u.sfdir2.list[2].name = "frame000003.tst" |
| u.sfdir2.list[2].inumber.i4 = 25165956 |
| ---- |
| |
| Raw disk data, the space beyond the shortform entries is invalid and could be non-zero: |
| |
| ---- |
| xfs_db> type text |
| xfs_db> p |
| 00: 49 4e 41 ed 01 01 00 02 00 00 00 00 00 00 00 00 INA............. |
| 10: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 03 ................ |
| 20: 44 b2 45 a2 09 fd e4 50 44 b2 45 a3 12 ee b5 d0 D.E....PD.E..... |
| 30: 44 b2 45 a3 12 ee b5 d0 00 00 00 00 00 00 00 48 D.E............H |
| 40: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| 50: 00 00 00 02 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| 60: ff ff ff ff 03 00 00 00 00 80 0f 00 30 66 72 61 ............0fra |
| 70: 6d 65 30 30 30 30 30 30 2e 74 73 74 01 80 00 81 me000000.tst.... |
| 80: 0f 00 70 66 72 61 6d 65 30 30 30 30 30 32 2e 74 ..pframe000002.t |
| 90: 73 74 01 80 00 83 0f 00 90 66 72 61 6d 65 30 30 st.......frame00 |
| a0: 30 30 30 33 2e 74 73 74 01 80 00 84 0f 00 90 66 0003.tst.......f |
| b0: 72 61 6d 65 30 30 30 30 30 33 2e 74 73 74 01 80 rame000003.tst.. |
| c0: 00 84 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ |
| ---- |
| |
| This is an example of mixed 4-byte and 8-byte inodes in a directory: |
| |
| ---- |
| xfs_db> inode 1024 |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 040755 |
| core.version = 3 |
| core.format = 1 (local) |
| core.nlinkv2 = 9 |
| ... |
| core.size = 125 |
| core.nblocks = 0 |
| core.extsize = 0 |
| core.nextents = 0 |
| ... |
| u3.sfdir3.hdr.count = 7 |
| u3.sfdir3.hdr.i8count = 4 |
| u3.sfdir3.hdr.parent.i8 = 1024 |
| u3.sfdir3.list[0].namelen = 3 |
| u3.sfdir3.list[0].offset = 0x60 |
| u3.sfdir3.list[0].name = "git" |
| u3.sfdir3.list[0].inumber.i8 = 1027 |
| u3.sfdir3.list[0].filetype = 2 |
| u3.sfdir3.list[1].namelen = 4 |
| u3.sfdir3.list[1].offset = 0x70 |
| u3.sfdir3.list[1].name = "home" |
| u3.sfdir3.list[1].inumber.i8 = 13422826546 |
| u3.sfdir3.list[1].filetype = 2 |
| u3.sfdir3.list[2].namelen = 10 |
| u3.sfdir3.list[2].offset = 0x80 |
| u3.sfdir3.list[2].name = "mike" |
| u3.sfdir3.list[2].inumber.i8 = 4299308032 |
| u3.sfdir3.list[2].filetype = 2 |
| u3.sfdir3.list[3].namelen = 3 |
| u3.sfdir3.list[3].offset = 0x98 |
| u3.sfdir3.list[3].name = "mtr" |
| u3.sfdir3.list[3].inumber.i8 = 13433252916 |
| u3.sfdir3.list[3].filetype = 2 |
| u3.sfdir3.list[4].namelen = 3 |
| u3.sfdir3.list[4].offset = 0xa8 |
| u3.sfdir3.list[4].name = "vms" |
| u3.sfdir3.list[4].inumber.i8 = 16647516355 |
| u3.sfdir3.list[4].filetype = 2 |
| u3.sfdir3.list[5].namelen = 5 |
| u3.sfdir3.list[5].offset = 0xb8 |
| u3.sfdir3.list[5].name = "rsync" |
| u3.sfdir3.list[5].inumber.i8 = 3494912 |
| u3.sfdir3.list[5].filetype = 2 |
| u3.sfdir3.list[6].namelen = 3 |
| u3.sfdir3.list[6].offset = 0xd0 |
| u3.sfdir3.list[6].name = "tmp" |
| u3.sfdir3.list[6].inumber.i8 = 1593379 |
| u3.sfdir3.list[6].filetype = 2 |
| ---- |
| |
| [[Block_Directories]] |
| == Block Directories |
| |
| When the shortform directory space exceeds the space in an inode, the directory |
| data is moved into a new single directory block outside the inode. The inode's |
| format is changed from ``local'' to ``extent'' Following is a list of points about |
| block directories. |
| |
| * All directory data is stored within the one directory block, including ``.'' and |
| ``..'' entries which are mandatory. |
| |
| * The block also contains ``leaf'' and ``freespace index'' information. |
| |
| * The location of the block is defined by the inode's in-core |
| xref:Extent_List[extent list]: the +di_u.u_bmx[0]+ value. The file offset in |
| the extent must always be zero and the +length+ = (directory block size / |
| filesystem block size). The block number points to the filesystem block |
| containing the directory data. |
| |
| * Block directory data is stored in the following structures: |
| |
| [source, c] |
| ---- |
| #define XFS_DIR2_DATA_FD_COUNT 3 |
| typedef struct xfs_dir2_block { |
| xfs_dir2_data_hdr_t hdr; |
| xfs_dir2_data_union_t u[1]; |
| xfs_dir2_leaf_entry_t leaf[1]; |
| xfs_dir2_block_tail_t tail; |
| } xfs_dir2_block_t; |
| ---- |
| |
| *hdr*:: |
| Directory block header. On a v5 filesystem this is +xfs_dir3_data_hdr_t+. |
| |
| *u*:: |
| Union of directory and unused entries. |
| |
| *leaf*:: |
| Hash values of the entries in this block. |
| |
| *tail*:: |
| Bookkeeping for the leaf entries. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_data_hdr { |
| __uint32_t magic; |
| xfs_dir2_data_free_t bestfree[XFS_DIR2_DATA_FD_COUNT]; |
| } xfs_dir2_data_hdr_t; |
| ---- |
| |
| *magic*:: |
| Magic number for this directory block. |
| |
| *bestfree*:: |
| An array pointing to free regions in the directory block. |
| |
| On a v5 filesystem, directory and attribute blocks are formatted with v3 |
| headers, which contain extra data: |
| |
| [source, c] |
| ---- |
| struct xfs_dir3_blk_hdr { |
| __be32 magic; |
| __be32 crc; |
| __be64 blkno; |
| __be64 lsn; |
| uuid_t uuid; |
| __be64 owner; |
| }; |
| ---- |
| |
| *magic*:: |
| Magic number for this directory block. |
| |
| *crc*:: |
| Checksum of the directory block. |
| |
| *blkno*:: |
| Block number of this directory block. |
| |
| *lsn*:: |
| Log sequence number of the last write to this block. |
| |
| *uuid*:: |
| The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+ |
| depending on which features are set. |
| |
| *owner*:: |
| The inode number that this directory block belongs to. |
| |
| [source, c] |
| ---- |
| struct xfs_dir3_data_hdr { |
| struct xfs_dir3_blk_hdr hdr; |
| xfs_dir2_data_free_t best_free[XFS_DIR2_DATA_FD_COUNT]; |
| __be32 pad; |
| }; |
| ---- |
| |
| *hdr*:: |
| The v5 directory/attribute block header. |
| |
| *best_free*:: |
| An array pointing to free regions in the directory block. |
| |
| *pad*:: |
| Padding to maintain a 64-bit alignment. |
| |
| Within the block, data structures are as follows: |
| |
| [source, c] |
| ----- |
| typedef struct xfs_dir2_data_free { |
| xfs_dir2_data_off_t offset; |
| xfs_dir2_data_off_t length; |
| } xfs_dir2_data_free_t; |
| ---- |
| |
| *offset*:: |
| Block offset of a free block, in bytes. |
| |
| *length*:: |
| Length of the free block, in bytes. |
| |
| Space inside the directory block can be used for directory entries or unused |
| entries. This is signified via a union of the two types: |
| |
| [source, c] |
| ----- |
| typedef union { |
| xfs_dir2_data_entry_t entry; |
| xfs_dir2_data_unused_t unused; |
| } xfs_dir2_data_union_t; |
| ---- |
| |
| *entry*:: |
| A directory entry. |
| |
| *unused*:: |
| An unused entry. |
| |
| [source, c] |
| ----- |
| typedef struct xfs_dir2_data_entry { |
| xfs_ino_t inumber; |
| __uint8_t namelen; |
| __uint8_t name[1]; |
| __uint8_t ftype; |
| xfs_dir2_data_off_t tag; |
| } xfs_dir2_data_entry_t; |
| ---- |
| |
| *inumber*:: |
| The inode number that this entry points to. |
| |
| *namelen*:: |
| Length of the name, in bytes. |
| |
| *name*:: |
| The name associated with this entry. |
| |
| *ftype*:: |
| The type of the inode. This is used to avoid reading the inode while iterating |
| a directory. The +XFS_SB_VERSION2_FTYPE+ feature must be set, or this field |
| will not be present. |
| |
| *tag*:: |
| Starting offset of the entry, in bytes. This is used for directory iteration. |
| |
| [source, c] |
| ----- |
| typedef struct xfs_dir2_data_unused { |
| __uint16_t freetag; /* 0xffff */ |
| xfs_dir2_data_off_t length; |
| xfs_dir2_data_off_t tag; |
| } xfs_dir2_data_unused_t; |
| ---- |
| |
| *freetag*:: |
| Magic number signifying that this is an unused entry. Must be 0xFFFF. |
| |
| *length*:: |
| Length of this unused entry, in bytes. |
| |
| *tag*:: |
| Starting offset of the entry, in bytes. |
| |
| [source, c] |
| ----- |
| typedef struct xfs_dir2_leaf_entry { |
| xfs_dahash_t hashval; |
| xfs_dir2_dataptr_t address; |
| } xfs_dir2_leaf_entry_t; |
| ---- |
| |
| *hashval*:: |
| Hash value of the name of the directory entry. This is used to speed up entry |
| lookups. |
| |
| *address*:: |
| Block offset of the entry, in eight byte units. |
| |
| [source, c] |
| ----- |
| typedef struct xfs_dir2_block_tail { |
| __uint32_t count; |
| __uint32_t stale; |
| } xfs_dir2_block_tail_t; |
| ---- |
| |
| *count*:: |
| Number of leaf entries. |
| |
| *stale*:: |
| Number of free leaf entries. |
| |
| Following is a diagram of how these pieces fit together for a block directory. |
| |
| .Block directory layout |
| image::images/43.png[] |
| |
| * The magic number in the header is ``XD2B'' (0x58443242), or ``XDB3'' (0x58444233) |
| on a v5 filesystem. |
| |
| * The +tag+ in the +xfs_dir2_data_entry_t+ structure stores its offset from the |
| start of the block. |
| |
| * The start of a free space region is marked with the +xfs_dir2_data_unused_t+ |
| structure where the +freetag+ is +0xffff+. The +freetag+ and +length+ overwrites |
| the +inumber+ for an entry. The +tag+ is located at +length - sizeof(tag)+ from |
| the start of the +unused+ entry on-disk. |
| |
| * The +bestfree+ array in the header points to as many as three of the largest |
| spaces of free space within the block for storing new entries sorted by largest |
| to third largest. If there are less than 3 empty regions, the remaining |
| +bestfree+ elements are zeroed. The +offset+ specifies the offset from the start |
| of the block in bytes, and the +length+ specifies the size of the free space in |
| bytes. The location each points to must contain the above |
| +xfs_dir2_data_unused_t+ structure. As a block cannot exceed 64KB in size, each |
| is a 16-bit value. +bestfree+ is used to optimise the time required to locate |
| space to create an entry. It saves scanning through the block to find a location |
| suitable for every entry created. |
| |
| * The +tail+ structure specifies the number of elements in the +leaf+ array and |
| the number of +stale+ entries in the array. The +tail+ is always located at the |
| end of the block. The +leaf+ data immediately precedes the +tail+ structure. |
| |
| * The +leaf+ array, which grows from the end of the block just before the +tail+ |
| structure, contains an array of hash/address pairs for quickly looking up a name |
| by a hash value. Hash values are covered by the introduction to directories. The |
| +address+ on-disk is the offset into the block divided by 8 |
| (+XFS_DIR2_DATA_ALIGN+). Hash/address pairs are stored on disk to optimise |
| lookup speed for large directories. If they were not stored, the hashes would |
| have to be calculated for all entries each time a lookup occurs in a directory. |
| |
| |
| === xfs_db Block Directory Example |
| |
| A directory is created with 8 entries, directory block size = filesystem block size: |
| |
| ---- |
| xfs_db> sb 0 |
| xfs_db> p |
| magicnum = 0x58465342 |
| blocksize = 4096 |
| ... |
| dirblklog = 0 |
| ... |
| xfs_db> inode <inode#> |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 040755 |
| core.version = 1 |
| core.format = 2 (extents) |
| core.nlinkv1 = 2 |
| ... |
| core.size = 4096 |
| core.nblocks = 1 |
| core.extsize = 0 |
| core.nextents = 1 |
| ... |
| u.bmx[0] = [startoff,startblock,blockcount,extentflag] 0:[0,2097164,1,0] |
| ---- |
| |
| Go to the ``startblock'' and show the raw disk data: |
| |
| ---- |
| xfs_db> dblock 0 |
| xfs_db> type text |
| xfs_db> p |
| 000: 58 44 32 42 01 30 0e 78 00 00 00 00 00 00 00 00 XD2B.0.x........ |
| 010: 00 00 00 00 02 00 00 80 01 2e 00 00 00 00 00 10 ................ |
| 020: 00 00 00 00 00 00 00 80 02 2e 2e 00 00 00 00 20 ................ |
| 030: 00 00 00 00 02 00 00 81 0f 66 72 61 6d 65 30 30 .........frame00 |
| 040: 30 30 30 30 2e 74 73 74 80 8e 59 00 00 00 00 30 0000.tst..Y....0 |
| 050: 00 00 00 00 02 00 00 82 0f 66 72 61 6d 65 30 30 .........frame00 |
| 060: 30 30 30 31 2e 74 73 74 d0 ca 5c 00 00 00 00 50 0001.tst.......P |
| 070: 00 00 00 00 02 00 00 83 0f 66 72 61 6d 65 30 30 .........frame00 |
| 080: 30 30 30 32 2e 74 73 74 00 00 00 00 00 00 00 70 0002.tst.......p |
| 090: 00 00 00 00 02 00 00 84 0f 66 72 61 6d 65 30 30 .........frame00 |
| 0a0: 30 30 30 33 2e 74 73 74 00 00 00 00 00 00 00 90 0003.tst........ |
| 0b0: 00 00 00 00 02 00 00 85 0f 66 72 61 6d 65 30 30 .........frame00 |
| 0c0: 30 30 30 34 2e 74 73 74 00 00 00 00 00 00 00 b0 0004.tst........ |
| 0d0: 00 00 00 00 02 00 00 86 0f 66 72 61 6d 65 30 30 .........frame00 |
| 0e0: 30 30 30 35 2e 74 73 74 00 00 00 00 00 00 00 d0 0005.tst........ |
| 0f0: 00 00 00 00 02 00 00 87 0f 66 72 61 6d 65 30 30 .........frame00 |
| 100: 30 30 30 36 2e 74 73 74 00 00 00 00 00 00 00 f0 0006.tst........ |
| 110: 00 00 00 00 02 00 00 88 0f 66 72 61 6d 65 30 30 .........frame00 |
| 120: 30 30 30 37 2e 74 73 74 00 00 00 00 00 00 01 10 0007.tst........ |
| 130: ff ff 0e 78 00 00 00 00 00 00 00 00 00 00 00 00 ...x............ |
| ---- |
| |
| The ``leaf'' and ``tail'' structures are stored at the end of the block, so as the |
| directory grows, the middle is filled in: |
| |
| ---- |
| fa0: 00 00 00 00 00 00 01 30 00 00 00 2e 00 00 00 02 .......0........ |
| fb0: 00 00 17 2e 00 00 00 04 83 a0 40 b4 00 00 00 0e ................ |
| fc0: 93 a0 40 b4 00 00 00 12 a3 a0 40 b4 00 00 00 06 ................ |
| fd0: b3 a0 40 b4 00 00 00 0a c3 a0 40 b4 00 00 00 1e ................ |
| fe0: d3 a0 40 b4 00 00 00 22 e3 a0 40 b4 00 00 00 16 ................ |
| ff0: f3 a0 40 b4 00 00 00 1a 00 00 00 0a 00 00 00 00 ................ |
| ---- |
| |
| In a readable format: |
| |
| ---- |
| xfs_db> type dir2 |
| xfs_db> p |
| bhdr.magic = 0x58443242 |
| bhdr.bestfree[0].offset = 0x130 |
| bhdr.bestfree[0].length = 0xe78 |
| bhdr.bestfree[1].offset = 0 |
| bhdr.bestfree[1].length = 0 |
| bhdr.bestfree[2].offset = 0 |
| bhdr.bestfree[2].length = 0 |
| bu[0].inumber = 33554560 |
| bu[0].namelen = 1 |
| bu[0].name = "." |
| bu[0].tag = 0x10 |
| bu[1].inumber = 128 |
| bu[1].namelen = 2 |
| bu[1].name = ".." |
| bu[1].tag = 0x20 |
| bu[2].inumber = 33554561 |
| bu[2].namelen = 15 |
| bu[2].name = "frame000000.tst" |
| bu[2].tag = 0x30 |
| bu[3].inumber = 33554562 |
| bu[3].namelen = 15 |
| bu[3].name = "frame000001.tst" |
| bu[3].tag = 0x50 |
| ... |
| bu[8].inumber = 33554567 |
| bu[8].namelen = 15 |
| bu[8].name = "frame000006.tst" |
| bu[8].tag = 0xf0 |
| bu[9].inumber = 33554568 |
| bu[9].namelen = 15 |
| bu[9].name = "frame000007.tst" |
| bu[9].tag = 0x110 |
| bu[10].freetag = 0xffff |
| bu[10].length = 0xe78 |
| bu[10].tag = 0x130 |
| bleaf[0].hashval = 0x2e |
| bleaf[0].address = 0x2 |
| bleaf[1].hashval = 0x172e |
| bleaf[1].address = 0x4 |
| bleaf[2].hashval = 0x83a040b4 |
| bleaf[2].address = 0xe |
| ... |
| bleaf[8].hashval = 0xe3a040b4 |
| bleaf[8].address = 0x16 |
| bleaf[9].hashval = 0xf3a040b4 |
| bleaf[9].address = 0x1a |
| btail.count = 10 |
| btail.stale = 0 |
| ---- |
| |
| [NOTE] |
| For block directories, all xfs_db fields are preceded with ``b''. |
| |
| For a simple lookup example, the hash of frame000000.tst is 0xb3a040b4. Looking |
| up that value, we get an address of 0x6. Multiply that by 8, it becomes offset |
| 0x30 and the inode at that point is 33554561. |
| |
| When we remove an entry from the middle (frame000004.tst), we can see how the |
| freespace details are adjusted: |
| |
| ---- |
| bhdr.magic = 0x58443242 |
| bhdr.bestfree[0].offset = 0x130 |
| bhdr.bestfree[0].length = 0xe78 |
| bhdr.bestfree[1].offset = 0xb0 |
| bhdr.bestfree[1].length = 0x20 |
| bhdr.bestfree[2].offset = 0 |
| bhdr.bestfree[2].length = 0 |
| ... |
| bu[5].inumber = 33554564 |
| bu[5].namelen = 15 |
| bu[5].name = "frame000003.tst" |
| bu[5].tag = 0x90 |
| bu[6].freetag = 0xffff |
| bu[6].length = 0x20 |
| bu[6].tag = 0xb0 |
| bu[7].inumber = 33554566 |
| bu[7].namelen = 15 |
| bu[7].name = "frame000005.tst" |
| bu[7].tag = 0xd0 |
| ... |
| bleaf[7].hashval = 0xd3a040b4 |
| bleaf[7].address = 0x22 |
| bleaf[8].hashval = 0xe3a040b4 |
| bleaf[8].address = 0 |
| bleaf[9].hashval = 0xf3a040b4 |
| bleaf[9].address = 0x1a |
| btail.count = 10 |
| btail.stale = 1 |
| ---- |
| |
| A new ``bestfree'' value is added for the entry, the start of the entry is marked |
| as unused with 0xffff (which overwrites the inode number for an actual entry), |
| and the length of the space. The tag remains intact at the +offset+length - |
| sizeof(tag)+. The address for the hash is also cleared. The affected areas are |
| highlighted below: |
| |
| [subs="quotes"] |
| ---- |
| 090: 00 00 00 00 02 00 00 84 0f 66 72 61 6d 65 30 30 ..........frame00 |
| 0a0: 30 30 30 33 2e 74 73 74 00 00 00 00 00 00 00 90 0003.tst......... |
| 0b0: *ff ff 00 20* 02 00 00 85 0f 66 72 61 6d 65 30 30 ..........frame00 |
| 0c0: 30 30 30 34 2e 74 73 74 00 00 00 00 *00 00 00 b0* 0004.tst......... |
| 0d0: 00 00 00 00 02 00 00 86 0f 66 72 61 6d 65 30 30 ..........frame00 |
| 0e0: 30 30 30 35 2e 74 73 74 00 00 00 00 00 00 00 0d 0005.tst......... |
| ... |
| fb0: 00 00 17 2e 00 00 00 04 83 a0 40 b4 00 00 00 0e ................. |
| fc0: 93 a0 40 b4 00 00 00 12 a3 a0 40 b4 00 00 00 06 ................. |
| fd0: b3 a0 40 b4 00 00 00 0a c3 a0 40 b4 00 00 00 1e ................. |
| fe0: d3 a0 40 b4 00 00 00 22 e3 a0 40 b4 *00 00 00 00* ................. |
| ff0: f3 a0 40 b4 00 00 00 1a 00 00 00 0a *00 00 00 01* ................. |
| ---- |
| |
| |
| |
| [[Leaf_Directories]] |
| == Leaf Directories |
| |
| Once a Block Directory has filled the block, the directory data is changed into |
| a new format. It still uses xref:Data_Extents[extents] and the same basic |
| structures, but the ``data'' and ``leaf'' are split up into their own extents. The |
| ``leaf'' information only occupies one extent. As ``leaf'' information is more |
| compact than ``data'' information, more than one ``data'' extent is common. |
| |
| * Block to Leaf conversions retain the existing block for the data entries and |
| allocate a new block for the leaf and freespace index information. |
| |
| * As with all directories, data blocks must start at logical offset zero. |
| |
| * The ``leaf'' block has a special offset defined by +XFS_DIR2_LEAF_OFFSET+. |
| Currently, this is 32GB and in the extent view, a block offset of |
| 32GB / +sb_blocksize+. On a 4KB block filesystem, this is 0x800000 (8388608 |
| decimal). |
| |
| * Blocks with directory entries (``data'' extents) have the magic number ``X2D2'' |
| (0x58443244), or ``XDD3'' (0x58444433) on a v5 filesystem. |
| |
| * The ``data'' extents have a new header (no ``leaf'' data): |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_data { |
| xfs_dir2_data_hdr_t hdr; |
| xfs_dir2_data_union_t u[1]; |
| } xfs_dir2_data_t; |
| ---- |
| |
| *hdr*:: |
| Data block header. On a v5 filesystem, this field is +struct xfs_dir3_data_hdr+. |
| |
| *u*:: |
| Union of directory and unused entries, exactly the same as in a block directory. |
| |
| // split lists |
| |
| * The ``leaf'' extent uses the following structures: |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_leaf { |
| xfs_dir2_leaf_hdr_t hdr; |
| xfs_dir2_leaf_entry_t ents[1]; |
| xfs_dir2_data_off_t bests[1]; |
| xfs_dir2_leaf_tail_t tail; |
| } xfs_dir2_leaf_t; |
| ---- |
| |
| *hdr*:: |
| Directory leaf header. On a v5 filesystem this is +struct |
| xfs_dir3_leaf_hdr_t+. |
| |
| *ents*:: |
| Hash values of the entries in this block. |
| |
| *bests*:: |
| An array pointing to free regions in the directory block. |
| |
| *tail*:: |
| Bookkeeping for the leaf entries. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_leaf_hdr { |
| xfs_da_blkinfo_t info; |
| __uint16_t count; |
| __uint16_t stale; |
| } xfs_dir2_leaf_hdr_t; |
| ---- |
| |
| *info*:: |
| Leaf btree block header. |
| |
| *count*:: |
| Number of leaf entries. |
| |
| *stale*:: |
| Number of stale/zeroed leaf entries. |
| |
| [source, c] |
| ---- |
| struct xfs_dir3_leaf_hdr { |
| struct xfs_da3_blkinfo info; |
| __uint16_t count; |
| __uint16_t stale; |
| __be32 pad; |
| }; |
| ---- |
| |
| *info*:: |
| Leaf B+tree block header. |
| |
| *count*:: |
| Number of leaf entries. |
| |
| *stale*:: |
| Number of stale/zeroed leaf entries. |
| |
| *pad*:: |
| Padding to maintain alignment rules. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_leaf_tail { |
| __uint32_t bestcount; |
| } xfs_dir2_leaf_tail_t; |
| ---- |
| |
| *bestcount*:: |
| Number of best free entries. |
| |
| // split lists |
| |
| * The magic number of the leaf block is +XFS_DIR2_LEAF1_MAGIC+ (0xd2f1); on a |
| v5 filesystem it is +XFS_DIR3_LEAF1_MAGIC+ (0x3df1). |
| |
| * The size of the +ents+ array is specified by +hdr.count+. |
| |
| * The size of the +bests+ array is specified by the +tail.bestcount+, which is |
| also the number of ``data'' blocks for the directory. The bests array maintains |
| each data block's +bestfree[0].length+ value. |
| |
| .Leaf directory free entry detail |
| image::images/48.png[] |
| |
| === xfs_db Leaf Directory Example |
| |
| For this example, a directory was created with 256 entries (frame000000.tst to |
| frame000255.tst). Some files were deleted (frame00005*, frame00018* and |
| frame000240.tst) to show free list characteristics. |
| |
| ---- |
| xfs_db> inode <inode#> |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 040755 |
| core.version = 1 |
| core.format = 2 (extents) |
| core.nlinkv1 = 2 |
| ... |
| core.size = 12288 |
| core.nblocks = 4 |
| core.extsize = 0 |
| core.nextents = 3 |
| ... |
| u.bmx[0-2] = [startoff,startblock,blockcount,extentflag] |
| 0:[0,4718604,1,0] |
| 1:[1,4718610,2,0] |
| 2:[8388608,4718605,1,0] |
| ---- |
| |
| As can be seen in this example, three blocks are used for ``data'' in two extents, |
| and the ``leaf'' extent has a logical offset of 8388608 blocks (32GB). |
| |
| Examining the first block: |
| |
| ---- |
| xfs_db> dblock 0 |
| xfs_db> type dir2 |
| xfs_db> p |
| dhdr.magic = 0x58443244 |
| dhdr.bestfree[0].offset = 0x670 |
| dhdr.bestfree[0].length = 0x140 |
| dhdr.bestfree[1].offset = 0xff0 |
| dhdr.bestfree[1].length = 0x10 |
| dhdr.bestfree[2].offset = 0 |
| dhdr.bestfree[2].length = 0 |
| du[0].inumber = 75497600 |
| du[0].namelen = 1 |
| du[0].name = "." |
| du[0].tag = 0x10 |
| du[1].inumber = 128 |
| du[1].namelen = 2 |
| du[1].name = ".." |
| du[1].tag = 0x20 |
| du[2].inumber = 75497601 |
| du[2].namelen = 15 |
| du[2].name = "frame000000.tst" |
| du[2].tag = 0x30 |
| du[3].inumber = 75497602 |
| du[3].namelen = 15 |
| du[3].name = "frame000001.tst" |
| du[3].tag = 0x50 |
| ... |
| du[51].inumber = 75497650 |
| du[51].namelen = 15 |
| du[51].name = "frame000049.tst" |
| du[51].tag = 0x650 |
| du[52].freetag = 0xffff |
| du[52].length = 0x140 |
| du[52].tag = 0x670 |
| du[53].inumber = 75497661 |
| du[53].namelen = 15 |
| du[53].name = "frame000060.tst" |
| du[53].tag = 0x7b0 |
| ... |
| du[118].inumber = 75497758 |
| du[118].namelen = 15 |
| du[118].name = "frame000125.tst" |
| du[118].tag = 0xfd0 |
| du[119].freetag = 0xffff |
| du[119].length = 0x10 |
| du[119].tag = 0xff0 |
| ---- |
| |
| [NOTE] |
| The xfs_db field output is preceded by a ``d'' for ``data''. |
| |
| The next ``data'' block: |
| |
| ---- |
| xfs_db> dblock 1 |
| xfs_db> type dir2 |
| xfs_db> p |
| dhdr.magic = 0x58443244 |
| dhdr.bestfree[0].offset = 0x6d0 |
| dhdr.bestfree[0].length = 0x140 |
| dhdr.bestfree[1].offset = 0xe50 |
| dhdr.bestfree[1].length = 0x20 |
| dhdr.bestfree[2].offset = 0xff0 |
| dhdr.bestfree[2].length = 0x10 |
| du[0].inumber = 75497759 |
| du[0].namelen = 15 |
| du[0].name = "frame000126.tst" |
| du[0].tag = 0x10 |
| ... |
| du[53].inumber = 75497844 |
| du[53].namelen = 15 |
| du[53].name = "frame000179.tst" |
| du[53].tag = 0x6b0 |
| du[54].freetag = 0xffff |
| du[54].length = 0x140 |
| du[54].tag = 0x6d0 |
| du[55].inumber = 75497855 |
| du[55].namelen = 15 |
| du[55].name = "frame000190.tst" |
| du[55].tag = 0x810 |
| ... |
| du[104].inumber = 75497904 |
| du[104].namelen = 15 |
| du[104].name = "frame000239.tst" |
| du[104].tag = 0xe30 |
| du[105].freetag = 0xffff |
| du[105].length = 0x20 |
| du[105].tag = 0xe50 |
| du[106].inumber = 75497906 |
| du[106].namelen = 15 |
| du[106].name = "frame000241.tst" |
| du[106].tag = 0xe70 |
| ... |
| du[117].inumber = 75497917 |
| du[117].namelen = 15 |
| du[117].name = "frame000252.tst" |
| du[117].tag = 0xfd0 |
| du[118].freetag = 0xffff |
| du[118].length = 0x10 |
| du[118].tag = 0xff0 |
| ---- |
| |
| And the last data block: |
| |
| |
| ---- |
| xfs_db> dblock 2 |
| xfs_db> type dir2 |
| xfs_db> p |
| dhdr.magic = 0x58443244 |
| dhdr.bestfree[0].offset = 0x70 |
| dhdr.bestfree[0].length = 0xf90 |
| dhdr.bestfree[1].offset = 0 |
| dhdr.bestfree[1].length = 0 |
| dhdr.bestfree[2].offset = 0 |
| dhdr.bestfree[2].length = 0 |
| du[0].inumber = 75497918 |
| du[0].namelen = 15 |
| du[0].name = "frame000253.tst" |
| du[0].tag = 0x10 |
| du[1].inumber = 75497919 |
| du[1].namelen = 15 |
| du[1].name = "frame000254.tst" |
| du[1].tag = 0x30 |
| du[2].inumber = 75497920 |
| du[2].namelen = 15 |
| du[2].name = "frame000255.tst" |
| du[2].tag = 0x50 |
| du[3].freetag = 0xffff |
| du[3].length = 0xf90 |
| du[3].tag = 0x70 |
| ---- |
| |
| Examining the ``leaf'' block (with the fields preceded by an ``l'' for ``leaf''): |
| |
| ---- |
| xfs_db> dblock 8388608 |
| xfs_db> type dir2 |
| xfs_db> p |
| lhdr.info.forw = 0 |
| lhdr.info.back = 0 |
| lhdr.info.magic = 0xd2f1 |
| lhdr.count = 258 |
| lhdr.stale = 0 |
| lbests[0-2] = 0:0x10 1:0x10 2:0xf90 |
| lents[0].hashval = 0x2e |
| lents[0].address = 0x2 |
| lents[1].hashval = 0x172e |
| lents[1].address = 0x4 |
| lents[2].hashval = 0x23a04084 |
| lents[2].address = 0x116 |
| ... |
| lents[257].hashval = 0xf3a048bc |
| lents[257].address = 0x366 |
| ltail.bestcount = 3 |
| ---- |
| |
| Note how the +lbests+ array correspond with the +bestfree[0].length+ values in |
| the ``data'' blocks: |
| |
| ---- |
| xfs_db> dblock 0 |
| xfs_db> type dir2 |
| xfs_db> p |
| dhdr.magic = 0x58443244 |
| dhdr.bestfree[0].offset = 0xff0 |
| dhdr.bestfree[0].length = 0x10 |
| ... |
| xfs_db> dblock 1 |
| xfs_db> type dir2 |
| xfs_db> p |
| dhdr.magic = 0x58443244 |
| dhdr.bestfree[0].offset = 0xff0 |
| dhdr.bestfree[0].length = 0x10 |
| ... |
| xfs_db> dblock 2 |
| xfs_db> type dir2 |
| xfs_db> p |
| dhdr.magic = 0x58443244 |
| dhdr.bestfree[0].offset = 0x70 |
| dhdr.bestfree[0].length = 0xf90 |
| ---- |
| |
| Now after the entries have been deleted: |
| |
| ---- |
| xfs_db> dblock 8388608 |
| xfs_db> type dir2 |
| xfs_db> p |
| lhdr.info.forw = 0 |
| lhdr.info.back = 0 |
| lhdr.info.magic = 0xd2f1 |
| lhdr.count = 258 |
| lhdr.stale = 21 |
| lbests[0-2] = 0:0x140 1:0x140 2:0xf90 |
| lents[0].hashval = 0x2e |
| lents[0].address = 0x2 |
| lents[1].hashval = 0x172e |
| lents[1].address = 0x4 |
| lents[2].hashval = 0x23a04084 |
| lents[2].address = 0x116 |
| ... |
| ---- |
| |
| As can be seen, the +lbests+ values have been update to contain each |
| +hdr.bestfree[0].length+ values. The leaf's +hdr.stale+ value has also been |
| updated to specify the number of stale entries in the array. The stale entries |
| have an address of zero. |
| |
| TODO: Need an example for where new entries get inserted with several large free |
| spaces. |
| |
| |
| [[Node_Directories]] |
| == Node Directories |
| |
| When the ``leaf'' information fills a block, the extents undergo another |
| separation. All ``freeindex'' information moves into its own extent. Like Leaf |
| Directories, the ``leaf'' block maintained the best free space information for |
| each ``data'' block. This is not possible with more than one leaf. |
| |
| * The ``data'' blocks stay the same as leaf directories. |
| |
| * After the ``freeindex'' data moves to its own block, it is possible for the |
| leaf data to fit within a single leaf block. This single leaf block has a |
| magic number of +XFS_DIR2_LEAFN_MAGIC+ (0xd2ff) or on a v5 filesystem, |
| +XFS_DIR3_LEAFN_MAGIC+ (0x3dff). |
| |
| * The ``leaf'' blocks eventually change into a B+tree with the generic B+tree |
| header pointing to directory ``leaves'' as described in |
| xref:Leaf_Directories[Leaf Directories]. Blocks with leaf data still have the |
| +LEAFN_MAGIC+ magic number as outlined above. The top-level tree blocks are |
| called ``nodes'' and have a magic number of +XFS_DA_NODE_MAGIC+ (0xfebe), or on |
| a v5 filesystem, +XFS_DA3_NODE_MAGIC+ (0x3ebe). |
| |
| * Distinguishing between a combined leaf/freeindex block (+LEAF1_MAGIC+), a |
| leaf-only block (+LEAFN_MAGIC+), and a btree node block (+NODE_MAGIC+) can only |
| be done by examining the magic number. |
| |
| * The new ``freeindex'' block(s) only contains the bests for each data block. |
| |
| * The freeindex block uses the following structures: |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_free_hdr { |
| __uint32_t magic; |
| __int32_t firstdb; |
| __int32_t nvalid; |
| __int32_t nused; |
| } xfs_dir2_free_hdr_t; |
| ---- |
| |
| *magic*:: |
| The magic number of the free block, ``XD2F'' (0x0x58443246). |
| |
| *firstdb*:: |
| The starting directory block number for the bests array. |
| |
| *nvalid*:: |
| Number of elements in the bests array. |
| |
| *nused*:: |
| Number of valid elements in the bests array. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dir2_free { |
| xfs_dir2_free_hdr_t hdr; |
| xfs_dir2_data_off_t bests[1]; |
| } xfs_dir2_free_t; |
| ---- |
| |
| *hdr*:: |
| Free block header. |
| |
| *bests*:: |
| An array specifying the best free counts in each directory data block. |
| |
| // split lists |
| |
| * On a v5 filesystem, the freeindex block uses the following structures: |
| |
| [source, c] |
| ---- |
| struct xfs_dir3_free_hdr { |
| struct xfs_dir3_blk_hdr hdr; |
| __int32_t firstdb; |
| __int32_t nvalid; |
| __int32_t nused; |
| __int32_t pad; |
| }; |
| ---- |
| |
| *hdr*:: |
| v3 directory block header. The magic number is "XDF3" (0x0x58444633). |
| |
| *firstdb*:: |
| The starting directory block number for the bests array. |
| |
| *nvalid*:: |
| Number of elements in the bests array. |
| |
| *nused*:: |
| Number of valid elements in the bests array. |
| |
| *pad*:: |
| Padding to maintain alignment. |
| |
| [source, c] |
| ---- |
| struct xfs_dir3_free { |
| xfs_dir3_free_hdr_t hdr; |
| __be16 bests[1]; |
| }; |
| ---- |
| |
| *hdr*:: |
| Free block header. |
| |
| *bests*:: |
| An array specifying the best free counts in each directory data block. |
| |
| // split lists |
| |
| * The location of the leaf blocks can be in any order, the only way to determine |
| the appropriate is by the node block hash/before values. Given a hash to look up, |
| you read the node's +btree+ array and first +hashval+ in the array that exceeds |
| the given hash and it can then be found in the block pointed to by the +before+ |
| value. |
| |
| // split lists |
| |
| * The freeindex's +bests+ array starts from the end of the block and grows to the |
| start of the block. |
| |
| * When an data block becomes unused (ie. all entries in it have been deleted), the |
| block is freed, the data extents contain a hole, and the freeindex's +hdr.nused+ |
| value is decremented and the associated +bests[]+ entry is set to 0xffff. |
| |
| * As the first data block always contains ``.'' and ``..'', it's invalid for the |
| directory to have a hole at the start. |
| |
| * The freeindex's +hdr.nvalid+ should always be the same as the number of |
| allocated data directory blocks containing name/inode data and will always be |
| less than or equal to +hdr.nused+. The value of +hdr.nused+ should be the same |
| as the index of the last data directory block plus one (i.e. when the last data |
| block is freed, +nused+ and +nvalid+ are decremented). |
| |
| .Node directory layout |
| image::images/54.png[] |
| |
| === xfs_db Node Directory Example |
| |
| With the node directory examples, we are using a filesystems with 4KB block |
| size, and a 16KB directory size. The directory has over 2000 entries: |
| |
| ---- |
| xfs_db> sb 0 |
| xfs_db> p |
| magicnum = 0x58465342 |
| blocksize = 4096 |
| ... |
| dirblklog = 2 |
| ... |
| xfs_db> inode <inode#> |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 040755 |
| core.version = 1 |
| core.format = 2 (extents) |
| ... |
| core.size = 81920 |
| core.nblocks = 36 |
| core.extsize = 0 |
| core.nextents = 8 |
| ... |
| u.bmx[0-7] = [startoff,startblock,blockcount,extentflag] 0:[0,7368,4,0] |
| 1:[4,7408,4,0] 2:[8,7444,4,0] 3:[12,7480,4,0] 4:[16,7520,4,0] |
| 5:[8388608,7396,4,0] 6:[8388612,7524,8,0] 7:[16777216,7516,4,0] |
| ---- |
| |
| As can already be observed, all extents are allocated is multiples of 4 blocks. |
| |
| Blocks 0 to 19 (16+4-1) are used for directory data blocks. Looking at blocks |
| 16-19, we can seen that it's the same as the single-leaf format, except the |
| +length+ values are a lot larger to accommodate the increased directory block |
| size: |
| |
| ---- |
| xfs_db> dblock 16 |
| xfs_db> type dir2 |
| xfs_db> p |
| dhdr.magic = 0x58443244 |
| dhdr.bestfree[0].offset = 0xb0 |
| dhdr.bestfree[0].length = 0x3f50 |
| dhdr.bestfree[1].offset = 0 |
| dhdr.bestfree[1].length = 0 |
| dhdr.bestfree[2].offset = 0 |
| dhdr.bestfree[2].length = 0 |
| du[0].inumber = 120224 |
| du[0].namelen = 15 |
| du[0].name = "frame002043.tst" |
| du[0].tag = 0x10 |
| du[1].inumber = 120225 |
| du[1].namelen = 15 |
| du[1].name = "frame002044.tst" |
| du[1].tag = 0x30 |
| du[2].inumber = 120226 |
| du[2].namelen = 15 |
| du[2].name = "frame002045.tst" |
| du[2].tag = 0x50 |
| du[3].inumber = 120227 |
| du[3].namelen = 15 |
| du[3].name = "frame002046.tst" |
| du[3].tag = 0x70 |
| du[4].inumber = 120228 |
| du[4].namelen = 15 |
| du[4].name = "frame002047.tst" |
| du[4].tag = 0x90 |
| du[5].freetag = 0xffff |
| du[5].length = 0x3f50 |
| du[5].tag = 0 |
| ---- |
| |
| Next, the ``node'' block, the fields are preceded with 'n' for node blocks: |
| |
| ---- |
| xfs_db> dblock 8388608 |
| xfs_db> type dir2 |
| xfs_db> p |
| nhdr.info.forw = 0 |
| nhdr.info.back = 0 |
| nhdr.info.magic = 0xfebe |
| nhdr.count = 2 |
| nhdr.level = 1 |
| nbtree[0-1] = [hashval,before] 0:[0xa3a440ac,8388616] 1:[0xf3a440bc,8388612] |
| ---- |
| |
| The two following leaf blocks were allocated as part of the directory's |
| conversion to node format. All hashes less than 0xa3a440ac are located at |
| directory offset 8,388,616, and hashes less than 0xf3a440bc are located at |
| directory offset 8,388,612. Hashes greater or equal to 0xf3a440bc don't exist |
| in this directory. |
| |
| ---- |
| xfs_db> dblock 8388616 |
| xfs_db> type dir2 |
| xfs_db> p |
| lhdr.info.forw = 8388612 |
| lhdr.info.back = 0 |
| lhdr.info.magic = 0xd2ff |
| lhdr.count = 1023 |
| lhdr.stale = 0 |
| lents[0].hashval = 0x2e |
| lents[0].address = 0x2 |
| lents[1].hashval = 0x172e |
| lents[1].address = 0x4 |
| lents[2].hashval = 0x23a04084 |
| lents[2].address = 0x116 |
| ... |
| lents[1021].hashval = 0xa3a440a4 |
| lents[1021].address = 0x1fa2 |
| lents[1022].hashval = 0xa3a440ac |
| lents[1022].address = 0x1fca |
| xfs_db> dblock 8388612 |
| xfs_db> type dir2 |
| xfs_db> p |
| lhdr.info.forw = 0 |
| lhdr.info.back = 8388616 |
| lhdr.info.magic = 0xd2ff |
| lhdr.count = 1027 |
| lhdr.stale = 0 |
| lents[0].hashval = 0xa3a440b4 |
| lents[0].address = 0x1f52 |
| lents[1].hashval = 0xa3a440bc |
| lents[1].address = 0x1f7a |
| ... |
| lents[1025].hashval = 0xf3a440b4 |
| lents[1025].address = 0x1f66 |
| lents[1026].hashval = 0xf3a440bc |
| lents[1026].address = 0x1f8e |
| ---- |
| |
| An example lookup using xfs_db: |
| |
| ---- |
| xfs_db> hash frame001845.tst |
| 0xf3a26094 |
| ---- |
| |
| Doing a binary search through the array, we get address 0x1ce6, which is offset |
| 0xe730. Each fsblock is 4KB in size (0x1000), so it will be offset 0x730 into |
| directory offset 14. From the extent map, this will be fsblock 7482: |
| |
| ---- |
| xfs_db> fsblock 7482 |
| xfs_db> type text |
| xfs_db> p |
| ... |
| 730: 00 00 00 00 00 01 d4 da 0f 66 72 61 6d 65 30 30 .........frame00 |
| 740: 31 38 34 35 2e 74 73 74 00 00 00 00 00 00 27 30 1845.tst.......0 |
| ---- |
| |
| Looking at the freeindex information (fields with an 'f' tag): |
| |
| ---- |
| xfs_db> fsblock 7516 |
| xfs_db> type dir2 |
| xfs_db> p |
| fhdr.magic = 0x58443246 |
| fhdr.firstdb = 0 |
| fhdr.nvalid = 5 |
| fhdr.nused = 5 |
| fbests[0-4] = 0:0x10 1:0x10 2:0x10 3:0x10 4:0x3f50 |
| ---- |
| |
| Like the Leaf Directory, each of the +fbests+ values correspond to each data |
| block's +bestfree[0].length+ value. |
| |
| The +fbests+ array is highlighted in a raw block dump: |
| |
| [subs="quotes"] |
| ---- |
| xfs_db> type text |
| xfs_db> p |
| 000: 58 44 32 46 00 00 00 00 00 00 00 05 00 00 00 05 XD2F............ |
| 010: *00 10 00 10 00 10 00 10 3f 50* 00 00 1f 01 ff ff .........P...... |
| ---- |
| |
| TODO: Example with a hole in the middle |
| |
| |
| [[Btree_Directories]] |
| == B+tree Directories |
| |
| When the extent map in an inode grows beyond the inode's space, the inode format |
| is changed to a ``btree''. The inode contains a filesystem block point to the |
| B+tree extent map for the directory's blocks. The B+tree extents contain the |
| extent map for the ``data'', ``node'', ``leaf'', and ``freeindex'' information as |
| described in Node Directories. |
| |
| Refer to the previous section on B+tree xref:Btree_Extent_List[Data Extents] for |
| more information on XFS B+tree extents. |
| |
| The following properties apply to both node and B+tree directories: |
| |
| * The node/leaf trees can be more than one level deep. |
| |
| * More than one freeindex block may exist, but this will be quite rare. It would |
| required hundreds of thousand files with quite long file names (or millions with |
| shorter names) to get a second freeindex block. |
| |
| === xfs_db B+tree Directory Example |
| |
| A directory has been created with 200,000 entries with each entry being 100 |
| characters long. The filesystem block size and directory block size are 4KB: |
| |
| ---- |
| xfs_db> inode <inode#> |
| xfs_db> p |
| core.magic = 0x494e |
| core.mode = 040755 |
| core.version = 1 |
| core.format = 3 (btree) |
| ... |
| core.size = 22757376 |
| core.nblocks = 6145 |
| core.extsize = 0 |
| core.nextents = 234 |
| core.naextents = 0 |
| core.forkoff = 0 |
| ... |
| u.bmbt.level = 1 |
| u.bmbt.numrecs = 1 |
| u.bmbt.keys[1] = [startoff] 1:[0] |
| u.bmbt.ptrs[1] = 1:89 |
| xfs_db> fsblock 89 |
| xfs_db> type bmapbtd |
| xfs_db> p |
| magic = 0x424d4150 |
| level = 0 |
| numrecs = 234 |
| leftsib = null |
| rightsib = null |
| recs[1-234] = [startoff,startblock,blockcount,extentflag] |
| 1:[0,53,1,0] 2:[1,55,13,0] 3:[14,69,1,0] 4:[15,72,13,0] |
| 5:[28,86,2,0] 6:[30,90,21,0] 7:[51,112,1,0] 8:[52,114,11,0] |
| ... |
| 125:[5177,902,15,0] 126:[5192,918,6,0] 127:[5198,524786,358,0] |
| 128:[8388608,54,1,0] 129:[8388609,70,2,0] 130:[8388611,85,1,0] |
| ... |
| 229:[8389164,917,1,0] 230:[8389165,924,19,0] 231:[8389184,944,9,0] |
| 232:[16777216,68,1,0] 233:[16777217,7340114,1,0] 234:[16777218,5767362,1,0] |
| ---- |
| |
| We have 128 extents and a total of 5555 blocks being used to store name/inode |
| pairs. With only about 2000 values that can be stored in the freeindex block, 3 |
| blocks have been allocated for this information. The +firstdb+ field specifies |
| the starting directory block number for each array: |
| |
| ---- |
| xfs_db> dblock 16777216 |
| xfs_db> type dir2 |
| xfs_db> p |
| fhdr.magic = 0x58443246 |
| fhdr.firstdb = 0 |
| fhdr.nvalid = 2040 |
| fhdr.nused = 2040 |
| fbests[0-2039] = ... |
| xfs_db> dblock 16777217 |
| xfs_db> type dir2 |
| xfs_db> p |
| fhdr.magic = 0x58443246 |
| fhdr.firstdb = 2040 |
| fhdr.nvalid = 2040 |
| fhdr.nused = 2040 |
| fbests[0-2039] = ... |
| xfs_db> dblock 16777218 |
| xfs_db> type dir2 |
| xfs_db> p |
| fhdr.magic = 0x58443246 |
| fhdr.firstdb = 4080 |
| fhdr.nvalid = 1476 |
| fhdr.nused = 1476 |
| fbests[0-1475] = ... |
| ---- |
| |
| Looking at the root node in the node block, it's a pretty deep tree: |
| |
| ---- |
| xfs_db> dblock 8388608 |
| xfs_db> type dir2 |
| xfs_db> p |
| nhdr.info.forw = 0 |
| nhdr.info.back = 0 |
| nhdr.info.magic = 0xfebe |
| nhdr.count = 2 |
| nhdr.level = 2 |
| nbtree[0-1] = [hashval,before] 0:[0x6bbf6f39,8389121] 1:[0xfbbf7f79,8389120] |
| xfs_db> dblock 8389121 |
| xfs_db> type dir2 |
| xfs_db> p |
| nhdr.info.forw = 8389120 |
| nhdr.info.back = 0 |
| nhdr.info.magic = 0xfebe |
| nhdr.count = 263 |
| nhdr.level = 1 |
| nbtree[0-262] = ... 262:[0x6bbf6f39,8388928] |
| xfs_db> dblock 8389120 |
| xfs_db> type dir2 |
| xfs_db> p |
| nhdr.info.forw = 0 |
| nhdr.info.back = 8389121 |
| nhdr.info.magic = 0xfebe |
| nhdr.count = 319 |
| nhdr.level = 1 |
| nbtree[0-318] = [hashval,before] 0:[0x70b14711,8388919] ... |
| ---- |
| |
| The leaves at each the end of a node always point to the end leaves in adjacent |
| nodes. Directory block 8388928 has a forward pointer to block 8388919 and block |
| 8388919 has a previous pointer to block 8388928, as highlighted in the |
| following example: |
| |
| [subs="quotes"] |
| ---- |
| xfs_db> dblock 8388928 |
| xfs_db> type dir2 |
| xfs_db> p |
| lhdr.info.forw = *8388919* |
| lhdr.info.back = 8388937 |
| lhdr.info.magic = 0xd2ff |
| ... |
| |
| xfs_db> dblock 8388919 |
| xfs_db> type dir2 |
| xfs_db> p |
| lhdr.info.forw = 8388706 |
| lhdr.info.back = *8388928* |
| lhdr.info.magic = 0xd2ff |
| ... |
| ---- |