| = Fixed Length Record B+trees |
| |
| XFS uses b+trees to index all metadata records. This well known data structure |
| is used to provide efficient random and sequential access to metadata records |
| while minimizing seek times. There are two btree formats: a short format |
| for records pertaining to a single allocation group, since all block pointers |
| in an AG are 32-bits in size; and a long format for records pertaining to a |
| file, since file data can have 64-bit block offsets. Each b+tree block is |
| either a leaf node containing records, or an internal node containing keys and |
| pointers to other b+tree blocks. The tree consists of a root block which may |
| point to some number of other blocks; blocks in the bottom level of the b+tree |
| contains only records. |
| |
| Leaf blocks of both types of b+trees have the same general format: a header |
| describing the data in the block, and an array of records. The specific header |
| formats are given in the next two sections, and the record format is provided |
| by the b+tree client itself. The generic b+tree code does not have any |
| specific knowledge of the record format. |
| |
| ---- |
| +--------+------------+------------+ |
| | header | record | records... | |
| +--------+------------+------------+ |
| ---- |
| |
| Internal node blocks of both types of b+trees also have the same general |
| format: a header describing the data in the block, an array of keys, and an |
| array of pointers. Each pointer may be associated with one or two keys. The |
| first key uniquely identifies the first record accessible via the leftmost path |
| down the branch of the tree. |
| |
| If the records in a b+tree are indexed by an interval, then a range of keys can |
| uniquely identify a single record. For example, if a record covers blocks |
| 12-16, then any one of the keys 12, 13, 14, 15, or 16 return the same record. |
| In this case, the key for the record describing "12-16" is 12. If none of the |
| records overlap, we only need to store one key. |
| |
| This is the format of a standard b+tree node: |
| |
| ---- |
| +--------+---------+---------+---------+---------+ |
| | header | key | keys... | ptr | ptrs... | |
| +--------+---------+---------+---------+---------+ |
| ---- |
| |
| If the b+tree records do not overlap, performing a b+tree lookup is simple. |
| Start with the root. If it is a leaf block, perform a binary search of the |
| records until we find the record with a lower key than our search key. If the |
| block is a node block, perform a binary search of the keys until we find a |
| key lower than our search key, then follow the pointer to the next block. |
| Repeat until we find a record. |
| |
| However, if b+tree records contain intervals and are allowed to overlap, the |
| internal nodes of the b+tree become larger: |
| |
| ---- |
| +--------+---------+----------+---------+-------------+---------+---------+ |
| | header | low key | high key | low key | high key... | ptr | ptrs... | |
| +--------+---------+----------+---------+-------------+---------+---------+ |
| ---- |
| |
| The low keys are exactly the same as the keys in the non-overlapping b+tree. |
| High keys, however, are a little different. Recall that a record with a key |
| consisting of an interval can be referenced by a number of keys. Since the low |
| key of a record indexes the low end of that key range, the high key indexes the |
| high end of the key range. Returning to the example above, the high key for |
| the record describing "12-16" is 16. The high key recorded in a b+tree node |
| is the largest of the high keys of all records accessible under the subtree |
| rooted by the pointer. For a level 1 node, this is the largest high key in |
| the pointed-to leaf node; for any other node, this is the largest of the high |
| keys in the pointed-to node. |
| |
| Nodes and leaves use the same magic numbers. |
| |
| [[Short_Format_Btrees]] |
| == Short Format B+trees |
| |
| Each allocation group uses a ``short format'' B+tree to index various |
| information about the allocation group. The structure is called short format |
| because all block pointers are AG block numbers. The trees use the following |
| header: |
| |
| [source, c] |
| ---- |
| struct xfs_btree_sblock { |
| __be32 bb_magic; |
| __be16 bb_level; |
| __be16 bb_numrecs; |
| __be32 bb_leftsib; |
| __be32 bb_rightsib; |
| |
| /* version 5 filesystem fields start here */ |
| __be64 bb_blkno; |
| __be64 bb_lsn; |
| uuid_t bb_uuid; |
| __be32 bb_owner; |
| __le32 bb_crc; |
| }; |
| ---- |
| |
| *bb_magic*:: |
| Specifies the magic number for the per-AG B+tree block. |
| |
| *bb_level*:: |
| The level of the tree in which this block is found. If this value is 0, this |
| is a leaf block and contains records; otherwise, it is a node block and |
| contains keys and pointers. Level values increase towards the root. |
| |
| *bb_numrecs*:: |
| Number of records in this block. |
| |
| *bb_leftsib*:: |
| AG block number of the left sibling of this B+tree node. |
| |
| *bb_rightsib*:: |
| AG block number of the right sibling of this B+tree node. |
| |
| *bb_blkno*:: |
| FS block number of this B+tree block. |
| |
| *bb_lsn*:: |
| Log sequence number of the last write to this block. |
| |
| *bb_uuid*:: |
| The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+ |
| depending on which features are set. |
| |
| *bb_owner*:: |
| The AG number that this B+tree block ought to be in. |
| |
| *bb_crc*:: |
| Checksum of the B+tree block. |
| |
| [[Long_Format_Btrees]] |
| == Long Format B+trees |
| |
| Long format B+trees are similar to short format B+trees, except that their |
| block pointers are 64-bit filesystem block numbers instead of 32-bit AG block |
| numbers. Because of this, long format b+trees can be (and usually are) rooted |
| in an inode's data or attribute fork. The nodes and leaves of this B+tree use |
| the +xfs_btree_lblock+ declaration: |
| |
| [source, c] |
| ---- |
| struct xfs_btree_lblock { |
| __be32 bb_magic; |
| __be16 bb_level; |
| __be16 bb_numrecs; |
| __be64 bb_leftsib; |
| __be64 bb_rightsib; |
| |
| /* version 5 filesystem fields start here */ |
| __be64 bb_blkno; |
| __be64 bb_lsn; |
| uuid_t bb_uuid; |
| __be64 bb_owner; |
| __le32 bb_crc; |
| __be32 bb_pad; |
| }; |
| ---- |
| |
| *bb_magic*:: |
| Specifies the magic number for the btree block. |
| |
| *bb_level*:: |
| The level of the tree in which this block is found. If this value is 0, this |
| is a leaf block and contains records; otherwise, it is a node block and |
| contains keys and pointers. |
| |
| *bb_numrecs*:: |
| Number of records in this block. |
| |
| *bb_leftsib*:: |
| FS block number of the left sibling of this B+tree node. |
| |
| *bb_rightsib*:: |
| FS block number of the right sibling of this B+tree node. |
| |
| *bb_blkno*:: |
| FS block number of this B+tree block. |
| |
| *bb_lsn*:: |
| Log sequence number of the last write to this block. |
| |
| *bb_uuid*:: |
| The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+ |
| depending on which features are set. |
| |
| *bb_owner*:: |
| The AG number that this B+tree block ought to be in. |
| |
| *bb_crc*:: |
| Checksum of the B+tree block. |
| |
| *bb_pad*:: |
| Pads the structure to 64 bytes. |