design/XFS_Filesystem_Structure/allocation_groups.asciidoc - pub/scm/fs/xfs/xfs-documentation - Git at Google

 [[Allocation_Groups]]
 = Allocation Groups

 As mentioned earlier, XFS filesystems are divided into a number of equally
 sized chunks called Allocation Groups. Each AG can almost be thought of as an
 individual filesystem that maintains its own space usage. Each AG can be up to
 one terabyte in size (512 bytes × 2^31^), regardless of the underlying device's
 sector size.

 Each AG has the following characteristics:

          * A super block describing overall filesystem info
          * Free space management
          * Inode allocation and tracking
          * Reverse block-mapping index (optional)
          * Data block reference count index (optional)

 Having multiple AGs allows XFS to handle most operations in parallel without
 degrading performance as the number of concurrent accesses increases.

 The only global information maintained by the first AG (primary) is free space
 across the filesystem and total inode counts. If the
 +XFS_SB_VERSION2_LAZYSBCOUNTBIT+ flag is set in the superblock, these are only
 updated on-disk when the filesystem is cleanly unmounted (umount or shutdown).

 Immediately after a +mkfs.xfs+, the primary AG has the following disk layout;
 the subsequent AGs do not have any inodes allocated:

 .Allocation group layout
 image::images/6.png[]

 Each of these structures are expanded upon in the following sections.

 [[Superblocks]]
 == Superblocks

 Each AG starts with a superblock. The first one, in AG 0, is the primary
 superblock which stores aggregate AG information. Secondary superblocks are
 only used by xfs_repair when the primary superblock has been corrupted.  A
 superblock is one sector in length.

 The superblock is defined by the following structure. The description of each
 field follows.

 [source, c]
 ----
 struct xfs_sb
 {
 	__uint32_t		sb_magicnum;
 	__uint32_t		sb_blocksize;
 	xfs_rfsblock_t		sb_dblocks;
 	xfs_rfsblock_t		sb_rblocks;
 	xfs_rtblock_t		sb_rextents;
 	uuid_t			sb_uuid;
 	xfs_fsblock_t		sb_logstart;
 	xfs_ino_t		sb_rootino;
 	xfs_ino_t		sb_rbmino;
 	xfs_ino_t		sb_rsumino;
 	xfs_agblock_t		sb_rextsize;
 	xfs_agblock_t		sb_agblocks;
 	xfs_agnumber_t		sb_agcount;
 	xfs_extlen_t		sb_rbmblocks;
 	xfs_extlen_t		sb_logblocks;
 	__uint16_t		sb_versionnum;
 	__uint16_t		sb_sectsize;
 	__uint16_t		sb_inodesize;
 	__uint16_t		sb_inopblock;
 	char			sb_fname[12];
 	__uint8_t		sb_blocklog;
 	__uint8_t		sb_sectlog;
 	__uint8_t		sb_inodelog;
 	__uint8_t		sb_inopblog;
 	__uint8_t		sb_agblklog;
 	__uint8_t		sb_rextslog;
 	__uint8_t		sb_inprogress;
 	__uint8_t		sb_imax_pct;
 	__uint64_t		sb_icount;
 	__uint64_t		sb_ifree;
 	__uint64_t		sb_fdblocks;
 	__uint64_t		sb_frextents;
 	xfs_ino_t		sb_uquotino;
 	xfs_ino_t		sb_gquotino;
 	__uint16_t		sb_qflags;
 	__uint8_t		sb_flags;
 	__uint8_t		sb_shared_vn;
 	xfs_extlen_t		sb_inoalignmt;
 	__uint32_t		sb_unit;
 	__uint32_t		sb_width;
 	__uint8_t		sb_dirblklog;
 	__uint8_t		sb_logsectlog;
 	__uint16_t		sb_logsectsize;
 	__uint32_t		sb_logsunit;
 	__uint32_t		sb_features2;
 	__uint32_t		sb_bad_features2;

 	/* version 5 superblock fields start here */
 	__uint32_t		sb_features_compat;
 	__uint32_t		sb_features_ro_compat;
 	__uint32_t		sb_features_incompat;
 	__uint32_t		sb_features_log_incompat;

 	__uint32_t		sb_crc;
 	xfs_extlen_t		sb_spino_align;

 	xfs_ino_t		sb_pquotino;
 	xfs_lsn_t		sb_lsn;
 	uuid_t			sb_meta_uuid;
 	xfs_ino_t		sb_rrmapino;
 };
 ----
 *sb_magicnum*::
 Identifies the filesystem. Its value is +XFS_SB_MAGIC+ ``XFSB'' (0x58465342).

 *sb_blocksize*::
 The size of a basic unit of space allocation in bytes. Typically, this is 4096
 (4KB) but can range from 512 to 65536 bytes.

 *sb_dblocks*::
 Total number of blocks available for data and metadata on the filesystem.

 *sb_rblocks*::
 Number blocks in the real-time disk device. Refer to
 xref:Real-time_Devices[real-time sub-volumes] for more information.

 *sb_rextents*::
 Number of extents on the real-time device.

 *sb_uuid*::
 UUID (Universally Unique ID) for the filesystem. Filesystems can be mounted by
 the UUID instead of device name.

 *sb_logstart*::
 First block number for the journaling log if the log is internal (ie. not on a
 separate disk device). For an external log device, this will be zero (the log
 will also start on the first block on the log device).  The identity of the log
 devices is not recorded in the filesystem, but the UUIDs of the filesystem and
 the log device are compared to prevent corruption.

 *sb_rootino*::
 Root inode number for the filesystem.  Normally, the root inode is at the
 start of the first possible inode chunk in AG 0.  This is 128 when using a 4KB
 block size.

 *sb_rbmino*::
 Bitmap inode for real-time extents.

 *sb_rsumino*::
 Summary inode for real-time bitmap.

 *sb_rextsize*::
 Realtime extent size in blocks.

 *sb_agblocks*::
 Size of each AG in blocks. For the actual size of the last AG, refer to the
 xref:AG_Free_Space_Management[free space] +agf_length+ value.

 *sb_agcount*::
 Number of AGs in the filesystem.

 *sb_rbmblocks*::
 Number of real-time bitmap blocks.

 *sb_logblocks*::
 Number of blocks for the journaling log.

 *sb_versionnum*::
 Filesystem version number. This is a bitmask specifying the features enabled
 when creating the filesystem. Any disk checking tools or drivers that do not
 recognize any set bits must not operate upon the filesystem. Most of the flags
 indicate features introduced over time. If the value of the lower nibble is >=
 4, the higher bits indicate feature flags as follows:

 .Version 4 Superblock version flags
 [options="header"]
 |=====
 | Flag				| Description
 | +XFS_SB_VERSION_ATTRBIT+	| Set if any inode have extended attributes.
 | +XFS_SB_VERSION_NLINKBIT+	| Set if any inodes use 32-bit di_nlink values.
 | +XFS_SB_VERSION_QUOTABIT+	|
 Quotas are enabled on the filesystem. This
 also brings in the various quota fields in the superblock.

 | +XFS_SB_VERSION_ALIGNBIT+	| Set if sb_inoalignmt is used.
 | +XFS_SB_VERSION_DALIGNBIT+	| Set if sb_unit and sb_width are used.
 | +XFS_SB_VERSION_SHAREDBIT+	| Set if sb_shared_vn is used.
 | +XFS_SB_VERSION_LOGV2BIT+	| Version 2 journaling logs are used.
 | +XFS_SB_VERSION_SECTORBIT+	| Set if sb_sectsize is not 512.
 | +XFS_SB_VERSION_EXTFLGBIT+	| Unwritten extents are used. This is always set.
 | +XFS_SB_VERSION_DIRV2BIT+	|
 Version 2 directories are used. This is always set.

 | +XFS_SB_VERSION_MOREBITSBIT+	|
 Set if the sb_features2 field in the superblock contains more flags.
 |=====

 If the lower nibble of this value is 5, then this is a v5 filesystem; the
 +XFS_SB_VERSION2_CRCBIT+ feature must be set in +sb_features2+.

 *sb_sectsize*::
 Specifies the underlying disk sector size in bytes.  Typically this is 512 or
 4096 bytes. This determines the minimum I/O alignment, especially for direct I/O.

 *sb_inodesize*::
 Size of the inode in bytes. The default is 256 (2 inodes per standard sector)
 but can be made as large as 2048 bytes when creating the filesystem.  On a v5
 filesystem, the default and minimum inode size are both 512 bytes.

 *sb_inopblock*::
 Number of inodes per block. This is equivalent to +sb_blocksize / sb_inodesize+.

 *sb_fname[12]*::
 Name for the filesystem. This value can be used in the mount command.

 *sb_blocklog*::
 log~2~ value of +sb_blocksize+. In other terms, +sb_blocksize = 2^sb_blocklog^+.

 *sb_sectlog*::
 log~2~ value of +sb_sectsize+.

 *sb_inodelog*::
 log~2~ value of +sb_inodesize+.

 *sb_inopblog*::
 log~2~ value of +sb_inopblock+.

 *sb_agblklog*::
 log~2~ value of +sb_agblocks+ (rounded up). This value is used to generate inode
 numbers and absolute block numbers defined in extent maps.

 *sb_rextslog*::
 log~2~ value of +sb_rextents+.

 *sb_inprogress*::
 Flag specifying that the filesystem is being created.

 *sb_imax_pct*::
 Maximum percentage of filesystem space that can be used for inodes. The default
 value is 5%.

 *sb_icount*::
 Global count for number inodes allocated on the filesystem. This is only
 maintained in the first superblock.

 *sb_ifree*::
 Global count of free inodes on the filesystem. This is only maintained in the
 first superblock.

 *sb_fdblocks*::
 Global count of free data blocks on the filesystem. This is only maintained in
 the first superblock.

 *sb_frextents*::
 Global count of free real-time extents on the filesystem. This is only
 maintained in the first superblock.

 *sb_uquotino*::
 Inode for user quotas. This and the following two quota fields only apply if
 +XFS_SB_VERSION_QUOTABIT+ flag is set in +sb_versionnum+. Refer to
 xref:Quota_Inodes[quota inodes] for more information

 *sb_gquotino*::
 Inode for group or project quotas. Group and Project quotas cannot be used at
 the same time.

 *sb_qflags*::
 Quota flags. It can be a combination of the following flags:

 .Superblock quota flags
 [options="header"]
 |=====
 | Flag				| Description
 | +XFS_UQUOTA_ACCT+		| User quota accounting is enabled.
 | +XFS_UQUOTA_ENFD+		| User quotas are enforced.
 | +XFS_UQUOTA_CHKD+		| User quotas have been checked.
 | +XFS_PQUOTA_ACCT+		| Project quota accounting is enabled.
 | +XFS_OQUOTA_ENFD+		| Other (group/project) quotas are enforced.
 | +XFS_OQUOTA_CHKD+		| Other (group/project) quotas have been checked.
 | +XFS_GQUOTA_ACCT+		| Group quota accounting is enabled.
 |=====

 *sb_flags*::
 Miscellaneous flags.

 .Superblock flags
 [options="header"]
 |=====
 | Flag                          | Description
 | +XFS_SBF_READONLY+            | Only read-only mounts allowed.
 |=====

 *sb_shared_vn*::
 Reserved and must be zero (``vn'' stands for version number).

 *sb_inoalignmt*::
 Inode chunk alignment in fsblocks.  Prior to v5, the default value provided for
 inode chunks to have an 8KiB alignment.  Starting with v5, the default value
 scales with the multiple of the inode size over 256 bytes.  Concretely, this
 means an alignment of 16KiB for 512-byte inodes, 32KiB for 1024-byte inodes,
 etc.  If sparse inodes are enabled, the +ir_startino+ field of each inode
 B+tree record must be aligned to this block granularity, even if the inode
 given by +ir_startino+ itself is sparse.

 *sb_unit*::
 Underlying stripe or raid unit in blocks.

 *sb_width*::
 Underlying stripe or raid width in blocks.

 *sb_dirblklog*::
 log~2~ multiplier that determines the granularity of directory block allocations
 in fsblocks.

 *sb_logsectlog*::
 log~2~ value of the log subvolume's sector size. This is only used if the
 journaling log is on a separate disk device (i.e. not internal).

 *sb_logsectsize*::
 The log's sector size in bytes if the filesystem uses an external log device.

 *sb_logsunit*::
 The log device's stripe or raid unit size. This only applies to version 2 logs
 +XFS_SB_VERSION_LOGV2BIT+ is set in +sb_versionnum+.

 *sb_features2*::
 Additional version flags if +XFS_SB_VERSION_MOREBITSBIT+ is set in
 +sb_versionnum+. The currently defined additional features include:

 .Extended Version 4 Superblock flags
 [options="header"]
 |=====
 | Flag				| Description
 | +XFS_SB_VERSION2_LAZYSBCOUNTBIT+ |
 Lazy global counters. Making a filesystem with this bit set can improve
 performance. The global free space and inode counts are only updated in the
 primary superblock when the filesystem is cleanly unmounted.

 | +XFS_SB_VERSION2_ATTR2BIT+	|
 Extended attributes version 2. Making a filesystem with this optimises the inode
 layout of extended attributes.  See the section about
 xref:Extended_Attribute_Versions[extended attribute versions] for more
 information.

 | +XFS_SB_VERSION2_PARENTBIT+	|
 Parent pointers. All inodes must have an extended attribute that points back to
 its parent inode. The primary purpose for this information is in backup systems.

 | +XFS_SB_VERSION2_PROJID32BIT+	|
 32-bit Project ID.  Inodes can be associated with a project ID number, which
 can be used to enforce disk space usage quotas for a particular group of
 directories.  This flag indicates that project IDs can be 32 bits in size.

 | +XFS_SB_VERSION2_CRCBIT+	|
 Metadata checksumming.  All metadata blocks have an extended header containing
 the block checksum, a copy of the metadata UUID, the log sequence number of the
 last update to prevent stale replays, and a back pointer to the owner of the
 block.  This feature must be and can only be set if the lowest nibble of
 +sb_versionnum+ is set to 5.

 | +XFS_SB_VERSION2_FTYPE+	|
 Directory file type.  Each directory entry records the type of the inode to
 which the entry points.  This speeds up directory iteration by removing the
 need to load every inode into memory.
 |=====

 *sb_bad_features2*::
 This field mirrors +sb_features2+, due to past 64-bit alignment errors.

 *sb_features_compat*::
 Read-write compatible feature flags.  The kernel can still read and write this
 FS even if it doesn't understand the flag.  Currently, there are no valid
 flags.

 *sb_features_ro_compat*::
 Read-only compatible feature flags.  The kernel can still read this FS even if
 it doesn't understand the flag.

 .Extended Version 5 Superblock Read-Only compatibility flags
 [options="header"]
 |=====
 | Flag				| Description
 | +XFS_SB_FEAT_RO_COMPAT_FINOBT+ |
 Free inode B+tree.  Each allocation group contains a B+tree to track inode chunks
 containing free inodes.  This is a performance optimization to reduce the time
 required to allocate inodes.

 | +XFS_SB_FEAT_RO_COMPAT_RMAPBT+ |
 Reverse mapping B+tree.  Each allocation group contains a B+tree containing
 records mapping AG blocks to their owners.  See the section about
 xref:Reconstruction[reconstruction] for more details.

 | +XFS_SB_FEAT_RO_COMPAT_REFLINK+ |
 Reference count B+tree.  Each allocation group contains a B+tree to track the
 reference counts of AG blocks.  This enables files to share data blocks safely.
 See the section about xref:Reflink_Deduplication[reflink and deduplication] for
 more details.

 |=====

 *sb_features_incompat*::
 Read-write incompatible feature flags.  The kernel cannot read or write this
 FS if it doesn't understand the flag.

 .Extended Version 5 Superblock Read-Write incompatibility flags
 [options="header"]
 |=====
 | Flag				| Description
 | +XFS_SB_FEAT_INCOMPAT_FTYPE+ |
 Directory file type.  Each directory entry tracks the type of the inode to
 which the entry points.  This is a performance optimization to remove the need
 to load every inode into memory to iterate a directory.

 | +XFS_SB_FEAT_INCOMPAT_SPINODES+ |
 Sparse inodes.  This feature relaxes the requirement to allocate inodes in
 chunks of 64.  When the free space is heavily fragmented, there might exist
 plenty of free space but not enough contiguous free space to allocate a new
 inode chunk.  With this feature, the user can continue to create files until
 all free space is exhausted.

 Unused space in the inode B+tree records are used to track which parts of the
 inode chunk are not inodes.

 See the chapter on xref:Sparse_Inodes[Sparse Inodes] for more information.

 | +XFS_SB_FEAT_INCOMPAT_META_UUID+ |
 Metadata UUID.  The UUID stamped into each metadata block must match the value
 in +sb_meta_uuid+.  This enables the administrator to change +sb_uuid+ at will
 without having to rewrite the entire filesystem.
 |=====

 *sb_features_log_incompat*::
 Read-write incompatible feature flags for the log.  The kernel cannot read or
 write this FS log if it doesn't understand the flag.  Currently, no flags are
 defined.

 *sb_crc*::
 Superblock checksum.

 *sb_spino_align*::
 Sparse inode alignment, in fsblocks.  Each chunk of inodes referenced by a
 sparse inode B+tree record must be aligned to this block granularity.

 *sb_pquotino*::
 Project quota inode.

 *sb_lsn*::
 Log sequence number of the last superblock update.

 *sb_meta_uuid*::
 If the +XFS_SB_FEAT_INCOMPAT_META_UUID+ feature is set, then the UUID field in
 all metadata blocks must match this UUID.  If not, the block header UUID field
 must match +sb_uuid+.

 *sb_rrmapino*::
 If the +XFS_SB_FEAT_RO_COMPAT_RMAPBT+ feature is set and a real-time
 device is present (+sb_rblocks+ > 0), this field points to an inode
 that contains the root to the
 xref:Real_time_Reverse_Mapping_Btree[Real-Time Reverse Mapping B+tree].
 This field is zero otherwise.

 === xfs_db Superblock Example

 A filesystem is made on a single disk with the following command:

 ----
 # mkfs.xfs -i attr=2 -n size=16384 -f /dev/sda7
 meta-data=/dev/sda7              isize=256    agcount=16, agsize=3923122 blks
          =                       sectsz=512   attr=2
 data     =                       bsize=4096   blocks=62769952, imaxpct=25
          =                       sunit=0      swidth=0 blks, unwritten=1
 naming   =version 2              bsize=16384
 log      =internal log           bsize=4096   blocks=30649, version=1
          =                       sectsz=512   sunit=0 blks
 realtime =none                   extsz=65536  blocks=0, rtextents=0
 ----

 And in xfs_db, inspecting the superblock:

 ----
 xfs_db> sb
 xfs_db> p
 magicnum = 0x58465342
 blocksize = 4096
 dblocks = 62769952
 rblocks = 0
 rextents = 0
 uuid = 32b24036-6931-45b4-b68c-cd5e7d9a1ca5
 logstart = 33554436
 rootino = 128
 rbmino = 129
 rsumino = 130
 rextsize = 16
 agblocks = 3923122
 agcount = 16
 rbmblocks = 0
 logblocks = 30649
 versionnum = 0xb084
 sectsize = 512
 inodesize = 256
 inopblock = 16
 fname = "\000\000\000\000\000\000\000\000\000\000\000\000"
 blocklog = 12
 sectlog = 9
 inodelog = 8
 inopblog = 4
 agblklog = 22
 rextslog = 0
 inprogress = 0
 imax_pct = 25
 icount = 64
 ifree = 61
 fdblocks = 62739235
 frextents = 0
 uquotino = 0
 gquotino = 0
 qflags = 0
 flags = 0
 shared_vn = 0
 inoalignmt = 2
 unit = 0
 width = 0
 dirblklog = 2
 logsectlog = 0
 logsectsize = 0
 logsunit = 0
 features2 = 8
 ----


 [[AG_Free_Space_Management]]
 == AG Free Space Management

 The XFS filesystem tracks free space in an allocation group using two B+trees.
 One B+tree tracks space by block number, the second by the size of the free
 space block. This scheme allows XFS to find quickly free space near a given
 block or of a given size.

 All block numbers, indexes, and counts are AG relative.

 [[AG_Free_Space_Block]]
 === AG Free Space Block

 The second sector in an AG contains the information about the two free space
 B+trees and associated free space information for the AG. The ``AG Free Space
 Block'' also knows as the +AGF+, uses the following structure:

 [source, c]
 ----
 struct xfs_agf {
      __be32              agf_magicnum;
      __be32              agf_versionnum;
      __be32              agf_seqno;
      __be32              agf_length;
      __be32              agf_roots[XFS_BTNUM_AGF];
      __be32              agf_levels[XFS_BTNUM_AGF];
      __be32              agf_flfirst;
      __be32              agf_fllast;
      __be32              agf_flcount;
      __be32              agf_freeblks;
      __be32              agf_longest;
      __be32              agf_btreeblks;

      /* version 5 filesystem fields start here */
      uuid_t              agf_uuid;
      __be32              agf_rmap_blocks;
      __be32              agf_refcount_blocks;
      __be32              agf_refcount_root;
      __be32              agf_refcount_level;
      __be64              agf_spare64[14];

      /* unlogged fields, written during buffer writeback. */
      __be64              agf_lsn;
      __be32              agf_crc;
      __be32              agf_spare2;
 };
 ----

 The rest of the bytes in the sector are zeroed. +XFS_BTNUM_AGF+ is set to 3:
 index 0 for the free space B+tree indexed by block number; index 1 for the free
 space B+tree indexed by extent size; and index 2 for the reverse-mapping
 B+tree.

 *agf_magicnum*::
 Specifies the magic number for the AGF sector: ``XAGF'' (0x58414746).

 *agf_versionnum*::
 Set to +XFS_AGF_VERSION+ which is currently 1.

 *agf_seqno*::
 Specifies the AG number for the sector.

 *agf_length*::
 Specifies the size of the AG in filesystem blocks. For all AGs except the last,
 this must be equal to the superblock's +sb_agblocks+ value. For the last AG,
 this could be less than the +sb_agblocks+ value. It is this value that should
 be used to determine the size of the AG.

 *agf_roots*::
 Specifies the block number for the root of the two free space B+trees and the
 reverse-mapping B+tree, if enabled.

 *agf_levels*::
 Specifies the level or depth of the two free space B+trees and the
 reverse-mapping B+tree, if enabled.  For a fresh AG, this value will be one,
 and the ``roots'' will point to a single leaf of level 0.

 *agf_flfirst*::
 Specifies the index of the first ``free list'' block. Free lists are covered in
 more detail later on.

 *agf_fllast*::
 Specifies the index of the last ``free list'' block.

 *agf_flcount*::
 Specifies the number of blocks in the ``free list''.

 *agf_freeblks*::
 Specifies the current number of free blocks in the AG.

 *agf_longest*::
 Specifies the number of blocks of longest contiguous free space in the AG.

 *agf_btreeblks*::
 Specifies the number of blocks used for the free space B+trees. This is only
 used if the +XFS_SB_VERSION2_LAZYSBCOUNTBIT+ bit is set in +sb_features2+.

 *agf_uuid*::
 The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
 depending on which features are set.

 *agf_rmap_blocks*::
 The size of the reverse mapping B+tree in this allocation group, in blocks.

 *agf_refcount_blocks*::
 The size of the reference count B+tree in this allocation group, in blocks.

 *agf_refcount_root*::
 Block number for the root of the reference count B+tree, if enabled.

 *agf_refcount_level*::
 Depth of the reference count B+tree, if enabled.

 *agf_spare64*::
 Empty space in the logged part of the AGF sector, for use for future features.

 *agf_lsn*::
 Log sequence number of the last AGF write.

 *agf_crc*::
 Checksum of the AGF sector.

 *agf_spare2*::
 Empty space in the unlogged part of the AGF sector.

 [[AG_Free_Space_Btrees]]
 === AG Free Space B+trees

 The two Free Space B+trees store a sorted array of block offset and block
 counts in the leaves of the B+tree. The first B+tree is sorted by the offset,
 the second by the count or size.

 Leaf nodes contain a sorted array of offset/count pairs which are also used for
 node keys:

 [source, c]
 ----
 struct xfs_alloc_rec {
      __be32                    ar_startblock;
      __be32                    ar_blockcount;
 };
 ----

 *ar_startblock*::
 AG block number of the start of the free space.

 *ar_blockcount*::
 Length of the free space.

 Node pointers are an AG relative block pointer:

 [source, c]
 ----
 typedef __be32 xfs_alloc_ptr_t;
 ----

 * As the free space tracking is AG relative, all the block numbers are only
 32-bits.
 * The +bb_magic+ value depends on the B+tree: ``ABTB'' (0x41425442) for the block
 offset B+tree, ``ABTC'' (0x41425443) for the block count B+tree.  On a v5
 filesystem, these are ``AB3B'' (0x41423342) and ``AB3C'' (0x41423343),
 respectively.
 * The +xfs_btree_sblock_t+ header is used for intermediate B+tree node as well
 as the leaves.
 * For a typical 4KB filesystem block size, the offset for the +xfs_alloc_ptr_t+
 array would be +0xab0+ (2736 decimal).
 * There are a series of macros in +xfs_btree.h+ for deriving the offsets,
 counts, maximums, etc for the B+trees used in XFS.

 The following diagram shows a single level B+tree which consists of one leaf:

 .Freespace B+tree with one leaf.
 image::images/15a.png[]

 With the intermediate nodes, the associated leaf pointers are stored in a
 separate array about two thirds into the block. The following diagram
 illustrates a 2-level B+tree for a free space B+tree:

 .Multi-level freespace B+tree.
 image::images/15b.png[]

 [[AG_Free_List]]
 === AG Free List

 The AG Free List is located in the 4^th^ sector of each AG and is known as the
 AGFL. It is an array of AG relative block pointers for reserved space for
 growing the free space B+trees. This space cannot be used for general user data
 including inodes, data, directories and extended attributes.

 With a freshly made filesystem, 4 blocks are reserved immediately after the free
 space B+tree root blocks (blocks 4 to 7). As they are used up as the free space
 fragments, additional blocks will be reserved from the AG and added to the free
 list array.  This size may increase as features are added.

 As the free list array is located within a single sector, a typical device will
 have space for 128 elements in the array (512 bytes per sector, 4 bytes per AG
 relative block pointer). The actual size can be determined by using the
 +XFS_AGFL_SIZE+ macro.

 Active elements in the array are specified by the
 xref:AG_Free_Space_Block[AGF's] +agf_flfirst+, +agf_fllast+ and +agf_flcount+
 values. The array is managed as a circular list.

 On a v5 filesystem, the following header precedes the free list entries:

 [source, c]
 ----
 struct xfs_agfl {
      __be32              agfl_magicnum;
      __be32              agfl_seqno;
      uuid_t              agfl_uuid;
      __be64              agfl_lsn;
      __be32              agfl_crc;
 };
 ----

 *agfl_magicnum*::
 Specifies the magic number for the AGFL sector: "XAFL" (0x5841464c).

 *agfl_seqno*::
 Specifies the AG number for the sector.

 *agfl_uuid*::
 The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
 depending on which features are set.

 *agfl_lsn*::
 Log sequence number of the last AGFL write.

 *agfl_crc*::
 Checksum of the AGFL sector.

 On a v4 filesystem there is no header; the array of free block numbers begins
 at the beginning of the sector.

 .AG Free List layout
 image::images/16.png[]

 The presence of these reserved blocks guarantees that the free space B+trees
 can be updated if any blocks are freed by extent changes in a full AG.

 ==== xfs_db AGF Example

 These examples are derived from an AG that has been deliberately fragmented.
 The AGF:

 ----
 xfs_db> agf 0
 xfs_db> p
 magicnum = 0x58414746
 versionnum = 1
 seqno = 0
 length = 3923122
 bnoroot = 7
 cntroot = 83343
 bnolevel = 2
 cntlevel = 2
 flfirst = 22
 fllast = 27
 flcount = 6
 freeblks = 3654234
 longest = 3384327
 btreeblks = 0
 ----

 In the AGFL, the active elements are from 22 to 27 inclusive which are obtained
 from the +flfirst+ and +fllast+ values from the +agf+ in the previous example:

 ----
 xfs_db> agfl 0
 xfs_db> p
 bno[0-127] = 0:4 1:5 2:6 3:7 4:83342 5:83343 6:83344 7:83345 8:83346 9:83347
              10:4 11:5 12:80205 13:80780 14:81496 15:81766 16:83346 17:4 18:5
              19:80205 20:82449 21:81496 22:81766 23:82455 24:80780 25:5
              26:80205 27:83344
 ----

 The root block of the free space B+tree sorted by block offset is found in the
 AGF's +bnoroot+ value:

 ----
 xfs_db> fsblock 7
 xfs_db> type bnobt
 xfs_db> p
 magic = 0x41425442
 level = 1
 numrecs = 4
 leftsib = null
 rightsib = null
 keys[1-4] = [startblock,blockcount]
            1:[12,16] 2:[184586,3] 3:[225579,1] 4:[511629,1]
 ptrs[1-4] = 1:2 2:83347 3:6 4:4
 ----

 Blocks 2, 83347, 6 and 4 contain the leaves for the free space B+tree by
 starting block. Block 2 would contain offsets 12 up to but not including 184586
 while block 4 would have all offsets from 511629 to the end of the AG.

 The root block of the free space B+tree sorted by block count is found in the
 AGF's +cntroot+ value:

 ----
 xfs_db> fsblock 83343
 xfs_db> type cntbt
 xfs_db> p
 magic = 0x41425443
 level = 1
 numrecs = 4
 leftsib = null
 rightsib = null
 keys[1-4] = [blockcount,startblock]
            1:[1,81496] 2:[1,511729] 3:[3,191875] 4:[6,184595]
 ptrs[1-4] = 1:3 2:83345 3:83342 4:83346
 ----

 The leaf in block 3, in this example, would only contain single block counts.
 The offsets are sorted in ascending order if the block count is the same.

 Inspecting the leaf in block 83346, we can see the largest block at the end:

 ----
 xfs_db> fsblock 83346
 xfs_db> type cntbt
 xfs_db> p
 magic = 0x41425443
 level = 0
 numrecs = 344
 leftsib = 83342
 rightsib = null
 recs[1-344] = [startblock,blockcount]
            1:[184595,6] 2:[187573,6] 3:[187776,6]
            ...
            342:[513712,755] 343:[230317,258229] 344:[538795,3384327]
 ----

 The longest block count (3384327) must be the same as the AGF's +longest+ value.

 [[AG_Inode_Management]]
 == AG Inode Management

 [[Inode_Numbers]]
 === Inode Numbers

 Inode numbers in XFS come in two forms: AG relative and absolute.

 AG relative inode numbers always fit within 32 bits. The number of bits actually
 used is determined by the sum of the xref:Superblocks[superblock's] +sb_inoplog+
 and +sb_agblklog+ values. Relative inode numbers are found within the AG's inode
 structures.

 Absolute inode numbers include the AG number in the high bits, above the bits
 used for the AG relative inode number. Absolute inode numbers are found in
 xref:Directories[directory] entries and the superblock.

 .Inode number formats
 image::images/18.png[]

 [[Inode_Information]]
 === Inode Information

 Each AG manages its own inodes. The third sector in the AG contains information
 about the AG's inodes and is known as the AGI.

 The AGI uses the following structure:

 [source, c]
 ----
 struct xfs_agi {
      __be32              agi_magicnum;
      __be32              agi_versionnum;
      __be32              agi_seqno
      __be32              agi_length;
      __be32              agi_count;
      __be32              agi_root;
      __be32              agi_level;
      __be32              agi_freecount;
      __be32              agi_newino;
      __be32              agi_dirino;
      __be32              agi_unlinked[64];

      /*
       * v5 filesystem fields start here; this marks the end of logging region 1
       * and start of logging region 2.
       */
      uuid_t              agi_uuid;
      __be32              agi_crc;
      __be32              agi_pad32;
      __be64              agi_lsn;

      __be32              agi_free_root;
      __be32              agi_free_level;
 }
 ----
 *agi_magicnum*::
 Specifies the magic number for the AGI sector: ``XAGI'' (0x58414749).

 *agi_versionnum*::
 Set to +XFS_AGI_VERSION+ which is currently 1.

 *agi_seqno*::
 Specifies the AG number for the sector.

 *agi_length*::
 Specifies the size of the AG in filesystem blocks.

 *agi_count*::
 Specifies the number of inodes allocated for the AG.

 *agi_root*::
 Specifies the block number in the AG containing the root of the inode B+tree.

 *agi_level*::
 Specifies the number of levels in the inode B+tree.

 *agi_freecount*::
 Specifies the number of free inodes in the AG.

 *agi_newino*::
 Specifies AG-relative inode number of the most recently allocated chunk.

 *agi_dirino*::
 Deprecated and not used, this is always set to NULL (-1).

 *agi_unlinked[64]*::
 Hash table of unlinked (deleted) inodes that are still being referenced. Refer
 to xref:Unlinked_Pointer[unlinked list pointers] for more information.

 *agi_uuid*::
 The UUID of this block, which must match either +sb_uuid+ or +sb_meta_uuid+
 depending on which features are set.

 *agi_crc*::
 Checksum of the AGI sector.

 *agi_pad32*::
 Padding field, otherwise unused.

 *agi_lsn*::
 Log sequence number of the last write to this block.

 *agi_free_root*::
 Specifies the block number in the AG containing the root of the free inode
 B+tree.

 *agi_free_level*::
 Specifies the number of levels in the free inode B+tree.

 [[Inode_Btrees]]
 == Inode B+trees

 Inodes are traditionally allocated in chunks of 64, and a B+tree is used to
 track these chunks of inodes as they are allocated and freed. The block
 containing root of the B+tree is defined by the AGI's +agi_root+ value.  If the
 +XFS_SB_FEAT_RO_COMPAT_FINOBT+ feature is enabled, a second B+tree is used to
 track the chunks containing free inodes; this is an optimization to speed up
 inode allocation.

 The B+tree header for the nodes and leaves use the +xfs_btree_sblock+ structure
 which is the same as the header used in the xref:AG_Free_Space_Btrees[AGF
 B+trees].

 The magic number of the inode B+tree is ``IABT'' (0x49414254).  On a v5
 filesystem, the magic number is ``IAB3'' (0x49414233).

 The magic number of the free inode B+tree is ``FIBT'' (0x46494254).  On a v5
 filesystem, the magic number is ``FIB3'' (0x46494254).

 Leaves contain an array of the following structure:

 [source,c]
 ----
 struct xfs_inobt_rec {
      __be32                    ir_startino;
      __be32                    ir_freecount;
      __be64                    ir_free;
 };
 ----

 *ir_startino*::
 The lowest-numbered inode in this chunk.

 *ir_freecount*::
 Number of free inodes in this chunk.

 *ir_free*::
 A 64 element bitmap showing which inodes in this chunk are free.

 Nodes contain key/pointer pairs using the following types:

 [source,c]
 ----
 struct xfs_inobt_key {
      __be32                     ir_startino;
 };
 typedef __be32 xfs_inobt_ptr_t;
 ----

 The following diagram illustrates a single level inode B+tree:

 .Single Level inode B+tree
 image::images/20a.png[]


 And a 2-level inode B+tree:

 .Multi-Level inode B+tree
 image::images/20b.png[]


 ==== xfs_db AGI Example

 This is an AGI of a freshly populated filesystem:

 ----
 xfs_db> agi 0
 xfs_db> p
 magicnum = 0x58414749
 versionnum = 1
 seqno = 0
 length = 825457
 count = 5440
 root = 3
 level = 1
 freecount = 9
 newino = 5792
 dirino = null
 unlinked[0-63] =
 uuid = 3dfa1e5c-5a5f-4ca2-829a-000e453600fe
 lsn = 0x1000032c2
 crc = 0x14cb7e5c (correct)
 free_root = 4
 free_level = 1
 ----

 From this example, we see that the inode B+tree is rooted at AG block 3 and
 that the free inode B+tree is rooted at AG block 4.  Let's look at the
 inode B+tree:

 ----
 xfs_db> addr root
 xfs_db> p
 magic = 0x49414233
 level = 0
 numrecs = 85
 leftsib = null
 rightsib = null
 bno = 24
 lsn = 0x1000032c2
 uuid = 3dfa1e5c-5a5f-4ca2-829a-000e453600fe
 owner = 0
 crc = 0x768f9592 (correct)
 recs[1-85] = [startino,freecount,free]
         1:[96,0,0] 2:[160,0,0] 3:[224,0,0] 4:[288,0,0]
         5:[352,0,0] 6:[416,0,0] 7:[480,0,0] 8:[544,0,0]
         9:[608,0,0] 10:[672,0,0] 11:[736,0,0] 12:[800,0,0]
         ...
         85:[5792,9,0xff80000000000000]
 ----

 Most of the inode chunks on this filesystem are totally full, since the +free+
 value is zero.  This means that we ought to expect inode 160 to be linked
 somewhere in the directory structure.  However, notice that 0xff80000000000000
 in record 85 -- this means that we would expect inode 5856 to be free.  Moving
 on to the free inode B+tree, we see that this is indeed the case:

 ----
 xfs_db> addr free_root
 xfs_db> p
 magic = 0x46494233
 level = 0
 numrecs = 1
 leftsib = null
 rightsib = null
 bno = 32
 lsn = 0x1000032c2
 uuid = 3dfa1e5c-5a5f-4ca2-829a-000e453600fe
 owner = 0
 crc = 0x338af88a (correct)
 recs[1] = [startino,freecount,free] 1:[5792,9,0xff80000000000000]
 ----

 Observe also that the AGI's +agi_newino+ points to this chunk, which has never
 been fully allocated.

 [[Sparse_Inodes]]
 == Sparse Inodes

 As mentioned in the previous section, XFS allocates inodes in chunks of 64.  If
 there are no free extents large enough to hold a full chunk of 64 inodes, the
 inode allocation fails and XFS claims to have run out of space.  On a
 filesystem with highly fragmented free space, this can lead to out of space
 errors long before the filesystem runs out of free blocks.

 The sparse inode feature tracks inode chunks in the inode B+tree as if they
 were full chunks but uses some previously unused bits in the freecount field to
 track which parts of the inode chunk are not allocated for use as inodes.  This
 allows XFS to allocate inodes one block at a time if absolutely necessary.

 The inode and free inode B+trees operate in the same manner as they do without
 the sparse inode feature; the B+tree header for the nodes and leaves use the
 +xfs_btree_sblock+ structure which is the same as the header used in the
 xref:AG_Free_Space_Btrees[AGF B+trees].

 It is theoretically possible for a sparse inode B+tree record to reference
 multiple non-contiguous inode chunks.

 Leaves contain an array of the following structure:

 [source,c]
 ----
 struct xfs_inobt_rec {
      __be32                    ir_startino;
      __be16                    ir_holemask;
      __u8                      ir_count;
      __u8                      ir_freecount;
      __be64                    ir_free;
 };
 ----

 *ir_startino*::
 The lowest-numbered inode in this chunk, rounded down to the nearest multiple
 of 64, even if the start of this chunk is sparse.

 *ir_holemask*::
 A 16 element bitmap showing which parts of the chunk are not allocated to
 inodes.  Each bit represents four inodes; if a bit is marked here, the
 corresponding bits in ir_free must also be marked.

 *ir_count*::
 Number of inodes allocated to this chunk.

 *ir_freecount*::
 Number of free inodes in this chunk.

 *ir_free*::
 A 64 element bitmap showing which inodes in this chunk are not available for
 allocation.

 ==== xfs_db Sparse Inode AGI Example

 This example derives from an AG that has been deliberately fragmented.  The
 inode B+tree:

 ----
 xfs_db> agi 0
 xfs_db> p
 magicnum = 0x58414749
 versionnum = 1
 seqno = 0
 length = 6400
 count = 10432
 root = 2381
 level = 2
 freecount = 0
 newino = 14912
 dirino = null
 unlinked[0-63] =
 uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6
 lsn = 0x600000ac4
 crc = 0xef550dbc (correct)
 free_root = 4
 free_level = 1
 ----

 This AGI was formatted on a v5 filesystem; notice the extra v5 fields.  So far
 everything else looks much the same as always.

 ----
 xfs_db> addr root
 magic = 0x49414233
 level = 1
 numrecs = 2
 leftsib = null
 rightsib = null
 bno = 19048
 lsn = 0x50000192b
 uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6
 owner = 0
 crc = 0xd98cd2ca (correct)
 keys[1-2] = [startino] 1:[128] 2:[35136]
 ptrs[1-2] = 1:3 2:2380
 xfs_db> addr ptrs[1]
 xfs_db> p
 magic = 0x49414233
 level = 0
 numrecs = 159
 leftsib = null
 rightsib = 2380
 bno = 24
 lsn = 0x600000ac4
 uuid = b9b4623b-f678-4d48-8ce7-ce08950e3cd6
 owner = 0
 crc = 0x836768a6 (correct)
 recs[1-159] = [startino,holemask,count,freecount,free]
         1:[128,0,64,0,0]
         2:[14912,0xff,32,0,0xffffffff]
         3:[15040,0,64,0,0]
         4:[15168,0xff00,32,0,0xffffffff00000000]
         5:[15296,0,64,0,0]
         6:[15424,0xff,32,0,0xffffffff]
         7:[15552,0,64,0,0]
         8:[15680,0xff00,32,0,0xffffffff00000000]
         9:[15808,0,64,0,0]
         10:[15936,0xff,32,0,0xffffffff]
 ----

 Here we see the difference in the inode B+tree records.  For example, in record
 2, we see that the holemask has a value of 0xff.  This means that the first
 sixteen inodes in this chunk record do not actually map to inode blocks; the
 first inode in this chunk is actually inode 14944:

 ----
 xfs_db> inode 14912
 Metadata corruption detected at block 0x3a40/0x2000
 ...
 Metadata CRC error detected for ino 14912
 xfs_db> p core.magic
 core.magic = 0
 xfs_db> inode 14944
 xfs_db> p core.magic
 core.magic = 0x494e
 ----

 The chunk record also indicates that this chunk has 32 inodes, and that the
 missing inodes are also ``free''.

 [[Real-time_Devices]]
 == Real-time Devices

 The performance of the standard XFS allocator varies depending on the internal
 state of the various metadata indices enabled on the filesystem.  For
 applications which need to minimize the jitter of allocation latency, XFS
 supports the notion of a ``real-time device''.  This is a special device
 separate from the regular filesystem where extent allocations are tracked with
 a bitmap and free space is indexed with a two-dimensional array.  If an inode
 is flagged with +XFS_DIFLAG_REALTIME+, its data will live on the real time
 device.  The metadata for real time devices is discussed in the section about
 xref:Real-time_Inodes[real time inodes].

 By placing the real time device (and the journal) on separate high-performance
 storage devices, it is possible to reduce most of the unpredictability in I/O
 response times that come from metadata operations.

 None of the XFS per-AG B+trees are involved with real time files.  It is not
 possible for real time files to share data blocks.