| [[On-disk_Inode]] |
| = On-disk Inode |
| |
| All files, directories, and links are stored on disk with inodes and descend from |
| the root inode with its number defined in the xref:Superblocks[superblock]. The |
| previous section on xref:AG_Inode_Management[AG Inode Management] describes the |
| allocation and management of inodes on disk. This section describes the contents |
| of inodes themselves. |
| |
| An inode is divided into 3 parts: |
| |
| .On-disk inode sections |
| image::images/23.png[] |
| |
| * The core contains what the inode represents, stat data, and information |
| describing the data and attribute forks. |
| |
| * The +di_u+ ``data fork'' contains normal data related to the inode. Its contents |
| depends on the file type specified by +di_core.di_mode+ (eg. regular file, |
| directory, link, etc) and how much information is contained in the file which |
| determined by +di_core.di_format+. The following union to represent this data is |
| declared as follows: |
| |
| [source, c] |
| ---- |
| union { |
| xfs_bmdr_block_t di_bmbt; |
| xfs_bmbt_rec_t di_bmx[1]; |
| xfs_dir2_sf_t di_dir2sf; |
| char di_c[1]; |
| xfs_dev_t di_dev; |
| uuid_t di_muuid; |
| char di_symlink[1]; |
| } di_u; |
| ---- |
| |
| * The +di_a+ ``attribute fork'' contains extended attributes. Its layout is |
| determined by the +di_core.di_aformat+ value. Its representation is declared as |
| follows: |
| |
| [source, c] |
| ---- |
| union { |
| xfs_bmdr_block_t di_abmbt; |
| xfs_bmbt_rec_t di_abmx[1]; |
| xfs_attr_shortform_t di_attrsf; |
| } di_a; |
| ---- |
| |
| [NOTE] |
| The above two unions are rarely used in the XFS code, but the structures |
| within the union are directly cast depending on the +di_mode/di_format+ and |
| +di_aformat+ values. They are referenced in this document to make it easier to |
| explain the various structures in use within the inode. |
| |
| The remaining space in the inode after +di_next_unlinked+ where the two forks |
| are located is called the inode's ``literal area''. This starts at offset 100 |
| (0x64) in a version 1 or 2 inode, and offset 176 (0xb0) in a version 3 inode. |
| |
| The space for each of the two forks in the literal area is determined by the |
| inode size, and +di_core.di_forkoff+. The data fork is located between the start |
| of the literal area and +di_forkoff+. The attribute fork is located between |
| +di_forkoff+ and the end of the inode. |
| |
| |
| [[Inode_Core]] |
| == Inode Core |
| |
| The inode's core is 96 bytes on a V4 filesystem and 176 bytes on a V5 |
| filesystem. It contains information about the file itself including most stat |
| data information about data and attribute forks after the core within the |
| inode. It uses the following structure: |
| |
| [source, c] |
| ---- |
| struct xfs_dinode_core { |
| __uint16_t di_magic; |
| __uint16_t di_mode; |
| __int8_t di_version; |
| __int8_t di_format; |
| __uint16_t di_onlink; |
| __uint32_t di_uid; |
| __uint32_t di_gid; |
| __uint32_t di_nlink; |
| __uint16_t di_projid; |
| __uint16_t di_projid_hi; |
| __uint8_t di_pad[6]; |
| __uint16_t di_flushiter; |
| xfs_timestamp_t di_atime; |
| xfs_timestamp_t di_mtime; |
| xfs_timestamp_t di_ctime; |
| xfs_fsize_t di_size; |
| xfs_rfsblock_t di_nblocks; |
| xfs_extlen_t di_extsize; |
| xfs_extnum_t di_nextents; |
| xfs_aextnum_t di_anextents; |
| __uint8_t di_forkoff; |
| __int8_t di_aformat; |
| __uint32_t di_dmevmask; |
| __uint16_t di_dmstate; |
| __uint16_t di_flags; |
| __uint32_t di_gen; |
| |
| /* di_next_unlinked is the only non-core field in the old dinode */ |
| __be32 di_next_unlinked; |
| |
| /* version 5 filesystem (inode version 3) fields start here */ |
| __le32 di_crc; |
| __be64 di_changecount; |
| __be64 di_lsn; |
| __be64 di_flags2; |
| __be32 di_cowextsize; |
| __u8 di_pad2[12]; |
| xfs_timestamp_t di_crtime; |
| __be64 di_ino; |
| uuid_t di_uuid; |
| |
| }; |
| ---- |
| |
| *di_magic*:: |
| The inode signature; these two bytes are ``IN'' (0x494e). |
| |
| *di_mode*:: |
| Specifies the mode access bits and type of file using the standard S_Ixxx values |
| defined in stat.h. |
| |
| *di_version*:: |
| Specifies the inode version which currently can only be 1, 2, or 3. The inode |
| version specifies the usage of the +di_onlink+, +di_nlink+ and +di_projid+ |
| values in the inode core. Initially, inodes are created as v1 but can be |
| converted on the fly to v2 when required. v3 inodes are created only for v5 |
| filesystems. |
| |
| *di_format*:: |
| Specifies the format of the data fork in conjunction with the +di_mode+ type. |
| This can be one of several values. For directories and links, it can be ``local'' |
| where all metadata associated with the file is within the inode; ``extents'' where |
| the inode contains an array of extents to other filesystem blocks which contain |
| the associated metadata or data; or ``btree'' where the inode contains a B+tree |
| root node which points to filesystem blocks containing the metadata or data. |
| Migration between the formats depends on the amount of metadata associated with |
| the inode. ``dev'' is used for character and block devices while ``uuid'' is |
| currently not used. ``rmap'' indicates that a reverse-mapping B+tree |
| is rooted in the fork. |
| |
| [source, c] |
| ---- |
| typedef enum xfs_dinode_fmt { |
| XFS_DINODE_FMT_DEV, |
| XFS_DINODE_FMT_LOCAL, |
| XFS_DINODE_FMT_EXTENTS, |
| XFS_DINODE_FMT_BTREE, |
| XFS_DINODE_FMT_UUID, |
| XFS_DINODE_FMT_RMAP, |
| } xfs_dinode_fmt_t; |
| ---- |
| |
| *di_onlink*:: |
| In v1 inodes, this specifies the number of links to the inode from directories. |
| When the number exceeds 65535, the inode is converted to v2 and the link count |
| is stored in +di_nlink+. |
| |
| *di_uid*:: |
| Specifies the owner's UID of the inode. |
| |
| *di_gid*:: |
| Specifies the owner's GID of the inode. |
| |
| *di_nlink*:: |
| Specifies the number of links to the inode from directories. This is maintained |
| for both inode versions for current versions of XFS. Prior to v2 inodes, this |
| field was part of +di_pad+. |
| |
| *di_projid*:: |
| Specifies the owner's project ID in v2 inodes. An inode is converted to v2 if |
| the project ID is set. This value must be zero for v1 inodes. |
| |
| *di_projid_hi*:: |
| Specifies the high 16 bits of the owner's project ID in v2 inodes, if the |
| +XFS_SB_VERSION2_PROJID32BIT+ feature is set; and zero otherwise. |
| |
| *di_pad[6]*:: |
| Reserved, must be zero. |
| |
| *di_flushiter*:: |
| Incremented on flush. |
| |
| *di_atime*:: |
| |
| Specifies the last access time of the files using UNIX time conventions the |
| following structure. This value may be undefined if the filesystem is mounted |
| with the ``noatime'' option. XFS supports timestamps with nanosecond resolution: |
| |
| [source, c] |
| ---- |
| struct xfs_timestamp { |
| __int32_t t_sec; |
| __int32_t t_nsec; |
| }; |
| ---- |
| |
| If the +XFS_SB_FEAT_INCOMPAT_BIGTIME+ feature is enabled, the 64 bits used by |
| the timestamp field are interpreted as a flat 64-bit nanosecond counter. |
| See the section about xref:Inode_Timestamps[inode timestamps] for more details. |
| |
| *di_mtime*:: |
| Specifies the last time the file was modified. |
| |
| *di_ctime*:: |
| Specifies when the inode's status was last changed. |
| |
| *di_size*:: |
| Specifies the EOF of the inode in bytes. This can be larger or smaller than the |
| extent space (therefore actual disk space) used for the inode. For regular |
| files, this is the filesize in bytes, directories, the space taken by directory |
| entries and for links, the length of the symlink. |
| |
| *di_nblocks*:: |
| Specifies the number of filesystem blocks used to store the inode's data |
| including relevant metadata like B+trees. This does not include blocks used for |
| extended attributes. |
| |
| *di_extsize*:: |
| Specifies the extent size for filesystems with real-time devices or an extent |
| size hint for standard filesystems. For normal filesystems, and with |
| directories, the +XFS_DIFLAG_EXTSZINHERIT+ flag must be set in +di_flags+ if |
| this field is used. Inodes created in these directories will inherit the |
| di_extsize value and have +XFS_DIFLAG_EXTSIZE+ set in their +di_flags+. When a |
| file is written to beyond allocated space, XFS will attempt to allocate |
| additional disk space based on this value. |
| |
| *di_nextents*:: |
| Specifies the number of data extents associated with this inode. |
| |
| *di_anextents*:: |
| Specifies the number of extended attribute extents associated with this inode. |
| |
| *di_forkoff*:: |
| Specifies the offset into the inode's literal area where the extended attribute |
| fork starts. This is an 8-bit value that is multiplied by 8 to determine the |
| actual offset in bytes (ie. attribute data is 64-bit aligned). This also limits |
| the maximum size of the inode to 2048 bytes. This value is initially zero until |
| an extended attribute is created. When in attribute is added, the nature of |
| +di_forkoff+ depends on the +XFS_SB_VERSION2_ATTR2BIT+ flag in the superblock. |
| Refer to xref:Extended_Attribute_Versions[Extended Attribute Versions] for more |
| details. |
| |
| *di_aformat*:: |
| |
| Specifies the format of the attribute fork. This uses the same values as |
| +di_format+, but restricted to ``local'', ``extents'' and ``btree'' formats for |
| extended attribute data. |
| |
| *di_dmevmask*:: |
| DMAPI event mask. |
| |
| *di_dmstate*:: |
| DMAPI state. |
| |
| *di_flags*:: |
| Specifies flags associated with the inode. This can be a combination of the |
| following values: |
| |
| .Version 2 Inode flags |
| [options="header"] |
| |===== |
| | Flag | Description |
| | +XFS_DIFLAG_REALTIME+ | The inode's data is located on the real-time device. |
| | +XFS_DIFLAG_PREALLOC+ | The inode's extents have been preallocated. |
| | +XFS_DIFLAG_NEWRTBM+ | |
| Specifies the +sb_rbmino+ uses the new real-time bitmap format |
| |
| | +XFS_DIFLAG_IMMUTABLE+ | Specifies the inode cannot be modified. |
| | +XFS_DIFLAG_APPEND+ | The inode is in append only mode. |
| | +XFS_DIFLAG_SYNC+ | The inode is written synchronously. |
| | +XFS_DIFLAG_NOATIME+ | The inode's +di_atime+ is not updated. |
| | +XFS_DIFLAG_NODUMP+ | Specifies the inode is to be ignored by xfsdump. |
| | +XFS_DIFLAG_RTINHERIT+ | |
| For directory inodes, new inodes inherit the +XFS_DIFLAG_REALTIME+ bit. |
| |
| | +XFS_DIFLAG_PROJINHERIT+ | |
| For directory inodes, new inodes inherit the +di_projid+ value. |
| |
| | +XFS_DIFLAG_NOSYMLINKS+ | |
| For directory inodes, symlinks cannot be created. |
| |
| | +XFS_DIFLAG_EXTSIZE+ | |
| Specifies the extent size for real-time files or an extent size hint for regular files. |
| |
| | +XFS_DIFLAG_EXTSZINHERIT+ | |
| For directory inodes, new inodes inherit the +di_extsize+ value. |
| |
| | +XFS_DIFLAG_NODEFRAG+ | |
| Specifies the inode is to be ignored when defragmenting the filesystem. |
| |
| | +XFS_DIFLAG_FILESTREAMS+ | |
| Use the filestream allocator. The filestreams allocator allows a directory to |
| reserve an entire allocation group for exclusive use by files created in that |
| directory. Files in other directories cannot use AGs reserved by other |
| directories. |
| |
| |===== |
| |
| *di_gen*:: |
| A generation number used for inode identification. This is used by tools that do |
| inode scanning such as backup tools and xfsdump. An inode's generation number |
| can change by unlinking and creating a new file that reuses the inode. |
| |
| *di_next_unlinked*:: |
| See the section on xref:Unlinked_Pointer[unlinked inode pointers] for more |
| information. |
| |
| *di_crc*:: |
| Checksum of the inode. |
| |
| *di_changecount*:: |
| Counts the number of changes made to the attributes in this inode. |
| |
| *di_lsn*:: |
| Log sequence number of the last inode write. |
| |
| *di_flags2*:: |
| Specifies extended flags associated with a v3 inode. |
| |
| .Version 3 Inode flags |
| [options="header"] |
| |===== |
| | Flag | Description |
| | +XFS_DIFLAG2_DAX+ | |
| For a file, enable DAX to increase performance on persistent-memory storage. |
| If set on a directory, files created in the directory will inherit this flag. |
| | +XFS_DIFLAG2_REFLINK+ | |
| This inode shares (or has shared) data blocks with another inode. |
| | +XFS_DIFLAG2_COWEXTSIZE+ | |
| For files, this is the extent size hint for copy on write operations; see |
| +di_cowextsize+ for details. For directories, the value in +di_cowextsize+ |
| will be copied to all newly created files and directories. |
| |===== |
| |
| *di_cowextsize*:: |
| Specifies the extent size hint for copy on write operations. When allocating |
| extents for a copy on write operation, the allocator will be asked to align |
| its allocations to either +di_cowextsize+ blocks or +di_extsize+ blocks, |
| whichever is greater. The +XFS_DIFLAG2_COWEXTSIZE+ flag must be set if this |
| field is used. If this field and its flag are set on a directory file, the |
| value will be copied into any files or directories created within this |
| directory. During a block sharing operation, this value will be copied from |
| the source file to the destination file if the sharing operation completely |
| overwrites the destination file's contents and the destination file does not |
| already have +di_cowextsize+ set. |
| |
| *di_pad2*:: |
| Padding for future expansion of the inode. |
| |
| *di_crtime*:: |
| Specifies the time when this inode was created. |
| |
| *di_ino*:: |
| The full inode number of this inode. |
| |
| *di_uuid*:: |
| The UUID of this inode, which must match either +sb_uuid+ or +sb_meta_uuid+ |
| depending on which features are set. |
| |
| [[Unlinked_Pointer]] |
| == Unlinked Pointer |
| |
| The +di_next_unlinked+ value in the inode is used to track inodes that have |
| been unlinked (deleted) but are still open by a program. When an inode is |
| in this state, the inode is added to one of the xref:AG_Inode_Management[AGI's] |
| +agi_unlinked+ hash buckets. The AGI unlinked bucket points to an inode and the |
| +di_next_unlinked+ value points to the next inode in the chain. The last inode |
| in the chain has +di_next_unlinked+ set to NULL (-1). |
| |
| Once the last reference is released, the inode is removed from the unlinked hash |
| chain and +di_next_unlinked+ is set to NULL. In the case of a system crash, XFS |
| recovery will complete the unlink process for any inodes found in these lists. |
| |
| The only time the unlinked fields can be seen to be used on disk is either on an |
| active filesystem or a crashed system. A cleanly unmounted or recovered |
| filesystem will not have any inodes in these unlink hash chains. |
| |
| |
| .Unlinked inode pointer |
| image::images/28.png[] |
| |
| [[Data_Fork]] |
| == Data Fork |
| |
| The structure of the inode's data fork based is on the inode's type and |
| +di_format+. The data fork begins at the start of the inode's ``literal area''. |
| This area starts at offset 100 (0x64), or offset 176 (0xb0) in a v3 inode. The |
| size of the data fork is determined by the type and format. The maximum size is |
| determined by the inode size and +di_forkoff+. In code, use the +XFS_DFORK_PTR+ |
| macro specifying +XFS_DATA_FORK+ for the ``which'' parameter. Alternatively, |
| the +XFS_DFORK_DPTR+ macro can be used. |
| |
| Each of the following sub-sections summarises the contents of the data fork |
| based on the inode type. |
| |
| |
| [[Regular_Files_S_IFREG]] |
| === Regular Files (S_IFREG) |
| |
| The data fork specifies the file's data extents. The extents specify where the |
| file's actual data is located within the filesystem. Extents can have 2 formats |
| which is defined by the di_format value: |
| |
| * +XFS_DINODE_FMT_EXTENTS+: The extent data is fully contained within the inode |
| which contains an array of extents to the filesystem blocks for the file's data. |
| To access the extents, cast the return value from +XFS_DFORK_DPTR+ to |
| +xfs_bmbt_rec_t*+. |
| |
| * +XFS_DINODE_FMT_BTREE+: The extent data is contained in the leaves of a B+tree. |
| The inode contains the root node of the tree and is accessed by casting the |
| return value from +XFS_DFORK_DPTR+ to +xfs_bmdr_block_t*+. |
| |
| |
| Details for each of these data extent formats are covered in the |
| xref:Data_Extents[Data Extents] later on. |
| |
| |
| |
| [[Directories_S_IFDIR]] |
| === Directories (S_IFDIR) |
| |
| The data fork contains the directory's entries and associated data. The format |
| of the entries is also determined by the +di_format+ value and can be one of 3 |
| formats: |
| |
| * +XFS_DINODE_FMT_LOCAL+: The directory entries are fully contained within the |
| inode. This is accessed by casting the value from +XFS_DFORK_DPTR+ to |
| +xfs_dir2_sf_t*+. |
| |
| * +XFS_DINODE_FMT_EXTENTS+: The actual directory entries are located in another |
| filesystem block, the inode contains an array of extents to these filesystem |
| blocks (+xfs_bmbt_rec_t*+). |
| |
| * +XFS_DINODE_FMT_BTREE+: The directory entries are contained in the leaves of a |
| B+tree. The inode contains the root node (+xfs_bmdr_block_t*+). |
| |
| Details for each of these directory formats are covered in the |
| xref:Directories[Directories] later on. |
| |
| |
| [[Symbolic_Links_S_IFLNK]] |
| === Symbolic Links (S_IFLNK) |
| |
| The data fork contains the contents of the symbolic link. The format of the link |
| is determined by the +di_format+ value and can be one of 2 formats: |
| |
| * +XFS_DINODE_FMT_LOCAL+: The symbolic link is fully contained within the inode. |
| This is accessed by casting the return value from +XFS_DFORK_DPTR+ to +char*+. |
| |
| * +XFS_DINODE_FMT_EXTENTS+: The actual symlink is located in another filesystem |
| block, the inode contains the extents to these filesystem blocks |
| (+xfs_bmbt_rec_t*+). |
| |
| Details for symbolic links is covered in the section about |
| xref:Symbolic_Links[Symbolic Links]. |
| |
| [[Other_File_Types]] |
| === Other File Types |
| |
| For character and block devices (+S_IFCHR+ and +S_IFBLK+), cast the value from |
| +XFS_DFORK_DPTR+ to +xfs_dev_t*+. |
| |
| |
| |
| |
| [[Attribute_Fork]] |
| == Attribute Fork |
| |
| The attribute fork in the inode always contains the location of the extended |
| attributes associated with the inode. |
| |
| The location of the attribute fork in the inode's literal area is specified by |
| the +di_forkoff+ value in the inode's core. If this value is zero, the inode |
| does not contain any extended attributes. If non-zero, the attribute fork's |
| byte offset into the literal area can be computed from +di_forkoff × 8+. |
| Attributes must be allocated on a 64-bit boundary on the disk. To access the |
| extended attributes in code, use the +XFS_DFORK_PTR+ macro specifying |
| +XFS_ATTR_FORK+ for the ``which'' parameter. Alternatively, the +XFS_DFORK_APTR+ |
| macro can be used. |
| |
| The structure of the attribute fork depends on the +di_aformat+ value |
| in the inode. It can be one of the following values: |
| |
| * +XFS_DINODE_FMT_LOCAL+: The extended attributes are contained entirely within |
| the inode. This is accessed by casting the value from +XFS_DFORK_APTR+ to |
| +xfs_attr_shortform_t*+. |
| |
| * +XFS_DINODE_FMT_EXTENTS+: The attributes are located in another filesystem |
| block, the inode contains an array of pointers to these filesystem blocks. They |
| are accessed by casting the value from +XFS_DFORK_APTR+ to +xfs_bmbt_rec_t*+. |
| |
| * +XFS_DINODE_FMT_BTREE+: The extents for the attributes are contained in the |
| leaves of a B+tree. The inode contains the root node of the tree and is accessed |
| by casting the value from +XFS_DFORK_APTR+ to +xfs_bmdr_block_t*+. |
| |
| Detailed information on the layouts of extended attributes are covered in the |
| xref:Extended_Attributes[Extended Attributes] in this document. |
| |
| |
| |
| [[Extended_Attribute_Versions]] |
| === Extended Attribute Versions |
| |
| Extended attributes come in two versions: ``attr1'' or ``attr2''. The attribute |
| version is specified by the +XFS_SB_VERSION2_ATTR2BIT+ flag in the |
| +sb_features2+ field in the superblock. It determines how the inode's extra |
| space is split between +di_u+ and +di_a+ forks which also determines how the |
| +di_forkoff+ value is maintained in the inode's core. |
| |
| With ``attr1'' attributes, the +di_forkoff+ is set to somewhere in the middle of |
| the space between the core and end of the inode and never changes (which has the |
| effect of artificially limiting the space for data information). As the data |
| fork grows, when it gets to +di_forkoff+, it will move the data to the next |
| format level (ie. local < extent < btree). If very little space is used |
| for either attributes or data, then a good portion of the available inode space |
| is wasted with this version. |
| |
| ``attr2'' was introduced to maximum the utilisation of the inode's literal area. |
| The +di_forkoff+ starts at the end of the inode and works its way to the data |
| fork as attributes are added. Attr2 is highly recommended if extended attributes |
| are used. |
| |
| The following diagram compares the two versions: |
| |
| .Extended attribute layouts |
| image::images/30.png[] |
| |
| Note that because +di_forkoff+ is an 8-bit value measuring units of 8 bytes, |
| the maximum size of an inode is 2^8^ × 2^3^ = 2^11^ = 2048 bytes. |