blob: 9acdc8e1db97d16fbfca002093db372702dd6330 [file] [log] [blame]
<?xml version='1.0'?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
]>
<chapter id="On-disk_Inode">
<title>On-disk Inode</title>
<para>All files, directories and links are stored on disk with inodes and descend from the root inode with it's number defined in the superblock (<xref linkend="Superblocks"/>). The previous section on AG Inode Management (<xref linkend="AG_Inode_Management"/>) describes the allocation and management of inodes on disk. This section describes the contents of inodes themselves.</para>
<para>An inode is divided into 3 parts:</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/23.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>23</phrase></textobject>
</mediaobject>
</para>
<itemizedlist>
<listitem>
<para>The core contains what the inode represents, stat data and information describing the data and attribute forks. </para>
</listitem>
<listitem>
<para>The <command>di_u</command> "data fork" contains normal data related to the inode. It's contents depends on the file type specified by <command>di_core.di_mode</command> (eg. regular file, directory, link, etc) and how much information is contained in the file which determined by <command>di_core.di_format</command>. The following union to represent this data is declared as follows:</para>
<programlisting>
union {
xfs_bmdr_block_t di_bmbt;
xfs_bmbt_rec_t di_bmx[1];
xfs_dir2_sf_t di_dir2sf;
char di_c[1];
xfs_dev_t di_dev;
uuid_t di_muuid;
char di_symlink[1];
} di_u;
</programlisting>
</listitem>
<listitem>
<para>The di_a "attribute fork" contains extended attributes. Its layout is determined by the <command>di_core.di_aformat</command> value. Its representation is declared as follows:</para>
<programlisting>
union {
xfs_bmdr_block_t di_abmbt;
xfs_bmbt_rec_t di_abmx[1];
xfs_attr_shortform_t di_attrsf;
} di_a;
</programlisting>
</listitem>
</itemizedlist>
<note><para>Note: The above two unions are rarely used in the XFS code, but the structures within the union are directly cast depending on the <command>di_mode/di_format</command> and <command>di_aformat</command> values. They are referenced in this document to make it easier to explain the various structures in use within the inode.</para></note>
<para>The remaining space in the inode after <command>di_next_unlinked</command> where the two forks are located is called the inode's "literal area". This starts at offset 100 (0x64) in the inode.</para>
<para>The space for each of the two forks in the literal area is determined by the inode size, and <command>di_core.di_forkoff</command>. The data fork is located between the start of the literal area and <command>di_forkoff</command>. The attribute fork is located between <command>di_forkoff</command> and the end of the inode.</para>
<section id="Inode_Core"><title>Inode Core</title>
<para>The inode's core is 96 bytes in size and contains information about the file itself including most stat data information about data and attribute forks after the core within the inode. It uses the following structure:</para>
<programlisting>
typedef struct xfs_dinode_core {
__uint16_t di_magic;
__uint16_t di_mode;
__int8_t di_version;
__int8_t di_format;
__uint16_t di_onlink;
__uint32_t di_uid;
__uint32_t di_gid;
__uint32_t di_nlink;
__uint16_t di_projid;
__uint8_t di_pad[8];
__uint16_t di_flushiter;
xfs_timestamp_t di_atime;
xfs_timestamp_t di_mtime;
xfs_timestamp_t di_ctime;
xfs_fsize_t di_size;
xfs_drfsbno_t di_nblocks;
xfs_extlen_t di_extsize;
xfs_extnum_t di_nextents;
xfs_aextnum_t di_anextents;
__uint8_t di_forkoff;
__int8_t di_aformat;
__uint32_t di_dmevmask;
__uint16_t di_dmstate;
__uint16_t di_flags;
__uint32_t di_gen;
} xfs_dinode_core_t;
</programlisting>
<variablelist>
<varlistentry>
<term>di_magic</term>
<listitem><para>The inode signature where these two bytes are 0x494e, or "IN" in ASCII.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_mode</term>
<listitem><para>Specifies the mode access bits and type of file using the standard S_Ixxx values defined in stat.h.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_version</term>
<listitem><para>Specifies the inode version which currently can only be 1 or 2. The inode version specifies the usage of the <command>di_onlink</command>, <command>di_nlink</command> and <command>di_projid</command> values in the inode core. Initially, inodes are created as v1 but can be converted on the fly to v2 when required.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_format</term>
<listitem><para>Specifies the format of the data fork in conjunction with the <command>di_mode</command> type. This can be one of several values. For directories and links, it can be "local" where all metadata associated with the file is within the inode, "extents" where the inode contains an array of extents to other filesystem blocks which contain the associated metadata or data or "btree" where the inode contains a B+tree root node which points to filesystem blocks containing the metadata or data. Migration between the formats depends on the amount of metadata associated with the inode. "dev" is used for character and block devices while "uuid" is currently not used.
<programlisting>
typedef enum xfs_dinode_fmt {
XFS_DINODE_FMT_DEV,
XFS_DINODE_FMT_LOCAL,
XFS_DINODE_FMT_EXTENTS,
XFS_DINODE_FMT_BTREE,
XFS_DINODE_FMT_UUID
} xfs_dinode_fmt_t;
</programlisting></para></listitem>
</varlistentry>
<varlistentry>
<term>di_onlink</term>
<listitem><para>In v1 inodes, this specifies the number of links to the inode from directories. When the number exceeds 65535, the inode is converted to v2 and the link count is stored in <command>di_nlink</command>.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_uid</term>
<listitem><para>Specifies the owner's UID of the inode. </para></listitem>
</varlistentry>
<varlistentry>
<term>di_gid</term>
<listitem><para>Specifies the owner's GID of the inode.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_nlink</term>
<listitem><para>Specifies the number of links to the inode from directories. This is maintained for both inode versions for current versions of XFS. Old versions of XFS did not support v2 inodes, and therefore this value was never updated and was classed as reserved space (part of <command>di_pad</command>).</para></listitem>
</varlistentry>
<varlistentry>
<term>di_projid</term>
<listitem><para>Specifies the owner's project ID in v2 inodes. An inode is converted to v2 if the project ID is set.  This value must be zero for v1 inodes.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_pad[8]</term>
<listitem><para>Reserved, must be zero.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_flushiter</term>
<listitem><para>Incremented on flush.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_atime</term>
<listitem><para>Specifies the last access time of the files using UNIX time conventions the following structure. This value maybe undefined if the filesystem is mounted with the "noatime" option.
<programlisting>
typedef struct xfs_timestamp {
__int32_t t_sec;
__int32_t t_nsec;
} xfs_timestamp_t;
</programlisting></para></listitem>
</varlistentry>
<varlistentry>
<term>di_mtime</term>
<listitem><para>Specifies the last time the file was modified.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_ctime</term>
<listitem><para>Specifies when the inode's status was last changed.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_size</term>
<listitem><para>Specifies the EOF of the inode in bytes. This can be larger or smaller than the extent space (therefore actual disk space) used for the inode. For regular files, this is the filesize in bytes, directories, the space taken by directory entries and for links, the length of the symlink.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_nblocks</term>
<listitem><para>Specifies the number of filesystem blocks used to store the inode's data including relevant metadata like B+trees. This does not include blocks used for extended attributes.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_extsize</term>
<listitem><para>Specifies the extent size for filesystems with real-time devices and an extent size hint for standard filesystems. For normal filesystems, and with directories, the <command>XFS_DIFLAG_EXTSZINHERIT</command> flag must be set in <command>di_flags</command> if this field is used. Inodes created in these directories will inherit the di_extsize value and have <command>XFS_DIFLAG_EXTSIZE</command> set in their <command>di_flags</command>. When a file is written to beyond allocated space, XFS will attempt to allocate additional disk space based on this value.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_nextents</term>
<listitem><para>Specifies the number of data extents associated with this inode.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_anextents</term>
<listitem><para>Specifies the number of extended attribute extents associated with this inode.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_forkoff</term>
<listitem><para>Specifies the offset into the inode's literal area where the extended attribute fork starts. This is an 8-bit value that is multiplied by 8 to determine the actual offset in bytes (ie. attribute data is 64-bit aligned). This also limits the maximum size of the inode to 2048 bytes. This value is initially zero until an extended attribute is created. When in attribute is added, the nature of <command>di_forkoff</command> depends on the <command>XFS_SB_VERSION2_ATTR2BIT</command>  flag in the superblock. Refer to the Extended Attribute Versions section (<xref linkend="Extended_Attribute_Versions"/>) for more details.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_aformat</term>
<listitem><para>Specifies the format of the attribute fork. This uses the same values as <command>di_format</command>, but restricted to "local", "extents" and "btree" formats for extended attribute data.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_dmevmask</term>
<listitem><para>DMAPI event mask.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_dmstate</term>
<listitem><para>DMAPI state.</para></listitem>
</varlistentry>
<varlistentry>
<term>di_flags</term>
<listitem><para> Specifies flags associated with the inode. This can be a combination of the following values:
<informaltable frame="all">
<tgroup cols="2"><thead><row>
<entry>
<para>Flag</para>
</entry>
<entry>
<para>Description</para>
</entry>
</row>
</thead><tbody>
<row>
<entry>
<para><command>XFS_DIFLAG_REALTIME</command></para>
</entry>
<entry>
<para>The inode's data is located on the real-time device.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_PREALLOC</command></para>
</entry>
<entry>
<para>The inode's extents have been preallocated.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_NEWRTBM</command></para>
</entry>
<entry>
<para>Specifies the <command>sb_rbmino</command> uses the new real-time bitmap format</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_IMMUTABLE</command></para>
</entry>
<entry>
<para>Specifies the inode cannot be modified.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_APPEND</command></para>
</entry>
<entry>
<para>The inode is in append only mode.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_SYNC</command></para>
</entry>
<entry>
<para>The inode is written synchronously.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_NOATIME</command></para>
</entry>
<entry>
<para>The inode's <command>di_atime</command> is not updated.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_NODUMP</command></para>
</entry>
<entry>
<para>Specifies the inode is to be ignored by xfsdump.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_RTINHERIT</command></para>
</entry>
<entry>
<para>For directory inodes, new inodes inherit the <command>XFS_DIFLAG_REALTIME</command> bit.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_PROJINHERIT</command></para>
</entry>
<entry>
<para>For directory inodes, new inodes inherit the <command>di_projid</command> value.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_NOSYMLINKS</command></para>
</entry>
<entry>
<para>For directory inodes, symlinks cannot be created.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_EXTSIZE</command></para>
</entry>
<entry>
<para>Specifies the extent size for real-time files or a and extent size hint for regular files.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_EXTSZINHERIT</command></para>
</entry>
<entry>
<para>For directory inodes, new inodes inherit the <command>di_extsize</command> value.</para>
</entry>
</row>
<row>
<entry>
<para><command>XFS_DIFLAG_NODEFRAG</command></para>
</entry>
<entry>
<para>Specifies the inode is to be ignored when defragmenting the filesystem.</para>
</entry>
</row></tbody></tgroup>
</informaltable></para></listitem>
</varlistentry>
<varlistentry>
<term>di_gen</term>
<listitem>
<para>A generation number used for inode identification. This is used by tools that do inode scanning such as backup tools and xfsdump. An inode's generation number can change by unlinking and creating a new file that reuses the inode.
</para>
</listitem>
</varlistentry>
</variablelist>
</section>
<section id="Unlinked_Pointer"><title>Unlinked Pointer</title>
<para>The <command>di_next_unlinked</command> value in the inode is used to track inodes that have been unlinked (deleted) but which are still referenced. When an inode is unlinked and there is still an outstanding reference, the inode is added to one of the AGI's (<xref linkend="AG_Inode_Management"/>) <command>agi_unlinked</command> hash buckets. The AGI unlinked bucket points to an inode and the <command>di_next_unlinked</command> value points to the next inode in the chain. The last inode in the chain has <command>di_next_unlinked</command> set to NULL (-1).</para>
<para>Once the last reference is released, the inode is removed from the unlinked hash chain, and <command>di_next_unlinked</command> is set to NULL. In the case of a system crash, XFS recovery will complete the unlink process for any inodes found in these lists.</para>
<para>The only time the unlinked fields can be seen to be used on disk is either on an active filesystem or a crashed system. A cleanly unmounted or recovered filesystem will not have any inodes in these unlink hash chains.</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/28.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>28</phrase></textobject>
</mediaobject>
</para></section>
<section id="Data_Fork"><title>Data Fork</title>
<para>The structure of the inode's data fork based is on the inode's type and <command>di_format</command>. It always starts at offset 100 (0x64) in the inode's space which is the start of the inode's "literal area". The size of the data fork is determined by the type and format. The maximum size is determined by the inode size and <command>di_forkoff</command>. In code, use the <command>XFS_DFORK_PTR</command> macro specifying <command>XFS_DATA_FORK</command> for the "which" parameter. Alternatively, the <command>XFS_DFORK_DPTR</command> macro can be used.</para>
<para>Each of the following sub-sections summarises the contents of the data fork based on the inode type.</para>
<section id="Regular_Files_S_IFREG">
<title>Regular Files (S_IFREG)</title>
<para>The data fork specifies the file's data extents. The extents specify where the file's actual data is located within the filesystem. Extents can have 2 formats which is defined by the di_format value: </para>
<itemizedlist>
<listitem>
<para><command>XFS_DINODE_FMT_EXTENTS</command>: The extent data is fully contained within the inode which contains an array of extents to the filesystem blocks for the file's data. To access the extents, cast the return value from <command>XFS_DFORK_DPTR</command> to <command>xfs_bmbt_rec_t*</command>.</para>
</listitem>
<listitem>
<para><command>XFS_DINODE_FMT_BTREE</command>: The extent data is contained in the leaves of a B+tree. The inode contains the root node of the tree and is accessed by casting the return value from <command>XFS_DFORK_DPTR</command> to <command>xfs_bmdr_block_t*</command>.</para>
</listitem>
</itemizedlist>
<para>Details for each of these data extent formats are covered in the Data Extents section (<xref linkend="Data_Extents"/>) later on.</para></section>
<section id="Directories_S_IFDIR"><title>Directories (S_IFDIR)</title>
<para>The data fork contains the directory's entries and associated data. The format of the entries is also determined by the <command>di_format</command> value and can be one of 3 formats:</para>
<itemizedlist>
<listitem>
<para><command>XFS_DINODE_FMT_LOCAL</command>: The directory entries are fully contained within the inode. This is accessed by casting the value from <command>XFS_DFORK_DPTR</command> to <command>xfs_dir2_sf_t*</command>.</para>
</listitem>
<listitem>
<para><command>XFS_DINODE_FMT_EXTENTS</command>: The actual directory entries are located in another filesystem block, the inode contains an array of extents to these filesystem blocks (<command>xfs_bmbt_rec_t*</command>).</para>
</listitem>
<listitem>
<para><command>XFS_DINODE_FMT_BTREE</command>: The directory entries are contained in the leaves of a B+tree. The inode contains the root node (<command>xfs_bmdr_block_t*</command>).</para>
</listitem>
</itemizedlist>
<para>Details for each of these directory formats are covered in the Directories section (<xref linkend="Directories"/>) later on.</para></section>
<section id="Symbolic_Links_S_IFLNK"><title>Symbolic Links (S_IFLNK)</title>
<para>The data fork contains the contents of the symbolic link. The format of the link is determined by the <command>di_format</command> value and can be one of 2 formats:</para>
<itemizedlist>
<listitem>
<para><command>XFS_DINODE_FMT_LOCAL</command>: The symbolic link is fully contained within the inode. This is accessed by casting the return value from <command>XFS_DFORK_DPTR</command> to <command>char*</command>.</para>
</listitem>
<listitem>
<para><command>XFS_DINODE_FMT_EXTENTS</command>: The actual symlink is located in another filesystem block, the inode contains the extents to these filesystem blocks (<command>xfs_bmbt_rec_t*</command>).</para>
</listitem>
</itemizedlist>
<para>Details for symbolic links is covered in the Symbolic Links section (<xref linkend="Symbolic_Links"/>) later on.</para></section>
<section id="Other_File_Types"><title>Other File Types</title>
<para>For character and block devices (<command>S_IFCHR</command> and <command>S_IFBLK</command>), cast the value from <command>XFS_DFORK_DPTR</command> to <command>xfs_dev_t*</command>.</para></section></section>
<section id="Attribute_Fork"><title>Attribute Fork</title>
<para>The attribute fork in the inode always contains the location of the extended attributes associated with the inode. </para>
<para>The location of the attribute fork in the inode's literal area (offset 100 to the end of the inode) is specified by the <command>di_forkoff</command> value in the inode's core. If this value is zero, the inode does not contain any extended attributes. Non-zero, the byte offset into the literal area = <command>di_forkoff</command> * 8, which also determines the 2048 byte maximum size for an inode. Attributes must be allocated on a 64-bit boundary on the disk. To access the extended attributes in code, use the <command>XFS_DFORK_PTR</command> macro specifying <command>XFS_ATTR_FORK</command> for the "which" parameter. Alternatively, the <command>XFS_DFORK_APTR</command> macro can be used.</para>
<para>Which structure in the attribute fork is used depends on the <command>di_aformat</command> value in the inode. It can be one of the following values:</para>
<itemizedlist>
<listitem>
<para><command>XFS_DINODE_FMT_LOCAL</command>: The extended attributes are contained entirely within the inode. This is accessed by casting the value from <command>XFS_DFORK_APTR</command> to <command>xfs_attr_shortform_t*</command>.</para>
</listitem>
<listitem>
<para><command>XFS_DINODE_FMT_EXTENTS</command>: The attributes are located in another filesystem block, the inode contains an array of pointers to these filesystem blocks. They are accessed by casting the value from <command>XFS_DFORK_APTR</command> to <command>xfs_bmbt_rec_t*</command>.</para>
</listitem>
<listitem>
<para><command>XFS_DINODE_FMT_BTREE</command>: The extents for the attributes are contained in the leaves of a B+tree. The inode contains the root node of the tree and is accessed by casting the value from <command>XFS_DFORK_APTR</command> to <command>xfs_bmdr_block_t*</command>.</para>
</listitem>
</itemizedlist>
<para>Detailed information on the layouts of extended attributes are covered in the Extended Attributes section (<xref linkend="Extended_Attributes"/>) later on in this document.</para>
<section id="Extended_Attribute_Versions"><title>Extended Attribute Versions</title>
<para>Extended attributes come in two versions: "attr1" or "attr2". The attribute version is specified by the <command>XFS_SB_VERSION2_ATTR2BIT</command>  flag in the <command>sb_features2</command> field in the superblock. It determines how the inode's extra space is split between <command>di_u</command> and <command>di_a</command> forks which also determines how the <command>di_forkoff</command> value is maintained in the inode's core.</para>
<para>With "attr1" attributes, the <command>di_forkoff</command> is set to somewhere in the middle of the space between the core and end of the inode and never changes (which has the effect of artificially limiting the space for data information). As the data fork grows, when it gets to <command>di_forkoff</command>, it will move the data to the level format level (ie. local &gt; extent &gt; btree). If very little space is used for either attributes or data, then a good portion of the available inode space is wasted with this version.</para>
<para>"Attr2" was introduced to maximum the utilisation of the inode's literal area. The <command>di_forkoff</command> starts at the end of the inode and works its way to the data fork as attributes are added. Attr2 is highly recommended if extended attributes are used.</para>
<para>The following diagram compares the two versions:</para>
<para>
<mediaobject>
<imageobject><imagedata fileref="images/30.png" format="PNG" width="100%" scalefit="0"/></imageobject>
<textobject><phrase>30</phrase></textobject>
</mediaobject>
</para></section></section></chapter>