blob: 1922954e8f547c911f4894c40a113c5652cc9526 [file] [log] [blame]
[[On-disk_Inode]]
= On-disk Inode
All files, directories, and links are stored on disk with inodes and descend from
the root inode with its number defined in the xref:Superblocks[superblock]. The
previous section on xref:AG_Inode_Management[AG Inode Management] describes the
allocation and management of inodes on disk. This section describes the contents
of inodes themselves.
An inode is divided into 3 parts:
.On-disk inode sections
image::images/23.png[]
* The core contains what the inode represents, stat data, and information
describing the data and attribute forks.
* The +di_u+ ``data fork'' contains normal data related to the inode. Its contents
depends on the file type specified by +di_core.di_mode+ (eg. regular file,
directory, link, etc) and how much information is contained in the file which
determined by +di_core.di_format+. The following union to represent this data is
declared as follows:
[source, c]
----
union {
xfs_bmdr_block_t di_bmbt;
xfs_bmbt_rec_t di_bmx[1];
xfs_dir2_sf_t di_dir2sf;
char di_c[1];
xfs_dev_t di_dev;
uuid_t di_muuid;
char di_symlink[1];
} di_u;
----
* The +di_a+ ``attribute fork'' contains extended attributes. Its layout is
determined by the +di_core.di_aformat+ value. Its representation is declared as
follows:
[source, c]
----
union {
xfs_bmdr_block_t di_abmbt;
xfs_bmbt_rec_t di_abmx[1];
xfs_attr_shortform_t di_attrsf;
} di_a;
----
[NOTE]
The above two unions are rarely used in the XFS code, but the structures
within the union are directly cast depending on the +di_mode/di_format+ and
+di_aformat+ values. They are referenced in this document to make it easier to
explain the various structures in use within the inode.
The remaining space in the inode after +di_next_unlinked+ where the two forks
are located is called the inode's ``literal area''. This starts at offset 100
(0x64) in a version 1 or 2 inode, and offset 176 (0xb0) in a version 3 inode.
The space for each of the two forks in the literal area is determined by the
inode size, and +di_core.di_forkoff+. The data fork is located between the start
of the literal area and +di_forkoff+. The attribute fork is located between
+di_forkoff+ and the end of the inode.
[[Inode_Core]]
== Inode Core
The inode's core is 96 bytes on a V4 filesystem and 176 bytes on a V5
filesystem. It contains information about the file itself including most stat
data information about data and attribute forks after the core within the
inode. It uses the following structure:
[source, c]
----
struct xfs_dinode_core {
__uint16_t di_magic;
__uint16_t di_mode;
__int8_t di_version;
__int8_t di_format;
__uint16_t di_onlink;
__uint32_t di_uid;
__uint32_t di_gid;
__uint32_t di_nlink;
__uint16_t di_projid;
__uint16_t di_projid_hi;
__uint8_t di_pad[6];
__uint16_t di_flushiter;
xfs_timestamp_t di_atime;
xfs_timestamp_t di_mtime;
xfs_timestamp_t di_ctime;
xfs_fsize_t di_size;
xfs_rfsblock_t di_nblocks;
xfs_extlen_t di_extsize;
xfs_extnum_t di_nextents;
xfs_aextnum_t di_anextents;
__uint8_t di_forkoff;
__int8_t di_aformat;
__uint32_t di_dmevmask;
__uint16_t di_dmstate;
__uint16_t di_flags;
__uint32_t di_gen;
/* di_next_unlinked is the only non-core field in the old dinode */
__be32 di_next_unlinked;
/* version 5 filesystem (inode version 3) fields start here */
__le32 di_crc;
__be64 di_changecount;
__be64 di_lsn;
__be64 di_flags2;
__be32 di_cowextsize;
__u8 di_pad2[12];
xfs_timestamp_t di_crtime;
__be64 di_ino;
uuid_t di_uuid;
};
----
*di_magic*::
The inode signature; these two bytes are ``IN'' (0x494e).
*di_mode*::
Specifies the mode access bits and type of file using the standard S_Ixxx values
defined in stat.h.
*di_version*::
Specifies the inode version which currently can only be 1, 2, or 3. The inode
version specifies the usage of the +di_onlink+, +di_nlink+ and +di_projid+
values in the inode core. Initially, inodes are created as v1 but can be
converted on the fly to v2 when required. v3 inodes are created only for v5
filesystems.
*di_format*::
Specifies the format of the data fork in conjunction with the +di_mode+ type.
This can be one of several values. For directories and links, it can be ``local''
where all metadata associated with the file is within the inode; ``extents'' where
the inode contains an array of extents to other filesystem blocks which contain
the associated metadata or data; or ``btree'' where the inode contains a B+tree
root node which points to filesystem blocks containing the metadata or data.
Migration between the formats depends on the amount of metadata associated with
the inode. ``dev'' is used for character and block devices while ``uuid'' is
currently not used. ``rmap'' indicates that a reverse-mapping B+tree
is rooted in the fork.
[source, c]
----
typedef enum xfs_dinode_fmt {
XFS_DINODE_FMT_DEV,
XFS_DINODE_FMT_LOCAL,
XFS_DINODE_FMT_EXTENTS,
XFS_DINODE_FMT_BTREE,
XFS_DINODE_FMT_UUID,
XFS_DINODE_FMT_RMAP,
} xfs_dinode_fmt_t;
----
*di_onlink*::
In v1 inodes, this specifies the number of links to the inode from directories.
When the number exceeds 65535, the inode is converted to v2 and the link count
is stored in +di_nlink+.
*di_uid*::
Specifies the owner's UID of the inode.
*di_gid*::
Specifies the owner's GID of the inode.
*di_nlink*::
Specifies the number of links to the inode from directories. This is maintained
for both inode versions for current versions of XFS. Prior to v2 inodes, this
field was part of +di_pad+.
*di_projid*::
Specifies the owner's project ID in v2 inodes. An inode is converted to v2 if
the project ID is set. This value must be zero for v1 inodes.
*di_projid_hi*::
Specifies the high 16 bits of the owner's project ID in v2 inodes, if the
+XFS_SB_VERSION2_PROJID32BIT+ feature is set; and zero otherwise.
*di_pad[6]*::
Reserved, must be zero.
*di_flushiter*::
Incremented on flush.
*di_atime*::
Specifies the last access time of the files using UNIX time conventions the
following structure. This value may be undefined if the filesystem is mounted
with the ``noatime'' option. XFS supports timestamps with nanosecond resolution:
[source, c]
----
struct xfs_timestamp {
__int32_t t_sec;
__int32_t t_nsec;
};
----
If the +XFS_SB_FEAT_INCOMPAT_BIGTIME+ feature is enabled, the 64 bits used by
the timestamp field are interpreted as a flat 64-bit nanosecond counter.
See the section about xref:Inode_Timestamps[inode timestamps] for more details.
*di_mtime*::
Specifies the last time the file was modified.
*di_ctime*::
Specifies when the inode's status was last changed.
*di_size*::
Specifies the EOF of the inode in bytes. This can be larger or smaller than the
extent space (therefore actual disk space) used for the inode. For regular
files, this is the filesize in bytes, directories, the space taken by directory
entries and for links, the length of the symlink.
*di_nblocks*::
Specifies the number of filesystem blocks used to store the inode's data
including relevant metadata like B+trees. This does not include blocks used for
extended attributes.
*di_extsize*::
Specifies the extent size for filesystems with real-time devices or an extent
size hint for standard filesystems. For normal filesystems, and with
directories, the +XFS_DIFLAG_EXTSZINHERIT+ flag must be set in +di_flags+ if
this field is used. Inodes created in these directories will inherit the
di_extsize value and have +XFS_DIFLAG_EXTSIZE+ set in their +di_flags+. When a
file is written to beyond allocated space, XFS will attempt to allocate
additional disk space based on this value.
*di_nextents*::
Specifies the number of data extents associated with this inode.
*di_anextents*::
Specifies the number of extended attribute extents associated with this inode.
*di_forkoff*::
Specifies the offset into the inode's literal area where the extended attribute
fork starts. This is an 8-bit value that is multiplied by 8 to determine the
actual offset in bytes (ie. attribute data is 64-bit aligned). This also limits
the maximum size of the inode to 2048 bytes. This value is initially zero until
an extended attribute is created. When in attribute is added, the nature of
+di_forkoff+ depends on the +XFS_SB_VERSION2_ATTR2BIT+  flag in the superblock.
Refer to xref:Extended_Attribute_Versions[Extended Attribute Versions] for more
details.
*di_aformat*::
Specifies the format of the attribute fork. This uses the same values as
+di_format+, but restricted to ``local'', ``extents'' and ``btree'' formats for
extended attribute data.
*di_dmevmask*::
DMAPI event mask.
*di_dmstate*::
DMAPI state.
*di_flags*::
Specifies flags associated with the inode. This can be a combination of the
following values:
.Version 2 Inode flags
[options="header"]
|=====
| Flag | Description
| +XFS_DIFLAG_REALTIME+ | The inode's data is located on the real-time device.
| +XFS_DIFLAG_PREALLOC+ | The inode's extents have been preallocated.
| +XFS_DIFLAG_NEWRTBM+ |
Specifies the +sb_rbmino+ uses the new real-time bitmap format
| +XFS_DIFLAG_IMMUTABLE+ | Specifies the inode cannot be modified.
| +XFS_DIFLAG_APPEND+ | The inode is in append only mode.
| +XFS_DIFLAG_SYNC+ | The inode is written synchronously.
| +XFS_DIFLAG_NOATIME+ | The inode's +di_atime+ is not updated.
| +XFS_DIFLAG_NODUMP+ | Specifies the inode is to be ignored by xfsdump.
| +XFS_DIFLAG_RTINHERIT+ |
For directory inodes, new inodes inherit the +XFS_DIFLAG_REALTIME+ bit.
| +XFS_DIFLAG_PROJINHERIT+ |
For directory inodes, new inodes inherit the +di_projid+ value.
| +XFS_DIFLAG_NOSYMLINKS+ |
For directory inodes, symlinks cannot be created.
| +XFS_DIFLAG_EXTSIZE+ |
Specifies the extent size for real-time files or an extent size hint for regular files.
| +XFS_DIFLAG_EXTSZINHERIT+ |
For directory inodes, new inodes inherit the +di_extsize+ value.
| +XFS_DIFLAG_NODEFRAG+ |
Specifies the inode is to be ignored when defragmenting the filesystem.
| +XFS_DIFLAG_FILESTREAMS+ |
Use the filestream allocator. The filestreams allocator allows a directory to
reserve an entire allocation group for exclusive use by files created in that
directory. Files in other directories cannot use AGs reserved by other
directories.
|=====
*di_gen*::
A generation number used for inode identification. This is used by tools that do
inode scanning such as backup tools and xfsdump. An inode's generation number
can change by unlinking and creating a new file that reuses the inode.
*di_next_unlinked*::
See the section on xref:Unlinked_Pointer[unlinked inode pointers] for more
information.
*di_crc*::
Checksum of the inode.
*di_changecount*::
Counts the number of changes made to the attributes in this inode.
*di_lsn*::
Log sequence number of the last inode write.
*di_flags2*::
Specifies extended flags associated with a v3 inode.
.Version 3 Inode flags
[options="header"]
|=====
| Flag | Description
| +XFS_DIFLAG2_DAX+ |
For a file, enable DAX to increase performance on persistent-memory storage.
If set on a directory, files created in the directory will inherit this flag.
| +XFS_DIFLAG2_REFLINK+ |
This inode shares (or has shared) data blocks with another inode.
| +XFS_DIFLAG2_COWEXTSIZE+ |
For files, this is the extent size hint for copy on write operations; see
+di_cowextsize+ for details. For directories, the value in +di_cowextsize+
will be copied to all newly created files and directories.
|=====
*di_cowextsize*::
Specifies the extent size hint for copy on write operations. When allocating
extents for a copy on write operation, the allocator will be asked to align
its allocations to either +di_cowextsize+ blocks or +di_extsize+ blocks,
whichever is greater. The +XFS_DIFLAG2_COWEXTSIZE+ flag must be set if this
field is used. If this field and its flag are set on a directory file, the
value will be copied into any files or directories created within this
directory. During a block sharing operation, this value will be copied from
the source file to the destination file if the sharing operation completely
overwrites the destination file's contents and the destination file does not
already have +di_cowextsize+ set.
*di_pad2*::
Padding for future expansion of the inode.
*di_crtime*::
Specifies the time when this inode was created.
*di_ino*::
The full inode number of this inode.
*di_uuid*::
The UUID of this inode, which must match either +sb_uuid+ or +sb_meta_uuid+
depending on which features are set.
[[Unlinked_Pointer]]
== Unlinked Pointer
The +di_next_unlinked+ value in the inode is used to track inodes that have
been unlinked (deleted) but are still open by a program. When an inode is
in this state, the inode is added to one of the xref:AG_Inode_Management[AGI's]
+agi_unlinked+ hash buckets. The AGI unlinked bucket points to an inode and the
+di_next_unlinked+ value points to the next inode in the chain. The last inode
in the chain has +di_next_unlinked+ set to NULL (-1).
Once the last reference is released, the inode is removed from the unlinked hash
chain and +di_next_unlinked+ is set to NULL. In the case of a system crash, XFS
recovery will complete the unlink process for any inodes found in these lists.
The only time the unlinked fields can be seen to be used on disk is either on an
active filesystem or a crashed system. A cleanly unmounted or recovered
filesystem will not have any inodes in these unlink hash chains.
.Unlinked inode pointer
image::images/28.png[]
[[Data_Fork]]
== Data Fork
The structure of the inode's data fork based is on the inode's type and
+di_format+. The data fork begins at the start of the inode's ``literal area''.
This area starts at offset 100 (0x64), or offset 176 (0xb0) in a v3 inode. The
size of the data fork is determined by the type and format. The maximum size is
determined by the inode size and +di_forkoff+. In code, use the +XFS_DFORK_PTR+
macro specifying +XFS_DATA_FORK+ for the ``which'' parameter. Alternatively,
the +XFS_DFORK_DPTR+ macro can be used.
Each of the following sub-sections summarises the contents of the data fork
based on the inode type.
[[Regular_Files_S_IFREG]]
=== Regular Files (S_IFREG)
The data fork specifies the file's data extents. The extents specify where the
file's actual data is located within the filesystem. Extents can have 2 formats
which is defined by the di_format value:
* +XFS_DINODE_FMT_EXTENTS+: The extent data is fully contained within the inode
which contains an array of extents to the filesystem blocks for the file's data.
To access the extents, cast the return value from +XFS_DFORK_DPTR+ to
+xfs_bmbt_rec_t*+.
* +XFS_DINODE_FMT_BTREE+: The extent data is contained in the leaves of a B+tree.
The inode contains the root node of the tree and is accessed by casting the
return value from +XFS_DFORK_DPTR+ to +xfs_bmdr_block_t*+.
Details for each of these data extent formats are covered in the
xref:Data_Extents[Data Extents] later on.
[[Directories_S_IFDIR]]
=== Directories (S_IFDIR)
The data fork contains the directory's entries and associated data. The format
of the entries is also determined by the +di_format+ value and can be one of 3
formats:
* +XFS_DINODE_FMT_LOCAL+: The directory entries are fully contained within the
inode. This is accessed by casting the value from +XFS_DFORK_DPTR+ to
+xfs_dir2_sf_t*+.
* +XFS_DINODE_FMT_EXTENTS+: The actual directory entries are located in another
filesystem block, the inode contains an array of extents to these filesystem
blocks (+xfs_bmbt_rec_t*+).
* +XFS_DINODE_FMT_BTREE+: The directory entries are contained in the leaves of a
B+tree. The inode contains the root node (+xfs_bmdr_block_t*+).
Details for each of these directory formats are covered in the
xref:Directories[Directories] later on.
[[Symbolic_Links_S_IFLNK]]
=== Symbolic Links (S_IFLNK)
The data fork contains the contents of the symbolic link. The format of the link
is determined by the +di_format+ value and can be one of 2 formats:
* +XFS_DINODE_FMT_LOCAL+: The symbolic link is fully contained within the inode.
This is accessed by casting the return value from +XFS_DFORK_DPTR+ to +char*+.
* +XFS_DINODE_FMT_EXTENTS+: The actual symlink is located in another filesystem
block, the inode contains the extents to these filesystem blocks
(+xfs_bmbt_rec_t*+).
Details for symbolic links is covered in the section about
xref:Symbolic_Links[Symbolic Links].
[[Other_File_Types]]
=== Other File Types
For character and block devices (+S_IFCHR+ and +S_IFBLK+), cast the value from
+XFS_DFORK_DPTR+ to +xfs_dev_t*+.
[[Attribute_Fork]]
== Attribute Fork
The attribute fork in the inode always contains the location of the extended
attributes associated with the inode.
The location of the attribute fork in the inode's literal area is specified by
the +di_forkoff+ value in the inode's core. If this value is zero, the inode
does not contain any extended attributes. If non-zero, the attribute fork's
byte offset into the literal area can be computed from +di_forkoff × 8+.
Attributes must be allocated on a 64-bit boundary on the disk. To access the
extended attributes in code, use the +XFS_DFORK_PTR+ macro specifying
+XFS_ATTR_FORK+ for the ``which'' parameter. Alternatively, the +XFS_DFORK_APTR+
macro can be used.
The structure of the attribute fork depends on the +di_aformat+ value
in the inode. It can be one of the following values:
* +XFS_DINODE_FMT_LOCAL+: The extended attributes are contained entirely within
the inode. This is accessed by casting the value from +XFS_DFORK_APTR+ to
+xfs_attr_shortform_t*+.
* +XFS_DINODE_FMT_EXTENTS+: The attributes are located in another filesystem
block, the inode contains an array of pointers to these filesystem blocks. They
are accessed by casting the value from +XFS_DFORK_APTR+ to +xfs_bmbt_rec_t*+.
* +XFS_DINODE_FMT_BTREE+: The extents for the attributes are contained in the
leaves of a B+tree. The inode contains the root node of the tree and is accessed
by casting the value from +XFS_DFORK_APTR+ to +xfs_bmdr_block_t*+.
Detailed information on the layouts of extended attributes are covered in the
xref:Extended_Attributes[Extended Attributes] in this document.
[[Extended_Attribute_Versions]]
=== Extended Attribute Versions
Extended attributes come in two versions: ``attr1'' or ``attr2''. The attribute
version is specified by the +XFS_SB_VERSION2_ATTR2BIT+  flag in the
+sb_features2+ field in the superblock. It determines how the inode's extra
space is split between +di_u+ and +di_a+ forks which also determines how the
+di_forkoff+ value is maintained in the inode's core.
With ``attr1'' attributes, the +di_forkoff+ is set to somewhere in the middle of
the space between the core and end of the inode and never changes (which has the
effect of artificially limiting the space for data information). As the data
fork grows, when it gets to +di_forkoff+, it will move the data to the next
format level (ie. local < extent < btree). If very little space is used
for either attributes or data, then a good portion of the available inode space
is wasted with this version.
``attr2'' was introduced to maximum the utilisation of the inode's literal area.
The +di_forkoff+ starts at the end of the inode and works its way to the data
fork as attributes are added. Attr2 is highly recommended if extended attributes
are used.
The following diagram compares the two versions:
.Extended attribute layouts
image::images/30.png[]
Note that because +di_forkoff+ is an 8-bit value measuring units of 8 bytes,
the maximum size of an inode is 2^8^ × 2^3^ = 2^11^ = 2048 bytes.