blob: afb84e127d9cf68331259a6e7a04ea8c7406ff0f [file] [log] [blame]
<?xml version='1.0' encoding='utf-8' ?>
<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
]>
<chapter id="xfs-overview">
<title>XFS Overview</title>
<section>
<title>XFS Filesystem Structure</title>
<para>This section gives an overview of the structure of an XFS filesystem</para>
<para>More detailed examination of the filesystem structure is covered later in the course</para>
<para>An XFS filesystem is divided evenly into allocation groups</para>
<para>An allocation group can be from 16MB to 1TB in size</para>
<para>See <command>xfs(5)</command></para>
</section>
<section>
<title>Allocation Groups</title>
<mediaobject><imageobject>
<imagedata fileref="images/XFS-allocation-groups.png" />
</imageobject></mediaobject>
</section>
<section>
<title>Allocation Group Structure</title>
<para>Each allocation group includes</para>
<itemizedlist>
<listitem><para>Super block information about the entire filesystem</para></listitem>
<listitem><para>Free space management (within the allocation group)</para></listitem>
<listitem><para>Inode allocation and tracking (with the allocation group)</para></listitem>
</itemizedlist>
<para>Inode clusters within an allocation group are created when needed</para>
<itemizedlist>
<listitem><para>mkfs.xfs does not pre-create inodes throughout the filesystem</para></listitem>
</itemizedlist>
<mediaobject><imageobject>
<imagedata fileref="images/XFS-allocation-group-structure.png" />
</imageobject></mediaobject>
</section>
<section>
<title>XFS Limits</title>
<para>32 bit Linux</para>
<itemizedlist>
<listitem><para>Maximum File Size = 16TB (O_LARGEFILE)</para></listitem>
<listitem><para>Maximum Filesystem Size = 16TB</para></listitem>
</itemizedlist>
<para>64 bit Linux</para>
<itemizedlist>
<listitem><para>Maximum File Size = 9 Million TB = 9 ExaB</para></listitem>
<listitem><para>Maximum Filesystem Size = 18 Million TB = 18 ExaB</para></listitem>
</itemizedlist>
</section>
<section>
<title>Filesystem Block Size (FSB)</title>
<para>Filesystem blocks (FSBs) are the unit of space for a filesystem</para>
<itemizedlist>
<listitem><para>Filesystem blocks are composed of one or more device-level sectors.</para></listitem>
</itemizedlist>
<para>The page management implementation in Linux limits the maximum FSB size to the page size</para>
<itemizedlist>
<listitem><para>4KB on ia32 and x86_64 architectures</para></listitem>
<listitem><para>16KB on ia64</para></listitem>
</itemizedlist>
<para> Performance can improve with different block sizes depending on the size of I/O requests and the size of files</para>
<itemizedlist>
<listitem><para>Larger blocks will also use more disk space for small (&lt;1FSB) files</para></listitem>
</itemizedlist>
</section>
<section>
<title>Extents</title>
<para>An extent is a set of one or more contiguous FSBs that define a region in the filesystem for file data or metadata</para>
<itemizedlist>
<listitem><para>A single extent can be up to 8GB in length</para></listitem>
</itemizedlist>
<para>A file’s inode lists the extents associated with that file</para>
<itemizedlist>
<listitem><para>For very large files, the file’s inode may have thousands of extents, or one very large extent. Usually something in between.</para></listitem>
</itemizedlist>
<para>Extents are used for files, directory metadata and extended attributes when the information exceeds the space reserved in the inode</para>
<para>Using extents helps to</para>
<itemizedlist>
<listitem><para>minimize the disk space required to store a file's block map</para></listitem>
<listitem><para>reduce the effects of fragmentation</para></listitem>
<listitem><para>improve I/O performance by allowing fewer and larger I/O operations</para></listitem>
</itemizedlist>
</section>
<section>
<title>Unwritten Extents</title>
<para>An unwritten extent is an extent which has been marked as "not yet written" ondisk.</para>
<para>Unwritten extents can be created by preallocating file space using:</para>
<itemizedlist>
<listitem><para>XFS specific interfaces (<command>xfsctl(3)</command>)</para></listitem>
<listitem><para><command>sys_fallocate</command> on kernels >= 2.6.23</para></listitem>
<listitem><para><command>posix_fallocate(3)</command> on recent glibc
<itemizedlist>
<listitem><para>falls back to 0-writing if kernel or fs has no support</para></listitem>
</itemizedlist>
</para></listitem>
<listitem><para><command>fallocate(1)</command> on newer glibc versions</para></listitem>
<listitem><para>Through direct IOs of specific alignment (such as stripe boundaries)</para></listitem>
</itemizedlist>
<para>Unwritten extents apply only to regular files.</para>
<para>The unwritten state prevents the uninitialised data in the extent from being exposed to the user.</para>
<para>Once such an extent is written to, or partially written to, a transaction is
issued to convert the written part into a regular written extent, and mark the
remaining (up to 2) extents as unwritten.</para>
<para>Use the -p option to xfs_bmap to view unwritten extents.</para>
<para><command># xfs_io -f -c 'resvsp 0 10m' -c 'bmap -vp' /tmp/foo</command></para>
</section>
<section>
<title>Delayed Allocation</title>
<para>Delayed allocation splits file block allocation into two stages:</para>
<itemizedlist>
<listitem><para>Reservation - disk space is reserved (but not allocated) when writing to cache
<itemizedlist>
<listitem><para>decrements free block count</para></listitem>
<listitem><para>creates a virtual 'delalloc' extent</para></listitem>
</itemizedlist>
</para></listitem>
<listitem><para>Allocation - disk blocks are allocated when flushing data from cache to disk
<itemizedlist>
<listitem><para>converts 'delalloc' extent to real extent</para></listitem>
</itemizedlist>
</para></listitem>
</itemizedlist>
<para>Benefits of delayed allocation</para>
<itemizedlist>
<listitem><para>Fragmentation is reduced by combining writes and allocating extents in large chunks</para></listitem>
<listitem><para>Short lived files may never need to be allocated</para></listitem>
<listitem><para>Files written randomly (such as those that are memory mapped) can now be allocated contiguously</para></listitem>
</itemizedlist>
</section>
<section>
<title>Direct I/O</title>
<para>Direct I/O allows an application to transfer data directly to disk from an application buffer and vice versa.</para>
<itemizedlist>
<listitem><para>Data does not pass through the filesystem cache</para></listitem>
<listitem><para>Data is transferred by DMA and does not involve CPU overhead</para></listitem>
<listitem><para>Synchronous I/O</para></listitem>
<listitem><para>XFS allows for parallel writes to same file</para></listitem>
</itemizedlist>
<para>Uses of direct I/O</para>
<itemizedlist>
<listitem><para>Backup programs, so that they can work without polluting the page cache</para></listitem>
<listitem><para>Applications that need 'intelligent' caching</para></listitem>
<listitem><para>High performance, bandwidth intensive workloads</para></listitem>
</itemizedlist>
</section>
<section>
<title>Stripe Alignment</title>
<para>Delayed allocations can be aligned to stripe unit/width boundaries if past eof</para>
<para>Direct I/O can align block allocations on stripe unit/width boundaries</para>
</section>
<section>
<title>Inodes</title>
<para>XFS has three inode structures</para>
<para>XFS inode</para>
<itemizedlist>
<listitem><para>In-memory XFS inode used only by the filesystem</para></listitem>
</itemizedlist>
<para>Ondisk inode</para>
<itemizedlist>
<listitem><para>Used for storing the metadata for files, directories and other file types</para></listitem>
<listitem><para>Default size is 256 bytes and can be up to 2KB</para></listitem>
<listitem><para>Embedded within the XFS inode</para></listitem>
</itemizedlist>
<para> Linux inode</para>
<itemizedlist>
<listitem><para>Generic inode structure used by VFS</para></listitem>
<listitem><para>Embedded within the XFS inode</para></listitem>
</itemizedlist>
</section>
<section>
<title>Directory and File Inodes</title>
<mediaobject><imageobject>
<imagedata fileref="images/XFS-directory-file-inodes.png" />
</imageobject></mediaobject>
</section>
<section>
<title>Journal Log</title>
<para>XFS Journal logs all metadata changes</para>
<itemizedlist>
<listitem><para>Only filesystem metadata is logged, not user data</para></listitem>
</itemizedlist>
<para>Allows the filesystem to replay the log and recover the filesystem quickly after a crash</para>
<itemizedlist>
<listitem><para>No requirement to run fsck</para></listitem>
</itemizedlist>
<para>Log replay will apply filesystem and metadata changes during a mount that had been
logged but may not have yet been applied to the filesystem</para>
<para>The log may be located on a separate device</para>
<itemizedlist>
<listitem><para>Can improve performance due to reduced disk contention</para></listitem>
</itemizedlist>
</section>
</chapter>