| <?xml version='1.0' encoding='utf-8' ?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [ |
| ]> |
| <chapter id="xfs-overview"> |
| <title>XFS Overview</title> |
| <section> |
| <title>XFS Filesystem Structure</title> |
| <para>This section gives an overview of the structure of an XFS filesystem</para> |
| <para>More detailed examination of the filesystem structure is covered later in the course</para> |
| <para>An XFS filesystem is divided evenly into allocation groups</para> |
| <para>An allocation group can be from 16MB to 1TB in size</para> |
| <para>See <command>xfs(5)</command></para> |
| </section> |
| <section> |
| <title>Allocation Groups</title> |
| <mediaobject><imageobject> |
| <imagedata fileref="images/XFS-allocation-groups.png" /> |
| </imageobject></mediaobject> |
| </section> |
| <section> |
| <title>Allocation Group Structure</title> |
| <para>Each allocation group includes</para> |
| <itemizedlist> |
| <listitem><para>Super block information about the entire filesystem</para></listitem> |
| <listitem><para>Free space management (within the allocation group)</para></listitem> |
| <listitem><para>Inode allocation and tracking (with the allocation group)</para></listitem> |
| </itemizedlist> |
| <para>Inode clusters within an allocation group are created when needed</para> |
| <itemizedlist> |
| <listitem><para>mkfs.xfs does not pre-create inodes throughout the filesystem</para></listitem> |
| </itemizedlist> |
| <mediaobject><imageobject> |
| <imagedata fileref="images/XFS-allocation-group-structure.png" /> |
| </imageobject></mediaobject> |
| </section> |
| <section> |
| <title>XFS Limits</title> |
| <para>32 bit Linux</para> |
| <itemizedlist> |
| <listitem><para>Maximum File Size = 16TB (O_LARGEFILE)</para></listitem> |
| <listitem><para>Maximum Filesystem Size = 16TB</para></listitem> |
| </itemizedlist> |
| <para>64 bit Linux</para> |
| <itemizedlist> |
| <listitem><para>Maximum File Size = 9 Million TB = 9 ExaB</para></listitem> |
| <listitem><para>Maximum Filesystem Size = 18 Million TB = 18 ExaB</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Filesystem Block Size (FSB)</title> |
| <para>Filesystem blocks (FSBs) are the unit of space for a filesystem</para> |
| <itemizedlist> |
| <listitem><para>Filesystem blocks are composed of one or more device-level sectors.</para></listitem> |
| </itemizedlist> |
| <para>The page management implementation in Linux limits the maximum FSB size to the page size</para> |
| <itemizedlist> |
| <listitem><para>4KB on ia32 and x86_64 architectures</para></listitem> |
| <listitem><para>16KB on ia64</para></listitem> |
| </itemizedlist> |
| <para> Performance can improve with different block sizes depending on the size of I/O requests and the size of files</para> |
| <itemizedlist> |
| <listitem><para>Larger blocks will also use more disk space for small (<1FSB) files</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Extents</title> |
| <para>An extent is a set of one or more contiguous FSBs that define a region in the filesystem for file data or metadata</para> |
| <itemizedlist> |
| <listitem><para>A single extent can be up to 8GB in length</para></listitem> |
| </itemizedlist> |
| <para>A file’s inode lists the extents associated with that file</para> |
| <itemizedlist> |
| <listitem><para>For very large files, the file’s inode may have thousands of extents, or one very large extent. Usually something in between.</para></listitem> |
| </itemizedlist> |
| <para>Extents are used for files, directory metadata and extended attributes when the information exceeds the space reserved in the inode</para> |
| <para>Using extents helps to</para> |
| <itemizedlist> |
| <listitem><para>minimize the disk space required to store a file's block map</para></listitem> |
| <listitem><para>reduce the effects of fragmentation</para></listitem> |
| <listitem><para>improve I/O performance by allowing fewer and larger I/O operations</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Unwritten Extents</title> |
| <para>An unwritten extent is an extent which has been marked as "not yet written" ondisk.</para> |
| <para>Unwritten extents can be created by preallocating file space using:</para> |
| <itemizedlist> |
| <listitem><para>XFS specific interfaces (<command>xfsctl(3)</command>)</para></listitem> |
| <listitem><para><command>sys_fallocate</command> on kernels >= 2.6.23</para></listitem> |
| <listitem><para><command>posix_fallocate(3)</command> on recent glibc |
| <itemizedlist> |
| <listitem><para>falls back to 0-writing if kernel or fs has no support</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| <listitem><para><command>fallocate(1)</command> on newer glibc versions</para></listitem> |
| <listitem><para>Through direct IOs of specific alignment (such as stripe boundaries)</para></listitem> |
| </itemizedlist> |
| <para>Unwritten extents apply only to regular files.</para> |
| <para>The unwritten state prevents the uninitialised data in the extent from being exposed to the user.</para> |
| <para>Once such an extent is written to, or partially written to, a transaction is |
| issued to convert the written part into a regular written extent, and mark the |
| remaining (up to 2) extents as unwritten.</para> |
| <para>Use the -p option to xfs_bmap to view unwritten extents.</para> |
| <para><command># xfs_io -f -c 'resvsp 0 10m' -c 'bmap -vp' /tmp/foo</command></para> |
| </section> |
| <section> |
| <title>Delayed Allocation</title> |
| <para>Delayed allocation splits file block allocation into two stages:</para> |
| <itemizedlist> |
| <listitem><para>Reservation - disk space is reserved (but not allocated) when writing to cache |
| <itemizedlist> |
| <listitem><para>decrements free block count</para></listitem> |
| <listitem><para>creates a virtual 'delalloc' extent</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| <listitem><para>Allocation - disk blocks are allocated when flushing data from cache to disk |
| <itemizedlist> |
| <listitem><para>converts 'delalloc' extent to real extent</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| </itemizedlist> |
| <para>Benefits of delayed allocation</para> |
| <itemizedlist> |
| <listitem><para>Fragmentation is reduced by combining writes and allocating extents in large chunks</para></listitem> |
| <listitem><para>Short lived files may never need to be allocated</para></listitem> |
| <listitem><para>Files written randomly (such as those that are memory mapped) can now be allocated contiguously</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Direct I/O</title> |
| <para>Direct I/O allows an application to transfer data directly to disk from an application buffer and vice versa.</para> |
| <itemizedlist> |
| <listitem><para>Data does not pass through the filesystem cache</para></listitem> |
| <listitem><para>Data is transferred by DMA and does not involve CPU overhead</para></listitem> |
| <listitem><para>Synchronous I/O</para></listitem> |
| <listitem><para>XFS allows for parallel writes to same file</para></listitem> |
| </itemizedlist> |
| <para>Uses of direct I/O</para> |
| <itemizedlist> |
| <listitem><para>Backup programs, so that they can work without polluting the page cache</para></listitem> |
| <listitem><para>Applications that need 'intelligent' caching</para></listitem> |
| <listitem><para>High performance, bandwidth intensive workloads</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Stripe Alignment</title> |
| <para>Delayed allocations can be aligned to stripe unit/width boundaries if past eof</para> |
| <para>Direct I/O can align block allocations on stripe unit/width boundaries</para> |
| </section> |
| <section> |
| <title>Inodes</title> |
| <para>XFS has three inode structures</para> |
| <para>XFS inode</para> |
| <itemizedlist> |
| <listitem><para>In-memory XFS inode used only by the filesystem</para></listitem> |
| </itemizedlist> |
| <para>Ondisk inode</para> |
| <itemizedlist> |
| <listitem><para>Used for storing the metadata for files, directories and other file types</para></listitem> |
| <listitem><para>Default size is 256 bytes and can be up to 2KB</para></listitem> |
| <listitem><para>Embedded within the XFS inode</para></listitem> |
| </itemizedlist> |
| <para> Linux inode</para> |
| <itemizedlist> |
| <listitem><para>Generic inode structure used by VFS</para></listitem> |
| <listitem><para>Embedded within the XFS inode</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Directory and File Inodes</title> |
| <mediaobject><imageobject> |
| <imagedata fileref="images/XFS-directory-file-inodes.png" /> |
| </imageobject></mediaobject> |
| </section> |
| <section> |
| <title>Journal Log</title> |
| <para>XFS Journal logs all metadata changes</para> |
| <itemizedlist> |
| <listitem><para>Only filesystem metadata is logged, not user data</para></listitem> |
| </itemizedlist> |
| <para>Allows the filesystem to replay the log and recover the filesystem quickly after a crash</para> |
| <itemizedlist> |
| <listitem><para>No requirement to run fsck</para></listitem> |
| </itemizedlist> |
| <para>Log replay will apply filesystem and metadata changes during a mount that had been |
| logged but may not have yet been applied to the filesystem</para> |
| <para>The log may be located on a separate device</para> |
| <itemizedlist> |
| <listitem><para>Can improve performance due to reduced disk contention</para></listitem> |
| </itemizedlist> |
| </section> |
| </chapter> |