| <?xml version='1.0' encoding='utf-8' ?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [ |
| ]> |
| <chapter id="xfs-allocators"> |
| <title>Allocators</title> |
| <section> |
| <title>Allocation Policy</title> |
| <para>The default allocation behaviour of XFS for new files is to place them in the same allocation group as their parent directory.</para> |
| <para>Since files and their parent directories are often accessed in close succession, this minimises costly disk seeks.</para> |
| <para>The allocator will also attempt to place newly created directories in different allocation groups.</para> |
| <para>Combined, these policies help group directories of files together on disk, even if they're being written to concurrently.</para> |
| <para>Allocation policies in XFS can change due to</para> |
| <itemizedlist> |
| <listitem><para>32 bit inode numbers are used on large file systems</para></listitem> |
| <listitem><para>Some mount options</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Allocation Policy - Directories</title> |
| <para>New directories are placed in different AGs where possible</para> |
| <para>Watch the inode numbers as directory inodes are created:</para> |
| <para><programlisting> |
| > mkdir a b |
| > ls -li |
| total 0 |
| 131 drwxr-xr-x 2 sjv users 6 2006-10-20 12:12 a |
| 33554624 drwxr-xr-x 2 sjv users 6 2006-10-20 12:12 b</programlisting></para> |
| </section> |
| <section> |
| <title>Allocation Policy - Files</title> |
| <para>Files are created in the same AG as their parent directory where possible, which is also evident in their inode numbers:</para> |
| <para><programlisting> |
| > touch a/1 b/1 a/2 b/2 |
| > ls -1id * */* |
| 131 a |
| 132 a/1 |
| 133 a/2 |
| 33554624 b |
| 33554625 b/1 |
| 33554626 b/2</programlisting></para> |
| </section> |
| <section> |
| <title>Inode Numbers</title> |
| <para>Every inode on disk has a unique inode number associated with it.</para> |
| <para>It is a requirement that inode numbers be persistent across unmounts and reboots, |
| so once an inode is written to disk its inode number is fixed.</para> |
| <para>For performance reasons it must be possible to quickly find an inode on disk using its inode number.</para> |
| <para>XFS uses the physical location of the inode on disk to encode the inode number, which makes |
| finding the inode on disk using the inode number a trivial task.</para> |
| </section> |
| <section> |
| <title>Inode Number Format</title> |
| <para>An inode's location consists of three distinct parts</para> |
| <itemizedlist> |
| <listitem><para>an allocation group number</para></listitem> |
| <listitem><para>a file system block number within its AG</para></listitem> |
| <listitem><para>an inode number inside that file system block</para></listitem> |
| </itemizedlist> |
| <para>The number of bits required to store each of these values varies with the filesystem geometry</para> |
| <para>Larger filesystems can easily require more than 32 bits, which can limit inode |
| allocation to a region at the start of the volume</para> |
| </section> |
| <section> |
| <title>Inode Number Size</title> |
| <para>File systems aren't free to use inode numbers of arbitrary size.</para> |
| <para>Operating system interfaces and legacy software products often mandate the use of 32 |
| bit inode numbers even on systems that support 64 bit inode numbers.</para> |
| <para>This can be a problem on large file systems, since 32 bit inode numbers only provide |
| enough bits to encode inode locations in the first 1TB of a volume when 256 byte inodes |
| are used, up to 8TB in the case of 2kB inodes.</para> |
| <para>For best performance, a file system needs to keep a file's data blocks close to its |
| inode to minimise seeks when performing I/O. XFS's ability to do this suffers on large |
| volumes when 32 bit inode numbers are used.</para> |
| </section> |
| <section> |
| <title>32bit and 64bit Inodes</title> |
| <para>By default, XFS will use 32 bit inode numbers.</para> |
| <para>If the system supports it, the -o inode64 option to mount to allow 64 bit inode numbers.</para> |
| <para>Once an inode has been written somewhere on the disk that requires a 64 bit inode number, the |
| file system can no longer be used with 32 bit inode numbers</para> |
| <itemizedlist> |
| <listitem><para>The inode64 mount option should not be removed once used</para></listitem> |
| </itemizedlist> |
| <para>(IRIX can move inodes to 32 bit numbers with <command>xfs_reno</command>, this tool has not |
| been ported to Linux, yet)</para> |
| </section> |
| <section> |
| <title>32bit and 64bit Inodes</title> |
| <para>Inode numbers are stored in big endian format on disk, and host endian format in-core.</para> |
| <para>Applications that pass 64 bit inode numbers using 32 bit variables will truncate the |
| 32 most-significant bits.</para> |
| <para>Since XFS stores the AG number an inode belongs to in the most significant bits, a result |
| of this truncation can be an inode number that points to an inode in a lower AG by mistake.</para> |
| <para>Using that inode number will result in either a lookup on the incorrect inode, or the |
| referencing of an area on disk that doesn't contain inodes at all.</para> |
| </section> |
| <section> |
| <title>32 bit Inodes on >1TB Filesystems</title> |
| <para>When 32 bit inode numbers are used on a volume larger than 1TB in size, several changes occur.</para> |
| <para>A 100TB volume using 256 byte inodes mounted in the default inode32 mode has just |
| one percent of its space available for allocating inodes.</para> |
| <para>XFS will reserve the first 1TB of disk space exclusively for inodes to ensure that the |
| imbalance is no worse than this due to file data allocations.</para> |
| <para>It is no longer possible for file data to reside in the same AG as the parent directory's inode.</para> |
| <para>XFS will instead "rotor" through the upper AGs as it allocates space for files, putting |
| each file in a new AG to evenly spread the I/O load.</para> |
| </section> |
| <section> |
| <title>Rotor Step</title> |
| <para>The performance of some workloads will suffer from the distribution each file in a different |
| AG, so the "rotor step" sysctl was added adjust this behavior</para> |
| <para>For example, to keep at least a second of ingested 24fps video files in the same |
| AG before moving to the next AG:</para> |
| <para><programlisting> |
| # sysctl fs.xfs.rotorstep |
| fs.xfs.rotorstep = 1 |
| # sudo sysctl –w fs.xfs.rotorstep=24 |
| fs.xfs.rotorstep = 24</programlisting></para> |
| <note><para>The rotorstep value is a global one, so setting it will affect the behaviour of |
| all mounted file systems over 1TB in size that use 32 bit inode numbers.</para></note> |
| </section> |
| <section> |
| <title>Realtime Allocator</title> |
| <para>Certain classes of applications require deterministic latencies on file allocation operations</para> |
| <itemizedlist> |
| <listitem><para>The performance of the standard XFS allocator varies depending on the |
| internal data structures used to manage the filesystem content</para></listitem> |
| </itemizedlist> |
| <para>The realtime allocator uses a bitmap algorithm that gives consistent allocation |
| latencies regardless of the filesystem's contents.</para> |
| <para>By using the realtime allocator in conjunction with an external log volume, it's possible |
| to remove most of the unpredictability in disk response times that's caused by metadata overheads.</para> |
| <note><para>The realtime allocator is only available when XFS kernelspace is built with |
| CONFIG_XFS_RT enabled</para></note> |
| </section> |
| <section> |
| <title>Realtime Allocator Limitations</title> |
| <para>In practice, the realtime allocator is not widely used.</para> |
| <para>It effectively uses a single large allocation group with a single set of data structures |
| losing the parallelism of XFS’s allocation groups</para> |
| <itemizedlist> |
| <listitem><para>The locks associated with this central data structure result in the serialisation of |
| concurrent operations to a realtime device</para></listitem> |
| </itemizedlist> |
| <para>The realtime allocator is incapable of maintaining a spatial separation on disk for concurrent operations</para> |
| <itemizedlist> |
| <listitem><para>It tries to start new files at random points in the bitmap to reduce this |
| problem but this has a negative impact on some workloads</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Traditional allocator impact on some workloads</title> |
| <para>A certain class of applications will write many large files to a directory in sequence.</para> |
| <para>A film scanner ingesting video may write each video frame to a separate file. To playback |
| the video frames in realtime its important that these files are contiguous on disk for optimal |
| read-ahead performance by the hardware RAID.</para> |
| <para>At 24 frames per second, each frame is needed every 40ms, so it is important to keep the disks |
| busy and reading into cache the next frame to be displayed</para> |
| </section> |
| <section> |
| <title>Traditional Stream Allocation</title> |
| <para>Initially each directory is allocated to a separate AG</para> |
| <para>Each stream writes to that AG until it is full</para> |
| <para>Additional allocations now go in the next consecutive AG that has enough free space</para> |
| <para>Multiple streams will start writing to the same AG, interleaving their files and negating any read-ahead</para> |
| <mediaobject><imageobject> |
| <imagedata fileref="images/XFS-traditional-stream-allocation.png" /> |
| </imageobject></mediaobject> |
| |
| </section> |
| <section> |
| <title>RAID performance with interleaved streams</title> |
| <para>With only 1.3% read cache hits, RAID is reading 545MB/s to return 184MB/s to the client (200% backend overhead)</para> |
| <para><programlisting> |
| System Performance Statistics |
| All Ports Port 1 Port 2 Port 3 Port 4 |
| Read MB/s: 183.9 45.8 46.8 46.3 45.0 |
| Write MB/s: 0.0 0.0 0.0 0.0 0.0 |
| Total MB/s: 183.9 45.8 46.8 46.3 45.0 |
| |
| Read IO/s: 520 133 129 129 129 |
| Write IO/s: 0 0 0 0 0 |
| Total IO/s: 522 134 131 128 129 |
| |
| Read Hits: 1.3% 1.6% 2.2% 0.6% 0.6% |
| Prefetch Hits: 0.8% 1.1% 1.1% 0.6% 0.6% |
| Prefetches: 46.3% 46.3% 46.0% 46.1% 46.8% |
| Writebacks: 0.0% 0.0% 0.0% 0.0% 0.0% |
| Rebuild MB/s: 0.0 0.0 0.0 0.0 0.0 |
| Verify MB/s: 0.0 0.0 0.0 0.0 0.0 |
| |
| Total Reads Writes Pieces Reads Writes |
| Disk IO/s: 518 518 0 1: 4910 0 |
| Disk MB/s: 544.6 544.6 0.0 2: 29890 0 |
| Disk Pieces: 65710 65710 0 3: 340 0 |
| BDB Pieces: 0 4: 0 0 |
| 5: 0 0 |
| Cache Writeback Data: 0.0% 6: 0 0 |
| Rebuild/Verify Data: 0.0% 0.0% 7: 0 0 |
| Cache Data locked: 0.0% 8: 0 0</programlisting></para> |
| </section> |
| <section> |
| <title>Filestreams Allocator</title> |
| <para>A new allocation algorithm was added to XFS that associates a parent directory with an AG |
| until a preset inactivity timeout elapses.</para> |
| <para>A stream that moves to a new AG will cause that AG to be locked, so other streams looking |
| for a new AG will not use the same AG</para> |
| <para>The new algorithm is called the Filestreams allocator and it is enabled in one of two ways:</para> |
| <itemizedlist> |
| <listitem><para>the filesystem is mounted with the -o filestreams option, or</para></listitem> |
| <listitem><para>the filestreams chattr flag is applied to a directory to indicate that all allocations |
| beneath that point in the directory hierarchy should use the filestreams allocator</para></listitem> |
| </itemizedlist> |
| <para>Filestreams will have a negative impact on workloads that continue to grow files |
| in the same directory, causing more fragmentation than the default allocator</para> |
| </section> |
| <section> |
| <title>RAID performance with filestreams</title> |
| <para>Almost all data now found in RAID cache, only 15% backend disk I/O overhead</para> |
| <para><programlisting> |
| System Performance Statistics |
| All Ports Port 1 Port 2 Port 3 Port 4 |
| Read MB/s: 299.1 74.0 74.7 75.1 75.2 |
| Write MB/s: 0.0 0.0 0.0 0.0 0.0 |
| Total MB/s: 299.1 74.0 74.7 75.1 75.2 |
| |
| Read IO/s: 840 209 210 211 210 |
| Write IO/s: 0 0 0 0 0 |
| Total IO/s: 836 209 210 208 209 |
| |
| Read Hits: 99.5% 98.3% 99.6% 100.0% 100.0% |
| Prefetch Hits: 98.8% 97.6% 98.9% 99.6% 99.0% |
| Prefetches: 42.0% 41.5% 42.0% 42.9% 41.7% |
| Writebacks: 0.0% 0.0% 0.0% 0.0% 0.0% |
| Rebuild MB/s: 0.0 0.0 0.0 0.0 0.0 |
| Verify MB/s: 0.0 0.0 0.0 0.0 0.0 |
| |
| Total Reads Writes Pieces Reads Writes |
| Disk IO/s: 614 614 0 1: 39068 0 |
| Disk MB/s: 345.5 345.5 0.0 2: 111 0 |
| Disk Pieces: 39290 39290 0 3: 0 0 |
| BDB Pieces: 0 4: 0 0 |
| 5: 0 0 |
| Cache Writeback Data: 0.0% 6: 0 0 |
| Rebuild/Verify Data: 0.0% 0.0% 7: 0 0 |
| Cache Data locked: 0.0% 8: 0 0</programlisting></para> |
| </section> |
| <section> |
| <title>Fragmentation</title> |
| <para>Despite the use of extents and the various allocation schemes, XFS files and filesystems |
| may still become fragmented over time</para> |
| <para>xfs_db can display the level of fragmentation in the filesystem</para> |
| <itemizedlist> |
| <listitem><para>xfs_db -r /dev/sda3 |
| <itemizedlist> |
| <listitem><para>frag -f file fragmentation percentage</para></listitem> |
| <listitem><para>frag -d directory fragmentation percentage</para></listitem> |
| <listitem><para>freesp freespace</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Fragmentation Example</title> |
| <para><programlisting> |
| > xfs_db –r device |
| xfs_db: freesp |
| from to extents blocks pct |
| 1 1 94807 94807 1.36 |
| 2 3 63041 145012 2.08 |
| 4 7 30374 152890 2.19 |
| 8 15 19437 207742 2.98 |
| 16 31 15173 331559 4.76 |
| 32 63 14099 636086 9.13 |
| 64 127 16804 1497220 21.48 |
| 128 255 8390 1470464 21.10 |
| 256 511 3003 1033383 14.83 |
| 512 1023 810 551813 7.92 |
| 1024 2047 258 370811 5.32 |
| 2048 4095 101 282202 4.05 |
| 4096 8191 27 145550 2.09 |
| xfs_db: frag -d |
| actual 45966, ideal 12398, fragmentation factor 73.03% |
| xfs_db: frag -f |
| actual 2104856, ideal 2100484, fragmentation factor 0.21%</programlisting></para> |
| <note><para>The fragmentation factor value can be misleading.</para> |
| <para>It is derived from (actual - ideal) / (ideal) so an average of 5 extents |
| per file will yield 80%. For multi-gigabyte files, 5 extents is |
| not harmful, and the 80% is not representative of a problem.</para> |
| </note> |
| </section> |
| <section> |
| <title>xfs_fsr</title> |
| <para>Simple defragmentation tool that</para> |
| <itemizedlist> |
| <listitem><para>Searched for files that are fragmented</para></listitem> |
| <listitem><para>Creates a tempory inode</para></listitem> |
| <listitem><para>Asks the filesystem to create new extents for the temporary inode</para></listitem> |
| <listitem><para>If the new extents are less fragmented it copies the data in original file to the new extents</para></listitem> |
| <listitem><para>The temporary inode is then renamed to replace the original file</para></listitem> |
| </itemizedlist> |
| <para>Fsr makes no consideration for the used and free space within its allocation group |
| and does not rearrange files to create larger contiguous free space</para> |
| <para>So fsr may fragment freespace over a period of time</para> |
| </section> |
| </chapter> |