XFS_User_Guide/en-US/XFS-allocators.xml - pub/scm/fs/xfs/xfsdocs-xml-dev - Git at Google

 <?xml version='1.0' encoding='utf-8' ?>
 <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
 ]>
 <chapter id="xfs-allocators">
 	<title>Allocators</title>
 	<section>
 		<title>Allocation Policy</title>
 		<para>The default allocation behaviour of XFS for new files is to place them in the same allocation group as their parent directory.</para>
 		<para>Since files and their parent directories are often accessed in close succession, this minimises costly disk seeks.</para>
 		<para>The allocator will also attempt to place newly created directories in different allocation groups.</para>
 		<para>Combined, these policies help group directories of files together on disk, even if they're being written to concurrently.</para>
 		<para>Allocation policies in XFS can change due to</para>
 		<itemizedlist>
 		<listitem><para>32 bit inode numbers are used on large file systems</para></listitem>
 		<listitem><para>Some mount options</para></listitem>
 		</itemizedlist>
 	</section>
 	<section>
 		<title>Allocation Policy - Directories</title>
 		<para>New directories are placed in different AGs where possible</para>
 		<para>Watch the inode numbers as directory inodes are created:</para>
 		<para><programlisting>
 > mkdir a b
 > ls -li
 total 0
      131 drwxr-xr-x 2 sjv users 6 2006-10-20 12:12 a
 33554624 drwxr-xr-x 2 sjv users 6 2006-10-20 12:12 b</programlisting></para>
 	</section>
 	<section>
 		<title>Allocation Policy - Files</title>
 		<para>Files are created in the same AG as their parent directory where possible, which is also evident in their inode numbers:</para>
 		<para><programlisting>
 > touch a/1 b/1 a/2 b/2
 > ls -1id * */*
      131 a
      132 a/1
      133 a/2
 33554624 b
 33554625 b/1
 33554626 b/2</programlisting></para>
 	</section>
 	<section>
 		<title>Inode Numbers</title>
 		<para>Every inode on disk has a unique inode number associated with it.</para>
 		<para>It is a requirement that inode numbers be persistent across unmounts and reboots,
 		      so once an inode is written to disk its inode number is fixed.</para>
 		<para>For performance reasons it must be possible to quickly find an inode on disk using its inode number.</para>
 		<para>XFS uses the physical location of the inode on disk to encode the inode number, which makes
 		      finding the inode on disk using the inode number a trivial task.</para>
 	</section>
 	<section>
 		<title>Inode Number Format</title>
 		<para>An inode's location consists of three distinct parts</para>
 		<itemizedlist>
 		<listitem><para>an allocation group number</para></listitem>
 		<listitem><para>a file system block number within its AG</para></listitem>
 		<listitem><para>an inode number inside that file system block</para></listitem>
 		</itemizedlist>
 		<para>The number of bits required to store each of these values varies with the filesystem geometry</para>
 		<para>Larger filesystems can easily require more than 32 bits, which can limit inode
 		      allocation to a region at the start of the volume</para>
 	</section>
 	<section>
 		<title>Inode Number Size</title>
 		<para>File systems aren't free to use inode numbers of arbitrary size.</para>
 		<para>Operating system interfaces and legacy software products often mandate the use of 32
 		      bit inode numbers even on systems that support 64 bit inode numbers.</para>
 		<para>This can be a problem on large file systems, since 32 bit inode numbers only provide
 		      enough bits to encode inode locations in the first 1TB of a volume when 256 byte inodes
 		      are used, up to 8TB in the case of 2kB inodes.</para>
 		<para>For best performance, a file system needs to keep a file's data blocks close to its
 		      inode to minimise seeks when performing I/O. XFS's ability to do this suffers on large
 		      volumes when 32 bit inode numbers are used.</para>
 	</section>
 	<section>
 		<title>32bit and 64bit Inodes</title>
 		<para>By default, XFS will use 32 bit inode numbers.</para>
 		<para>If the system supports it, the -o inode64 option to mount to allow 64 bit inode numbers.</para>
 		<para>Once an inode has been written somewhere on the disk that requires a 64 bit inode number, the
 		      file system can no longer be used with 32 bit inode numbers</para>
 		<itemizedlist>
 		<listitem><para>The inode64 mount option should not be removed once used</para></listitem>
 		</itemizedlist>
 		<para>(IRIX can move inodes to 32 bit numbers with <command>xfs_reno</command>, this tool has not
 		      been ported to Linux, yet)</para>
 	</section>
 	<section>
 		<title>32bit and 64bit Inodes</title>
 		<para>Inode numbers are stored in big endian format on disk, and host endian format in-core.</para>
 		<para>Applications that pass 64 bit inode numbers using 32 bit variables will truncate the
 		      32 most-significant bits.</para>
 		<para>Since XFS stores the AG number an inode belongs to in the most significant bits, a result
 		      of this truncation can be an inode number that points to an inode in a lower AG by mistake.</para>
 		<para>Using that inode number will result in either a lookup on the incorrect inode, or the
 		      referencing of an area on disk that doesn't contain inodes at all.</para>
 	</section>
 	<section>
 		<title>32 bit Inodes on &gt;1TB Filesystems</title>
 		<para>When 32 bit inode numbers are used on a volume larger than 1TB in size, several changes occur.</para>
 		<para>A 100TB volume using 256 byte inodes mounted in the default inode32 mode has just
 		      one percent of its space available for allocating inodes.</para>
 		<para>XFS will reserve the first 1TB of disk space exclusively for inodes to ensure that the
 		      imbalance is no worse than this due to file data allocations.</para>
 		<para>It is no longer possible for file data to reside in the same AG as the parent directory's inode.</para>
 		<para>XFS will instead "rotor" through the upper AGs as it allocates space for files, putting
 		      each file in a new AG to evenly spread the I/O load.</para>
 	</section>
 	<section>
 		<title>Rotor Step</title>
 		<para>The performance of some workloads will suffer from the distribution each file in a different
 		      AG, so the "rotor step" sysctl was added adjust this behavior</para>
 		<para>For example, to keep at least a second of ingested 24fps video files in the same
 		      AG before moving to the next AG:</para>
 		<para><programlisting>
 # sysctl fs.xfs.rotorstep
 fs.xfs.rotorstep = 1
 # sudo sysctl –w fs.xfs.rotorstep=24
 fs.xfs.rotorstep = 24</programlisting></para>
 		<note><para>The rotorstep value is a global one, so setting it will affect the behaviour of
 		      all mounted file systems over 1TB in size that use 32 bit inode numbers.</para></note>
 	</section>
 	<section>
 		<title>Realtime Allocator</title>
 		<para>Certain classes of applications require deterministic latencies on file allocation operations</para>
 		<itemizedlist>
 		<listitem><para>The performance of the standard XFS allocator varies depending on the
 				internal data structures used to manage the filesystem content</para></listitem>
 		</itemizedlist>
 		<para>The realtime allocator uses a bitmap algorithm that gives consistent allocation
 		      latencies regardless of the filesystem's contents.</para>
 		<para>By using the realtime allocator in conjunction with an external log volume, it's possible
 		      to remove most of the unpredictability in disk response times that's caused by metadata overheads.</para>
 		<note><para>The realtime allocator is only available when XFS kernelspace is built with
 			    CONFIG_XFS_RT enabled</para></note>
 	</section>
 	<section>
 		<title>Realtime Allocator Limitations</title>
 		<para>In practice, the realtime allocator is not widely used.</para>
 		<para>It effectively uses a single large allocation group with a single set of data structures
 		      losing the parallelism of XFS’s allocation groups</para>
 		<itemizedlist>
 		<listitem><para>The locks associated with this central data structure result in the serialisation of
 				concurrent operations to a realtime device</para></listitem>
 		</itemizedlist>
 		<para>The realtime allocator is incapable of maintaining a spatial separation on disk for concurrent operations</para>
 		<itemizedlist>
 		<listitem><para>It tries to start new files at random points in the bitmap to reduce this
 				problem but this has a negative impact on some workloads</para></listitem>
 		</itemizedlist>
 	</section>
 	<section>
 		<title>Traditional allocator impact on some workloads</title>
 		<para>A certain class of applications will write many large files to a directory in sequence.</para>
 		<para>A film scanner ingesting video may write each video frame to a separate file. To playback
 		      the video frames in realtime its important that these files are contiguous on disk for optimal
 		      read-ahead performance by the hardware RAID.</para>
 		<para>At 24 frames per second, each frame is needed every 40ms, so it is important to keep the disks
 		      busy and reading into cache the next frame to be displayed</para>
 	</section>
 	<section>
 		<title>Traditional Stream Allocation</title>
 		<para>Initially each directory is allocated to a separate AG</para>
 		<para>Each stream writes to that AG until it is full</para>
 		<para>Additional allocations now go in the next consecutive AG that has enough free space</para>
 		<para>Multiple streams will start writing to the same AG, interleaving their files and negating any read-ahead</para>
 		<mediaobject><imageobject>
 			<imagedata fileref="images/XFS-traditional-stream-allocation.png" />
 		</imageobject></mediaobject>

 	</section>
 	<section>
 		<title>RAID performance with interleaved streams</title>
 		<para>With only 1.3% read cache hits, RAID is reading 545MB/s to return 184MB/s to the client (200% backend overhead)</para>
 		<para><programlisting>
  System Performance Statistics
                 All Ports     Port 1     Port 2     Port 3     Port 4
   Read  MB/s:      183.9       45.8       46.8       46.3       45.0
   Write MB/s:        0.0        0.0        0.0        0.0        0.0
   Total MB/s:      183.9       45.8       46.8       46.3       45.0

   Read  IO/s:        520        133        129        129        129
   Write IO/s:          0          0          0          0          0
   Total IO/s:        522        134        131        128        129

   Read Hits:         1.3%       1.6%       2.2%       0.6%       0.6%
   Prefetch Hits:     0.8%       1.1%       1.1%       0.6%       0.6%
   Prefetches:       46.3%      46.3%      46.0%      46.1%      46.8%
   Writebacks:        0.0%       0.0%       0.0%       0.0%       0.0%
   Rebuild MB/s:      0.0        0.0        0.0        0.0        0.0
   Verify MB/s:       0.0        0.0        0.0        0.0        0.0

                     Total      Reads     Writes      Pieces   Reads     Writes
   Disk IO/s:          518        518          0        1:      4910          0
   Disk MB/s:        544.6      544.6        0.0        2:     29890          0
   Disk Pieces:      65710      65710          0        3:       340          0
   BDB Pieces:                      0                   4:         0          0
                                                        5:         0          0
    Cache Writeback Data:     0.0%                      6:         0          0
    Rebuild/Verify Data:      0.0%    0.0%              7:         0          0
    Cache Data locked:        0.0%                      8:         0          0</programlisting></para>
 	</section>
 	<section>
 		<title>Filestreams Allocator</title>
 		<para>A new allocation algorithm was added to XFS that associates a parent directory with an AG
 		      until a preset inactivity timeout elapses.</para>
 		<para>A stream that moves to a new AG will cause that AG to be locked, so other streams looking
 		      for a new AG will not use the same AG</para>
 		<para>The new algorithm is called the Filestreams allocator and it is enabled in one of two ways:</para>
 		<itemizedlist>
 		<listitem><para>the filesystem is mounted with the -o filestreams option, or</para></listitem>
 		<listitem><para>the filestreams chattr flag is applied to a directory to indicate that all allocations
 				beneath that point in the directory hierarchy should use the filestreams allocator</para></listitem>
 		</itemizedlist>
 		<para>Filestreams will have a negative impact on workloads that continue to grow files
 		      in the same directory, causing more fragmentation than the default allocator</para>
 	</section>
 	<section>
 		<title>RAID performance with filestreams</title>
 		<para>Almost all data now found in RAID cache, only 15% backend disk I/O overhead</para>
 		<para><programlisting>
 System Performance Statistics
                 All Ports     Port 1     Port 2     Port 3     Port 4
   Read  MB/s:      299.1       74.0       74.7       75.1       75.2
   Write MB/s:        0.0        0.0        0.0        0.0        0.0
   Total MB/s:      299.1       74.0       74.7       75.1       75.2

   Read  IO/s:        840        209        210        211        210
   Write IO/s:          0          0          0          0          0
   Total IO/s:        836        209        210        208        209

   Read Hits:        99.5%      98.3%      99.6%     100.0%     100.0%
   Prefetch Hits:    98.8%      97.6%      98.9%      99.6%      99.0%
   Prefetches:       42.0%      41.5%      42.0%      42.9%      41.7%
   Writebacks:        0.0%       0.0%       0.0%       0.0%       0.0%
   Rebuild MB/s:      0.0        0.0        0.0        0.0        0.0
   Verify MB/s:       0.0        0.0        0.0        0.0        0.0

                     Total      Reads     Writes      Pieces   Reads     Writes
   Disk IO/s:          614        614          0        1:     39068          0
   Disk MB/s:        345.5      345.5        0.0        2:       111          0
   Disk Pieces:      39290      39290          0        3:         0          0
   BDB Pieces:                      0                   4:         0          0
                                                        5:         0          0
    Cache Writeback Data:     0.0%                      6:         0          0
    Rebuild/Verify Data:      0.0%    0.0%              7:         0          0
    Cache Data locked:        0.0%                      8:         0          0</programlisting></para>
 	</section>
 	<section>
 		<title>Fragmentation</title>
 		<para>Despite the use of extents and the various allocation schemes, XFS files and filesystems
 		      may still become fragmented over time</para>
 		<para>xfs_db can display the level of fragmentation in the filesystem</para>
 		<itemizedlist>
 		<listitem><para>xfs_db -r /dev/sda3
 			<itemizedlist>
 			<listitem><para>frag -f	file fragmentation percentage</para></listitem>
 			<listitem><para>frag -d	directory fragmentation percentage</para></listitem>
 			<listitem><para>freesp freespace</para></listitem>
 			</itemizedlist>
 		</para></listitem>
 		</itemizedlist>
 	</section>
 	<section>
 		<title>Fragmentation Example</title>
 		<para><programlisting>
 > xfs_db –r device
 xfs_db: freesp
    from      to extents  blocks    pct
       1       1   94807   94807   1.36
       2       3   63041  145012   2.08
       4       7   30374  152890   2.19
       8      15   19437  207742   2.98
      16      31   15173  331559   4.76
      32      63   14099  636086   9.13
      64     127   16804 1497220  21.48
     128     255    8390 1470464  21.10
     256     511    3003 1033383  14.83
     512    1023     810  551813   7.92
    1024    2047     258  370811   5.32
    2048    4095     101  282202   4.05
    4096    8191      27  145550   2.09
 xfs_db: frag -d
 actual 45966, ideal 12398, fragmentation factor 73.03%
 xfs_db: frag -f
 actual 2104856, ideal 2100484, fragmentation factor 0.21%</programlisting></para>
 	<note><para>The fragmentation factor value can be misleading.</para>
 	      <para>It is derived from (actual - ideal) / (ideal) so an average of 5 extents
 		    per file will yield 80%.  For multi-gigabyte files, 5 extents is
 		    not harmful, and the 80% is not representative of a problem.</para>
 	</note>
 	</section>
 	<section>
 		<title>xfs_fsr</title>
 		<para>Simple defragmentation tool that</para>
 		<itemizedlist>
 		<listitem><para>Searched for files that are fragmented</para></listitem>
 		<listitem><para>Creates a tempory inode</para></listitem>
 		<listitem><para>Asks the filesystem to create new extents for the temporary inode</para></listitem>
 		<listitem><para>If the new extents are less fragmented it copies the data in original file to the new extents</para></listitem>
 		<listitem><para>The temporary inode is then renamed to replace the original file</para></listitem>
 		</itemizedlist>
 		<para>Fsr makes no consideration for the used and free space within its allocation group
 		      and does not rearrange files to create larger contiguous free space</para>
 		<para>So fsr may fragment freespace over a period of time</para>
 	</section>
 </chapter>
	<?xml version='1.0' encoding='utf-8' ?>
	<!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [
	]>
	<chapter id="xfs-allocators">
	<title>Allocators</title>
	<section>
	<title>Allocation Policy</title>
	<para>The default allocation behaviour of XFS for new files is to place them in the same allocation group as their parent directory.</para>
	<para>Since files and their parent directories are often accessed in close succession, this minimises costly disk seeks.</para>
	<para>The allocator will also attempt to place newly created directories in different allocation groups.</para>
	<para>Combined, these policies help group directories of files together on disk, even if they're being written to concurrently.</para>
	<para>Allocation policies in XFS can change due to</para>
	<itemizedlist>
	<listitem><para>32 bit inode numbers are used on large file systems</para></listitem>
	<listitem><para>Some mount options</para></listitem>
	</itemizedlist>
	</section>
	<section>
	<title>Allocation Policy - Directories</title>
	<para>New directories are placed in different AGs where possible</para>
	<para>Watch the inode numbers as directory inodes are created:</para>
	<para><programlisting>
	> mkdir a b
	> ls -li
	total 0
	131 drwxr-xr-x 2 sjv users 6 2006-10-20 12:12 a
	33554624 drwxr-xr-x 2 sjv users 6 2006-10-20 12:12 b</programlisting></para>
	</section>
	<section>
	<title>Allocation Policy - Files</title>
	<para>Files are created in the same AG as their parent directory where possible, which is also evident in their inode numbers:</para>
	<para><programlisting>
	> touch a/1 b/1 a/2 b/2
	> ls -1id * /
	131 a
	132 a/1
	133 a/2
	33554624 b
	33554625 b/1
	33554626 b/2</programlisting></para>
	</section>
	<section>
	<title>Inode Numbers</title>
	<para>Every inode on disk has a unique inode number associated with it.</para>
	<para>It is a requirement that inode numbers be persistent across unmounts and reboots,
	so once an inode is written to disk its inode number is fixed.</para>
	<para>For performance reasons it must be possible to quickly find an inode on disk using its inode number.</para>
	<para>XFS uses the physical location of the inode on disk to encode the inode number, which makes
	finding the inode on disk using the inode number a trivial task.</para>
	</section>
	<section>
	<title>Inode Number Format</title>
	<para>An inode's location consists of three distinct parts</para>
	<itemizedlist>
	<listitem><para>an allocation group number</para></listitem>
	<listitem><para>a file system block number within its AG</para></listitem>
	<listitem><para>an inode number inside that file system block</para></listitem>
	</itemizedlist>
	<para>The number of bits required to store each of these values varies with the filesystem geometry</para>
	<para>Larger filesystems can easily require more than 32 bits, which can limit inode
	allocation to a region at the start of the volume</para>
	</section>
	<section>
	<title>Inode Number Size</title>
	<para>File systems aren't free to use inode numbers of arbitrary size.</para>
	<para>Operating system interfaces and legacy software products often mandate the use of 32
	bit inode numbers even on systems that support 64 bit inode numbers.</para>
	<para>This can be a problem on large file systems, since 32 bit inode numbers only provide
	enough bits to encode inode locations in the first 1TB of a volume when 256 byte inodes
	are used, up to 8TB in the case of 2kB inodes.</para>
	<para>For best performance, a file system needs to keep a file's data blocks close to its
	inode to minimise seeks when performing I/O. XFS's ability to do this suffers on large
	volumes when 32 bit inode numbers are used.</para>
	</section>
	<section>
	<title>32bit and 64bit Inodes</title>
	<para>By default, XFS will use 32 bit inode numbers.</para>
	<para>If the system supports it, the -o inode64 option to mount to allow 64 bit inode numbers.</para>
	<para>Once an inode has been written somewhere on the disk that requires a 64 bit inode number, the
	file system can no longer be used with 32 bit inode numbers</para>
	<itemizedlist>
	<listitem><para>The inode64 mount option should not be removed once used</para></listitem>
	</itemizedlist>
	<para>(IRIX can move inodes to 32 bit numbers with <command>xfs_reno</command>, this tool has not
	been ported to Linux, yet)</para>
	</section>
	<section>
	<title>32bit and 64bit Inodes</title>
	<para>Inode numbers are stored in big endian format on disk, and host endian format in-core.</para>
	<para>Applications that pass 64 bit inode numbers using 32 bit variables will truncate the
	32 most-significant bits.</para>
	<para>Since XFS stores the AG number an inode belongs to in the most significant bits, a result
	of this truncation can be an inode number that points to an inode in a lower AG by mistake.</para>
	<para>Using that inode number will result in either a lookup on the incorrect inode, or the
	referencing of an area on disk that doesn't contain inodes at all.</para>
	</section>
	<section>
	<title>32 bit Inodes on >1TB Filesystems</title>
	<para>When 32 bit inode numbers are used on a volume larger than 1TB in size, several changes occur.</para>
	<para>A 100TB volume using 256 byte inodes mounted in the default inode32 mode has just
	one percent of its space available for allocating inodes.</para>
	<para>XFS will reserve the first 1TB of disk space exclusively for inodes to ensure that the
	imbalance is no worse than this due to file data allocations.</para>
	<para>It is no longer possible for file data to reside in the same AG as the parent directory's inode.</para>
	<para>XFS will instead "rotor" through the upper AGs as it allocates space for files, putting
	each file in a new AG to evenly spread the I/O load.</para>
	</section>
	<section>
	<title>Rotor Step</title>
	<para>The performance of some workloads will suffer from the distribution each file in a different
	AG, so the "rotor step" sysctl was added adjust this behavior</para>
	<para>For example, to keep at least a second of ingested 24fps video files in the same
	AG before moving to the next AG:</para>
	<para><programlisting>
	# sysctl fs.xfs.rotorstep
	fs.xfs.rotorstep = 1
	# sudo sysctl –w fs.xfs.rotorstep=24
	fs.xfs.rotorstep = 24</programlisting></para>
	<note><para>The rotorstep value is a global one, so setting it will affect the behaviour of
	all mounted file systems over 1TB in size that use 32 bit inode numbers.</para></note>
	</section>
	<section>
	<title>Realtime Allocator</title>
	<para>Certain classes of applications require deterministic latencies on file allocation operations</para>
	<itemizedlist>
	<listitem><para>The performance of the standard XFS allocator varies depending on the
	internal data structures used to manage the filesystem content</para></listitem>
	</itemizedlist>
	<para>The realtime allocator uses a bitmap algorithm that gives consistent allocation
	latencies regardless of the filesystem's contents.</para>
	<para>By using the realtime allocator in conjunction with an external log volume, it's possible
	to remove most of the unpredictability in disk response times that's caused by metadata overheads.</para>
	<note><para>The realtime allocator is only available when XFS kernelspace is built with
	CONFIG_XFS_RT enabled</para></note>
	</section>
	<section>
	<title>Realtime Allocator Limitations</title>
	<para>In practice, the realtime allocator is not widely used.</para>
	<para>It effectively uses a single large allocation group with a single set of data structures
	losing the parallelism of XFS’s allocation groups</para>
	<itemizedlist>
	<listitem><para>The locks associated with this central data structure result in the serialisation of
	concurrent operations to a realtime device</para></listitem>
	</itemizedlist>
	<para>The realtime allocator is incapable of maintaining a spatial separation on disk for concurrent operations</para>
	<itemizedlist>
	<listitem><para>It tries to start new files at random points in the bitmap to reduce this
	problem but this has a negative impact on some workloads</para></listitem>
	</itemizedlist>
	</section>
	<section>
	<title>Traditional allocator impact on some workloads</title>
	<para>A certain class of applications will write many large files to a directory in sequence.</para>
	<para>A film scanner ingesting video may write each video frame to a separate file. To playback
	the video frames in realtime its important that these files are contiguous on disk for optimal
	read-ahead performance by the hardware RAID.</para>
	<para>At 24 frames per second, each frame is needed every 40ms, so it is important to keep the disks
	busy and reading into cache the next frame to be displayed</para>
	</section>
	<section>
	<title>Traditional Stream Allocation</title>
	<para>Initially each directory is allocated to a separate AG</para>
	<para>Each stream writes to that AG until it is full</para>
	<para>Additional allocations now go in the next consecutive AG that has enough free space</para>
	<para>Multiple streams will start writing to the same AG, interleaving their files and negating any read-ahead</para>
	<mediaobject><imageobject>
	<imagedata fileref="images/XFS-traditional-stream-allocation.png" />
	</imageobject></mediaobject>

	</section>
	<section>
	<title>RAID performance with interleaved streams</title>
	<para>With only 1.3% read cache hits, RAID is reading 545MB/s to return 184MB/s to the client (200% backend overhead)</para>
	<para><programlisting>
	System Performance Statistics
	All Ports Port 1 Port 2 Port 3 Port 4
	Read MB/s: 183.9 45.8 46.8 46.3 45.0
	Write MB/s: 0.0 0.0 0.0 0.0 0.0
	Total MB/s: 183.9 45.8 46.8 46.3 45.0

	Read IO/s: 520 133 129 129 129
	Write IO/s: 0 0 0 0 0
	Total IO/s: 522 134 131 128 129

	Read Hits: 1.3% 1.6% 2.2% 0.6% 0.6%
	Prefetch Hits: 0.8% 1.1% 1.1% 0.6% 0.6%
	Prefetches: 46.3% 46.3% 46.0% 46.1% 46.8%
	Writebacks: 0.0% 0.0% 0.0% 0.0% 0.0%
	Rebuild MB/s: 0.0 0.0 0.0 0.0 0.0
	Verify MB/s: 0.0 0.0 0.0 0.0 0.0

	Total Reads Writes Pieces Reads Writes
	Disk IO/s: 518 518 0 1: 4910 0
	Disk MB/s: 544.6 544.6 0.0 2: 29890 0
	Disk Pieces: 65710 65710 0 3: 340 0
	BDB Pieces: 0 4: 0 0
	5: 0 0
	Cache Writeback Data: 0.0% 6: 0 0
	Rebuild/Verify Data: 0.0% 0.0% 7: 0 0
	Cache Data locked: 0.0% 8: 0 0</programlisting></para>
	</section>
	<section>
	<title>Filestreams Allocator</title>
	<para>A new allocation algorithm was added to XFS that associates a parent directory with an AG
	until a preset inactivity timeout elapses.</para>
	<para>A stream that moves to a new AG will cause that AG to be locked, so other streams looking
	for a new AG will not use the same AG</para>
	<para>The new algorithm is called the Filestreams allocator and it is enabled in one of two ways:</para>
	<itemizedlist>
	<listitem><para>the filesystem is mounted with the -o filestreams option, or</para></listitem>
	<listitem><para>the filestreams chattr flag is applied to a directory to indicate that all allocations
	beneath that point in the directory hierarchy should use the filestreams allocator</para></listitem>
	</itemizedlist>
	<para>Filestreams will have a negative impact on workloads that continue to grow files
	in the same directory, causing more fragmentation than the default allocator</para>
	</section>
	<section>
	<title>RAID performance with filestreams</title>
	<para>Almost all data now found in RAID cache, only 15% backend disk I/O overhead</para>
	<para><programlisting>
	System Performance Statistics
	All Ports Port 1 Port 2 Port 3 Port 4
	Read MB/s: 299.1 74.0 74.7 75.1 75.2
	Write MB/s: 0.0 0.0 0.0 0.0 0.0
	Total MB/s: 299.1 74.0 74.7 75.1 75.2

	Read IO/s: 840 209 210 211 210
	Write IO/s: 0 0 0 0 0
	Total IO/s: 836 209 210 208 209

	Read Hits: 99.5% 98.3% 99.6% 100.0% 100.0%
	Prefetch Hits: 98.8% 97.6% 98.9% 99.6% 99.0%
	Prefetches: 42.0% 41.5% 42.0% 42.9% 41.7%
	Writebacks: 0.0% 0.0% 0.0% 0.0% 0.0%
	Rebuild MB/s: 0.0 0.0 0.0 0.0 0.0
	Verify MB/s: 0.0 0.0 0.0 0.0 0.0

	Total Reads Writes Pieces Reads Writes
	Disk IO/s: 614 614 0 1: 39068 0
	Disk MB/s: 345.5 345.5 0.0 2: 111 0
	Disk Pieces: 39290 39290 0 3: 0 0
	BDB Pieces: 0 4: 0 0
	5: 0 0
	Cache Writeback Data: 0.0% 6: 0 0
	Rebuild/Verify Data: 0.0% 0.0% 7: 0 0
	Cache Data locked: 0.0% 8: 0 0</programlisting></para>
	</section>
	<section>
	<title>Fragmentation</title>
	<para>Despite the use of extents and the various allocation schemes, XFS files and filesystems
	may still become fragmented over time</para>
	<para>xfs_db can display the level of fragmentation in the filesystem</para>
	<itemizedlist>
	<listitem><para>xfs_db -r /dev/sda3
	<itemizedlist>
	<listitem><para>frag -f file fragmentation percentage</para></listitem>
	<listitem><para>frag -d directory fragmentation percentage</para></listitem>
	<listitem><para>freesp freespace</para></listitem>
	</itemizedlist>
	</para></listitem>
	</itemizedlist>
	</section>
	<section>
	<title>Fragmentation Example</title>
	<para><programlisting>
	> xfs_db –r device
	xfs_db: freesp
	from to extents blocks pct
	1 1 94807 94807 1.36
	2 3 63041 145012 2.08
	4 7 30374 152890 2.19
	8 15 19437 207742 2.98
	16 31 15173 331559 4.76
	32 63 14099 636086 9.13
	64 127 16804 1497220 21.48
	128 255 8390 1470464 21.10
	256 511 3003 1033383 14.83
	512 1023 810 551813 7.92
	1024 2047 258 370811 5.32
	2048 4095 101 282202 4.05
	4096 8191 27 145550 2.09
	xfs_db: frag -d
	actual 45966, ideal 12398, fragmentation factor 73.03%
	xfs_db: frag -f
	actual 2104856, ideal 2100484, fragmentation factor 0.21%</programlisting></para>
	<note><para>The fragmentation factor value can be misleading.</para>
	<para>It is derived from (actual - ideal) / (ideal) so an average of 5 extents
	per file will yield 80%. For multi-gigabyte files, 5 extents is
	not harmful, and the 80% is not representative of a problem.</para>
	</note>
	</section>
	<section>
	<title>xfs_fsr</title>
	<para>Simple defragmentation tool that</para>
	<itemizedlist>
	<listitem><para>Searched for files that are fragmented</para></listitem>
	<listitem><para>Creates a tempory inode</para></listitem>
	<listitem><para>Asks the filesystem to create new extents for the temporary inode</para></listitem>
	<listitem><para>If the new extents are less fragmented it copies the data in original file to the new extents</para></listitem>
	<listitem><para>The temporary inode is then renamed to replace the original file</para></listitem>
	</itemizedlist>
	<para>Fsr makes no consideration for the used and free space within its allocation group
	and does not rearrange files to create larger contiguous free space</para>
	<para>So fsr may fragment freespace over a period of time</para>
	</section>
	</chapter>