| <?xml version='1.0' encoding='utf-8' ?> |
| <!DOCTYPE chapter PUBLIC "-//OASIS//DTD DocBook XML V4.4//EN" "http://www.oasis-open.org/docbook/xml/4.4/docbookx.dtd" [ |
| ]> |
| <chapter id="xfs-internals"> |
| <title>XFS Internals</title> |
| <section> |
| <title>XFS Internals</title> |
| <para>Understand some of the unique features in XFS</para> |
| <para>Why XFS differs from other Linux filesystems</para> |
| <para>How these differences are implemented</para> |
| </section> |
| <section> |
| <title>xfs_ioctl</title> |
| <para>XFS specific system calls (xfsctl()) are dispatched by xfs_ioctl()</para> |
| <itemizedlist> |
| <listitem><para>fs/xfs/xfs_fs.h</para></listitem> |
| <listitem><para>fs/xfs/linux-2.6/xfs_ioctl.c</para></listitem> |
| </itemizedlist> |
| <para>Can be exercised with xfs_io</para> |
| <para>geometry, fscounts, [get|set]resblks, shutdown, freeze/thaw</para> |
| <itemizedlist> |
| <listitem><para>filesystem level manipulation</para></listitem> |
| </itemizedlist> |
| <para>grow[fs|fslog|fsrt]</para> |
| <itemizedlist> |
| <listitem><para>filesystem size (and maximum inode count) expansion</para></listitem> |
| </itemizedlist> |
| <para>[get|set]xflags, fs[get|set]xattr, fs[get|set]xattra, dioinfo</para> |
| <itemizedlist> |
| <listitem><para>inode attribute information</para></listitem> |
| <listitem><para>direct I/O parameters (min/max/align)</para></listitem> |
| </itemizedlist> |
| <para>allocsp, freesp, resvsp, unresvsp</para> |
| <itemizedlist> |
| <listitem><para>space allocation and/or preallocation</para></listitem> |
| </itemizedlist> |
| <para>bulkstat</para> |
| <itemizedlist> |
| <listitem><para>many (sequential) inode's attributes – stat(2)</para></listitem> |
| </itemizedlist> |
| <para>xfsdump, quotacheck, dmapi</para> |
| <itemizedlist> |
| <listitem><para>by-handle (open, fd-to-, path-to-, readlink, attrlist, attrmulti, ...)</para></listitem> |
| <listitem><para>manipulating inodes by “handles” (inum/igen/fsid)</para></listitem> |
| </itemizedlist> |
| <para>getbmap, getbmapa, swapext</para> |
| <itemizedlist> |
| <listitem><para>inode data/attr fork extent information</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>xfs_ioctl – Miscellaneous</title> |
| <para>geometry</para> |
| <itemizedlist> |
| <listitem><para>Data displayed by xfs_info</para></listitem> |
| </itemizedlist> |
| <para>fscounts</para> |
| <itemizedlist> |
| <listitem><para>XFS specific stat information</para></listitem> |
| </itemizedlist> |
| <para>resblks</para> |
| <itemizedlist> |
| <listitem><para>allows dmapi to set aside some disk space</para></listitem> |
| <listitem><para>threads can be marked to say they can use it if about to run out of space</para></listitem> |
| </itemizedlist> |
| <para>grow</para> |
| <itemizedlist> |
| <listitem><para>upward only</para></listitem> |
| <listitem><para>can't grow the log, not implemented in the kernel</para></listitem> |
| <listitem><para>also can be used to change the maximum space for inodes</para></listitem> |
| </itemizedlist> |
| <para>dioinfo changes direct I/O paramenters</para> |
| <itemizedlist> |
| <listitem><para>max direct I/O size is huge, no real limit</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>xfs_ioctl - Attribute Flags</title> |
| <para>xflags</para> |
| <itemizedlist> |
| <listitem><para>inode flags, see definition of struct fsxattr</para></listitem> |
| </itemizedlist> |
| <para>xattr</para> |
| <itemizedlist> |
| <listitem><para>original IRIX version for inode flags</para></listitem> |
| <listitem><para>project id</para></listitem> |
| <listitem><para>extent size hint</para></listitem> |
| <listitem><para>how many extents are allocated on the data fork</para></listitem> |
| </itemizedlist> |
| <para>xattra passes out the same structure as xattr, but applies to the attribute rather than data fork.</para> |
| </section> |
| <section> |
| <title>xfs_ioctl - Space Allocation</title> |
| <para>allocsp/freesp</para> |
| <itemizedlist> |
| <listitem><para>allocate space</para></listitem> |
| <listitem><para>zeros</para></listitem> |
| <listitem><para>updates the inode size</para></listitem> |
| </itemizedlist> |
| <para>resvsp/unresvsp is for allocating but not zeroing</para> |
| <itemizedlist> |
| <listitem><para>efficient way for applications to take advantage of unwritten extents</para></listitem> |
| </itemizedlist> |
| <para>posix fallocate has no generic interface to call into the kernel</para> |
| <itemizedlist> |
| <listitem><para>current implementation just writes zeros from user space</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>xfs_ioctl - Bulkstat</title> |
| <para>bulkstat returns multiple inodes</para> |
| <itemizedlist> |
| <listitem><para>scans entire filesystem, no way to start at a particular directory |
| <itemizedlist> |
| <listitem><para>hence xfsdump cannot dump a directory</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| <listitem><para>flags to pass in to use incore or ondisk inode clusters</para></listitem> |
| </itemizedlist> |
| <para>byhandle interfaces</para> |
| <itemizedlist> |
| <listitem><para>handles usually obtained from bulkstat</para></listitem> |
| <listitem><para>a handle is the combination of inum, igen and fsid</para></listitem> |
| <listitem><para>see DMAPI for more information on byhandle interfaces</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>XFS sysctls - Daemons</title> |
| <para>fs.xfs.xfssyncd_centisecs (Min: 100 Default: 3000 Max: 720000)</para> |
| <itemizedlist> |
| <listitem><para>The interval at which the xfssyncd thread flushes metadata out to disk. This thread will flush log activity out, and do some processing on unlinked inodes.</para></listitem> |
| </itemizedlist> |
| <para>fs.xfs.xfsbufd_centisecs (Min: 50 Default: 100 Max: 3000)</para> |
| <itemizedlist> |
| <listitem><para>The interval at which xfsbufd scans the dirty metadata buffers list.</para></listitem> |
| </itemizedlist> |
| <para>fs.xfs.age_buffer_centisecs (Min: 100 Default: 1500 Max: 720000)</para> |
| <itemizedlist> |
| <listitem><para>The age at which xfsbufd flushes dirty metadata buffers to disk.</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>XFS sysctls - Debug</title> |
| <para>fs.xfs.error_level (Min: 0 Default: 3 Max: 11)</para> |
| <itemizedlist> |
| <listitem><para>A volume knob for error reporting when internal errors occur.</para></listitem> |
| <listitem><para>This will generate detailed messages & backtraces for filesystem shutdowns, for example.</para></listitem> |
| <listitem><para>Current threshold values are:</para> |
| <para><programlisting> |
| XFS_ERRLEVEL_OFF: 0 |
| XFS_ERRLEVEL_LOW: 1 |
| XFS_ERRLEVEL_HIGH: 5</programlisting></para> |
| </listitem> |
| </itemizedlist> |
| <para>fs.xfs.panic_mask (Min: 0 Default: 0 Max: 127)</para> |
| <itemizedlist> |
| <listitem><para>Causes certain error conditions to call BUG(). Value is a bitmask;</para></listitem> |
| <listitem><para>AND together the tags which represent errors which should cause panics:</para> |
| <para><programlisting> |
| XFS_NO_PTAG 0 |
| XFS_PTAG_IFLUSH 0x00000001 |
| XFS_PTAG_LOGRES 0x00000002 |
| XFS_PTAG_AILDELETE 0x00000004 |
| XFS_PTAG_ERROR_REPORT 0x00000008 |
| XFS_PTAG_SHUTDOWN_CORRUPT 0x00000010 |
| XFS_PTAG_SHUTDOWN_IOERROR 0x00000020 |
| XFS_PTAG_SHUTDOWN_LOGERROR 0x00000040</programlisting></para> |
| </listitem> |
| </itemizedlist> |
| <para>This option is intended for debugging only.</para> |
| </section> |
| <section> |
| <title>XFS sysctls - Compatibility</title> |
| <para>fs.xfs.irix_symlink_mode (Min: 0 Default: 0 Max: 1)</para> |
| <itemizedlist> |
| <listitem><para>Controls whether symlinks are created with mode 0777 (default) or whether their mode is affected by the umask (irix mode).</para></listitem> |
| </itemizedlist> |
| <para>fs.xfs.irix_sgid_inherit (Min: 0 Default: 0 Max: 1)</para> |
| <itemizedlist> |
| <listitem><para>Controls files created in SGID directories</para></listitem> |
| <listitem><para>If the group ID of the new file does not match the effective group ID or one of the supplementary group IDs of the parent dir, the ISGID bit is cleared if the irix_sgid_inherit compatibility sysctl is set. </para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>XFS sysctls – Attribute Inheritence</title> |
| <para>fs.xfs.inherit_sync (Min: 0 Default: 1 Max: 1)</para> |
| <itemizedlist> |
| <listitem><para>Setting this to "1" will cause the "sync" flag set by the xfs_io(8) chattr command on a directory to be inherited by files in that directory.</para></listitem> |
| </itemizedlist> |
| <para>fs.xfs.inherit_nodump (Min: 0 Default: 1 Max: 1)</para> |
| <itemizedlist> |
| <listitem><para>Setting this to "1" will cause the "nodump" flag set by the xfs_io(8) chattr command on a directory to be inherited by files in that directory.</para></listitem> |
| </itemizedlist> |
| <para>fs.xfs.inherit_noatime (Min: 0 Default: 1 Max: 1)</para> |
| <itemizedlist> |
| <listitem><para>Setting this to "1" will cause the "noatime" flag set by the xfs_io(8) chattr command on a directory to be inherited by files in that directory.</para></listitem> |
| </itemizedlist> |
| <para>fs.xfs.inherit_nosymlinks (Min: 0 Default: 1 Max: 1)</para> |
| <itemizedlist> |
| <listitem><para>Setting this to "1" will cause the "nosymlinks" flag set by the xfs_io(8) chattr command on a directory to be inherited by files in that directory.</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>XFS sysctls - Misc</title> |
| <para>fs.xfs.stats_clear (Min: 0 Default: 0 Max: 1)</para> |
| <itemizedlist> |
| <listitem><para>Setting this to "1" clears accumulated XFS statistics in /proc/fs/xfs/stat. It then immediately resets to "0".</para></listitem> |
| </itemizedlist> |
| <para>fs.xfs.rotorstep (Min: 1 Default: 1 Max: 256)</para> |
| <itemizedlist> |
| <listitem><para>In "inode32" allocation mode, this option determines how many files the allocator attempts to allocate in the same allocation group before moving to the next allocation group. The intent is to control the rate at which the allocator moves between allocation groups when allocating extents for new files.</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Generic sysctls</title> |
| <para>dentry-state</para> |
| <itemizedlist> |
| <listitem><para>Number of directory entries</para></listitem> |
| <listitem><para>Number of unused entries</para></listitem> |
| <listitem><para>Reclaim >secs when short on memory</para></listitem> |
| </itemizedlist> |
| <para>file-max</para> |
| <itemizedlist> |
| <listitem><para>Maximum number of files system wide</para></listitem> |
| </itemizedlist> |
| <para>file-nr</para> |
| <itemizedlist> |
| <listitem><para># files allocated</para></listitem> |
| <listitem><para>Number of files in use</para></listitem> |
| <listitem><para>Max number of files system wide</para></listitem> |
| </itemizedlist> |
| <para>inode-state</para> |
| <itemizedlist> |
| <listitem><para>Number of active inodes</para></listitem> |
| <listitem><para>Number of free inode entries</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Mount Path</title> |
| <para>Mounting an XFS filesystem has several steps</para> |
| <itemizedlist> |
| <listitem><para>xfs_fs_fill_super in xfs_super.c:</para></listitem> |
| </itemizedlist> |
| <orderedlist> |
| <listitem><para>Allocate a xfs_mount structure</para></listitem> |
| <listitem><para>Parse mount options (xfs_parseargs)</para></listitem> |
| <listitem><para>Open up the log and realtime device |
| <itemizedlist> |
| <listitem><para>xfs_alloc_buftarg starts a kernel thread for delayed write buffers for each device</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| <listitem><para>Call xfs_readsb to read the super block</para></listitem> |
| <listitem><para>Call xfs_finish_flags the mount options to what was in the super block</para></listitem> |
| <listitem><para>Tell the buffers what the sector size is going to be with xfs_setsize_buftarg</para></listitem> |
| <listitem><para>Check to see if this device supports write barriers</para></listitem> |
| <listitem><para>Then we call xfs_mountfs to complete the mount</para></listitem> |
| <listitem><para>Finally we read the root inode and start the sync daemon |
| (xfssyncd)</para></listitem> |
| </orderedlist> |
| </section> |
| <section> |
| <title>Mount - xfs_mountfs</title> |
| <para>Calls xfs_mount_common sets up additional mount_t fields from superbock</para> |
| <para>Check the end of the filesystem really does exist for each device</para> |
| <para>Initialise various data structures</para> |
| <itemizedlist> |
| <listitem><para>inode and stripe alignment</para></listitem> |
| <listitem><para>allocate and initialise inode tree for this filesystem</para></listitem> |
| </itemizedlist> |
| <para>Set up the transactions and log, and do log recovery</para> |
| <para>Call out to the quota manager to do the quota check</para> |
| </section> |
| <section> |
| <title>Transactions</title> |
| <para>Transactions are used to record metadata changes to the filesystem</para> |
| <para>Each transaction is an atomic change to the filesystem</para> |
| <para>Only in the core of XFS, not in the higher layers of XFS</para> |
| <para>After reserving space for a transaction, it is very difficult to cancel the transaction</para> |
| <itemizedlist> |
| <listitem><para>Calling routines will try to pull inodes incore, setup xfs_inode_t and dquots to avoid problems from here on</para></listitem> |
| <listitem><para>They must have references on any objects to be attached to the transaction</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Creating Transactions</title> |
| <para>A transaction is typically coded as</para> |
| <para>Create the incore structure for the transaction</para> |
| <para><programlisting>tp = xfs_trans_alloc(type);</programlisting></para> |
| <para>Reserve space for the transaction, quick lookup in mount_t structure</para> |
| <itemizedlist> |
| <listitem><para>can return ENOSPC</para></listitem> |
| <listitem><para>reserved space can be large to cater for large ondisk structure changes</para></listitem> |
| <listitem><para>exceeding the reservation will cause XFS to dump a lot of diagnotistics before shutting down</para></listitem> |
| </itemizedlist> |
| <para><programlisting>error = xfs_trans_reserve(tp, data, log, rt, ...);</programlisting></para> |
| <para>Now make changes, allocate space, free space, etc.</para> |
| <para>Attach superblock/inode(s)/buffers etc, log ranges within these objects, typically via</para> |
| <para><programlisting>xfs_trans_log_inode(tp, ip, XFS_ILOG_CORE);</programlisting></para> |
| <para>Commit the transaction, copying data from that attached objects</para> |
| <para><programlisting>error = xfs_trans_commit(tp);</programlisting></para> |
| </section> |
| <section> |
| <title>In Core Logs</title> |
| <para>There are normally 8 in-core log buffers (iclogs)</para> |
| <itemizedlist> |
| <listitem><para>depends on memory of system</para></listitem> |
| <listitem><para>can be set by a mount option.</para></listitem> |
| </itemizedlist> |
| <para>An in-core log is written where</para> |
| <itemizedlist> |
| <listitem><para>it fills up</para></listitem> |
| <listitem><para>you get a synchronous op, more transactions are asynchronous</para></listitem> |
| <listitem><para>a sync the incore buffer is written</para></listitem> |
| </itemizedlist> |
| <para>When XFS receives an I/O completion XFS can unpin the first metadata buffers</para> |
| <itemizedlist> |
| <listitem><para>Once unpined they can be written to disk</para></listitem> |
| </itemizedlist> |
| <para>The active item list (AIL) is used to prevent the metadata buffers from being written multiple times if it is in multiple transactions.</para> |
| <para>Therefore, transactions and metadata buffers have a lifecycle.</para> |
| </section> |
| <section> |
| <title>Log Sequence Numbers</title> |
| <para>An LSN is the log sequence number</para> |
| <itemizedlist> |
| <listitem><para>64bit with two 32bit values</para></listitem> |
| <listitem><para>the first is the cycle number</para></listitem> |
| <listitem><para>the second is the block number</para></listitem> |
| </itemizedlist> |
| <para>The block number is assigned when it is committed</para> |
| <para>The cycle number is incremented each time we have cycled through the log</para> |
| <para>The metadata in the log is only the metadata that has changed, in theory, but the whole inode tends to be logged.</para> |
| <para>One of the flaws of the xfs logging, is that everything that changed is logged even if sometime later we changed that field again, we still log the original change.</para> |
| <itemizedlist> |
| <listitem><para>XFS holy grail is to avoid duplicates in the log</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>XFS Log Diagram</title> |
| <para>XXX insert pretty picture here</para> |
| </section> |
| <section> |
| <title>xfs_bmapi</title> |
| <para>Many routines within XFS have to allocate ondisk space</para> |
| <itemizedlist> |
| <listitem><para>metadata</para></listitem> |
| <listitem><para>inodes</para></listitem> |
| <listitem><para>extents</para></listitem> |
| </itemizedlist> |
| <para>This is all done through xfs_bmapi</para> |
| <itemizedlist> |
| <listitem><para>access extent map for reading</para></listitem> |
| <listitem><para>setup delayed allocation</para></listitem> |
| <listitem><para>perform actual allocation</para></listitem> |
| <listitem><para>convert unwritten extents to written extents</para></listitem> |
| </itemizedlist> |
| <para>If you trace calls through XFS xfs_bmapi is the centre of XFS</para> |
| <para>In write mode you are usually in a transaction</para> |
| <itemizedlist> |
| <listitem><para>except when doing delayed allocation, transaction pointer will be null</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>xfs_bmap_alloc</title> |
| <para>xfs_bmap_alloc is the switch between the different allocators</para> |
| <itemizedlist> |
| <listitem><para>normal (xfs_bmap_btalloc)</para></listitem> |
| <listitem><para>realtime (xfs_bmap_rtalloc)</para></listitem> |
| <listitem><para>filestreams</para></listitem> |
| </itemizedlist> |
| <para>Realtime uses two bitmaps</para> |
| <itemizedlist> |
| <listitem><para>one for free space</para></listitem> |
| <listitem><para>one for larger clusters of freespace</para></listitem> |
| </itemizedlist> |
| <para>Quota accounting is also done in bmap routines</para> |
| <para>inodes have two pointers to dquot's</para> |
| <itemizedlist> |
| <listitem><para>one for the user</para></listitem> |
| <listitem><para>one for the group project</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Memory Allocation</title> |
| <para>IRIX was very good at ensuring memory allocations succeeded</para> |
| <itemizedlist> |
| <listitem><para>XFS was written on IRIX</para></listitem> |
| </itemizedlist> |
| <para>Linux is not as good with memory allocations, so this has been the source of many XFS problems</para> |
| <para><programlisting> |
| # head -2 /proc/slabinfo; grep –i xfs /proc/slabinfo |
| # slabtop</programlisting></para> |
| </section> |
| <section> |
| <title>Memory allocations for transactions</title> |
| <para>If it is in a transaction, the memory allocation code will behave differently to try to ensure this critical thread does not sleep</para> |
| <itemizedlist> |
| <listitem><para>This flag is a generic linux change</para></listitem> |
| </itemizedlist> |
| <para>In some cases you could ask for many MBs of contiguous space</para> |
| <itemizedlist> |
| <listitem><para>Linux does not like this.</para></listitem> |
| </itemizedlist> |
| <para>You could have a whole lot of dirty page cache pages but in order to flush those out you have to call back into XFS which can allocate memory.</para> |
| <para>Recent changes to XFS means we are much better at managing memory allocation and helps us to avoid these issues</para> |
| <itemizedlist> |
| <listitem><para>2.6.17 and should be included in SLES10SP1</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Metadata Buffering</title> |
| <para>xfs_buf.{h,c} implements xfs_buf_t the XFS metadata buffer cache</para> |
| <itemizedlist> |
| <listitem><para>Multi-page buffers</para></listitem> |
| <listitem><para>Buffer “pinning” |
| <itemizedlist> |
| <listitem><para>Prevent them from being written even if “dirty”</para></listitem> |
| <listitem><para>Transaction log must be written first</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| <listitem><para>Several “private” buffer pointers</para></listitem> |
| <listitem><para>Locking, iodone semaphore for I/O waiters</para></listitem> |
| <listitem><para>Callbacks for: iodone, relse, pre-write</para></listitem> |
| </itemizedlist> |
| <para>In-core log buffers also implemented via xfs_buf_t</para> |
| <para>This causes some oddities since they use</para> |
| <itemizedlist> |
| <listitem><para>sub-buffer-sized I/Os</para></listitem> |
| <listitem><para>non-page-cache buffers</para></listitem> |
| </itemizedlist> |
| <para>Separate address space from bdev</para> |
| <itemizedlist> |
| <listitem><para>Prevents user space buffering from impacting metadata buffering</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Metadata I/O Completion</title> |
| <para>xfslogd/N</para> |
| <itemizedlist> |
| <listitem><para>per-CPU daemon</para></listitem> |
| <listitem><para>Threads that handle I/O completion work for iclog buffers |
| <itemizedlist> |
| <listitem><para>xlog_state_do_callbacks</para></listitem> |
| <listitem><para>multiple completions to unpin metadata buffers waiting for this transaction</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| <listitem><para>and also metadata |
| <itemizedlist> |
| <listitem><para>xfs_buf_do_callbacks</para></listitem> |
| <listitem><para>typically, removing from AIL and freeing up buffer_item memory</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| </itemizedlist> |
| <para>xfsdatad/N</para> |
| <itemizedlist> |
| <listitem><para>similar idea for data path</para></listitem> |
| <listitem><para>very different between 2.4 and 2.6</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>Delayed write buffers</title> |
| <para>xfsbufd</para> |
| <itemizedlist> |
| <listitem><para>kernel thread, one per filesystem device</para></listitem> |
| <listitem><para>buffers are time stamped when queued</para></listitem> |
| <listitem><para>xfsbufd walks the xfs_buftarg_t (“buffer target”) hash table finding delayed write buffers |
| <itemizedlist> |
| <listitem><para>default is every 5 seconds look for buffers more than 30 seconds old</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| <listitem><para>metadata and log data specific, file data is handled differently</para></listitem> |
| </itemizedlist> |
| <para>Tweak the age at which unpinned and dirty metadata buffers will be considered for flushing</para> |
| <itemizedlist> |
| <listitem><para>/proc/sys/fs/xfs/age_buffer_centisecs</para></listitem> |
| </itemizedlist> |
| <para>Tunable daemon wakeup interval</para> |
| <itemizedlist> |
| <listitem><para>/proc/sys/fs/xfs/xfsbufd_centisecs</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>I/O Path</title> |
| <para>Has two parts</para> |
| <itemizedlist> |
| <listitem><para>code that handles read and write calls |
| <itemizedlist> |
| <listitem><para>both buffered and direct I/O</para></listitem> |
| <listitem><para>xfs_lrw.c</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| <listitem><para>code that actually writes something out</para></listitem> |
| </itemizedlist> |
| <para>Uses Linux get_block_t interfaces and struct buffer_head</para> |
| <para>XFS code is different to other filesystems because of</para> |
| <itemizedlist> |
| <listitem><para>different locking</para></listitem> |
| <listitem><para>delayed allocation</para></listitem> |
| <listitem><para>dmapi</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>I/O Path - Locking</title> |
| <para>There are two locks within the linux inode</para> |
| <itemizedlist> |
| <listitem><para>the most interesting is the i_mutex</para></listitem> |
| <listitem><para>the second and inner most lock XFS does not use</para></listitem> |
| </itemizedlist> |
| <para>XFS will hold it for the entire buffered write call</para> |
| <para>This is not the case for direct I/O</para> |
| <itemizedlist> |
| <listitem><para>ext3/reiserfs will hold the i_mutex for the whole direct I/O which will serialise their direct I/O path.</para></listitem> |
| </itemizedlist> |
| <para>i_mutex is often taken outside of xfs as well</para> |
| <para>XFS also has the iolock and ilock mrlocks on the xfs inode</para> |
| <para>Must be taken in this order</para> |
| <itemizedlist> |
| <listitem><para>i_mutex</para></listitem> |
| <listitem><para>iolock</para></listitem> |
| <listitem><para>ilock</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>I/O Path - DMAPI Integration</title> |
| <para>DMAPI requires extra hooks for DMAPI events on read and writes to a file</para> |
| <itemizedlist> |
| <listitem><para>This code includes lots of branches and is not coded for to be enabled at compile time</para></listitem> |
| </itemizedlist> |
| <para>DMAPI also needs to perform invisible I/O to remove and replace the file data without changing the inode access/change/modification times</para> |
| </section> |
| <section> |
| <title>I/O Path 0 Delayed Allocation</title> |
| <para>Initial write reserves space only</para> |
| <itemizedlist> |
| <listitem><para>Must ensure that when write to disk occurs there is space available</para></listitem> |
| </itemizedlist> |
| <para>Allocation of real extents occurs when data is actually being written</para> |
| <itemizedlist> |
| <listitem><para>Data can be coalesced into much larger I/Os to disk</para></listitem> |
| <listitem><para>Allows allocator to allocate much larger extents than just individual I/O request sizes from the application</para></listitem> |
| </itemizedlist> |
| </section> |
| <section> |
| <title>sync(2)</title> |
| <para>XFS implements an optimization to sync(2) of metadata:</para> |
| <itemizedlist> |
| <listitem><para>XFS will only force the log out, such that any dirty metadata that is incore is written to the log <emphasis>only</emphasis>, the metadata itself is <emphasis>not</emphasis> necessarily written</para></listitem> |
| <listitem><para>This is safe, since all change is ondisk</para></listitem> |
| <listitem><para>File data is guaranteed too (even barriers)</para></listitem> |
| </itemizedlist> |
| <para>Log and metadata are written to disk for</para> |
| <itemizedlist> |
| <listitem><para>freeze/thaw</para></listitem> |
| <listitem><para>remount ro</para></listitem> |
| <listitem><para>unmount</para></listitem> |
| </itemizedlist> |
| <para>Applications like grub have been bitten in the past, but fixed nowadays</para> |
| </section> |
| <section> |
| <title>Data writeout</title> |
| <para>Triggered by the VM subsystem calling into XFS</para> |
| <itemizedlist> |
| <listitem><para>xfs_aops.c |
| <itemizedlist> |
| <listitem><para>xfs_vm_writepage(s)</para></listitem> |
| <listitem><para>xfs_page_state_convert</para></listitem> |
| </itemizedlist> |
| </para></listitem> |
| </itemizedlist> |
| <para>Page cache pages attached to inodes via a radix-tree (2.6)</para> |
| <itemizedlist> |
| <listitem><para>inode->i_mapping</para></listitem> |
| <listitem><para>page->mapping</para></listitem> |
| </itemizedlist> |
| <para>XFS does its own writeout, sort of</para> |
| <itemizedlist> |
| <listitem><para>due to delayed allocation and unwritten extents</para></listitem> |
| <listitem><para>extent conversion requires a transaction</para></listitem> |
| </itemizedlist> |
| <para>Unlike write page, read path is mostly generic linux implementation</para> |
| </section> |
| <section> |
| <title>XFS writepage</title> |
| <para>Rewritten in 2.6</para> |
| <para>XFS tries very hard to cluster pages</para> |
| <itemizedlist> |
| <listitem><para>xfs_add_to_ioend</para></listitem> |
| <listitem><para>xfs_cluster_write</para></listitem> |
| </itemizedlist> |
| <para>This allows XFS to do a much smaller number of extent conversions</para> |
| <itemizedlist> |
| <listitem><para>rather than converting for every buffer head</para></listitem> |
| </itemizedlist> |
| <para>Uses struct bio for writing more than 1 page at a time</para> |
| </section> |
| </chapter> |
| |