| [[Journaling_Log]] |
| = Journaling Log |
| |
| [NOTE] |
| Only v2 log format is covered here. |
| |
| The XFS journal exists on disk as a reserved extent of blocks within the |
| filesystem, or as a separate journal device. The journal itself can be thought |
| of as a series of log records; each log record contains a part of or a whole |
| transaction. A transaction consists of a series of log operation headers |
| (``log items''), formatting structures, and raw data. The first operation in a |
| transaction establishes the transaction ID and the last operation is a commit |
| record. The operations recorded between the start and commit operations |
| represent the metadata changes made by the transaction. If the commit |
| operation is missing, the transaction is incomplete and cannot be recovered. |
| |
| [[Log_Records]] |
| == Log Records |
| |
| The XFS log is split into a series of log records. Log records seem to |
| correspond to an in-core log buffer, which can be up to 256KiB in size. Each |
| record has a log sequence number, which is the same LSN recorded in the v5 |
| metadata integrity fields. |
| |
| Log sequence numbers are a 64-bit quantity consisting of two 32-bit quantities. |
| The upper 32 bits are the ``cycle number'', which increments every time XFS |
| cycles through the log. The lower 32 bits are the ``block number'', which is |
| assigned when a transaction is committed, and should correspond to the block |
| offset within the log. |
| |
| A log record begins with the following header, which occupies 512 bytes on |
| disk: |
| |
| [source, c] |
| ---- |
| typedef struct xlog_rec_header { |
| __be32 h_magicno; |
| __be32 h_cycle; |
| __be32 h_version; |
| __be32 h_len; |
| __be64 h_lsn; |
| __be64 h_tail_lsn; |
| __le32 h_crc; |
| __be32 h_prev_block; |
| __be32 h_num_logops; |
| __be32 h_cycle_data[XLOG_HEADER_CYCLE_SIZE / BBSIZE]; |
| /* new fields */ |
| __be32 h_fmt; |
| uuid_t h_fs_uuid; |
| __be32 h_size; |
| } xlog_rec_header_t; |
| ---- |
| |
| *h_magicno*:: |
| The magic number of log records, 0xfeedbabe. |
| |
| *h_cycle*:: |
| Cycle number of this log record. |
| |
| *h_version*:: |
| Log record version, currently 2. |
| |
| *h_len*:: |
| Length of the log record, in bytes. Must be aligned to a 64-bit boundary. |
| |
| *h_lsn*:: |
| Log sequence number of this record. |
| |
| *h_tail_lsn*:: |
| Log sequence number of the first log record with uncommitted buffers. |
| |
| *h_crc*:: |
| Checksum of the log record header, the cycle data, and the log records |
| themselves. |
| |
| *h_prev_block*:: |
| Block number of the previous log record. |
| |
| *h_num_logops*:: |
| The number of log operations in this record. |
| |
| *h_cycle_data*:: |
| The first u32 of each log sector must contain the cycle number. Since log |
| item buffers are formatted without regard to this requirement, the original |
| contents of the first four bytes of each sector in the log are copied into the |
| corresponding element of this array. After that, the first four bytes of those |
| sectors are stamped with the cycle number. This process is reversed at |
| recovery time. If there are more sectors in this log record than there are |
| slots in this array, the cycle data continues for as many sectors are needed; |
| each sector is formatted as type +xlog_rec_ext_header+. |
| |
| *h_fmt*:: |
| Format of the log record. This is one of the following values: |
| |
| .Log record formats |
| [options="header"] |
| |===== |
| | Format value | Log format |
| | +XLOG_FMT_UNKNOWN+ | Unknown. Perhaps this log is corrupt. |
| | +XLOG_FMT_LINUX_LE+ | Little-endian Linux. |
| | +XLOG_FMT_LINUX_BE+ | Big-endian Linux. |
| | +XLOG_FMT_IRIX_BE+ | Big-endian Irix. |
| |===== |
| |
| *h_fs_uuid*:: |
| Filesystem UUID. |
| |
| *h_size*:: |
| In-core log record size. This is somewhere between 16 and 256KiB, with 32KiB |
| being the default. |
| |
| As mentioned earlier, if this log record is longer than 256 sectors, the cycle |
| data overflows into the next sector(s) in the log. Each of those sectors is |
| formatted as follows: |
| |
| [source, c] |
| ---- |
| typedef struct xlog_rec_ext_header { |
| __be32 xh_cycle; |
| __be32 xh_cycle_data[XLOG_HEADER_CYCLE_SIZE / BBSIZE]; |
| } xlog_rec_ext_header_t; |
| ---- |
| |
| *xh_cycle*:: |
| Cycle number of this log record. Should match +h_cycle+. |
| |
| *xh_cycle_data*:: |
| Overflow cycle data. |
| |
| [[Log_Operations]] |
| == Log Operations |
| |
| Within a log record, log operations are recorded as a series consisting of an |
| operation header immediately followed by a data region. The operation header |
| has the following format: |
| |
| [source, c] |
| ---- |
| typedef struct xlog_op_header { |
| __be32 oh_tid; |
| __be32 oh_len; |
| __u8 oh_clientid; |
| __u8 oh_flags; |
| __u16 oh_res2; |
| } xlog_op_header_t; |
| ---- |
| |
| *oh_tid*:: |
| Transaction ID of this operation. |
| |
| *oh_len*:: |
| Number of bytes in the data region. |
| |
| *oh_clientid*:: |
| The originator of this operation. This can be one of the following: |
| |
| .Log Operation Client ID |
| [options="header"] |
| |===== |
| | Client ID | Originator |
| | +XFS_TRANSACTION+ | Operation came from a transaction. |
| | +XFS_VOLUME+ | ??? |
| | +XFS_LOG+ | ??? |
| |===== |
| |
| *oh_flags*:: |
| Specifies flags associated with this operation. This can be a combination of |
| the following values (though most likely only one will be set at a time): |
| |
| .Log Operation Flags |
| [options="header"] |
| |===== |
| | Flag | Description |
| | +XLOG_START_TRANS+ | Start a new transaction. The next operation header should describe a transaction header. |
| | +XLOG_COMMIT_TRANS+ | Commit this transaction. |
| | +XLOG_CONTINUE_TRANS+ | Continue this trans into new log record. |
| | +XLOG_WAS_CONT_TRANS+ | This transaction started in a previous log record. |
| | +XLOG_END_TRANS+ | End of a continued transaction. |
| | +XLOG_UNMOUNT_TRANS+ | Transaction to unmount a filesystem. |
| |===== |
| |
| *oh_res2*:: |
| Padding. |
| |
| The data region follows immediately after the operation header and is exactly |
| +oh_len+ bytes long. These payloads are in host-endian order, which means that |
| one cannot replay the log from an unclean XFS filesystem on a system with a |
| different byte order. |
| |
| [[Log_Items]] |
| == Log Items |
| |
| Following are the types of log item payloads that can follow an |
| +xlog_op_header+. Except for buffer data and inode cores, all log items have a |
| magic number to distinguish themselves. Buffer data items only appear after |
| +xfs_buf_log_format+ items; and inode core items only appear after |
| +xfs_inode_log_format+ items. |
| |
| .Log Operation Magic Numbers |
| [options="header"] |
| |===== |
| | Magic | Hexadecimal | Operation Type |
| | +XFS_TRANS_HEADER_MAGIC+ | 0x5452414e | xref:Log_Transaction_Headers[Log Transaction Header] |
| | +XFS_LI_EFI+ | 0x1236 | xref:EFI_Log_Item[Extent Freeing Intent] |
| | +XFS_LI_EFD+ | 0x1237 | xref:EFD_Log_Item[Extent Freeing Done] |
| | +XFS_LI_IUNLINK+ | 0x1238 | Unknown? |
| | +XFS_LI_INODE+ | 0x123b | xref:Inode_Log_Item[Inode Updates] |
| | +XFS_LI_BUF+ | 0x123c | xref:Buffer_Log_Item[Buffer Writes] |
| | +XFS_LI_DQUOT+ | 0x123d | xref:Quota_Update_Log_Item[Update Quota] |
| | +XFS_LI_QUOTAOFF+ | 0x123e | xref:Quota_Off_Log_Item[Quota Off] |
| | +XFS_LI_ICREATE+ | 0x123f | xref:Inode_Create_Log_Item[Inode Creation] |
| | +XFS_LI_RUI+ | 0x1240 | xref:RUI_Log_Item[Reverse Mapping Update Intent] |
| | +XFS_LI_RUD+ | 0x1241 | xref:RUD_Log_Item[Reverse Mapping Update Done] |
| | +XFS_LI_CUI+ | 0x1242 | xref:CUI_Log_Item[Reference Count Update Intent] |
| | +XFS_LI_CUD+ | 0x1243 | xref:CUD_Log_Item[Reference Count Update Done] |
| | +XFS_LI_BUI+ | 0x1244 | xref:BUI_Log_Item[File Block Mapping Update Intent] |
| | +XFS_LI_BUD+ | 0x1245 | xref:BUD_Log_Item[File Block Mapping Update Done] |
| |===== |
| |
| Note that all log items (except for transaction headers) MUST start with |
| the following header structure. The type and size fields are baked into |
| each log item header, but there is not a separately defined header. |
| |
| [source, c] |
| ---- |
| struct xfs_log_item { |
| __uint16_t magic; |
| __uint16_t size; |
| }; |
| ---- |
| |
| [[Log_Transaction_Headers]] |
| === Transaction Headers |
| |
| A transaction header is an operation payload that starts a transaction. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_trans_header { |
| uint th_magic; |
| uint th_type; |
| __int32_t th_tid; |
| uint th_num_items; |
| } xfs_trans_header_t; |
| ---- |
| |
| *th_magic*:: |
| The signature of a transaction header, ``TRAN'' (0x5452414e). Note that this |
| value is in host-endian order, not big-endian like the rest of XFS. |
| |
| *th_type*:: |
| Transaction type. This is one of the following values: |
| |
| [options="header"] |
| |===== |
| | Type | Description |
| | +XFS_TRANS_SETATTR_NOT_SIZE+ | Set an inode attribute that isn't the inode's size. |
| | +XFS_TRANS_SETATTR_SIZE+ | Setting the size attribute of an inode. |
| | +XFS_TRANS_INACTIVE+ | Freeing blocks from an unlinked inode. |
| | +XFS_TRANS_CREATE+ | Create a file. |
| | +XFS_TRANS_CREATE_TRUNC+ | Unused? |
| | +XFS_TRANS_TRUNCATE_FILE+ | Truncate a quota file. |
| | +XFS_TRANS_REMOVE+ | Remove a file. |
| | +XFS_TRANS_LINK+ | Link an inode into a directory. |
| | +XFS_TRANS_RENAME+ | Rename a path. |
| | +XFS_TRANS_MKDIR+ | Create a directory. |
| | +XFS_TRANS_RMDIR+ | Remove a directory. |
| | +XFS_TRANS_SYMLINK+ | Create a symbolic link. |
| | +XFS_TRANS_SET_DMATTRS+ | Set the DMAPI attributes of an inode. |
| | +XFS_TRANS_GROWFS+ | Expand the filesystem. |
| | +XFS_TRANS_STRAT_WRITE+ | Convert an unwritten extent or delayed-allocate some blocks to handle a write. |
| | +XFS_TRANS_DIOSTRAT+ | Allocate some blocks to handle a direct I/O write. |
| | +XFS_TRANS_WRITEID+ | Update an inode's preallocation flag. |
| | +XFS_TRANS_ADDAFORK+ | Add an attribute fork to an inode. |
| | +XFS_TRANS_ATTRINVAL+ | Erase the attribute fork of an inode. |
| | +XFS_TRANS_ATRUNCATE+ | Unused? |
| | +XFS_TRANS_ATTR_SET+ | Set an extended attribute. |
| | +XFS_TRANS_ATTR_RM+ | Remove an extended attribute. |
| | +XFS_TRANS_ATTR_FLAG+ | Unused? |
| | +XFS_TRANS_CLEAR_AGI_BUCKET+ | Clear a bad inode pointer in the AGI unlinked inode hash bucket. |
| | +XFS_TRANS_SB_CHANGE+ | Write the superblock to disk. |
| | +XFS_TRANS_QM_QUOTAOFF+ | Start disabling quotas. |
| | +XFS_TRANS_QM_DQALLOC+ | Allocate a disk quota structure. |
| | +XFS_TRANS_QM_SETQLIM+ | Adjust quota limits. |
| | +XFS_TRANS_QM_DQCLUSTER+ | Unused? |
| | +XFS_TRANS_QM_QINOCREATE+ | Create a (quota) inode with reference taken. |
| | +XFS_TRANS_QM_QUOTAOFF_END+ | Finish disabling quotas. |
| | +XFS_TRANS_FSYNC_TS+ | Update only inode timestamps. |
| | +XFS_TRANS_GROWFSRT_ALLOC+ | Grow the realtime bitmap and summary data for growfs. |
| | +XFS_TRANS_GROWFSRT_ZERO+ | Zero space in the realtime bitmap and summary data. |
| | +XFS_TRANS_GROWFSRT_FREE+ | Free space in the realtime bitmap and summary data. |
| | +XFS_TRANS_SWAPEXT+ | Swap data fork of two inodes. |
| | +XFS_TRANS_CHECKPOINT+ | Checkpoint the log. |
| | +XFS_TRANS_ICREATE+ | Unknown? |
| | +XFS_TRANS_CREATE_TMPFILE+ | Create a temporary file. |
| |===== |
| |
| *th_tid*:: |
| Transaction ID. |
| |
| *th_num_items*:: |
| The number of operations appearing after this operation, not including the |
| commit operation. In effect, this tracks the number of metadata change |
| operations in this transaction. |
| |
| [[EFI_Log_Item]] |
| === Intent to Free an Extent |
| |
| The next two operation types work together to handle the freeing of filesystem |
| blocks. Naturally, the ranges of blocks to be freed can be expressed in terms |
| of extents: |
| |
| [source, c] |
| ---- |
| typedef struct xfs_extent_32 { |
| __uint64_t ext_start; |
| __uint32_t ext_len; |
| } __attribute__((packed)) xfs_extent_32_t; |
| |
| typedef struct xfs_extent_64 { |
| __uint64_t ext_start; |
| __uint32_t ext_len; |
| __uint32_t ext_pad; |
| } xfs_extent_64_t; |
| ---- |
| |
| *ext_start*:: |
| Start block of this extent. |
| |
| *ext_len*:: |
| Length of this extent. |
| |
| The ``extent freeing intent'' operation comes first; it tells the log that XFS |
| wants to free some extents. This record is crucial for correct log recovery |
| because it prevents the log from replaying blocks that are subsequently freed. |
| If the log lacks a corresponding ``extent freeing done'' operation, the |
| recovery process will free the extents. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_efi_log_format { |
| __uint16_t efi_type; |
| __uint16_t efi_size; |
| __uint32_t efi_nextents; |
| __uint64_t efi_id; |
| xfs_extent_t efi_extents[1]; |
| } xfs_efi_log_format_t; |
| ---- |
| |
| *efi_type*:: |
| The signature of an EFI operation, 0x1236. This value is in host-endian order, |
| not big-endian like the rest of XFS. |
| |
| *efi_size*:: |
| Size of this log item. Should be 1. |
| |
| *efi_nextents*:: |
| Number of extents to free. |
| |
| *efi_id*:: |
| A 64-bit number that binds the corresponding EFD log item to this EFI log item. |
| |
| *efi_extents*:: |
| Variable-length array of extents to be freed. The array length is given by |
| +efi_nextents+. The record type will be either +xfs_extent_64_t+ or |
| +xfs_extent_32_t+; this can be determined from the log item size (+oh_len+) and |
| the number of extents (+efi_nextents+). |
| |
| [[EFD_Log_Item]] |
| === Completion of Intent to Free an Extent |
| |
| The ``extent freeing done'' operation complements the ``extent freeing intent'' |
| operation. This second operation indicates that the block freeing actually |
| happened, so that log recovery needn't try to free the blocks. Typically, the |
| operations to update the free space B+trees follow immediately after the EFD. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_efd_log_format { |
| __uint16_t efd_type; |
| __uint16_t efd_size; |
| __uint32_t efd_nextents; |
| __uint64_t efd_efi_id; |
| xfs_extent_t efd_extents[1]; |
| } xfs_efd_log_format_t; |
| ---- |
| |
| *efd_type*:: |
| The signature of an EFD operation, 0x1237. This value is in host-endian order, |
| not big-endian like the rest of XFS. |
| |
| *efd_size*:: |
| Size of this log item. Should be 1. |
| |
| *efd_nextents*:: |
| Number of extents to free. |
| |
| *efd_id*:: |
| A 64-bit number that binds the corresponding EFI log item to this EFD log item. |
| |
| *efd_extents*:: |
| Variable-length array of extents to be freed. The array length is given by |
| +efd_nextents+. The record type will be either +xfs_extent_64_t+ or |
| +xfs_extent_32_t+; this can be determined from the log item size (+oh_len+) and |
| the number of extents (+efd_nextents+). |
| |
| [[RUI_Log_Item]] |
| === Reverse Mapping Updates Intent |
| |
| The next two operation types work together to handle deferred reverse mapping |
| updates. Naturally, the mappings to be updated can be expressed in terms of |
| mapping extents: |
| |
| [source, c] |
| ---- |
| struct xfs_map_extent { |
| __uint64_t me_owner; |
| __uint64_t me_startblock; |
| __uint64_t me_startoff; |
| __uint32_t me_len; |
| __uint32_t me_flags; |
| }; |
| ---- |
| |
| *me_owner*:: |
| Owner of this reverse mapping. See the values in the section about |
| xref:Reverse_Mapping_Btree[reverse mapping] for more information. |
| |
| *me_startblock*:: |
| Filesystem block of this mapping. |
| |
| *me_startoff*:: |
| Logical block offset of this mapping. |
| |
| *me_len*:: |
| The length of this mapping. |
| |
| *me_flags*:: |
| The lower byte of this field is a type code indicating what sort of |
| reverse mapping operation we want. The upper three bytes are flag bits. |
| |
| .Reverse mapping update log intent types |
| [options="header"] |
| |===== |
| | Value | Description |
| | +XFS_RMAP_EXTENT_MAP+ | Add a reverse mapping for file data. |
| | +XFS_RMAP_EXTENT_MAP_SHARED+ | Add a reverse mapping for file data for a file with shared blocks. |
| | +XFS_RMAP_EXTENT_UNMAP+ | Remove a reverse mapping for file data. |
| | +XFS_RMAP_EXTENT_UNMAP_SHARED+ | Remove a reverse mapping for file data for a file with shared blocks. |
| | +XFS_RMAP_EXTENT_CONVERT+ | Convert a reverse mapping for file data between unwritten and normal. |
| | +XFS_RMAP_EXTENT_CONVERT_SHARED+ | Convert a reverse mapping for file data between unwritten and normal for a file with shared blocks. |
| | +XFS_RMAP_EXTENT_ALLOC+ | Add a reverse mapping for non-file data. |
| | +XFS_RMAP_EXTENT_FREE+ | Remove a reverse mapping for non-file data. |
| |===== |
| |
| .Reverse mapping update log intent flags |
| [options="header"] |
| |===== |
| | Value | Description |
| | +XFS_RMAP_EXTENT_ATTR_FORK+ | Extent is for the attribute fork. |
| | +XFS_RMAP_EXTENT_BMBT_BLOCK+ | Extent is for a block mapping btree block. |
| | +XFS_RMAP_EXTENT_UNWRITTEN+ | Extent is unwritten. |
| |===== |
| |
| The ``rmap update intent'' operation comes first; it tells the log that XFS |
| wants to update some reverse mappings. This record is crucial for correct log |
| recovery because it enables us to spread a complex metadata update across |
| multiple transactions while ensuring that a crash midway through the complex |
| update will be replayed fully during log recovery. |
| |
| [source, c] |
| ---- |
| struct xfs_rui_log_format { |
| __uint16_t rui_type; |
| __uint16_t rui_size; |
| __uint32_t rui_nextents; |
| __uint64_t rui_id; |
| struct xfs_map_extent rui_extents[1]; |
| }; |
| ---- |
| |
| *rui_type*:: |
| The signature of an RUI operation, 0x1240. This value is in host-endian order, |
| not big-endian like the rest of XFS. |
| |
| *rui_size*:: |
| Size of this log item. Should be 1. |
| |
| *rui_nextents*:: |
| Number of reverse mappings. |
| |
| *rui_id*:: |
| A 64-bit number that binds the corresponding RUD log item to this RUI log item. |
| |
| *rui_extents*:: |
| Variable-length array of reverse mappings to update. |
| |
| [[RUD_Log_Item]] |
| === Completion of Reverse Mapping Updates |
| |
| The ``reverse mapping update done'' operation complements the ``reverse mapping |
| update intent'' operation. This second operation indicates that the update |
| actually happened, so that log recovery needn't replay the update. The RUD and |
| the actual updates are typically found in a new transaction following the |
| transaction in which the RUI was logged. |
| |
| [source, c] |
| ---- |
| struct xfs_rud_log_format { |
| __uint16_t rud_type; |
| __uint16_t rud_size; |
| __uint32_t __pad; |
| __uint64_t rud_rui_id; |
| }; |
| ---- |
| |
| *rud_type*:: |
| The signature of an RUD operation, 0x1241. This value is in host-endian order, |
| not big-endian like the rest of XFS. |
| |
| *rud_size*:: |
| Size of this log item. Should be 1. |
| |
| *rud_rui_id*:: |
| A 64-bit number that binds the corresponding RUI log item to this RUD log item. |
| |
| [[CUI_Log_Item]] |
| === Reference Count Updates Intent |
| |
| The next two operation types work together to handle reference count updates. |
| Naturally, the ranges of extents having reference count updates can be |
| expressed in terms of physical extents: |
| |
| [source, c] |
| ---- |
| struct xfs_phys_extent { |
| __uint64_t pe_startblock; |
| __uint32_t pe_len; |
| __uint32_t pe_flags; |
| }; |
| ---- |
| |
| *pe_startblock*:: |
| Filesystem block of this extent. |
| |
| *pe_len*:: |
| The length of this extent. |
| |
| *pe_flags*:: |
| The lower byte of this field is a type code indicating what sort of |
| reverse mapping operation we want. The upper three bytes are flag bits. |
| |
| .Reference count update log intent types |
| [options="header"] |
| |===== |
| | Value | Description |
| | +XFS_REFCOUNT_EXTENT_INCREASE+ | Increase the reference count for this extent. |
| | +XFS_REFCOUNT_EXTENT_DECREASE+ | Decrease the reference count for this extent. |
| | +XFS_REFCOUNT_EXTENT_ALLOC_COW+ | Reserve an extent for staging copy on write. |
| | +XFS_REFCOUNT_EXTENT_FREE_COW+ | Unreserve an extent for staging copy on write. |
| |===== |
| |
| The ``reference count update intent'' operation comes first; it tells the log |
| that XFS wants to update some reference counts. This record is crucial for |
| correct log recovery because it enables us to spread a complex metadata update |
| across multiple transactions while ensuring that a crash midway through the |
| complex update will be replayed fully during log recovery. |
| |
| [source, c] |
| ---- |
| struct xfs_cui_log_format { |
| __uint16_t cui_type; |
| __uint16_t cui_size; |
| __uint32_t cui_nextents; |
| __uint64_t cui_id; |
| struct xfs_map_extent cui_extents[1]; |
| }; |
| ---- |
| |
| *cui_type*:: |
| The signature of an CUI operation, 0x1242. This value is in host-endian order, |
| not big-endian like the rest of XFS. |
| |
| *cui_size*:: |
| Size of this log item. Should be 1. |
| |
| *cui_nextents*:: |
| Number of reference count updates. |
| |
| *cui_id*:: |
| A 64-bit number that binds the corresponding RUD log item to this RUI log item. |
| |
| *cui_extents*:: |
| Variable-length array of reference count update information. |
| |
| [[CUD_Log_Item]] |
| === Completion of Reference Count Updates |
| |
| The ``reference count update done'' operation complements the ``reference count |
| update intent'' operation. This second operation indicates that the update |
| actually happened, so that log recovery needn't replay the update. The CUD and |
| the actual updates are typically found in a new transaction following the |
| transaction in which the CUI was logged. |
| |
| [source, c] |
| ---- |
| struct xfs_cud_log_format { |
| __uint16_t cud_type; |
| __uint16_t cud_size; |
| __uint32_t __pad; |
| __uint64_t cud_cui_id; |
| }; |
| ---- |
| |
| *cud_type*:: |
| The signature of an RUD operation, 0x1243. This value is in host-endian order, |
| not big-endian like the rest of XFS. |
| |
| *cud_size*:: |
| Size of this log item. Should be 1. |
| |
| *cud_cui_id*:: |
| A 64-bit number that binds the corresponding CUI log item to this CUD log item. |
| |
| [[BUI_Log_Item]] |
| === File Block Mapping Intent |
| |
| The next two operation types work together to handle deferred file block |
| mapping updates. The extents to be mapped are expressed via the |
| +xfs_map_extent+ structure discussed in the section about |
| xref:RUI_Log_Item[reverse mapping intents]. |
| |
| The lower byte of the +me_flags+ field is a type code indicating what sort of |
| file block mapping operation we want. The upper three bytes are flag bits. |
| |
| .File block mapping update log intent types |
| [options="header"] |
| |===== |
| | Value | Description |
| | +XFS_BMAP_EXTENT_MAP+ | Add a mapping for file data. |
| | +XFS_BMAP_EXTENT_UNMAP+ | Remove a mapping for file data. |
| |===== |
| |
| .File block mapping update log intent flags |
| [options="header"] |
| |===== |
| | Value | Description |
| | +XFS_BMAP_EXTENT_ATTR_FORK+ | Extent is for the attribute fork. |
| | +XFS_BMAP_EXTENT_UNWRITTEN+ | Extent is unwritten. |
| |===== |
| |
| The ``file block mapping update intent'' operation comes first; it tells the |
| log that XFS wants to map or unmap some extents in a file. This record is |
| crucial for correct log recovery because it enables us to spread a complex |
| metadata update across multiple transactions while ensuring that a crash midway |
| through the complex update will be replayed fully during log recovery. |
| |
| [source, c] |
| ---- |
| struct xfs_bui_log_format { |
| __uint16_t bui_type; |
| __uint16_t bui_size; |
| __uint32_t bui_nextents; |
| __uint64_t bui_id; |
| struct xfs_map_extent bui_extents[1]; |
| }; |
| ---- |
| |
| *bui_type*:: |
| The signature of an BUI operation, 0x1244. This value is in host-endian order, |
| not big-endian like the rest of XFS. |
| |
| *bui_size*:: |
| Size of this log item. Should be 1. |
| |
| *bui_nextents*:: |
| Number of file mappings. Should be 1. |
| |
| *bui_id*:: |
| A 64-bit number that binds the corresponding BUD log item to this BUI log item. |
| |
| *bui_extents*:: |
| Variable-length array of file block mappings to update. There should only |
| be one mapping present. |
| |
| [[BUD_Log_Item]] |
| === Completion of File Block Mapping Updates |
| |
| The ``file block mapping update done'' operation complements the ``file block |
| mapping update intent'' operation. This second operation indicates that the |
| update actually happened, so that log recovery needn't replay the update. The |
| BUD and the actual updates are typically found in a new transaction following |
| the transaction in which the BUI was logged. |
| |
| [source, c] |
| ---- |
| struct xfs_bud_log_format { |
| __uint16_t bud_type; |
| __uint16_t bud_size; |
| __uint32_t __pad; |
| __uint64_t bud_bui_id; |
| }; |
| ---- |
| |
| *bud_type*:: |
| The signature of an BUD operation, 0x1245. This value is in host-endian order, |
| not big-endian like the rest of XFS. |
| |
| *bud_size*:: |
| Size of this log item. Should be 1. |
| |
| *bud_bui_id*:: |
| A 64-bit number that binds the corresponding BUI log item to this BUD log item. |
| |
| [[Inode_Log_Item]] |
| === Inode Updates |
| |
| This operation records changes to an inode record. There are several types of |
| inode updates, each corresponding to different parts of the inode record. |
| Allowing updates to proceed at a sub-inode granularity reduces contention for |
| the inode, since different parts of the inode can be updated simultaneously. |
| |
| The actual buffer data are stored in subsequent log items. |
| |
| The inode log format header is as follows: |
| |
| [source, c] |
| ---- |
| typedef struct xfs_inode_log_format_64 { |
| __uint16_t ilf_type; |
| __uint16_t ilf_size; |
| __uint32_t ilf_fields; |
| __uint16_t ilf_asize; |
| __uint16_t ilf_dsize; |
| __uint32_t ilf_pad; |
| __uint64_t ilf_ino; |
| union { |
| __uint32_t ilfu_rdev; |
| uuid_t ilfu_uuid; |
| } ilf_u; |
| __int64_t ilf_blkno; |
| __int32_t ilf_len; |
| __int32_t ilf_boffset; |
| } xfs_inode_log_format_64_t; |
| ---- |
| |
| *ilf_type*:: |
| The signature of an inode update operation, 0x123b. This value is in |
| host-endian order, not big-endian like the rest of XFS. |
| |
| *ilf_size*:: |
| Number of operations involved in this update, including this format operation. |
| |
| *ilf_fields*:: |
| Specifies which parts of the inode are being updated. This can be certain |
| combinations of the following: |
| |
| [options="header"] |
| |===== |
| | Flag | Inode changes to log include: |
| | +XFS_ILOG_CORE+ | The standard inode fields. |
| | +XFS_ILOG_DDATA+ | Data fork's local data. |
| | +XFS_ILOG_DEXT+ | Data fork's extent list. |
| | +XFS_ILOG_DBROOT+ | Data fork's B+tree root. |
| | +XFS_ILOG_DEV+ | Data fork's device number. |
| | +XFS_ILOG_UUID+ | Data fork's UUID contents. |
| | +XFS_ILOG_ADATA+ | Attribute fork's local data. |
| | +XFS_ILOG_AEXT+ | Attribute fork's extent list. |
| | +XFS_ILOG_ABROOT+ | Attribute fork's B+tree root. |
| | +XFS_ILOG_DOWNER+ | Change the data fork owner on replay. |
| | +XFS_ILOG_AOWNER+ | Change the attr fork owner on replay. |
| | +XFS_ILOG_TIMESTAMP+ | Timestamps are dirty, but not necessarily anything else. Should never appear on disk. |
| | +XFS_ILOG_NONCORE+ | ( +XFS_ILOG_DDATA+ \| +XFS_ILOG_DEXT+ \| +XFS_ILOG_DBROOT+ \| +XFS_ILOG_DEV+ \| +XFS_ILOG_UUID+ \| +XFS_ILOG_ADATA+ \| +XFS_ILOG_AEXT+ \| +XFS_ILOG_ABROOT+ \| +XFS_ILOG_DOWNER+ \| +XFS_ILOG_AOWNER+ ) |
| | +XFS_ILOG_DFORK+ | ( +XFS_ILOG_DDATA+ \| +XFS_ILOG_DEXT+ \| +XFS_ILOG_DBROOT+ ) |
| | +XFS_ILOG_AFORK+ | ( +XFS_ILOG_ADATA+ \| +XFS_ILOG_AEXT+ \| +XFS_ILOG_ABROOT+ ) |
| | +XFS_ILOG_ALL+ | ( +XFS_ILOG_CORE+ \| +XFS_ILOG_DDATA+ \| +XFS_ILOG_DEXT+ \| +XFS_ILOG_DBROOT+ \| +XFS_ILOG_DEV+ \| +XFS_ILOG_UUID+ \| +XFS_ILOG_ADATA+ \| +XFS_ILOG_AEXT+ \| +XFS_ILOG_ABROOT+ \| +XFS_ILOG_TIMESTAMP+ \| +XFS_ILOG_DOWNER+ \| +XFS_ILOG_AOWNER+ ) |
| |===== |
| |
| *ilf_asize*:: |
| Size of the attribute fork, in bytes. |
| |
| *ilf_dsize*:: |
| Size of the data fork, in bytes. |
| |
| *ilf_ino*:: |
| Absolute node number. |
| |
| *ilfu_rdev*:: |
| Device number information, for a device file update. |
| |
| *ilfu_uuid*:: |
| UUID, for a UUID update? |
| |
| *ilf_blkno*:: |
| Block number of the inode buffer, in sectors. |
| |
| *ilf_len*:: |
| Length of inode buffer, in sectors. |
| |
| *ilf_boffset*:: |
| Byte offset of the inode in the buffer. |
| |
| Be aware that there is a nearly identical +xfs_inode_log_format_32+ which may |
| appear on disk. It is the same as +xfs_inode_log_format_64+, except that it is |
| missing the +ilf_pad+ field and is 52 bytes long as opposed to 56 bytes. |
| |
| [[Inode_Data_Log_Item]] |
| === Inode Data Log Item |
| |
| This region contains the new contents of a part of an inode, as described in |
| the xref:Inode_Log_Item[previous section]. There are no magic numbers. |
| |
| If +XFS_ILOG_CORE+ is set in +ilf_fields+, the correpsonding data buffer must |
| be in the format +struct xfs_icdinode+, which has the same format as the first |
| 96 bytes of an xref:On-disk_Inode[inode], but is recorded in host byte order. |
| |
| [[Buffer_Log_Item]] |
| === Buffer Log Item |
| |
| This operation writes parts of a buffer to disk. The regions to write are |
| tracked in the data map; the actual buffer data are stored in subsequent log |
| items. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_buf_log_format { |
| unsigned short blf_type; |
| unsigned short blf_size; |
| ushort blf_flags; |
| ushort blf_len; |
| __int64_t blf_blkno; |
| unsigned int blf_map_size; |
| unsigned int blf_data_map[XFS_BLF_DATAMAP_SIZE]; |
| } xfs_buf_log_format_t; |
| ---- |
| |
| *blf_type*:: |
| Magic number to specify a buffer log item, 0x123c. |
| |
| *blf_size*:: |
| Number of buffer data items following this item. |
| |
| *blf_flags*:: |
| Specifies flags associated with the buffer item. This can be any of the |
| following: |
| |
| [options="header"] |
| |===== |
| | Flag | Description |
| | +XFS_BLF_INODE_BUF+ | Inode buffer. These must be recovered before replaying items that change this buffer. |
| | +XFS_BLF_CANCEL+ | Don't recover this buffer, blocks are being freed. |
| | +XFS_BLF_UDQUOT_BUF+ | User quota buffer, don't recover if there's a subsequent quotaoff. |
| | +XFS_BLF_PDQUOT_BUF+ | Project quota buffer, don't recover if there's a subsequent quotaoff. |
| | +XFS_BLF_GDQUOT_BUF+ | Group quota buffer, don't recover if there's a subsequent quotaoff. |
| |===== |
| |
| *blf_len*:: |
| Number of sectors affected by this buffer. |
| |
| *blf_blkno*:: |
| Block number to write, in sectors. |
| |
| *blf_map_size*:: |
| The size of +blf_data_map+, in 32-bit words. |
| |
| *blf_data_map*:: |
| This variable-sized array acts as a dirty bitmap for the logged buffer. Each |
| 1 bit represents a dirty region in the buffer, and each run of 1 bits |
| corresponds to a subsequent log item containing the new contents of the buffer |
| area. Each bit represents +(blf_len * 512) / (blf_map_size * NBBY)+ bytes. |
| |
| [[Buffer_Data_Log_Item]] |
| === Buffer Data Log Item |
| |
| This region contains the new contents of a part of a buffer, as described in |
| the xref:Buffer_Log_Item[previous section]. There are no magic numbers. |
| |
| [[Quota_Update_Log_Item]] |
| === Update Quota File |
| |
| This updates a block in a quota file. The buffer data must be in the next log |
| item. |
| |
| [source, c] |
| ---- |
| typedef struct xfs_dq_logformat { |
| __uint16_t qlf_type; |
| __uint16_t qlf_size; |
| xfs_dqid_t qlf_id; |
| __int64_t qlf_blkno; |
| __int32_t qlf_len; |
| __uint32_t qlf_boffset; |
| } xfs_dq_logformat_t; |
| ---- |
| |
| *qlf_type*:: |
| The signature of an inode create operation, 0x123e. This value is in |
| host-endian order, not big-endian like the rest of XFS. |
| |
| *qlf_size*:: |
| Size of this log item. Should be 2. |
| |
| *qlf_id*:: |
| The user/group/project ID to alter. |
| |
| *qlf_blkno*:: |
| Block number of the quota buffer, in sectors. |
| |
| *qlf_len*:: |
| Length of the quota buffer, in sectors. |
| |
| *qlf_boffset*:: |
| Buffer offset of the quota data to update, in bytes. |
| |
| [[Quota_Update_Data_Log_Item]] |
| === Quota Update Data Log Item |
| |
| This region contains the new contents of a part of a buffer, as described in |
| the xref:Quota_Update_Log_Item[previous section]. There are no magic numbers. |
| |
| [[Quota_Off_Log_Item]] |
| === Disable Quota Log Item |
| |
| A request to disable quota controls has the following format: |
| |
| [source, c] |
| ---- |
| typedef struct xfs_qoff_logformat { |
| unsigned short qf_type; |
| unsigned short qf_size; |
| unsigned int qf_flags; |
| char qf_pad[12]; |
| } xfs_qoff_logformat_t; |
| ---- |
| |
| *qf_type*:: |
| The signature of an inode create operation, 0x123d. This value is in |
| host-endian order, not big-endian like the rest of XFS. |
| |
| *qf_size*:: |
| Size of this log item. Should be 1. |
| |
| *qf_flags*:: |
| Specifies which quotas are being turned off. Can be a combination of the |
| following: |
| |
| [options="header"] |
| |===== |
| | Flag | Quota type to disable |
| | +XFS_UQUOTA_ACCT+ | User quotas. |
| | +XFS_PQUOTA_ACCT+ | Project quotas. |
| | +XFS_GQUOTA_ACCT+ | Group quotas. |
| |===== |
| |
| [[Inode_Create_Log_Item]] |
| === Inode Creation Log Item |
| |
| This log item is created when inodes are allocated in-core. When replaying |
| this item, the specified inode records will be zeroed and some of the inode |
| fields populated with default values. |
| |
| [source, c] |
| ---- |
| struct xfs_icreate_log { |
| __uint16_t icl_type; |
| __uint16_t icl_size; |
| __be32 icl_ag; |
| __be32 icl_agbno; |
| __be32 icl_count; |
| __be32 icl_isize; |
| __be32 icl_length; |
| __be32 icl_gen; |
| }; |
| ---- |
| |
| *icl_type*:: |
| The signature of an inode create operation, 0x123f. This value is in |
| host-endian order, not big-endian like the rest of XFS. |
| |
| *icl_size*:: |
| Size of this log item. Should be 1. |
| |
| *icl_ag*:: |
| AG number of the inode chunk to create. |
| |
| *icl_agbno*:: |
| AG block number of the inode chunk. |
| |
| *icl_count*:: |
| Number of inodes to initialize. |
| |
| *icl_isize*:: |
| Size of each inode, in bytes. |
| |
| *icl_length*:: |
| Length of the extent being initialized, in blocks. |
| |
| *icl_gen*:: |
| Inode generation number to write into the new inodes. |
| |
| == xfs_logprint Example |
| |
| Here's an example of dumping the XFS log contents with +xfs_logprint+: |
| |
| ---- |
| # xfs_logprint /dev/sda |
| xfs_logprint: /dev/sda contains a mounted and writable filesystem |
| xfs_logprint: |
| data device: 0xfc03 |
| log device: 0xfc03 daddr: 900931640 length: 879816 |
| |
| cycle: 48 version: 2 lsn: 48,0 tail_lsn: 47,879760 |
| length of Log Record: 19968 prev offset: 879808 num ops: 53 |
| uuid: 24afeec2-f418-46a2-a573-10091f5e200e format: little endian linux |
| h_size: 32768 |
| ---- |
| |
| This is the log record header. |
| |
| ---- |
| Oper (0): tid: 30483aec len: 0 clientid: TRANS flags: START |
| ---- |
| |
| This operation indicates that we're starting a transaction, so the next |
| operation should record the transaction header. |
| |
| ---- |
| Oper (1): tid: 30483aec len: 16 clientid: TRANS flags: none |
| TRAN: type: CHECKPOINT tid: 30483aec num_items: 50 |
| ---- |
| |
| This operation records a transaction header. There should be fifty operations |
| in this transaction and the transaction ID is 0x30483aec. |
| |
| ---- |
| Oper (2): tid: 30483aec len: 24 clientid: TRANS flags: none |
| BUF: #regs: 2 start blkno: 145400496 (0x8aaa2b0) len: 8 bmap size: 1 flags: 0x2000 |
| Oper (3): tid: 30483aec len: 3712 clientid: TRANS flags: none |
| BUF DATA |
| ... |
| Oper (4): tid: 30483aec len: 24 clientid: TRANS flags: none |
| BUF: #regs: 3 start blkno: 59116912 (0x3860d70) len: 8 bmap size: 1 flags: 0x2000 |
| Oper (5): tid: 30483aec len: 128 clientid: TRANS flags: none |
| BUF DATA |
| 0 43544241 49010000 fa347000 2c357000 3a40b200 13000000 2343c200 13000000 |
| 8 3296d700 13000000 375deb00 13000000 8a551501 13000000 56be1601 13000000 |
| 10 af081901 13000000 ec741c01 13000000 9e911c01 13000000 69073501 13000000 |
| 18 4e539501 13000000 6549501 13000000 5d0e7f00 14000000 c6908200 14000000 |
| |
| Oper (6): tid: 30483aec len: 640 clientid: TRANS flags: none |
| BUF DATA |
| 0 7f47c800 21000000 23c0e400 21000000 2d0dfe00 21000000 e7060c01 21000000 |
| 8 34b91801 21000000 9cca9100 22000000 26e69800 22000000 4c969900 22000000 |
| ... |
| 90 1cf69900 27000000 42f79c00 27000000 6a99e00 27000000 6a99e00 27000000 |
| 98 6a99e00 27000000 6a99e00 27000000 6a99e00 27000000 6a99e00 27000000 |
| ---- |
| |
| Operations 4-6 describe two updates to a single dirty buffer at disk address |
| 59,116,912. The first chunk of dirty data is 128 bytes long. Notice how the |
| first four bytes of the first chunk is 0x43544241? Remembering that log items |
| are in host byte order, reverse that to 0x41425443, which is the magic number |
| for the free space B+tree ordered by size. |
| |
| The second chunk is 640 bytes. There are more buffer changes, so we'll skip |
| ahead a few operations: |
| |
| ---- |
| Oper (19): tid: 30483aec len: 56 clientid: TRANS flags: none |
| INODE: #regs: 2 ino: 0x63a73b4e flags: 0x1 dsize: 40 |
| blkno: 1412688704 len: 16 boff: 7168 |
| Oper (20): tid: 30483aec len: 96 clientid: TRANS flags: none |
| INODE CORE |
| magic 0x494e mode 0100600 version 2 format 3 |
| nlink 1 uid 1000 gid 1000 |
| atime 0x5633d58d mtime 0x563a391b ctime 0x563a391b |
| size 0x109dc8 nblocks 0x111 extsize 0x0 nextents 0x1b |
| naextents 0x0 forkoff 0 dmevmask 0x0 dmstate 0x0 |
| flags 0x0 gen 0x389071be |
| ---- |
| |
| This is an update to the core of inode 0x63a73b4e. There were similar inode |
| core updates after this, so we'll skip ahead a bit: |
| |
| ---- |
| Oper (32): tid: 30483aec len: 56 clientid: TRANS flags: none |
| INODE: #regs: 3 ino: 0x4bde428 flags: 0x5 dsize: 16 |
| blkno: 79553568 len: 16 boff: 4096 |
| Oper (33): tid: 30483aec len: 96 clientid: TRANS flags: none |
| INODE CORE |
| magic 0x494e mode 0100644 version 2 format 2 |
| nlink 1 uid 1000 gid 1000 |
| atime 0x563a3924 mtime 0x563a3931 ctime 0x563a3931 |
| size 0x1210 nblocks 0x2 extsize 0x0 nextents 0x1 |
| naextents 0x0 forkoff 0 dmevmask 0x0 dmstate 0x0 |
| flags 0x0 gen 0x2829c6f9 |
| Oper (34): tid: 30483aec len: 16 clientid: TRANS flags: none |
| EXTENTS inode data |
| ---- |
| |
| This inode update changes both the core and also the data fork. Since we're |
| changing the block map, it's unsurprising that one of the subsequent operations |
| is an EFI: |
| |
| ---- |
| Oper (37): tid: 30483aec len: 32 clientid: TRANS flags: none |
| EFI: #regs: 1 num_extents: 1 id: 0xffff8801147b5c20 |
| (s: 0x720daf, l: 1) |
| \---------------------------------------------------------------------------- |
| Oper (38): tid: 30483aec len: 32 clientid: TRANS flags: none |
| EFD: #regs: 1 num_extents: 1 id: 0xffff8801147b5c20 |
| \---------------------------------------------------------------------------- |
| Oper (39): tid: 30483aec len: 24 clientid: TRANS flags: none |
| BUF: #regs: 2 start blkno: 8 (0x8) len: 8 bmap size: 1 flags: 0x2800 |
| Oper (40): tid: 30483aec len: 128 clientid: TRANS flags: none |
| AGF Buffer: XAGF |
| ver: 1 seq#: 0 len: 56308224 |
| root BNO: 18174905 CNT: 18175030 |
| level BNO: 2 CNT: 2 |
| 1st: 41 last: 46 cnt: 6 freeblks: 35790503 longest: 19343245 |
| \---------------------------------------------------------------------------- |
| Oper (41): tid: 30483aec len: 24 clientid: TRANS flags: none |
| BUF: #regs: 3 start blkno: 145398760 (0x8aa9be8) len: 8 bmap size: 1 flags: 0x2000 |
| Oper (42): tid: 30483aec len: 128 clientid: TRANS flags: none |
| BUF DATA |
| Oper (43): tid: 30483aec len: 128 clientid: TRANS flags: none |
| BUF DATA |
| \---------------------------------------------------------------------------- |
| Oper (44): tid: 30483aec len: 24 clientid: TRANS flags: none |
| BUF: #regs: 3 start blkno: 145400224 (0x8aaa1a0) len: 8 bmap size: 1 flags: 0x2000 |
| Oper (45): tid: 30483aec len: 128 clientid: TRANS flags: none |
| BUF DATA |
| Oper (46): tid: 30483aec len: 3584 clientid: TRANS flags: none |
| BUF DATA |
| \---------------------------------------------------------------------------- |
| Oper (47): tid: 30483aec len: 24 clientid: TRANS flags: none |
| BUF: #regs: 3 start blkno: 59066216 (0x3854768) len: 8 bmap size: 1 flags: 0x2000 |
| Oper (48): tid: 30483aec len: 128 clientid: TRANS flags: none |
| BUF DATA |
| Oper (49): tid: 30483aec len: 768 clientid: TRANS flags: none |
| BUF DATA |
| ---- |
| |
| Here we see an EFI, followed by an EFD, followed by updates to the AGF and the |
| free space B+trees. Most probably, we just unmapped a few blocks from a file. |
| |
| ---- |
| Oper (50): tid: 30483aec len: 56 clientid: TRANS flags: none |
| INODE: #regs: 2 ino: 0x3906f20 flags: 0x1 dsize: 16 |
| blkno: 59797280 len: 16 boff: 0 |
| Oper (51): tid: 30483aec len: 96 clientid: TRANS flags: none |
| INODE CORE |
| magic 0x494e mode 0100644 version 2 format 2 |
| nlink 1 uid 1000 gid 1000 |
| atime 0x563a3938 mtime 0x563a3938 ctime 0x563a3938 |
| size 0x0 nblocks 0x0 extsize 0x0 nextents 0x0 |
| naextents 0x0 forkoff 0 dmevmask 0x0 dmstate 0x0 |
| flags 0x0 gen 0x35ed661 |
| \---------------------------------------------------------------------------- |
| Oper (52): tid: 30483aec len: 0 clientid: TRANS flags: COMMIT |
| ---- |
| |
| One more inode core update and this transaction commits. |