|  | .. SPDX-License-Identifier: GPL-2.0 | 
|  |  | 
|  | ================== | 
|  | XFS Logging Design | 
|  | ================== | 
|  |  | 
|  | Preamble | 
|  | ======== | 
|  |  | 
|  | This document describes the design and algorithms that the XFS journalling | 
|  | subsystem is based on. This document describes the design and algorithms that | 
|  | the XFS journalling subsystem is based on so that readers may familiarize | 
|  | themselves with the general concepts of how transaction processing in XFS works. | 
|  |  | 
|  | We begin with an overview of transactions in XFS, followed by describing how | 
|  | transaction reservations are structured and accounted, and then move into how we | 
|  | guarantee forwards progress for long running transactions with finite initial | 
|  | reservations bounds. At this point we need to explain how relogging works. With | 
|  | the basic concepts covered, the design of the delayed logging mechanism is | 
|  | documented. | 
|  |  | 
|  |  | 
|  | Introduction | 
|  | ============ | 
|  |  | 
|  | XFS uses Write Ahead Logging for ensuring changes to the filesystem metadata | 
|  | are atomic and recoverable. For reasons of space and time efficiency, the | 
|  | logging mechanisms are varied and complex, combining intents, logical and | 
|  | physical logging mechanisms to provide the necessary recovery guarantees the | 
|  | filesystem requires. | 
|  |  | 
|  | Some objects, such as inodes and dquots, are logged in logical format where the | 
|  | details logged are made up of the changes to in-core structures rather than | 
|  | on-disk structures. Other objects - typically buffers - have their physical | 
|  | changes logged. Long running atomic modifications have individual changes | 
|  | chained together by intents, ensuring that journal recovery can restart and | 
|  | finish an operation that was only partially done when the system stopped | 
|  | functioning. | 
|  |  | 
|  | The reason for these differences is to keep the amount of log space and CPU time | 
|  | required to process objects being modified as small as possible and hence the | 
|  | logging overhead as low as possible. Some items are very frequently modified, | 
|  | and some parts of objects are more frequently modified than others, so keeping | 
|  | the overhead of metadata logging low is of prime importance. | 
|  |  | 
|  | The method used to log an item or chain modifications together isn't | 
|  | particularly important in the scope of this document. It suffices to know that | 
|  | the method used for logging a particular object or chaining modifications | 
|  | together are different and are dependent on the object and/or modification being | 
|  | performed. The logging subsystem only cares that certain specific rules are | 
|  | followed to guarantee forwards progress and prevent deadlocks. | 
|  |  | 
|  |  | 
|  | Transactions in XFS | 
|  | =================== | 
|  |  | 
|  | XFS has two types of high level transactions, defined by the type of log space | 
|  | reservation they take. These are known as "one shot" and "permanent" | 
|  | transactions. Permanent transaction reservations can take reservations that span | 
|  | commit boundaries, whilst "one shot" transactions are for a single atomic | 
|  | modification. | 
|  |  | 
|  | The type and size of reservation must be matched to the modification taking | 
|  | place.  This means that permanent transactions can be used for one-shot | 
|  | modifications, but one-shot reservations cannot be used for permanent | 
|  | transactions. | 
|  |  | 
|  | In the code, a one-shot transaction pattern looks somewhat like this:: | 
|  |  | 
|  | tp = xfs_trans_alloc(<reservation>) | 
|  | <lock items> | 
|  | <join item to transaction> | 
|  | <do modification> | 
|  | xfs_trans_commit(tp); | 
|  |  | 
|  | As items are modified in the transaction, the dirty regions in those items are | 
|  | tracked via the transaction handle.  Once the transaction is committed, all | 
|  | resources joined to it are released, along with the remaining unused reservation | 
|  | space that was taken at the transaction allocation time. | 
|  |  | 
|  | In contrast, a permanent transaction is made up of multiple linked individual | 
|  | transactions, and the pattern looks like this:: | 
|  |  | 
|  | tp = xfs_trans_alloc(<reservation>) | 
|  | xfs_ilock(ip, XFS_ILOCK_EXCL) | 
|  |  | 
|  | loop { | 
|  | xfs_trans_ijoin(tp, 0); | 
|  | <do modification> | 
|  | xfs_trans_log_inode(tp, ip); | 
|  | xfs_trans_roll(&tp); | 
|  | } | 
|  |  | 
|  | xfs_trans_commit(tp); | 
|  | xfs_iunlock(ip, XFS_ILOCK_EXCL); | 
|  |  | 
|  | While this might look similar to a one-shot transaction, there is an important | 
|  | difference: xfs_trans_roll() performs a specific operation that links two | 
|  | transactions together:: | 
|  |  | 
|  | ntp = xfs_trans_dup(tp); | 
|  | xfs_trans_commit(tp); | 
|  | xfs_trans_reserve(ntp); | 
|  |  | 
|  | This results in a series of "rolling transactions" where the inode is locked | 
|  | across the entire chain of transactions.  Hence while this series of rolling | 
|  | transactions is running, nothing else can read from or write to the inode and | 
|  | this provides a mechanism for complex changes to appear atomic from an external | 
|  | observer's point of view. | 
|  |  | 
|  | It is important to note that a series of rolling transactions in a permanent | 
|  | transaction does not form an atomic change in the journal. While each | 
|  | individual modification is atomic, the chain is *not atomic*. If we crash half | 
|  | way through, then recovery will only replay up to the last transactional | 
|  | modification the loop made that was committed to the journal. | 
|  |  | 
|  | This affects long running permanent transactions in that it is not possible to | 
|  | predict how much of a long running operation will actually be recovered because | 
|  | there is no guarantee of how much of the operation reached stale storage. Hence | 
|  | if a long running operation requires multiple transactions to fully complete, | 
|  | the high level operation must use intents and deferred operations to guarantee | 
|  | recovery can complete the operation once the first transactions is persisted in | 
|  | the on-disk journal. | 
|  |  | 
|  |  | 
|  | Transactions are Asynchronous | 
|  | ============================= | 
|  |  | 
|  | In XFS, all high level transactions are asynchronous by default. This means that | 
|  | xfs_trans_commit() does not guarantee that the modification has been committed | 
|  | to stable storage when it returns. Hence when a system crashes, not all the | 
|  | completed transactions will be replayed during recovery. | 
|  |  | 
|  | However, the logging subsystem does provide global ordering guarantees, such | 
|  | that if a specific change is seen after recovery, all metadata modifications | 
|  | that were committed prior to that change will also be seen. | 
|  |  | 
|  | For single shot operations that need to reach stable storage immediately, or | 
|  | ensuring that a long running permanent transaction is fully committed once it is | 
|  | complete, we can explicitly tag a transaction as synchronous. This will trigger | 
|  | a "log force" to flush the outstanding committed transactions to stable storage | 
|  | in the journal and wait for that to complete. | 
|  |  | 
|  | Synchronous transactions are rarely used, however, because they limit logging | 
|  | throughput to the IO latency limitations of the underlying storage. Instead, we | 
|  | tend to use log forces to ensure modifications are on stable storage only when | 
|  | a user operation requires a synchronisation point to occur (e.g. fsync). | 
|  |  | 
|  |  | 
|  | Transaction Reservations | 
|  | ======================== | 
|  |  | 
|  | It has been mentioned a number of times now that the logging subsystem needs to | 
|  | provide a forwards progress guarantee so that no modification ever stalls | 
|  | because it can't be written to the journal due to a lack of space in the | 
|  | journal. This is achieved by the transaction reservations that are made when | 
|  | a transaction is first allocated. For permanent transactions, these reservations | 
|  | are maintained as part of the transaction rolling mechanism. | 
|  |  | 
|  | A transaction reservation provides a guarantee that there is physical log space | 
|  | available to write the modification into the journal before we start making | 
|  | modifications to objects and items. As such, the reservation needs to be large | 
|  | enough to take into account the amount of metadata that the change might need to | 
|  | log in the worst case. This means that if we are modifying a btree in the | 
|  | transaction, we have to reserve enough space to record a full leaf-to-root split | 
|  | of the btree. As such, the reservations are quite complex because we have to | 
|  | take into account all the hidden changes that might occur. | 
|  |  | 
|  | For example, a user data extent allocation involves allocating an extent from | 
|  | free space, which modifies the free space trees. That's two btrees.  Inserting | 
|  | the extent into the inode's extent map might require a split of the extent map | 
|  | btree, which requires another allocation that can modify the free space trees | 
|  | again.  Then we might have to update reverse mappings, which modifies yet | 
|  | another btree which might require more space. And so on.  Hence the amount of | 
|  | metadata that a "simple" operation can modify can be quite large. | 
|  |  | 
|  | This "worst case" calculation provides us with the static "unit reservation" | 
|  | for the transaction that is calculated at mount time. We must guarantee that the | 
|  | log has this much space available before the transaction is allowed to proceed | 
|  | so that when we come to write the dirty metadata into the log we don't run out | 
|  | of log space half way through the write. | 
|  |  | 
|  | For one-shot transactions, a single unit space reservation is all that is | 
|  | required for the transaction to proceed. For permanent transactions, however, we | 
|  | also have a "log count" that affects the size of the reservation that is to be | 
|  | made. | 
|  |  | 
|  | While a permanent transaction can get by with a single unit of space | 
|  | reservation, it is somewhat inefficient to do this as it requires the | 
|  | transaction rolling mechanism to re-reserve space on every transaction roll. We | 
|  | know from the implementation of the permanent transactions how many transaction | 
|  | rolls are likely for the common modifications that need to be made. | 
|  |  | 
|  | For example, an inode allocation is typically two transactions - one to | 
|  | physically allocate a free inode chunk on disk, and another to allocate an inode | 
|  | from an inode chunk that has free inodes in it.  Hence for an inode allocation | 
|  | transaction, we might set the reservation log count to a value of 2 to indicate | 
|  | that the common/fast path transaction will commit two linked transactions in a | 
|  | chain. Each time a permanent transaction rolls, it consumes an entire unit | 
|  | reservation. | 
|  |  | 
|  | Hence when the permanent transaction is first allocated, the log space | 
|  | reservation is increased from a single unit reservation to multiple unit | 
|  | reservations. That multiple is defined by the reservation log count, and this | 
|  | means we can roll the transaction multiple times before we have to re-reserve | 
|  | log space when we roll the transaction. This ensures that the common | 
|  | modifications we make only need to reserve log space once. | 
|  |  | 
|  | If the log count for a permanent transaction reaches zero, then it needs to | 
|  | re-reserve physical space in the log. This is somewhat complex, and requires | 
|  | an understanding of how the log accounts for space that has been reserved. | 
|  |  | 
|  |  | 
|  | Log Space Accounting | 
|  | ==================== | 
|  |  | 
|  | The position in the log is typically referred to as a Log Sequence Number (LSN). | 
|  | The log is circular, so the positions in the log are defined by the combination | 
|  | of a cycle number - the number of times the log has been overwritten - and the | 
|  | offset into the log.  A LSN carries the cycle in the upper 32 bits and the | 
|  | offset in the lower 32 bits. The offset is in units of "basic blocks" (512 | 
|  | bytes). Hence we can do relatively simple LSN based math to keep track of | 
|  | available space in the log. | 
|  |  | 
|  | Log space accounting is done via a pair of constructs called "grant heads".  The | 
|  | position of the grant heads is an absolute value, so the amount of space | 
|  | available in the log is defined by the distance between the position of the | 
|  | grant head and the current log tail. That is, how much space can be | 
|  | reserved/consumed before the grant heads would fully wrap the log and overtake | 
|  | the tail position. | 
|  |  | 
|  | The first grant head is the "reserve" head. This tracks the byte count of the | 
|  | reservations currently held by active transactions. It is a purely in-memory | 
|  | accounting of the space reservation and, as such, actually tracks byte offsets | 
|  | into the log rather than basic blocks. Hence it technically isn't using LSNs to | 
|  | represent the log position, but it is still treated like a split {cycle,offset} | 
|  | tuple for the purposes of tracking reservation space. | 
|  |  | 
|  | The reserve grant head is used to accurately account for exact transaction | 
|  | reservations amounts and the exact byte count that modifications actually make | 
|  | and need to write into the log. The reserve head is used to prevent new | 
|  | transactions from taking new reservations when the head reaches the current | 
|  | tail. It will block new reservations in a FIFO queue and as the log tail moves | 
|  | forward it will wake them in order once sufficient space is available. This FIFO | 
|  | mechanism ensures no transaction is starved of resources when log space | 
|  | shortages occur. | 
|  |  | 
|  | The other grant head is the "write" head. Unlike the reserve head, this grant | 
|  | head contains an LSN and it tracks the physical space usage in the log. While | 
|  | this might sound like it is accounting the same state as the reserve grant head | 
|  | - and it mostly does track exactly the same location as the reserve grant head - | 
|  | there are critical differences in behaviour between them that provides the | 
|  | forwards progress guarantees that rolling permanent transactions require. | 
|  |  | 
|  | These differences when a permanent transaction is rolled and the internal "log | 
|  | count" reaches zero and the initial set of unit reservations have been | 
|  | exhausted. At this point, we still require a log space reservation to continue | 
|  | the next transaction in the sequeunce, but we have none remaining. We cannot | 
|  | sleep during the transaction commit process waiting for new log space to become | 
|  | available, as we may end up on the end of the FIFO queue and the items we have | 
|  | locked while we sleep could end up pinning the tail of the log before there is | 
|  | enough free space in the log to fulfill all of the pending reservations and | 
|  | then wake up transaction commit in progress. | 
|  |  | 
|  | To take a new reservation without sleeping requires us to be able to take a | 
|  | reservation even if there is no reservation space currently available. That is, | 
|  | we need to be able to *overcommit* the log reservation space. As has already | 
|  | been detailed, we cannot overcommit physical log space. However, the reserve | 
|  | grant head does not track physical space - it only accounts for the amount of | 
|  | reservations we currently have outstanding. Hence if the reserve head passes | 
|  | over the tail of the log all it means is that new reservations will be throttled | 
|  | immediately and remain throttled until the log tail is moved forward far enough | 
|  | to remove the overcommit and start taking new reservations. In other words, we | 
|  | can overcommit the reserve head without violating the physical log head and tail | 
|  | rules. | 
|  |  | 
|  | As a result, permanent transactions only "regrant" reservation space during | 
|  | xfs_trans_commit() calls, while the physical log space reservation - tracked by | 
|  | the write head - is then reserved separately by a call to xfs_log_reserve() | 
|  | after the commit completes. Once the commit completes, we can sleep waiting for | 
|  | physical log space to be reserved from the write grant head, but only if one | 
|  | critical rule has been observed:: | 
|  |  | 
|  | Code using permanent reservations must always log the items they hold | 
|  | locked across each transaction they roll in the chain. | 
|  |  | 
|  | "Re-logging" the locked items on every transaction roll ensures that the items | 
|  | attached to the transaction chain being rolled are always relocated to the | 
|  | physical head of the log and so do not pin the tail of the log. If a locked item | 
|  | pins the tail of the log when we sleep on the write reservation, then we will | 
|  | deadlock the log as we cannot take the locks needed to write back that item and | 
|  | move the tail of the log forwards to free up write grant space. Re-logging the | 
|  | locked items avoids this deadlock and guarantees that the log reservation we are | 
|  | making cannot self-deadlock. | 
|  |  | 
|  | If all rolling transactions obey this rule, then they can all make forwards | 
|  | progress independently because nothing will block the progress of the log | 
|  | tail moving forwards and hence ensuring that write grant space is always | 
|  | (eventually) made available to permanent transactions no matter how many times | 
|  | they roll. | 
|  |  | 
|  |  | 
|  | Re-logging Explained | 
|  | ==================== | 
|  |  | 
|  | XFS allows multiple separate modifications to a single object to be carried in | 
|  | the log at any given time.  This allows the log to avoid needing to flush each | 
|  | change to disk before recording a new change to the object. XFS does this via a | 
|  | method called "re-logging". Conceptually, this is quite simple - all it requires | 
|  | is that any new change to the object is recorded with a *new copy* of all the | 
|  | existing changes in the new transaction that is written to the log. | 
|  |  | 
|  | That is, if we have a sequence of changes A through to F, and the object was | 
|  | written to disk after change D, we would see in the log the following series | 
|  | of transactions, their contents and the log sequence number (LSN) of the | 
|  | transaction:: | 
|  |  | 
|  | Transaction		Contents	LSN | 
|  | A			   A		   X | 
|  | B			  A+B		  X+n | 
|  | C			 A+B+C		 X+n+m | 
|  | D			A+B+C+D		X+n+m+o | 
|  | <object written to disk> | 
|  | E			   E		   Y (> X+n+m+o) | 
|  | F			  E+F		  Y+p | 
|  |  | 
|  | In other words, each time an object is relogged, the new transaction contains | 
|  | the aggregation of all the previous changes currently held only in the log. | 
|  |  | 
|  | This relogging technique allows objects to be moved forward in the log so that | 
|  | an object being relogged does not prevent the tail of the log from ever moving | 
|  | forward.  This can be seen in the table above by the changing (increasing) LSN | 
|  | of each subsequent transaction, and it's the technique that allows us to | 
|  | implement long-running, multiple-commit permanent transactions. | 
|  |  | 
|  | A typical example of a rolling transaction is the removal of extents from an | 
|  | inode which can only be done at a rate of two extents per transaction because | 
|  | of reservation size limitations. Hence a rolling extent removal transaction | 
|  | keeps relogging the inode and btree buffers as they get modified in each | 
|  | removal operation. This keeps them moving forward in the log as the operation | 
|  | progresses, ensuring that current operation never gets blocked by itself if the | 
|  | log wraps around. | 
|  |  | 
|  | Hence it can be seen that the relogging operation is fundamental to the correct | 
|  | working of the XFS journalling subsystem. From the above description, most | 
|  | people should be able to see why the XFS metadata operations writes so much to | 
|  | the log - repeated operations to the same objects write the same changes to | 
|  | the log over and over again. Worse is the fact that objects tend to get | 
|  | dirtier as they get relogged, so each subsequent transaction is writing more | 
|  | metadata into the log. | 
|  |  | 
|  | It should now also be obvious how relogging and asynchronous transactions go | 
|  | hand in hand. That is, transactions don't get written to the physical journal | 
|  | until either a log buffer is filled (a log buffer can hold multiple | 
|  | transactions) or a synchronous operation forces the log buffers holding the | 
|  | transactions to disk. This means that XFS is doing aggregation of transactions | 
|  | in memory - batching them, if you like - to minimise the impact of the log IO on | 
|  | transaction throughput. | 
|  |  | 
|  | The limitation on asynchronous transaction throughput is the number and size of | 
|  | log buffers made available by the log manager. By default there are 8 log | 
|  | buffers available and the size of each is 32kB - the size can be increased up | 
|  | to 256kB by use of a mount option. | 
|  |  | 
|  | Effectively, this gives us the maximum bound of outstanding metadata changes | 
|  | that can be made to the filesystem at any point in time - if all the log | 
|  | buffers are full and under IO, then no more transactions can be committed until | 
|  | the current batch completes. It is now common for a single current CPU core to | 
|  | be to able to issue enough transactions to keep the log buffers full and under | 
|  | IO permanently. Hence the XFS journalling subsystem can be considered to be IO | 
|  | bound. | 
|  |  | 
|  | Delayed Logging: Concepts | 
|  | ========================= | 
|  |  | 
|  | The key thing to note about the asynchronous logging combined with the | 
|  | relogging technique XFS uses is that we can be relogging changed objects | 
|  | multiple times before they are committed to disk in the log buffers. If we | 
|  | return to the previous relogging example, it is entirely possible that | 
|  | transactions A through D are committed to disk in the same log buffer. | 
|  |  | 
|  | That is, a single log buffer may contain multiple copies of the same object, | 
|  | but only one of those copies needs to be there - the last one "D", as it | 
|  | contains all the changes from the previous changes. In other words, we have one | 
|  | necessary copy in the log buffer, and three stale copies that are simply | 
|  | wasting space. When we are doing repeated operations on the same set of | 
|  | objects, these "stale objects" can be over 90% of the space used in the log | 
|  | buffers. It is clear that reducing the number of stale objects written to the | 
|  | log would greatly reduce the amount of metadata we write to the log, and this | 
|  | is the fundamental goal of delayed logging. | 
|  |  | 
|  | From a conceptual point of view, XFS is already doing relogging in memory (where | 
|  | memory == log buffer), only it is doing it extremely inefficiently. It is using | 
|  | logical to physical formatting to do the relogging because there is no | 
|  | infrastructure to keep track of logical changes in memory prior to physically | 
|  | formatting the changes in a transaction to the log buffer. Hence we cannot avoid | 
|  | accumulating stale objects in the log buffers. | 
|  |  | 
|  | Delayed logging is the name we've given to keeping and tracking transactional | 
|  | changes to objects in memory outside the log buffer infrastructure. Because of | 
|  | the relogging concept fundamental to the XFS journalling subsystem, this is | 
|  | actually relatively easy to do - all the changes to logged items are already | 
|  | tracked in the current infrastructure. The big problem is how to accumulate | 
|  | them and get them to the log in a consistent, recoverable manner. | 
|  | Describing the problems and how they have been solved is the focus of this | 
|  | document. | 
|  |  | 
|  | One of the key changes that delayed logging makes to the operation of the | 
|  | journalling subsystem is that it disassociates the amount of outstanding | 
|  | metadata changes from the size and number of log buffers available. In other | 
|  | words, instead of there only being a maximum of 2MB of transaction changes not | 
|  | written to the log at any point in time, there may be a much greater amount | 
|  | being accumulated in memory. Hence the potential for loss of metadata on a | 
|  | crash is much greater than for the existing logging mechanism. | 
|  |  | 
|  | It should be noted that this does not change the guarantee that log recovery | 
|  | will result in a consistent filesystem. What it does mean is that as far as the | 
|  | recovered filesystem is concerned, there may be many thousands of transactions | 
|  | that simply did not occur as a result of the crash. This makes it even more | 
|  | important that applications that care about their data use fsync() where they | 
|  | need to ensure application level data integrity is maintained. | 
|  |  | 
|  | It should be noted that delayed logging is not an innovative new concept that | 
|  | warrants rigorous proofs to determine whether it is correct or not. The method | 
|  | of accumulating changes in memory for some period before writing them to the | 
|  | log is used effectively in many filesystems including ext3 and ext4. Hence | 
|  | no time is spent in this document trying to convince the reader that the | 
|  | concept is sound. Instead it is simply considered a "solved problem" and as | 
|  | such implementing it in XFS is purely an exercise in software engineering. | 
|  |  | 
|  | The fundamental requirements for delayed logging in XFS are simple: | 
|  |  | 
|  | 1. Reduce the amount of metadata written to the log by at least | 
|  | an order of magnitude. | 
|  | 2. Supply sufficient statistics to validate Requirement #1. | 
|  | 3. Supply sufficient new tracing infrastructure to be able to debug | 
|  | problems with the new code. | 
|  | 4. No on-disk format change (metadata or log format). | 
|  | 5. Enable and disable with a mount option. | 
|  | 6. No performance regressions for synchronous transaction workloads. | 
|  |  | 
|  | Delayed Logging: Design | 
|  | ======================= | 
|  |  | 
|  | Storing Changes | 
|  | --------------- | 
|  |  | 
|  | The problem with accumulating changes at a logical level (i.e. just using the | 
|  | existing log item dirty region tracking) is that when it comes to writing the | 
|  | changes to the log buffers, we need to ensure that the object we are formatting | 
|  | is not changing while we do this. This requires locking the object to prevent | 
|  | concurrent modification. Hence flushing the logical changes to the log would | 
|  | require us to lock every object, format them, and then unlock them again. | 
|  |  | 
|  | This introduces lots of scope for deadlocks with transactions that are already | 
|  | running. For example, a transaction has object A locked and modified, but needs | 
|  | the delayed logging tracking lock to commit the transaction. However, the | 
|  | flushing thread has the delayed logging tracking lock already held, and is | 
|  | trying to get the lock on object A to flush it to the log buffer. This appears | 
|  | to be an unsolvable deadlock condition, and it was solving this problem that | 
|  | was the barrier to implementing delayed logging for so long. | 
|  |  | 
|  | The solution is relatively simple - it just took a long time to recognise it. | 
|  | Put simply, the current logging code formats the changes to each item into an | 
|  | vector array that points to the changed regions in the item. The log write code | 
|  | simply copies the memory these vectors point to into the log buffer during | 
|  | transaction commit while the item is locked in the transaction. Instead of | 
|  | using the log buffer as the destination of the formatting code, we can use an | 
|  | allocated memory buffer big enough to fit the formatted vector. | 
|  |  | 
|  | If we then copy the vector into the memory buffer and rewrite the vector to | 
|  | point to the memory buffer rather than the object itself, we now have a copy of | 
|  | the changes in a format that is compatible with the log buffer writing code. | 
|  | that does not require us to lock the item to access. This formatting and | 
|  | rewriting can all be done while the object is locked during transaction commit, | 
|  | resulting in a vector that is transactionally consistent and can be accessed | 
|  | without needing to lock the owning item. | 
|  |  | 
|  | Hence we avoid the need to lock items when we need to flush outstanding | 
|  | asynchronous transactions to the log. The differences between the existing | 
|  | formatting method and the delayed logging formatting can be seen in the | 
|  | diagram below. | 
|  |  | 
|  | Current format log vector:: | 
|  |  | 
|  | Object    +---------------------------------------------+ | 
|  | Vector 1      +----+ | 
|  | Vector 2                    +----+ | 
|  | Vector 3                                   +----------+ | 
|  |  | 
|  | After formatting:: | 
|  |  | 
|  | Log Buffer    +-V1-+-V2-+----V3----+ | 
|  |  | 
|  | Delayed logging vector:: | 
|  |  | 
|  | Object    +---------------------------------------------+ | 
|  | Vector 1      +----+ | 
|  | Vector 2                    +----+ | 
|  | Vector 3                                   +----------+ | 
|  |  | 
|  | After formatting:: | 
|  |  | 
|  | Memory Buffer +-V1-+-V2-+----V3----+ | 
|  | Vector 1      +----+ | 
|  | Vector 2           +----+ | 
|  | Vector 3                +----------+ | 
|  |  | 
|  | The memory buffer and associated vector need to be passed as a single object, | 
|  | but still need to be associated with the parent object so if the object is | 
|  | relogged we can replace the current memory buffer with a new memory buffer that | 
|  | contains the latest changes. | 
|  |  | 
|  | The reason for keeping the vector around after we've formatted the memory | 
|  | buffer is to support splitting vectors across log buffer boundaries correctly. | 
|  | If we don't keep the vector around, we do not know where the region boundaries | 
|  | are in the item, so we'd need a new encapsulation method for regions in the log | 
|  | buffer writing (i.e. double encapsulation). This would be an on-disk format | 
|  | change and as such is not desirable.  It also means we'd have to write the log | 
|  | region headers in the formatting stage, which is problematic as there is per | 
|  | region state that needs to be placed into the headers during the log write. | 
|  |  | 
|  | Hence we need to keep the vector, but by attaching the memory buffer to it and | 
|  | rewriting the vector addresses to point at the memory buffer we end up with a | 
|  | self-describing object that can be passed to the log buffer write code to be | 
|  | handled in exactly the same manner as the existing log vectors are handled. | 
|  | Hence we avoid needing a new on-disk format to handle items that have been | 
|  | relogged in memory. | 
|  |  | 
|  |  | 
|  | Tracking Changes | 
|  | ---------------- | 
|  |  | 
|  | Now that we can record transactional changes in memory in a form that allows | 
|  | them to be used without limitations, we need to be able to track and accumulate | 
|  | them so that they can be written to the log at some later point in time.  The | 
|  | log item is the natural place to store this vector and buffer, and also makes sense | 
|  | to be the object that is used to track committed objects as it will always | 
|  | exist once the object has been included in a transaction. | 
|  |  | 
|  | The log item is already used to track the log items that have been written to | 
|  | the log but not yet written to disk. Such log items are considered "active" | 
|  | and as such are stored in the Active Item List (AIL) which is a LSN-ordered | 
|  | double linked list. Items are inserted into this list during log buffer IO | 
|  | completion, after which they are unpinned and can be written to disk. An object | 
|  | that is in the AIL can be relogged, which causes the object to be pinned again | 
|  | and then moved forward in the AIL when the log buffer IO completes for that | 
|  | transaction. | 
|  |  | 
|  | Essentially, this shows that an item that is in the AIL can still be modified | 
|  | and relogged, so any tracking must be separate to the AIL infrastructure. As | 
|  | such, we cannot reuse the AIL list pointers for tracking committed items, nor | 
|  | can we store state in any field that is protected by the AIL lock. Hence the | 
|  | committed item tracking needs its own locks, lists and state fields in the log | 
|  | item. | 
|  |  | 
|  | Similar to the AIL, tracking of committed items is done through a new list | 
|  | called the Committed Item List (CIL).  The list tracks log items that have been | 
|  | committed and have formatted memory buffers attached to them. It tracks objects | 
|  | in transaction commit order, so when an object is relogged it is removed from | 
|  | its place in the list and re-inserted at the tail. This is entirely arbitrary | 
|  | and done to make it easy for debugging - the last items in the list are the | 
|  | ones that are most recently modified. Ordering of the CIL is not necessary for | 
|  | transactional integrity (as discussed in the next section) so the ordering is | 
|  | done for convenience/sanity of the developers. | 
|  |  | 
|  |  | 
|  | Delayed Logging: Checkpoints | 
|  | ---------------------------- | 
|  |  | 
|  | When we have a log synchronisation event, commonly known as a "log force", | 
|  | all the items in the CIL must be written into the log via the log buffers. | 
|  | We need to write these items in the order that they exist in the CIL, and they | 
|  | need to be written as an atomic transaction. The need for all the objects to be | 
|  | written as an atomic transaction comes from the requirements of relogging and | 
|  | log replay - all the changes in all the objects in a given transaction must | 
|  | either be completely replayed during log recovery, or not replayed at all. If | 
|  | a transaction is not replayed because it is not complete in the log, then | 
|  | no later transactions should be replayed, either. | 
|  |  | 
|  | To fulfill this requirement, we need to write the entire CIL in a single log | 
|  | transaction. Fortunately, the XFS log code has no fixed limit on the size of a | 
|  | transaction, nor does the log replay code. The only fundamental limit is that | 
|  | the transaction cannot be larger than just under half the size of the log.  The | 
|  | reason for this limit is that to find the head and tail of the log, there must | 
|  | be at least one complete transaction in the log at any given time. If a | 
|  | transaction is larger than half the log, then there is the possibility that a | 
|  | crash during the write of a such a transaction could partially overwrite the | 
|  | only complete previous transaction in the log. This will result in a recovery | 
|  | failure and an inconsistent filesystem and hence we must enforce the maximum | 
|  | size of a checkpoint to be slightly less than a half the log. | 
|  |  | 
|  | Apart from this size requirement, a checkpoint transaction looks no different | 
|  | to any other transaction - it contains a transaction header, a series of | 
|  | formatted log items and a commit record at the tail. From a recovery | 
|  | perspective, the checkpoint transaction is also no different - just a lot | 
|  | bigger with a lot more items in it. The worst case effect of this is that we | 
|  | might need to tune the recovery transaction object hash size. | 
|  |  | 
|  | Because the checkpoint is just another transaction and all the changes to log | 
|  | items are stored as log vectors, we can use the existing log buffer writing | 
|  | code to write the changes into the log. To do this efficiently, we need to | 
|  | minimise the time we hold the CIL locked while writing the checkpoint | 
|  | transaction. The current log write code enables us to do this easily with the | 
|  | way it separates the writing of the transaction contents (the log vectors) from | 
|  | the transaction commit record, but tracking this requires us to have a | 
|  | per-checkpoint context that travels through the log write process through to | 
|  | checkpoint completion. | 
|  |  | 
|  | Hence a checkpoint has a context that tracks the state of the current | 
|  | checkpoint from initiation to checkpoint completion. A new context is initiated | 
|  | at the same time a checkpoint transaction is started. That is, when we remove | 
|  | all the current items from the CIL during a checkpoint operation, we move all | 
|  | those changes into the current checkpoint context. We then initialise a new | 
|  | context and attach that to the CIL for aggregation of new transactions. | 
|  |  | 
|  | This allows us to unlock the CIL immediately after transfer of all the | 
|  | committed items and effectively allows new transactions to be issued while we | 
|  | are formatting the checkpoint into the log. It also allows concurrent | 
|  | checkpoints to be written into the log buffers in the case of log force heavy | 
|  | workloads, just like the existing transaction commit code does. This, however, | 
|  | requires that we strictly order the commit records in the log so that | 
|  | checkpoint sequence order is maintained during log replay. | 
|  |  | 
|  | To ensure that we can be writing an item into a checkpoint transaction at | 
|  | the same time another transaction modifies the item and inserts the log item | 
|  | into the new CIL, then checkpoint transaction commit code cannot use log items | 
|  | to store the list of log vectors that need to be written into the transaction. | 
|  | Hence log vectors need to be able to be chained together to allow them to be | 
|  | detached from the log items. That is, when the CIL is flushed the memory | 
|  | buffer and log vector attached to each log item needs to be attached to the | 
|  | checkpoint context so that the log item can be released. In diagrammatic form, | 
|  | the CIL would look like this before the flush:: | 
|  |  | 
|  | CIL Head | 
|  | | | 
|  | V | 
|  | Log Item <-> log vector 1	-> memory buffer | 
|  | |				-> vector array | 
|  | V | 
|  | Log Item <-> log vector 2	-> memory buffer | 
|  | |				-> vector array | 
|  | V | 
|  | ...... | 
|  | | | 
|  | V | 
|  | Log Item <-> log vector N-1	-> memory buffer | 
|  | |				-> vector array | 
|  | V | 
|  | Log Item <-> log vector N	-> memory buffer | 
|  | -> vector array | 
|  |  | 
|  | And after the flush the CIL head is empty, and the checkpoint context log | 
|  | vector list would look like:: | 
|  |  | 
|  | Checkpoint Context | 
|  | | | 
|  | V | 
|  | log vector 1	-> memory buffer | 
|  | |		-> vector array | 
|  | |		-> Log Item | 
|  | V | 
|  | log vector 2	-> memory buffer | 
|  | |		-> vector array | 
|  | |		-> Log Item | 
|  | V | 
|  | ...... | 
|  | | | 
|  | V | 
|  | log vector N-1	-> memory buffer | 
|  | |		-> vector array | 
|  | |		-> Log Item | 
|  | V | 
|  | log vector N	-> memory buffer | 
|  | -> vector array | 
|  | -> Log Item | 
|  |  | 
|  | Once this transfer is done, the CIL can be unlocked and new transactions can | 
|  | start, while the checkpoint flush code works over the log vector chain to | 
|  | commit the checkpoint. | 
|  |  | 
|  | Once the checkpoint is written into the log buffers, the checkpoint context is | 
|  | attached to the log buffer that the commit record was written to along with a | 
|  | completion callback. Log IO completion will call that callback, which can then | 
|  | run transaction committed processing for the log items (i.e. insert into AIL | 
|  | and unpin) in the log vector chain and then free the log vector chain and | 
|  | checkpoint context. | 
|  |  | 
|  | Discussion Point: I am uncertain as to whether the log item is the most | 
|  | efficient way to track vectors, even though it seems like the natural way to do | 
|  | it. The fact that we walk the log items (in the CIL) just to chain the log | 
|  | vectors and break the link between the log item and the log vector means that | 
|  | we take a cache line hit for the log item list modification, then another for | 
|  | the log vector chaining. If we track by the log vectors, then we only need to | 
|  | break the link between the log item and the log vector, which means we should | 
|  | dirty only the log item cachelines. Normally I wouldn't be concerned about one | 
|  | vs two dirty cachelines except for the fact I've seen upwards of 80,000 log | 
|  | vectors in one checkpoint transaction. I'd guess this is a "measure and | 
|  | compare" situation that can be done after a working and reviewed implementation | 
|  | is in the dev tree.... | 
|  |  | 
|  | Delayed Logging: Checkpoint Sequencing | 
|  | -------------------------------------- | 
|  |  | 
|  | One of the key aspects of the XFS transaction subsystem is that it tags | 
|  | committed transactions with the log sequence number of the transaction commit. | 
|  | This allows transactions to be issued asynchronously even though there may be | 
|  | future operations that cannot be completed until that transaction is fully | 
|  | committed to the log. In the rare case that a dependent operation occurs (e.g. | 
|  | re-using a freed metadata extent for a data extent), a special, optimised log | 
|  | force can be issued to force the dependent transaction to disk immediately. | 
|  |  | 
|  | To do this, transactions need to record the LSN of the commit record of the | 
|  | transaction. This LSN comes directly from the log buffer the transaction is | 
|  | written into. While this works just fine for the existing transaction | 
|  | mechanism, it does not work for delayed logging because transactions are not | 
|  | written directly into the log buffers. Hence some other method of sequencing | 
|  | transactions is required. | 
|  |  | 
|  | As discussed in the checkpoint section, delayed logging uses per-checkpoint | 
|  | contexts, and as such it is simple to assign a sequence number to each | 
|  | checkpoint. Because the switching of checkpoint contexts must be done | 
|  | atomically, it is simple to ensure that each new context has a monotonically | 
|  | increasing sequence number assigned to it without the need for an external | 
|  | atomic counter - we can just take the current context sequence number and add | 
|  | one to it for the new context. | 
|  |  | 
|  | Then, instead of assigning a log buffer LSN to the transaction commit LSN | 
|  | during the commit, we can assign the current checkpoint sequence. This allows | 
|  | operations that track transactions that have not yet completed know what | 
|  | checkpoint sequence needs to be committed before they can continue. As a | 
|  | result, the code that forces the log to a specific LSN now needs to ensure that | 
|  | the log forces to a specific checkpoint. | 
|  |  | 
|  | To ensure that we can do this, we need to track all the checkpoint contexts | 
|  | that are currently committing to the log. When we flush a checkpoint, the | 
|  | context gets added to a "committing" list which can be searched. When a | 
|  | checkpoint commit completes, it is removed from the committing list. Because | 
|  | the checkpoint context records the LSN of the commit record for the checkpoint, | 
|  | we can also wait on the log buffer that contains the commit record, thereby | 
|  | using the existing log force mechanisms to execute synchronous forces. | 
|  |  | 
|  | It should be noted that the synchronous forces may need to be extended with | 
|  | mitigation algorithms similar to the current log buffer code to allow | 
|  | aggregation of multiple synchronous transactions if there are already | 
|  | synchronous transactions being flushed. Investigation of the performance of the | 
|  | current design is needed before making any decisions here. | 
|  |  | 
|  | The main concern with log forces is to ensure that all the previous checkpoints | 
|  | are also committed to disk before the one we need to wait for. Therefore we | 
|  | need to check that all the prior contexts in the committing list are also | 
|  | complete before waiting on the one we need to complete. We do this | 
|  | synchronisation in the log force code so that we don't need to wait anywhere | 
|  | else for such serialisation - it only matters when we do a log force. | 
|  |  | 
|  | The only remaining complexity is that a log force now also has to handle the | 
|  | case where the forcing sequence number is the same as the current context. That | 
|  | is, we need to flush the CIL and potentially wait for it to complete. This is a | 
|  | simple addition to the existing log forcing code to check the sequence numbers | 
|  | and push if required. Indeed, placing the current sequence checkpoint flush in | 
|  | the log force code enables the current mechanism for issuing synchronous | 
|  | transactions to remain untouched (i.e. commit an asynchronous transaction, then | 
|  | force the log at the LSN of that transaction) and so the higher level code | 
|  | behaves the same regardless of whether delayed logging is being used or not. | 
|  |  | 
|  | Delayed Logging: Checkpoint Log Space Accounting | 
|  | ------------------------------------------------ | 
|  |  | 
|  | The big issue for a checkpoint transaction is the log space reservation for the | 
|  | transaction. We don't know how big a checkpoint transaction is going to be | 
|  | ahead of time, nor how many log buffers it will take to write out, nor the | 
|  | number of split log vector regions are going to be used. We can track the | 
|  | amount of log space required as we add items to the commit item list, but we | 
|  | still need to reserve the space in the log for the checkpoint. | 
|  |  | 
|  | A typical transaction reserves enough space in the log for the worst case space | 
|  | usage of the transaction. The reservation accounts for log record headers, | 
|  | transaction and region headers, headers for split regions, buffer tail padding, | 
|  | etc. as well as the actual space for all the changed metadata in the | 
|  | transaction. While some of this is fixed overhead, much of it is dependent on | 
|  | the size of the transaction and the number of regions being logged (the number | 
|  | of log vectors in the transaction). | 
|  |  | 
|  | An example of the differences would be logging directory changes versus logging | 
|  | inode changes. If you modify lots of inode cores (e.g. ``chmod -R g+w *``), then | 
|  | there are lots of transactions that only contain an inode core and an inode log | 
|  | format structure. That is, two vectors totaling roughly 150 bytes. If we modify | 
|  | 10,000 inodes, we have about 1.5MB of metadata to write in 20,000 vectors. Each | 
|  | vector is 12 bytes, so the total to be logged is approximately 1.75MB. In | 
|  | comparison, if we are logging full directory buffers, they are typically 4KB | 
|  | each, so we in 1.5MB of directory buffers we'd have roughly 400 buffers and a | 
|  | buffer format structure for each buffer - roughly 800 vectors or 1.51MB total | 
|  | space.  From this, it should be obvious that a static log space reservation is | 
|  | not particularly flexible and is difficult to select the "optimal value" for | 
|  | all workloads. | 
|  |  | 
|  | Further, if we are going to use a static reservation, which bit of the entire | 
|  | reservation does it cover? We account for space used by the transaction | 
|  | reservation by tracking the space currently used by the object in the CIL and | 
|  | then calculating the increase or decrease in space used as the object is | 
|  | relogged. This allows for a checkpoint reservation to only have to account for | 
|  | log buffer metadata used such as log header records. | 
|  |  | 
|  | However, even using a static reservation for just the log metadata is | 
|  | problematic. Typically log record headers use at least 16KB of log space per | 
|  | 1MB of log space consumed (512 bytes per 32k) and the reservation needs to be | 
|  | large enough to handle arbitrary sized checkpoint transactions. This | 
|  | reservation needs to be made before the checkpoint is started, and we need to | 
|  | be able to reserve the space without sleeping.  For a 8MB checkpoint, we need a | 
|  | reservation of around 150KB, which is a non-trivial amount of space. | 
|  |  | 
|  | A static reservation needs to manipulate the log grant counters - we can take a | 
|  | permanent reservation on the space, but we still need to make sure we refresh | 
|  | the write reservation (the actual space available to the transaction) after | 
|  | every checkpoint transaction completion. Unfortunately, if this space is not | 
|  | available when required, then the regrant code will sleep waiting for it. | 
|  |  | 
|  | The problem with this is that it can lead to deadlocks as we may need to commit | 
|  | checkpoints to be able to free up log space (refer back to the description of | 
|  | rolling transactions for an example of this).  Hence we *must* always have | 
|  | space available in the log if we are to use static reservations, and that is | 
|  | very difficult and complex to arrange. It is possible to do, but there is a | 
|  | simpler way. | 
|  |  | 
|  | The simpler way of doing this is tracking the entire log space used by the | 
|  | items in the CIL and using this to dynamically calculate the amount of log | 
|  | space required by the log metadata. If this log metadata space changes as a | 
|  | result of a transaction commit inserting a new memory buffer into the CIL, then | 
|  | the difference in space required is removed from the transaction that causes | 
|  | the change. Transactions at this level will *always* have enough space | 
|  | available in their reservation for this as they have already reserved the | 
|  | maximal amount of log metadata space they require, and such a delta reservation | 
|  | will always be less than or equal to the maximal amount in the reservation. | 
|  |  | 
|  | Hence we can grow the checkpoint transaction reservation dynamically as items | 
|  | are added to the CIL and avoid the need for reserving and regranting log space | 
|  | up front. This avoids deadlocks and removes a blocking point from the | 
|  | checkpoint flush code. | 
|  |  | 
|  | As mentioned early, transactions can't grow to more than half the size of the | 
|  | log. Hence as part of the reservation growing, we need to also check the size | 
|  | of the reservation against the maximum allowed transaction size. If we reach | 
|  | the maximum threshold, we need to push the CIL to the log. This is effectively | 
|  | a "background flush" and is done on demand. This is identical to | 
|  | a CIL push triggered by a log force, only that there is no waiting for the | 
|  | checkpoint commit to complete. This background push is checked and executed by | 
|  | transaction commit code. | 
|  |  | 
|  | If the transaction subsystem goes idle while we still have items in the CIL, | 
|  | they will be flushed by the periodic log force issued by the xfssyncd. This log | 
|  | force will push the CIL to disk, and if the transaction subsystem stays idle, | 
|  | allow the idle log to be covered (effectively marked clean) in exactly the same | 
|  | manner that is done for the existing logging method. A discussion point is | 
|  | whether this log force needs to be done more frequently than the current rate | 
|  | which is once every 30s. | 
|  |  | 
|  |  | 
|  | Delayed Logging: Log Item Pinning | 
|  | --------------------------------- | 
|  |  | 
|  | Currently log items are pinned during transaction commit while the items are | 
|  | still locked. This happens just after the items are formatted, though it could | 
|  | be done any time before the items are unlocked. The result of this mechanism is | 
|  | that items get pinned once for every transaction that is committed to the log | 
|  | buffers. Hence items that are relogged in the log buffers will have a pin count | 
|  | for every outstanding transaction they were dirtied in. When each of these | 
|  | transactions is completed, they will unpin the item once. As a result, the item | 
|  | only becomes unpinned when all the transactions complete and there are no | 
|  | pending transactions. Thus the pinning and unpinning of a log item is symmetric | 
|  | as there is a 1:1 relationship with transaction commit and log item completion. | 
|  |  | 
|  | For delayed logging, however, we have an asymmetric transaction commit to | 
|  | completion relationship. Every time an object is relogged in the CIL it goes | 
|  | through the commit process without a corresponding completion being registered. | 
|  | That is, we now have a many-to-one relationship between transaction commit and | 
|  | log item completion. The result of this is that pinning and unpinning of the | 
|  | log items becomes unbalanced if we retain the "pin on transaction commit, unpin | 
|  | on transaction completion" model. | 
|  |  | 
|  | To keep pin/unpin symmetry, the algorithm needs to change to a "pin on | 
|  | insertion into the CIL, unpin on checkpoint completion". In other words, the | 
|  | pinning and unpinning becomes symmetric around a checkpoint context. We have to | 
|  | pin the object the first time it is inserted into the CIL - if it is already in | 
|  | the CIL during a transaction commit, then we do not pin it again. Because there | 
|  | can be multiple outstanding checkpoint contexts, we can still see elevated pin | 
|  | counts, but as each checkpoint completes the pin count will retain the correct | 
|  | value according to its context. | 
|  |  | 
|  | Just to make matters slightly more complex, this checkpoint level context | 
|  | for the pin count means that the pinning of an item must take place under the | 
|  | CIL commit/flush lock. If we pin the object outside this lock, we cannot | 
|  | guarantee which context the pin count is associated with. This is because of | 
|  | the fact pinning the item is dependent on whether the item is present in the | 
|  | current CIL or not. If we don't pin the CIL first before we check and pin the | 
|  | object, we have a race with CIL being flushed between the check and the pin | 
|  | (or not pinning, as the case may be). Hence we must hold the CIL flush/commit | 
|  | lock to guarantee that we pin the items correctly. | 
|  |  | 
|  | Delayed Logging: Concurrent Scalability | 
|  | --------------------------------------- | 
|  |  | 
|  | A fundamental requirement for the CIL is that accesses through transaction | 
|  | commits must scale to many concurrent commits. The current transaction commit | 
|  | code does not break down even when there are transactions coming from 2048 | 
|  | processors at once. The current transaction code does not go any faster than if | 
|  | there was only one CPU using it, but it does not slow down either. | 
|  |  | 
|  | As a result, the delayed logging transaction commit code needs to be designed | 
|  | for concurrency from the ground up. It is obvious that there are serialisation | 
|  | points in the design - the three important ones are: | 
|  |  | 
|  | 1. Locking out new transaction commits while flushing the CIL | 
|  | 2. Adding items to the CIL and updating item space accounting | 
|  | 3. Checkpoint commit ordering | 
|  |  | 
|  | Looking at the transaction commit and CIL flushing interactions, it is clear | 
|  | that we have a many-to-one interaction here. That is, the only restriction on | 
|  | the number of concurrent transactions that can be trying to commit at once is | 
|  | the amount of space available in the log for their reservations. The practical | 
|  | limit here is in the order of several hundred concurrent transactions for a | 
|  | 128MB log, which means that it is generally one per CPU in a machine. | 
|  |  | 
|  | The amount of time a transaction commit needs to hold out a flush is a | 
|  | relatively long period of time - the pinning of log items needs to be done | 
|  | while we are holding out a CIL flush, so at the moment that means it is held | 
|  | across the formatting of the objects into memory buffers (i.e. while memcpy()s | 
|  | are in progress). Ultimately a two pass algorithm where the formatting is done | 
|  | separately to the pinning of objects could be used to reduce the hold time of | 
|  | the transaction commit side. | 
|  |  | 
|  | Because of the number of potential transaction commit side holders, the lock | 
|  | really needs to be a sleeping lock - if the CIL flush takes the lock, we do not | 
|  | want every other CPU in the machine spinning on the CIL lock. Given that | 
|  | flushing the CIL could involve walking a list of tens of thousands of log | 
|  | items, it will get held for a significant time and so spin contention is a | 
|  | significant concern. Preventing lots of CPUs spinning doing nothing is the | 
|  | main reason for choosing a sleeping lock even though nothing in either the | 
|  | transaction commit or CIL flush side sleeps with the lock held. | 
|  |  | 
|  | It should also be noted that CIL flushing is also a relatively rare operation | 
|  | compared to transaction commit for asynchronous transaction workloads - only | 
|  | time will tell if using a read-write semaphore for exclusion will limit | 
|  | transaction commit concurrency due to cache line bouncing of the lock on the | 
|  | read side. | 
|  |  | 
|  | The second serialisation point is on the transaction commit side where items | 
|  | are inserted into the CIL. Because transactions can enter this code | 
|  | concurrently, the CIL needs to be protected separately from the above | 
|  | commit/flush exclusion. It also needs to be an exclusive lock but it is only | 
|  | held for a very short time and so a spin lock is appropriate here. It is | 
|  | possible that this lock will become a contention point, but given the short | 
|  | hold time once per transaction I think that contention is unlikely. | 
|  |  | 
|  | The final serialisation point is the checkpoint commit record ordering code | 
|  | that is run as part of the checkpoint commit and log force sequencing. The code | 
|  | path that triggers a CIL flush (i.e. whatever triggers the log force) will enter | 
|  | an ordering loop after writing all the log vectors into the log buffers but | 
|  | before writing the commit record. This loop walks the list of committing | 
|  | checkpoints and needs to block waiting for checkpoints to complete their commit | 
|  | record write. As a result it needs a lock and a wait variable. Log force | 
|  | sequencing also requires the same lock, list walk, and blocking mechanism to | 
|  | ensure completion of checkpoints. | 
|  |  | 
|  | These two sequencing operations can use the mechanism even though the | 
|  | events they are waiting for are different. The checkpoint commit record | 
|  | sequencing needs to wait until checkpoint contexts contain a commit LSN | 
|  | (obtained through completion of a commit record write) while log force | 
|  | sequencing needs to wait until previous checkpoint contexts are removed from | 
|  | the committing list (i.e. they've completed). A simple wait variable and | 
|  | broadcast wakeups (thundering herds) has been used to implement these two | 
|  | serialisation queues. They use the same lock as the CIL, too. If we see too | 
|  | much contention on the CIL lock, or too many context switches as a result of | 
|  | the broadcast wakeups these operations can be put under a new spinlock and | 
|  | given separate wait lists to reduce lock contention and the number of processes | 
|  | woken by the wrong event. | 
|  |  | 
|  |  | 
|  | Lifecycle Changes | 
|  | ----------------- | 
|  |  | 
|  | The existing log item life cycle is as follows:: | 
|  |  | 
|  | 1. Transaction allocate | 
|  | 2. Transaction reserve | 
|  | 3. Lock item | 
|  | 4. Join item to transaction | 
|  | If not already attached, | 
|  | Allocate log item | 
|  | Attach log item to owner item | 
|  | Attach log item to transaction | 
|  | 5. Modify item | 
|  | Record modifications in log item | 
|  | 6. Transaction commit | 
|  | Pin item in memory | 
|  | Format item into log buffer | 
|  | Write commit LSN into transaction | 
|  | Unlock item | 
|  | Attach transaction to log buffer | 
|  |  | 
|  | <log buffer IO dispatched> | 
|  | <log buffer IO completes> | 
|  |  | 
|  | 7. Transaction completion | 
|  | Mark log item committed | 
|  | Insert log item into AIL | 
|  | Write commit LSN into log item | 
|  | Unpin log item | 
|  | 8. AIL traversal | 
|  | Lock item | 
|  | Mark log item clean | 
|  | Flush item to disk | 
|  |  | 
|  | <item IO completion> | 
|  |  | 
|  | 9. Log item removed from AIL | 
|  | Moves log tail | 
|  | Item unlocked | 
|  |  | 
|  | Essentially, steps 1-6 operate independently from step 7, which is also | 
|  | independent of steps 8-9. An item can be locked in steps 1-6 or steps 8-9 | 
|  | at the same time step 7 is occurring, but only steps 1-6 or 8-9 can occur | 
|  | at the same time. If the log item is in the AIL or between steps 6 and 7 | 
|  | and steps 1-6 are re-entered, then the item is relogged. Only when steps 8-9 | 
|  | are entered and completed is the object considered clean. | 
|  |  | 
|  | With delayed logging, there are new steps inserted into the life cycle:: | 
|  |  | 
|  | 1. Transaction allocate | 
|  | 2. Transaction reserve | 
|  | 3. Lock item | 
|  | 4. Join item to transaction | 
|  | If not already attached, | 
|  | Allocate log item | 
|  | Attach log item to owner item | 
|  | Attach log item to transaction | 
|  | 5. Modify item | 
|  | Record modifications in log item | 
|  | 6. Transaction commit | 
|  | Pin item in memory if not pinned in CIL | 
|  | Format item into log vector + buffer | 
|  | Attach log vector and buffer to log item | 
|  | Insert log item into CIL | 
|  | Write CIL context sequence into transaction | 
|  | Unlock item | 
|  |  | 
|  | <next log force> | 
|  |  | 
|  | 7. CIL push | 
|  | lock CIL flush | 
|  | Chain log vectors and buffers together | 
|  | Remove items from CIL | 
|  | unlock CIL flush | 
|  | write log vectors into log | 
|  | sequence commit records | 
|  | attach checkpoint context to log buffer | 
|  |  | 
|  | <log buffer IO dispatched> | 
|  | <log buffer IO completes> | 
|  |  | 
|  | 8. Checkpoint completion | 
|  | Mark log item committed | 
|  | Insert item into AIL | 
|  | Write commit LSN into log item | 
|  | Unpin log item | 
|  | 9. AIL traversal | 
|  | Lock item | 
|  | Mark log item clean | 
|  | Flush item to disk | 
|  | <item IO completion> | 
|  | 10. Log item removed from AIL | 
|  | Moves log tail | 
|  | Item unlocked | 
|  |  | 
|  | From this, it can be seen that the only life cycle differences between the two | 
|  | logging methods are in the middle of the life cycle - they still have the same | 
|  | beginning and end and execution constraints. The only differences are in the | 
|  | committing of the log items to the log itself and the completion processing. | 
|  | Hence delayed logging should not introduce any constraints on log item | 
|  | behaviour, allocation or freeing that don't already exist. | 
|  |  | 
|  | As a result of this zero-impact "insertion" of delayed logging infrastructure | 
|  | and the design of the internal structures to avoid on disk format changes, we | 
|  | can basically switch between delayed logging and the existing mechanism with a | 
|  | mount option. Fundamentally, there is no reason why the log manager would not | 
|  | be able to swap methods automatically and transparently depending on load | 
|  | characteristics, but this should not be necessary if delayed logging works as | 
|  | designed. |