| struct printk_ringbuffer |
| ------------------------ |
| John Ogness <john.ogness@linutronix.de> |
| |
| Overview |
| ~~~~~~~~ |
| As the name suggests, this ring buffer was implemented specifically to serve |
| the needs of the printk() infrastructure. The ring buffer itself is not |
| specific to printk and could be used for other purposes. _However_, the |
| requirements and semantics of printk are rather unique. If you intend to use |
| this ring buffer for anything other than printk, you need to be very clear on |
| its features, behavior, and pitfalls. |
| |
| Features |
| ^^^^^^^^ |
| The printk ring buffer has the following features: |
| |
| - single global buffer |
| - resides in initialized data section (available at early boot) |
| - lockless readers |
| - supports multiple writers |
| - supports multiple non-consuming readers |
| - safe from any context (including NMI) |
| - groups bytes into variable length blocks (referenced by entries) |
| - entries tagged with sequence numbers |
| |
| Behavior |
| ^^^^^^^^ |
| Since the printk ring buffer readers are lockless, there exists no |
| synchronization between readers and writers. Basically writers are the tasks |
| in control and may overwrite any and all committed data at any time and from |
| any context. For this reason readers can miss entries if they are overwritten |
| before the reader was able to access the data. The reader API implementation |
| is such that reader access to entries is atomic, so there is no risk of |
| readers having to deal with partial or corrupt data. Also, entries are |
| tagged with sequence numbers so readers can recognize if entries were missed. |
| |
| Writing to the ring buffer consists of 2 steps. First a writer must reserve |
| an entry of desired size. After this step the writer has exclusive access |
| to the memory region. Once the data has been written to memory, it needs to |
| be committed to the ring buffer. After this step the entry has been inserted |
| into the ring buffer and assigned an appropriate sequence number. |
| |
| Once committed, a writer must no longer access the data directly. This is |
| because the data may have been overwritten and no longer exists. If a |
| writer must access the data, it should either keep a private copy before |
| committing the entry or use the reader API to gain access to the data. |
| |
| Because of how the data backend is implemented, entries that have been |
| reserved but not yet committed act as barriers, preventing future writers |
| from filling the ring buffer beyond the location of the reserved but not |
| yet committed entry region. For this reason it is *important* that writers |
| perform both reserve and commit as quickly as possible. Also, be aware that |
| preemption and local interrupts are disabled and writing to the ring buffer |
| is processor-reentrant locked during the reserve/commit window. Writers in |
| NMI contexts can still preempt any other writers, but as long as these |
| writers do not write a large amount of data with respect to the ring buffer |
| size, this should not become an issue. |
| |
| API |
| ~~~ |
| |
| Declaration |
| ^^^^^^^^^^^ |
| The printk ring buffer can be instantiated as a static structure: |
| |
| /* declare a static struct printk_ringbuffer */ |
| #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) |
| |
| The value of szbits specifies the size of the ring buffer in bits. The |
| cpulockptr field is a pointer to a prb_cpulock struct that is used to |
| perform processor-reentrant spin locking for the writers. It is specified |
| externally because it may be used for multiple ring buffers (or other |
| code) to synchronize writers without risk of deadlock. |
| |
| Here is an example of a declaration of a printk ring buffer specifying a |
| 32KB (2^15) ring buffer: |
| |
| .... |
| DECLARE_STATIC_PRINTKRB_CPULOCK(rb_cpulock); |
| DECLARE_STATIC_PRINTKRB(rb, 15, &rb_cpulock); |
| .... |
| |
| If writers will be using multiple ring buffers and the ordering of that usage |
| is not clear, the same prb_cpulock should be used for both ring buffers. |
| |
| Writer API |
| ^^^^^^^^^^ |
| The writer API consists of 2 functions. The first is to reserve an entry in |
| the ring buffer, the second is to commit that data to the ring buffer. The |
| reserved entry information is stored within a provided `struct prb_handle`. |
| |
| /* reserve an entry */ |
| char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb, |
| unsigned int size); |
| |
| /* commit a reserved entry to the ring buffer */ |
| void prb_commit(struct prb_handle *h); |
| |
| Here is an example of a function to write data to a ring buffer: |
| |
| .... |
| int write_data(struct printk_ringbuffer *rb, char *data, int size) |
| { |
| struct prb_handle h; |
| char *buf; |
| |
| buf = prb_reserve(&h, rb, size); |
| if (!buf) |
| return -1; |
| memcpy(buf, data, size); |
| prb_commit(&h); |
| |
| return 0; |
| } |
| .... |
| |
| Pitfalls |
| ++++++++ |
| Be aware that prb_reserve() can fail. A retry might be successful, but it |
| depends entirely on whether or not the next part of the ring buffer to |
| overwrite belongs to reserved but not yet committed entries of other writers. |
| Writers can use the prb_inc_lost() function to allow readers to notice that a |
| message was lost. |
| |
| Reader API |
| ^^^^^^^^^^ |
| The reader API utilizes a `struct prb_iterator` to track the reader's |
| position in the ring buffer. |
| |
| /* declare a pre-initialized static iterator for a ring buffer */ |
| #define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr) |
| |
| /* initialize iterator for a ring buffer (if static macro NOT used) */ |
| void prb_iter_init(struct prb_iterator *iter, |
| struct printk_ringbuffer *rb, u64 *seq); |
| |
| /* make a deep copy of an iterator */ |
| void prb_iter_copy(struct prb_iterator *dest, |
| struct prb_iterator *src); |
| |
| /* non-blocking, advance to next entry (and read the data) */ |
| int prb_iter_next(struct prb_iterator *iter, char *buf, |
| int size, u64 *seq); |
| |
| /* blocking, advance to next entry (and read the data) */ |
| int prb_iter_wait_next(struct prb_iterator *iter, char *buf, |
| int size, u64 *seq); |
| |
| /* position iterator at the entry seq */ |
| int prb_iter_seek(struct prb_iterator *iter, u64 seq); |
| |
| /* read data at current position */ |
| int prb_iter_data(struct prb_iterator *iter, char *buf, |
| int size, u64 *seq); |
| |
| Typically prb_iter_data() is not needed because the data can be retrieved |
| directly with prb_iter_next(). |
| |
| Here is an example of a non-blocking function that will read all the data in |
| a ring buffer: |
| |
| .... |
| void read_all_data(struct printk_ringbuffer *rb, char *buf, int size) |
| { |
| struct prb_iterator iter; |
| u64 prev_seq = 0; |
| u64 seq; |
| int ret; |
| |
| prb_iter_init(&iter, rb, NULL); |
| |
| for (;;) { |
| ret = prb_iter_next(&iter, buf, size, &seq); |
| if (ret > 0) { |
| if (seq != ++prev_seq) { |
| /* "seq - prev_seq" entries missed */ |
| prev_seq = seq; |
| } |
| /* process buf here */ |
| } else if (ret == 0) { |
| /* hit the end, done */ |
| break; |
| } else if (ret < 0) { |
| /* |
| * iterator is invalid, a writer overtook us, reset the |
| * iterator and keep going, entries were missed |
| */ |
| prb_iter_init(&iter, rb, NULL); |
| } |
| } |
| } |
| .... |
| |
| Pitfalls |
| ++++++++ |
| The reader's iterator can become invalid at any time because the reader was |
| overtaken by a writer. Typically the reader should reset the iterator back |
| to the current oldest entry (which will be newer than the entry the reader |
| was at) and continue, noting the number of entries that were missed. |
| |
| Utility API |
| ^^^^^^^^^^^ |
| Several functions are available as convenience for external code. |
| |
| /* query the size of the data buffer */ |
| int prb_buffer_size(struct printk_ringbuffer *rb); |
| |
| /* skip a seq number to signify a lost record */ |
| void prb_inc_lost(struct printk_ringbuffer *rb); |
| |
| /* processor-reentrant spin lock */ |
| void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store); |
| |
| /* processor-reentrant spin unlock */ |
| void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store); |
| |
| Pitfalls |
| ++++++++ |
| Although the value returned by prb_buffer_size() does represent an absolute |
| upper bound, the amount of data that can be stored within the ring buffer |
| is actually less because of the additional storage space of a header for each |
| entry. |
| |
| The prb_lock() and prb_unlock() functions can be used to synchronize between |
| ring buffer writers and other external activities. The function of a |
| processor-reentrant spin lock is to disable preemption and local interrupts |
| and synchronize against other processors. It does *not* protect against |
| multiple contexts of a single processor, i.e NMI. |
| |
| Implementation |
| ~~~~~~~~~~~~~~ |
| This section describes several of the implementation concepts and details to |
| help developers better understand the code. |
| |
| Entries |
| ^^^^^^^ |
| All ring buffer data is stored within a single static byte array. The reason |
| for this is to ensure that any pointers to the data (past and present) will |
| always point to valid memory. This is important because the lockless readers |
| may be preempted for long periods of time and when they resume may be working |
| with expired pointers. |
| |
| Entries are identified by start index and size. (The start index plus size |
| is the start index of the next entry.) The start index is not simply an |
| offset into the byte array, but rather a logical position (lpos) that maps |
| directly to byte array offsets. |
| |
| For example, for a byte array of 1000, an entry may have have a start index |
| of 100. Another entry may have a start index of 1100. And yet another 2100. |
| All of these entry are pointing to the same memory region, but only the most |
| recent entry is valid. The other entries are pointing to valid memory, but |
| represent entries that have been overwritten. |
| |
| Note that due to overflowing, the most recent entry is not necessarily the one |
| with the highest lpos value. Indeed, the printk ring buffer initializes its |
| data such that an overflow happens relatively quickly in order to validate the |
| handling of this situation. The implementation assumes that an lpos (unsigned |
| long) will never completely wrap while a reader is preempted. If this were to |
| become an issue, the seq number (which never wraps) could be used to increase |
| the robustness of handling this situation. |
| |
| Buffer Wrapping |
| ^^^^^^^^^^^^^^^ |
| If an entry starts near the end of the byte array but would extend beyond it, |
| a special terminating entry (size = -1) is inserted into the byte array and |
| the real entry is placed at the beginning of the byte array. This can waste |
| space at the end of the byte array, but simplifies the implementation by |
| allowing writers to always work with contiguous buffers. |
| |
| Note that the size field is the first 4 bytes of the entry header. Also note |
| that calc_next() always ensures that there are at least 4 bytes left at the |
| end of the byte array to allow room for a terminating entry. |
| |
| Ring Buffer Pointers |
| ^^^^^^^^^^^^^^^^^^^^ |
| Three pointers (lpos values) are used to manage the ring buffer: |
| |
| - _tail_: points to the oldest entry |
| - _head_: points to where the next new committed entry will be |
| - _reserve_: points to where the next new reserved entry will be |
| |
| These pointers always maintain a logical ordering: |
| |
| tail <= head <= reserve |
| |
| The reserve pointer moves forward when a writer reserves a new entry. The |
| head pointer moves forward when a writer commits a new entry. |
| |
| The reserve pointer cannot overwrite the tail pointer in a wrap situation. In |
| such a situation, the tail pointer must be "pushed forward", thus |
| invalidating that oldest entry. Readers identify if they are accessing a |
| valid entry by ensuring their entry pointer is `>= tail && < head`. |
| |
| If the tail pointer is equal to the head pointer, it cannot be pushed and any |
| reserve operation will fail. The only resolution is for writers to commit |
| their reserved entries. |
| |
| Processor-Reentrant Locking |
| ^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| The purpose of the processor-reentrant locking is to limit the interruption |
| scenarios of writers to 2 contexts. This allows for a simplified |
| implementation where: |
| |
| - The reserve/commit window only exists on 1 processor at a time. A reserve |
| can never fail due to uncommitted entries of other processors. |
| |
| - When committing entries, it is trivial to handle the situation when |
| subsequent entries have already been committed, i.e. managing the head |
| pointer. |
| |
| Performance |
| ~~~~~~~~~~~ |
| Some basic tests were performed on a quad Intel(R) Xeon(R) CPU E5-2697 v4 at |
| 2.30GHz (36 cores / 72 threads). All tests involved writing a total of |
| 32,000,000 records at an average of 33 bytes each. Each writer was pinned to |
| its own CPU and would write as fast as it could until a total of 32,000,000 |
| records were written. All tests involved 2 readers that were both pinned |
| together to another CPU. Each reader would read as fast as it could and track |
| how many of the 32,000,000 records it could read. All tests used a ring buffer |
| of 16KB in size, which holds around 350 records (header + data for each |
| entry). |
| |
| The only difference between the tests is the number of writers (and thus also |
| the number of records per writer). As more writers are added, the time to |
| write a record increases. This is because data pointers, modified via cmpxchg, |
| and global data access in general become more contended. |
| |
| 1 writer |
| ^^^^^^^^ |
| runtime: 0m 18s |
| reader1: 16219900/32000000 (50%) records |
| reader2: 16141582/32000000 (50%) records |
| |
| 2 writers |
| ^^^^^^^^^ |
| runtime: 0m 32s |
| reader1: 16327957/32000000 (51%) records |
| reader2: 16313988/32000000 (50%) records |
| |
| 4 writers |
| ^^^^^^^^^ |
| runtime: 0m 42s |
| reader1: 16421642/32000000 (51%) records |
| reader2: 16417224/32000000 (51%) records |
| |
| 8 writers |
| ^^^^^^^^^ |
| runtime: 0m 43s |
| reader1: 16418300/32000000 (51%) records |
| reader2: 16432222/32000000 (51%) records |
| |
| 16 writers |
| ^^^^^^^^^^ |
| runtime: 0m 54s |
| reader1: 16539189/32000000 (51%) records |
| reader2: 16542711/32000000 (51%) records |
| |
| 32 writers |
| ^^^^^^^^^^ |
| runtime: 1m 13s |
| reader1: 16731808/32000000 (52%) records |
| reader2: 16735119/32000000 (52%) records |
| |
| Comments |
| ^^^^^^^^ |
| It is particularly interesting to compare/contrast the 1-writer and 32-writer |
| tests. Despite the writing of the 32,000,000 records taking over 4 times |
| longer, the readers (which perform no cmpxchg) were still unable to keep up. |
| This shows that the memory contention between the increasing number of CPUs |
| also has a dramatic effect on readers. |
| |
| It should also be noted that in all cases each reader was able to read >=50% |
| of the records. This means that a single reader would have been able to keep |
| up with the writer(s) in all cases, becoming slightly easier as more writers |
| are added. This was the purpose of pinning 2 readers to 1 CPU: to observe how |
| maximum reader performance changes. |