| From: John Ogness <john.ogness@linutronix.de> |
| Date: Tue, 12 Feb 2019 15:29:39 +0100 |
| Subject: [PATCH 01/25] printk-rb: add printk ring buffer documentation |
| |
| The full documentation file for the printk ring buffer. |
| |
| Signed-off-by: John Ogness <john.ogness@linutronix.de> |
| Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
| --- |
| Documentation/printk-ringbuffer.txt | 377 ++++++++++++++++++++++++++++++++++++ |
| 1 file changed, 377 insertions(+) |
| create mode 100644 Documentation/printk-ringbuffer.txt |
| |
| --- /dev/null |
| +++ b/Documentation/printk-ringbuffer.txt |
| @@ -0,0 +1,377 @@ |
| +struct printk_ringbuffer |
| +------------------------ |
| +John Ogness <john.ogness@linutronix.de> |
| + |
| +Overview |
| +~~~~~~~~ |
| +As the name suggests, this ring buffer was implemented specifically to serve |
| +the needs of the printk() infrastructure. The ring buffer itself is not |
| +specific to printk and could be used for other purposes. _However_, the |
| +requirements and semantics of printk are rather unique. If you intend to use |
| +this ring buffer for anything other than printk, you need to be very clear on |
| +its features, behavior, and pitfalls. |
| + |
| +Features |
| +^^^^^^^^ |
| +The printk ring buffer has the following features: |
| + |
| +- single global buffer |
| +- resides in initialized data section (available at early boot) |
| +- lockless readers |
| +- supports multiple writers |
| +- supports multiple non-consuming readers |
| +- safe from any context (including NMI) |
| +- groups bytes into variable length blocks (referenced by entries) |
| +- entries tagged with sequence numbers |
| + |
| +Behavior |
| +^^^^^^^^ |
| +Since the printk ring buffer readers are lockless, there exists no |
| +synchronization between readers and writers. Basically writers are the tasks |
| +in control and may overwrite any and all committed data at any time and from |
| +any context. For this reason readers can miss entries if they are overwritten |
| +before the reader was able to access the data. The reader API implementation |
| +is such that reader access to entries is atomic, so there is no risk of |
| +readers having to deal with partial or corrupt data. Also, entries are |
| +tagged with sequence numbers so readers can recognize if entries were missed. |
| + |
| +Writing to the ring buffer consists of 2 steps. First a writer must reserve |
| +an entry of desired size. After this step the writer has exclusive access |
| +to the memory region. Once the data has been written to memory, it needs to |
| +be committed to the ring buffer. After this step the entry has been inserted |
| +into the ring buffer and assigned an appropriate sequence number. |
| + |
| +Once committed, a writer must no longer access the data directly. This is |
| +because the data may have been overwritten and no longer exists. If a |
| +writer must access the data, it should either keep a private copy before |
| +committing the entry or use the reader API to gain access to the data. |
| + |
| +Because of how the data backend is implemented, entries that have been |
| +reserved but not yet committed act as barriers, preventing future writers |
| +from filling the ring buffer beyond the location of the reserved but not |
| +yet committed entry region. For this reason it is *important* that writers |
| +perform both reserve and commit as quickly as possible. Also, be aware that |
| +preemption and local interrupts are disabled and writing to the ring buffer |
| +is processor-reentrant locked during the reserve/commit window. Writers in |
| +NMI contexts can still preempt any other writers, but as long as these |
| +writers do not write a large amount of data with respect to the ring buffer |
| +size, this should not become an issue. |
| + |
| +API |
| +~~~ |
| + |
| +Declaration |
| +^^^^^^^^^^^ |
| +The printk ring buffer can be instantiated as a static structure: |
| + |
| + /* declare a static struct printk_ringbuffer */ |
| + #define DECLARE_STATIC_PRINTKRB(name, szbits, cpulockptr) |
| + |
| +The value of szbits specifies the size of the ring buffer in bits. The |
| +cpulockptr field is a pointer to a prb_cpulock struct that is used to |
| +perform processor-reentrant spin locking for the writers. It is specified |
| +externally because it may be used for multiple ring buffers (or other |
| +code) to synchronize writers without risk of deadlock. |
| + |
| +Here is an example of a declaration of a printk ring buffer specifying a |
| +32KB (2^15) ring buffer: |
| + |
| +.... |
| +DECLARE_STATIC_PRINTKRB_CPULOCK(rb_cpulock); |
| +DECLARE_STATIC_PRINTKRB(rb, 15, &rb_cpulock); |
| +.... |
| + |
| +If writers will be using multiple ring buffers and the ordering of that usage |
| +is not clear, the same prb_cpulock should be used for both ring buffers. |
| + |
| +Writer API |
| +^^^^^^^^^^ |
| +The writer API consists of 2 functions. The first is to reserve an entry in |
| +the ring buffer, the second is to commit that data to the ring buffer. The |
| +reserved entry information is stored within a provided `struct prb_handle`. |
| + |
| + /* reserve an entry */ |
| + char *prb_reserve(struct prb_handle *h, struct printk_ringbuffer *rb, |
| + unsigned int size); |
| + |
| + /* commit a reserved entry to the ring buffer */ |
| + void prb_commit(struct prb_handle *h); |
| + |
| +Here is an example of a function to write data to a ring buffer: |
| + |
| +.... |
| +int write_data(struct printk_ringbuffer *rb, char *data, int size) |
| +{ |
| + struct prb_handle h; |
| + char *buf; |
| + |
| + buf = prb_reserve(&h, rb, size); |
| + if (!buf) |
| + return -1; |
| + memcpy(buf, data, size); |
| + prb_commit(&h); |
| + |
| + return 0; |
| +} |
| +.... |
| + |
| +Pitfalls |
| +++++++++ |
| +Be aware that prb_reserve() can fail. A retry might be successful, but it |
| +depends entirely on whether or not the next part of the ring buffer to |
| +overwrite belongs to reserved but not yet committed entries of other writers. |
| +Writers can use the prb_inc_lost() function to allow readers to notice that a |
| +message was lost. |
| + |
| +Reader API |
| +^^^^^^^^^^ |
| +The reader API utilizes a `struct prb_iterator` to track the reader's |
| +position in the ring buffer. |
| + |
| + /* declare a pre-initialized static iterator for a ring buffer */ |
| + #define DECLARE_STATIC_PRINTKRB_ITER(name, rbaddr) |
| + |
| + /* initialize iterator for a ring buffer (if static macro NOT used) */ |
| + void prb_iter_init(struct prb_iterator *iter, |
| + struct printk_ringbuffer *rb, u64 *seq); |
| + |
| + /* make a deep copy of an iterator */ |
| + void prb_iter_copy(struct prb_iterator *dest, |
| + struct prb_iterator *src); |
| + |
| + /* non-blocking, advance to next entry (and read the data) */ |
| + int prb_iter_next(struct prb_iterator *iter, char *buf, |
| + int size, u64 *seq); |
| + |
| + /* blocking, advance to next entry (and read the data) */ |
| + int prb_iter_wait_next(struct prb_iterator *iter, char *buf, |
| + int size, u64 *seq); |
| + |
| + /* position iterator at the entry seq */ |
| + int prb_iter_seek(struct prb_iterator *iter, u64 seq); |
| + |
| + /* read data at current position */ |
| + int prb_iter_data(struct prb_iterator *iter, char *buf, |
| + int size, u64 *seq); |
| + |
| +Typically prb_iter_data() is not needed because the data can be retrieved |
| +directly with prb_iter_next(). |
| + |
| +Here is an example of a non-blocking function that will read all the data in |
| +a ring buffer: |
| + |
| +.... |
| +void read_all_data(struct printk_ringbuffer *rb, char *buf, int size) |
| +{ |
| + struct prb_iterator iter; |
| + u64 prev_seq = 0; |
| + u64 seq; |
| + int ret; |
| + |
| + prb_iter_init(&iter, rb, NULL); |
| + |
| + for (;;) { |
| + ret = prb_iter_next(&iter, buf, size, &seq); |
| + if (ret > 0) { |
| + if (seq != ++prev_seq) { |
| + /* "seq - prev_seq" entries missed */ |
| + prev_seq = seq; |
| + } |
| + /* process buf here */ |
| + } else if (ret == 0) { |
| + /* hit the end, done */ |
| + break; |
| + } else if (ret < 0) { |
| + /* |
| + * iterator is invalid, a writer overtook us, reset the |
| + * iterator and keep going, entries were missed |
| + */ |
| + prb_iter_init(&iter, rb, NULL); |
| + } |
| + } |
| +} |
| +.... |
| + |
| +Pitfalls |
| +++++++++ |
| +The reader's iterator can become invalid at any time because the reader was |
| +overtaken by a writer. Typically the reader should reset the iterator back |
| +to the current oldest entry (which will be newer than the entry the reader |
| +was at) and continue, noting the number of entries that were missed. |
| + |
| +Utility API |
| +^^^^^^^^^^^ |
| +Several functions are available as convenience for external code. |
| + |
| + /* query the size of the data buffer */ |
| + int prb_buffer_size(struct printk_ringbuffer *rb); |
| + |
| + /* skip a seq number to signify a lost record */ |
| + void prb_inc_lost(struct printk_ringbuffer *rb); |
| + |
| + /* processor-reentrant spin lock */ |
| + void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store); |
| + |
| + /* processor-reentrant spin unlock */ |
| + void prb_lock(struct prb_cpulock *cpu_lock, unsigned int *cpu_store); |
| + |
| +Pitfalls |
| +++++++++ |
| +Although the value returned by prb_buffer_size() does represent an absolute |
| +upper bound, the amount of data that can be stored within the ring buffer |
| +is actually less because of the additional storage space of a header for each |
| +entry. |
| + |
| +The prb_lock() and prb_unlock() functions can be used to synchronize between |
| +ring buffer writers and other external activities. The function of a |
| +processor-reentrant spin lock is to disable preemption and local interrupts |
| +and synchronize against other processors. It does *not* protect against |
| +multiple contexts of a single processor, i.e NMI. |
| + |
| +Implementation |
| +~~~~~~~~~~~~~~ |
| +This section describes several of the implementation concepts and details to |
| +help developers better understand the code. |
| + |
| +Entries |
| +^^^^^^^ |
| +All ring buffer data is stored within a single static byte array. The reason |
| +for this is to ensure that any pointers to the data (past and present) will |
| +always point to valid memory. This is important because the lockless readers |
| +may be preempted for long periods of time and when they resume may be working |
| +with expired pointers. |
| + |
| +Entries are identified by start index and size. (The start index plus size |
| +is the start index of the next entry.) The start index is not simply an |
| +offset into the byte array, but rather a logical position (lpos) that maps |
| +directly to byte array offsets. |
| + |
| +For example, for a byte array of 1000, an entry may have have a start index |
| +of 100. Another entry may have a start index of 1100. And yet another 2100. |
| +All of these entry are pointing to the same memory region, but only the most |
| +recent entry is valid. The other entries are pointing to valid memory, but |
| +represent entries that have been overwritten. |
| + |
| +Note that due to overflowing, the most recent entry is not necessarily the one |
| +with the highest lpos value. Indeed, the printk ring buffer initializes its |
| +data such that an overflow happens relatively quickly in order to validate the |
| +handling of this situation. The implementation assumes that an lpos (unsigned |
| +long) will never completely wrap while a reader is preempted. If this were to |
| +become an issue, the seq number (which never wraps) could be used to increase |
| +the robustness of handling this situation. |
| + |
| +Buffer Wrapping |
| +^^^^^^^^^^^^^^^ |
| +If an entry starts near the end of the byte array but would extend beyond it, |
| +a special terminating entry (size = -1) is inserted into the byte array and |
| +the real entry is placed at the beginning of the byte array. This can waste |
| +space at the end of the byte array, but simplifies the implementation by |
| +allowing writers to always work with contiguous buffers. |
| + |
| +Note that the size field is the first 4 bytes of the entry header. Also note |
| +that calc_next() always ensures that there are at least 4 bytes left at the |
| +end of the byte array to allow room for a terminating entry. |
| + |
| +Ring Buffer Pointers |
| +^^^^^^^^^^^^^^^^^^^^ |
| +Three pointers (lpos values) are used to manage the ring buffer: |
| + |
| + - _tail_: points to the oldest entry |
| + - _head_: points to where the next new committed entry will be |
| + - _reserve_: points to where the next new reserved entry will be |
| + |
| +These pointers always maintain a logical ordering: |
| + |
| + tail <= head <= reserve |
| + |
| +The reserve pointer moves forward when a writer reserves a new entry. The |
| +head pointer moves forward when a writer commits a new entry. |
| + |
| +The reserve pointer cannot overwrite the tail pointer in a wrap situation. In |
| +such a situation, the tail pointer must be "pushed forward", thus |
| +invalidating that oldest entry. Readers identify if they are accessing a |
| +valid entry by ensuring their entry pointer is `>= tail && < head`. |
| + |
| +If the tail pointer is equal to the head pointer, it cannot be pushed and any |
| +reserve operation will fail. The only resolution is for writers to commit |
| +their reserved entries. |
| + |
| +Processor-Reentrant Locking |
| +^^^^^^^^^^^^^^^^^^^^^^^^^^^ |
| +The purpose of the processor-reentrant locking is to limit the interruption |
| +scenarios of writers to 2 contexts. This allows for a simplified |
| +implementation where: |
| + |
| +- The reserve/commit window only exists on 1 processor at a time. A reserve |
| + can never fail due to uncommitted entries of other processors. |
| + |
| +- When committing entries, it is trivial to handle the situation when |
| + subsequent entries have already been committed, i.e. managing the head |
| + pointer. |
| + |
| +Performance |
| +~~~~~~~~~~~ |
| +Some basic tests were performed on a quad Intel(R) Xeon(R) CPU E5-2697 v4 at |
| +2.30GHz (36 cores / 72 threads). All tests involved writing a total of |
| +32,000,000 records at an average of 33 bytes each. Each writer was pinned to |
| +its own CPU and would write as fast as it could until a total of 32,000,000 |
| +records were written. All tests involved 2 readers that were both pinned |
| +together to another CPU. Each reader would read as fast as it could and track |
| +how many of the 32,000,000 records it could read. All tests used a ring buffer |
| +of 16KB in size, which holds around 350 records (header + data for each |
| +entry). |
| + |
| +The only difference between the tests is the number of writers (and thus also |
| +the number of records per writer). As more writers are added, the time to |
| +write a record increases. This is because data pointers, modified via cmpxchg, |
| +and global data access in general become more contended. |
| + |
| +1 writer |
| +^^^^^^^^ |
| + runtime: 0m 18s |
| + reader1: 16219900/32000000 (50%) records |
| + reader2: 16141582/32000000 (50%) records |
| + |
| +2 writers |
| +^^^^^^^^^ |
| + runtime: 0m 32s |
| + reader1: 16327957/32000000 (51%) records |
| + reader2: 16313988/32000000 (50%) records |
| + |
| +4 writers |
| +^^^^^^^^^ |
| + runtime: 0m 42s |
| + reader1: 16421642/32000000 (51%) records |
| + reader2: 16417224/32000000 (51%) records |
| + |
| +8 writers |
| +^^^^^^^^^ |
| + runtime: 0m 43s |
| + reader1: 16418300/32000000 (51%) records |
| + reader2: 16432222/32000000 (51%) records |
| + |
| +16 writers |
| +^^^^^^^^^^ |
| + runtime: 0m 54s |
| + reader1: 16539189/32000000 (51%) records |
| + reader2: 16542711/32000000 (51%) records |
| + |
| +32 writers |
| +^^^^^^^^^^ |
| + runtime: 1m 13s |
| + reader1: 16731808/32000000 (52%) records |
| + reader2: 16735119/32000000 (52%) records |
| + |
| +Comments |
| +^^^^^^^^ |
| +It is particularly interesting to compare/contrast the 1-writer and 32-writer |
| +tests. Despite the writing of the 32,000,000 records taking over 4 times |
| +longer, the readers (which perform no cmpxchg) were still unable to keep up. |
| +This shows that the memory contention between the increasing number of CPUs |
| +also has a dramatic effect on readers. |
| + |
| +It should also be noted that in all cases each reader was able to read >=50% |
| +of the records. This means that a single reader would have been able to keep |
| +up with the writer(s) in all cases, becoming slightly easier as more writers |
| +are added. This was the purpose of pinning 2 readers to 1 CPU: to observe how |
| +maximum reader performance changes. |