| ===================== |
| Restartable Sequences |
| ===================== |
| |
| Restartable Sequences allow to register a per thread userspace memory area |
| to be used as an ABI between kernel and userspace for three purposes: |
| |
| * userspace restartable sequences |
| |
| * quick access to read the current CPU number, node ID from userspace |
| |
| * scheduler time slice extensions |
| |
| Restartable sequences (per-cpu atomics) |
| --------------------------------------- |
| |
| Restartable sequences allow userspace to perform update operations on |
| per-cpu data without requiring heavyweight atomic operations. The actual |
| ABI is unfortunately only available in the code and selftests. |
| |
| Quick access to CPU number, node ID |
| ----------------------------------- |
| |
| Allows to implement per CPU data efficiently. Documentation is in code and |
| selftests. :( |
| |
| Optimized RSEQ V2 |
| ----------------- |
| |
| On architectures which utilize the generic entry code and generic TIF bits |
| the kernel supports runtime optimizations for RSEQ, which also enable |
| enhanced features like scheduler time slice extensions. |
| |
| To enable them a task has to register the RSEQ region with at least the |
| length advertised by getauxval(AT_RSEQ_FEATURE_SIZE). |
| |
| If existing binaries register with RSEQ_ORIG_SIZE (32 bytes), the kernel |
| keeps the legacy low performance mode enabled to fulfil the expectations |
| of existing users regarding the original RSEQ implementation behaviour. |
| |
| The following table documents the ABI and behavioral guarantees of the |
| legacy and the optimized V2 mode. |
| |
| .. list-table:: RSEQ modes |
| :header-rows: 1 |
| |
| * - Nr |
| - What |
| |
| - Legacy |
| - Optimized V2 |
| |
| * - 1 |
| - The cpu_id_start, cpu_id, node_id and mm_cid fields (User mode read |
| only) |
| .. Legacy |
| - Updated by the kernel unconditionally after each context switch and |
| before signal delivery |
| .. Optimized V2 |
| - Updated by the kernel if and only if they change, i.e. if the task |
| is migrated or mm_cid changes |
| |
| * - 2 |
| - The rseq_cs critical section field |
| .. Legacy |
| - Evaluated and handled unconditionally after each context switch and |
| before signal delivery |
| .. Optimized V2 |
| - Evaluated and handled conditionally only when user space was |
| interrupted and was scheduled out or before delivering a signal in |
| the interrupted context. |
| |
| * - 3 |
| - Read only fields |
| .. Legacy |
| - No strict enforcement except in debug mode |
| .. Optimized V2 |
| - Strict enforcement |
| |
| * - 4 |
| - membarrier(...RSEQ) |
| .. Legacy |
| - All running threads of the process are interrupted and the ID fields |
| are rewritten and eventually active critical sections are aborted |
| before they return to user space. All threads which are scheduled |
| out whether voluntary or not are covered by #1/#2 above. |
| .. Optimized V2 |
| - All running threads of the process are interrupted and eventually |
| active critical sections are aborted before these threads return to |
| user space. The ID fields are only updated if changed as a |
| consequence of the interrupt. All threads which are scheduled out |
| whether voluntary or not are covered by #1/#2 above. |
| |
| * - 5 |
| - Time slice extensions |
| .. Legacy |
| - Not supported |
| .. Optimized V2 |
| - Supported |
| |
| The legacy mode is obviously less performant as it does unconditional |
| updates and critical section checks even if not strictly required by the |
| ABI contract. That can't be changed anymore as some users depend on that |
| observed behavior, which in turn enables them to violate the ABI and |
| overwrite the cpu_id_start field for their own purposes. This is obviously |
| discouraged as it renders RSEQ incompatible with the intended usage and |
| breaks the expectation of other libraries in the same application. |
| |
| The ABI compliant optimized v2 mode, which respects the read only fields, |
| does not require unconditional updates and therefore is way more |
| performant. The kernel validates the read only fields for compliance. If |
| user space modifies them, the process is killed. Compliant usage allows |
| multiple libraries in the same application to benefit from the RSEQ |
| functionality without disturbing each other. The ABI compliant optimized v2 |
| mode also enables extended RSEQ features like time slice extensions. |
| |
| |
| Scheduler time slice extensions |
| ------------------------------- |
| |
| This allows a thread to request a time slice extension when it enters a |
| critical section to avoid contention on a resource when the thread is |
| scheduled out inside of the critical section. |
| |
| The prerequisites for this functionality are: |
| |
| * Enabled in Kconfig |
| |
| * Enabled at boot time (default is enabled) |
| |
| * A rseq userspace pointer has been registered for the thread in |
| optimized V2 mode |
| |
| The thread has to enable the functionality via prctl(2):: |
| |
| prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_SET, |
| PR_RSEQ_SLICE_EXT_ENABLE, 0, 0); |
| |
| prctl() returns 0 on success or otherwise with the following error codes: |
| |
| ========= ============================================================== |
| Errorcode Meaning |
| ========= ============================================================== |
| EINVAL Functionality not available or invalid function arguments. |
| Note: arg4 and arg5 must be zero |
| ENOTSUPP Functionality was disabled on the kernel command line |
| ENXIO Available, but no rseq user struct registered |
| ========= ============================================================== |
| |
| The state can be also queried via prctl(2):: |
| |
| prctl(PR_RSEQ_SLICE_EXTENSION, PR_RSEQ_SLICE_EXTENSION_GET, 0, 0, 0); |
| |
| prctl() returns ``PR_RSEQ_SLICE_EXT_ENABLE`` when it is enabled or 0 if |
| disabled. Otherwise it returns with the following error codes: |
| |
| ========= ============================================================== |
| Errorcode Meaning |
| ========= ============================================================== |
| EINVAL Functionality not available or invalid function arguments. |
| Note: arg3 and arg4 and arg5 must be zero |
| ========= ============================================================== |
| |
| The availability and status is also exposed via the rseq ABI struct flags |
| field via the ``RSEQ_CS_FLAG_SLICE_EXT_AVAILABLE_BIT`` and the |
| ``RSEQ_CS_FLAG_SLICE_EXT_ENABLED_BIT``. These bits are read-only for user |
| space and only for informational purposes. |
| |
| If the mechanism was enabled via prctl(), the thread can request a time |
| slice extension by setting rseq::slice_ctrl::request to 1. If the thread is |
| interrupted and the interrupt results in a reschedule request in the |
| kernel, then the kernel can grant a time slice extension and return to |
| userspace instead of scheduling out. The length of the extension is |
| determined by debugfs:rseq/slice_ext_nsec. The default value is 5 usec; which |
| is the minimum value. It can be incremented to 50 usecs, however doing so |
| can/will affect the minimum scheduling latency. |
| |
| Any proposed changes to this default will have to come with a selftest and |
| rseq-slice-hist.py output that shows the new value has merrit. |
| |
| The kernel indicates the grant by clearing rseq::slice_ctrl::request and |
| setting rseq::slice_ctrl::granted to 1. If there is a reschedule of the |
| thread after granting the extension, the kernel clears the granted bit to |
| indicate that to userspace. |
| |
| If the request bit is still set when the leaving the critical section, |
| userspace can clear it and continue. |
| |
| If the granted bit is set, then userspace invokes rseq_slice_yield(2) when |
| leaving the critical section to relinquish the CPU. The kernel enforces |
| this by arming a timer to prevent misbehaving userspace from abusing this |
| mechanism. |
| |
| If both the request bit and the granted bit are false when leaving the |
| critical section, then this indicates that a grant was revoked and no |
| further action is required by userspace. |
| |
| The required code flow is as follows:: |
| |
| rseq->slice_ctrl.request = 1; |
| barrier(); // Prevent compiler reordering |
| critical_section(); |
| barrier(); // Prevent compiler reordering |
| rseq->slice_ctrl.request = 0; |
| if (rseq->slice_ctrl.granted) |
| rseq_slice_yield(); |
| |
| As all of this is strictly CPU local, there are no atomicity requirements. |
| Checking the granted state is racy, but that cannot be avoided at all:: |
| |
| if (rseq->slice_ctrl.granted) |
| -> Interrupt results in schedule and grant revocation |
| rseq_slice_yield(); |
| |
| So there is no point in pretending that this might be solved by an atomic |
| operation. |
| |
| If the thread issues a syscall other than rseq_slice_yield(2) within the |
| granted timeslice extension, the grant is also revoked and the CPU is |
| relinquished immediately when entering the kernel. This is required as |
| syscalls might consume arbitrary CPU time until they reach a scheduling |
| point when the preemption model is either NONE or VOLUNTARY and therefore |
| might exceed the grant by far. |
| |
| The preferred solution for user space is to use rseq_slice_yield(2) which |
| is side effect free. The support for arbitrary syscalls is required to |
| support onion layer architectured applications, where the code handling the |
| critical section and requesting the time slice extension has no control |
| over the code within the critical section. |
| |
| The kernel enforces flag consistency and terminates the thread with SIGSEGV |
| if it detects a violation. |