| .\" Page by b.hubert |
| .\" and Copyright (C) 2015, Thomas Gleixner <tglx@linutronix.de> |
| .\" and Copyright (C) 2015, Michael Kerrisk <mtk.manpages@gmail.com> |
| .\" |
| .\" %%%LICENSE_START(FREELY_REDISTRIBUTABLE) |
| .\" may be freely modified and distributed |
| .\" %%%LICENSE_END |
| .\" |
| .\" Niki A. Rahimi (LTC Security Development, narahimi@us.ibm.com) |
| .\" added ERRORS section. |
| .\" |
| .\" Modified 2004-06-17 mtk |
| .\" Modified 2004-10-07 aeb, added FUTEX_REQUEUE, FUTEX_CMP_REQUEUE |
| .\" |
| .\" FIXME Still to integrate are some points from Torvald Riegel's mail of |
| .\" 2015-01-23: |
| .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=7977 |
| .\" |
| .\" FIXME Do we need to add some text regarding Torvald Riegel's 2015-01-24 mail |
| .\" http://thread.gmane.org/gmane.linux.kernel/1703405/focus=1873242 |
| .\" |
| .TH FUTEX 2 2017-09-15 "Linux" "Linux Programmer's Manual" |
| .SH NAME |
| futex \- fast user-space locking |
| .SH SYNOPSIS |
| .nf |
| .PP |
| .B "#include <linux/futex.h>" |
| .B "#include <sys/time.h>" |
| .PP |
| .BI "int futex(int *" uaddr ", int " futex_op ", int " val , |
| .BI " const struct timespec *" timeout , \ |
| " \fR /* or: \fBuint32_t \fIval2\fP */ |
| .BI " int *" uaddr2 ", int " val3 ); |
| .fi |
| .PP |
| .IR Note : |
| There is no glibc wrapper for this system call; see NOTES. |
| .SH DESCRIPTION |
| .PP |
| The |
| .BR futex () |
| system call provides a method for waiting until a certain condition becomes |
| true. |
| It is typically used as a blocking construct in the context of |
| shared-memory synchronization. |
| When using futexes, the majority of |
| the synchronization operations are performed in user space. |
| A user-space program employs the |
| .BR futex () |
| system call only when it is likely that the program has to block for |
| a longer time until the condition becomes true. |
| Other |
| .BR futex () |
| operations can be used to wake any processes or threads waiting |
| for a particular condition. |
| .PP |
| A futex is a 32-bit value\(emreferred to below as a |
| .IR "futex word" \(emwhose |
| address is supplied to the |
| .BR futex () |
| system call. |
| (Futexes are 32 bits in size on all platforms, including 64-bit systems.) |
| All futex operations are governed by this value. |
| In order to share a futex between processes, |
| the futex is placed in a region of shared memory, |
| created using (for example) |
| .BR mmap (2) |
| or |
| .BR shmat (2). |
| (Thus, the futex word may have different |
| virtual addresses in different processes, |
| but these addresses all refer to the same location in physical memory.) |
| In a multithreaded program, it is sufficient to place the futex word |
| in a global variable shared by all threads. |
| .PP |
| When executing a futex operation that requests to block a thread, |
| the kernel will block only if the futex word has the value that the |
| calling thread supplied (as one of the arguments of the |
| .BR futex () |
| call) as the expected value of the futex word. |
| The loading of the futex word's value, |
| the comparison of that value with the expected value, |
| and the actual blocking will happen atomically and will be totally ordered |
| with respect to concurrent operations performed by other threads |
| on the same futex word. |
| .\" Notes from Darren Hart (Dec 2015): |
| .\" Totally ordered with respect futex operations refers to semantics |
| .\" of the ACQUIRE/RELEASE operations and how they impact ordering of |
| .\" memory reads and writes. The kernel futex operations are protected |
| .\" by spinlocks, which ensure that all operations are serialized |
| .\" with respect to one another. |
| .\" |
| .\" This is a lot to attempt to define in this document. Perhaps a |
| .\" reference to linux/Documentation/memory-barriers.txt as a footnote |
| .\" would be sufficient? Or perhaps for this manual, "serialized" would |
| .\" be sufficient, with a footnote regarding "totally ordered" and a |
| .\" pointer to the memory-barrier documentation? |
| Thus, the futex word is used to connect the synchronization in user space |
| with the implementation of blocking by the kernel. |
| Analogously to an atomic |
| compare-and-exchange operation that potentially changes shared memory, |
| blocking via a futex is an atomic compare-and-block operation. |
| .\" FIXME(Torvald Riegel): |
| .\" Eventually we want to have some text in NOTES to satisfy |
| .\" the reference in the following sentence |
| .\" See NOTES for a detailed specification of |
| .\" the synchronization semantics. |
| .PP |
| One use of futexes is for implementing locks. |
| The state of the lock (i.e., acquired or not acquired) |
| can be represented as an atomically accessed flag in shared memory. |
| In the uncontended case, |
| a thread can access or modify the lock state with atomic instructions, |
| for example atomically changing it from not acquired to acquired |
| using an atomic compare-and-exchange instruction. |
| (Such instructions are performed entirely in user mode, |
| and the kernel maintains no information about the lock state.) |
| On the other hand, a thread may be unable to acquire a lock because |
| it is already acquired by another thread. |
| It then may pass the lock's flag as a futex word and the value |
| representing the acquired state as the expected value to a |
| .BR futex () |
| wait operation. |
| This |
| .BR futex () |
| operation will block if and only if the lock is still acquired |
| (i.e., the value in the futex word still matches the "acquired state"). |
| When releasing the lock, a thread has to first reset the |
| lock state to not acquired and then execute a futex |
| operation that wakes threads blocked on the lock flag used as a futex word |
| (this can be further optimized to avoid unnecessary wake-ups). |
| See |
| .BR futex (7) |
| for more detail on how to use futexes. |
| .PP |
| Besides the basic wait and wake-up futex functionality, there are further |
| futex operations aimed at supporting more complex use cases. |
| .PP |
| Note that |
| no explicit initialization or destruction is necessary to use futexes; |
| the kernel maintains a futex |
| (i.e., the kernel-internal implementation artifact) |
| only while operations such as |
| .BR FUTEX_WAIT , |
| described below, are being performed on a particular futex word. |
| .\" |
| .SS Arguments |
| The |
| .I uaddr |
| argument points to the futex word. |
| On all platforms, futexes are four-byte |
| integers that must be aligned on a four-byte boundary. |
| The operation to perform on the futex is specified in the |
| .I futex_op |
| argument; |
| .IR val |
| is a value whose meaning and purpose depends on |
| .IR futex_op . |
| .PP |
| The remaining arguments |
| .RI ( timeout , |
| .IR uaddr2 , |
| and |
| .IR val3 ) |
| are required only for certain of the futex operations described below. |
| Where one of these arguments is not required, it is ignored. |
| .PP |
| For several blocking operations, the |
| .I timeout |
| argument is a pointer to a |
| .IR timespec |
| structure that specifies a timeout for the operation. |
| However, notwithstanding the prototype shown above, for some operations, |
| the least significant four bytes of this argument are instead |
| used as an integer whose meaning is determined by the operation. |
| For these operations, the kernel casts the |
| .I timeout |
| value first to |
| .IR "unsigned long", |
| then to |
| .IR uint32_t , |
| and in the remainder of this page, this argument is referred to as |
| .I val2 |
| when interpreted in this fashion. |
| .PP |
| Where it is required, the |
| .IR uaddr2 |
| argument is a pointer to a second futex word that is employed |
| by the operation. |
| .PP |
| The interpretation of the final integer argument, |
| .IR val3 , |
| depends on the operation. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .SS Futex operations |
| The |
| .I futex_op |
| argument consists of two parts: |
| a command that specifies the operation to be performed, |
| bit-wise ORed with zero or more options that |
| modify the behaviour of the operation. |
| The options that may be included in |
| .I futex_op |
| are as follows: |
| .TP |
| .BR FUTEX_PRIVATE_FLAG " (since Linux 2.6.22)" |
| .\" commit 34f01cc1f512fa783302982776895c73714ebbc2 |
| This option bit can be employed with all futex operations. |
| It tells the kernel that the futex is process-private and not shared |
| with another process (i.e., it is being used for synchronization |
| only between threads of the same process). |
| This allows the kernel to make some additional performance optimizations. |
| .\" I.e., It allows the kernel choose the fast path for validating |
| .\" the user-space address and avoids expensive VMA lookups, |
| .\" taking reference counts on file backing store, and so on. |
| .IP |
| As a convenience, |
| .IR <linux/futex.h> |
| defines a set of constants with the suffix |
| .BR _PRIVATE |
| that are equivalents of all of the operations listed below, |
| .\" except the obsolete FUTEX_FD, for which the "private" flag was |
| .\" meaningless |
| but with the |
| .BR FUTEX_PRIVATE_FLAG |
| ORed into the constant value. |
| Thus, there are |
| .BR FUTEX_WAIT_PRIVATE , |
| .BR FUTEX_WAKE_PRIVATE , |
| and so on. |
| .TP |
| .BR FUTEX_CLOCK_REALTIME " (since Linux 2.6.28)" |
| .\" commit 1acdac104668a0834cfa267de9946fac7764d486 |
| This option bit can be employed only with the |
| .BR FUTEX_WAIT_BITSET , |
| .BR FUTEX_WAIT_REQUEUE_PI , |
| and |
| (since Linux 4.5) |
| .\" commit 337f13046ff03717a9e99675284a817527440a49 |
| .BR FUTEX_WAIT |
| operations. |
| .IP |
| If this option is set, the kernel measures the |
| .I timeout |
| against the |
| .BR CLOCK_REALTIME |
| clock. |
| .IP |
| If this option is not set, the kernel measures the |
| .I timeout |
| against the |
| .BR CLOCK_MONOTONIC |
| clock. |
| .PP |
| The operation specified in |
| .I futex_op |
| is one of the following: |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_WAIT " (since Linux 2.6.0)" |
| .\" Strictly speaking, since some time in 2.5.x |
| This operation tests that the value at the |
| futex word pointed to by the address |
| .I uaddr |
| still contains the expected value |
| .IR val , |
| and if so, then sleeps waiting for a |
| .B FUTEX_WAKE |
| operation on the futex word. |
| The load of the value of the futex word is an atomic memory |
| access (i.e., using atomic machine instructions of the respective |
| architecture). |
| This load, the comparison with the expected value, and |
| starting to sleep are performed atomically |
| .\" FIXME: Torvald, I think we may need to add some explanation of |
| .\" "totally ordered" here. |
| and totally ordered |
| with respect to other futex operations on the same futex word. |
| If the thread starts to sleep, |
| it is considered a waiter on this futex word. |
| If the futex value does not match |
| .IR val , |
| then the call fails immediately with the error |
| .BR EAGAIN . |
| .IP |
| The purpose of the comparison with the expected value is to prevent lost |
| wake-ups. |
| If another thread changed the value of the futex word after the |
| calling thread decided to block based on the prior value, |
| and if the other thread executed a |
| .BR FUTEX_WAKE |
| operation (or similar wake-up) after the value change and before this |
| .BR FUTEX_WAIT |
| operation, then the calling thread will observe the |
| value change and will not start to sleep. |
| .IP |
| If the |
| .I timeout |
| is not NULL, the structure it points to specifies a |
| timeout for the wait. |
| (This interval will be rounded up to the system clock granularity, |
| and is guaranteed not to expire early.) |
| The timeout is by default measured according to the |
| .BR CLOCK_MONOTONIC |
| clock, but, since Linux 4.5, the |
| .BR CLOCK_REALTIME |
| clock can be selected by specifying |
| .BR FUTEX_CLOCK_REALTIME |
| in |
| .IR futex_op . |
| If |
| .I timeout |
| is NULL, the call blocks indefinitely. |
| .IP |
| .IR Note : |
| for |
| .BR FUTEX_WAIT , |
| .IR timeout |
| is interpreted as a |
| .IR relative |
| value. |
| This differs from other futex operations, where |
| .I timeout |
| is interpreted as an absolute value. |
| To obtain the equivalent of |
| .BR FUTEX_WAIT |
| with an absolute timeout, employ |
| .BR FUTEX_WAIT_BITSET |
| with |
| .IR val3 |
| specified as |
| .BR FUTEX_BITSET_MATCH_ANY . |
| .IP |
| The arguments |
| .I uaddr2 |
| and |
| .I val3 |
| are ignored. |
| .\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to a |
| .\" different example. |
| .\" |
| .\" For |
| .\" .BR futex (7), |
| .\" this call is executed if decrementing the count gave a negative value |
| .\" (indicating contention), |
| .\" and will sleep until another process or thread releases |
| .\" the futex and executes the |
| .\" .B FUTEX_WAKE |
| .\" operation. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_WAKE " (since Linux 2.6.0)" |
| .\" Strictly speaking, since Linux 2.5.x |
| This operation wakes at most |
| .I val |
| of the waiters that are waiting (e.g., inside |
| .BR FUTEX_WAIT ) |
| on the futex word at the address |
| .IR uaddr . |
| Most commonly, |
| .I val |
| is specified as either 1 (wake up a single waiter) or |
| .BR INT_MAX |
| (wake up all waiters). |
| No guarantee is provided about which waiters are awoken |
| (e.g., a waiter with a higher scheduling priority is not guaranteed |
| to be awoken in preference to a waiter with a lower priority). |
| .IP |
| The arguments |
| .IR timeout , |
| .IR uaddr2 , |
| and |
| .I val3 |
| are ignored. |
| .\" FIXME . (Torvald) I think we should remove this. Or maybe adapt to |
| .\" a different example. |
| .\" |
| .\" For |
| .\" .BR futex (7), |
| .\" this is executed if incrementing the count showed that |
| .\" there were waiters, |
| .\" once the futex value has been set to 1 |
| .\" (indicating that it is available). |
| .\" |
| .\" How does "incrementing the count show that there were waiters"? |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_FD " (from Linux 2.6.0 up to and including Linux 2.6.25)" |
| .\" Strictly speaking, from Linux 2.5.x to 2.6.25 |
| This operation creates a file descriptor that is associated with |
| the futex at |
| .IR uaddr . |
| The caller must close the returned file descriptor after use. |
| When another process or thread performs a |
| .BR FUTEX_WAKE |
| on the futex word, the file descriptor indicates as being readable with |
| .BR select (2), |
| .BR poll (2), |
| and |
| .BR epoll (7) |
| .IP |
| The file descriptor can be used to obtain asynchronous notifications: if |
| .I val |
| is nonzero, then, when another process or thread executes a |
| .BR FUTEX_WAKE , |
| the caller will receive the signal number that was passed in |
| .IR val . |
| .IP |
| The arguments |
| .IR timeout , |
| .I uaddr2 |
| and |
| .I val3 |
| are ignored. |
| .IP |
| Because it was inherently racy, |
| .B FUTEX_FD |
| has been removed |
| .\" commit 82af7aca56c67061420d618cc5a30f0fd4106b80 |
| from Linux 2.6.26 onward. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_REQUEUE " (since Linux 2.6.0)" |
| This operation performs the same task as |
| .BR FUTEX_CMP_REQUEUE |
| (see below), except that no check is made using the value in |
| .IR val3 . |
| (The argument |
| .I val3 |
| is ignored.) |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_CMP_REQUEUE " (since Linux 2.6.7)" |
| This operation first checks whether the location |
| .I uaddr |
| still contains the value |
| .IR val3 . |
| If not, the operation fails with the error |
| .BR EAGAIN . |
| Otherwise, the operation wakes up a maximum of |
| .I val |
| waiters that are waiting on the futex at |
| .IR uaddr . |
| If there are more than |
| .I val |
| waiters, then the remaining waiters are removed |
| from the wait queue of the source futex at |
| .I uaddr |
| and added to the wait queue of the target futex at |
| .IR uaddr2 . |
| The |
| .I val2 |
| argument specifies an upper limit on the number of waiters |
| that are requeued to the futex at |
| .IR uaddr2 . |
| .IP |
| .\" FIXME(Torvald) Is the following correct? Or is just the decision |
| .\" which threads to wake or requeue part of the atomic operation? |
| The load from |
| .I uaddr |
| is an atomic memory access (i.e., using atomic machine instructions of |
| the respective architecture). |
| This load, the comparison with |
| .IR val3 , |
| and the requeueing of any waiters are performed atomically and totally |
| ordered with respect to other operations on the same futex word. |
| .\" Notes from a f2f conversation with Thomas Gleixner (Aug 2015): ### |
| .\" The operation is serialized with respect to operations on both |
| .\" source and target futex. No other waiter can enqueue itself |
| .\" for waiting and no other waiter can dequeue itself because of |
| .\" a timeout or signal. |
| .IP |
| Typical values to specify for |
| .I val |
| are 0 or 1. |
| (Specifying |
| .BR INT_MAX |
| is not useful, because it would make the |
| .BR FUTEX_CMP_REQUEUE |
| operation equivalent to |
| .BR FUTEX_WAKE .) |
| The limit value specified via |
| .I val2 |
| is typically either 1 or |
| .BR INT_MAX . |
| (Specifying the argument as 0 is not useful, because it would make the |
| .BR FUTEX_CMP_REQUEUE |
| operation equivalent to |
| .BR FUTEX_WAIT .) |
| .IP |
| The |
| .B FUTEX_CMP_REQUEUE |
| operation was added as a replacement for the earlier |
| .BR FUTEX_REQUEUE . |
| The difference is that the check of the value at |
| .I uaddr |
| can be used to ensure that requeueing happens only under certain |
| conditions, which allows race conditions to be avoided in certain use cases. |
| .\" But, as Rich Felker points out, there remain valid use cases for |
| .\" FUTEX_REQUEUE, for example, when the calling thread is requeuing |
| .\" the target(s) to a lock that the calling thread owns |
| .\" From: Rich Felker <dalias@libc.org> |
| .\" Date: Wed, 29 Oct 2014 22:43:17 -0400 |
| .\" To: Darren Hart <dvhart@infradead.org> |
| .\" CC: libc-alpha@sourceware.org, ... |
| .\" Subject: Re: Add futex wrapper to glibc? |
| .IP |
| Both |
| .BR FUTEX_REQUEUE |
| and |
| .BR FUTEX_CMP_REQUEUE |
| can be used to avoid "thundering herd" wake-ups that could occur when using |
| .B FUTEX_WAKE |
| in cases where all of the waiters that are woken need to acquire |
| another futex. |
| Consider the following scenario, |
| where multiple waiter threads are waiting on B, |
| a wait queue implemented using a futex: |
| .IP |
| .in +4n |
| .EX |
| lock(A) |
| while (!check_value(V)) { |
| unlock(A); |
| block_on(B); |
| lock(A); |
| }; |
| unlock(A); |
| .EE |
| .in |
| .IP |
| If a waker thread used |
| .BR FUTEX_WAKE , |
| then all waiters waiting on B would be woken up, |
| and they would all try to acquire lock A. |
| However, waking all of the threads in this manner would be pointless because |
| all except one of the threads would immediately block on lock A again. |
| By contrast, a requeue operation wakes just one waiter and moves |
| the other waiters to lock A, |
| and when the woken waiter unlocks A then the next waiter can proceed. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_WAKE_OP " (since Linux 2.6.14)" |
| .\" commit 4732efbeb997189d9f9b04708dc26bf8613ed721 |
| .\" Author: Jakub Jelinek <jakub@redhat.com> |
| .\" Date: Tue Sep 6 15:16:25 2005 -0700 |
| .\" FIXME. (Torvald) The glibc condvar implementation is currently being |
| .\" revised (e.g., to not use an internal lock anymore). |
| .\" It is probably more future-proof to remove this paragraph. |
| .\" [Torvald, do you have an update here?] |
| This operation was added to support some user-space use cases |
| where more than one futex must be handled at the same time. |
| The most notable example is the implementation of |
| .BR pthread_cond_signal (3), |
| which requires operations on two futexes, |
| the one used to implement the mutex and the one used in the implementation |
| of the wait queue associated with the condition variable. |
| .BR FUTEX_WAKE_OP |
| allows such cases to be implemented without leading to |
| high rates of contention and context switching. |
| .IP |
| The |
| .BR FUTEX_WAKE_OP |
| operation is equivalent to executing the following code atomically |
| and totally ordered with respect to other futex operations on |
| any of the two supplied futex words: |
| .IP |
| .in +4n |
| .EX |
| int oldval = *(int *) uaddr2; |
| *(int *) uaddr2 = oldval \fIop\fP \fIoparg\fP; |
| futex(uaddr, FUTEX_WAKE, val, 0, 0, 0); |
| if (oldval \fIcmp\fP \fIcmparg\fP) |
| futex(uaddr2, FUTEX_WAKE, val2, 0, 0, 0); |
| .EE |
| .in |
| .IP |
| In other words, |
| .BR FUTEX_WAKE_OP |
| does the following: |
| .RS |
| .IP * 3 |
| saves the original value of the futex word at |
| .IR uaddr2 |
| and performs an operation to modify the value of the futex at |
| .IR uaddr2 ; |
| this is an atomic read-modify-write memory access (i.e., using atomic |
| machine instructions of the respective architecture) |
| .IP * |
| wakes up a maximum of |
| .I val |
| waiters on the futex for the futex word at |
| .IR uaddr ; |
| and |
| .IP * |
| dependent on the results of a test of the original value of the |
| futex word at |
| .IR uaddr2 , |
| wakes up a maximum of |
| .I val2 |
| waiters on the futex for the futex word at |
| .IR uaddr2 . |
| .RE |
| .IP |
| The operation and comparison that are to be performed are encoded |
| in the bits of the argument |
| .IR val3 . |
| Pictorially, the encoding is: |
| .IP |
| .in +8n |
| .EX |
| +---+---+-----------+-----------+ |
| |op |cmp| oparg | cmparg | |
| +---+---+-----------+-----------+ |
| 4 4 12 12 <== # of bits |
| .EE |
| .in |
| .IP |
| Expressed in code, the encoding is: |
| .IP |
| .in +4n |
| .EX |
| #define FUTEX_OP(op, oparg, cmp, cmparg) \\ |
| (((op & 0xf) << 28) | \\ |
| ((cmp & 0xf) << 24) | \\ |
| ((oparg & 0xfff) << 12) | \\ |
| (cmparg & 0xfff)) |
| .EE |
| .in |
| .IP |
| In the above, |
| .I op |
| and |
| .I cmp |
| are each one of the codes listed below. |
| The |
| .I oparg |
| and |
| .I cmparg |
| components are literal numeric values, except as noted below. |
| .IP |
| The |
| .I op |
| component has one of the following values: |
| .IP |
| .in +4n |
| .EX |
| FUTEX_OP_SET 0 /* uaddr2 = oparg; */ |
| FUTEX_OP_ADD 1 /* uaddr2 += oparg; */ |
| FUTEX_OP_OR 2 /* uaddr2 |= oparg; */ |
| FUTEX_OP_ANDN 3 /* uaddr2 &= ~oparg; */ |
| FUTEX_OP_XOR 4 /* uaddr2 ^= oparg; */ |
| .EE |
| .in |
| .IP |
| In addition, bit-wise ORing the following value into |
| .I op |
| causes |
| .IR "(1\ <<\ oparg)" |
| to be used as the operand: |
| .IP |
| .in +4n |
| .EX |
| FUTEX_OP_ARG_SHIFT 8 /* Use (1 << oparg) as operand */ |
| .EE |
| .in |
| .IP |
| The |
| .I cmp |
| field is one of the following: |
| .IP |
| .in +4n |
| .EX |
| FUTEX_OP_CMP_EQ 0 /* if (oldval == cmparg) wake */ |
| FUTEX_OP_CMP_NE 1 /* if (oldval != cmparg) wake */ |
| FUTEX_OP_CMP_LT 2 /* if (oldval < cmparg) wake */ |
| FUTEX_OP_CMP_LE 3 /* if (oldval <= cmparg) wake */ |
| FUTEX_OP_CMP_GT 4 /* if (oldval > cmparg) wake */ |
| FUTEX_OP_CMP_GE 5 /* if (oldval >= cmparg) wake */ |
| .EE |
| .in |
| .IP |
| The return value of |
| .BR FUTEX_WAKE_OP |
| is the sum of the number of waiters woken on the futex |
| .IR uaddr |
| plus the number of waiters woken on the futex |
| .IR uaddr2 . |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_WAIT_BITSET " (since Linux 2.6.25)" |
| .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d |
| This operation is like |
| .BR FUTEX_WAIT |
| except that |
| .I val3 |
| is used to provide a 32-bit bit mask to the kernel. |
| This bit mask, in which at least one bit must be set, |
| is stored in the kernel-internal state of the waiter. |
| See the description of |
| .BR FUTEX_WAKE_BITSET |
| for further details. |
| .IP |
| If |
| .I timeout |
| is not NULL, the structure it points to specifies |
| an absolute timeout for the wait operation. |
| If |
| .I timeout |
| is NULL, the operation can block indefinitely. |
| .IP |
| .IP |
| The |
| .I uaddr2 |
| argument is ignored. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_WAKE_BITSET " (since Linux 2.6.25)" |
| .\" commit cd689985cf49f6ff5c8eddc48d98b9d581d9475d |
| This operation is the same as |
| .BR FUTEX_WAKE |
| except that the |
| .I val3 |
| argument is used to provide a 32-bit bit mask to the kernel. |
| This bit mask, in which at least one bit must be set, |
| is used to select which waiters should be woken up. |
| The selection is done by a bit-wise AND of the "wake" bit mask |
| (i.e., the value in |
| .IR val3 ) |
| and the bit mask which is stored in the kernel-internal |
| state of the waiter (the "wait" bit mask that is set using |
| .BR FUTEX_WAIT_BITSET ). |
| All of the waiters for which the result of the AND is nonzero are woken up; |
| the remaining waiters are left sleeping. |
| .IP |
| The effect of |
| .BR FUTEX_WAIT_BITSET |
| and |
| .BR FUTEX_WAKE_BITSET |
| is to allow selective wake-ups among multiple waiters that are blocked |
| on the same futex. |
| However, note that, depending on the use case, |
| employing this bit-mask multiplexing feature on a |
| futex can be less efficient than simply using multiple futexes, |
| because employing bit-mask multiplexing requires the kernel |
| to check all waiters on a futex, |
| including those that are not interested in being woken up |
| (i.e., they do not have the relevant bit set in their "wait" bit mask). |
| .\" According to http://locklessinc.com/articles/futex_cheat_sheet/: |
| .\" |
| .\" "The original reason for the addition of these extensions |
| .\" was to improve the performance of pthread read-write locks |
| .\" in glibc. However, the pthreads library no longer uses the |
| .\" same locking algorithm, and these extensions are not used |
| .\" without the bitset parameter being all ones. |
| .\" |
| .\" The page goes on to note that the FUTEX_WAIT_BITSET operation |
| .\" is nevertheless used (with a bit mask of all ones) in order to |
| .\" obtain the absolute timeout functionality that is useful |
| .\" for efficiently implementing Pthreads APIs (which use absolute |
| .\" timeouts); FUTEX_WAIT provides only relative timeouts. |
| .IP |
| The constant |
| .BR FUTEX_BITSET_MATCH_ANY , |
| which corresponds to all 32 bits set in the bit mask, can be used as the |
| .I val3 |
| argument for |
| .BR FUTEX_WAIT_BITSET |
| and |
| .BR FUTEX_WAKE_BITSET . |
| Other than differences in the handling of the |
| .I timeout |
| argument, the |
| .BR FUTEX_WAIT |
| operation is equivalent to |
| .BR FUTEX_WAIT_BITSET |
| with |
| .IR val3 |
| specified as |
| .BR FUTEX_BITSET_MATCH_ANY ; |
| that is, allow a wake-up by any waker. |
| The |
| .BR FUTEX_WAKE |
| operation is equivalent to |
| .BR FUTEX_WAKE_BITSET |
| with |
| .IR val3 |
| specified as |
| .BR FUTEX_BITSET_MATCH_ANY ; |
| that is, wake up any waiter(s). |
| .IP |
| The |
| .I uaddr2 |
| and |
| .I timeout |
| arguments are ignored. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .SS Priority-inheritance futexes |
| Linux supports priority-inheritance (PI) futexes in order to handle |
| priority-inversion problems that can be encountered with |
| normal futex locks. |
| Priority inversion is the problem that occurs when a high-priority |
| task is blocked waiting to acquire a lock held by a low-priority task, |
| while tasks at an intermediate priority continuously preempt |
| the low-priority task from the CPU. |
| Consequently, the low-priority task makes no progress toward |
| releasing the lock, and the high-priority task remains blocked. |
| .PP |
| Priority inheritance is a mechanism for dealing with |
| the priority-inversion problem. |
| With this mechanism, when a high-priority task becomes blocked |
| by a lock held by a low-priority task, |
| the priority of the low-priority task is temporarily raised |
| to that of the high-priority task, |
| so that it is not preempted by any intermediate level tasks, |
| and can thus make progress toward releasing the lock. |
| To be effective, priority inheritance must be transitive, |
| meaning that if a high-priority task blocks on a lock |
| held by a lower-priority task that is itself blocked by a lock |
| held by another intermediate-priority task |
| (and so on, for chains of arbitrary length), |
| then both of those tasks |
| (or more generally, all of the tasks in a lock chain) |
| have their priorities raised to be the same as the high-priority task. |
| .PP |
| From a user-space perspective, |
| what makes a futex PI-aware is a policy agreement (described below) |
| between user space and the kernel about the value of the futex word, |
| coupled with the use of the PI-futex operations described below. |
| (Unlike the other futex operations described above, |
| the PI-futex operations are designed |
| for the implementation of very specific IPC mechanisms.) |
| .\" |
| .\" Quoting Darren Hart: |
| .\" These opcodes paired with the PI futex value policy (described below) |
| .\" defines a "futex" as PI aware. These were created very specifically |
| .\" in support of PI pthread_mutexes, so it makes a lot more sense to |
| .\" talk about a PI aware pthread_mutex, than a PI aware futex, since |
| .\" there is a lot of policy and scaffolding that has to be built up |
| .\" around it to use it properly (this is what a PI pthread_mutex is). |
| .PP |
| .\" mtk: The following text is drawn from the Hart/Guniguntala paper |
| .\" (listed in SEE ALSO), but I have reworded some pieces |
| .\" significantly. |
| .\" |
| The PI-futex operations described below differ from the other |
| futex operations in that they impose policy on the use of the value of the |
| futex word: |
| .IP * 3 |
| If the lock is not acquired, the futex word's value shall be 0. |
| .IP * |
| If the lock is acquired, the futex word's value shall |
| be the thread ID (TID; |
| see |
| .BR gettid (2)) |
| of the owning thread. |
| .IP * |
| If the lock is owned and there are threads contending for the lock, |
| then the |
| .B FUTEX_WAITERS |
| bit shall be set in the futex word's value; in other words, this value is: |
| .IP |
| FUTEX_WAITERS | TID |
| .IP |
| (Note that is invalid for a PI futex word to have no owner and |
| .BR FUTEX_WAITERS |
| set.) |
| .PP |
| With this policy in place, |
| a user-space application can acquire an unacquired |
| lock or release a lock using atomic instructions executed in user mode |
| (e.g., a compare-and-swap operation such as |
| .I cmpxchg |
| on the x86 architecture). |
| Acquiring a lock simply consists of using compare-and-swap to atomically |
| set the futex word's value to the caller's TID if its previous value was 0. |
| Releasing a lock requires using compare-and-swap to set the futex word's |
| value to 0 if the previous value was the expected TID. |
| .PP |
| If a futex is already acquired (i.e., has a nonzero value), |
| waiters must employ the |
| .B FUTEX_LOCK_PI |
| operation to acquire the lock. |
| If other threads are waiting for the lock, then the |
| .B FUTEX_WAITERS |
| bit is set in the futex value; |
| in this case, the lock owner must employ the |
| .B FUTEX_UNLOCK_PI |
| operation to release the lock. |
| .PP |
| In the cases where callers are forced into the kernel |
| (i.e., required to perform a |
| .BR futex () |
| call), |
| they then deal directly with a so-called RT-mutex, |
| a kernel locking mechanism which implements the required |
| priority-inheritance semantics. |
| After the RT-mutex is acquired, the futex value is updated accordingly, |
| before the calling thread returns to user space. |
| .PP |
| It is important to note |
| .\" tglx (July 2015): |
| .\" If there are multiple waiters on a pi futex then a wake pi operation |
| .\" will wake the first waiter and hand over the lock to this waiter. This |
| .\" includes handing over the rtmutex which represents the futex in the |
| .\" kernel. The strict requirement is that the futex owner and the rtmutex |
| .\" owner must be the same, except for the update period which is |
| .\" serialized by the futex internal locking. That means the kernel must |
| .\" update the user-space value prior to returning to user space |
| that the kernel will update the futex word's value prior |
| to returning to user space. |
| (This prevents the possibility of the futex word's value ending |
| up in an invalid state, such as having an owner but the value being 0, |
| or having waiters but not having the |
| .B FUTEX_WAITERS |
| bit set.) |
| .PP |
| If a futex has an associated RT-mutex in the kernel |
| (i.e., there are blocked waiters) |
| and the owner of the futex/RT-mutex dies unexpectedly, |
| then the kernel cleans up the RT-mutex and hands it over to the next waiter. |
| This in turn requires that the user-space value is updated accordingly. |
| To indicate that this is required, the kernel sets the |
| .B FUTEX_OWNER_DIED |
| bit in the futex word along with the thread ID of the new owner. |
| User space can detect this situation via the presence of the |
| .B FUTEX_OWNER_DIED |
| bit and is then responsible for cleaning up the stale state left over by |
| the dead owner. |
| .\" tglx (July 2015): |
| .\" The FUTEX_OWNER_DIED bit can also be set on uncontended futexes, where |
| .\" the kernel has no state associated. This happens via the robust futex |
| .\" mechanism. In that case the futex value will be set to |
| .\" FUTEX_OWNER_DIED. The robust futex mechanism is also available for non |
| .\" PI futexes. |
| .PP |
| PI futexes are operated on by specifying one of the values listed below in |
| .IR futex_op . |
| Note that the PI futex operations must be used as paired operations |
| and are subject to some additional requirements: |
| .IP * 3 |
| .B FUTEX_LOCK_PI |
| and |
| .B FUTEX_TRYLOCK_PI |
| pair with |
| .BR FUTEX_UNLOCK_PI. |
| .B FUTEX_UNLOCK_PI |
| must be called only on a futex owned by the calling thread, |
| as defined by the value policy, otherwise the error |
| .B EPERM |
| results. |
| .IP * |
| .B FUTEX_WAIT_REQUEUE_PI |
| pairs with |
| .BR FUTEX_CMP_REQUEUE_PI . |
| This must be performed from a non-PI futex to a distinct PI futex |
| (or the error |
| .B EINVAL |
| results). |
| Additionally, |
| .I val |
| (the number of waiters to be woken) must be 1 |
| (or the error |
| .B EINVAL |
| results). |
| .PP |
| The PI futex operations are as follows: |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_LOCK_PI " (since Linux 2.6.18)" |
| .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
| This operation is used after an attempt to acquire |
| the lock via an atomic user-mode instruction failed |
| because the futex word has a nonzero value\(emspecifically, |
| because it contained the (PID-namespace-specific) TID of the lock owner. |
| .IP |
| The operation checks the value of the futex word at the address |
| .IR uaddr . |
| If the value is 0, then the kernel tries to atomically set |
| the futex value to the caller's TID. |
| If the futex word's value is nonzero, |
| the kernel atomically sets the |
| .B FUTEX_WAITERS |
| bit, which signals the futex owner that it cannot unlock the futex in |
| user space atomically by setting the futex value to 0. |
| .\" tglx (July 2015): |
| .\" The operation here is similar to the FUTEX_WAIT logic. When the user |
| .\" space atomic acquire does not succeed because the futex value was non |
| .\" zero, then the waiter goes into the kernel, takes the kernel internal |
| .\" lock and retries the acquisition under the lock. If the acquisition |
| .\" does not succeed either, then it sets the FUTEX_WAITERS bit, to signal |
| .\" the lock owner that it needs to go into the kernel. Here is the pseudo |
| .\" code: |
| .\" |
| .\" lock(kernel_lock); |
| .\" retry: |
| .\" |
| .\" /* |
| .\" * Owner might have unlocked in userspace before we |
| .\" * were able to set the waiter bit. |
| .\" */ |
| .\" if (atomic_acquire(futex) == SUCCESS) { |
| .\" unlock(kernel_lock()); |
| .\" return 0; |
| .\" } |
| .\" |
| .\" /* |
| .\" * Owner might have unlocked after the above atomic_acquire() |
| .\" * attempt. |
| .\" */ |
| .\" if (atomic_set_waiters_bit(futex) != SUCCESS) |
| .\" goto retry; |
| .\" |
| .\" queue_waiter(); |
| .\" unlock(kernel_lock); |
| .\" block(); |
| .\" |
| After that, the kernel: |
| .RS |
| .IP 1. 3 |
| Tries to find the thread which is associated with the owner TID. |
| .IP 2. |
| Creates or reuses kernel state on behalf of the owner. |
| (If this is the first waiter, there is no kernel state for this |
| futex, so kernel state is created by locking the RT-mutex |
| and the futex owner is made the owner of the RT-mutex. |
| If there are existing waiters, then the existing state is reused.) |
| .IP 3. |
| Attaches the waiter to the futex |
| (i.e., the waiter is enqueued on the RT-mutex waiter list). |
| .RE |
| .IP |
| If more than one waiter exists, |
| the enqueueing of the waiter is in descending priority order. |
| (For information on priority ordering, see the discussion of the |
| .BR SCHED_DEADLINE , |
| .BR SCHED_FIFO , |
| and |
| .BR SCHED_RR |
| scheduling policies in |
| .BR sched (7).) |
| The owner inherits either the waiter's CPU bandwidth |
| (if the waiter is scheduled under the |
| .BR SCHED_DEADLINE |
| policy) or the waiter's priority (if the waiter is scheduled under the |
| .BR SCHED_RR |
| or |
| .BR SCHED_FIFO |
| policy). |
| .\" August 2015: |
| .\" mtk: If the realm is restricted purely to SCHED_OTHER (SCHED_NORMAL) |
| .\" processes, does the nice value come into play also? |
| .\" |
| .\" tglx: No. SCHED_OTHER/NORMAL tasks are handled in FIFO order |
| This inheritance follows the lock chain in the case of nested locking |
| .\" (i.e., task 1 blocks on lock A, held by task 2, |
| .\" while task 2 blocks on lock B, held by task 3) |
| and performs deadlock detection. |
| .IP |
| The |
| .I timeout |
| argument provides a timeout for the lock attempt. |
| If |
| .I timeout |
| is not NULL, the structure it points to specifies |
| an absolute timeout, measured against the |
| .BR CLOCK_REALTIME |
| clock. |
| .\" 2016-07-07 response from Thomas Gleixner on LKML: |
| .\" From: Thomas Gleixner <tglx@linutronix.de> |
| .\" Date: 6 July 2016 at 20:57 |
| .\" Subject: Re: futex: Allow FUTEX_CLOCK_REALTIME with FUTEX_WAIT op |
| .\" |
| .\" On Thu, 23 Jun 2016, Michael Kerrisk (man-pages) wrote: |
| .\" > On 06/23/2016 08:28 PM, Darren Hart wrote: |
| .\" > > And as a follow-on, what is the reason for FUTEX_LOCK_PI only using |
| .\" > > CLOCK_REALTIME? It seems reasonable to me that a user may want to wait a |
| .\" > > specific amount of time, regardless of wall time. |
| .\" > |
| .\" > Yes, that's another weird inconsistency. |
| .\" |
| .\" The reason is that phtread_mutex_timedlock() uses absolute timeouts based on |
| .\" CLOCK_REALTIME. glibc folks asked to make that the default behaviour back |
| .\" then when we added LOCK_PI. |
| If |
| .I timeout |
| is NULL, the operation will block indefinitely. |
| .IP |
| The |
| .IR uaddr2 , |
| .IR val , |
| and |
| .IR val3 |
| arguments are ignored. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_TRYLOCK_PI " (since Linux 2.6.18)" |
| .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
| This operation tries to acquire the lock at |
| .IR uaddr . |
| It is invoked when a user-space atomic acquire did not |
| succeed because the futex word was not 0. |
| .IP |
| Because the kernel has access to more state information than user space, |
| acquisition of the lock might succeed if performed by the |
| kernel in cases where the futex word |
| (i.e., the state information accessible to use-space) contains stale state |
| .RB ( FUTEX_WAITERS |
| and/or |
| .BR FUTEX_OWNER_DIED ). |
| This can happen when the owner of the futex died. |
| User space cannot handle this condition in a race-free manner, |
| but the kernel can fix this up and acquire the futex. |
| .\" Paraphrasing a f2f conversation with Thomas Gleixner about the |
| .\" above point (Aug 2015): ### |
| .\" There is a rare possibility of a race condition involving an |
| .\" uncontended futex with no owner, but with waiters. The |
| .\" kernel-user-space contract is that if a futex is nonzero, you must |
| .\" go into kernel. The futex was owned by a task, and that task dies |
| .\" but there are no waiters, so the futex value is non zero. |
| .\" Therefore, the next locker has to go into the kernel, |
| .\" so that the kernel has a chance to clean up. (CMXCH on zero |
| .\" in user space would fail, so kernel has to clean up.) |
| .\" Darren Hart (Oct 2015): |
| .\" The trylock in the kernel has more state, so it can independently |
| .\" verify the flags that userspace must trust implicitly. |
| .IP |
| The |
| .IR uaddr2 , |
| .IR val , |
| .IR timeout , |
| and |
| .IR val3 |
| arguments are ignored. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_UNLOCK_PI " (since Linux 2.6.18)" |
| .\" commit c87e2837be82df479a6bae9f155c43516d2feebc |
| This operation wakes the top priority waiter that is waiting in |
| .B FUTEX_LOCK_PI |
| on the futex address provided by the |
| .I uaddr |
| argument. |
| .IP |
| This is called when the user-space value at |
| .I uaddr |
| cannot be changed atomically from a TID (of the owner) to 0. |
| .IP |
| The |
| .IR uaddr2 , |
| .IR val , |
| .IR timeout , |
| and |
| .IR val3 |
| arguments are ignored. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_CMP_REQUEUE_PI " (since Linux 2.6.31)" |
| .\" commit 52400ba946759af28442dee6265c5c0180ac7122 |
| This operation is a PI-aware variant of |
| .BR FUTEX_CMP_REQUEUE . |
| It requeues waiters that are blocked via |
| .B FUTEX_WAIT_REQUEUE_PI |
| on |
| .I uaddr |
| from a non-PI source futex |
| .RI ( uaddr ) |
| to a PI target futex |
| .RI ( uaddr2 ). |
| .IP |
| As with |
| .BR FUTEX_CMP_REQUEUE , |
| this operation wakes up a maximum of |
| .I val |
| waiters that are waiting on the futex at |
| .IR uaddr . |
| However, for |
| .BR FUTEX_CMP_REQUEUE_PI , |
| .I val |
| is required to be 1 |
| (since the main point is to avoid a thundering herd). |
| The remaining waiters are removed from the wait queue of the source futex at |
| .I uaddr |
| and added to the wait queue of the target futex at |
| .IR uaddr2 . |
| .IP |
| The |
| .I val2 |
| .\" val2 is the cap on the number of requeued waiters. |
| .\" In the glibc pthread_cond_broadcast() implementation, this argument |
| .\" is specified as INT_MAX, and for pthread_cond_signal() it is 0. |
| and |
| .I val3 |
| arguments serve the same purposes as for |
| .BR FUTEX_CMP_REQUEUE . |
| .\" |
| .\" The page at http://locklessinc.com/articles/futex_cheat_sheet/ |
| .\" notes that "priority-inheritance Futex to priority-inheritance |
| .\" Futex requeues are currently unsupported". However, probably |
| .\" the page does not need to say nothing about this, since |
| .\" Thomas Gleixner commented (July 2015): "they never will be |
| .\" supported because they make no sense at all" |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .TP |
| .BR FUTEX_WAIT_REQUEUE_PI " (since Linux 2.6.31)" |
| .\" commit 52400ba946759af28442dee6265c5c0180ac7122 |
| .\" |
| Wait on a non-PI futex at |
| .I uaddr |
| and potentially be requeued (via a |
| .BR FUTEX_CMP_REQUEUE_PI |
| operation in another task) onto a PI futex at |
| .IR uaddr2 . |
| The wait operation on |
| .I uaddr |
| is the same as for |
| .BR FUTEX_WAIT . |
| .IP |
| The waiter can be removed from the wait on |
| .I uaddr |
| without requeueing on |
| .IR uaddr2 |
| via a |
| .BR FUTEX_WAKE |
| operation in another task. |
| In this case, the |
| .BR FUTEX_WAIT_REQUEUE_PI |
| operation fails with the error |
| .BR EAGAIN . |
| .IP |
| If |
| .I timeout |
| is not NULL, the structure it points to specifies |
| an absolute timeout for the wait operation. |
| If |
| .I timeout |
| is NULL, the operation can block indefinitely. |
| .IP |
| The |
| .I val3 |
| argument is ignored. |
| .IP |
| The |
| .BR FUTEX_WAIT_REQUEUE_PI |
| and |
| .BR FUTEX_CMP_REQUEUE_PI |
| were added to support a fairly specific use case: |
| support for priority-inheritance-aware POSIX threads condition variables. |
| The idea is that these operations should always be paired, |
| in order to ensure that user space and the kernel remain in sync. |
| Thus, in the |
| .BR FUTEX_WAIT_REQUEUE_PI |
| operation, the user-space application pre-specifies the target |
| of the requeue that takes place in the |
| .BR FUTEX_CMP_REQUEUE_PI |
| operation. |
| .\" |
| .\" Darren Hart notes that a patch to allow glibc to fully support |
| .\" PI-aware pthreads condition variables has not yet been accepted into |
| .\" glibc. The story is complex, and can be found at |
| .\" https://sourceware.org/bugzilla/show_bug.cgi?id=11588 |
| .\" Darren notes that in the meantime, the patch is shipped with various |
| .\" PREEMPT_RT-enabled Linux systems. |
| .\" |
| .\" Related to the preceding, Darren proposed that somewhere, man-pages |
| .\" should document the following point: |
| .\" |
| .\" While the Linux kernel, since 2.6.31, supports requeueing of |
| .\" priority-inheritance (PI) aware mutexes via the |
| .\" FUTEX_WAIT_REQUEUE_PI and FUTEX_CMP_REQUEUE_PI futex operations, |
| .\" the glibc implementation does not yet take full advantage of this. |
| .\" Specifically, the condvar internal data lock remains a non-PI aware |
| .\" mutex, regardless of the type of the pthread_mutex associated with |
| .\" the condvar. This can lead to an unbounded priority inversion on |
| .\" the internal data lock even when associating a PI aware |
| .\" pthread_mutex with a condvar during a pthread_cond*_wait |
| .\" operation. For this reason, it is not recommended to rely on |
| .\" priority inheritance when using pthread condition variables. |
| .\" |
| .\" The problem is that the obvious location for this text is |
| .\" the pthread_cond*wait(3) man page. However, such a man page |
| .\" does not currently exist. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .SH RETURN VALUE |
| .PP |
| In the event of an error (and assuming that |
| .BR futex () |
| was invoked via |
| .BR syscall (2)), |
| all operations return \-1 and set |
| .I errno |
| to indicate the cause of the error. |
| .PP |
| The return value on success depends on the operation, |
| as described in the following list: |
| .TP |
| .B FUTEX_WAIT |
| Returns 0 if the caller was woken up. |
| Note that a wake-up can also be caused by common futex usage patterns |
| in unrelated code that happened to have previously used the futex word's |
| memory location (e.g., typical futex-based implementations of |
| Pthreads mutexes can cause this under some conditions). |
| Therefore, callers should always conservatively assume that a return |
| value of 0 can mean a spurious wake-up, and use the futex word's value |
| (i.e., the user-space synchronization scheme) |
| to decide whether to continue to block or not. |
| .TP |
| .B FUTEX_WAKE |
| Returns the number of waiters that were woken up. |
| .TP |
| .B FUTEX_FD |
| Returns the new file descriptor associated with the futex. |
| .TP |
| .B FUTEX_REQUEUE |
| Returns the number of waiters that were woken up. |
| .TP |
| .B FUTEX_CMP_REQUEUE |
| Returns the total number of waiters that were woken up or |
| requeued to the futex for the futex word at |
| .IR uaddr2 . |
| If this value is greater than |
| .IR val , |
| then the difference is the number of waiters requeued to the futex for the |
| futex word at |
| .IR uaddr2 . |
| .TP |
| .B FUTEX_WAKE_OP |
| Returns the total number of waiters that were woken up. |
| This is the sum of the woken waiters on the two futexes for |
| the futex words at |
| .I uaddr |
| and |
| .IR uaddr2 . |
| .TP |
| .B FUTEX_WAIT_BITSET |
| Returns 0 if the caller was woken up. |
| See |
| .B FUTEX_WAIT |
| for how to interpret this correctly in practice. |
| .TP |
| .B FUTEX_WAKE_BITSET |
| Returns the number of waiters that were woken up. |
| .TP |
| .B FUTEX_LOCK_PI |
| Returns 0 if the futex was successfully locked. |
| .TP |
| .B FUTEX_TRYLOCK_PI |
| Returns 0 if the futex was successfully locked. |
| .TP |
| .B FUTEX_UNLOCK_PI |
| Returns 0 if the futex was successfully unlocked. |
| .TP |
| .B FUTEX_CMP_REQUEUE_PI |
| Returns the total number of waiters that were woken up or |
| requeued to the futex for the futex word at |
| .IR uaddr2 . |
| If this value is greater than |
| .IR val , |
| then difference is the number of waiters requeued to the futex for |
| the futex word at |
| .IR uaddr2 . |
| .TP |
| .B FUTEX_WAIT_REQUEUE_PI |
| Returns 0 if the caller was successfully requeued to the futex for |
| the futex word at |
| .IR uaddr2 . |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .SH ERRORS |
| .TP |
| .B EACCES |
| No read access to the memory of a futex word. |
| .TP |
| .B EAGAIN |
| .RB ( FUTEX_WAIT , |
| .BR FUTEX_WAIT_BITSET , |
| .BR FUTEX_WAIT_REQUEUE_PI ) |
| The value pointed to by |
| .I uaddr |
| was not equal to the expected value |
| .I val |
| at the time of the call. |
| .IP |
| .BR Note : |
| on Linux, the symbolic names |
| .B EAGAIN |
| and |
| .B EWOULDBLOCK |
| (both of which appear in different parts of the kernel futex code) |
| have the same value. |
| .TP |
| .B EAGAIN |
| .RB ( FUTEX_CMP_REQUEUE , |
| .BR FUTEX_CMP_REQUEUE_PI ) |
| The value pointed to by |
| .I uaddr |
| is not equal to the expected value |
| .IR val3 . |
| .TP |
| .BR EAGAIN |
| .RB ( FUTEX_LOCK_PI , |
| .BR FUTEX_TRYLOCK_PI , |
| .BR FUTEX_CMP_REQUEUE_PI ) |
| The futex owner thread ID of |
| .I uaddr |
| (for |
| .BR FUTEX_CMP_REQUEUE_PI : |
| .IR uaddr2 ) |
| is about to exit, |
| but has not yet handled the internal state cleanup. |
| Try again. |
| .TP |
| .BR EDEADLK |
| .RB ( FUTEX_LOCK_PI , |
| .BR FUTEX_TRYLOCK_PI , |
| .BR FUTEX_CMP_REQUEUE_PI ) |
| The futex word at |
| .I uaddr |
| is already locked by the caller. |
| .TP |
| .BR EDEADLK |
| .\" FIXME . I see that kernel/locking/rtmutex.c uses EDEADLK in some |
| .\" places, and EDEADLOCK in others. On almost all architectures |
| .\" these constants are synonymous. Is there a reason that both |
| .\" names are used? |
| .\" |
| .\" tglx (July 2015): "No. We should probably fix that." |
| .\" |
| .RB ( FUTEX_CMP_REQUEUE_PI ) |
| While requeueing a waiter to the PI futex for the futex word at |
| .IR uaddr2 , |
| the kernel detected a deadlock. |
| .TP |
| .B EFAULT |
| A required pointer argument (i.e., |
| .IR uaddr , |
| .IR uaddr2 , |
| or |
| .IR timeout ) |
| did not point to a valid user-space address. |
| .TP |
| .B EINTR |
| A |
| .B FUTEX_WAIT |
| or |
| .B FUTEX_WAIT_BITSET |
| operation was interrupted by a signal (see |
| .BR signal (7)). |
| In kernels before Linux 2.6.22, this error could also be returned for |
| a spurious wakeup; since Linux 2.6.22, this no longer happens. |
| .TP |
| .B EINVAL |
| The operation in |
| .IR futex_op |
| is one of those that employs a timeout, but the supplied |
| .I timeout |
| argument was invalid |
| .RI ( tv_sec |
| was less than zero, or |
| .IR tv_nsec |
| was not less than 1,000,000,000). |
| .TP |
| .B EINVAL |
| The operation specified in |
| .IR futex_op |
| employs one or both of the pointers |
| .I uaddr |
| and |
| .IR uaddr2 , |
| but one of these does not point to a valid object\(emthat is, |
| the address is not four-byte-aligned. |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_WAIT_BITSET , |
| .BR FUTEX_WAKE_BITSET ) |
| The bit mask supplied in |
| .IR val3 |
| is zero. |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_CMP_REQUEUE_PI ) |
| .I uaddr |
| equals |
| .IR uaddr2 |
| (i.e., an attempt was made to requeue to the same futex). |
| .TP |
| .BR EINVAL |
| .RB ( FUTEX_FD ) |
| The signal number supplied in |
| .I val |
| is invalid. |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_WAKE , |
| .BR FUTEX_WAKE_OP , |
| .BR FUTEX_WAKE_BITSET , |
| .BR FUTEX_REQUEUE , |
| .BR FUTEX_CMP_REQUEUE ) |
| The kernel detected an inconsistency between the user-space state at |
| .I uaddr |
| and the kernel state\(emthat is, it detected a waiter which waits in |
| .BR FUTEX_LOCK_PI |
| on |
| .IR uaddr . |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_LOCK_PI , |
| .BR FUTEX_TRYLOCK_PI , |
| .BR FUTEX_UNLOCK_PI ) |
| The kernel detected an inconsistency between the user-space state at |
| .I uaddr |
| and the kernel state. |
| This indicates either state corruption |
| or that the kernel found a waiter on |
| .I uaddr |
| which is waiting via |
| .BR FUTEX_WAIT |
| or |
| .BR FUTEX_WAIT_BITSET . |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_CMP_REQUEUE_PI ) |
| The kernel detected an inconsistency between the user-space state at |
| .I uaddr2 |
| and the kernel state; |
| .\" From a conversation with Thomas Gleixner (Aug 2015): ### |
| .\" The kernel sees: I have non PI state for a futex you tried to |
| .\" tell me was PI |
| that is, the kernel detected a waiter which waits via |
| .BR FUTEX_WAIT |
| or |
| .BR FUTEX_WAIT_BITSET |
| on |
| .IR uaddr2 . |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_CMP_REQUEUE_PI ) |
| The kernel detected an inconsistency between the user-space state at |
| .I uaddr |
| and the kernel state; |
| that is, the kernel detected a waiter which waits via |
| .BR FUTEX_WAIT |
| or |
| .BR FUTEX_WAIT_BITESET |
| on |
| .IR uaddr . |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_CMP_REQUEUE_PI ) |
| The kernel detected an inconsistency between the user-space state at |
| .I uaddr |
| and the kernel state; |
| that is, the kernel detected a waiter which waits on |
| .I uaddr |
| via |
| .BR FUTEX_LOCK_PI |
| (instead of |
| .BR FUTEX_WAIT_REQUEUE_PI ). |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_CMP_REQUEUE_PI ) |
| .\" This deals with the case: |
| .\" wait_requeue_pi(A, B); |
| .\" requeue_pi(A, C); |
| An attempt was made to requeue a waiter to a futex other than that |
| specified by the matching |
| .B FUTEX_WAIT_REQUEUE_PI |
| call for that waiter. |
| .TP |
| .B EINVAL |
| .RB ( FUTEX_CMP_REQUEUE_PI ) |
| The |
| .I val |
| argument is not 1. |
| .TP |
| .B EINVAL |
| Invalid argument. |
| .TP |
| .B ENFILE |
| .RB ( FUTEX_FD ) |
| The system-wide limit on the total number of open files has been reached. |
| .TP |
| .BR ENOMEM |
| .RB ( FUTEX_LOCK_PI , |
| .BR FUTEX_TRYLOCK_PI , |
| .BR FUTEX_CMP_REQUEUE_PI ) |
| The kernel could not allocate memory to hold state information. |
| .TP |
| .B ENOSYS |
| Invalid operation specified in |
| .IR futex_op . |
| .TP |
| .B ENOSYS |
| The |
| .BR FUTEX_CLOCK_REALTIME |
| option was specified in |
| .IR futex_op , |
| but the accompanying operation was neither |
| .BR FUTEX_WAIT , |
| .BR FUTEX_WAIT_BITSET , |
| nor |
| .BR FUTEX_WAIT_REQUEUE_PI . |
| .TP |
| .BR ENOSYS |
| .RB ( FUTEX_LOCK_PI , |
| .BR FUTEX_TRYLOCK_PI , |
| .BR FUTEX_UNLOCK_PI , |
| .BR FUTEX_CMP_REQUEUE_PI , |
| .BR FUTEX_WAIT_REQUEUE_PI ) |
| A run-time check determined that the operation is not available. |
| The PI-futex operations are not implemented on all architectures and |
| are not supported on some CPU variants. |
| .TP |
| .BR EPERM |
| .RB ( FUTEX_LOCK_PI , |
| .BR FUTEX_TRYLOCK_PI , |
| .BR FUTEX_CMP_REQUEUE_PI ) |
| The caller is not allowed to attach itself to the futex at |
| .I uaddr |
| (for |
| .BR FUTEX_CMP_REQUEUE_PI : |
| the futex at |
| .IR uaddr2 ). |
| (This may be caused by a state corruption in user space.) |
| .TP |
| .BR EPERM |
| .RB ( FUTEX_UNLOCK_PI ) |
| The caller does not own the lock represented by the futex word. |
| .TP |
| .BR ESRCH |
| .RB ( FUTEX_LOCK_PI , |
| .BR FUTEX_TRYLOCK_PI , |
| .BR FUTEX_CMP_REQUEUE_PI ) |
| The thread ID in the futex word at |
| .I uaddr |
| does not exist. |
| .TP |
| .BR ESRCH |
| .RB ( FUTEX_CMP_REQUEUE_PI ) |
| The thread ID in the futex word at |
| .I uaddr2 |
| does not exist. |
| .TP |
| .B ETIMEDOUT |
| The operation in |
| .IR futex_op |
| employed the timeout specified in |
| .IR timeout , |
| and the timeout expired before the operation completed. |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .SH VERSIONS |
| .PP |
| Futexes were first made available in a stable kernel release |
| with Linux 2.6.0. |
| .PP |
| Initial futex support was merged in Linux 2.5.7 but with different |
| semantics from what was described above. |
| A four-argument system call with the semantics |
| described in this page was introduced in Linux 2.5.40. |
| A fifth argument was added in Linux 2.5.70, |
| and a sixth argument was added in Linux 2.6.7. |
| .SH CONFORMING TO |
| This system call is Linux-specific. |
| .SH NOTES |
| Glibc does not provide a wrapper for this system call; call it using |
| .BR syscall (2). |
| .PP |
| Several higher-level programming abstractions are implemented via futexes, |
| including POSIX semaphores and |
| various POSIX threads synchronization mechanisms |
| (mutexes, condition variables, read-write locks, and barriers). |
| .\" TODO FIXME(Torvald) Above, we cite this section and claim it contains |
| .\" details on the synchronization semantics; add the C11 equivalents |
| .\" here (or whatever we find consensus for). |
| .\" |
| .\"""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""""" |
| .\" |
| .SH EXAMPLE |
| The program below demonstrates use of futexes in a program where a parent |
| process and a child process use a pair of futexes located inside a |
| shared anonymous mapping to synchronize access to a shared resource: |
| the terminal. |
| The two processes each write |
| .IR nloops |
| (a command-line argument that defaults to 5 if omitted) |
| messages to the terminal and employ a synchronization protocol |
| that ensures that they alternate in writing messages. |
| Upon running this program we see output such as the following: |
| .PP |
| .in +4n |
| .EX |
| $ \fB./futex_demo\fP |
| Parent (18534) 0 |
| Child (18535) 0 |
| Parent (18534) 1 |
| Child (18535) 1 |
| Parent (18534) 2 |
| Child (18535) 2 |
| Parent (18534) 3 |
| Child (18535) 3 |
| Parent (18534) 4 |
| Child (18535) 4 |
| .EE |
| .in |
| .SS Program source |
| \& |
| .EX |
| /* futex_demo.c |
| |
| Usage: futex_demo [nloops] |
| (Default: 5) |
| |
| Demonstrate the use of futexes in a program where parent and child |
| use a pair of futexes located inside a shared anonymous mapping to |
| synchronize access to a shared resource: the terminal. The two |
| processes each write \(aqnum\-loops\(aq messages to the terminal and employ |
| a synchronization protocol that ensures that they alternate in |
| writing messages. |
| */ |
| #define _GNU_SOURCE |
| #include <stdio.h> |
| #include <errno.h> |
| #include <stdlib.h> |
| #include <unistd.h> |
| #include <sys/wait.h> |
| #include <sys/mman.h> |
| #include <sys/syscall.h> |
| #include <linux/futex.h> |
| #include <sys/time.h> |
| |
| #define errExit(msg) do { perror(msg); exit(EXIT_FAILURE); \\ |
| } while (0) |
| |
| static int *futex1, *futex2, *iaddr; |
| |
| static int |
| futex(int *uaddr, int futex_op, int val, |
| const struct timespec *timeout, int *uaddr2, int val3) |
| { |
| return syscall(SYS_futex, uaddr, futex_op, val, |
| timeout, uaddr, val3); |
| } |
| |
| /* Acquire the futex pointed to by \(aqfutexp\(aq: wait for its value to |
| become 1, and then set the value to 0. */ |
| |
| static void |
| fwait(int *futexp) |
| { |
| int s; |
| |
| /* __sync_bool_compare_and_swap(ptr, oldval, newval) is a gcc |
| built\-in function. It atomically performs the equivalent of: |
| |
| if (*ptr == oldval) |
| *ptr = newval; |
| |
| It returns true if the test yielded true and *ptr was updated. |
| The alternative here would be to employ the equivalent atomic |
| machine\-language instructions. For further information, see |
| the GCC Manual. */ |
| |
| while (1) { |
| |
| /* Is the futex available? */ |
| |
| if (__sync_bool_compare_and_swap(futexp, 1, 0)) |
| break; /* Yes */ |
| |
| /* Futex is not available; wait */ |
| |
| s = futex(futexp, FUTEX_WAIT, 0, NULL, NULL, 0); |
| if (s == \-1 && errno != EAGAIN) |
| errExit("futex\-FUTEX_WAIT"); |
| } |
| } |
| |
| /* Release the futex pointed to by \(aqfutexp\(aq: if the futex currently |
| has the value 0, set its value to 1 and the wake any futex waiters, |
| so that if the peer is blocked in fpost(), it can proceed. */ |
| |
| static void |
| fpost(int *futexp) |
| { |
| int s; |
| |
| /* __sync_bool_compare_and_swap() was described in comments above */ |
| |
| if (__sync_bool_compare_and_swap(futexp, 0, 1)) { |
| |
| s = futex(futexp, FUTEX_WAKE, 1, NULL, NULL, 0); |
| if (s == \-1) |
| errExit("futex\-FUTEX_WAKE"); |
| } |
| } |
| |
| int |
| main(int argc, char *argv[]) |
| { |
| pid_t childPid; |
| int j, nloops; |
| |
| setbuf(stdout, NULL); |
| |
| nloops = (argc > 1) ? atoi(argv[1]) : 5; |
| |
| /* Create a shared anonymous mapping that will hold the futexes. |
| Since the futexes are being shared between processes, we |
| subsequently use the "shared" futex operations (i.e., not the |
| ones suffixed "_PRIVATE") */ |
| |
| iaddr = mmap(NULL, sizeof(int) * 2, PROT_READ | PROT_WRITE, |
| MAP_ANONYMOUS | MAP_SHARED, \-1, 0); |
| if (iaddr == MAP_FAILED) |
| errExit("mmap"); |
| |
| futex1 = &iaddr[0]; |
| futex2 = &iaddr[1]; |
| |
| *futex1 = 0; /* State: unavailable */ |
| *futex2 = 1; /* State: available */ |
| |
| /* Create a child process that inherits the shared anonymous |
| mapping */ |
| |
| childPid = fork(); |
| if (childPid == \-1) |
| errExit("fork"); |
| |
| if (childPid == 0) { /* Child */ |
| for (j = 0; j < nloops; j++) { |
| fwait(futex1); |
| printf("Child (%ld) %d\\n", (long) getpid(), j); |
| fpost(futex2); |
| } |
| |
| exit(EXIT_SUCCESS); |
| } |
| |
| /* Parent falls through to here */ |
| |
| for (j = 0; j < nloops; j++) { |
| fwait(futex2); |
| printf("Parent (%ld) %d\\n", (long) getpid(), j); |
| fpost(futex1); |
| } |
| |
| wait(NULL); |
| |
| exit(EXIT_SUCCESS); |
| } |
| .EE |
| .SH SEE ALSO |
| .ad l |
| .BR get_robust_list (2), |
| .BR restart_syscall (2), |
| .BR pthread_mutexattr_getprotocol (3), |
| .BR futex (7), |
| .BR sched (7) |
| .PP |
| The following kernel source files: |
| .IP * 2 |
| .I Documentation/pi-futex.txt |
| .IP * |
| .I Documentation/futex-requeue-pi.txt |
| .IP * |
| .I Documentation/locking/rt-mutex.txt |
| .IP * |
| .I Documentation/locking/rt-mutex-design.txt |
| .IP * |
| .I Documentation/robust-futex-ABI.txt |
| .PP |
| Franke, H., Russell, R., and Kirwood, M., 2002. |
| \fIFuss, Futexes and Furwocks: Fast Userlevel Locking in Linux\fP |
| (from proceedings of the Ottawa Linux Symposium 2002), |
| .br |
| .UR http://kernel.org\:/doc\:/ols\:/2002\:/ols2002\-pages\-479\-495.pdf |
| .UE |
| .PP |
| Hart, D., 2009. \fIA futex overview and update\fP, |
| .UR http://lwn.net/Articles/360699/ |
| .UE |
| .PP |
| Hart, D. and Guniguntala, D., 2009. |
| \fIRequeue-PI: Making Glibc Condvars PI-Aware\fP |
| (from proceedings of the 2009 Real-Time Linux Workshop), |
| .UR http://lwn.net/images/conf/rtlws11/papers/proc/p10.pdf |
| .UE |
| .PP |
| Drepper, U., 2011. \fIFutexes Are Tricky\fP, |
| .UR http://www.akkadia.org/drepper/futex.pdf |
| .UE |
| .PP |
| Futex example library, futex-*.tar.bz2 at |
| .br |
| .UR ftp://ftp.kernel.org\:/pub\:/linux\:/kernel\:/people\:/rusty/ |
| .UE |
| .\" |
| .\" FIXME(Torvald) We should probably refer to the glibc code here, in |
| .\" particular the glibc-internal futex wrapper functions that are |
| .\" WIP, and the generic pthread_mutex_t and perhaps condvar |
| .\" implementations. |