futex: Add support for private attached futexes

This patch adds support for the futex OP FUTEX_ATTACHED which can only
be used together with FLAGS_SHARED: This is limited to private FUTEXes.
This FUTEX_ATTACHED flag can not be made default because it changes the
API/ABI.
ATTACHED futex, usage howto:
- before usage it needs to be `attached' to initialize the in-kernel
  state.
  cookie = sys_futex(&mutex->__data.__lock,
		     FUTEX_ATTACH | FUTEX_PRIVATE_FLAG,
		     0, 0, 0, 0);
  The return value is either <0 for an error or >= 0 which returns a
  `cookie' which should be used for further operations.

- any operation on this FUTEX should use the `cookie', for example the
  LOCK_PI operation:
  ret = sys_futex((void *)(unsigned long)cookie, FUTEX_LOCK_PI |
		  FUTEX_PRIVATE_FLAG | FUTEX_ATTACHED,
		  0, 0, 0, 0);
  The return value is <0 for an error and 0 for success.

- once the lock is considered removed, the FUTEX_DETACH should be
  invoked in order to remove the in kernel state for the FUTEX. The
  return value is 0 for success and <0 for failure. A FUTEX can not be
  detached if there is an operation pending i.e. a LOCK_PI which did not
  yet complete.

The implementation:
The struct_mm is exended by struct futex_cache. This struct holds the
following members:
- slots
  an array of struct futex_cache_slot. Each entry is deployed after an
  `FUTEX_ATTACH' operation and holds a pointer to struct futex_state.
  The array is extended on demand (never shrunk) and RCU protected.

- cache_map
  each set bit is set if the corresponding `slots' entry is in use. The
  size is limited 4096 bits which means there can not be more than 4096
  FUTEX per process attached / in use.

- cache_size
  Size in bits of the currently deployed slots member.

- cache_lock
  A lock which taken in slowpath on extending of the slots member and on
  removal the fs members.

On each `FUTEX_ATTACH' operation an in kernel state of the userland
FUTEX is allocated: futex_state. This state contains a dedicated
futex_hash_bucket which is used exclusively for the lock. This avoids
lock contentions on the global futex_hash_bucket which means two
different locks share never the same futex_hash_bucket. Also the memory
for the in kernel state is allocated the current NUMA node which should
reduce cross NUMA memory access for the access of the futex_hash_bucket.
The global futex_hash_bucket is used to ensure that a FUTEX is only
enqueued once. A second FUTEX_ATTACH operation on the same uaddr will
fail because it already exists in the global futex_hash_bucket.

Uppon a `FUTEX_ATTACH' operation the slot number of the ->slots array is
returned which holds the in kernel state. This number is used in the
following FUTEX operations i.e. FUTEX_LOCK_PI. In the hotpath, the
cache_map is checked to see if the array member is deployed. The slots
array and fs member is dereferenced within a RCU read section. This
avoids holding any locks in the hotpath. The futex_state has an `users'
reference counter. A value of zero means that the structure exists
within this RCU read section and is subject to removal and therefore
shall not be used. atomic_inc_not_zero() ensures usage of the object
after leave the RCU read section.

The mix of `FUTEX_ATTACHED' flag has the same outcome as the mix of the
`FUTEX_PRIVATE_FLAG' flag: The kernel won't find the correct
futex_hash_bucket and the operation will block.

It is believed that the `FUTEX_ATTACH' operation can be hidding within
pthread_mutex_init() function and the `FUTEX_ATTACH' operation with
pthread_mutex_destroy(). The glibc could turn in on for all private
locks. An automatic in-kernel switch on does not exists because the
current interfaces supplies the address of the lock instead the returned
cookie. A lookup in kernel would involve lock protected list or hashtable
which would bring locks which we try to avoid with the per-lock
futex_hash_bucket.

Signed-off-by: Thomas Gleixner <tglx@linutronix.de>
7 files changed