xattr: ensure simple xattrs scale better

A while ago we got reports for soft lockups and hung tasks when dealing
with a large number of xattrs set on tmpfs or kernfs based instances.
(This was originally reported as a security issue but was identified to
 not be one.) While the number of user.* xattrs for kernfs inodes are
limited and tmpfs doesn't support them, security.* and trusted.* xattrs
don't have that restriction. Especially security.* xattrs can be set by
root in a userns.

Since there are already users that use large numbers of xattrs on
tmpfs/kernfs limiting them retroactively isn't an option without high
probability of regressions.

So we decided to switch to better data structures to avoid the lockup
issues users have seen. Instead of relying on a linked list switch to an
rhashtable. The ->set() operation continues to be protected by the
per-inode spinlock while ->get() and ->list() operations are lockless.
The ->get() operation relies on increment-if-not-zero pattern and the
list operation just checks that any xattr still in the rhashtable has a
non-zero refcount. This gives sufficient consistency guarantees.

Tests were run with hundreds of thousands of xattrs set on tmpfs inodes.
No lockups or other issues were observed in contrast to the linked list
implementation. The tests have been running for a long time.

Of course, there's currently a limit to the number of xattrs that can be
reasonably retrieved via listxattr() as the kernel will start returning
E2BIG at some point. While this is a bug in the implementation of xattr
retrieval that we should probably fix at some point in the future
(readdir() like interface for xattrs?) switching to an rhashtable is
something we can and should do right now to get rid of the immediate
scaling issues.

Since the listxattr() system call is explicitly documented with
> The list of names is returned as an unordered array [...]
we're not breaking api with this change as well.

Now listxattr() can run concurrently with ->set() operations but will
provide rcu consistency as the whole list walk will take place under rcu
lock. Any xattrs that have been removed but are still in the rhashtable
under rcu we can identify by checking whether their refcount is zero. If
it is we skip them.

The consistency guarantees don't change. If an xattr retrieval races
with a set this is something that userspace already needs to deal with
today. It needs to provide a well-estimated buffer for a first try and
if the buffer is too small reallocate it. In the meantime the xattrs
could have shrunk or grown no matter the data structure.

Also add proper kernel documentation to all the functions.

Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>

Notes:

To: linux-fsdevel@vger.kernel.org Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Tejun Heo <tj@kernel.org> Cc: kernel@openvz.org Cc: Vasily Averin <vvs@openvz.org>
4 files changed