xattr: use rbtree for simple_xattrs
A while ago Vasily reported that it is possible to set a large number of
xattrs on inodes of filesystems that make use of the simple xattr
infrastructure. This includes all kernfs-based filesystems that support
xattrs (e.g., cgroupfs) and tmpfs. Both cgroupfs and tmpfs can be
mounted by unprivileged users in unprivileged containers and root in an
unprivileged container can set an unrestricted number of security.*
xattrs and privileged users can also set unlimited trusted.* xattrs. As
there are apparently users that have a fairly large number of xattrs we
should scale a bit better. Other xattrs such as user.* are restricted
for kernfs-based instances to a fairly limited number.
Using a simple linked list protected by a spinlock used for set, get,
and list operations doesn't scale well if users use a lot of xattrs even
if it's not a crazy number. There's no need to bring in the big guns
like rhashtables or rw semaphores for this. An rbtree with a rwlock, or
limited rcu semanics and seqlock is enough.
It scales within the constraints we are working in. By far the most
common operation is getting an xattr. Setting xattrs should be a
moderately rare operation. And listxattr() often only happens when
copying xattrs between files or with the filey. Holding a lock across
listxattr() is unproblematic because it doesn't list the values of
xattrs. It can only be used to list the names of all xattrs set on a
file. And the number of xattr names that can be listed with listxattr()
is limited to XATTR_LIST_MAX aka 65536 bytes. If a larger buffer is
passed then vfs_listxattr() caps it to XATTR_LIST_MAX and if more xattr
names are found it will return -EFBIG. In short, the maximum amount of
memory that can be retrieved via listxattr() is limited.
Of course, the API is broken as documented on xattr(7) already. In the
future we might want to address this but for now this is the world we
live in and have lived for a long time. But it does indeed mean that
once an application goes over XATTR_LIST_MAX limit of xattrs set on an
inode it isn't possible to copy the file and include its xattrs in the
copy unless the caller knows all xattrs or limits the copy of the xattrs
to important ones it knows by name (At least for tmpfs, and kernfs-based
filesystems. Other filesystems might provide ways of achieving this.).
Bonus of this port to rbtree+rwlock is that we shrink the memory
consumption for users of the simple xattr infrastructure.
Also add proper kernel documentation to all the functions.
A big thanks to Paul for his comments.
Cc: Vasily Averin <vvs@openvz.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
Notes:
To: linux-fsdevel@vger.kernel.org
Cc: Vasily Averin <vvs@openvz.org>
Cc: Hugh Dickins <hughd@google.com>
Cc: Seth Forshee (Digital Ocean) <sforshee@kernel.org>
Cc: Stéphane Graber <stgraber@ubuntu.com>
Cc: Roman Gushchin <roman.gushchin@linux.dev>
Cc: Tejun Heo <tj@kernel.org>
Cc: "Paul E. McKenney" <paulmck@kernel.org>
Cc: Al Viro <viro@zeniv.linux.org.uk>
In the v1 and v2 of this patch we used an rbtree which was protected by an
rcu+seqlock combination. During the discussion it became clear that there's
some valid concern how safe it is to combine rcu with rbtrees. While most of
the issues are highly unlikely to matter in practice the code here can be
reached by unprivileged users rather directly so we should not be adventurous.
Instead of the rcu+seqlock combination simply use an rwlock. This will scale
sufficiently as well and had been one of the implementations I considered and
wrote a little while ago. Thanks to Paul for some deeper insights into issues
associated with rcu and rbtrees!
In addition to this patch I would like to propose that we restrict the number
of xattrs for the simple xattr infrastructure via XATTR_MAX_LIST bytes. In
other words, we restrict the number of xattrs for simple xattr filesystems to
the number of xattrs names that can be retrieved via listxattr(). That should
be about 2000 to 3000 xattrs per inode which is more than enough. We should
risk this and see if we get any regression reports from userswith this
approach.
This should be as simple as adding a max_list member to struct simple_xattrs
and initialize it with XATTR_MAX_LIST. Set operations would then check against
this field whether the new xattr they are trying to set will fit and return
-EFBIG otherwise. I think that might be a good approach to get rid of the in
principle unbounded number of xattrs that can be set via the simple xattr
infrastructure. I think this is a regression risk worth taking.
/* v2 */
Christian Brauner <brauner@kernel.org>:
- Fix kernel doc.
- Remove accidental leftover union from previous iteration.
/* v3 */
Port the whole thing to use a simple rwlock instead of rcu+seqlock.
"Paul E. McKenney" <paulmck@kernel.org>:
- Fix simple_xattr_add() by searching the correct slot in the rbtree first.
Roman Gushchin <roman.gushchin@linux.dev>:
- Avoid calling strcmp() multiple times.
3 files changed