| From 04a2c4b4511d186b0fce685da21085a5d4acd370 Mon Sep 17 00:00:00 2001 |
| From: Sasha Levin <sashal@kernel.org> |
| Date: Sun, 29 Jun 2025 03:40:21 -0400 |
| Subject: fs: Prevent file descriptor table allocations exceeding INT_MAX |
| |
| From: Sasha Levin <sashal@kernel.org> |
| |
| commit 04a2c4b4511d186b0fce685da21085a5d4acd370 upstream. |
| |
| When sysctl_nr_open is set to a very high value (for example, 1073741816 |
| as set by systemd), processes attempting to use file descriptors near |
| the limit can trigger massive memory allocation attempts that exceed |
| INT_MAX, resulting in a WARNING in mm/slub.c: |
| |
| WARNING: CPU: 0 PID: 44 at mm/slub.c:5027 __kvmalloc_node_noprof+0x21a/0x288 |
| |
| This happens because kvmalloc_array() and kvmalloc() check if the |
| requested size exceeds INT_MAX and emit a warning when the allocation is |
| not flagged with __GFP_NOWARN. |
| |
| Specifically, when nr_open is set to 1073741816 (0x3ffffff8) and a |
| process calls dup2(oldfd, 1073741880), the kernel attempts to allocate: |
| - File descriptor array: 1073741880 * 8 bytes = 8,589,935,040 bytes |
| - Multiple bitmaps: ~400MB |
| - Total allocation size: > 8GB (exceeding INT_MAX = 2,147,483,647) |
| |
| Reproducer: |
| 1. Set /proc/sys/fs/nr_open to 1073741816: |
| # echo 1073741816 > /proc/sys/fs/nr_open |
| |
| 2. Run a program that uses a high file descriptor: |
| #include <unistd.h> |
| #include <sys/resource.h> |
| |
| int main() { |
| struct rlimit rlim = {1073741824, 1073741824}; |
| setrlimit(RLIMIT_NOFILE, &rlim); |
| dup2(2, 1073741880); // Triggers the warning |
| return 0; |
| } |
| |
| 3. Observe WARNING in dmesg at mm/slub.c:5027 |
| |
| systemd commit a8b627a introduced automatic bumping of fs.nr_open to the |
| maximum possible value. The rationale was that systems with memory |
| control groups (memcg) no longer need separate file descriptor limits |
| since memory is properly accounted. However, this change overlooked |
| that: |
| |
| 1. The kernel's allocation functions still enforce INT_MAX as a maximum |
| size regardless of memcg accounting |
| 2. Programs and tests that legitimately test file descriptor limits can |
| inadvertently trigger massive allocations |
| 3. The resulting allocations (>8GB) are impractical and will always fail |
| |
| systemd's algorithm starts with INT_MAX and keeps halving the value |
| until the kernel accepts it. On most systems, this results in nr_open |
| being set to 1073741816 (0x3ffffff8), which is just under 1GB of file |
| descriptors. |
| |
| While processes rarely use file descriptors near this limit in normal |
| operation, certain selftests (like |
| tools/testing/selftests/core/unshare_test.c) and programs that test file |
| descriptor limits can trigger this issue. |
| |
| Fix this by adding a check in alloc_fdtable() to ensure the requested |
| allocation size does not exceed INT_MAX. This causes the operation to |
| fail with -EMFILE instead of triggering a kernel warning and avoids the |
| impractical >8GB memory allocation request. |
| |
| Fixes: 9cfe015aa424 ("get rid of NR_OPEN and introduce a sysctl_nr_open") |
| Cc: stable@vger.kernel.org |
| Signed-off-by: Sasha Levin <sashal@kernel.org> |
| Link: https://lore.kernel.org/20250629074021.1038845-1-sashal@kernel.org |
| Signed-off-by: Christian Brauner <brauner@kernel.org> |
| Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| --- |
| fs/file.c | 15 +++++++++++++++ |
| 1 file changed, 15 insertions(+) |
| |
| --- a/fs/file.c |
| +++ b/fs/file.c |
| @@ -126,6 +126,21 @@ static struct fdtable * alloc_fdtable(un |
| if (unlikely(nr > sysctl_nr_open)) |
| nr = ((sysctl_nr_open - 1) | (BITS_PER_LONG - 1)) + 1; |
| |
| + /* |
| + * Check if the allocation size would exceed INT_MAX. kvmalloc_array() |
| + * and kvmalloc() will warn if the allocation size is greater than |
| + * INT_MAX, as filp_cachep objects are not __GFP_NOWARN. |
| + * |
| + * This can happen when sysctl_nr_open is set to a very high value and |
| + * a process tries to use a file descriptor near that limit. For example, |
| + * if sysctl_nr_open is set to 1073741816 (0x3ffffff8) - which is what |
| + * systemd typically sets it to - then trying to use a file descriptor |
| + * close to that value will require allocating a file descriptor table |
| + * that exceeds 8GB in size. |
| + */ |
| + if (unlikely(nr > INT_MAX / sizeof(struct file *))) |
| + return ERR_PTR(-EMFILE); |
| + |
| fdt = kmalloc(sizeof(struct fdtable), GFP_KERNEL_ACCOUNT); |
| if (!fdt) |
| goto out; |