cve/published/2021/CVE-2021-47128.mbox - pub/scm/linux/security/vulns - Git at Google

 From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
 From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
 To: <linux-cve-announce@vger.kernel.org>
 Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
 Subject: CVE-2021-47128: bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks

 Description
 ===========

 In the Linux kernel, the following vulnerability has been resolved:

 bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks

 Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
 added an implementation of the locked_down LSM hook to SELinux, with the aim
 to restrict which domains are allowed to perform operations that would breach
 lockdown. This is indirectly also getting audit subsystem involved to report
 events. The latter is problematic, as reported by Ondrej and Serhei, since it
 can bring down the whole system via audit:

   1) The audit events that are triggered due to calls to security_locked_down()
      can OOM kill a machine, see below details [0].

   2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
      when trying to wake up kauditd, for example, when using trace_sched_switch()
      tracepoint, see details in [1]. Triggering this was not via some hypothetical
      corner case, but with existing tools like runqlat & runqslower from bcc, for
      example, which make use of this tracepoint. Rough call sequence goes like:

      rq_lock(rq) -> -------------------------+
        trace_sched_switch() ->               |
          bpf_prog_xyz() ->                   +-> deadlock
            selinux_lockdown() ->             |
              audit_log_end() ->              |
                wake_up_interruptible() ->    |
                  try_to_wake_up() ->         |
                    rq_lock(rq) --------------+

 What's worse is that the intention of 59438b46471a to further restrict lockdown
 settings for specific applications in respect to the global lockdown policy is
 completely broken for BPF. The SELinux policy rule for the current lockdown check
 looks something like this:

   allow <who> <who> : lockdown { <reason> };

 However, this doesn't match with the 'current' task where the security_locked_down()
 is executed, example: httpd does a syscall. There is a tracing program attached
 to the syscall which triggers a BPF program to run, which ends up doing a
 bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
 the permission check against 'current', that is, httpd in this example. httpd
 has literally zero relation to this tracing program, and it would be nonsensical
 having to write an SELinux policy rule against httpd to let the tracing helper
 pass. The policy in this case needs to be against the entity that is installing
 the BPF program. For example, if bpftrace would generate a histogram of syscall
 counts by user space application:

   bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

 bpftrace would then go and generate a BPF program from this internally. One way
 of doing it [for the sake of the example] could be to call bpf_get_current_task()
 helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
 helpers. So the program itself has nothing to do with httpd or any other random
 app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
 check. The allow/deny policy belongs in the context of bpftrace: meaning, you
 want to grant bpftrace access to use these helpers, but other tracers on the
 system like my_random_tracer _not_.

 Therefore fix all three issues at the same time by taking a completely different
 approach for the security_locked_down() hook, that is, move the check into the
 program verification phase where we actually retrieve the BPF func proto. This
 also reliably gets the task (current) that is trying to install the BPF tracing
 program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
 we're moving this out of the BPF helper's fast-path which can be called several
 millions of times per second.

 The check is then also in line with other security_locked_down() hooks in the
 system where the enforcement is performed at open/load time, for example,
 open_kcore() for /proc/kcore access or module_sig_check() for module signatures
 just to pick few random ones. What's out of scope in the fix as well as in
 other security_locked_down() hook locations /outside/ of BPF subsystem is that
 if the lockdown policy changes on the fly there is no retrospective action.
 This requires a different discussion, potentially complex infrastructure, and
 it's also not clear whether this can be solved generically. Either way, it is
 out of scope for a suitable stable fix which this one is targeting. Note that
 the breakage is specifically on 59438b46471a where it started to rely on 'current'
 as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
 Restrict bpf when kernel lockdown is in confidentiality mode").

 [0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:

   I starting seeing this with F-34. When I run a container that is traced with
   BPF to record the syscalls it is doing, auditd is flooded with messages like:

   type=AVC msg=audit(1619784520.593:282387): avc:  denied  { confidentiality }
     for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
       scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
         tclass=lockdown permissive=0

   This seems to be leading to auditd running out of space in the backlog buffer
   and eventually OOMs the machine.

   [...]
   auditd running at 99% CPU presumably processing all the messages, eventually I get:
   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
   Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
   Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
   Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
   Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
   Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
   [...]

 [1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
     Serhei Makarov says:

   Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
   bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
   is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
   testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
   ppc64le. Example stack trace:

   [...]
   [  730.868702] stack backtrace:
   [  730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
   [  730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
   [  730.873278] Call Trace:
   [  730.873770]  dump_stack+0x7f/0xa1
   [  730.874433]  check_noncircular+0xdf/0x100
   [  730.875232]  __lock_acquire+0x1202/0x1e10
   [  730.876031]  ? __lock_acquire+0xfc0/0x1e10
   [  730.876844]  lock_acquire+0xc2/0x3a0
   [  730.877551]  ? __wake_up_common_lock+0x52/0x90
   [  730.878434]  ? lock_acquire+0xc2/0x3a0
   [  730.879186]  ? lock_is_held_type+0xa7/0x120
   [  730.880044]  ? skb_queue_tail+0x1b/0x50
   [  730.880800]  _raw_spin_lock_irqsave+0x4d/0x90
   [  730.881656]  ? __wake_up_common_lock+0x52/0x90
   [  730.882532]  __wake_up_common_lock+0x52/0x90
   [  730.883375]  audit_log_end+0x5b/0x100
   [  730.884104]  slow_avc_audit+0x69/0x90
   [  730.884836]  avc_has_perm+0x8b/0xb0
   [  730.885532]  selinux_lockdown+0xa5/0xd0
   [  730.886297]  security_locked_down+0x20/0x40
   [  730.887133]  bpf_probe_read_compat+0x66/0xd0
   [  730.887983]  bpf_prog_250599c5469ac7b5+0x10f/0x820
   [  730.888917]  trace_call_bpf+0xe9/0x240
   [  730.889672]  perf_trace_run_bpf_submit+0x4d/0xc0
   [  730.890579]  perf_trace_sched_switch+0x142/0x180
   [  730.891485]  ? __schedule+0x6d8/0xb20
   [  730.892209]  __schedule+0x6d8/0xb20
   [  730.892899]  schedule+0x5b/0xc0
   [  730.893522]  exit_to_user_mode_prepare+0x11d/0x240
   [  730.894457]  syscall_exit_to_user_mode+0x27/0x70
   [  730.895361]  entry_SYSCALL_64_after_hwframe+0x44/0xae
   [...]

 The Linux kernel CVE team has assigned CVE-2021-47128 to this issue.


 Affected and fixed versions
 ===========================

 	Issue introduced in 5.6 with commit 59438b46471ae6cdfb761afc8c9beaf1e428a331 and fixed in 5.10.43 with commit ff5039ec75c83d2ed5b781dc7733420ee8c985fc
 	Issue introduced in 5.6 with commit 59438b46471ae6cdfb761afc8c9beaf1e428a331 and fixed in 5.12.10 with commit acc43fc6cf0d50612193813c5906a1ab9d433e1e
 	Issue introduced in 5.6 with commit 59438b46471ae6cdfb761afc8c9beaf1e428a331 and fixed in 5.13 with commit ff40e51043af63715ab413995ff46996ecf9583f

 Please see https://www.kernel.org for a full list of currently supported
 kernel versions by the kernel community.

 Unaffected versions might change over time as fixes are backported to
 older supported kernel versions.  The official CVE entry at
 	https://cve.org/CVERecord/?id=CVE-2021-47128
 will be updated if fixes are backported, please check that for the most
 up to date information about this issue.


 Affected files
 ==============

 The file(s) affected by this issue are:
 	kernel/bpf/helpers.c
 	kernel/trace/bpf_trace.c


 Mitigation
 ==========

 The Linux kernel CVE team recommends that you update to the latest
 stable kernel version for this, and many other bugfixes.  Individual
 changes are never tested alone, but rather are part of a larger kernel
 release.  Cherry-picking individual commits is not recommended or
 supported by the Linux kernel community at all.  If however, updating to
 the latest release is impossible, the individual changes to resolve this
 issue can be found at these commits:
 	https://git.kernel.org/stable/c/ff5039ec75c83d2ed5b781dc7733420ee8c985fc
 	https://git.kernel.org/stable/c/acc43fc6cf0d50612193813c5906a1ab9d433e1e
 	https://git.kernel.org/stable/c/ff40e51043af63715ab413995ff46996ecf9583f
	From bippy-5f407fcff5a0 Mon Sep 17 00:00:00 2001
	From: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
	To: <linux-cve-announce@vger.kernel.org>
	Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org>
	Subject: CVE-2021-47128: bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks

	Description
	===========

	In the Linux kernel, the following vulnerability has been resolved:

	bpf, lockdown, audit: Fix buggy SELinux lockdown permission checks

	Commit 59438b46471a ("security,lockdown,selinux: implement SELinux lockdown")
	added an implementation of the locked_down LSM hook to SELinux, with the aim
	to restrict which domains are allowed to perform operations that would breach
	lockdown. This is indirectly also getting audit subsystem involved to report
	events. The latter is problematic, as reported by Ondrej and Serhei, since it
	can bring down the whole system via audit:

	1) The audit events that are triggered due to calls to security_locked_down()
	can OOM kill a machine, see below details [0].

	2) It also seems to be causing a deadlock via avc_has_perm()/slow_avc_audit()
	when trying to wake up kauditd, for example, when using trace_sched_switch()
	tracepoint, see details in [1]. Triggering this was not via some hypothetical
	corner case, but with existing tools like runqlat & runqslower from bcc, for
	example, which make use of this tracepoint. Rough call sequence goes like:

	rq_lock(rq) -> -------------------------+
	trace_sched_switch() -> \|
	bpf_prog_xyz() -> +-> deadlock
	selinux_lockdown() -> \|
	audit_log_end() -> \|
	wake_up_interruptible() -> \|
	try_to_wake_up() -> \|
	rq_lock(rq) --------------+

	What's worse is that the intention of 59438b46471a to further restrict lockdown
	settings for specific applications in respect to the global lockdown policy is
	completely broken for BPF. The SELinux policy rule for the current lockdown check
	looks something like this:

	allow <who> <who> : lockdown { <reason> };

	However, this doesn't match with the 'current' task where the security_locked_down()
	is executed, example: httpd does a syscall. There is a tracing program attached
	to the syscall which triggers a BPF program to run, which ends up doing a
	bpf_probe_read_kernel{,_str}() helper call. The selinux_lockdown() hook does
	the permission check against 'current', that is, httpd in this example. httpd
	has literally zero relation to this tracing program, and it would be nonsensical
	having to write an SELinux policy rule against httpd to let the tracing helper
	pass. The policy in this case needs to be against the entity that is installing
	the BPF program. For example, if bpftrace would generate a histogram of syscall
	counts by user space application:

	bpftrace -e 'tracepoint:raw_syscalls:sys_enter { @[comm] = count(); }'

	bpftrace would then go and generate a BPF program from this internally. One way
	of doing it [for the sake of the example] could be to call bpf_get_current_task()
	helper and then access current->comm via one of bpf_probe_read_kernel{,_str}()
	helpers. So the program itself has nothing to do with httpd or any other random
	app doing a syscall here. The BPF program _explicitly initiated_ the lockdown
	check. The allow/deny policy belongs in the context of bpftrace: meaning, you
	want to grant bpftrace access to use these helpers, but other tracers on the
	system like my_random_tracer _not_.

	Therefore fix all three issues at the same time by taking a completely different
	approach for the security_locked_down() hook, that is, move the check into the
	program verification phase where we actually retrieve the BPF func proto. This
	also reliably gets the task (current) that is trying to install the BPF tracing
	program, e.g. bpftrace/bcc/perf/systemtap/etc, and it also fixes the OOM since
	we're moving this out of the BPF helper's fast-path which can be called several
	millions of times per second.

	The check is then also in line with other security_locked_down() hooks in the
	system where the enforcement is performed at open/load time, for example,
	open_kcore() for /proc/kcore access or module_sig_check() for module signatures
	just to pick few random ones. What's out of scope in the fix as well as in
	other security_locked_down() hook locations /outside/ of BPF subsystem is that
	if the lockdown policy changes on the fly there is no retrospective action.
	This requires a different discussion, potentially complex infrastructure, and
	it's also not clear whether this can be solved generically. Either way, it is
	out of scope for a suitable stable fix which this one is targeting. Note that
	the breakage is specifically on 59438b46471a where it started to rely on 'current'
	as UAPI behavior, and _not_ earlier infrastructure such as 9d1f8be5cf42 ("bpf:
	Restrict bpf when kernel lockdown is in confidentiality mode").

	[0] https://bugzilla.redhat.com/show_bug.cgi?id=1955585, Jakub Hrozek says:

	I starting seeing this with F-34. When I run a container that is traced with
	BPF to record the syscalls it is doing, auditd is flooded with messages like:

	type=AVC msg=audit(1619784520.593:282387): avc: denied { confidentiality }
	for pid=476 comm="auditd" lockdown_reason="use of bpf to read kernel RAM"
	scontext=system_u:system_r:auditd_t:s0 tcontext=system_u:system_r:auditd_t:s0
	tclass=lockdown permissive=0

	This seems to be leading to auditd running out of space in the backlog buffer
	and eventually OOMs the machine.

	[...]
	auditd running at 99% CPU presumably processing all the messages, eventually I get:
	Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
	Apr 30 12:20:42 fedora kernel: audit: backlog limit exceeded
	Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152579 > audit_backlog_limit=64
	Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152626 > audit_backlog_limit=64
	Apr 30 12:20:42 fedora kernel: audit: audit_backlog=2152694 > audit_backlog_limit=64
	Apr 30 12:20:42 fedora kernel: audit: audit_lost=6878426 audit_rate_limit=0 audit_backlog_limit=64
	Apr 30 12:20:45 fedora kernel: oci-seccomp-bpf invoked oom-killer: gfp_mask=0x100cca(GFP_HIGHUSER_MOVABLE), order=0, oom_score_adj=-1000
	Apr 30 12:20:45 fedora kernel: CPU: 0 PID: 13284 Comm: oci-seccomp-bpf Not tainted 5.11.12-300.fc34.x86_64 #1
	Apr 30 12:20:45 fedora kernel: Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014
	[...]

	[1] https://lore.kernel.org/linux-audit/CANYvDQN7H5tVp47fbYcRasv4XF07eUbsDwT_eDCHXJUj43J7jQ@mail.gmail.com/,
	Serhei Makarov says:

	Upstream kernel 5.11.0-rc7 and later was found to deadlock during a
	bpf_probe_read_compat() call within a sched_switch tracepoint. The problem
	is reproducible with the reg_alloc3 testcase from SystemTap's BPF backend
	testsuite on x86_64 as well as the runqlat, runqslower tools from bcc on
	ppc64le. Example stack trace:

	[...]
	[ 730.868702] stack backtrace:
	[ 730.869590] CPU: 1 PID: 701 Comm: in:imjournal Not tainted, 5.12.0-0.rc2.20210309git144c79ef3353.166.fc35.x86_64 #1
	[ 730.871605] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.13.0-2.fc32 04/01/2014
	[ 730.873278] Call Trace:
	[ 730.873770] dump_stack+0x7f/0xa1
	[ 730.874433] check_noncircular+0xdf/0x100
	[ 730.875232] __lock_acquire+0x1202/0x1e10
	[ 730.876031] ? __lock_acquire+0xfc0/0x1e10
	[ 730.876844] lock_acquire+0xc2/0x3a0
	[ 730.877551] ? __wake_up_common_lock+0x52/0x90
	[ 730.878434] ? lock_acquire+0xc2/0x3a0
	[ 730.879186] ? lock_is_held_type+0xa7/0x120
	[ 730.880044] ? skb_queue_tail+0x1b/0x50
	[ 730.880800] _raw_spin_lock_irqsave+0x4d/0x90
	[ 730.881656] ? __wake_up_common_lock+0x52/0x90
	[ 730.882532] __wake_up_common_lock+0x52/0x90
	[ 730.883375] audit_log_end+0x5b/0x100
	[ 730.884104] slow_avc_audit+0x69/0x90
	[ 730.884836] avc_has_perm+0x8b/0xb0
	[ 730.885532] selinux_lockdown+0xa5/0xd0
	[ 730.886297] security_locked_down+0x20/0x40
	[ 730.887133] bpf_probe_read_compat+0x66/0xd0
	[ 730.887983] bpf_prog_250599c5469ac7b5+0x10f/0x820
	[ 730.888917] trace_call_bpf+0xe9/0x240
	[ 730.889672] perf_trace_run_bpf_submit+0x4d/0xc0
	[ 730.890579] perf_trace_sched_switch+0x142/0x180
	[ 730.891485] ? __schedule+0x6d8/0xb20
	[ 730.892209] __schedule+0x6d8/0xb20
	[ 730.892899] schedule+0x5b/0xc0
	[ 730.893522] exit_to_user_mode_prepare+0x11d/0x240
	[ 730.894457] syscall_exit_to_user_mode+0x27/0x70
	[ 730.895361] entry_SYSCALL_64_after_hwframe+0x44/0xae
	[...]

	The Linux kernel CVE team has assigned CVE-2021-47128 to this issue.


	Affected and fixed versions
	===========================

	Issue introduced in 5.6 with commit 59438b46471ae6cdfb761afc8c9beaf1e428a331 and fixed in 5.10.43 with commit ff5039ec75c83d2ed5b781dc7733420ee8c985fc
	Issue introduced in 5.6 with commit 59438b46471ae6cdfb761afc8c9beaf1e428a331 and fixed in 5.12.10 with commit acc43fc6cf0d50612193813c5906a1ab9d433e1e
	Issue introduced in 5.6 with commit 59438b46471ae6cdfb761afc8c9beaf1e428a331 and fixed in 5.13 with commit ff40e51043af63715ab413995ff46996ecf9583f

	Please see https://www.kernel.org for a full list of currently supported
	kernel versions by the kernel community.

	Unaffected versions might change over time as fixes are backported to
	older supported kernel versions. The official CVE entry at
	https://cve.org/CVERecord/?id=CVE-2021-47128
	will be updated if fixes are backported, please check that for the most
	up to date information about this issue.


	Affected files
	==============

	The file(s) affected by this issue are:
	kernel/bpf/helpers.c
	kernel/trace/bpf_trace.c


	Mitigation
	==========

	The Linux kernel CVE team recommends that you update to the latest
	stable kernel version for this, and many other bugfixes. Individual
	changes are never tested alone, but rather are part of a larger kernel
	release. Cherry-picking individual commits is not recommended or
	supported by the Linux kernel community at all. If however, updating to
	the latest release is impossible, the individual changes to resolve this
	issue can be found at these commits:
	https://git.kernel.org/stable/c/ff5039ec75c83d2ed5b781dc7733420ee8c985fc
	https://git.kernel.org/stable/c/acc43fc6cf0d50612193813c5906a1ab9d433e1e
	https://git.kernel.org/stable/c/ff40e51043af63715ab413995ff46996ecf9583f