| From bippy-7c5fe7eed585 Mon Sep 17 00:00:00 2001 |
| From: Greg Kroah-Hartman <gregkh@linuxfoundation.org> |
| To: <linux-cve-announce@vger.kernel.org> |
| Reply-to: <cve@kernel.org>, <linux-kernel@vger.kernel.org> |
| Subject: CVE-2021-47094: KVM: x86/mmu: Don't advance iterator after restart due to yielding |
| |
| Description |
| =========== |
| |
| In the Linux kernel, the following vulnerability has been resolved: |
| |
| KVM: x86/mmu: Don't advance iterator after restart due to yielding |
| |
| After dropping mmu_lock in the TDP MMU, restart the iterator during |
| tdp_iter_next() and do not advance the iterator. Advancing the iterator |
| results in skipping the top-level SPTE and all its children, which is |
| fatal if any of the skipped SPTEs were not visited before yielding. |
| |
| When zapping all SPTEs, i.e. when min_level == root_level, restarting the |
| iter and then invoking tdp_iter_next() is always fatal if the current gfn |
| has as a valid SPTE, as advancing the iterator results in try_step_side() |
| skipping the current gfn, which wasn't visited before yielding. |
| |
| Sprinkle WARNs on iter->yielded being true in various helpers that are |
| often used in conjunction with yielding, and tag the helper with |
| __must_check to reduce the probabily of improper usage. |
| |
| Failing to zap a top-level SPTE manifests in one of two ways. If a valid |
| SPTE is skipped by both kvm_tdp_mmu_zap_all() and kvm_tdp_mmu_put_root(), |
| the shadow page will be leaked and KVM will WARN accordingly. |
| |
| WARNING: CPU: 1 PID: 3509 at arch/x86/kvm/mmu/tdp_mmu.c:46 [kvm] |
| RIP: 0010:kvm_mmu_uninit_tdp_mmu+0x3e/0x50 [kvm] |
| Call Trace: |
| <TASK> |
| kvm_arch_destroy_vm+0x130/0x1b0 [kvm] |
| kvm_destroy_vm+0x162/0x2a0 [kvm] |
| kvm_vcpu_release+0x34/0x60 [kvm] |
| __fput+0x82/0x240 |
| task_work_run+0x5c/0x90 |
| do_exit+0x364/0xa10 |
| ? futex_unqueue+0x38/0x60 |
| do_group_exit+0x33/0xa0 |
| get_signal+0x155/0x850 |
| arch_do_signal_or_restart+0xed/0x750 |
| exit_to_user_mode_prepare+0xc5/0x120 |
| syscall_exit_to_user_mode+0x1d/0x40 |
| do_syscall_64+0x48/0xc0 |
| entry_SYSCALL_64_after_hwframe+0x44/0xae |
| |
| If kvm_tdp_mmu_zap_all() skips a gfn/SPTE but that SPTE is then zapped by |
| kvm_tdp_mmu_put_root(), KVM triggers a use-after-free in the form of |
| marking a struct page as dirty/accessed after it has been put back on the |
| free list. This directly triggers a WARN due to encountering a page with |
| page_count() == 0, but it can also lead to data corruption and additional |
| errors in the kernel. |
| |
| WARNING: CPU: 7 PID: 1995658 at arch/x86/kvm/../../../virt/kvm/kvm_main.c:171 |
| RIP: 0010:kvm_is_zone_device_pfn.part.0+0x9e/0xd0 [kvm] |
| Call Trace: |
| <TASK> |
| kvm_set_pfn_dirty+0x120/0x1d0 [kvm] |
| __handle_changed_spte+0x92e/0xca0 [kvm] |
| __handle_changed_spte+0x63c/0xca0 [kvm] |
| __handle_changed_spte+0x63c/0xca0 [kvm] |
| __handle_changed_spte+0x63c/0xca0 [kvm] |
| zap_gfn_range+0x549/0x620 [kvm] |
| kvm_tdp_mmu_put_root+0x1b6/0x270 [kvm] |
| mmu_free_root_page+0x219/0x2c0 [kvm] |
| kvm_mmu_free_roots+0x1b4/0x4e0 [kvm] |
| kvm_mmu_unload+0x1c/0xa0 [kvm] |
| kvm_arch_destroy_vm+0x1f2/0x5c0 [kvm] |
| kvm_put_kvm+0x3b1/0x8b0 [kvm] |
| kvm_vcpu_release+0x4e/0x70 [kvm] |
| __fput+0x1f7/0x8c0 |
| task_work_run+0xf8/0x1a0 |
| do_exit+0x97b/0x2230 |
| do_group_exit+0xda/0x2a0 |
| get_signal+0x3be/0x1e50 |
| arch_do_signal_or_restart+0x244/0x17f0 |
| exit_to_user_mode_prepare+0xcb/0x120 |
| syscall_exit_to_user_mode+0x1d/0x40 |
| do_syscall_64+0x4d/0x90 |
| entry_SYSCALL_64_after_hwframe+0x44/0xae |
| |
| Note, the underlying bug existed even before commit 1af4a96025b3 ("KVM: |
| x86/mmu: Yield in TDU MMU iter even if no SPTES changed") moved calls to |
| tdp_mmu_iter_cond_resched() to the beginning of loops, as KVM could still |
| incorrectly advance past a top-level entry when yielding on a lower-level |
| entry. But with respect to leaking shadow pages, the bug was introduced |
| by yielding before processing the current gfn. |
| |
| Alternatively, tdp_mmu_iter_cond_resched() could simply fall through, or |
| callers could jump to their "retry" label. The downside of that approach |
| is that tdp_mmu_iter_cond_resched() _must_ be called before anything else |
| in the loop, and there's no easy way to enfornce that requirement. |
| |
| Ideally, KVM would handling the cond_resched() fully within the iterator |
| macro (the code is actually quite clean) and avoid this entire class of |
| bugs, but that is extremely difficult do while also supporting yielding |
| after tdp_mmu_set_spte_atomic() fails. Yielding after failing to set a |
| SPTE is very desirable as the "owner" of the REMOVED_SPTE isn't strictly |
| bounded, e.g. if it's zapping a high-level shadow page, the REMOVED_SPTE |
| may block operations on the SPTE for a significant amount of time. |
| |
| The Linux kernel CVE team has assigned CVE-2021-47094 to this issue. |
| |
| |
| Affected and fixed versions |
| =========================== |
| |
| Issue introduced in 5.10 with commit faaf05b00aecdb347ffd1d763d024394ec0329f8 and fixed in 5.15.12 with commit d884eefd75cc54887bc2e9e724207443525dfb2c |
| Issue introduced in 5.10 with commit faaf05b00aecdb347ffd1d763d024394ec0329f8 and fixed in 5.16 with commit 3a0f64de479cae75effb630a2e0a237ca0d0623c |
| |
| Please see https://www.kernel.org for a full list of currently supported |
| kernel versions by the kernel community. |
| |
| Unaffected versions might change over time as fixes are backported to |
| older supported kernel versions. The official CVE entry at |
| https://cve.org/CVERecord/?id=CVE-2021-47094 |
| will be updated if fixes are backported, please check that for the most |
| up to date information about this issue. |
| |
| |
| Affected files |
| ============== |
| |
| The file(s) affected by this issue are: |
| arch/x86/kvm/mmu/tdp_iter.c |
| arch/x86/kvm/mmu/tdp_iter.h |
| arch/x86/kvm/mmu/tdp_mmu.c |
| |
| |
| Mitigation |
| ========== |
| |
| The Linux kernel CVE team recommends that you update to the latest |
| stable kernel version for this, and many other bugfixes. Individual |
| changes are never tested alone, but rather are part of a larger kernel |
| release. Cherry-picking individual commits is not recommended or |
| supported by the Linux kernel community at all. If however, updating to |
| the latest release is impossible, the individual changes to resolve this |
| issue can be found at these commits: |
| https://git.kernel.org/stable/c/d884eefd75cc54887bc2e9e724207443525dfb2c |
| https://git.kernel.org/stable/c/3a0f64de479cae75effb630a2e0a237ca0d0623c |