| From: Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
| Date: Fri, 5 Aug 2016 13:51:17 +0200 |
| Subject: [PATCH] x86/mm: disable preemption during CR3 read+write |
| MIME-Version: 1.0 |
| Content-Type: text/plain; charset=UTF-8 |
| Content-Transfer-Encoding: 8bit |
| |
| Usually current->mm (and therefore mm->pgd) stays the same during the |
| lifetime of a task so it does not matter if a task gets preempted during |
| the read and write of the CR3. |
| |
| But then, there is this scenario on x86-UP: |
| TaskA is in do_exit() and exit_mm() sets current->mm = NULL followed by |
| mmput() -> exit_mmap() -> tlb_finish_mmu() -> tlb_flush_mmu() -> |
| tlb_flush_mmu_tlbonly() -> tlb_flush() -> flush_tlb_mm_range() -> |
| __flush_tlb_up() -> __flush_tlb() -> __native_flush_tlb(). |
| |
| At this point current->mm is NULL but current->active_mm still points to |
| the "old" mm. |
| Let's preempt taskA _after_ native_read_cr3() by taskB. TaskB has its |
| own mm so CR3 has changed. |
| Now preempt back to taskA. TaskA has no ->mm set so it borrows taskB's |
| mm and so CR3 remains unchanged. Once taskA gets active it continues |
| where it was interrupted and that means it writes its old CR3 value |
| back. Everything is fine because userland won't need its memory |
| anymore. |
| |
| Now the fun part. Let's preempt taskA one more time and get back to |
| taskB. This time switch_mm() won't do a thing because oldmm |
| (->active_mm) is the same as mm (as per context_switch()). So we remain |
| with a bad CR3 / pgd and return to userland. |
| The next thing that happens is handle_mm_fault() with an address for the |
| execution of its code in userland. handle_mm_fault() realizes that it |
| has a PTE with proper rights so it returns doing nothing. But the CPU |
| looks at the wrong pgd and insists that something is wrong and faults |
| again. And again. And one more timeā¦ |
| |
| This pagefault circle continues until the scheduler gets tired of it and |
| puts another task on the CPU. It gets little difficult if the task is a |
| RT task with a high priority. The system will either freeze or it gets |
| fixed by the software watchdog thread which usually runs at RT-max prio. |
| But waiting for the watchdog will increase the latency of the RT task |
| which is no good. |
| |
| Cc: stable@vger.kernel.org |
| Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> |
| --- |
| arch/x86/include/asm/tlbflush.h | 7 +++++++ |
| 1 file changed, 7 insertions(+) |
| |
| --- a/arch/x86/include/asm/tlbflush.h |
| +++ b/arch/x86/include/asm/tlbflush.h |
| @@ -135,7 +135,14 @@ static inline void cr4_set_bits_and_upda |
| |
| static inline void __native_flush_tlb(void) |
| { |
| + /* |
| + * if current->mm == NULL then we borrow a mm which may change during a |
| + * task switch and therefore we must not be preempted while we write CR3 |
| + * back. |
| + */ |
| + preempt_disable(); |
| native_write_cr3(native_read_cr3()); |
| + preempt_enable(); |
| } |
| |
| static inline void __native_flush_tlb_global_irq_disabled(void) |