| From: Johannes Weiner <hannes@cmpxchg.org> |
| Subject: psi: avoid divide-by-zero crash inside virtual machines |
| |
| We've been seeing hard-to-trigger psi crashes when running inside VM |
| instances: |
| |
| [ 1255.840325] divide error: 0000 [#1] SMP PTI |
| [ 1255.840325] Modules linked in: [...] |
| [ 1255.840325] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119 |
| [ 1255.840325] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015 |
| [ 1255.840325] Workqueue: events psi_clock |
| [ 1255.840325] RIP: 0010:psi_update_stats+0x270/0x490 |
| [ 1255.840325] RSP: 0018:ffffc90001117e10 EFLAGS: 00010246 |
| [ 1255.840325] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8 |
| [ 1255.840325] RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000 |
| [ 1255.840325] RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000 |
| [ 1255.840325] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 |
| [ 1255.840325] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502 |
| [ 1255.840325] FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000 |
| [ 1255.840325] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 |
| [ 1255.840325] CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0 |
| [ 1255.840325] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 |
| [ 1255.840325] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 |
| [ 1255.840325] Call Trace: |
| [ 1255.840325] psi_clock+0x12/0x50 |
| [ 1255.840325] process_one_work+0x1e0/0x390 |
| [ 1255.840325] worker_thread+0x2b/0x3c0 |
| [ 1255.840325] ? rescuer_thread+0x330/0x330 |
| [ 1255.840325] kthread+0x113/0x130 |
| [ 1255.840325] ? kthread_create_worker_on_cpu+0x40/0x40 |
| [ 1255.840325] ? SyS_exit_group+0x10/0x10 |
| [ 1255.840325] ret_from_fork+0x35/0x40 |
| [ 1255.840325] Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48 |
| |
| The Code-line points to `period` being 0 inside update_stats(), and we |
| divide by that when calculating that period's pressure percentage. |
| |
| The elapsed period should never be 0. The reason this can happen is due |
| to an off-by-one in the idle time / missing period calculation combined |
| with a coarse sched_clock() in the virtual machine. |
| |
| The target time for aggregation is advanced into the future on a fixed |
| grid to prevent clock drift. So when an aggregation runs after some idle |
| period, we can not just set it to "now + psi_period", but have to |
| calculate the downtime and advance the target time relative to itself. |
| |
| However, if the aggregator was disabled exactly one psi_period (ns), we |
| drop one idle period in the calculation due to a > when we should do >=. |
| In that case, next_update will be advanced from 'now - psi_period' to |
| 'now' when it should be moved to 'now + psi_period'. The run finishes |
| with last_update == next_update == sched_clock(). |
| |
| With hardware clocks, this exact nanosecond match isn't likely in the |
| first place; but if it does happen, the clock will still have moved on and |
| the period non-zero by the time the worker runs. A pointlessly short |
| period, but besides the extra work, no harm no foul. However, a slow |
| sched_clock() like we have on VMs might not have advanced either by the |
| time the worker runs again. And when we calculate the elapsed period, the |
| result, our pressure divisor, will be 0. Ouch. |
| |
| Fix this by correctly handling the situation when the elapsed time between |
| aggregation runs is precisely two periods, and advance the expiration |
| timestamp correctly to period into the future. |
| |
| Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org |
| Signed-off-by: Johannes Weiner <hannes@cmpxchg.org> |
| Reported-by: Ćukasz Siudut <lsiudut@fb.com |
| Reviewed-by: Andrew Morton <akpm@linux-foundation.org> |
| Cc: Peter Zijlstra <peterz@infradead.org> |
| Signed-off-by: Andrew Morton <akpm@linux-foundation.org> |
| --- |
| |
| |
| --- a/kernel/sched/psi.c~psi-avoid-divide-by-zero-crash-inside-virtual-machines |
| +++ a/kernel/sched/psi.c |
| @@ -322,7 +322,7 @@ static bool update_stats(struct psi_grou |
| expires = group->next_update; |
| if (now < expires) |
| goto out; |
| - if (now - expires > psi_period) |
| + if (now - expires >= psi_period) |
| missed_periods = div_u64(now - expires, psi_period); |
| |
| /* |
| _ |