patches/old/psi-avoid-divide-by-zero-crash-inside-virtual-machines.patch - pub/scm/linux/kernel/git/akpm/25-new - Git at Google

 From: Johannes Weiner <hannes@cmpxchg.org>
 Subject: psi: avoid divide-by-zero crash inside virtual machines

 We've been seeing hard-to-trigger psi crashes when running inside VM
 instances:

 [ 1255.840325] divide error: 0000 [#1] SMP PTI
 [ 1255.840325] Modules linked in: [...]
 [ 1255.840325] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
 [ 1255.840325] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
 [ 1255.840325] Workqueue: events psi_clock
 [ 1255.840325] RIP: 0010:psi_update_stats+0x270/0x490
 [ 1255.840325] RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
 [ 1255.840325] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
 [ 1255.840325] RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
 [ 1255.840325] RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
 [ 1255.840325] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
 [ 1255.840325] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
 [ 1255.840325] FS:  0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
 [ 1255.840325] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
 [ 1255.840325] CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
 [ 1255.840325] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
 [ 1255.840325] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
 [ 1255.840325] Call Trace:
 [ 1255.840325]  psi_clock+0x12/0x50
 [ 1255.840325]  process_one_work+0x1e0/0x390
 [ 1255.840325]  worker_thread+0x2b/0x3c0
 [ 1255.840325]  ? rescuer_thread+0x330/0x330
 [ 1255.840325]  kthread+0x113/0x130
 [ 1255.840325]  ? kthread_create_worker_on_cpu+0x40/0x40
 [ 1255.840325]  ? SyS_exit_group+0x10/0x10
 [ 1255.840325]  ret_from_fork+0x35/0x40
 [ 1255.840325] Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48

 The Code-line points to `period` being 0 inside update_stats(), and we
 divide by that when calculating that period's pressure percentage.

 The elapsed period should never be 0.  The reason this can happen is due
 to an off-by-one in the idle time / missing period calculation combined
 with a coarse sched_clock() in the virtual machine.

 The target time for aggregation is advanced into the future on a fixed
 grid to prevent clock drift.  So when an aggregation runs after some idle
 period, we can not just set it to "now + psi_period", but have to
 calculate the downtime and advance the target time relative to itself.

 However, if the aggregator was disabled exactly one psi_period (ns), we
 drop one idle period in the calculation due to a > when we should do >=.
 In that case, next_update will be advanced from 'now - psi_period' to
 'now' when it should be moved to 'now + psi_period'.  The run finishes
 with last_update == next_update == sched_clock().

 With hardware clocks, this exact nanosecond match isn't likely in the
 first place; but if it does happen, the clock will still have moved on and
 the period non-zero by the time the worker runs.  A pointlessly short
 period, but besides the extra work, no harm no foul.  However, a slow
 sched_clock() like we have on VMs might not have advanced either by the
 time the worker runs again.  And when we calculate the elapsed period, the
 result, our pressure divisor, will be 0.  Ouch.

 Fix this by correctly handling the situation when the elapsed time between
 aggregation runs is precisely two periods, and advance the expiration
 timestamp correctly to period into the future.

 Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
 Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
 Reported-by: Łukasz Siudut <lsiudut@fb.com
 Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
 Cc: Peter Zijlstra <peterz@infradead.org>
 Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
 ---


 --- a/kernel/sched/psi.c~psi-avoid-divide-by-zero-crash-inside-virtual-machines
 +++ a/kernel/sched/psi.c
 @@ -322,7 +322,7 @@ static bool update_stats(struct psi_grou
  	expires = group->next_update;
  	if (now < expires)
  		goto out;
 -	if (now - expires > psi_period)
 +	if (now - expires >= psi_period)
  		missed_periods = div_u64(now - expires, psi_period);

  	/*
 _
	From: Johannes Weiner <hannes@cmpxchg.org>
	Subject: psi: avoid divide-by-zero crash inside virtual machines

	We've been seeing hard-to-trigger psi crashes when running inside VM
	instances:

	[ 1255.840325] divide error: 0000 [#1] SMP PTI
	[ 1255.840325] Modules linked in: [...]
	[ 1255.840325] CPU: 0 PID: 212 Comm: kworker/0:2 Not tainted 4.16.18-119_fbk9_3817_gfe944c98d695 #119
	[ 1255.840325] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 0.0.0 02/06/2015
	[ 1255.840325] Workqueue: events psi_clock
	[ 1255.840325] RIP: 0010:psi_update_stats+0x270/0x490
	[ 1255.840325] RSP: 0018:ffffc90001117e10 EFLAGS: 00010246
	[ 1255.840325] RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff8800a35a13f8
	[ 1255.840325] RDX: 0000000000000000 RSI: ffff8800a35a1340 RDI: 0000000000000000
	[ 1255.840325] RBP: 0000000000000658 R08: ffff8800a35a1470 R09: 0000000000000000
	[ 1255.840325] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
	[ 1255.840325] R13: 0000000000000000 R14: 0000000000000000 R15: 00000000000f8502
	[ 1255.840325] FS: 0000000000000000(0000) GS:ffff88023fc00000(0000) knlGS:0000000000000000
	[ 1255.840325] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
	[ 1255.840325] CR2: 00007fbe370fa000 CR3: 00000000b1e3a000 CR4: 00000000000006f0
	[ 1255.840325] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
	[ 1255.840325] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
	[ 1255.840325] Call Trace:
	[ 1255.840325] psi_clock+0x12/0x50
	[ 1255.840325] process_one_work+0x1e0/0x390
	[ 1255.840325] worker_thread+0x2b/0x3c0
	[ 1255.840325] ? rescuer_thread+0x330/0x330
	[ 1255.840325] kthread+0x113/0x130
	[ 1255.840325] ? kthread_create_worker_on_cpu+0x40/0x40
	[ 1255.840325] ? SyS_exit_group+0x10/0x10
	[ 1255.840325] ret_from_fork+0x35/0x40
	[ 1255.840325] Code: 48 0f 47 c7 48 01 c2 45 85 e4 48 89 16 0f 85 e6 00 00 00 4c 8b 49 10 4c 8b 51 08 49 69 d9 f2 07 00 00 48 6b c0 64 4c 8b 29 31 d2 <48> f7 f7 49 69 d5 8d 06 00 00 48 89 c5 4c 69 f0 00 98 0b 00 48

	The Code-line points to `period` being 0 inside update_stats(), and we
	divide by that when calculating that period's pressure percentage.

	The elapsed period should never be 0. The reason this can happen is due
	to an off-by-one in the idle time / missing period calculation combined
	with a coarse sched_clock() in the virtual machine.

	The target time for aggregation is advanced into the future on a fixed
	grid to prevent clock drift. So when an aggregation runs after some idle
	period, we can not just set it to "now + psi_period", but have to
	calculate the downtime and advance the target time relative to itself.

	However, if the aggregator was disabled exactly one psi_period (ns), we
	drop one idle period in the calculation due to a > when we should do >=.
	In that case, next_update will be advanced from 'now - psi_period' to
	'now' when it should be moved to 'now + psi_period'. The run finishes
	with last_update == next_update == sched_clock().

	With hardware clocks, this exact nanosecond match isn't likely in the
	first place; but if it does happen, the clock will still have moved on and
	the period non-zero by the time the worker runs. A pointlessly short
	period, but besides the extra work, no harm no foul. However, a slow
	sched_clock() like we have on VMs might not have advanced either by the
	time the worker runs again. And when we calculate the elapsed period, the
	result, our pressure divisor, will be 0. Ouch.

	Fix this by correctly handling the situation when the elapsed time between
	aggregation runs is precisely two periods, and advance the expiration
	timestamp correctly to period into the future.

	Link: http://lkml.kernel.org/r/20190214193157.15788-1-hannes@cmpxchg.org
	Signed-off-by: Johannes Weiner <hannes@cmpxchg.org>
	Reported-by: Łukasz Siudut <lsiudut@fb.com
	Reviewed-by: Andrew Morton <akpm@linux-foundation.org>
	Cc: Peter Zijlstra <peterz@infradead.org>
	Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
	---


	--- a/kernel/sched/psi.c~psi-avoid-divide-by-zero-crash-inside-virtual-machines
	+++ a/kernel/sched/psi.c
	@@ -322,7 +322,7 @@ static bool update_stats(struct psi_grou
	expires = group->next_update;
	if (now < expires)
	goto out;
	- if (now - expires > psi_period)
	+ if (now - expires >= psi_period)
	missed_periods = div_u64(now - expires, psi_period);

	/*
	_