MCE: Add Action-Required support

Implement core MCA recovery. This is used for errors
that happen in the current execution context.

The kernel has to first pass the error information
to a function running on the current process stack.
This is done using a new work flag and then executing
the code after the exception through do_notify_resume.

Then hwpoison is allowed to sleep and can try to recover it.

To pass the information about the error around we need
to use a field in the current process. The old ways
to handle this (per cpu buffer) don't work because
a CPU could be switched before reaching the handler code.

For kernel recovery we only handle errors happening
during copy_*_user() exception tables and inject EFAULT.
When the tolerance level is sufficiently high also
a unsafe oops like do_exit() killing, which has some
deadlock potential.

FIXME: fix 386 handling of mce notify bit in entry_32.S after mce

Signed-off-by: Andi Kleen <>
4 files changed