x86, apic: Disable BSP if boot cpu is AP

Currently, on x86 architecture, if crash happens on AP in the kdump
1st kernel, the 2nd kernel fails to wake up multiple CPUs. The typical
behaviour we actually see is immediate system reset or hang.

This comes from the hardware specification that the processor with BSP
flag is jumped at BIOS init code when receiving INIT; the behaviour we
then see depends on the init code.

This never happens if we use only one cpu in the 2nd kernel. So, we
have avoided the issue by the workaround that specifying maxcpus=1 or
nr_cpus=1 in kernel parameter of the 2nd kernel.

In order to address the issue, this patch disables BSP if boot cpu is
an AP, and thus we don't try to wake up the BSP by sending INIT.

Before this idea we discussed the following two ideas but we cannot
adopt them in each reasons:

  1. Switch CPU from AP to BSP via IPI NMI at crash in the 1st kernel

    This is done in the kdump crash path where logic is in
    inconsistent state. Any part of memory can be corrupted, including
    hardware-related table being accessed for example when paging is
    performed or interruption is performed.

  2. Unset BSP flag of the boot cpu in the 1st kernel

    Unsetting BSP flag can affect some real world firmware badly. For
    example, Ma verified that some HP systems fail to reboot under
    this configuration. See:
    http://lkml.indiana.edu/hypermail/linux/kernel/1308.1/03574.html

Due to the idea 1, we have to address the issue in the 2nd kernel on
AP. Then, it's impossible to know which CPU is BSP by rdmsr
instruction because the CPU is the one we are now trying to wake
up. From the same reason, it's also impossible to unset BSP flag of
the BSP by wrmsr instruction.

Next, due to the idea 2, BSP is halting in the 1st kernel while
keeping BSP flag set (or possibly could be running somewhere in
catastrophic state.) In generall, CPUs except for the boot cpu in the
2nd kernel -- the cpu under which crash happened --- can be thought of
as remaining in any inconsistent state in the 1st kernel. For APs,
it's possible to recover sane state by initiating INIT to them; see
3.7.3 Processor-specific INIT in MultiProcessor
specification. However, there's no way for BSP. Therefore, there's no
other way to disable BSP.

My motivation is to generate crash dump quickly on the system with
huge memory. We can assume such system also has a lot of N-cpus and
(N-1)-cpus are still available.

To identify which CPU is BSP, we lookup ACPI table or MP table. One
concern is that ACPI guidlines BIOS *should* list the BSP in the first
MADT LAPIC entry; not *must*. In this sense, this logic relis on BIOS
following ACPI's guideline. On the other hand, we don't need to worry
about this in MP table case because it has explit BSP flag.

To avoid any undesirable bahaviour caused by any broken BIOS that
doesn't conform to the guideline, it's enough to limit the number of
cpus to 1 by specifying maxcpu=1 or nr_cpus=1, as is currently done in
default kdump configuration. (But of course, it's problematic in
maxcpu=1 case if trying to wake up other cpus later in user space.)

SFI and devicetree doesn't provide BSP information, so there's no
functionality change in their codes, only assigning false for all the
entries, keeping interface uniform.

[ hpa: it might be better in the future for the primary kernel to
  explicitly export the APIC ID for the BSP.  The primary kernel will
  know, both due to which CPU it was booted from and because it can
  see the BSP flag.  However, this seems like a net win for now. ]

Signed-off-by: HATAYAMA Daisuke <d.hatayama@jp.fujitsu.com>
Link: http://lkml.kernel.org/r/20130829092804.5476.95588.stgit@localhost6.localdomain6
Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
6 files changed