| |
| [NMI watchdog is available for x86 and x86-64 architectures] |
| |
| Is your system locking up unpredictably? No keyboard activity, just |
| a frustrating complete hard lockup? Do you want to help us debugging |
| such lockups? If all yes then this document is definitely for you. |
| |
| On many x86/x86-64 type hardware there is a feature that enables |
| us to generate 'watchdog NMI interrupts'. (NMI: Non Maskable Interrupt |
| which get executed even if the system is otherwise locked up hard). |
| This can be used to debug hard kernel lockups. By executing periodic |
| NMI interrupts, the kernel can monitor whether any CPU has locked up, |
| and print out debugging messages if so. |
| |
| In order to use the NMI watchdoc, you need to have APIC support in your |
| kernel. For SMP kernels, APIC support gets compiled in automatically. For |
| UP, enable either CONFIG_X86_UP_APIC (Processor type and features -> Local |
| APIC support on uniprocessors) or CONFIG_X86_UP_IOAPIC (Processor type and |
| features -> IO-APIC support on uniprocessors) in your kernel config. |
| CONFIG_X86_UP_APIC is for uniprocessor machines without an IO-APIC. |
| CONFIG_X86_UP_IOAPIC is for uniprocessor with an IO-APIC. [Note: certain |
| kernel debugging options, such as Kernel Stack Meter or Kernel Tracer, |
| may implicitly disable the NMI watchdog.] |
| |
| For x86-64, the needed APIC is always compiled in, and the NMI watchdog is |
| always enabled with I/O-APIC mode (nmi_watchdog=1). Currently, local APIC |
| mode (nmi_watchdog=2) does not work on x86-64. |
| |
| Using local APIC (nmi_watchdog=2) needs the first performance register, so |
| you can't use it for other purposes (such as high precision performance |
| profiling.) However, at least oprofile and the perfctr driver disable the |
| local APIC NMI watchdog automatically. |
| |
| To actually enable the NMI watchdog, use the 'nmi_watchdog=N' boot |
| parameter. Eg. the relevant lilo.conf entry: |
| |
| append="nmi_watchdog=1" |
| |
| For SMP machines and UP machines with an IO-APIC use nmi_watchdog=1. |
| For UP machines without an IO-APIC use nmi_watchdog=2, this only works |
| for some processor types. If in doubt, boot with nmi_watchdog=1 and |
| check the NMI count in /proc/interrupts; if the count is zero then |
| reboot with nmi_watchdog=2 and check the NMI count. If it is still |
| zero then log a problem, you probably have a processor that needs to be |
| added to the nmi code. |
| |
| A 'lockup' is the following scenario: if any CPU in the system does not |
| execute the period local timer interrupt for more than 5 seconds, then |
| the NMI handler generates an oops and kills the process. This |
| 'controlled crash' (and the resulting kernel messages) can be used to |
| debug the lockup. Thus whenever the lockup happens, wait 5 seconds and |
| the oops will show up automatically. If the kernel produces no messages |
| then the system has crashed so hard (eg. hardware-wise) that either it |
| cannot even accept NMI interrupts, or the crash has made the kernel |
| unable to print messages. |
| |
| NOTE: starting with 2.4.2-ac18 the NMI-oopser is disabled by default, |
| you have to enable it with a boot time parameter. Prior to 2.4.2-ac18 |
| the NMI-oopser is enabled unconditionally on x86 SMP boxes. |
| |
| [ feel free to send bug reports, suggestions and patches to |
| Ingo Molnar <mingo@redhat.com> or the Linux SMP mailing |
| list at <linux-smp@vger.kernel.org> ] |
| |