blob: f96a0ecce7cf1180ecb0cbb4f5f4215ca751b3f5 [file] [log] [blame]
MCE Stress Test HOWTO
====================
Oct 10th, 2009
Haicheng Li
Abstract
--------
This document explains the design and structure of MCE stress test suite,
the kernel configurations and user space tools required for automated
stress testing, as well as usage guide and etc.
0. Quick Shortcut
-----------------
- Install the Linux kernel (2.6.32 or newer) with full MCA recovery support.
Make sure following configuration options are enabled:
CONFIG_X86_MCE=y
CONFIG_MEMORY_FAILURE=y
With these two options enabled, you can do stress testing thru madvise
syscall (sec 4.1).
- Install page-types tool (sec 3.3), which is accompanied with Linux kernel
source (2.6.32 or newer).
# cd $KERNEL_SRC/Documentation/vm/
# gcc -o page-types page-types.c
# cp page-types /usr/bin/
- Get latest LTP (Linux Test Project) image from http://ltp.sf.net. Refer
to INSTALL of LTP to install LTP on your machine.
- Build and run stress testing
# make
# cd stress
# ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR -N
Note here, '-d $YOUR_PARTITION' is a mandatory option. Test will create
all temporary files on $YOUR_PARTITION, and error injection will just
affect the pages associated with $$YOUR_PARTITION. So you must provide a
free disk partition to stress test driver!
This will do the stress testing thru madvise syscall (sec 4.1). However,
there are more advanced test methods provided (sec 4.2, 4.3).
Note, for all examples in the rest of this doc, it is supposed that $PWD is
the stress subdir.
1. Overview
-----------
The MCE stress test suite is a collection of tools and test scripts, which
intends to achieve stress testing on Linux kernel MCA high level handlers
that include HWPosion page recovery, soft page offline, and so on.
In general, this test suite is designed to do stress testing thru various
test interfaces, i.e. madvise syscall, HWPoison page injector, and APEI
injector (see ACPI4.0 spec). And it's able to support most of popular
Linux File Systems (FS), that is, there is an option for user to specify which
FS type they want the test to be running on.
If you just want to start testing as quickly as possible, you can skip
section 2 & 3, just go to section 4 directly.
2. Design Details
-----------------
The MCE stress test suite consists of four parts: test driver, workload
controller, customized workloads, and background workloads.
The main test idea is described as below:
- Test driver launchs various customized workloads to continuously generate
lots of pages with expected page states, Note, all of these workloads know
about their expected results that should not be affected by Linux MCE high
level handlers.
- Then test driver injects MCE errors to these pages thru either madvise
syscall or HWPoison injector or APEI injector. While Linux Kernel handling
these MCE errors, all the workloads continue running normally,
- After long time running, test driver will collect test result of each
workload to see if any unexpected failures happened. In such a way, it can
decide if any bug is found.
- If any system panics or FS corruption happens, that means there must be a
bug. It's the bottom line to decide if test gets pass.
2.1 Test Driver
Test driver (a.k.a hwpoison.sh) drives the whole test procedure. It's
responsible for managing test environment, setting up error injection
interface, controlling test progress, launching workloads, injecting page
errors, as well as recording test logs and reporting test result.
For detailed usage of hwpoison.sh test driver, please refer to:
# ./hwpoison.sh -h
2.2 Workload Controller
Workload controller needs to have various test workloads running parallelly
and continuously within a required duration time. We select ltp-pan
program of Linux Test Project (LTP) as the workload controller of this
stress test suite.
Test driver (hwpoison.sh) interacts with ltp-pan in following ways:
- hwpoison.sh generates a test config file that lists the workload type
to be launched by ltp-pan.
- hwpoison also passes test duration time and other workload specific
parameters to ltp-pan via test config file.
- ltp-pan makes each workload run and get finished in time, then test driver
can get the result of each workload via corresponding result files.
- finally, hwpoison.sh will decide the overall test result based on each
workload result, and report final result out.
2.3 Customized Workloads
There are three types of customized workloads, which are intended to generate
pages with various page state.
* Type0: page-poisoning workload, meant to cover:
- anonymous pages operations.
- file data operations.
* Type1: fs-metadata workload, meant to cover:
- inode operations.
* Type2: fs_type specific workload, meant to cover:
- extended functions of some special FS.
2.4 Background Workloads
LTP is selected as the background workload to simulate normal system
operations in background while stress testing is running.
Besides LTP, there are also some alternatives, like AIM. We might extend more
background workloads in future.
2.5 Test Result
How to determine that stress testing gets pass?
- at least no kernel panics happens during stress testing.
- fsck on the target disk at the end of stress testing should get pass.
- there is no failure found by customized workloads, especially for
page-poisoning workload.
Where to get detailed test result?
- When stress testing is done, the general test result is recorded in
result/hwpoison.result, and the general test log is in result/hwpoison.log.
However, you can specify them in following way:
# hwpoison.sh -r $YOUR_RESULT -l $YOUR_LOG
- The test result and test log of each workload are recorded as
log/$workload/$workload.result and log/$workload/$workload.log.
For example, for page-poisoning workload, its test result and test logs are
log/page-poisoning/page-poisoning.result and
log/page-poisoning/page-poisoning.log.
- Besides, under each workload result dir, you can find other extra logs
like pan_log, pan_output and etc. These logs are generated by ltp-pan
workload controller. Usually they can help you understand what has been
going on with ltp-pan while workload is running. Pls. refer to ltp-pan doc
for details.
3. Tools
--------
3.1 page-poisoning
It is the page-poisoning workload. page-poisoning workload is an extension of
tinjpage test program with a multi-process model. It spawns thousands of
processes that inject HWPosion error to various pages simultaneously thru
madvise syscall. Then it checks if these errors get handled correctly,
i.e. whether each test process receives or doesn't receive SIGBUS signal as
expected.
For more info about page-poisoning workload, pls. read through README file
under stress/tools/page-poisoning/.
3.2 fs-metadata
It is the fs-metadata workload. fs-metadata is designed to test i-node
operations with heavy workload and make sure every i-node operation gets
the expected result. In details, it firstly generates a huge directory
hierarchy on the target disk, then it performs unlink operations on this
directory hierarchy and duplicate a copy of the directory, finally it
checks if these two directories are same as expected.
For more info about fs-metadata workload, pls. read through README file
under stress/tools/fs-metadata/.
3.3 page-types
page-types is a tool to query the page type of every memory page in the
system. We use it to filter out pages with required page types. Test will
inject error to these pages via error injector, although the page filter
of HWPosion handler in Linux Kernel will filter them out for a second
time. Note, the reason we need to use page-types to do first time filtering
is just about performance.
To install page-types on your test machine:
# cd $KERNEL_SRC/Documentation/vm/
# gcc -o page-types page-types.c
# cp page-types /usr/bin/
3.4 ltp-pan
It's the workload controller of this stress test suite. In fact, ltp-pan
is the test harness of LTP (Linux Test Project), and is included in
LTP package. For more information, please refer to ltp-pan document of LTP.
4. Usage Guide
--------------
This section is trying to show you how to conduct the stress testing thru
various test interfaces.
As an example, we choose to run stress testing based on partition /dev/sda1
for 1 hour. Note, we've installed LTP to /ltp.
4.1 Stress Test thru Madvise Syscall.
To run this stress testing, you need to strictly follow below test
instructions.
* Test instructions:
- make sure following kernel options are enabled:
CONFIG_X86_MCE=y
CONFIG_MEMORY_FAILURE=y
- build and run stress testing
# make
# ./hwpoison.sh -d $YOUR_PARTITION -M -o $YOUR_LTP_DIR
* Example:
- launch testing
# ./hwpoison.sh -d /dev/sda1 -M -t 3600
- general test results
result: result/hwpoison.result
logs: result/hwpoison.log
- detailed workload results
result: log/page-poisoning/page-poisoning.result
log: log/page-poisoning/page-poisoning.log
4.2 Stress Test thru HWPosion Page Injector
This is the default test method of this stress test suite.
To run this stress testing, you need to strictly follow below test
instructions.
* Test instructions:
- make sure following kernel options are enabled:
CONFIG_X86_MCE=y
CONFIG_MEMORY_FAILURE=y
CONFIG_DEBUG_KERNEL=y
CONFIG_CGROUP_MEM_RES_CTLR_SWAP=y
CONFIG_HWPOISON_INJECT=y
- build and run stress testing
# make
# ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L
* Example:
- launch testing
# ./hwpoison.sh -d /dev/sda1 -t 3600 -L
- general test results
result: result/hwpoison.result
logs: result/hwpoison.log
- detailed workload results
fs-metadata result: log/fs-metadata/fs-metadata.result
fs-metadata log: log/fs-metadata/fs-metadata.log
ltp result: log/ltp/ltp.result
ltp log: log/ltp/ltp.log
fs-specific result: log/fs-specific/fs-specific.result
fs-specific log: log/fs-specific/fs-specific.log
4.3 Stress Test thru APEI Injector
To run this stress testing, you need to follow below test instructions.
* Test instructions:
- make sure following kernel options are enabled:
CONFIG_X86_MCE=y
CONFIG_X86_MCE_INTEL=y
CONFIG_MEMORY_FAILURE=y
CONFIG_ACPI_APEI=y
CONFIG_ACPI_APEI_EINJ=y
- build and run stress testing
# make
# ./hwpoison.sh -d $YOUR_PARTITION -o $YOUR_LTP_DIR -L -A
* Example:
- launch testing
# ./hwpoison.sh -d /dev/sda1 -t 3600 -L -A
- general test results
result: result/hwpoison.result
logs: result/hwpoison.log
- detailed workload results
fs-metadata result: log/fs-metadata/fs-metadata.result
fs-metadata log: log/fs-metadata/fs-metadata.log
ltp result: log/ltp/ltp.result
ltp log: log/ltp/ltp.log
fs-specific result: log/fs-specific/fs-specific.result
fs-specific log: log/fs-specific/fs-specific.log
5. FAQs
-------
Here is a collection of frequently asked questions:
Q: How to tell test driver not to format my disk partition?
A: Use the option '-N'.
Q: Can three types of tests run on same sytem simultaneously?
A: No. There are limitations in Linux Kernel HWPoison page filtering.
Q: Can I run this stress testing on multiple disks parallely?
A: Yes. But it requires updated Kernel patches for HWPosion page filtering.
Now, it just supports one same test with same pagetype flags specified.