doc/rteval.txt - pub/scm/linux/kernel/git/jkacur/rteval - Git at Google

 Evaluating Realtime Linux system performance with rteval
 Clark Williams <williams@redhat.com>
 --------------------------------------------------------

 Abstract
 --------

 One of the problems of developing and fielding any software product
 that runs on a wide range of hardware platforms, is determining how
 well the product performs on those platforms. This is especially true
 about developing something as closely tied to hardware as a Realtime
 Linux system. So, how to measure the performance of a realtime Linux
 kernel on a particular hardware platform? What defines "good
 performance" for a realtime system?

 A real time system is one which must meet deadlines of some
 sort. These deadlines may be periodic (e.g. occuring every 200
 milliseconds) or they may be a some time limit following the occurance
 of an event (e.g. no more than 100 milliseconds following the arrival
 of a network packet). To give a realtime application the best chance
 of meeting its deadline(s), a realtime OS must minimize the time
 between event occurance and the servicing of that event (latency).

 This paper describes the 'rteval' program, a Python 2.x program
 developed at Red Hat to help quantify realtime performance on the
 Realtime kernel. Rteval is an attempt to put together a synthetic
 benchmark which mimics a well behaved realtime application, running on
 a heavily loaded realtime Linux system. It uses the 'cyclictest'
 program in the role of the realtime app and uses two loads, a parallel
 build of a Linux kernel and the scheduler benchmark 'hackbench' to
 boost the system load.

 Rteval runs for a specified length of time (typically 12 hours). When
 an rteval run is completed, a statisical analysis of the results is
 done, an XML file is generated, containing system state, raw result
 data and statistical analysis results and optionally the XML is sent
 by XML-RPC to a database for reporting.

 The Load Applications
 ---------------------

 Rteval uses two system loads. The first is looping a scheduler
 benchmark called 'hackbench'. Hackbench creates pairs of threads which
 send data from a sender to a receiver via a pipe or socket. It creates
 many small processes which do lots of I/O, exercising the kernel
 scheduler by causing many scheduling decisions. The rteval wrapper
 class for hackbench continually runs the hackbench program until
 signaled to stop by the main logic. The number of hackbench threads is
 determined by the number of cpu cores available on the test system.

 The other load used by rteval is a parallel compile of the Linux
 kernel. Rteval has a module named 'kcompile' which controls the kernel
 build process by invoking the make process with 2 times the number of
 online cpus simultaneous build jobs.  The clean, bzImage and modules
 targets are built with an 'allmodconfig' configuration file to
 maximize the amount of compilation done. This results in a large
 amount of process creation (preprocessors, compiler, assemblers and
 linkers) as well as a moderately heavy file I/O load. The kernel build
 load is repeated until the rteval runtime is reached.

 The intent behind having the load programs is to generate enough
 threads doing a balanced load of operations (disk I/O, computation,
 IPC, etc.) so that there is no time in which a processor core in the
 system under test does not have a process ready to run. The success of
 the loads can be measured by watching the system 'load average'
 (either by examining /proc/loadavg or running the 'uptime' or 'top'
 programs).

 The Measurement Application
 ---------------------------

 The cyclictest program is used as the realtime application. Cyclictest
 measures the delay between timer expiration and the time when the program
 waiting for the timer actually runs. The way it does this is by taking
 a timestamp (t1) just before calling the timer wait function, then
 sleeping for a specified interval. Upon waking up, a second timestamp
 (t2) is taken. Then the difference is calculated between the timer
 expiration time (t1 + interval) and the actual wakeup time (t2). This
 is the event latency.  For example, if the initial time stamp t1 is
 1000 and the interval is 100, then the calculated wakup time is
 1100. If the wakeup time stamp (t2) is 1110, then cyclictest would
 report a latency of 10.

 The cyclictest program is run in one of two modes, with either the
 --smp option or the --numa option, based on the number of memory nodes
 detected on the system. Both of these cases create a measurement
 thread for each online cpu in the system and these threads are run
 with a SCHED_FIFO scheduling policy at priority 95. All memory
 allocations done by cyclictest are locked into memory using the
 mlockall(2) system call (to eliminate major page faults). The
 measurement threads are run with the same interval (100 microseconds)
 using the clock_gettime(2) call to get time stamps and the
 clock_nanosleep(2) call to actually invoke a timer. Cyclictest keeps a
 histogram of observed latency values for each thread, which is dumped
 to standard output and read by rteval when the run is complete.

 The Results
 -----------

 The main idea behind the 'rteval' program was to get two pieces of
 information about the performance of the RT Linux kernel on a
 particular hardware platform:

 	   1. What is the maximum latency seen?
 	   2. How variable are the service times?

 The first is easy, just iterate through the histograms returned for
 each cpu and find the highest index with a non-zero value.  The second
 is a little more complicated.

 Early in rteval development, rather than use a histogram the
 cyclictest run would just dump all the samples to a file and rteval
 would parse the file after the run. Unfortunately, when you're
 sampling at a rate of once every 100 microseconds for each cpu in the
 system, you're going to generate a *lot* of data. Especially since we
 want to run rteval for many hours, possibly days.  The output from
 cyclictest in that mode is a 26-character string for each sample
 recorded, so when sampling at 100us you generate 10,000 samples per
 second, so for a 1 hour run on a four core box, you'd get:

 	10,000 * 60 * 60 * 4 == 144,000,000 samples/hr

 Multiply that times the 26 character string written by cyclictest and
 you write 374,400,000 bytes per hour to disk. A 12 hour run on a  four
 core system would generate about 44 gigabytes of data. This was deemed
 excessive...

 So the decision was made to record the latency values in histogram
 format, one histogram for each measurement thread. This has the
 advantage of using only a fixed amount of memory to record samples,
 but has the disadvantage of losing temporal ordering information,
 which would allow you to detect periodic latency's by looking at the
 time stamp for a spike. It also complicates statistics calculations
 which presume you have the entire data set for analysis. This was
 worked around by treating each non-zero histogram bucket as a series
 of samples for that index value.

 The variability calculation is basically a stock standard deviation
 calculation, where a mean is calculated for the data set and then
 variance and standard deviation are calculated. Other measures of
 variability of such as Mean Absolute Deviation are calulated, but to
 date Standard Deviation has been a reliable indicator of the
 variability of service times. This variability is sometimes called
 'jitter' in realtime paralance, due to the plot the data would make on
 an oscilloscope.

 In addition to calculating mean, variance and standard deviation,
 rteval's statistics code calculates and stores the usual suspects for
 statistics, such as min, max, mode, median, and range, both for each
 cpu core and aggregated for all cores.

 Another challenge was in identifying the underlying hardware platform
 so that runs on the same system could be grouped properly. The Desktop
 Management Interface (DMI) tables maintaned by the BIOS were a good
 starting point, since they record information about the cpu, memory,
 and peripheral devices in the system. Added to that information is
 some state about the system while the test was running: kernel
 version, active clocksource, number of NUMA nodes available, kernel
 threads and their priorities, kernel modules loaded, and the state of
 network interfaces.

 Problems
 --------

 Using rteval has helped Red Hat locate areas that cause performance
 problems with realtime Linux kernels. Some of these problem areas are:

 1. BIOS/SMI issues
 Many systems use System Management Interrupts (SMIs) to perform system
 critical operations without help from the running operating system by
 trapping back to BIOS code. Unfortunately, this causes 'gaps in time'
 for the kernel, since nothing can run while an SMI is being handled in
 the BIOS. Most times SMI impact is negligable since it's mostly
 thermal management, reading a thermocouple and turning fans on/off.
 Sometimes though the operation takes a long time (i.e when an EDAC
 needs to correct a memory error) and that can cause deadlines to
 be missed by many hundreds of microseconds. Red Hat has been working
 with hardware vendors to identify these hotspots and reduce their
 impact.

 2. Kernel scalability issues

 In the past few years, the number of cores per socket on a motherboard
 has gone up from 2 to 8, resulting in some scalability problems in the
 kernel. One area that has received a lot of attention is the load
 balancer for SCHED_OTHER tasks. This is logic in the kernel that
 attempts to make sure that each core in the system has tasks to run
 and that no one core is overloaded with tasks. During a load balancer
 pass, a core with a long run queue (indicating there are many tasks
 ready on that core) will have some of those tasks migrated to other
 cores, which requires that both the current and destination cores run
 queue locks being held (meaning nothing can run on those cores).

 In a stock Linux kernel long SCHED_OTHER load balancer passes result
 in more utilization of cpus and an overall througput gain.
 Unfortunately long load balancer passes can result in missed
 deadlines because a task on the run queue for a core cannot run while
 the loadbalancer is running. To compensate for this on realtime Linux
 the load balancer has a lower number of target migrations and looks
 for contention on the run queue locks (meaning that a task is trying
 to be scheduled on one of the cores on which the balancer is
 operating). Research in this area is ongoing.

 There is also a realtime thread (SCHED_FIFO and SCHED_RR) thread load
 balancer and similar research is being done towards reducing the
 overhead of this load balancer as well.

 <what other areas?>

 3. NUMA

 In conjunction with the increase in the number of cpu cores per die
 has been the desire to reduce the amount of interconnect traces
 between cpus and memory nodes (as you pack the cores tighter, you have
 less room to run connections to memory and other cores). One way to do
 this is to route a core's address/data/signal lines to some sort of
 switch module, such as AMD's HyperTransport (HT) mechanism or Intel's
 Quick Path Interconnect (QPI). With a series of switches in place many
 cores can access memory and other cpu resources through the switch
 network without programs running on them knowing they're going through
 a switch. This results in a Non-Uniform Memory Access (NUMA)
 architecuture, which means that some memory accesses will take longer
 than others due to traversing the switch network. NUMA is great for
 scaling up throughput oriented servers, but tends to hurt determinism,
 if the programs are not aware of the memory topology.

 A --numa option was added to the cyclictest program to use the libnuma
 library to bind threads to local memory nodes and allocate memory on
 the closest memory node, so to minimize the time required to access
 memory.

 Further Development
 -------------------

 Once we started getting rteval run information it was natural that we
 would want to store it in a database for further analysis (especially
 watching for performance regressions). David Sommerseth created a set
 of tables for a PostgreSQL database and then added an option to rteval
 to ship the results back to a database server using XML-RPC. This
 option is currently used internally at Red Hat do ship rteval run data
 back to our internal DB server. There are no plans to open this data
 up to the public, but the XML-RPC code is there if someone else wants
 to use the facility. (No, there are no backdoors in the code that ship
 run data back to Red Hat; it's Python code, look and see!).
	Evaluating Realtime Linux system performance with rteval
	Clark Williams <williams@redhat.com>
	--------------------------------------------------------

	Abstract
	--------

	One of the problems of developing and fielding any software product
	that runs on a wide range of hardware platforms, is determining how
	well the product performs on those platforms. This is especially true
	about developing something as closely tied to hardware as a Realtime
	Linux system. So, how to measure the performance of a realtime Linux
	kernel on a particular hardware platform? What defines "good
	performance" for a realtime system?

	A real time system is one which must meet deadlines of some
	sort. These deadlines may be periodic (e.g. occuring every 200
	milliseconds) or they may be a some time limit following the occurance
	of an event (e.g. no more than 100 milliseconds following the arrival
	of a network packet). To give a realtime application the best chance
	of meeting its deadline(s), a realtime OS must minimize the time
	between event occurance and the servicing of that event (latency).

	This paper describes the 'rteval' program, a Python 2.x program
	developed at Red Hat to help quantify realtime performance on the
	Realtime kernel. Rteval is an attempt to put together a synthetic
	benchmark which mimics a well behaved realtime application, running on
	a heavily loaded realtime Linux system. It uses the 'cyclictest'
	program in the role of the realtime app and uses two loads, a parallel
	build of a Linux kernel and the scheduler benchmark 'hackbench' to
	boost the system load.

	Rteval runs for a specified length of time (typically 12 hours). When
	an rteval run is completed, a statisical analysis of the results is
	done, an XML file is generated, containing system state, raw result
	data and statistical analysis results and optionally the XML is sent
	by XML-RPC to a database for reporting.

	The Load Applications
	---------------------

	Rteval uses two system loads. The first is looping a scheduler
	benchmark called 'hackbench'. Hackbench creates pairs of threads which
	send data from a sender to a receiver via a pipe or socket. It creates
	many small processes which do lots of I/O, exercising the kernel
	scheduler by causing many scheduling decisions. The rteval wrapper
	class for hackbench continually runs the hackbench program until
	signaled to stop by the main logic. The number of hackbench threads is
	determined by the number of cpu cores available on the test system.

	The other load used by rteval is a parallel compile of the Linux
	kernel. Rteval has a module named 'kcompile' which controls the kernel
	build process by invoking the make process with 2 times the number of
	online cpus simultaneous build jobs. The clean, bzImage and modules
	targets are built with an 'allmodconfig' configuration file to
	maximize the amount of compilation done. This results in a large
	amount of process creation (preprocessors, compiler, assemblers and
	linkers) as well as a moderately heavy file I/O load. The kernel build
	load is repeated until the rteval runtime is reached.

	The intent behind having the load programs is to generate enough
	threads doing a balanced load of operations (disk I/O, computation,
	IPC, etc.) so that there is no time in which a processor core in the
	system under test does not have a process ready to run. The success of
	the loads can be measured by watching the system 'load average'
	(either by examining /proc/loadavg or running the 'uptime' or 'top'
	programs).

	The Measurement Application
	---------------------------

	The cyclictest program is used as the realtime application. Cyclictest
	measures the delay between timer expiration and the time when the program
	waiting for the timer actually runs. The way it does this is by taking
	a timestamp (t1) just before calling the timer wait function, then
	sleeping for a specified interval. Upon waking up, a second timestamp
	(t2) is taken. Then the difference is calculated between the timer
	expiration time (t1 + interval) and the actual wakeup time (t2). This
	is the event latency. For example, if the initial time stamp t1 is
	1000 and the interval is 100, then the calculated wakup time is
	1100. If the wakeup time stamp (t2) is 1110, then cyclictest would
	report a latency of 10.

	The cyclictest program is run in one of two modes, with either the
	--smp option or the --numa option, based on the number of memory nodes
	detected on the system. Both of these cases create a measurement
	thread for each online cpu in the system and these threads are run
	with a SCHED_FIFO scheduling policy at priority 95. All memory
	allocations done by cyclictest are locked into memory using the
	mlockall(2) system call (to eliminate major page faults). The
	measurement threads are run with the same interval (100 microseconds)
	using the clock_gettime(2) call to get time stamps and the
	clock_nanosleep(2) call to actually invoke a timer. Cyclictest keeps a
	histogram of observed latency values for each thread, which is dumped
	to standard output and read by rteval when the run is complete.

	The Results
	-----------

	The main idea behind the 'rteval' program was to get two pieces of
	information about the performance of the RT Linux kernel on a
	particular hardware platform:

	1. What is the maximum latency seen?
	2. How variable are the service times?

	The first is easy, just iterate through the histograms returned for
	each cpu and find the highest index with a non-zero value. The second
	is a little more complicated.

	Early in rteval development, rather than use a histogram the
	cyclictest run would just dump all the samples to a file and rteval
	would parse the file after the run. Unfortunately, when you're
	sampling at a rate of once every 100 microseconds for each cpu in the
	system, you're going to generate a lot of data. Especially since we
	want to run rteval for many hours, possibly days. The output from
	cyclictest in that mode is a 26-character string for each sample
	recorded, so when sampling at 100us you generate 10,000 samples per
	second, so for a 1 hour run on a four core box, you'd get:

	10,000 * 60 * 60 * 4 == 144,000,000 samples/hr

	Multiply that times the 26 character string written by cyclictest and
	you write 374,400,000 bytes per hour to disk. A 12 hour run on a four
	core system would generate about 44 gigabytes of data. This was deemed
	excessive...

	So the decision was made to record the latency values in histogram
	format, one histogram for each measurement thread. This has the
	advantage of using only a fixed amount of memory to record samples,
	but has the disadvantage of losing temporal ordering information,
	which would allow you to detect periodic latency's by looking at the
	time stamp for a spike. It also complicates statistics calculations
	which presume you have the entire data set for analysis. This was
	worked around by treating each non-zero histogram bucket as a series
	of samples for that index value.

	The variability calculation is basically a stock standard deviation
	calculation, where a mean is calculated for the data set and then
	variance and standard deviation are calculated. Other measures of
	variability of such as Mean Absolute Deviation are calulated, but to
	date Standard Deviation has been a reliable indicator of the
	variability of service times. This variability is sometimes called
	'jitter' in realtime paralance, due to the plot the data would make on
	an oscilloscope.

	In addition to calculating mean, variance and standard deviation,
	rteval's statistics code calculates and stores the usual suspects for
	statistics, such as min, max, mode, median, and range, both for each
	cpu core and aggregated for all cores.

	Another challenge was in identifying the underlying hardware platform
	so that runs on the same system could be grouped properly. The Desktop
	Management Interface (DMI) tables maintaned by the BIOS were a good
	starting point, since they record information about the cpu, memory,
	and peripheral devices in the system. Added to that information is
	some state about the system while the test was running: kernel
	version, active clocksource, number of NUMA nodes available, kernel
	threads and their priorities, kernel modules loaded, and the state of
	network interfaces.

	Problems
	--------

	Using rteval has helped Red Hat locate areas that cause performance
	problems with realtime Linux kernels. Some of these problem areas are:

	1. BIOS/SMI issues
	Many systems use System Management Interrupts (SMIs) to perform system
	critical operations without help from the running operating system by
	trapping back to BIOS code. Unfortunately, this causes 'gaps in time'
	for the kernel, since nothing can run while an SMI is being handled in
	the BIOS. Most times SMI impact is negligable since it's mostly
	thermal management, reading a thermocouple and turning fans on/off.
	Sometimes though the operation takes a long time (i.e when an EDAC
	needs to correct a memory error) and that can cause deadlines to
	be missed by many hundreds of microseconds. Red Hat has been working
	with hardware vendors to identify these hotspots and reduce their
	impact.

	2. Kernel scalability issues

	In the past few years, the number of cores per socket on a motherboard
	has gone up from 2 to 8, resulting in some scalability problems in the
	kernel. One area that has received a lot of attention is the load
	balancer for SCHED_OTHER tasks. This is logic in the kernel that
	attempts to make sure that each core in the system has tasks to run
	and that no one core is overloaded with tasks. During a load balancer
	pass, a core with a long run queue (indicating there are many tasks
	ready on that core) will have some of those tasks migrated to other
	cores, which requires that both the current and destination cores run
	queue locks being held (meaning nothing can run on those cores).

	In a stock Linux kernel long SCHED_OTHER load balancer passes result
	in more utilization of cpus and an overall througput gain.
	Unfortunately long load balancer passes can result in missed
	deadlines because a task on the run queue for a core cannot run while
	the loadbalancer is running. To compensate for this on realtime Linux
	the load balancer has a lower number of target migrations and looks
	for contention on the run queue locks (meaning that a task is trying
	to be scheduled on one of the cores on which the balancer is
	operating). Research in this area is ongoing.

	There is also a realtime thread (SCHED_FIFO and SCHED_RR) thread load
	balancer and similar research is being done towards reducing the
	overhead of this load balancer as well.

	<what other areas?>

	3. NUMA

	In conjunction with the increase in the number of cpu cores per die
	has been the desire to reduce the amount of interconnect traces
	between cpus and memory nodes (as you pack the cores tighter, you have
	less room to run connections to memory and other cores). One way to do
	this is to route a core's address/data/signal lines to some sort of
	switch module, such as AMD's HyperTransport (HT) mechanism or Intel's
	Quick Path Interconnect (QPI). With a series of switches in place many
	cores can access memory and other cpu resources through the switch
	network without programs running on them knowing they're going through
	a switch. This results in a Non-Uniform Memory Access (NUMA)
	architecuture, which means that some memory accesses will take longer
	than others due to traversing the switch network. NUMA is great for
	scaling up throughput oriented servers, but tends to hurt determinism,
	if the programs are not aware of the memory topology.

	A --numa option was added to the cyclictest program to use the libnuma
	library to bind threads to local memory nodes and allocate memory on
	the closest memory node, so to minimize the time required to access
	memory.

	Further Development
	-------------------

	Once we started getting rteval run information it was natural that we
	would want to store it in a database for further analysis (especially
	watching for performance regressions). David Sommerseth created a set
	of tables for a PostgreSQL database and then added an option to rteval
	to ship the results back to a database server using XML-RPC. This
	option is currently used internally at Red Hat do ship rteval run data
	back to our internal DB server. There are no plans to open this data
	up to the public, but the XML-RPC code is there if someone else wants
	to use the facility. (No, there are no backdoors in the code that ship
	run data back to Red Hat; it's Python code, look and see!).