advsync/rt.tex - pub/scm/linux/kernel/git/paulmck/perfbook - Git at Google

 % advsync/rt.tex
 % mainfile: ../perfbook.tex
 % SPDX-License-Identifier: CC-BY-SA-3.0

 \section{Parallel Real-Time Computing}
 \label{sec:advsync:Parallel Real-Time Computing}
 %
 \epigraph{One always has time enough if one applies it well.}
 	 {Johann Wolfgang von G\"othe}
 % \epigraph{The difference between you and me is that I was right in time.}
 % 	 {Konrad Adenauer}
 % No support for this quote found.

 An important emerging area in computing is that of parallel real-time
 computing.
 \Cref{sec:advsync:What is Real-Time Computing?}
 looks at a number of definitions of ``real-time computing'', moving
 beyond the usual sound bites to more meaningful criteria.
 \Cref{sec:advsync:Who Needs Real-Time?}
 surveys the sorts of applications that need real-time response.
 \Cref{sec:advsync:Who Needs Parallel Real-Time?}
 notes that parallel real-time computing is upon us, and discusses
 when and why parallel real-time computing can be useful.
 \Cref{sec:advsync:Implementing Parallel Real-Time Systems}
 gives a brief overview of how parallel real-time systems may be implemented,
 with
 \cref{sec:advsync:Implementing Parallel Real-Time Operating Systems,%
 sec:advsync:Implementing Parallel Real-Time Applications}
 focusing on operating systems and applications, respectively.
 Finally,
 \cref{sec:advsync:Real Time vs. Real Fast: How to Choose?}
 outlines how to decide whether or not your application needs real-time
 facilities.

 \subsection{What is Real-Time Computing?}
 \label{sec:advsync:What is Real-Time Computing?}

 One traditional way of classifying real-time computing is into the
 categories of \emph{hard real time} and \emph{soft real time}, where
 the macho hard real-time applications never miss their deadlines, but
 the wimpy soft real-time applications miss their deadlines quite often.

 \subsubsection{Soft Real Time}
 \label{sec:Soft Real Time}

 It should be easy to see problems with this definition of soft real time.
 For one thing, by this definition, \emph{any} piece of software could be
 said to be a soft real-time application:
 ``My application computes million-point Fourier transforms in half a
 picosecond.''
 ``No way!!!
 The clock cycle on this system is more than \emph{three hundred} picoseconds!''
 ``Ah, but it is a \emph{soft} real-time application!''
 If the term ``soft real time'' is to be of any use whatsoever, some limits
 are clearly required.

 We might therefore say that a given soft real-time application must meet
 its response-time requirements at least some fraction of the time, for
 example, we might say that it must execute in less than 20 microseconds
 99.9\pct\ of the time.

 This of course raises the question of what is to be done when the application
 fails to meet its response-time requirements.
 The answer varies with the application, but one possibility
 is that the system being controlled has sufficient stability and inertia
 to render harmless the occasional late control action.
 Another possibility is that the application has two ways of computing
 the result, a fast and deterministic but inaccurate method on the
 one hand and
 a very accurate method with unpredictable compute time on the other.
 One reasonable approach would be to start
 both methods in parallel, and if the accurate method fails to finish
 in time, kill it and use the answer from the fast but inaccurate method.
 One candidate for the fast but inaccurate method is to take
 no control action during the current time period, and another candidate is
 to take the same control action as was taken during the preceding time
 period.

 In short, it does not make sense to talk about soft real time without
 some measure of exactly how soft it is.

 \subsubsection{Hard Real Time}
 \label{sec:Hard Real Time}

 In contrast, the definition of hard real time is quite definite.
 After all, a given system either always meets its deadlines or it
 doesn't.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/realtime-smash}}
 \caption{Real-Time Response, Meet Hammer}
 \ContributedBy{fig:advsync:Hard Real-Time Response; Meet Hammer}{Melissa Broussard}
 \end{figure}

 Unfortunately, a strict application of this definition would mean that
 there can never be any hard real-time systems.
 The reason for this is fancifully depicted in
 \cref{fig:advsync:Hard Real-Time Response; Meet Hammer}.
 And although you can always construct a more robust system, perhaps with
 redundancy, your adversary can always get a bigger hammer.
 But don't take \emph{my} word for it:
 Ask the dinosaurs.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{cartoons/realtime-lifesupport-nobomb}}
 \caption{Real-Time Response:
 			     Hardware Matters}
 \ContributedBy{fig:advsync:Real-Time Response: Hardware Matters}{Melissa Broussard}
 \end{figure}

 Then again, perhaps it is unfair to blame the software for what is clearly
 not just a hardware problem, but a bona fide big-iron hardware problem
 at that.\footnote{
 	Or, given modern hammers, a big-steel problem.}
 This suggests that we define hard real-time software as software that
 will always meet its deadlines, but only in the absence of a hardware
 failure.
 Unfortunately, failure is not always an option, as fancifully depicted in
 \cref{fig:advsync:Real-Time Response: Hardware Matters}.
 We simply cannot expect the poor gentleman depicted in that figure to be
 reassured our saying ``Rest assured that if a missed deadline results
 in your tragic death, it most certainly will not have been due to a
 software problem!''
 Hard real-time response is a property of the entire system, not
 just of the software.

 But if we cannot demand perfection, perhaps we can make do with
 notification, similar to the soft real-time approach noted earlier.
 Then if the Life-a-Tron in
 \cref{fig:advsync:Real-Time Response: Hardware Matters}
 is about to miss its deadline,
 it can alert the hospital staff.

 \begin{figure*}
 \centering
 \resizebox{\onecolumntextwidth}{!}{\rotatebox{90}{\includegraphics{cartoons/realtime-lazy-crop}}}
 \caption{Real-Time Response:
 			     Notification Insufficient}
 \ContributedBy{fig:advsync:Real-Time Response: Notification Insufficient}{Melissa Broussard}
 \end{figure*}

 Unfortunately, this approach has the trivial solution fancifully depicted in
 \cref{fig:advsync:Real-Time Response: Notification Insufficient}.
 A system that always immediately issues a notification that it won't
 be able to meet its deadline complies with the letter of the law,
 but is completely useless.
 There clearly must also be a requirement that the system meets its deadline
 some fraction of the time, or perhaps that it be prohibited from missing
 its deadlines on more than a certain number of consecutive operations.

 We clearly cannot take a sound-bite approach to either hard or soft
 real time.
 The next section therefore takes a more real-world approach.

 \subsubsection{Real-World Real Time}
 \label{sec:advsync:Real-World Real Time}

 Although sentences like ``Hard real-time systems \emph{always} meet
 their deadlines!\@'' are catchy and easy to memorize, something else is
 needed for real-world real-time systems.
 Although the resulting specifications are
 harder to memorize, they can simplify construction of a real-time
 system by imposing constraints on the environment, the workload, and
 the real-time application itself.

 \paragraph{Environmental Constraints}
 \label{sec:advsync:Environmental Constraints}

 Constraints on the environment address the objection to open-ended
 promises of response times implied by ``hard real time''.
 These constraints might specify permissible operating temperatures,
 air quality, levels and types of electromagnetic radiation, and, to
 \cref{fig:advsync:Hard Real-Time Response; Meet Hammer}'s
 point, levels of shock and vibration.

 Of course, some constraints are easier to meet than others.
 Any number of people have learned the hard way that
 commodity computer components often refuse to operate at sub-freezing
 temperatures, which suggests a set of climate-control requirements.

 An old college friend once had the challenge of operating
 a real-time system in an atmosphere featuring some rather aggressive
 chlorine compounds, a challenge that he wisely handed off to his
 colleagues designing the hardware.
 In effect, my colleague imposed an atmospheric-composition constraint
 on the environment immediately surrounding the computer, a constraint
 that the hardware designers met through use of physical seals.

 Another old college friend worked on a computer-controlled system
 that sputtered ingots of titanium using an industrial-strength arc
 in a vacuum.
 From time to time, the arc would decide that it was bored with its path
 through the ingot of titanium and choose a far shorter and more
 entertaining path to ground.
 As we all learned in our physics classes, a sudden shift in the flow of
 electrons creates an electromagnetic wave, with larger shifts in larger
 flows creating higher-power electromagnetic waves.
 And in this case, the resulting electromagnetic pulses were sufficient
 to induce a quarter of a volt potential difference in the leads of
 a small ``rubber ducky'' antenna located more than 400 meters away.
 This meant that nearby conductors experienced higher voltages, courtesy
 of the inverse-square law.
 This included those conductors making up the computer controlling the
 sputtering process.
 In particular, the voltage induced on that computer's reset line was
 sufficient to actually reset the computer, mystifying everyone involved.
 This situation was addressed using hardware, including some elaborate
 shielding and a fiber-optic network with the lowest bitrate I have ever
 heard of, namely 9600 baud.
 Less spectacular electromagnetic environments can often be handled by
 software through use of error detection and correction codes.
 That said, it is important to remember that although error detection and
 correction codes can reduce failure rates, they normally cannot reduce
 them all the way down to zero, which can present yet another obstacle
 to achieving hard real-time response.

 There are also situations where a minimum level of energy
 is required, for example, through the power leads of the system and
 through the devices through which the system is to communicate with
 that portion of the outside world that is to be monitored or controlled.

 \QuickQuiz{
 	But what about battery-powered systems?
 	They don't require energy flowing into the system as a whole.
 }\QuickQuizAnswer{
 	Sooner or later, the battery must be recharged, which requires
 	energy to flow into the system.
 }\QuickQuizEnd

 A number of systems are intended to operate in environments with impressive
 levels of shock and vibration, for example, engine control systems.
 More strenuous requirements may be found when we move away from
 continuous vibrations to intermittent shocks.
 For example, during my undergraduate studies, I encountered an old Athena
 ballistics computer, which was designed to continue operating normally even if
 a hand grenade went off nearby.\footnote{
 	Decades later, the acceptance tests for some types of computer
 	systems involve large detonations, and some types of
 	communications networks must deal with what is delicately
 	termed ``ballistic jamming.''}
 And finally, the ``black boxes'' used in airliners must continue operating
 before, during, and after a crash.

 Of course, it is possible to make hardware more robust against
 environmental shocks and insults.
 Any number of ingenious mechanical shock-absorbing devices can reduce the
 effects of shock and vibration, multiple layers of shielding can reduce
 the effects of low-energy electromagnetic radiation, error-correction
 coding can reduce the effects of high-energy radiation, various potting
 and sealing techniques can reduce the effect of air quality, and any
 number of heating and cooling systems can counter the effects of temperature.
 In extreme cases, triple modular redundancy can reduce the probability that
 a fault in one part of the system will result in incorrect behavior from
 the overall system.
 However, all of these methods have one thing in common:
 Although they can reduce the probability of failure, they cannot reduce
 it to zero.

 These environmental challenges are often met via robust hardware, however,
 the workload and application constraints in the next two sections are
 often handled in software.

 \paragraph{Workload Constraints}
 \label{sec:advsync:Workload Constraints}

 Just as with people, it is often possible to prevent a real-time system
 from meeting its deadlines by overloading it.
 For example, if the system is being interrupted too frequently, it might
 not have sufficient CPU bandwidth to handle its real-time application.
 A hardware solution to this problem might limit the rate at which
 interrupts were delivered to the system.
 Possible software solutions include disabling interrupts for some time if
 they are being received too frequently,
 resetting the device generating too-frequent interrupts,
 or even avoiding interrupts altogether in favor of polling.

 Overloading can also degrade response times due to queueing effects,
 so it is not unusual for real-time systems to overprovision CPU bandwidth,
 so that a running system has (say) 80\pct\ idle time.
 This approach also applies to storage and networking devices.
 In some cases, separate storage and networking hardware might be reserved
 for the sole use of high-priority portions of the real-time application.
 In short, it is not unusual for this hardware to be mostly idle, given
 that response time is more important than throughput in
 real-time systems.

 \QuickQuiz{
 	But given the results from queueing theory, won't low utilization
 	merely improve the average response time rather than improving
 	the worst-case response time?
 	And isn't worst-case response time all that most
 	real-time systems really care about?
 }\QuickQuizAnswer{
 	Yes, but \ldots

 	Those queueing-theory results assume infinite ``calling populations'',
 	which in the Linux kernel might correspond to an infinite number
 	of tasks.
 	As of early 2021, no real system supports an infinite number of
 	tasks, so results assuming infinite calling populations should
 	be expected to have less-than-infinite applicability.

 	Other queueing-theory results have \emph{finite}
 	calling populations, which feature sharply bounded response
 	times~\cite{Hillier86}.
 	These results better model real systems, and these models do
 	predict reductions in both average and worst-case response
 	times as utilizations decrease.
 	These results can be extended to model concurrent systems that use
 	synchronization mechanisms such as
 	locking~\cite{BjoernBrandenburgPhD,DipankarSarma2004OLSscalability}.

 	In short, queueing-theory results that accurately describe
 	real-world real-time systems show that worst-case response
 	time decreases with decreasing utilization.
 }\QuickQuizEnd

 Of course, maintaining sufficiently low utilization requires great
 discipline throughout the design and implementation.
 There is nothing quite like a little feature creep to destroy deadlines.

 \paragraph{Application Constraints}
 \label{sec:advsync:Application Constraints}

 It is easier to provide bounded response time for some operations than
 for others.
 For example, it is quite common to see response-time specifications for
 interrupts and for wake-up operations, but quite rare for (say)
 filesystem unmount operations.
 One reason for this is that it is quite difficult to bound the amount
 of work that a filesystem-unmount operation might need to do, given that
 the unmount is required to flush all of that filesystem's in-memory
 data to mass storage.

 This means that real-time applications must be confined to operations
 for which bounded latencies can reasonably be provided.
 Other operations must either be pushed out into the non-real-time portions
 of the application or forgone entirely.

 There might also be constraints on the non-real-time portions of the
 application.
 For example, is the non-real-time application permitted to use the CPUs
 intended for the real-time portion?
 Are there time periods during which the real-time portion of the application
 is expected to be unusually busy, and if so, is the non-real-time portion
 of the application permitted to run at all during those times?
 Finally, by what amount is the real-time portion of the application permitted
 to degrade the throughput of the non-real-time portion?

 \paragraph{Real-World Real-Time Specifications}
 \label{sec:advsync:Real-World Real-Time Specifications}

 As can be seen from the preceding sections, a real-world real-time
 specification needs to include constraints on the environment,
 on the workload, and on the application itself.
 In addition, for the operations that the real-time portion of the
 application is permitted to make use of, there must be constraints
 on the hardware and software implementing those operations.

 For each such operation, these constraints might include a maximum
 response time (and possibly also a minimum response time) and a
 probability of meeting that response time.
 A probability of 100\pct\ indicates that the corresponding operation
 must provide hard real-time service.

 In some cases, both the response times and the required probabilities of
 meeting them might vary depending on the parameters to the operation in
 question.
 For example, a network operation over a local LAN would be much more likely
 to complete in (say) 100~microseconds than would that same network operation
 over a transcontinental WAN\@.
 Furthermore, a network operation over a copper or fiber
 LAN might have an extremely
 high probability of completing without time-consuming retransmissions,
 while that same networking operation over a lossy WiFi network might
 have a much higher probability of missing tight deadlines.
 Similarly, a read from a tightly coupled solid-state disk (SSD) could be
 expected to complete much more quickly than that same read to an old-style
 USB-connected rotating-rust disk drive.\footnote{
 	Important safety tip:
 	Worst-case response times from USB devices can be extremely long.
 	Real-time systems should therefore take care to place any USB
 	devices well away from critical paths.}

 Some real-time applications pass through different phases of operation.
 For example, a real-time system controlling a plywood lathe that peels
 a thin sheet of wood (called ``veneer'') from a spinning log must:
 \begin{enumerate*}[(1)]
 \item Load the log into the lathe,
 \item Position the log on the lathe's chucks so as to expose the largest
 cylinder contained within that log to the blade,
 \item Start spinning the log,
 \item Continuously vary the knife's position so as to peel the log into veneer,
 \item Remove the remaining core of the log that is too small to peel, and
 \item Wait for the next log.
 \end{enumerate*}
 Each of these six phases of operation might well have its own set of
 deadlines and environmental constraints,
 for example, one would expect phase~4's deadlines to be much more severe
 than those of phase~6, as in milliseconds rather than seconds.
 One might therefore expect that low-priority work would be performed in
 phase~6 rather than in phase~4.
 In any case, careful choices of hardware, drivers, and software
 configuration would be required to support phase~4's more severe
 requirements.

 A key advantage of this phase-by-phase approach is that the \IX{latency}
 budgets can be broken down, so that the application's various components
 can be developed independently, each with its own latency budget.
 Of course, as with any other kind of budget, there will likely be the
 occasional conflict as to which component gets which fraction of the
 overall budget, and as with any other kind of budget, strong leadership
 and a sense of shared goals can help to resolve these conflicts in
 a timely fashion.
 And, again as with other kinds of technical budget, a strong validation
 effort is required in order to ensure proper focus on latencies and to
 give early warning of latency problems.
 A successful validation effort will almost always include a good test
 suite, which might be unsatisfying to the theorists, but has the virtue
 of helping to get the job done.
 As a point of fact, as of early 2021, most real-world real-time system
 use an acceptance test rather than formal proofs.

 However, the widespread use of test suites to validate real-time systems
 does have a very real disadvantage, namely that real-time software is
 validated only on specific configurations of hardware and software.
 Adding additional configurations requires additional costly and
 time-consuming testing.
 Perhaps the field of formal verification will advance sufficiently to
 change this situation, but as of early 2021, rather
 large advances are required.

 \QuickQuiz{
 	Formal verification is already quite capable, benefiting from
 	decades of intensive study.
 	Are additional advances \emph{really} required, or is this just
 	a practitioner's excuse to continue to lazily ignore the awesome
 	power of formal verification?
 }\QuickQuizAnswer{
 	Perhaps this situation is just a theoretician's excuse to avoid
 	diving into the messy world of real software?
 	Perhaps more constructively, the following advances are required:

 	\begin{enumerate}
 	\item	Formal verification needs to handle larger software
 		artifacts.
 		The largest verification efforts have been for systems
 		of only about 10,000 lines of code, and those have been
 		verifying much simpler properties than real-time latencies.
 	\item	Hardware vendors will need to publish formal timing
 		guarantees.
 		This used to be common practice back when hardware was
 		much simpler, but today's complex hardware results in
 		excessively complex expressions for worst-case performance.
 		Unfortunately, energy-efficiency concerns are pushing
 		vendors in the direction of even more complexity.
 	\item	Timing analysis needs to be integrated into development
 		methodologies and IDEs.
 	\end{enumerate}

 	All that said, there is hope, given recent work formalizing
 	the memory models of real computer
 	systems~\cite{JadeAlglave2011ppcmem,Alglave:2013:SVW:2450268.2450306}.
 	On the other hand, formal verification has just as much trouble
 	as does testing with the astronomical number of variants of the
 	Linux kernel that can be constructed from different combinations
 	of its tens of thousands of Kconfig options.
 	Sometimes life is hard!
 }\QuickQuizEnd

 In addition to latency requirements for the real-time portions of the
 application, there will likely be performance and scalability requirements
 for the non-real-time portions of the application.
 These additional requirements reflect the fact that ultimate real-time
 latencies are often attained by degrading scalability and average performance.

 Software-engineering requirements can also be important, especially for
 large applications that must be developed and maintained by large teams.
 These requirements often favor increased modularity and fault isolation.

 This is a mere outline of the work that would be required to specify
 deadlines and environmental constraints for a production real-time system.
 It is hoped that this outline clearly demonstrates the inadequacy of
 the sound-bite-based approach to real-time computing.

 \subsection{Who Needs Real-Time?}
 \label{sec:advsync:Who Needs Real-Time?}

 It is possible to argue that all computing is in fact real-time computing.
 For one example, when you purchase a birthday gift online, you expect
 the gift to arrive before the recipient's birthday.
 And in fact even turn-of-the-millennium web services observed sub-second
 response constraints~\cite{KristofferBohmann2001a}, and requirements have
 not eased with the passage of time~\cite{DeCandia:2007:DAH:1323293.1294281}.
 It is nevertheless useful to focus on those real-time applications
 whose response-time requirements cannot be achieved straightforwardly
 by non-real-time systems and applications.
 Of course, as hardware costs decrease and bandwidths and memory sizes
 increase, the line between real-time and non-real-time will continue
 to shift, but such progress is by no means a bad thing.

 \QuickQuiz{
 	Differentiating real-time from non-real-time based on what can
 	``be achieved straightforwardly by non-real-time systems and
 	applications'' is a travesty!
 	There is absolutely no theoretical basis for such a distinction!!!
 	Can't we do better than that???
 }\QuickQuizAnswer{
 	This distinction is admittedly unsatisfying from a strictly
 	theoretical perspective.
 	But on the other hand, it is exactly what the developer needs
 	in order to decide whether the application can be cheaply and
 	easily developed using standard non-real-time approaches, or
 	whether the more difficult and expensive real-time approaches
 	are required.
 	In other words, although theory is quite important, for those of
 	us called upon to complete practical projects, theory supports
 	practice, never the other way around.
 }\QuickQuizEnd

 Real-time computing is used in industrial-control applications, ranging from
 manufacturing to avionics;
 scientific applications, perhaps most spectacularly in the adaptive
 optics used by
 large Earth-bound telescopes to de-twinkle starlight;
 military applications, including the afore-mentioned avionics;
 and financial-services applications, where the first computer to recognize
 an opportunity is likely to reap most of the profit.
 These four areas could be characterized as ``in search of production'',
 ``in search of life'', ``in search of death'', and ``in search of money''.

 Financial-services applications differ subtly from applications in
 the other three categories in that money is non-material, meaning that
 non-computational latencies are quite small.
 In contrast, mechanical delays inherent in the other three categories
 provide a very real point of diminishing returns beyond which further
 reductions in the application's real-time response provide little or
 no benefit.
 This means that financial-services applications, along with other
 real-time information-processing applications, face an arms race,
 where the application with the lowest latencies normally wins.
 Although the resulting latency requirements can still be specified
 as described in
 \pararef{sec:advsync:Real-World Real-Time Specifications},
 the unusual nature of these requirements has led some to refer to
 financial and information-processing applications as ``low latency''
 rather than ``real time''.

 Regardless of exactly what we choose to call it, there is substantial
 need for real-time
 computing~\cite{JeremyWPeters2006NYTDec11,BillInmon2007a}.

 \subsection{Who Needs Parallel Real-Time?}
 \label{sec:advsync:Who Needs Parallel Real-Time?}

 It is less clear who really needs parallel real-time computing, but
 the advent of low-cost multicore systems has brought it to the fore
 regardless.
 Unfortunately, the traditional mathematical basis for real-time
 computing assumes single-CPU systems, with a few exceptions that
 prove the rule~\cite{BjoernBrandenburgPhD}.
 Fortunately, there are a couple of ways of squaring modern computing
 hardware to fit the real-time mathematical circle, and a few Linux-kernel
 hackers have been encouraging academics to make this
 transition~\cite{DanielBristot2019RTtrace,ThomasGleixner2010AcademiaVsReality}.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{advsync/rt-reflexes}}
 \caption{Real-Time Reflexes}
 \label{fig:advsync:Real-Time Reflexes}
 \end{figure}

 One approach is to recognize the fact that many real-time systems
 resemble biological nervous systems, with responses ranging from
 real-time reflexes to non-real-time strategizing and planning,
 as depicted in
 \cref{fig:advsync:Real-Time Reflexes}.
 The hard real-time reflexes, which read from sensors and control
 actuators, run real-time on a single CPU or on special-purpose hardware
 such as an FPGA\@.
 The non-real-time strategy and planning portion of the application runs
 on the remaining CPUs.
 Strategy and planning activities might include statistical analysis,
 periodic calibration, user interface, supply-chain activities, and
 preparation.
 For an example of high-compute-load preparation activities, think back
 to the veneer-peeling application discussed in
 \pararef{sec:advsync:Real-World Real-Time Specifications}.
 While one CPU is attending to the high-speed real-time computations
 required to peel one log, the other CPUs might be analyzing the size
 and shape of the next log in order to determine how to position the
 next log so as to obtain the largest cylinder of high-quality wood.
 It turns out that many applications have non-real-time and real-time
 components~\cite{RobertBerry2008IBMSysJ}, so this approach can
 often be used to allow traditional real-time analysis to be combined
 with modern multicore hardware.

 Another trivial approach is to shut off all but one hardware thread so as
 to return to the settled mathematics of uniprocessor real-time
 computing.
 However, this approach gives up potential cost and energy-efficiency
 advantages.
 That said, obtaining these advantages requires overcoming the parallel
 performance obstacles covered in
 \cref{chp:Hardware and its Habits},
 and not merely on average, but instead in the worst case.

 Implementing parallel real-time systems can therefore be quite a
 challenge.
 Ways of meeting this challenge are outlined in the following section.

 \subsection{Implementing Parallel Real-Time Systems}
 \label{sec:advsync:Implementing Parallel Real-Time Systems}

 We will look at two major styles of real-time systems, event-driven and
 polling.
 An event-driven real-time system remains idle much of the time, responding
 in real time to events passed up through the operating system to the
 application.
 Alternatively, the system could instead be running a background
 non-real-time workload.
 A polling real-time system features a real-time thread that is CPU
 bound, running in a tight loop that polls inputs and updates outputs on
 each pass.
 This tight polling loop often executes entirely in user mode, reading from
 and writing to hardware registers that have been mapped into the user-mode
 application's address space.
 Alternatively, some applications place the polling loop into the kernel,
 for example, using loadable kernel modules.

 \begin{figure}
 \centering
 \resizebox{3in}{!}{\includegraphics{advsync/rt-regimes}}
 \caption{Real-Time Response Regimes}
 \label{fig:advsync:Real-Time Response Regimes}
 \end{figure}

 Regardless of the style chosen, the approach used to implement a real-time
 system will depend on the deadlines, for example, as shown in
 \cref{fig:advsync:Real-Time Response Regimes}.
 Starting from the top of this figure, if you can live with response times in
 excess of one second, you might well be able to use scripting languages
 to implement your real-time application---and scripting languages are
 in fact used surprisingly often, not that I necessarily recommend this
 practice.
 If the required latencies exceed several tens of milliseconds,
 old 2.4 versions of the Linux kernel can be used, not that I necessarily
 recommend this practice, either.
 Special real-time Java implementations can provide real-time response
 latencies of a few milliseconds, even when the garbage collector is
 used.
 The Linux 2.6.x and 3.x kernels can provide real-time latencies of
 a few hundred microseconds if painstakingly configured, tuned, and run
 on real-time-friendly hardware.
 Special real-time Java implementations can provide real-time latencies
 below 100 microseconds if use of the garbage collector is carefully avoided.
 (But note that avoiding the garbage collector means also avoiding
 Java's large standard libraries, thus also avoiding Java's productivity
 advantages.)
 The Linux 4.x and 5.x kernels can provide sub-hundred-microsecond
 latencies, but with all the same caveats as for the 2.6.x and 3.x kernels.
 A Linux kernel incorporating the \rt\ patchset can provide latencies
 well below 20 microseconds, and specialty real-time operating systems (RTOSes)
 running without MMUs can provide sub-ten-microsecond
 latencies.
 Achieving sub-microsecond latencies typically requires hand-coded assembly
 or even special-purpose hardware.

 Of course, careful configuration and tuning are required all the way down
 the stack.
 In particular, if the hardware or firmware fails to provide real-time
 latencies, there is nothing that the software can do to make up for the
 lost time.
 Worse yet, high-performance hardware sometimes sacrifices worst-case behavior
 to obtain greater throughput.
 In fact, timings from tight loops run with interrupts disabled can
 provide the basis for a high-quality random-number
 generator~\cite{PeterOkech2009InherentRandomness}.
 Furthermore, some firmware does cycle-stealing to carry out various
 housekeeping tasks, in some cases attempting to cover its tracks by
 reprogramming the victim CPU's hardware clocks.
 Of course, cycle stealing is expected behavior in virtualized
 environment, but people are nevertheless working towards real-time
 response in virtualized
 environments~\cite{ThomasGleixner2012KVMrealtime,JanKiszka2014virtRT}.
 It is therefore critically important to evaluate your hardware's and
 firmware's real-time capabilities.

 But given competent real-time hardware and firmware, the next
 layer up the stack is the operating system, which is covered in
 the next section.

 \subsection{Implementing Parallel Real-Time Operating Systems}
 \label{sec:advsync:Implementing Parallel Real-Time Operating Systems}

 \begin{figure}
 \centering
 \resizebox{2.2in}{!}{\includegraphics{advsync/Linux-on-RTOS}}
 \caption{Linux Ported to RTOS}
 \label{fig:advsync:Linux Ported to RTOS}
 \end{figure}

 There are a number of strategies that may be used to implement a
 real-time system.
 One approach is to port a general-purpose non-real-time OS on top
 of a special purpose real-time operating system (RTOS), as shown in
 \cref{fig:advsync:Linux Ported to RTOS}.
 The green ``Linux Process'' boxes represent non-real-time processes
 running on the Linux kernel, while the yellow ``RTOS Process''
 boxes represent real-time processes running on the RTOS\@.

 This was a very popular approach before the Linux kernel gained
 real-time capabilities, and is still in
 use~\cite{Xenomai2014,VictorYodaiken2004a}.
 However, this approach requires that the application be split into
 one portion that runs on the RTOS and another that runs on Linux.
 Although it is possible to make the two environments look similar,
 for example, by forwarding POSIX system calls from the RTOS to a
 utility thread running on Linux, there are invariably rough edges.

 In addition, the RTOS must interface to both the hardware and to
 the Linux kernel, thus requiring significant maintenance with
 changes in both hardware and kernel.
 Furthermore, each such RTOS often has its own system-call interface
 and set of system libraries, which can balkanize both ecosystems and
 developers.
 In fact, these problems seem to be what drove the combination of
 RTOSes with Linux, as this approach allowed access to the full real-time
 capabilities of the RTOS, while allowing the application's non-real-time
 code full access to Linux's open-source ecosystem.

 \begin{figure*}
 \centering
 \IfEbookSize{
 \resizebox{.87\onecolumntextwidth}{!}{\includegraphics{advsync/preemption}}
 }{
 \resizebox{4.4in}{!}{\includegraphics{advsync/preemption}}
 }
 \caption{Linux-Kernel Real-Time Implementations}
 \label{fig:advsync:Linux-Kernel Real-Time Implementations}
 \end{figure*}

 Although pairing RTOSes with the Linux kernel was a clever and useful
 short-term response during the time that the Linux kernel had minimal
 real-time capabilities, it also motivated adding real-time capabilities
 to the Linux kernel.
 Progress towards this goal is shown in
 \cref{fig:advsync:Linux-Kernel Real-Time Implementations}.
 The upper row shows a diagram of the Linux kernel with preemption disabled,
 thus having essentially no real-time capabilities.
 The middle row shows a set of diagrams showing the increasing real-time
 capabilities of the mainline Linux kernel with preemption enabled.
 Finally, the bottom row shows a diagram of the Linux kernel with the
 \rt\ patchset applied, maximizing real-time capabilities.
 Functionality from the \rt\ patchset is added to mainline,
 hence the increasing capabilities of the mainline Linux kernel over time.
 Nevertheless, the most demanding real-time applications continue to use
 the \rt\ patchset.

 The non-preemptible kernel shown at the top of
 \cref{fig:advsync:Linux-Kernel Real-Time Implementations}
 is built with \co{CONFIG_PREEMPT=n}, so that execution within the Linux
 kernel cannot be preempted.
 This means that the kernel's real-time response latency is bounded below
 by the longest code path in the Linux kernel, which is indeed long.
 However, user-mode execution is preemptible, so that one of the
 real-time Linux processes shown in the upper right may preempt any of the
 non-real-time Linux processes shown in the upper left anytime the
 non-real-time process is executing in user mode.

 The middle row of
 \cref{fig:advsync:Linux-Kernel Real-Time Implementations}
 shows three stages (from left to right) in the development of Linux's
 preemptible kernels.
 In all three stages, most process-level code within the Linux kernel
 can be preempted.
 This of course greatly improves real-time response latency, but
 preemption is still disabled
 within RCU read-side critical sections,
 spinlock critical sections,
 interrupt handlers,
 interrupt-disabled code regions, and
 preempt-disabled code regions, as indicated by the red boxes in the
 left-most diagram in the middle row of the figure.
 The advent of preemptible RCU allowed RCU read-side critical sections
 to be preempted, as shown in the central diagram,
 and the advent of threaded interrupt handlers allowed device-interrupt
 handlers to be preempted, as shown in the right-most diagram.
 Of course, a great deal of other real-time functionality was added
 during this time, however, it cannot be as easily represented on this
 diagram.
 It will instead be discussed in
 \cref{sec:advsync:Event-Driven Real-Time Support}.

 The bottom row of
 \cref{fig:advsync:Linux-Kernel Real-Time Implementations}
 shows the \rt\ patchset, which features threaded (and thus preemptible)
 interrupt handlers for many devices, which also allows the corresponding
 ``interrupt-disabled'' regions of these drivers to be preempted.
 These drivers instead use locking to coordinate the process-level
 portions of each driver with its threaded interrupt handlers.
 Finally, in some cases, disabling of preemption is replaced by
 disabling of migration.
 These measures result in excellent response times in many systems running
 the \rt\ patchset~\cite{Reghenzani:2019:RLK:3309872.3297714,DanielBristot2019RTtrace}.

 \begin{figure}
 \centering
 \resizebox{2.5in}{!}{\includegraphics{advsync/nohzfull}}
 \caption{CPU Isolation}
 \label{fig:advsync:CPU Isolation}
 \end{figure}

 A final approach is simply to get everything out of the way of the
 real-time process, clearing all other processing off of any CPUs that
 this process needs, as shown in \cref{fig:advsync:CPU Isolation}.
 This was implemented in the 3.10 Linux kernel via the \co{CONFIG_NO_HZ_FULL}
 Kconfig parameter~\cite{JonCorbet2013NO-HZ-FULL,FredericWeisbecker2013nohz}.
 It is important to note that this approach requires at least one
 \emph{housekeeping CPU} to do background processing, for example running
 kernel daemons.
 However, when there is only one runnable task on a given non-housekeeping CPU,
 scheduling-clock interrupts are shut off on that CPU, removing an important
 source of interference and \emph{OS jitter}.
 With a few exceptions, the kernel does not force other processing off of the
 non-housekeeping CPUs, but instead simply provides better performance
 when only one runnable task is present on a given CPU\@.
 Any number of userspace tools may be used to force a given CPU to have
 no more that one runnable task.
 If configured properly, a non-trivial undertaking, \co{CONFIG_NO_HZ_FULL}
 offers real-time threads levels of performance that come close to those of
 bare-metal systems~\cite{AbdullahAljuhni2018nohzfull}.
 \ppl{Fr\'{e}d\'{e}ric}{Weisbecker} produced a practical guide to
 \co{CONFIG_NO_HZ_FULL} configuration~\cite{
 	FredericWeisbecker2022nohzIntro,
 	FredericWeisbecker2022nohzFullDynticksInternals,
 	FredericWeisbecker2022nohzfull,
 	FredericWeisbecker2022HousekeepingTradeoffs,
 	FredericWeisbecker2022practicalExample,
 	FredericWeisbecker2022nohzfullTSC}.

 There has of course been much debate over which of these approaches
 is best for real-time systems, and this debate has been going on for
 quite some
 time~\cite{JonCorbet2004RealTimeLinuxPart1,JonCorbet2004RealTimeLinuxPart2}.
 As usual, the answer seems to be ``It depends,'' as discussed in the
 following sections.
 \Cref{sec:advsync:Event-Driven Real-Time Support}
 considers event-driven real-time systems, and
 \cref{sec:advsync:Polling-Loop Real-Time Support}
 considers real-time systems that use a CPU-bound polling loop.

 \subsubsection{Event-Driven Real-Time Support}
 \label{sec:advsync:Event-Driven Real-Time Support}

 The operating-system support required for event-driven real-time
 applications is quite extensive, however, this section will focus
 on only a few items, namely
 timers,
 threaded interrupts,
 priority inheritance,
 preemptible RCU,
 and
 preemptible spinlocks.

 \paragraph{Timers} are clearly critically important for real-time
 operations.
 After all, if you cannot specify that something be done at a specific
 time, how are you going to respond by that time?
 Even in non-real-time systems, large numbers of timers are generated,
 so they must be handled extremely efficiently.
 Example uses include retransmit timers for TCP connections (which are
 almost always canceled before they have a chance to fire),\footnote{
 	At least assuming reasonably low packet-loss rates!}
 timed delays (as in \co{sleep(1)}, which are rarely canceled),
 and timeouts for the \co{poll()} system call (which are often
 canceled before they have a chance to fire).
 A good data structure for such timers would therefore be a priority queue
 whose addition and deletion primitives were fast and $\O{1}$ in the number
 of timers posted.

 The classic data structure for this purpose is the \emph{calendar queue},
 which in the Linux kernel is called the \co{timer wheel}.
 This age-old data structure is also heavily used in discrete-event
 simulation.
 The idea is that time is quantized, for example, in the Linux kernel,
 the duration of the time quantum is the period of the scheduling-clock
 interrupt.
 A given time can be represented by an integer, and any attempt to post
 a timer at some non-integral time will be rounded to a convenient nearby
 integral time quantum.

 One straightforward implementation would be to allocate a single array,
 indexed by the low-order bits of the time.
 This works in theory, but in practice systems create large numbers of
 long-duration timeouts (for example, the two-hour keepalive timeouts for TCP
 sessions) that are almost always canceled.
 These long-duration timeouts cause problems for small arrays because
 much time is wasted skipping timeouts that have not yet expired.
 On the other hand, an array that is large enough to gracefully accommodate
 a large number of long-duration timeouts would consume too much memory,
 especially given that performance and scalability concerns require one
 such array for each and every CPU\@.

 \begin{figure}
 \centering
 \resizebox{2.0in}{!}{\includegraphics{advsync/timerwheel}}
 \caption{Timer Wheel}
 \label{fig:advsync:Timer Wheel}
 \end{figure}

 A common approach for resolving this conflict is to provide multiple
 arrays in a hierarchy.
 At the lowest level of this hierarchy, each array element represents
 one unit of time.
 At the second level, each array element represents $N$ units of time,
 where $N$ is the number of elements in each array.
 At the third level, each array element represents $N^2$ units of time,
 and so on up the hierarchy.
 This approach allows the individual arrays to be indexed by different
 bits, as illustrated by
 \cref{fig:advsync:Timer Wheel}
 for an unrealistically small eight-bit clock.
 Here, each array has 16 elements, so the low-order four bits of the time
 (currently \co{0xf}) index the low-order (rightmost) array, and the
 next four bits (currently \co{0x1}) index the next level up.
 Thus, we have two arrays each with 16 elements, for a total of 32 elements,
 which, taken together, is much smaller than the 256-element array that
 would be required for a single array.

 This approach works extremely well for throughput-based systems.
 Each timer operation is $\O{1}$ with small constant, and each timer
 element is touched at most $m+1$ times, where $m$ is the number of
 levels.

 \begin{figure}
 \centering
 \resizebox{3.0in}{!}{\includegraphics{cartoons/1kHz}}
 \caption{Timer Wheel at 1\,kHz}
 \ContributedBy{fig:advsync:Timer Wheel at 1kHz}{Melissa Broussard}
 \end{figure}

 \begin{figure}
 \centering
 \resizebox{3.0in}{!}{\includegraphics{cartoons/100kHz}}
 \caption{Timer Wheel at 100\,kHz}
 \ContributedBy{fig:advsync:Timer Wheel at 100kHz}{Melissa Broussard}
 \end{figure}

 Unfortunately, timer wheels do not work well for real-time systems, and for
 two reasons.
 The first reason is that there is a harsh tradeoff between timer
 accuracy and timer overhead, which is fancifully illustrated by
 \cref{fig:advsync:Timer Wheel at 1kHz,fig:advsync:Timer Wheel at 100kHz}.
 In \cref{fig:advsync:Timer Wheel at 1kHz},
 timer processing happens only once per millisecond, which keeps overhead
 acceptably low for many (but not all!\@) workloads, but which also means
 that timeouts cannot be set for finer than one-millisecond granularities.
 On the other hand, \cref{fig:advsync:Timer Wheel at 100kHz}
 shows timer processing taking place every ten microseconds, which
 provides acceptably fine timer granularity for most (but not all!\@)
 workloads, but which processes timers so frequently that the system
 might well not have time to do anything else.

 The second reason is the need to cascade timers from higher levels to
 lower levels.
 Referring back to \cref{fig:advsync:Timer Wheel},
 we can see that any timers enqueued on element \co{1x} in the upper
 (leftmost) array must be cascaded down to the lower (rightmost)
 array so that may be invoked when their time arrives.
 Unfortunately, there could be a large number of timeouts
 waiting to be cascaded, especially for timer wheels with larger numbers
 of levels.
 The power of statistics causes this cascading to be a non-problem for
 throughput-oriented systems, but cascading can result in problematic
 degradations of latency in real-time systems.

 \IfTwoColumn{
 \begin{figure*}
 \centering
 \resizebox{1.4\twocolumnwidth}{!}{\includegraphics{advsync/irq}}
 \caption{Non-Threaded Interrupt Handler}
 \label{fig:advsync:Non-Threaded Interrupt Handler}
 \end{figure*}
 }{}

 \IfTwoColumn{
 \begin{figure*}
 \centering
 \resizebox{1.4\twocolumnwidth}{!}{\includegraphics{advsync/threaded-irq}}
 \caption{Threaded Interrupt Handler}
 \label{fig:advsync:Threaded Interrupt Handler}
 \end{figure*}
 }{}

 Of course, real-time systems could simply choose a different data
 structure, for example, some form of heap or tree, giving up
 $\O{1}$ bounds on insertion and deletion operations to gain $\O{\log n}$
 limits on data-structure-maintenance operations.
 This can be a good choice for special-purpose RTOSes, but is inefficient
 for general-purpose systems such as Linux, which routinely support
 extremely large numbers of timers.

 The solution chosen for the Linux kernel's \rt\ patchset is to differentiate
 between timers that schedule later activity and timeouts that schedule
 error handling for low-probability errors such as TCP packet losses.
 One key observation is that error handling is normally not particularly
 time-critical, so that a timer wheel's millisecond-level granularity
 is good and sufficient.
 Another key observation is that error-handling timeouts are normally
 canceled very early, often before they can be cascaded.
 In addition, systems commonly have many more error-handling timeouts
 than they do timer events, so that an $\O{\log n}$ data structure should
 provide acceptable performance for timer events.

 However, it is possible to do better, namely by simply refusing to
 cascade timers.
 Instead of cascading, the timers that would otherwise have been cascaded
 all the way down the calendar queue are handled in place.
 This does result in up to a few percent error for the time duration,
 but the few situations where this is a problem can instead use tree-based
 high-resolution timers (hrtimers).

 In short, the Linux kernel's \rt\ patchset uses timer wheels for
 error-handling timeouts and a tree for timer events, providing each
 category the required quality of service.

 \IfTwoColumn{}{
 \begin{figure}
 \centering
 \IfEbookSize{
 \resizebox{\onecolumntextwidth}{!}{\includegraphics{advsync/irq}}
 }{
 \resizebox{1.4\twocolumnwidth}{!}{\includegraphics{advsync/irq}}
 }
 \caption{Non-Threaded Interrupt Handler}
 \label{fig:advsync:Non-Threaded Interrupt Handler}
 \end{figure}
 }

 \paragraph{Threaded interrupts}
 are used to address a significant source of degraded real-time latencies,
 namely long-running interrupt handlers,
 as shown in \cref{fig:advsync:Non-Threaded Interrupt Handler}.
 These latencies can be especially problematic for devices that can
 deliver a large number of events with a single interrupt, which means
 that the interrupt handler will run for an extended period of time
 processing all of these events.
 Worse yet are devices that can deliver new events to a still-running
 interrupt handler, as such an interrupt handler might well run
 indefinitely, thus indefinitely degrading real-time latencies.

 \IfTwoColumn{}{
 \begin{figure}
 \centering
 \IfEbookSize{
 \resizebox{\onecolumntextwidth}{!}{\includegraphics{advsync/threaded-irq}}
 }{
 \resizebox{1.4\twocolumnwidth}{!}{\includegraphics{advsync/threaded-irq}}
 }
 \caption{Threaded Interrupt Handler}
 \label{fig:advsync:Threaded Interrupt Handler}
 \end{figure}
 }

 One way of addressing this problem is the use of threaded interrupts shown in
 \cref{fig:advsync:Threaded Interrupt Handler}.
 Interrupt handlers run in the context of a preemptible \IXacr{irq} thread,
 which runs at a configurable priority.
 The device interrupt handler then runs for only a short time, just
 long enough to make the \IRQ\ thread aware of the new event.
 As shown in the figure, threaded interrupts can greatly improve
 real-time latencies, in part because interrupt handlers running in
 the context of the \IRQ\ thread may be preempted by high-priority real-time
 threads.

 However, there is no such thing as a free lunch, and there are downsides
 to threaded interrupts.
 One downside is increased interrupt latency.
 Instead of immediately running the interrupt handler, the handler's execution
 is deferred until the \IRQ\ thread gets around to running it.
 Of course, this is not a problem unless the device generating the interrupt
 is on the real-time application's critical path.

 Another downside is that poorly written high-priority real-time code
 might starve the interrupt handler, for example, preventing networking
 code from running, in turn making it very difficult to debug the problem.
 Developers must therefore take great care when writing high-priority
 real-time code.
 This has been dubbed the \emph{Spiderman principle}:
 With great power comes great responsibility.

 \paragraph{Priority inheritance} is used to handle priority inversion,
 which can be caused by, among other things, locks acquired by
 preemptible interrupt handlers~\cite{LuiSha1990PriorityInheritance}.
 Suppose that a low-priority thread holds a lock, but is preempted by
 a group of medium-priority threads, at least one such thread per CPU\@.
 If an interrupt occurs, a high-priority \IRQ\ thread will preempt one
 of the medium-priority threads, but only until it decides to acquire
 the lock held by the low-priority thread.
 Unfortunately, the low-priority thread cannot release the lock until
 it starts running, which the medium-priority threads prevent it from
 doing.
 So the high-priority \IRQ\ thread cannot acquire the lock until after one
 of the medium-priority threads releases its CPU\@.
 In short, the medium-priority threads are indirectly blocking the
 high-priority \IRQ\ threads, a classic case of priority inversion.

 Note that this priority inversion could not happen with non-threaded
 interrupts because the low-priority thread would have to disable interrupts
 while holding the lock, which would prevent the medium-priority
 threads from preempting it.

 In the priority-inheritance solution, the high-priority thread attempting
 to acquire the lock donates its priority to the low-priority thread holding
 the lock until such time as the lock is released, thus preventing long-term
 priority inversion.

 \begin{figure}
 \centering
 \resizebox{3.4in}{!}{\includegraphics{cartoons/Priority_Boost_2}}
 \caption{Priority Inversion and User Input}
 \ContributedBy{fig:advsync:Priority Inversion and User Input}{Melissa Broussard}
 \end{figure}

 Of course, priority inheritance does have its limitations.
 For example, if you can design your application to avoid priority
 inversion entirely, you will likely obtain somewhat better
 latencies~\cite{VictorYodaiken2004a}.
 This should be no surprise, given that priority inheritance adds
 a pair of context switches to the worst-case latency.
 That said, priority inheritance can convert indefinite postponement
 into a limited increase in latency, and the software-engineering
 benefits of priority inheritance may outweigh its latency costs in
 many applications.

 Another limitation is that it addresses only lock-based priority
 inversions within the context of a given operating system.
 One priority-inversion scenario that it cannot address is a high-priority
 thread waiting on a network socket for a message that is to be written
 by a low-priority process that is preempted by a set of CPU-bound
 medium-priority processes.
 In addition, a potential disadvantage of applying priority inheritance
 to user input is fancifully depicted in
 \cref{fig:advsync:Priority Inversion and User Input}.

 A final limitation involves reader-writer locking.
 Suppose that we have a very large number of low-priority threads, perhaps
 even thousands of them, each
 of which read-holds a particular reader-writer lock.
 Suppose that all of these threads are preempted by a set of medium-priority
 threads, with at least one medium-priority thread per CPU\@.
 Finally, suppose that a high-priority thread awakens and attempts to
 write-acquire this same reader-writer lock.
 No matter how vigorously we boost the priority of the threads read-holding
 this lock, it could well be a good long time before the high-priority
 thread can complete its write-acquisition.

 There are a number of possible solutions to this reader-writer lock
 priority-inversion conundrum:

 \begin{enumerate}
 \item	Only allow one read-acquisition of a given reader-writer lock
 	at a time.
 	(This is the approach traditionally taken by the Linux
 	kernel's \rt\ patchset.)
 \item	Only allow $N$ read-acquisitions of a given reader-writer lock
 	at a time, where $N$ is the number of CPUs.
 \item	Only allow $N$ read-acquisitions of a given reader-writer lock
 	at a time, where $N$ is a number specified somehow by the
 	developer.
 \item	Prohibit high-priority threads from write-acquiring reader-writer
 	locks that are ever read-acquired by threads running at lower
 	priorities.
 	(This is a variant of the \emph{priority ceiling}
 	protocol~\cite{LuiSha1990PriorityInheritance}.)
 \end{enumerate}

 \QuickQuiz{
 	But if you only allow one reader at a time to read-acquire
 	a reader-writer lock, isn't that the same as an exclusive
 	lock???
 }\QuickQuizAnswer{
 	Indeed it is, other than the API\@.
 	And the API is important because it allows the Linux kernel
 	to offer real-time capabilities without having the \rt\ patchset
 	grow to ridiculous sizes.

 	However, this approach clearly and severely limits read-side
 	scalability.
 	The Linux kernel's \rt\ patchset was long able to live with this
 	limitation for several reasons:
 	\begin{enumerate*}[(1)]
 	\item Real-time systems have traditionally been relatively small,
 	\item Real-time systems have generally focused on process control,
 	thus being unaffected by scalability limitations in the
 	I/O subsystems, and
 	\item Many of the Linux kernel's reader-writer locks have been
 	converted to RCU\@.
 	\end{enumerate*}

 	However, the day came when it was absolutely necessary to
 	permit concurrent readers, as described in the text following
 	this quiz.
 }\QuickQuizEnd

 The no-concurrent-readers restriction eventually became intolerable, so
 the \rt\ developers looked more carefully at how the Linux kernel uses
 reader-writer spinlocks.
 They learned that time-critical code rarely uses those parts of the
 kernel that write-acquire reader-writer locks, so that the prospect
 of writer \IX{starvation} was not a show-stopper.
 They therefore constructed a real-time reader-writer lock in which
 write-side acquisitions use priority inheritance among each other,
 but where read-side acquisitions take absolute priority over
 write-side acquisitions.
 This approach appears to be working well in practice, and is another
 lesson in the importance of clearly understanding what your users
 really need.

 One interesting detail of this implementation is that both the
 \co{rt_read_lock()} and the \co{rt_write_lock()} functions enter an RCU
 read-side critical section and both the \co{rt_read_unlock()} and the
 \co{rt_write_unlock()} functions exit that critical section.
 This is necessary because non-realtime kernels' reader-writer locking
 functions disable preemption across their critical sections, and
 there really are reader-writer locking use cases that rely on the fact
 that \co{synchronize_rcu()} will therefore wait for all pre-existing
 reader-writer-lock critical sections to complete.
 Let this be a lesson to you:
 Understanding what your users really need is critically important to
 correct operation, not just to performance.
 Not only that, but what your users really need changes over time.

 This has the side-effect that all of a \rt\ kernel's reader-writer locking
 critical sections are subject to RCU priority boosting.
 This provides at least a partial solution to the problem of reader-writer
 lock readers being preempted for extended periods of time.

 It is also possible to avoid reader-writer lock priority inversion by
 converting the reader-writer lock to RCU, as briefly discussed in the
 next section.

 \paragraph{Preemptible RCU}
 can sometimes be used as a replacement for reader-writer
 locking~\cite{PaulEMcKenney2007WhatIsRCUFundamentally,PaulMcKenney2012RCUUsage,PaulEMcKenney2024RCUAPI},
 as was discussed in \cref{sec:defer:Read-Copy Update (RCU)}.
 Where it can be used, it permits readers and updaters to run concurrently,
 which prevents low-priority readers from inflicting any sort of
 priority-inversion scenario on high-priority updaters.
 However, for this to be useful, it is necessary to be able to preempt
 long-running RCU read-side critical
 sections~\cite{DinakarGuniguntala2008IBMSysJ}.
 Otherwise, long RCU read-side critical sections would result in
 excessive real-time latencies.

 \begin{listing}
 \begin{fcvlabel}[ln:advsync:Preemptible Linux-Kernel RCU]
 \begin{VerbatimL}[commandchars=\\\[\]]
 void __rcu_read_lock(void)			\lnlbl[lock:b]
 {
 	current->rcu_read_lock_nesting++;	\lnlbl[lock:inc]
 	barrier();				\lnlbl[lock:bar]
 }						\lnlbl[lock:e]

 void __rcu_read_unlock(void)		        \lnlbl[unl:b]
 {
 	barrier();				\lnlbl[unl:bar1]
 	if (!--current->rcu_read_lock_nesting)	\lnlbl[unl:decchk]
 		barrier();			\lnlbl[unl:bar2]
 		if (READ_ONCE(current->rcu_read_unlock_special.s)) { \lnlbl[unl:chks]
 			rcu_read_unlock_special(t); \lnlbl[unl:unls]
 		}				\lnlbl[unl:if:e]
 }						\lnlbl[unl:e]
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Preemptible Linux-Kernel RCU}
 \label{lst:advsync:Preemptible Linux-Kernel RCU}
 \end{listing}

 A preemptible RCU implementation was therefore added to the Linux kernel.
 This implementation avoids the need to individually track the state of
 each and every task in the kernel by keeping lists of tasks that have
 been preempted within their current RCU read-side critical sections.
 A \IX{grace period} is permitted to end:
 \begin{enumerate*}[(1)]
 \item Once all CPUs have completed any RCU read-side critical sections
 that were in effect before the start of the current grace period and
 \item Once all tasks that were preempted
 while in one of those pre-existing critical sections have removed
 themselves from their lists.
 \end{enumerate*}
 A simplified version of this implementation is shown in
 \cref{lst:advsync:Preemptible Linux-Kernel RCU}.
 \begin{fcvref}[ln:advsync:Preemptible Linux-Kernel RCU]
 The \co{__rcu_read_lock()} function spans \clnrefrange{lock:b}{lock:e} and
 the \co{__rcu_read_unlock()} function spans \clnrefrange{unl:b}{unl:e}.
 \end{fcvref}

 \begin{fcvref}[ln:advsync:Preemptible Linux-Kernel RCU:lock]
 \Clnref{inc} of \co{__rcu_read_lock()} increments a per-task count of the
 number of nested \co{rcu_read_lock()} calls, and
 \clnref{bar} prevents the compiler from reordering the subsequent code in the
 RCU read-side critical section to precede the \co{rcu_read_lock()}.
 \end{fcvref}

 \begin{fcvref}[ln:advsync:Preemptible Linux-Kernel RCU:unl]
 \Clnref{bar1} of \co{__rcu_read_unlock()} prevents the compiler from
 reordering the code in the critical section with the remainder of
 this function.
 \Clnref{decchk} decrements the nesting count and checks to see if it
 has become zero, in other words, if this corresponds to the outermost
 \co{rcu_read_unlock()} of a nested set.
 If so, \clnref{bar2} prevents the compiler from reordering this nesting
 update with \clnref{chks}'s check for special handling.
 If special handling is required, then the call to
 \co{rcu_read_unlock_special()} on \clnref{unls} carries it out.

 There are several types of special handling that can be required, but
 we will focus on that required when the RCU read-side critical section
 has been preempted.
 In this case, the task must remove itself from the list that it was
 added to when it was first preempted within its
 RCU read-side critical section.
 However, it is important to note that these lists are protected by locks,
 which means that \co{rcu_read_unlock()} is no longer lockless.
 However, the highest-priority threads will not be preempted, and therefore,
 for those highest-priority threads, \co{rcu_read_unlock()} will never
 attempt to acquire any locks.
 In addition, if implemented carefully, locking can be used to synchronize
 real-time software~\cite{BjoernBrandenburgPhD,DipankarSarma2004OLSscalability}.
 \end{fcvref}

 \QuickQuiz{
 	\begin{fcvref}[ln:advsync:Preemptible Linux-Kernel RCU:unl]
 	Suppose that preemption occurs just after the load from
 	\co{t->rcu_read_unlock_special.s} on \clnref{chks} of
 	\cref{lst:advsync:Preemptible Linux-Kernel RCU}.
 	Mightn't that result in the task failing to invoke
 	\co{rcu_read_unlock_special()}, thus failing to remove itself
 	from the list of tasks blocking the current grace period,
 	in turn causing that grace period to extend indefinitely?
 	\end{fcvref}
 }\QuickQuizAnswer{
 	That is a real problem, and it is solved in RCU's scheduler hook.
 	If that scheduler hook sees that the value of
 	\co{t->rcu_read_lock_nesting} is negative, it invokes
 	\co{rcu_read_unlock_special()} if needed before allowing
 	the context switch to complete.
 }\QuickQuizEnd

 Another important real-time feature of RCU, whether preemptible or
 not, is the ability to offload RCU callback execution to a kernel
 thread.
 To use this, your kernel must be built with \co{CONFIG_RCU_NOCB_CPU=y}
 and booted with the \co{rcu_nocbs=} kernel boot parameter specifying
 which CPUs are to be offloaded.
 Alternatively, any CPU specified by the \co{nohz_full=} kernel boot parameter
 described in
 \cref{sec:advsync:Polling-Loop Real-Time Support}
 will also have its RCU callbacks offloaded.

 In short, this preemptible RCU implementation enables real-time response for
 read-mostly data structures without the delays inherent to priority
 boosting of large numbers of readers, and also without delays due to
 callback invocation.

 \paragraph{Preemptible spinlocks}
 are an important part of the \rt\ patchset due to the long-duration
 spinlock-based critical sections in the Linux kernel.
 This functionality has not yet reached mainline:
 Although they are a conceptually simple substitution of sleeplocks
 for spinlocks, they have proven relatively controversial.
 In addition the real-time functionality that is already in the mainline
 Linux kernel suffices for a great many use cases, which slowed the \rt\
 patchset's development rate in the early
 2010s~\cite{JakeEdge2013Future-rtLinux,JakeEdge2014Future-rtLinux}.
 However, preemptible spinlocks are absolutely necessary to the task of
 achieving real-time latencies down in the tens of microseconds.
 Fortunately, Linux Foundation organized an effort to fund moving the
 remaining code from the \rt\ patchset to mainline.

 \paragraph{Per-CPU variables}\ are used heavily in the Linux kernel
 for performance reasons.
 Unfortunately for real-time applications, many use cases for per-CPU
 variables require coordinated update of multiple such variables,
 which is normally provided by disabling preemption, which in turn
 degrades real-time latencies.
 Real-time applications clearly need some other way of coordinating
 per-CPU variable updates.

 One alternative is to supply per-CPU spinlocks, which as noted above
 are actually sleeplocks, so that their critical sections can be
 preempted and so that priority inheritance is provided.
 In this approach, code updating groups of per-CPU variables must
 acquire the current CPU's spinlock, carry out the update, then
 release whichever lock is acquired, keeping in mind that a preemption
 might have resulted in a migration to some other CPU\@.
 However, this approach introduces both overhead and \IXpl{deadlock}.

 Another alternative, which is used in the \rt\ patchset as of early 2021,
 is to convert preemption disabling to migration disabling.
 This ensures that a given kernel thread remains on its CPU through
 the duration of the per-CPU-variable update, but could also allow some
 other kernel thread to intersperse its own update of those same variables,
 courtesy of preemption.
 There are cases such as statistics gathering where this is not a problem.
 In the surprisingly rare case where such mid-update preemption is a problem,
 the use case at hand must properly synchronize the updates, perhaps through
 a set of per-CPU locks specific to that use case.
 Although introducing locks again introduces the possibility of deadlock,
 the per-use-case nature of these locks makes any such deadlocks easier
 to manage and avoid.

 \paragraph{Closing event-driven remarks.}
 There are of course any number of other Linux-kernel components that are
 critically important to achieving world-class real-time latencies,
 for example, deadline
 scheduling~\cite{DanielBristot2018deadlinesched-1,DanielBristot2018deadlinesched-2},
 however, those listed in this section give a good feeling for the workings
 of the Linux kernel augmented by the \rt\ patchset.

 \subsubsection{Polling-Loop Real-Time Support}
 \label{sec:advsync:Polling-Loop Real-Time Support}

 At first glance, use of a polling loop might seem to avoid all possible
 operating-system interference problems.
 After all, if a given CPU never enters the kernel, the kernel is
 completely out of the picture.
 And the traditional approach to keeping the kernel out of the way is
 simply not to have a kernel, and many real-time applications do
 indeed run on bare metal, particularly those running on eight-bit
 microcontrollers.

 One might hope to get bare-metal performance on a modern operating-system
 kernel simply by running a single CPU-bound user-mode thread on a
 given CPU, avoiding all causes of interference.
 Although the reality is of course more complex, it is becoming
 possible to do just that,
 courtesy of the \co{NO_HZ_FULL} implementation led by
 Frederic Weisbecker~\cite{JonCorbet2013NO-HZ-FULL,FredericWeisbecker2013nohz}
 that was accepted into version 3.10 of the Linux kernel.
 Nevertheless, considerable care is required to properly set up such
 an environment, as it is necessary to control a number of possible
 sources of OS jitter.
 The discussion below covers the control of several sources of OS
 jitter, including device interrupts, kernel threads and daemons,
 scheduler real-time throttling (this is a feature, not a bug!),
 timers, non-real-time device drivers, in-kernel global synchronization,
 scheduling-clock interrupts, page faults, and finally, non-real-time
 hardware and firmware.

 Interrupts are an excellent source of large amounts of OS jitter.
 Unfortunately, in most cases interrupts are absolutely required in order
 for the system to communicate with the outside world.
 One way of resolving this conflict between OS jitter and maintaining
 contact with the outside world is to reserve a small number of
 housekeeping CPUs, and to force all interrupts to these CPUs.
 The \path{Documentation/IRQ-affinity.txt} file in the Linux source tree
 describes how to direct device interrupts to specified CPUs,
 which as of early 2021 involves something like the following:

 \begin{VerbatimU}
 $ echo 0f > /proc/irq/44/smp_affinity
 \end{VerbatimU}

 This command would confine interrupt \#44 to CPUs~0--3.
 Note that scheduling-clock interrupts require special handling, and are
 discussed later in this section.

 A second source of OS jitter is due to kernel threads and daemons.
 Individual kernel threads, such as RCU's grace-period kthreads
 (\co{rcu_bh}, \co{rcu_preempt}, and \co{rcu_sched}), may be forced
 onto any desired CPUs using the \co{taskset} command, the
 \co{sched_setaffinity()} system call, or \co{cgroups}.

 Per-CPU kthreads are often more challenging, sometimes constraining
 hardware configuration and workload layout.
 Preventing OS jitter from these kthreads requires either that certain
 types of hardware
 not be attached to real-time systems, that all interrupts and I/O
 initiation take place on housekeeping CPUs, that special kernel
 Kconfig or boot parameters be selected in order to direct work away from
 the worker CPUs, or that worker CPUs never enter the kernel.
 Specific per-kthread advice may be found in the Linux kernel source
 \path{Documentation} directory at
 \path{admin-guide/kernel-per-CPU-kthreads.rst}.

 A third source of OS jitter in the Linux kernel for CPU-bound threads
 running at real-time priority is the scheduler itself.
 This is an intentional debugging feature, designed to ensure that
 important non-realtime work is allotted at least 50 milliseconds
 out of each second, even if there is an infinite-loop bug in
 your real-time application.
 However, when you are running a polling-loop-style real-time application,
 you will need to disable this debugging feature.
 This can be done as follows:

 \begin{VerbatimU}
 $ echo -1 > /proc/sys/kernel/sched_rt_runtime_us
 \end{VerbatimU}

 You will of course need to be running as root to execute this command,
 and you will also need to carefully consider the aforementioned Spiderman
 principle.
 One way to minimize the risks is to offload interrupts and
 kernel threads/daemons from all CPUs running CPU-bound real-time
 threads, as described in the paragraphs above.
 In addition, you should carefully read the material in the
 \path{Documentation/scheduler} directory.
 The material in the \path{sched-rt-group.rst} file is particularly
 important, especially if you are using the \co{cgroups} real-time features
 enabled by the \co{CONFIG_RT_GROUP_SCHED} Kconfig parameter.

 A fourth source of OS jitter comes from timers.
 In most cases, keeping a given CPU out of the kernel will prevent
 timers from being scheduled on that CPU\@.
 One important exception are recurring timers, where a given timer
 handler posts a later occurrence of that same timer.
 If such a timer gets started on a given CPU for any reason, that
 timer will continue to run periodically on that CPU, inflicting
 OS jitter indefinitely.
 One crude but effective way to offload recurring timers is to
 use CPU hotplug to offline all worker CPUs that are to run CPU-bound
 real-time application threads, online these same CPUs, then start
 your real-time application.

 A fifth source of OS jitter is provided by device drivers that were
 not intended for real-time use.
 For an old canonical example, in 2005, the VGA driver would blank
 the screen by zeroing the frame buffer with interrupts disabled,
 which resulted in tens of milliseconds of OS jitter.
 One way of avoiding device-driver-induced OS jitter is to carefully
 select devices that have been used heavily in real-time systems,
 and which have therefore had their real-time bugs fixed.
 Another way is to confine the device's interrupts and all code using
 that device to designated housekeeping CPUs.
 A third way is to test the device's ability to support real-time
 workloads and fix any real-time bugs.\footnote{
 	If you take this approach, please submit your fixes upstream
 	so that others can benefit.
 	After all, when you need to port your application to
 	a later version of the Linux kernel, \emph{you} will be one of those
 	``others''.}

 A sixth source of OS jitter is provided by some in-kernel
 full-system synchronization algorithms, perhaps most notably
 the global TLB-flush algorithm.
 This can be avoided by avoiding memory-unmapping operations, and especially
 avoiding unmapping operations within the kernel.
 As of early 2021, the way to avoid in-kernel
 unmapping operations is to avoid unloading kernel modules.

 A seventh source of OS jitter is provided by
 scheduling-clock interrrupts and RCU callback invocation.
 These may be avoided by building your kernel with the
 \co{NO_HZ_FULL} Kconfig parameter enabled, and then booting
 with the \co{nohz_full=} parameter specifying the list of
 worker CPUs that are to run real-time threads.
 For example, \co{nohz_full=2-7} would designate CPUs~2, 3, 4, 5, 6, and~7
 as worker CPUs, thus leaving CPUs~0 and~1 as housekeeping CPUs.
 The worker CPUs would not incur scheduling-clock interrupts as long
 as there is no more than one runnable task on each worker CPU,
 and each worker CPU's RCU callbacks would be invoked on one of the
 housekeeping CPUs.
 A CPU that has suppressed scheduling-clock interrupts due to there
 only being one runnable task on that CPU is said to be in
 \emph{adaptive ticks mode} or in \co{nohz_full} mode.
 It is important to ensure that you have designated enough
 housekeeping CPUs to handle the housekeeping load imposed by the
 rest of the system, which requires careful benchmarking and tuning.

 An eighth source of OS jitter is page faults.
 Because most Linux implementations use an MMU for memory protection,
 real-time applications running on these systems can be subject
 to page faults.
 Use the \co{mlock()} and \co{mlockall()} system calls to pin your
 application's pages into memory, thus avoiding major page faults.
 Of course, the Spiderman principle applies, because locking down
 too much memory may prevent the system from getting other work done.

 A ninth source of OS jitter is unfortunately the hardware and firmware.
 It is therefore important to use systems that have been designed for
 real-time use.

 \begin{listing}
 \begin{fcvlabel}[ln:advsync:Locating Sources of OS Jitter]
 \begin{VerbatimL}[commandchars=\\\[\]]
 cd /sys/kernel/debug/tracing
 echo 1 > max_graph_depth		\lnlbl[echo1]
 echo function_graph > current_tracer
 # run workload
 cat per_cpu/cpuN/trace			\lnlbl[cat]
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Locating Sources of OS Jitter}
 \label{lst:advsync:Locating Sources of OS Jitter}
 \end{listing}

 \begin{fcvref}[ln:advsync:Locating Sources of OS Jitter]
 Unfortunately, this list of OS-jitter sources can never be complete,
 as it will change with each new version of the kernel.
 This makes it necessary to be able to track down additional sources
 of OS jitter.
 Given a CPU $N$ running a CPU-bound usermode thread, the
 commands shown in
 \cref{lst:advsync:Locating Sources of OS Jitter}
 will produce a list of all the times that this CPU entered the kernel.
 Of course, the \co{N} on \clnref{cat} must be replaced with the
 number of the CPU in question, and the \co{1} on \clnref{echo1} may be
 increased
 to show additional levels of function call within the kernel.
 The resulting trace can help track down the source of the OS jitter.
 \end{fcvref}

 As always, there is no free lunch, and \co{NO_HZ_FULL} is no exception.
 As noted earlier,
 \co{NO_HZ_FULL} makes kernel/user transitions more expensive due to the
 need for delta process accounting and the need to inform kernel subsystems
 (such as RCU) of the transitions.
 As a rough rule of thumb, \co{NO_HZ_FULL} helps with many types of
 real-time and heavy-compute workloads, but hurts other workloads
 that feature high rates of system calls and
 I/O~\cite{AbdullahAljuhni2018nohzfull}.
 Additional limitations, tradeoffs, and configuration advice may be
 found in \path{Documentation/timers/no_hz.rst} file in the Linux source tree.

 As you can see, obtaining bare-metal performance when running
 CPU-bound real-time threads on a general-purpose OS such as Linux
 requires painstaking attention to detail.
 Automation would of course help, and some automation has been applied,
 but given the relatively small number of users, automation can be
 expected to appear relatively slowly.
 Nevertheless, the ability to gain near-bare-metal performance while
 running a general-purpose operating system promises to ease construction
 of some types of real-time systems.

 \subsection{Implementing Parallel Real-Time Applications}
 \label{sec:advsync:Implementing Parallel Real-Time Applications}

 Developing real-time applications is a wide-ranging topic, and this
 section can only touch on a few aspects.
 To this end,
 \cref{sec:advsync:Real-Time Components}
 looks at a few software components commonly used in real-time applications,
 \cref{sec:advsync:Polling-Loop Applications}
 provides a brief overview of how polling-loop-based applications may
 be implemented,
 \cref{sec:advsync:Streaming Applications}
 gives a similar overview of streaming applications, and
 \cref{sec:advsync:Event-Driven Applications}
 briefly covers event-based applications.

 \subsubsection{Real-Time Components}
 \label{sec:advsync:Real-Time Components}

 As in all areas of engineering, a robust set of components is essential
 to \IX{productivity} and \IX{reliability}.
 This section is not a full catalog of real-time software components---such
 a catalog would fill multiple books---but rather a brief overview of the
 types of components available.

 A natural place to look for real-time software components would be
 algorithms offering wait-free
 synchronization~\cite{Herlihy91}, and in fact lockless
 algorithms are very important to real-time computing.
 However, wait-free synchronization only guarantees forward progress in
 finite time.
 Although a century is finite, this is unhelpful when your deadlines are
 measured in microseconds, let alone milliseconds.

 Nevertheless, there are some important wait-free algorithms that do
 provide bounded response time, including atomic test and set,
 atomic exchange,
 atomic fetch-and-add,
 single-producer/single-consumer FIFO queues based on circular arrays,
 and numerous per-thread partitioned algorithms.
 In addition, recent research has confirmed the observation that
 algorithms with lock-free guarantees\footnote{
 	Wait-free algorithms guarantee that all threads make progress in
 	finite time, while lock-free algorithms only guarantee that at
 	least one thread will make progress in finite time.
 	See \cref{sec:advsync:Non-Blocking Synchronization} for more details.}
 also provide the same latencies in practice (in the wait-free sense),
 assuming a stochastically fair scheduler and absence of fail-stop
 bugs~\cite{DanAlitarh2013PracticalProgress}.
 This means that many non-wait-free stacks and queues are nevertheless
 appropriate for real-time use.

 \QuickQuiz{
 	But isn't correct operation despite fail-stop bugs
 	a valuable fault-tolerance property?
 }\QuickQuizAnswer{
 	Yes and no.

 	Yes in that non-blocking algorithms can provide fault tolerance
 	in the face of fail-stop bugs, but no in that this is grossly
 	insufficient for practical fault tolerance.
 	For example, suppose you had a wait-free queue, and further
 	suppose that a thread has just dequeued an element.
 	If that thread now succumbs to a fail-stop bug, the element
 	it has just dequeued is effectively lost.
 	True fault tolerance requires way more than mere non-blocking
 	properties, and is beyond the scope of this book.
 }\QuickQuizEnd

 In practice, locking is often used in real-time programs, theoretical
 concerns notwithstanding.
 However, under more severe constraints, lock-based
 algorithms can also provide bounded latencies~\cite{BjoernBrandenburgPhD}.
 These constraints include:

 \begin{enumerate}
 \item	Fair scheduler.
 	In the common case of a fixed-priority scheduler, the bounded
 	latencies are provided only to the highest-priority threads.
 \item	Sufficient bandwidth to support the workload.
 	An implementation rule supporting this constraint might be
 	``There will be at least 50\pct\ idle time on all CPUs
 	during normal operation,''
 	or, more formally, ``The offered load will be sufficiently low
 	to allow the workload to be schedulable at all times.''
 \item	No fail-stop bugs.
 \item	FIFO locking primitives with bounded acquisition, handoff,
 	and release latencies.
 	Again, in the common case of a locking primitive that is FIFO
 	within priorities, the bounded latencies are provided only
 	to the highest-priority threads.
 \item	Some way of preventing unbounded priority inversion.
 	The priority-ceiling and priority-inheritance disciplines
 	mentioned earlier in this chapter suffice.
 \item	Bounded nesting of lock acquisitions.
 	We can have an unbounded number of locks, but only as long as a
 	given thread never acquires more than a few of them (ideally only
 	one of them) at a time.
 \item	Bounded number of threads.
 	In combination with the earlier constraints, this constraint means
 	that there will be a bounded number of threads waiting on any
 	given lock.
 \item	Bounded time spent in any given critical section.
 	Given a bounded number of threads waiting on any given lock and
 	a bounded critical-section duration, the wait time will be bounded.
 \end{enumerate}

 \QuickQuiz{
 	I couldn't help but spot the word ``include'' before this list.
 	Are there other constraints?
 }\QuickQuizAnswer{
 	Indeed there are, and lots of them.
 	However, they tend to be specific to a given situation,
 	and many of them can be thought of as refinements of some of
 	the constraints listed above.
 	For example, the many constraints on choices of data structure
 	will help meeting the ``Bounded time spent in any given critical
 	section'' constraint.
 }\QuickQuizEnd

 This result opens a vast cornucopia of algorithms and data structures
 for use in real-time software---and validates long-standing real-time practice.

 Of course, a careful and simple application design is also extremely
 important.
 The best real-time components in the world cannot make up for a
 poorly thought-out design.
 For parallel real-time applications, synchronization overheads clearly
 must be a key component of the design.

 \subsubsection{Polling-Loop Applications}
 \label{sec:advsync:Polling-Loop Applications}

 Many real-time applications consist of a single CPU-bound loop that
 reads sensor data, computes a control law, and writes control output.
 If the hardware registers providing sensor data and taking control
 output are mapped into the application's address space, this loop
 might be completely free of system calls.
 But beware of the Spiderman principle:
 With great power comes great responsibility, in this case the
 responsibility to avoid bricking the hardware by making inappropriate
 references to the hardware registers.

 This arrangement is often run on bare metal, without the benefits of
 (or the interference from) an operating system.
 However, increasing hardware capability and increasing levels of
 automation motivates increasing software functionality, for example,
 user interfaces, logging, and reporting, all of which can benefit from
 an operating system.

 One way of gaining much of the benefit of running on bare metal while
 still having access to the full features and functions of a
 general-purpose operating system is to use the Linux kernel's
 \co{NO_HZ_FULL} capability, described in
 \cref{sec:advsync:Polling-Loop Real-Time Support}.

 \subsubsection{Streaming Applications}
 \label{sec:advsync:Streaming Applications}

 One type of big-data real-time application takes input from numerous
 sources, processes it internally, and outputs alerts and summaries.
 These \emph{streaming applications} are often highly parallel, processing
 different information sources concurrently.

 One approach for implementing streaming applications is to use
 dense-array circular FIFOs to connect different processing
 steps~\cite{AdrianSutton2013LCA:Disruptor}.
 Each such FIFO has only a single thread producing into it and a
 (presumably different) single thread consuming from it.
 Fan-in and fan-out points use threads rather than data structures,
 so if the output of several FIFOs needed to be merged, a separate
 thread would input from them and output to another FIFO for which
 this separate thread was the sole producer.
 Similarly, if the output of a given FIFO needed to be split, a separate
 thread would input from this FIFO and output to several FIFOs as needed.

 This discipline might seem restrictive, but it allows communication
 among threads with minimal synchronization overhead, and minimal
 synchronization overhead is important when attempting to meet
 tight latency constraints.
 This is especially true when the amount of processing for each step
 is small, so that the synchronization overhead is significant compared
 to the processing overhead.

 The individual threads might be CPU-bound, in which case the advice in
 \cref{sec:advsync:Polling-Loop Applications} applies.
 On the other hand, if the individual threads block waiting for
 data from their input FIFOs, the advice of the next section applies.

 \subsubsection{Event-Driven Applications}
 \label{sec:advsync:Event-Driven Applications}

 We will use fuel injection into a mid-sized industrial engine as a
 fanciful example for event-driven applications.
 Under normal operating conditions, this engine requires that the fuel
 be injected within a one-degree interval surrounding top dead center.
 If we assume a 1,500-RPM rotation rate, we have 25 rotations per second,
 or about 9,000 degrees of rotation per second, which translates to
 111 microseconds per degree.
 We therefore need to schedule the fuel injection to within a time
 interval of about 100 microseconds.

 \begin{listing}
 \begin{fcvlabel}[ln:advsync:Timed-Wait Test Program]
 \begin{VerbatimL}
 if (clock_gettime(CLOCK_REALTIME, &timestart) != 0) {
 	perror("clock_gettime 1");
 	exit(-1);
 }
 if (nanosleep(&timewait, NULL) != 0) {
 	perror("nanosleep");
 	exit(-1);
 }
 if (clock_gettime(CLOCK_REALTIME, &timeend) != 0) {
 	perror("clock_gettime 2");
 	exit(-1);
 }
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Timed-Wait Test Program}
 \label{lst:advsync:Timed-Wait Test Program}
 \end{listing}

 Suppose that a timed wait was to be used to initiate fuel injection,
 although if you are building an engine, I hope you supply a rotation
 sensor.
 We need to test the timed-wait functionality, perhaps using the test program
 shown in
 \cref{lst:advsync:Timed-Wait Test Program}.
 Unfortunately, if we run this program, we can get unacceptable timer
 jitter, even in a \rt\ kernel.

 One problem is that POSIX \co{CLOCK_REALTIME} is, oddly enough, not intended
 for real-time use.
 Instead, it means ``realtime'' as opposed to the amount of CPU time
 consumed by a process or thread.
 For real-time use, you should instead use \co{CLOCK_MONOTONIC}.
 However, even with this change, results are still unacceptable.

 Another problem is that the thread must be raised to a real-time
 priority by using the \co{sched_setscheduler()} system call.
 But even this change is insufficient, because we can still see
 page faults.
 We also need to use the \co{mlockall()} system call to pin the
 application's memory, preventing page faults.
 With all of these changes, results might finally be acceptable.

 In other situations, further adjustments might be needed.
 It might be necessary to affinity time-critical threads onto their
 own CPUs, and it might also be necessary to affinity interrupts
 away from those CPUs.
 It might be necessary to carefully select hardware and drivers,
 and it will very likely be necessary to carefully select kernel
 configuration.

 As can be seen from this example, real-time computing can be quite
 unforgiving.

 \subsubsection{The Role of RCU}
 \label{sec:advsync:The Role of RCU}

 Suppose that you are writing a parallel real-time application that needs
 to access
 data that is subject to gradual change, perhaps due to changes in
 temperature, humidity, and barometric pressure.
 The real-time response constraints on this program are so severe that
 it is not permissible to spin or block, thus ruling out locking,
 nor is it permissible to use a retry loop, thus ruling out sequence locks
 and \IXpl{hazard pointer}.
 Fortunately, the temperature and pressure are normally controlled,
 so that a default hard-coded set of data is usually sufficient.

 However, the temperature, humidity, and pressure occasionally deviate too far
 from the defaults, and in such situations it is necessary to provide
 data that replaces the defaults.
 Because the temperature, humidity, and pressure change gradually,
 providing the updated values is not a matter of urgency, though
 it must happen within a few minutes.
 The program is to use a global pointer imaginatively named \co{cur_cal}
 that normally references \co{default_cal}, which is a statically allocated
 and initialized structure that contains the default calibration values
 in fields imaginatively named \co{a}, \co{b}, and \co{c}.
 Otherwise, \co{cur_cal} points to a dynamically allocated
 structure providing the current calibration values.

 \begin{listing}
 \begin{fcvlabel}[ln:advsync:Real-Time Calibration Using RCU]
 \begin{VerbatimL}[commandchars=\\\[\]]
 struct calibration {
 	short a;
 	short b;
 	short c;
 };
 struct calibration default_cal = { 62, 33, 88 };
 struct calibration cur_cal = &default_cal;

 short calc_control(short t, short h, short press)	\lnlbl[calc:b]
 {
 	struct calibration *p;

 	p = rcu_dereference(cur_cal);
 	return do_control(t, h, press, p->a, p->b, p->c);
 }							\lnlbl[calc:e]

 bool update_cal(short a, short b, short c)		\lnlbl[upd:b]
 {
 	struct calibration *p;
 	struct calibration *old_p;

 	old_p = rcu_dereference(cur_cal);
 	p = malloc(sizeof(*p);
 	if (!p)
 		return false;
 	p->a = a;
 	p->b = b;
 	p->c = c;
 	rcu_assign_pointer(cur_cal, p);
 	if (old_p == &default_cal)
 		return true;
 	synchronize_rcu();
 	free(old_p);
 	return true;
 }							\lnlbl[upd:e]
 \end{VerbatimL}
 \end{fcvlabel}
 \caption{Real-Time Calibration Using RCU}
 \label{lst:advsync:Real-Time Calibration Using RCU}
 \end{listing}

 \begin{fcvref}[ln:advsync:Real-Time Calibration Using RCU]
 \Cref{lst:advsync:Real-Time Calibration Using RCU}
 shows how RCU can be used to solve this problem.
 Lookups are deterministic, as shown in \co{calc_control()}
 on \clnrefrange{calc:b}{calc:e}, consistent with real-time requirements.
 Updates are more complex, as shown by \co{update_cal()}
 on \clnrefrange{upd:b}{upd:e}.
 \end{fcvref}

 \QuickQuizSeries{%
 \QuickQuizB{
 	Given that real-time systems are often used for safety-critical
 	applications, and given that runtime memory allocation is
 	forbidden in many safety-critical situations, what is with
 	the call to \co{malloc()}???
 }\QuickQuizAnswerB{
 	In early 2016, projects forbidding runtime memory allocation
 	were also not at all interested in multithreaded computing.
 	So the runtime memory allocation is not an additional
 	obstacle to safety criticality.

 	However, by 2020 runtime memory allocation in multi-core
 	real-time systems was gaining some traction.
 }\QuickQuizEndB
 %
 \QuickQuizE{
 	Don't you need some kind of synchronization to protect
 	\co{update_cal()}?
 }\QuickQuizAnswerE{
 	Indeed you do, and you could use any of a number of techniques
 	discussed earlier in this book.
 	One of those techniques is use of a single updater thread,
 	which would result in exactly the code shown in \co{update_cal()}
 	in \cref{lst:advsync:Real-Time Calibration Using RCU}.
 }\QuickQuizEndE
 }

 This example shows how RCU can provide deterministic read-side
 data-structure access to real-time programs.

 \subsection{Real Time vs.~Real Fast:
 				     How to Choose?}
 \label{sec:advsync:Real Time vs. Real Fast: How to Choose?}

 The choice between real-time and real-fast computing can be a difficult one.
 Because real-time systems often inflict a throughput penalty on
 non-real-time computing, using real-time when it is not required is
 unwise, as fancifully depicted by
 \cref{fig:advsync:The Dark Side of Real-Time Computing}.

 \begin{figure}
 \centering
 \resizebox{3.2in}{!}{\includegraphics{cartoons/RealTimeNotRealFast}}
 \caption{The Dark Side of Real-Time Computing}
 \ContributedBy{fig:advsync:The Dark Side of Real-Time Computing}{Sarah McKenney}
 \end{figure}

 On the other hand, failing to use real-time when it \emph{is} required
 can also cause problems, as fancifully depicted by
 \cref{fig:advsync:The Dark Side of Real-Fast Computing}.
 It is almost enough to make you feel sorry for the boss!

 \begin{figure}
 \centering
 \resizebox{3.2in}{!}{\includegraphics{cartoons/RealFastNotRealTime}}
 \caption{The Dark Side of Real-Fast Computing}
 \ContributedBy{fig:advsync:The Dark Side of Real-Fast Computing}{Sarah McKenney}
 \end{figure}

 One rule of thumb uses the following four questions to help you choose:

 \begin{enumerate}
 \item	Is average long-term throughput the only goal?
 \item	Is it permissible for heavy loads to degrade response times?
 \item	Is there high memory pressure, ruling out use of
 	the \co{mlockall()} system call?
 \item	Does the basic work item of your application take more than
 	100 milliseconds to complete?
 \end{enumerate}

 If the answer to any of these questions is ``yes'', you should choose
 real-fast over real-time, otherwise, real-time might be for you.

 Choose wisely, and if you do choose real-time, make sure that your
 hardware, firmware, and operating system are up to the job!