| % rt/rt.tex |
| |
| \QuickQuizChapter{chp:Parallel Real-Time Computing}{Parallel Real-Time Computing} |
| |
| \epigraph{The difference between you and me is that I was right in time.} |
| {\emph{Konrad Adenauer}} |
| |
| An important emerging area in computing is that of parallel real-time |
| computing. |
| Section~\ref{sec:rt:What is Real-Time Computing?} |
| looks at a number of definitions of ``real-time computing'', moving |
| beyond the usual sound bites to more meaningful criteria. |
| Section~\ref{sec:rt:Who Needs Real-Time Computing?} |
| surveys the sorts of applications that need real-time response. |
| Section~\ref{sec:rt:Who Needs Parallel Real-Time Computing?} |
| notes that parallel real-time computing is upon us, and discusses |
| when and why parallel real-time computing can be useful. |
| Section~\ref{sec:rt:Implementing Parallel Real-Time Systems} |
| gives a brief overview of how parallel real-time systems may be implemented, |
| and finally, |
| Section~\ref{sec:rt:Real Time vs. Real Fast: How to Choose?} |
| outlines how to decide whether or not your application needs real-time |
| facilities. |
| |
| \section{What is Real-Time Computing?} |
| \label{sec:rt:What is Real-Time Computing?} |
| |
| One traditional way of classifying real-time computing is into the categories |
| of \emph{hard real time} and \emph{soft real time}, where the macho |
| hard real-time applications never ever miss their deadlines, but the |
| wimpy soft real-time applications might well miss their deadlines |
| frequently and often. |
| |
| \subsection{Soft Real Time} |
| \label{sec:Soft Real Time} |
| |
| It should be easy to see problems with this definition of soft real time. |
| For one thing, by this definition, \emph{any} piece of software could be |
| said to be a soft real-time application: |
| ``My application computes million-point fourier transforms in half a |
| picosecond.'' |
| ``No way!!! |
| The clock cycle on this system is more than \emph{three hundred} picoseconds!'' |
| ``Ah, but it is a \emph{soft} real-time application!'' |
| If the term ``soft real time'' is to be of any use whatesoever, some limits |
| are clearly required. |
| |
| We might therefore say that a given soft real-time application must meet |
| its response-time requirements at least some fraction of the time, for |
| example, we might say that it must execute in less than 20 microseconds |
| 99.9\% of the time. |
| |
| This of course raises the question of what is to be done when the application |
| fails to meet its response-time requirements. |
| The answer varies with the application, but one possibility |
| is that the system being controlled has sufficient stability and inertia |
| to render harmless the occasional late control action. |
| Another possibility is that the application has two ways of computing |
| the result, a fast and deterministic but inaccurate method on the |
| one hand and |
| a very accurate method with unpredictable compute time on the other. |
| One reasonable approach would be to start |
| both methods in parallel, and if the accurate method fails to finish |
| in time, kill it and use the answer from the fast but inaccurate method. |
| One candidate for the fast but inaccurate method is to take |
| no control action during the current time period, and another candidate is |
| to take the same control action as was taken during the preceding time |
| period. |
| |
| In short, it does not make sense to talk about soft real time without |
| some measure of exactly how soft it is. |
| |
| \subsection{Hard Real Time} |
| \label{sec:Hard Real Time} |
| |
| \begin{figure}[bt] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/realtime-smash}} |
| \caption{Real-Time Response Guarantee, Meet Hammer} |
| \ContributedBy{Figure}{fig:rt:Hard Real-Time Response Guarantee, Meet Hammer}{Melissa Broussard} |
| \end{figure} |
| |
| In contrast, the definition of hard real time is quite definite. |
| After all, a given system either always meets its deadlines or it |
| doesn't. |
| Unfortunately, a strict application of this definition would mean that |
| there can never be any hard real-time systems. |
| The reason for this is fancifully depicted in |
| Figure~\ref{fig:rt:Hard Real-Time Response Guarantee, Meet Hammer}. |
| It is true that you could construct a more robust system, perhaps even |
| with added redundancy. |
| But it is also true that I can always get a bigger hammer. |
| |
| \begin{figure}[bt] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{cartoons/realtime-lifesupport-nobomb}} |
| \caption{Real-Time Response: Hardware Matters} |
| \ContributedBy{Figure}{fig:rt:Real-Time Response: Hardware Matters}{Melissa Broussard} |
| \end{figure} |
| |
| Then again, perhaps it is unfair to blame the software for what is clearly |
| not just a hardware problem, but a bona fide big-iron hardware problem |
| at that.\footnote{ |
| Or, given modern hammers, a big-steel problem.} |
| This suggests that we define hard real-time software as software that |
| will always meet its deadlines, but only in the absence of a hardware |
| failure. |
| Unfortunately, failure is not always an option, as fancifully depicted in |
| Figure~\ref{fig:rt:Real-Time Response: Hardware Matters}. |
| We simply cannot expect the poor gentleman depicted in that figure to be |
| reassured our saying ``Rest assured that if a missed deadline results |
| in your tragic death, it most certainly will not have been due to a |
| software problem!'' |
| Hard real-time response is a property of the entire system, not |
| just of the software. |
| |
| But if we cannot demand perfection, perhaps we can make do with |
| notification, similar to the soft real-time approach noted earlier. |
| Then if the Life-a-Tron in |
| Figure~\ref{fig:rt:Real-Time Response: Hardware Matters} |
| is about to miss its deadline, |
| it can alert the hospital staff. |
| |
| \begin{figure*}[bt] |
| \centering |
| \resizebox{6in}{!}{\rotatebox{90}{\includegraphics{cartoons/realtime-lazy-crop}}} |
| \caption{Real-Time Response: Notification Insufficient} |
| \ContributedBy{Figure}{fig:rt:Real-Time Response: Notification Insufficient}{Melissa Broussard} |
| \end{figure*} |
| |
| Unfortunately, this approach has the trivial solution fancifully depicted in |
| Figure~\ref{fig:rt:Real-Time Response: Notification Insufficient}. |
| A system that always immediately issues a notification that it won't |
| be able to meet its deadline complies with the letter of the law, |
| but is completely useless. |
| There clearly must also be a requirement that the system meet its deadline |
| some fraction of the time, or perhaps that it be prohibited from missing |
| its deadlines on more than a certain number of consecutive operations. |
| |
| We clearly cannot take a sound-bite approach to either hard or soft |
| real time. |
| The next section therefore takes a more real-world approach. |
| |
| \subsection{Real-World Real Time} |
| \label{sec:rt:Real-World Real Time} |
| |
| Although sentences like ``Hard real-time systems \emph{always} meet |
| their deadlines!'' can be catchy and are no doubt easy to memorize, |
| something else is needed for real-world real-time systems. |
| Although the resulting specifications are |
| harder to memorize, they can simplify construction of a real-time |
| system by imposing constraints on the environment, the workload, and |
| the real-time application itself. |
| |
| \subsubsection{Environmental Constraints} |
| \label{sec:rt:Environmental Constraints} |
| |
| Constraints on the environment address the objection to open-ended |
| promises of response times implied by ``hard real time''. |
| These constraints might specify permissible operating temperatures, |
| air quality, levels and types of electromagnetic radiation, and, to |
| Figure~\ref{fig:rt:Hard Real-Time Response Guarantee, Meet Hammer}'s |
| point, levels of shock and vibration. |
| |
| Of course, some constraints are easier to meet than others. |
| Any number of people have learned the hard way that |
| commodity computer components often refuse to operate at sub-freezing |
| tempertures, which suggests a set of climate-control requirements. |
| |
| An old college friend once had to meet the challenge of operating |
| a real-time system in an atmosphere featuring some rather aggressive |
| chlorine compounds, a challenge that he wisely handed off to his |
| colleagues designing the hardware. |
| In effect, my colleague imposed an atmospheric-composition constraint |
| on the environment immediately surrounding the computer, a constraint |
| that the hardware designers met through use of physical seals. |
| |
| Another old college friend worked on a computer-controlled system |
| that sputtered ingots of titanium using an industrial-strength arc |
| in a vacuum. |
| From time to time, the arc would decide that it was bored with its path |
| through the ingot of titanium and choose a far shorter and more |
| entertaining path to ground. |
| As we all learned in our physics classes, a sudden shift in the flow of |
| electrons creates an electromagnetic wave, with larger shifts in larger |
| flows creating higher-power electromagnetic waves. |
| And in this case, the resulting electromagnetic pulses were sufficient |
| to induce a quarter of a volt potential difference in the leads of |
| a small ``rubber ducky'' antenna located more than 400 meters away. |
| This means that nearby conductors saw larger voltages, courtesy of the |
| inverse-square law. |
| This includes those |
| conductors making up the computer controlling the sputtering process. |
| In particular, the voltage induced on that computer's reset line was |
| sufficient to actually reset the computer, to the consternation of everyone |
| involved. |
| In this case, the challenge was also met using hardware, including some |
| elaborate shielding and a fiber-optic network with the lowest bitrate |
| I have ever heard of, namely 9600 baud. |
| That said, less spectacular electromagnetic environments can often be |
| handled by software through use of error detection and correction codes. |
| That said, it is important to remember that although error detection and |
| correction codes can reduce failure rates, they normally cannot reduce |
| them all the way down to zero, which can form yet another obstacle |
| to achieving hard real-time response. |
| |
| There are also situations where a minimum level of energy |
| is required, for example, through the power leads of the system and |
| through the devices through which the system is to communicate with |
| that portion of the outside world that is to be monitored or controlled. |
| |
| \QuickQuiz{} |
| But what about battery-powered systems? |
| They don't require energy flowing into the system as a whole. |
| \QuickQuizAnswer{ |
| Sooner or later, either the battery must be recharged, which |
| requires energy to flow into the system, or the system will |
| stop operating. |
| } \QuickQuizEnd |
| |
| A number of systems are intended to operate in environments with impressive |
| levels of shock and vibration, for example, engine control systems. |
| More strenuous requirements may be found when we move away from |
| continuous vibrations to intermittent shocks. |
| For example, during my undergraduate studies, I encountered an old Athena |
| ballistics computer, which was designed to continue operating normally even if |
| a hand grenade went off nearby.\footnote{ |
| Decades later, the acceptance tests for some types of computer |
| systems involve large detonations, and some types of |
| communications networks must deal with what is delicately |
| termed ``ballistic jamming.''} |
| And finally, the ``black boxes'' used in airliners must continue operating |
| before, during, and after a crash. |
| |
| Of course, it is possible to make hardware more robust against |
| environmental shocks and insults. |
| Any number of ingenious mechanical shock-absorbing devices can reduce the |
| effects of shock and vibration, multiple layers of shielding can reduce |
| the effects of low-energy electromagnetic radiation, error-correction |
| coding can reduce the effects of high-energy radiation, various potting |
| and sealing techniques can reduce the effect of air quality, and any |
| number of heating and cooling systems can counter the effects of temperature. |
| In extreme cases, triple modular redundancy can reduce the probability that |
| a fault in one part of the system will result in incorrect behavior from |
| the overall system. |
| However, all of these methods have one thing in common: Although they |
| can reduce the probability of failure, they cannot reduce it to zero. |
| |
| Although these severe environmental conditions are often addressed by using |
| more robust hardware, the |
| workload and application constraints in the next two sections are often |
| handled in software. |
| |
| \subsubsection{Workload Constraints} |
| \label{sec:rt:Workload Constraints} |
| |
| Just as with people, it is often possible to prevent a real-time system |
| from meeting its deadlines by overloading it. |
| For example, if the system is being interrupted too frequently, it might |
| not have sufficient CPU bandwidth to handle its real-time application. |
| A hardware solution to this problem might limit the rate at which |
| interrupts were delivered to the system. |
| Possible software solutions include disabling interrupts for some time if |
| they are being received too frequently, |
| resetting the device generating too-frequent interrupts, |
| or even avoiding interrupts altogether in favor of polling. |
| |
| Overloading can also degrade response times due to queueing effects, |
| so it is not unusual for real-time systems to overprovision CPU bandwidth, |
| so that a running system has (say) 80\% idle time. |
| This approach also applies to storage and networking devices. |
| In some cases, separate storage and networking hardware might be reserved |
| for the sole use of high-priority portions of the real-time application. |
| It is of course not unusual for this hardware to be mostly idle, given |
| that response time is more important than throughput in |
| real-time systems. |
| |
| \QuickQuiz{} |
| But given the results from queueing theory, won't low utilization |
| merely improve the average response time rather than improving |
| the worst-case response time? |
| And isn't worst-case response time all that most |
| real-time systems really care about? |
| \QuickQuizAnswer{ |
| Yes, but... |
| |
| Those queueing-theory results assume infinite ``calling populations'', |
| which in the Linux kernel might correspond to an infinite number |
| of tasks. |
| As of mid-2016, no real system supports an infinite number of tasks, |
| so results assuming infinite calling populations should be |
| expected to have less-than-infinite applicability. |
| |
| Other queueing-theory results have \emph{finite} |
| calling populations, which feature sharply bounded response |
| times~\cite{Hillier86}. |
| These results better model real systems, and these models do |
| predict reductions in both average and worst-case response |
| times as utilizations decrease. |
| These results can be extended to model concurrent systems that use |
| synchronization mechanisms such as |
| locking~\cite{BjoernBrandenburgPhD}. |
| |
| In short, queueing-theory results that accurately describe |
| real-world real-time systems show that worst-case response |
| time decreases with decreasing utilization. |
| } \QuickQuizEnd |
| |
| Of course, maintaining sufficiently low utilization requires great |
| discipline throughout the design and implementation. |
| There is nothing quite like a little feature creep to destroy deadlines. |
| |
| \subsubsection{Application Constraints} |
| \label{sec:rt:Application Constraints} |
| |
| It is easier to provide bounded response time for some operations than |
| for others. |
| For example, it is quite common to see response-time specifications for |
| interrupts and for wake-up operations, but quite rare for (say) |
| filesystem unmount operations. |
| One reason for this is that it is quite difficult to bound the amount |
| of work that a filesystem-unmount operation might need to do, given that |
| the unmount is required to flush all of that filesystem's in-memory |
| data to mass storage. |
| |
| This means that real-time applications must be confined to operations |
| for which bounded latencies can reasonably be provided. |
| Other operations must either be pushed out into the non-real-time portions |
| of the application or forgone entirely. |
| |
| There might also be constraints on the non-real-time portions of the |
| application. |
| For example, is the non-real-time application permitted to use CPUs used |
| by the real-time portion? |
| Are there time periods during which the real-time portion of the application |
| is expected to be unusually busy, and if so, is the non-real-time portion |
| of the application permitted to run at all during those times? |
| Finally, by what amount is the real-time portion of the application permitted |
| to degrade the throughput of the non-real-time portion? |
| |
| \subsubsection{Real-World Real-Time Specifications} |
| \label{sec:rt:Real-World Real-Time Specifications} |
| |
| As can be seen from the preceding sections, a real-world real-time |
| specification needs to include constraints on the environment, |
| on the workload, and on the application itself. |
| In addition, for the operations that the real-time portion of the |
| application is permitted to make use of, there must be constraints |
| on the hardware and software implementing those operations. |
| |
| For each such operation, these constraints might include a maximum |
| response time (and possibly also a minimum response time) and a |
| probability of meeting that response time. |
| A probability of 100\% indicates that the corresponding operation |
| must provide hard real-time service. |
| |
| In some cases, both the response times and the required probabilities of |
| meeting them might vary depending on the parameters to the operation in |
| question. |
| For example, a network operation over a local LAN would be much more likely |
| to complete in (say) 100~microseconds than would that same network operation |
| over a transcontinental WAN. |
| Furthermore, a network operation over a copper or fiber |
| LAN might have an extremely |
| high probability of completing without time-consuming retransmissions, |
| while that same networking operation over a lossy WiFi network might |
| have a much higher probability of missing tight deadlines. |
| Similarly, a read from a tightly coupled solid-state disk (SSD) could be |
| expected to complete much more quickly than that same read to an old-style |
| USB-connected rotating-rust disk drive.\footnote{ |
| Important safety tip: Worst-case response times from USB devices |
| can be extremely long. |
| Real-time systems should therefore take care to place any USB |
| devices well away from critical paths.} |
| |
| Some real-time applications pass through different phases of operation. |
| For example, a real-time system controlling a plywood lathe that peels |
| a thin sheet of wood (called ``veneer'') from a spinning log must: |
| (1)~Load the log into the lathe, |
| (2)~Position the log on the lathe's chucks so as to expose the largest |
| cylinder contained in the log to the blade, |
| (3)~Start spinning the log, |
| (4)~Continuously vary the knife's position so as to peel the log into veneer, |
| (5)~Remove the remaining core of the log that is too small to peel, and |
| (6)~Wait for the next log. |
| Each of these six phases of operation might well have its own set of |
| deadlines and environmental constraints, |
| for example, one would expect phase~4's deadlines to be much more severe |
| than those of phase 6, milliseconds instead of seconds. |
| One might therefore expect that low-priority work would be performed in |
| phase~6 rather than in phase~4. |
| That said, careful choices of hardware, drivers, and software configuration |
| would be required to support phase~4's more severe requirements. |
| |
| A key advantage of this phase-by-phase approach is that the latency |
| budgets can be broken down, so that the application's various components |
| can be developed independently, each with its own latency budget. |
| Of course, as with any other kind of budget, there will likely be the |
| occasional conflict as to which component gets which fraction of the |
| overall budget, and as with any other kind of budget, strong leadership |
| and a sense of shared goals can help to resolve these conflicts in |
| a timely fashion. |
| And, again as with other kinds of technical budget, a strong validation |
| effort is required in order to ensure proper focus on latencies and to |
| give early warning of latency problems. |
| A successful validation effort will almost always include a good test |
| suite, which might be unsatisfying to the theorists, but has the virtue |
| of helping to get the job done. |
| As a point of fact, as of early 2015, most real-world real-time system |
| use an acceptance test rather than formal proofs. |
| |
| That said, the widespread use of test suites to validate real-time systems |
| does have a very real disadvantage, namely that real-time software is |
| validated only on specific hardware in specific hardware and software |
| configurations. |
| Adding additional hardware and configurations requires additional costly and |
| time-consuming testing. |
| Perhaps the field of formal verification will advance sufficiently to |
| change this situation, but as of early 2015, rather |
| large advances are required. |
| |
| \QuickQuiz{} |
| Formal verification is already quite capable, benefiting from |
| decades of intensive study. |
| Are additional advances \emph{really} required, or is this |
| just a practitioner's excuse to continue to be lazy and ignore |
| the awesome power of formal verification? |
| \QuickQuizAnswer{ |
| Perhaps this situation is just a theoretician's excuse to avoid |
| diving into the messy world of real software? |
| Perhaps more constructively, the following advances are required: |
| |
| \begin{enumerate} |
| \item Formal verification needs to handle larger software |
| artifacts. |
| The largest verification efforts have been for systems |
| of only about 10,000 lines of code, and those have been |
| verifying much simpler properties than real-time latencies. |
| \item Hardware vendors will need to publish formal timing |
| guarantees. |
| This used to be common practice back when hardware was |
| much simpler, but today's complex hardware results in |
| excessively complex expressions for worst-case performance. |
| Unfortunately, energy-efficiency concerns are pushing |
| vendors in the direction of even more complexity. |
| \item Timing analysis needs to be integrated into development |
| methodologies and IDEs. |
| \end{enumerate} |
| |
| All that said, there is hope, given recent work formalizing |
| the memory models of real computer |
| systems~\cite{JadeAlglave2011ppcmem,Alglave:2013:SVW:2450268.2450306}. |
| } \QuickQuizEnd |
| |
| In addition to latency requirements for the real-time portions of the |
| application, there will likely be performance and scalability requirements |
| for the non-real-time portions of the application. |
| These additional requirements reflect the fact that ultimate real-time |
| latencies are often attained by degrading scalability and average performance. |
| |
| Software-engineering requirements can also be important, especially for |
| large applications that must be developed and maintained by large teams. |
| These requirements often favor increased modularity and fault isolation. |
| |
| This is a mere outline of the work that would be required to specify |
| deadlines and environmental constraints for a production real-time system. |
| It is hoped that this outline clearly demonstrates the inadequacy of |
| the sound-bite-based approach to real-time computing. |
| |
| \section{Who Needs Real-Time Computing?} |
| \label{sec:rt:Who Needs Real-Time Computing?} |
| |
| It is possible to argue that all computing is in fact real-time computing. |
| For one moderately extreme example, when you purchase a birthday gift online, |
| you would like the gift to arrive before the recipient's birthday. |
| And in fact even turn-of-the-millenium web services observed sub-second |
| response constraints~\cite{KristofferBohmann2001a}, and requirements have |
| not eased with the passage of time~\cite{DeCandia:2007:DAH:1323293.1294281}. |
| It is nevertheless useful to focus on those real-time applications |
| whose response-time requirements cannot be achieved straightforwardly |
| by non-real-time systems and applications. |
| Of course, as hardware costs decrease and bandwidths and memory sizes |
| increase, the line between real-time and non-real-time will continue |
| to shift, but such progress is by no means a bad thing. |
| |
| \QuickQuiz{} |
| Differentiating real-time from non-real-time based on what can |
| ``be achieved straightforwardly by non-real-time systems and |
| applications'' is a travesty! |
| There is absolutely no theoretical basis for such a distinction!!! |
| Can't we do better than that??? |
| \QuickQuizAnswer{ |
| This distinction is admittedly unsatisfying from a strictly |
| theoretical perspective. |
| But on the other hand, it is exactly what the developer needs |
| in order to decide whether the application can be cheaply and |
| easily developed using standard non-real-time approaches, or |
| whether the more difficult and expensive real-time approaches |
| are required. |
| In other words, theory is quite important, however, for those |
| of us who like to get things done, theory supports practice, |
| never the other way around. |
| } \QuickQuizEnd |
| |
| Real-time computing is used in industrial-control applications, ranging from |
| manufacturing to avionics; |
| scientific applications, perhaps most spectacularly in the adaptive |
| optics used by |
| large Earth-bound telescopes to de-twinkle starlight; |
| military applications, including the afore-mentioned avionics; |
| and financial-services applications, where the first computer to recognize |
| an opportunity is likely to reap most of the resulting profit. |
| These four areas could be characterized as ``in search of production'', |
| ``in search of life'', ``in search of death'', and ``in search of money''. |
| |
| Financial-services applications differ subtlely from applications in |
| the other three categories in that money is non-material, meaning that |
| non-computational latencies are quite small. |
| In contrast, mechanical delays inherent in the other three categories |
| provide a very real point of diminishing returns beyond which further |
| reductions in the application's real-time response provide little or |
| no benefit. |
| This means that financial-services applications, along with other |
| real-time information-processing applications, face an arms race, |
| where the application with the lowest latencies normally wins. |
| Although the resulting latency requirements can still be specified |
| as described in |
| Section~\ref{sec:rt:Real-World Real-Time Specifications}, |
| the unusual nature of these requirements has led some to refer to |
| financial and information-processing applications as ``low latency'' |
| rather than ``real time''. |
| |
| Regardless of exactly what we choose to call it, there is substantial |
| need for real-time |
| computing~\cite{JeremyWPeters2006NYTDec11,BillInmon2007a}. |
| |
| \section{Who Needs Parallel Real-Time Computing?} |
| \label{sec:rt:Who Needs Parallel Real-Time Computing?} |
| |
| It is less clear who really needs parallel real-time computing, but |
| the advent of low-cost multicore systems has brought it to the fore |
| regardless. |
| Unfortunately, the traditional mathematical basis for real-time |
| computing assumes single-CPU systems, with a few exceptions that |
| prove the rule~\cite{BjoernBrandenburgPhD}. |
| That said, there are a couple of ways of squaring modern computing |
| hardware to fit the real-time mathematical circle, and a few Linux-kernel |
| hackers have been encouraging academics to make this |
| transition~\cite{ThomasGleixner2010AcademiaVsReality}. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{rt/rt-reflexes}} |
| \caption{Real-Time Reflexes} |
| \label{fig:rt:Real-Time Reflexes} |
| \end{figure} |
| |
| One approach is to recognize the fact that many real-time systems |
| reflect biological nervous systems, with responses ranging from |
| real-time reflexes to non-real-time strategizing and planning, |
| as depicted in |
| Figure~\ref{fig:rt:Real-Time Reflexes}. |
| The hard real-time reflexes, which read from sensors and control |
| actuators, run real-time on a single CPU, while the non-real-time |
| strategy and planning portion of the application runs on the multiple |
| remaining CPUs. |
| Strategy and planning activities might include statistical analysis, |
| periodic calibration, user interface, supply-chain activities, and |
| preparation. |
| For an example of high-compute-load preparation activities, think back |
| to the veneer-peeling application discussed in |
| Section~\ref{sec:rt:Real-World Real-Time Specifications}. |
| While one CPU is attending to the high-speed real-time computations |
| required to peel one log, the other CPUs might be analyzing the size |
| and shape of the next log in order to determine how to position the |
| next log so as to obtain the greatest possible quantity of high-quality |
| veneer. |
| It turns out that many applications have non-real-time and real-time |
| components~\cite{RobertBerry2008IBMSysJ}, so this approach can |
| often be used to allow traditional real-time analysis to be combined |
| with modern multicore hardware. |
| |
| Another trivial approach is to shut off all but one hardware thread so as |
| to return to the settled mathematics of uniprocessor real-time |
| computing. |
| However, this approach gives up potential cost and energy-efficiency |
| advantages. |
| That said, obtaining these advantages requires overcoming the parallel |
| performance obstacles covered in |
| Chapter~\ref{chp:Hardware and its Habits}, |
| and not merely on average, but instead in the worst case. |
| |
| Implementing parallel real-time systems can therefore be quite a |
| challenge. |
| Ways of meeting this challenge are outlined in the following section. |
| |
| \section{Implementing Parallel Real-Time Systems} |
| \label{sec:rt:Implementing Parallel Real-Time Systems} |
| |
| We will look at two major styles of real-time systems, event-driven and |
| polling. |
| An event-driven real-time system remains idle much of the time, responding |
| in real time to events passed up through the operating system to the |
| application. |
| Alternatively, the system could be running a background non-real-time |
| workload instead of remaining mostly idle. |
| A polling real-time system features a real-time thread that is CPU bound, |
| running in a tight loop that polls inputs and updates outputs on each |
| pass through the loop. |
| This tight polling loop often executes entirely in user mode, reading from |
| and writing to hardware registers that have been mapped into the user-mode |
| application's address space. |
| Alternatively, some applications place the polling loop into the kernel, |
| for example, via use of loadable kernel modules. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3in}{!}{\includegraphics{rt/rt-regimes}} |
| \caption{Real-Time Response Regimes} |
| \label{fig:rt:Real-Time Response Regimes} |
| \end{figure} |
| |
| Regardless of the style chosen, the approach used to implement a real-time |
| system will depend on the deadlines, for example, as shown in |
| Figure~\ref{fig:rt:Real-Time Response Regimes}. |
| Starting from the top of this figure, if you can live with response times in |
| excess of one second, you might well be able to use scripting languages |
| to implement your real-time application---and scripting languages are |
| in fact used surprisingly often, not that I necessarily recommend this |
| practice. |
| If the required latencies exceed several tens of milliseconds, |
| old 2.4 versions of the Linux kernel can be used, not that I necessarily |
| recommend this practice, either. |
| Special real-time Java implementations can provide real-time response |
| latencies of a few milliseconds, even when the garbage collector is |
| used. |
| The Linux 2.6.x and 3.x kernels can provide real-time latencies of |
| a few hundred microseconds if carefully configured, tuned, and run |
| on real-time friendly hardware. |
| Special real-time Java implementations can provide real-time latencies |
| below 100 microseconds if use of the garbage collector is carefully avoided. |
| (But note that avoiding the garbage collector means also avoiding |
| Java's large standard libraries, thus also avoiding Java's productivity |
| advantages.) |
| A Linux kernel incorporating the -rt patchset can provide latencies |
| below 20 microseconds, and specialty real-time operating systems (RTOSes) |
| running without memory translation can provide sub-ten-microsecond |
| latencies. |
| Achieving sub-microsecond latencies typically requires hand-coded assembly |
| or even special-purpose hardware. |
| |
| Of course, careful configuration and tuning are required all the way down |
| the stack. |
| In particular, if the hardware or firmware fails to provide real-time |
| latencies, there is nothing that the software can do to make up the |
| lost time. |
| And high-performance hardware sometimes sacrifices worst-case behavior |
| to obtain greater throughput. |
| In fact, timings from tight loops run with interrupts disabled can |
| provide the basis for a high-quality random-number |
| generator~\cite{PeterOkech2009InherentRandomness}. |
| Furthermore, some firmware does cycle-stealing to carry out various |
| housekeeping tasks, in some cases attempting to cover its tracks by |
| reprogramming the victim CPU's hardware clocks. |
| Of course, cycle stealing is expected behavior in virtualized |
| environment, but people are nevertheless working towards real-time |
| response in virtualized |
| environments~\cite{ThomasGleixner2012KVMrealtime,JanKiszka2014virtRT}. |
| It is therefore critically important to evaluate your hardware's and |
| firmware's real-time capabilities. |
| There are organizations who carry out such evaluations, including |
| the Open Source Automation Development Lab (OSADL). |
| |
| But given competent real-time hardware and firmware, the next |
| layer up the stack is the operating system, which is covered in |
| the next section. |
| |
| \subsection{Implementing Parallel Real-Time Operating Systems} |
| \label{sec:rt:Implementing Parallel Real-Time Operating Systems} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.2in}{!}{\includegraphics{rt/Linux-on-RTOS}} |
| \caption{Linux Ported to RTOS} |
| \label{fig:rt:Linux Ported to RTOS} |
| \end{figure} |
| |
| There are a number of strategies that may be used to implement a |
| real-time system. |
| One approach is to port a general-purpose non-real-time OS on top |
| of a special purpose real-time operating system (RTOS), as shown in |
| Figure~\ref{fig:rt:Linux Ported to RTOS}. |
| The green ``Linux Process'' boxes represent non-real-time processes |
| running on the Linux kernel, while the yellow ``RTOS Process'' |
| boxes represent real-time processes running on the RTOS. |
| |
| This was a very popular approach before the Linux kernel gained |
| real-time capabilities, and is still in use |
| today~\cite{Xenomai2014,VictorYodaiken2004a}. |
| However, this approach requires that the application be split into |
| one portion that runs on the RTOS and another that runs on Linux. |
| Although it is possible to make the two environments look similar, |
| for example, by forwarding POSIX system calls from the RTOS to a |
| utility thread running on Linux, there are invariably rough edges. |
| |
| In addition, the RTOS must interface to both the hardware and to |
| the Linux kernel, thus requiring significant maintenance with |
| changes in both hardware and kernel. |
| Furthermore, each such RTOS often has its own system-call interface |
| and set of system libraries, which can balkanize both ecosystems and |
| developers. |
| In fact, these problems seem to be what drove the combination of |
| RTOSes with Linux, as this approach allowed access to the full real-time |
| capabilities of the RTOS, while allowing the application's non-real-time |
| code full access to Linux's rich and vibrant open-source ecosystem. |
| |
| \begin{figure*}[p] |
| \centering |
| \resizebox{4.4in}{!}{\includegraphics{rt/preemption}} |
| \caption{Linux-Kernel Real-Time Implementations} |
| \label{fig:rt:Linux-Kernel Real-Time Implementations} |
| \end{figure*} |
| |
| Although pairing RTOSes with the Linux kernel was a clever and useful |
| short-term response during the time that the Linux kernel had minimal |
| real-time capabilities, it also motivated adding real-time capabilities |
| to the Linux kernel. |
| Progress towards this goal is shown in |
| Figure~\ref{fig:rt:Linux-Kernel Real-Time Implementations}. |
| The upper row shows a diagram of the Linux kernel with preemption disabled, |
| thus having essentially no real-time capabilities. |
| The middle row shows a set of diagrams showing the increasing real-time |
| capabilities of the mainline Linux kernel with preemption enabled. |
| Finally, the bottom row shows a diagram of the Linux kernel with the |
| -rt patchset applied, maximizing real-time capabilities. |
| Functionality from the -rt patchset is added to mainline, |
| hence the increasing capabilities of the mainline Linux kernel over time. |
| Neverthless, the most demanding real-time applications continue to use |
| the -rt patchset. |
| |
| The non-preemptible kernel shown at the top of |
| Figure~\ref{fig:rt:Linux-Kernel Real-Time Implementations} |
| is built with \co{CONFIG_PREEMPT=n}, so that execution within the Linux |
| kernel cannot be preempted. |
| This means that the kernel's real-time response latency is bounded below |
| by the longest code path in the Linux kernel, which is indeed long. |
| However, user-mode execution is preemptible, so that one of the |
| real-time Linux processes shown in the upper right may preempt any of the |
| non-real-time Linux processes shown in the upper left anytime the |
| non-real-time process is executing in user mode. |
| |
| The preemptible kernels shown in the middle row of |
| Figure~\ref{fig:rt:Linux-Kernel Real-Time Implementations} |
| are built with \co{CONFIG_PREEMPT=y}, so that most process-level code |
| within the Linux kernel can be preempted. |
| This of course greatly improves real-time response latency, but |
| preemption is still disabled |
| within RCU read-side critical sections, |
| spinlock critical sections, |
| interrupt handlers, |
| interrupt-disabled code regions, and |
| preempt-disabled code regions, as indicated by the red boxes in the |
| left-most diagram in the middle row of the figure. |
| The advent of preemptible RCU allowed RCU read-side critical sections |
| to be preempted, as shown in the central diagram, |
| and the advent of threaded interrupt handlers allowed device-interrupt |
| handlers to be preempted, as shown in the right-most diagram. |
| Of course, a great deal of other real-time functionality was added |
| during this time, however, it cannot be as easily represented on this |
| diagram. |
| It will instead be discussed in |
| Section~\ref{sec:rt:Event-Driven Real-Time Support}. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.5in}{!}{\includegraphics{rt/nohzfull}} |
| \caption{CPU Isolation} |
| \label{fig:rt:CPU Isolation} |
| \end{figure} |
| |
| A final approach is simply to get everything out of the way of the |
| real-time process, clearing all other processing off of any CPUs that |
| this process needs. |
| This was implemented in the 3.10 Linux kernel via the \co{CONFIG_NO_HZ_FULL} |
| Kconfig parameter~\cite{FredericWeisbecker2013nohz}. |
| % Need to fweisbec citation's URL. @@@ |
| It is important to note that this approach requires at least one |
| \emph{housekeeping CPU} to do background processing, for example running |
| kernel daemons. |
| However, when there is only one runnable task on a given non-housekeeping CPU, |
| scheduling-clock interrupts are shut off on that CPU, removing an important |
| source of interference and \emph{OS jitter}.\footnote{ |
| A once-per-second residual scheduling-clock interrupt remains |
| due to process-accounting concerns. |
| Future work includes addressing these concerns and eliminating |
| this residual interrupt.} |
| With a few exceptions, the kernel does not force other processing off of the |
| non-housekeeping CPUs, but instead simply provides better performance |
| when only one runnable task is present on a given CPU. |
| If configured properly, a non-trivial undertaking, \co{CONFIG_NO_HZ_FULL} |
| offers real-time threads levels of performance nearly rivaling that of |
| bare-metal systems. |
| |
| There has of course been much debate over which of these approaches |
| is best for real-time systems, and this debate has been going on for |
| quite some |
| time~\cite{JonCorbet2004RealTimeLinuxPart1,JonCorbet2004RealTimeLinuxPart2}. |
| As usual, the answer seems to be ``It depends,'' as discussed in the |
| following sections. |
| Section~\ref{sec:rt:Event-Driven Real-Time Support} |
| considers event-driven real-time systems, and |
| Section~\ref{sec:rt:Polling-Loop Real-Time Support} |
| considers real-time systems that use a CPU-bound polling loop. |
| |
| \subsubsection{Event-Driven Real-Time Support} |
| \label{sec:rt:Event-Driven Real-Time Support} |
| |
| The operating-system support required for event-driven real-time |
| applications is quite extensive, however, this section will focus |
| on only a few items, namely |
| timers, |
| threaded interrupts, |
| priority inheritance, |
| preemptible RCU, |
| and |
| preemptible spinlocks. |
| |
| \paragraph{Timers} are clearly critically important for real-time |
| operations. |
| After all, if you cannot specify that something be done at a specific |
| time, how are you going to respond by that time? |
| Even in non-real-time systems, large numbers of timers are generated, |
| so they must be handled extremely efficiently. |
| Example uses include retransmit timers for TCP connections (which are |
| almost always cancelled before they have a chance to fire),\footnote{ |
| At least assuming reasonably low packet-loss rates!} |
| timed delays (as in \co{sleep(1)}, which are rarely cancelled), |
| and timeouts for the \co{poll()} system call (which are often |
| cancelled before they have a chance to fire). |
| A good data structure for such timers would therefore be a priority queue |
| whose addition and deletion primitives were fast and $O(1)$ in the number |
| of timers posted. |
| |
| The classic data structure for this purpose is the \emph{calendar queue}, |
| which in the Linux kernel is called the \co{timer wheel}. |
| This age-old data structure is also heavily used in discrete-event |
| simulation. |
| The idea is that time is quantized, for example, in the Linux kernel, |
| the duration of the time quantum is the period of the scheduling-clock |
| interrupt. |
| A given time can be represented by an integer, and any attempt to post |
| a timer at some non-integral time will be rounded to a convenient nearby |
| integral time quantum. |
| |
| One straightforward implementation would be to allocate a single array, |
| indexed by the low-order bits of the time. |
| This works in theory, but in practice systems create large numbers of |
| long-duration timeouts (for example, 45-minute keepalive timeouts for TCP |
| sessions) that are almost always cancelled. |
| %@@@ verify duration of keepalive timers. |
| These long-duration timeouts cause problems for small arrays because |
| much time is wasted skipping timeouts that have not yet expired. |
| On the other hand, an array that is large enough to gracefully accommodate |
| a large number of long-duration timeouts would consume too much memory, |
| especially given that performance and scalability concerns require one |
| such array for each and every CPU. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{2.0in}{!}{\includegraphics{rt/timerwheel}} |
| \caption{Timer Wheel} |
| \label{fig:rt:Timer Wheel} |
| \end{figure} |
| |
| A common approach for resolving this conflict is to provide multiple |
| arrays in a hierarchy. |
| At the lowest level of this hierarchy, each array element represents |
| one unit of time. |
| At the second level, each array element represents $N$ units of time, |
| where $N$ is the number of elements in each array. |
| At the third level, each array element represents $N^2$ units of time, |
| and so on up the hierarchy. |
| This approach allows the individual arrays to be indexed by different |
| bits, as illustrated by |
| Figure~\ref{fig:rt:Timer Wheel} |
| for an unrealistically small eight-bit clock. |
| Here, each array has 16 elements, so the low-order four bits of the time |
| (currently \co{0xf}) index the low-order (rightmost) array, and the |
| next four bits (currently \co{0x1}) index the next level up. |
| Thus, we have two arrays each with 16 elements, for a total of 32 elements, |
| which, taken together, is much smaller than the 256-element array that |
| would be required for a single array. |
| |
| This approach works extremely well for throughput-based systems. |
| Each timer operation is $O(1)$ with small constant, and each timer |
| element is touched at most $m+1$ times, where $m$ is the number of |
| levels. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.0in}{!}{\includegraphics{cartoons/1kHz}} |
| \caption{Timer Wheel at 1kHz} |
| \ContributedBy{Figure}{fig:rt:Timer Wheel at 1kHz}{Melissa Broussard} |
| \end{figure} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.0in}{!}{\includegraphics{cartoons/100kHz}} |
| \caption{Timer Wheel at 100kHz} |
| \ContributedBy{Figure}{fig:rt:Timer Wheel at 100kHz}{Melissa Broussard} |
| \end{figure} |
| |
| Unfortunately, timer wheels do not work well for real-time systems, and for |
| two reasons. |
| The first reason is that there is a harsh tradeoff between timer |
| accuracy and timer overhead, which is fancifully illustrated by |
| Figures~\ref{fig:rt:Timer Wheel at 1kHz} |
| and~\ref{fig:rt:Timer Wheel at 100kHz}. |
| In |
| Figure~\ref{fig:rt:Timer Wheel at 1kHz}, |
| timer processing happens only once per millisecond, which keeps overhead |
| acceptably low for many (but not all!) workloads, but which also means |
| that timeouts cannot be set for finer than one-millisecond granularities. |
| On the other hand, |
| Figure~\ref{fig:rt:Timer Wheel at 100kHz} |
| shows timer processing taking place every ten microseconds, which |
| provides acceptably fine timer granularity for most (but not all!) |
| workloads, but which processes timers so frequently that the system |
| might well not have time to do anything else. |
| |
| The second reason is the need to cascade timers from higher levels to |
| lower levels. |
| Referring back to |
| Figure~\ref{fig:rt:Timer Wheel}, |
| we can see that any timers enqueued on element \co{1x} in the upper |
| (leftmost) array must be cascaded down to the lower (rightmost) |
| array so that may be invoked when their time arrives. |
| Unfortunately, there could be a large number of timeouts |
| waiting to be cascaded, especially for timer wheels with larger numbers |
| of levels. |
| The power of statistics causes this cascading to be a non-problem for |
| throughput-oriented systems, but cascading can result in problematic |
| degradations of latency in real-time systems. |
| |
| Of course, real-time systems could simply choose a different data |
| structure, for example, some form of heap or tree, giving up |
| $O(1)$ bounds on insertion and deletion operations to gain $O(\log n)$ |
| limits on data-structure-maintenance operations. |
| This can be a good choice for special-purpose RTOSes, but is inefficient |
| for general-purpose systems such as Linux, which routinely support |
| extremely large numbers of timers. |
| |
| The solution chosen for the Linux kernel's -rt patchset is to differentiate |
| between timers that schedule later activity and timeouts that schedule |
| error handling for low-probability errors such as TCP packet losses. |
| One key observation is that error handling is normally not particularly |
| time-critical, so that a timer wheel's millisecond-level granularity |
| is good and sufficient. |
| Another key observation is that error-handling timeouts are normally |
| cancelled very early, often before they can be cascaded. |
| A final observation is that systems commonly have many more error-handling |
| timeouts than they do timer events, so that an $O(\log n)$ |
| data structure should provide acceptable performance for timer events. |
| |
| In short, the Linux kernel's -rt patchset uses timer wheels for |
| error-handling timeouts and a tree for timer events, providing each |
| category the required quality of service. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.0in}{!}{\includegraphics{rt/irq}} |
| \caption{Non-Threaded Interrupt Handler} |
| \label{fig:rt:Non-Threaded Interrupt Handler} |
| \end{figure} |
| |
| \paragraph{Threaded interrupts} |
| are used to address a significant source of degraded real-time latencies, |
| namely long-running interrupt handlers, |
| as shown in |
| Figure~\ref{fig:rt:Non-Threaded Interrupt Handler}. |
| These latencies can be especially problematic for devices that can |
| deliver a large number of events with a single interrupt, which means |
| that the interrupt handler will run for an extended period of time |
| processing all of these events. |
| Worse yet are devices that can deliver new events to a still-running |
| interrupt handler, as such an interrupt handler might well run |
| indefinitely, thus indefinitely degrading real-time latencies. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.0in}{!}{\includegraphics{rt/threaded-irq}} |
| \caption{Threaded Interrupt Handler} |
| \label{fig:rt:Threaded Interrupt Handler} |
| \end{figure} |
| |
| One way of addressing this problem is the use of threaded interrupts shown in |
| Figure~\ref{fig:rt:Threaded Interrupt Handler}. |
| Interrupt handlers run in the context of a preemptible IRQ thread, |
| which runs at a configurable priority. |
| The device interrupt handler then runs for only a short time, just |
| long enough to make the IRQ thread aware of the new event. |
| As shown in the figure, threaded interrupts can greatly improve |
| real-time latencies, in part because interrupt handlers running in |
| the context of the IRQ thread may be preempted by high-priority real-time |
| threads. |
| |
| However, there is no such thing as a free lunch, and there are downsides |
| to threaded interrupts. |
| One downside is increased interrupt latency. |
| Instead of immediately running the interrupt handler, the handler's execution |
| is deferred until the IRQ thread gets around to running it. |
| Of course, this is not a problem unless the device generating the interrupt |
| is on the real-time application's critical path. |
| |
| Another downside is that poorly written high-priority real-time code |
| might starve the interrupt handler, for example, preventing networking |
| code from running, in turn making it very difficult to debug the problem. |
| Developers must therefore take great care when writing high-priority |
| real-time code. |
| This has been dubbed the \emph{Spiderman principle}: With great power |
| comes great responsibility. |
| |
| \paragraph{Priority inheritance} is used to handle priority inversion, |
| which can be caused by, among other things, locks acquired by |
| preemptible interrupt handlers~\cite{LuiSha1990PriorityInheritance}. |
| Suppose that a low-priority thread holds a lock, but is preempted by |
| a group of medium-priority threads, at least one such thread per CPU. |
| If an interrupt occurs, a high-priority IRQ thread will preempt one |
| of the medium-priority threads, but only until it decides to acquire |
| the lock held by the low-priority thread. |
| Unfortunately, the low-priority thread cannot release the lock until |
| it starts running, which the medium-priority threads prevent it from |
| doing. |
| So the high-priority IRQ thread cannot acquire the lock until after one |
| of the medium-priority threads releases its CPU. |
| In short, the medium-priority threads are indirectly blocking the |
| high-priority IRQ threads, a classic case of priority inversion. |
| |
| Note that this priority inversion could not happen with non-threaded |
| interrupts because the low-priority thread would have to disable interrupts |
| while holding the lock, which would prevent the medium-priority |
| threads from preempting it. |
| |
| In the priority-inheritance solution, the high-priority thread attempting |
| to acquire the lock donate its priority to the low-priority thread holding |
| the lock until such time as the lock is released, thus preventing long-term |
| priority inversion. |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.4in}{!}{\includegraphics{cartoons/Priority_Boost_2}} |
| \caption{Priority Inversion and User Input} |
| \ContributedBy{Figure}{fig:rt:Priority Inversion and User Input}{Melissa Broussard} |
| \end{figure} |
| |
| Of course, priority inheritance does have its limitations. |
| For example, if you can design your application to avoid priority |
| inversion entirely, you will likely obtain somewhat better |
| latencies~\cite{VictorYodaiken2004a}. |
| This should be no surprise, given that priority inheritance adds |
| a pair of context switches to the worst-case latency. |
| That said, priority inheritance can convert indefinite postponement |
| into a limited increase in latency, and the software-engineering |
| benefits of priority inheritance may outweigh its latency costs in |
| many applications. |
| |
| Another limitation is that it addresses only lock-based priority |
| inversions within the context of a given operating system. |
| One priority-inversion scenario that it cannot address is a high-priority |
| thread waiting on a network socket for a message that is to be written |
| by a low-priority process that is preempted by a set of CPU-bound |
| medium-priority processes. |
| In addition, a potential disadvantage of applying priority inheritance |
| to user input is fancifully depicted in |
| Figure~\ref{fig:rt:Priority Inversion and User Input}. |
| |
| A final limitation involves reader-writer locking. |
| Suppose that we have a very large number of low-priority threads, perhaps |
| even thousands of them, each |
| of which read-holds a particular reader-writer lock. |
| Suppose that all of these threads are preempted by a set of medium-priority |
| threads, with at least one medium-priority thread per CPU. |
| Finally, suppose that a high-priority thread awakens and attempts to |
| write-acquire this same reader-writer lock. |
| No matter how vigorously we boost the priority of the threads read-holding |
| this lock, it could well be a good long time before the high-priority |
| thread can complete its write-acquisition. |
| |
| There are a number of possible solutions to this reader-writer lock |
| priority-inversion conundrum: |
| |
| \begin{enumerate} |
| \item Only allow one read-acquisition of a given reader-writer lock |
| at a time. (This is the approach traditionally taken by |
| the Linux kernel's -rt patchset.) |
| \item Only allow $N$ read-acquisitions of a given reader-writer lock |
| at a time, where $N$ is the number of CPUs. |
| \item Only allow $N$ read-acquisitions of a given reader-writer lock |
| at a time, where $N$ is a number specified somehow by the |
| developer. |
| There is a good chance that the Linux kernel's -rt patchset |
| will someday take this approach. |
| %@@@ Check with Steven and Clark on this |
| \item Prohibit high-priority threads from write-acquiring reader-writer |
| locks that are ever read-acquired by threads running at lower |
| priorities. |
| (This is a variant of the \emph{priority ceiling} |
| protocol~\cite{LuiSha1990PriorityInheritance}.) |
| \end{enumerate} |
| |
| \QuickQuiz{} |
| But if you only allow one reader at a time to read-acquire |
| a reader-writer lock, isn't that the same as an exclusive |
| lock??? |
| \QuickQuizAnswer{ |
| Indeed it is, other than the API. |
| And the API is important because it allows the Linux kernel |
| to offer real-time capabilities without having the -rt patchset |
| grow to ridiculous sizes. |
| |
| However, this approach clearly and severely limits read-side |
| scalability. |
| The Linux kernel's -rt patchset has been able to live with this |
| limitation for several reasons: (1)~Real-time systems have |
| traditionally been relatively small, (2)~Real-time systems |
| have generally focused on process control, thus being unaffected |
| by scalability limitations in the I/O subsystems, and |
| (3)~Many of the Linux kernel's reader-writer locks have been |
| converted to RCU. |
| |
| All that aside, it is quite possible that the Linux kernel |
| will some day permit limited read-side parallelism for |
| reader-writer locks subject to priority boosting. |
| } \QuickQuizEnd |
| |
| In some cases, reader-writer lock priority inversion can be avoided by |
| converting the reader-writer lock to RCU, as briefly discussed in the |
| next section. |
| |
| \paragraph{Preemptible RCU} |
| can sometimes be used as a replacement for reader-writer |
| locking~\cite{PaulEMcKenney2007WhatIsRCUFundamentally,PaulMcKenney2012RCUUsage,PaulEMcKenney2014RCUAPI}, |
| as was discussed in Section~\ref{sec:defer:Read-Copy Update (RCU)}. |
| Where it can be used, it permits readers and updaters to run concurrently, |
| which prevents low-priority readers from inflicting any sort of |
| priority-inversion scenario on high-priority updaters. |
| However, for this to be useful, it is necessary to be able to preempt |
| long-running RCU read-side critical |
| sections~\cite{DinakarGuniguntala2008IBMSysJ}. |
| Otherwise, long RCU read-side critical sections would result in |
| excessive real-time latencies. |
| |
| { \scriptsize |
| \begin{verbbox} |
| 1 void __rcu_read_lock(void) |
| 2 { |
| 3 current->rcu_read_lock_nesting++; |
| 4 barrier(); |
| 5 } |
| 6 |
| 7 void __rcu_read_unlock(void) |
| 8 { |
| 9 struct task_struct *t = current; |
| 10 |
| 11 if (t->rcu_read_lock_nesting != 1) { |
| 12 --t->rcu_read_lock_nesting; |
| 13 } else { |
| 14 barrier(); |
| 15 t->rcu_read_lock_nesting = INT_MIN; |
| 16 barrier(); |
| 17 if (ACCESS_ONCE(t->rcu_read_unlock_special.s)) |
| 18 rcu_read_unlock_special(t); |
| 19 barrier(); |
| 20 t->rcu_read_lock_nesting = 0; |
| 21 } |
| 22 } |
| \end{verbbox} |
| } |
| \begin{figure}[tb] |
| \centering |
| \theverbbox |
| \caption{Preemptible Linux-Kernel RCU} |
| \label{fig:rt:Preemptible Linux-Kernel RCU} |
| \end{figure} |
| |
| A preemptible RCU implementation was therefore added to the Linux kernel. |
| This implementation avoids the need to individually track the state of |
| each and every task in the kernel by keeping lists of tasks that have |
| been preempted within their current RCU read-side critical sections. |
| A grace period is permitted to end: (1)~Once all CPUs have completed any |
| RCU read-side critical sections that were in effect before the start |
| of the current grace period and |
| (2)~Once all tasks that were preempted |
| while in one of those pre-existing critical sections have removed |
| themselves from their lists. |
| A simplified version of this implementation is shown in |
| Figure~\ref{fig:rt:Preemptible Linux-Kernel RCU}. |
| The \co{__rcu_read_lock()} function spans lines~1-5 and |
| the \co{__rcu_read_unlock()} function spans lines~7-22. |
| |
| Line~3 of \co{__rcu_read_lock()} increments a per-task count of the |
| number of nested \co{rcu_read_lock()} calls, and |
| line~4 prevents the compiler from reordering the subsequent code in the |
| RCU read-side critical section to precede the \co{rcu_read_lock()}. |
| |
| Line~11 of \co{__rcu_read_unlock()} checks to see if the nesting level count |
| is one, in other words, if this corresponds to the outermost |
| \co{rcu_read_unlock()} of a nested set. |
| If not, line~12 decrements this count, and control returns to the caller. |
| Otherwise, this is the outermost \co{rcu_read_unlock()}, which requires |
| the end-of-critical-section handling carried out by lines~14-20. |
| |
| Line~14 prevents the compiler from reordering the code in the critical |
| section with the code comprising the \co{rcu_read_unlock()}. |
| Line~15 sets the nesting count to a large negative number in order to prevent |
| destructive races with RCU read-side critical sections contained within |
| interrupt handlers~\cite{PaulEMcKenney2011RCU3.0trainwreck}, |
| and line~16 prevents the compiler from reordering this assignment with |
| line~17's check for special handling. |
| If line~17 determines that special handling is required, line~18 |
| invokes \co{rcu_read_unlock_special()} to carry out that special handling. |
| |
| There are several types of special handling that can be required, but |
| we will focus on that required when the RCU read-side critical section |
| has been preempted. |
| In this case, the task must remove itself from the list that it was |
| added to when it was first preempted within its |
| RCU read-side critical section. |
| However, it is important to note that these lists are protected by locks, |
| which means that \co{rcu_read_unlock()} is no longer lockless. |
| However, the highest-priority threads will not be preempted, and therefore, |
| for those highest-priority threads, \co{rcu_read_unlock()} will never |
| attempt to acquire any locks. |
| In addition, if implemented carefully, locking can be used to synchronize |
| real-time software~\cite{BjoernBrandenburgPhD}. |
| |
| Whether or not special handling is required, line~19 prevents the compiler |
| from reordering the check on line~17 with the zeroing of the nesting |
| count on line~20. |
| |
| \QuickQuiz{} |
| Suppose that preemption occurs just after the load from |
| \co{t->rcu_read_unlock_special.s} on line~17 of |
| Figure~\ref{fig:rt:Preemptible Linux-Kernel RCU}. |
| Mightn't that result in the task failing to invoke |
| \co{rcu_read_unlock_special()}, thus failing to remove itself |
| from the list of tasks blocking the current grace period, |
| in turn causing that grace period to extend indefinitely? |
| \QuickQuizAnswer{ |
| That is a real problem, and it is solved in RCU's scheduler hook. |
| If that scheduler hook sees that the value of |
| \co{t->rcu_read_lock_nesting} is negative, it invokes |
| \co{rcu_read_unlock_special()} if needed before allowing |
| the context switch to complete. |
| } \QuickQuizEnd |
| |
| This preemptible RCU implementation enables real-time response for |
| read-mostly data structures without the delays inherent to priority |
| boosting of large numbers of readers. |
| |
| \paragraph{Preemptible spinlocks} |
| are an important part of the -rt patchset due to the long-duration |
| spinlock-based critical sections in the Linux kernel. |
| This functionality has not yet reached mainline: Although they are a conceptually |
| simple substitution of sleeplocks for spinlocks, they have proven relatively |
| controversial.\footnote{ |
| In addition, development of the -rt patchset has slowed in recent |
| years, perhaps because the real-time functionality that is already |
| in the mainline Linux kernel suffices for a great many use |
| cases~\cite{JakeEdge2013Future-rtLinux,JakeEdge2014Future-rtLinux}. |
| However, OSADL (\url{http://osadl.org/}) is working to raise funds |
| to move the remaining code from the -rt patchset to mainline.} |
| However, they are quite necessary to the task of achieving real-time |
| latencies down in the tens of microseconds. |
| |
| There are of course any number of other Linux-kernel components that are |
| critically important to achieving world-class real-time latencies, |
| most recently deadline scheduling, |
| however, those listed in this section give a good feeling for the workings |
| of the Linux kernel augmented by the -rt patchset. |
| |
| \subsubsection{Polling-Loop Real-Time Support} |
| \label{sec:rt:Polling-Loop Real-Time Support} |
| |
| At first glance, use of a polling loop might seem to avoid all possible |
| operating-system interference problems. |
| After all, if a given CPU never enters the kernel, the kernel is |
| completely out of the picture. |
| And the traditional approach to keeping the kernel out of the way is |
| simply not to have a kernel, and many real-time applications do |
| indeed run on bare metal, particularly those running on eight-bit |
| microcontrollers. |
| |
| One might hope to get bare-metal performance on a modern operating-system |
| kernel simply by running a single CPU-bound user-mode thread on a |
| given CPU, avoiding all causes of interference. |
| Although the reality is of course more complex, it is becoming |
| possible to do just that, |
| courtesy of the \co{NO_HZ_FULL} implementation led by |
| Frederic Weisbecker~\cite{JonCorbet2013NO-HZ-FULL} that has been |
| accepted into version 3.10 of the Linux kernel. |
| Nevertheless, considerable care is required to properly set up such |
| an environment, as it is necessary to control a number of possible |
| sources of OS jitter. |
| The discussion below covers the control of several sources of OS |
| jitter, including device interrupts, kernel threads and daemons, |
| scheduler real-time throttling (this is a feature, not a bug!), |
| timers, non-real-time device drivers, in-kernel global synchronization, |
| scheduling-clock interrupts, page faults, and finally, non-real-time |
| hardware and firmware. |
| |
| Interrupts are an excellent source of large amounts of OS jitter. |
| Unfortunately, in most cases interrupts are absolutely required in order |
| for the system to communicate with the outside world. |
| One way of resolving this conflict between OS jitter and maintaining |
| contact with the outside world is to reserve a small number of |
| housekeeping CPUs, and to force all interrupts to these CPUs. |
| The \path{Documentation/IRQ-affinity.txt} file in the Linux source tree |
| describes how to direct device interrupts to specified CPU, |
| which as of early 2015 involves something like the following: |
| |
| \begin{quote} |
| \scriptsize |
| \co{echo 0f > /proc/irq/44/smp_affinity} |
| \end{quote} |
| |
| This command would confine interrupt \#44 to CPUs~0-3. |
| Note that scheduling-clock interrupts require special handling, and are |
| discussed later in this section. |
| |
| A second source of OS jitter is due to kernel threads and daemons. |
| Individual kernel threads, such as RCU's grace-period kthreads |
| (\co{rcu_bh}, \co{rcu_preempt}, and \co{rcu_sched}), may be forced |
| onto any desired CPUs using the \co{taskset} command, the |
| \co{sched_setaffinity()} system call, or \co{cgroups}. |
| |
| Per-CPU kthreads are often more challenging, sometimes constraining |
| hardware configuration and workload layout. |
| Preventing OS jitter from these kthreads requires either that certain |
| types of hardware |
| not be attached to real-time systems, that all interrupts and I/O |
| initiation take place on housekeeping CPUs, that special kernel |
| Kconfig or boot parameters be selected in order to direct work away from |
| the worker CPUs, or that worker CPUs never enter the kernel. |
| Specific per-kthread advice may be found in the Linux kernel source |
| \path{Documentation} directory at \path{kernel-per-CPU-kthreads.txt}. |
| |
| A third source of OS jitter in the Linux kernel for CPU-bound threads |
| running at real-time priority is the scheduler itself. |
| This is an intentional debugging feature, designed to ensure that |
| important non-realtime work is allotted at least 50 milliseconds |
| out of each second, even if there is an infinite-loop bug in |
| your real-time application. |
| However, when you are running a polling-loop-style real-time application, |
| you will need to disable this debugging feature. |
| This can be done as follows: |
| |
| { |
| \scriptsize |
| \co{echo -1 > /proc/sys/kernel/sched_rt_runtime_us} |
| } |
| |
| You will of course need to be running as root to execute this command, |
| and you will also need to carefully consider the Spiderman principle. |
| One way to minimize the risks is to offload interrupts and |
| kernel threads/daemons from all CPUs running CPU-bound real-time |
| threads, as described in the paragraphs above. |
| In addition, you should carefully read the material in the |
| \path{Documentation/scheduler} directory. |
| The material in the \path{sched-rt-group.txt} file is particularly |
| important, especially if you are using the \co{cgroups} real-time features |
| enabled by the \co{CONFIG_RT_GROUP_SCHED} Kconfig parameter, in which |
| case you should also read the material in the |
| \path{Documentation/cgroups} directory. |
| |
| A fourth source of OS jitter comes from timers. |
| In most cases, keeping a given CPU out of the kernel will prevent |
| timers from being scheduled on that CPU. |
| One important execption are recurring timers, where a given timer |
| handler posts a later occurrence of that same timer. |
| If such a timer gets started on a given CPU for any reason, that |
| timer will continue to run periodically on that CPU, inflicting |
| OS jitter indefinitely. |
| One crude but effective way to offload recurring timers is to |
| use CPU hotplug to offline all worker CPUs that are to run CPU-bound |
| real-time application threads, online these same CPUs, then start |
| your real-time application. |
| |
| A fifth source of OS jitter is provided by device drivers that were |
| not intended for real-time use. |
| For an old canonical example, in 2005, the VGA driver would blank |
| the screen by zeroing the frame buffer with interrupts disabled, |
| which resulted in tens of milliseconds of OS jitter. |
| One way of avoiding device-driver-induced OS jitter is to carefully |
| select devices that have been used heavily in real-time systems, |
| and which have therefore had their real-time bugs fixed. |
| Another way is to confine the devices interrupts and all code using |
| that device to designated housekeeping CPUs. |
| A third way is to test the device's ability to support real-time |
| workloads and fix any real-time bugs.\footnote{ |
| If you take this approach, please submit your fixes upstream |
| so that others can benefit. |
| Keep in mind that when you need to port your application to |
| a later version of the Linux kernel, \emph{you} will be one of those |
| ``others''.} |
| |
| A sixth source of OS jitter is provided by some in-kernel |
| full-system synchronization algorithms, perhaps most notably |
| the global TLB-flush algorithm. |
| This can be avoided by avoiding memory-unmapping operations, and expecially |
| avoiding unmapping operations within the kernel. |
| As of early 2015, the way to avoid in-kernel |
| unmapping operations is to avoid unloading kernel modules. |
| |
| A seventh source of OS jitter is provided by |
| scheduling-clock interrrupts and RCU callback invocation. |
| These may be avoided by building your kernel with the |
| \co{NO_HZ_FULL} Kconfig parameter enabled, and then booting |
| with the \co{nohz_full=} parameter specifying the list of |
| worker CPUs that are to run real-time threads. |
| For example, \co{nohz_full=2-7} would designate CPUs~2, 3, 4, 5, 6, and~7 |
| as worker CPUs, thus leaving CPUs~0 and~1 as housekeeping CPUs. |
| The worker CPUs would not incur scheduling-clock interrupts as long |
| as there is no more than one runnable task on each worker CPU, |
| and each worker CPU's RCU callbacks would be invoked on one of the |
| housekeeping CPUs. |
| A CPU that has suppressed scheduling-clock interrupts due to there |
| only being one runnable task on that CPU is said to be in |
| \emph{adaptive ticks mode}. |
| |
| As an alternative to the \co{nohz_full=} boot parameter, you can build |
| your kernel with \co{NO_HZ_FULL_ALL}, which will designate CPU 0 as |
| a housekeeping CPU and all other CPUs as worker CPUs. |
| Either way, it is important to ensure that you have designated enough |
| housekeeping CPUs to handle the housekeeping load imposed by the |
| rest of the system, which requires careful benchmarking and tuning. |
| |
| Of course, there is no free lunch, and \co{NO_HZ_FULL} is no exception. |
| As noted earlier, |
| \co{NO_HZ_FULL} makes kernel/user transitions more expensive due to the |
| need for delta process accounting and the need to inform kernel subsystems |
| (such as RCU) of the transitions. |
| It also prevents CPUs running processes with POSIX CPU timers enabled |
| from entering adaptive-ticks mode. |
| Additional limitations, tradeoffs, and configuration advice may be |
| found in \path{Documentation/timers/NO_HZ.txt}. |
| |
| An eighth source of OS jitter is page faults. |
| Because most Linux implementations use an MMU for memory protection, |
| real-time applications running on these systems can be subject |
| to page faults. |
| Use the \co{mlock()} and \co{mlockall()} system calls to pin your |
| application's pages into memory, thus avoiding major page faults. |
| Of course, the Spiderman principle applies, because locking down |
| too much memory may prevent the system from getting other work done. |
| |
| A ninth source of OS jitter is unfortunately the hardware and firmware. |
| It is therefore important to use systems that have been designed for |
| real-time use. |
| OSADL runs long-term tests of systems, so referring to their |
| website (\url{http://osadl.org/}) can be helpful. |
| |
| { \scriptsize |
| \begin{verbbox} |
| 1 cd /sys/kernel/debug/tracing |
| 2 echo 1 > max_graph_depth |
| 3 echo function_graph > current_tracer |
| 4 # run workload |
| 5 cat per_cpu/cpuN/trace |
| \end{verbbox} |
| } |
| \begin{figure}[tb] |
| \centering |
| \theverbbox |
| \caption{Locating Sources of OS Jitter} |
| \label{fig:rt:Locating Sources of OS Jitter} |
| \end{figure} |
| |
| Unfortunately, this list of OS-jitter sources can never be complete, |
| as it will change with each new version of the kernel. |
| This makes it necessary to be able to track down additional sources |
| of OS jitter. |
| Given a CPU $N$ running a CPU-bound usermode thread, the |
| commands shown in |
| Figure~\ref{fig:rt:Locating Sources of OS Jitter} |
| will produce a list of all the times that this CPU entered the kernel. |
| Of course, the \co{N} on line~5 must be replaced with the |
| number of the CPU in question, and the \co{1} on line~2 may be increased |
| to show additional levels of function call within the kernel. |
| The resulting trace can help track down the source of the OS jitter. |
| |
| As you can see, obtaining bare-metal performance when running |
| CPU-bound real-time threads on a general-purpose OS such as Linux |
| requires painstaking attention to detail. |
| Automation would of course help, and some automation has been applied, |
| but given the relatively small number of users, automation can be |
| expected to appear relatively slowly. |
| Nevertheless, the ability to gain near-bare-metal performance while |
| running a general-purpose operating system promises to ease construction |
| of some types of real-time systems. |
| |
| \subsection{Implementing Parallel Real-Time Applications} |
| \label{sec:rt:Implementing Parallel Real-Time Applications} |
| |
| Developing real-time applications is a wide-ranging topic, and this |
| section can only touch on a few aspects. |
| To this end, |
| Section~\ref{sec:rt:Real-Time Components} |
| looks at a few software components commonly used in real-time applications, |
| Section~\ref{sec:rt:Polling-Loop Applications} |
| provides a brief overview of how polling-loop-based applications may |
| be implemented, |
| Section~\ref{sec:rt:Streaming Applications} |
| gives a similar overview of streaming applications, and |
| Section~\ref{sec:rt:Event-Driven Applications} |
| briefly covers event-based applications. |
| |
| \subsubsection{Real-Time Components} |
| \label{sec:rt:Real-Time Components} |
| |
| As in all areas of engineering, a robust set of components is essential |
| to productivity and reliability. |
| This section is not a full catalog of real-time software components---such |
| a catalog would fill an entire book---but rather a brief overview of the |
| types of components available. |
| |
| A natural place to look for real-time software components would be |
| algorithms offering wait-free |
| synchronization~\cite{Herlihy91}, and in fact lockless |
| algorithms are very important to real-time computing. |
| However, wait-free synchronization only guarantees forward progress in |
| finite time, and real-time computing requires algorithms that meet the |
| far more stringent guarantee of forward progress in bounded time. |
| After all, a century is finite, but unhelpful when your deadlines are |
| measured in milliseconds. |
| |
| Nevertheless, there are some important wait-free algorithms that do |
| provide bounded response time, including atomic test and set, |
| atomic exchange, |
| atomic fetch-and-add, |
| single-producer/single-consumer FIFO queues based on circular arrays, |
| and numerous per-thread partitioned algorithms. |
| In addition, recent research has confirmed the observation that |
| algorithms with lock-free guarantees\footnote{ |
| Wait-free algorithms guarantee that all threads make progress in |
| finite time, while lock-free algorithms only guarantee that at |
| least one thread will make progress in finite time.} |
| provide the same latencies in practice assuming a stochastically |
| fair scheduler and freedom from fail-stop |
| bugs~\cite{DanAlitarh2013PracticalProgress}.\footnote{ |
| This paper also introduces the notion of \emph{bounded minimal |
| progress}, which is a welcome step on the part of theory |
| towards real-time practice.} |
| This means that lock-free stacks and queues are appropriate |
| for real-time use. |
| |
| \QuickQuiz{} |
| But isn't correct operation despite fail-stop bugs |
| a valuable fault-tolerance property? |
| \QuickQuizAnswer{ |
| Yes and no. |
| |
| Yes in that non-blocking algorithms can provide fault tolerance |
| in the face of fail-stop bugs, but no in that this is grossly |
| insufficient for practical fault tolerance. |
| For example, suppose you had a wait-free queue, and further |
| suppose that a thread has just dequeued an element. |
| If that thread now succumbs to a fail-stop bug, the element |
| it has just dequeued is effectively lost. |
| True fault tolerance requires way more than mere non-blocking |
| properties, and is beyond the scope of this book. |
| } \QuickQuizEnd |
| |
| In practice, locking is often used in real-time programs, theoretical |
| concerns notwithstanding. |
| However, under more severe constraints, lock-based |
| algorithms can also provide bounded latencies~\cite{BjoernBrandenburgPhD}. |
| These constraints include: |
| |
| \begin{enumerate} |
| \item Fair scheduler. |
| In the common case of a fixed-priority scheduler, the bounded |
| latencies are provided only to the highest-priority threads. |
| \item Sufficient bandwidth to support the workload. |
| An implementation rule supporting this constraint might be |
| ``There will be at least 50\% idle time on all CPUs |
| during normal operation,'' |
| or, more formally, ``The offered load will be sufficiently low |
| to allow the workload to be schedulable at all times.'' |
| \item No fail-stop bugs. |
| \item FIFO locking primitives with bounded acquisition, handoff, |
| and release latencies. |
| Again, in the common case of a locking primitive that is FIFO |
| within priorities, the bounded latencies are provided only |
| to the highest-priority threads. |
| \item Some way of preventing unbounded priority inversion. |
| The priority-ceiling and priority-inheritance disciplines |
| mentioned earlier in this chapter suffice. |
| \item Bounded nesting of lock acquisitions. |
| We can have an unbounded number of locks, but only as long as a |
| given thread never acquires more than a few of them (ideally only |
| one of them) at a time. |
| \item Bounded number of threads. |
| In combination with the earlier constraints, this constraint means |
| that there will be a bounded number of threads waiting on any |
| given lock. |
| \item Bounded time spent in any given critical section. |
| Given a bounded number of threads waiting on any given lock and |
| a bounded critical-section duration, the wait time will be bounded. |
| \end{enumerate} |
| |
| \QuickQuiz{} |
| I couldn't help but spot the word ``includes'' before this list. |
| Are there other constraints? |
| \QuickQuizAnswer{ |
| Indeed there are, and lots of them. |
| However, they tend to be specific to a given situation, |
| and many of them can be thought of as refinements of some of |
| the constraints listed above. |
| For example, the many constraints on choices of data structure |
| will help meeting the ``Bounded time spent in any given critical |
| section'' constraint. |
| } \QuickQuizEnd |
| |
| This result opens a vast cornucopia of algorithms and data structures |
| for use in real-time software---and validates long-standing real-time practice. |
| |
| Of course, a careful and simple application design is also extremely |
| important. |
| The best real-time components in the world cannot make up for a |
| poorly thought-out design. |
| For parallel real-time applications, synchronization overheads clearly |
| must be a key component of the design. |
| |
| \subsubsection{Polling-Loop Applications} |
| \label{sec:rt:Polling-Loop Applications} |
| |
| Many real-time applications consist of a single CPU-bound loop that |
| reads sensor data, computes a control law, and writes control output. |
| If the hardware registers providing sensor data and taking control |
| output are mapped into the application's address space, this loop |
| might be completely free of system calls. |
| But beware of the Spiderman principle: With great power comes great |
| responsibility, in this case the responsibility to avoid bricking the |
| hardware by making inappropriate references to the hardware registers. |
| |
| This arrangement is often run on bare metal, without the benefits of |
| (or the interference from) an operating system. |
| However, increasing hardware capability and increasing levels of |
| automation motivates increasing software functionality, for example, |
| user interfaces, logging, and reporting, all of which can benefit from |
| an operating system. |
| |
| One way of gaining much of the benefit of running on bare metal while |
| still having access to the full features and functions of a |
| general-purpose operating system is to use the Linux kernel's |
| \co{NO_HZ_FULL} capability, described in |
| Section~\ref{sec:rt:Polling-Loop Real-Time Support}. |
| This support first became available in version 3.10 of the Linux kernel. |
| |
| \subsubsection{Streaming Applications} |
| \label{sec:rt:Streaming Applications} |
| |
| A popular sort of big-data real-time application takes input from numerous |
| sources, processes it internally, and outputs alerts and summaries. |
| These \emph{streaming applications} are often highly parallel, processing |
| different information sources concurrently. |
| |
| One approach for implementing streaming applications is to use |
| dense-array circular FIFOs to connect different processing |
| steps~\cite{AdrianSutton2013LCA:Disruptor}. |
| Each such FIFO has only a single thread producing into it and a |
| (presumably different) single thread consuming from it. |
| Fan-in and fan-out points use threads rather than data structures, |
| so if the output of several FIFOs needed to be merged, a separate |
| thread would input from them and output to another FIFO for which |
| this separate thread was the sole producer. |
| Similarly, if the output of a given FIFO needed to be split, a separate |
| thread would input from this FIFO and output to several FIFOs as needed. |
| |
| This discipline might seem restrictive, but it allows communication |
| among threads with minimal synchronization overhead, and minimal |
| synchronization overhead is important when attempting to meet |
| tight latency constraints. |
| This is especially true when the amount of processing for each step |
| is small, so that the synchronization overhead is significant compared |
| to the processing overhead. |
| |
| The individual threads might be CPU-bound, in which case the advice in |
| Section~\ref{sec:rt:Polling-Loop Applications} applies. |
| On the other hand, if the individual threads block waiting for |
| data from their input FIFOs, the advice of the next section applies. |
| |
| \subsubsection{Event-Driven Applications} |
| \label{sec:rt:Event-Driven Applications} |
| |
| We will use fuel injection into a mid-sized industrial engine as a |
| fanciful example for event-driven applications. |
| Under normal operating conditions, this engine requires that the fuel |
| be injected within a one-degree interval surrounding top dead center. |
| If we assume a 1,500-RPM rotation rate, we have 25 rotations per second, |
| or about 9,000 degrees of rotation per second, which translates to |
| 111 microseconds per degree. |
| We therefore need to schedule the fuel injection to within a time |
| interval of about 100 microseconds. |
| |
| { \scriptsize |
| \begin{verbbox} |
| 1 if (clock_gettime(CLOCK_REALTIME, ×tart) != 0) { |
| 2 perror("clock_gettime 1"); |
| 3 exit(-1); |
| 4 } |
| 5 if (nanosleep(&timewait, NULL) != 0) { |
| 6 perror("nanosleep"); |
| 7 exit(-1); |
| 8 } |
| 9 if (clock_gettime(CLOCK_REALTIME, &timeend) != 0) { |
| 10 perror("clock_gettime 2"); |
| 11 exit(-1); |
| 12 } |
| \end{verbbox} |
| } |
| \begin{figure}[tb] |
| \centering |
| \theverbbox |
| \caption{Timed-Wait Test Program} |
| \label{fig:rt:Timed-Wait Test Program} |
| \end{figure} |
| |
| Suppose that a timed wait was to be used to initiate fuel injection, |
| although if you are building an engine, I hope you supply a rotation |
| sensor. |
| We need to test the timed-wait functionality, perhaps using the test program |
| shown in |
| Figure~\ref{fig:rt:Timed-Wait Test Program}. |
| Unfortunately, if we run this program, we can get unacceptable timer |
| jitter, even in a -rt kernel. |
| |
| One problem is that POSIX \co{CLOCK_REALTIME} is, oddly enough, not intended |
| for real-time use. |
| Instead, it means ``realtime'' as opposed to the amount of CPU time |
| consumed by a process or thread. |
| For real-time use, you should instead use \co{CLOCK_MONOTONIC}. |
| However, even with this change, results are still unacceptable. |
| |
| Another problem is that the thread must be raised to a real-time |
| priority by using the \co{sched_setscheduler()} system call. |
| But even this change is insufficient, because we can still see |
| page faults. |
| We also need to use the \co{mlockall()} system call to pin the |
| application's memory, preventing page faults. |
| With all of these changes, results might finally be acceptable. |
| |
| In other situations, further adjustments might be needed. |
| It might be necessary to affinity time-critical threads onto their |
| own CPUs, and it might also be necessary to affinity interrupts |
| away from those CPUs. |
| It might be necessary to carefully select hardware and drivers, |
| and it will very likely be necessary to carefully select kernel |
| configuration. |
| |
| As can be seen from this example, real-time computing can be quite |
| unforgiving. |
| |
| \subsection{The Role of RCU} |
| \label{sec:rt:The Role of RCU} |
| |
| Suppose that you are writing a parallel real-time program that needs |
| to access |
| data that is subject to gradual change, perhaps due to changes in |
| temperature, humidity, and barometric pressure. |
| The real-time response constraints on this program are so severe that |
| it is not permissible to spin or block, thus ruling out locking, |
| nor is it permissible to use a retry loop, thus ruling out sequence locks |
| and hazard pointers. |
| Fortunately, the temperature and pressure are normally controlled, |
| so that a default hard-coded set of data is usually sufficient. |
| |
| However, the temperature, humidity, and pressure occasionally deviate too far |
| from the defaults, and in such situations it is necessary to provide |
| data that replaces the defaults. |
| Because the temperature, humidity, and pressure change gradually, |
| providing the updated values is not a matter of urgency, though |
| it must happen within a few minutes. |
| The program is to use a global pointer imaginatively named \co{cur_cal} |
| that normally references \co{default_cal}, which is a statically allocated |
| and initialized structure that contains the default calibration values |
| in fields imaginatively named \co{a}, \co{b}, and \co{c}. |
| Otherwise, \co{cur_cal} points to a dynamically allocated |
| structure providing the current calibration values. |
| |
| { \scriptsize |
| \begin{verbbox} |
| 1 struct calibration { |
| 2 short a; |
| 3 short b; |
| 4 short c; |
| 5 }; |
| 6 struct calibration default_cal = { 62, 33, 88 }; |
| 7 struct calibration cur_cal = &default_cal; |
| 8 |
| 9 short calc_control(short t, short h, short press) |
| 10 { |
| 11 struct calibration *p; |
| 12 |
| 13 p = rcu_dereference(cur_cal); |
| 14 return do_control(t, h, press, p->a, p->b, p->c); |
| 15 } |
| 16 |
| 17 bool update_cal(short a, short b, short c) |
| 18 { |
| 19 struct calibration *p; |
| 20 struct calibration *old_p; |
| 21 |
| 22 old_p = rcu_dereference(cur_cal); |
| 23 p = malloc(sizeof(*p); |
| 24 if (!p) |
| 25 return false; |
| 26 p->a = a; |
| 27 p->b = b; |
| 28 p->c = c; |
| 29 rcu_assign_pointer(cur_cal, p); |
| 30 if (old_p == &default_cal) |
| 31 return true; |
| 32 synchronize_rcu(); |
| 33 free(p); |
| 34 return true; |
| 35 } |
| \end{verbbox} |
| } |
| \begin{figure}[tb] |
| \centering |
| \theverbbox |
| \caption{Real-Time Calibration Using RCU} |
| \label{fig:rt:Real-Time Calibration Using RCU} |
| \end{figure} |
| |
| Figure~\ref{fig:rt:Real-Time Calibration Using RCU} |
| shows how RCU can be used to solve this problem. |
| Lookups are deterministic, as shown in \co{calc_control()} |
| on lines~9-15, consistent with real-time requirements. |
| Updates are more complex, as shown by \co{update_cal()} |
| on lines~17-35. |
| |
| \QuickQuiz{} |
| Given that real-time systems are often used for safety-critical |
| applications, and given that runtime memory allocation is |
| forbidden in many safety-critical situations, what is with |
| the call to \co{malloc()}??? |
| \QuickQuizAnswer{ |
| In early 2016, situations forbidding runtime memory were |
| also not soo excited with multithreaded computing. |
| So the runtime memory allocation is not an additional |
| obstacle to safety criticality. |
| } \QuickQuizEnd |
| |
| \QuickQuiz{} |
| Don't you need some kind of synchronization to protect |
| \co{update_cal()}? |
| \QuickQuizAnswer{ |
| Indeed you do, and you could use any of a number of techniques |
| discussed earlier in this book. |
| } \QuickQuizEnd |
| |
| This example shows how RCU can provide deterministic read-side |
| data-structure access to real-time programs. |
| |
| \section{Real Time vs. Real Fast: How to Choose?} |
| \label{sec:rt:Real Time vs. Real Fast: How to Choose?} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.2in}{!}{\includegraphics{cartoons/RealTimeNotRealFast}} |
| \caption{The Dark Side of Real-Time Computing} |
| \ContributedBy{Figure}{fig:rt:The Dark Side of Real-Time Computing}{Sarah McKenney} |
| \end{figure} |
| |
| \begin{figure}[tb] |
| \centering |
| \resizebox{3.2in}{!}{\includegraphics{cartoons/RealFastNotRealTime}} |
| \caption{The Dark Side of Real-Fast Computing} |
| \ContributedBy{Figure}{fig:rt:The Dark Side of Real-Fast Computing}{Sarah McKenney} |
| \end{figure} |
| |
| The choice between real-time and real-fast computing can be a difficult one. |
| Because real-time systems often inflict a throughput penalty on non-real-time |
| computing, using real-time when it is not required can cause problems, |
| as fancifully depicted by |
| Figure~\ref{fig:rt:The Dark Side of Real-Time Computing}. |
| On the other hand, failing to use real-time when it \emph{is} required |
| can also cause problems, as fancifully depicted by |
| Figure~\ref{fig:rt:The Dark Side of Real-Fast Computing}. |
| It is almost enough to make you feel sorry for the boss! |
| |
| One rule of thumb uses the following four questions to help you choose: |
| |
| \begin{enumerate} |
| \item Is average long-term throughput the only goal? |
| \item Is it permissible for heavy loads to degrade response times? |
| \item Is there high memory pressure, ruling out use of |
| the \co{mlockall()} system call? |
| \item Does the basic work item of your application take more than |
| 100 milliseconds to complete? |
| \end{enumerate} |
| |
| If the answer to any of these questions is ``yes'', you should choose |
| real-fast over real-time, otherwise, real-time might be for you. |
| |
| Choose wisely, and if you do choose real-time, make sure that your |
| hardware, firmware, and operating system are up to the job! |