| .\" Copyright (C) 2014 Michael Kerrisk <mtk.manpages@gmail.com> |
| .\" and Copyright (C) 2014 Peter Zijlstra <peterz@infradead.org> |
| .\" and Copyright (C) 2014 Juri Lelli <juri.lelli@gmail.com> |
| .\" Various pieces from the old sched_setscheduler(2) page |
| .\" Copyright (C) Tom Bjorkholm, Markus Kuhn & David A. Wheeler 1996-1999 |
| .\" and Copyright (C) 2007 Carsten Emde <Carsten.Emde@osadl.org> |
| .\" and Copyright (C) 2008 Michael Kerrisk <mtk.manpages@gmail.com> |
| .\" |
| .\" %%%LICENSE_START(GPLv2+_DOC_FULL) |
| .\" This is free documentation; you can redistribute it and/or |
| .\" modify it under the terms of the GNU General Public License as |
| .\" published by the Free Software Foundation; either version 2 of |
| .\" the License, or (at your option) any later version. |
| .\" |
| .\" The GNU General Public License's references to "object code" |
| .\" and "executables" are to be interpreted as the output of any |
| .\" document formatting or typesetting system, including |
| .\" intermediate and printed output. |
| .\" |
| .\" This manual is distributed in the hope that it will be useful, |
| .\" but WITHOUT ANY WARRANTY; without even the implied warranty of |
| .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| .\" GNU General Public License for more details. |
| .\" |
| .\" You should have received a copy of the GNU General Public |
| .\" License along with this manual; if not, see |
| .\" <http://www.gnu.org/licenses/>. |
| .\" %%%LICENSE_END |
| .\" |
| .\" Worth looking at: http://rt.wiki.kernel.org/index.php |
| .\" |
| .TH SCHED 7 2021-03-22 "Linux" "Linux Programmer's Manual" |
| .SH NAME |
| sched \- overview of CPU scheduling |
| .SH DESCRIPTION |
| Since Linux 2.6.23, the default scheduler is CFS, |
| the "Completely Fair Scheduler". |
| The CFS scheduler replaced the earlier "O(1)" scheduler. |
| .\" |
| .SS API summary |
| Linux provides the following system calls for controlling |
| the CPU scheduling behavior, policy, and priority of processes |
| (or, more precisely, threads). |
| .TP |
| .BR nice (2) |
| Set a new nice value for the calling thread, |
| and return the new nice value. |
| .TP |
| .BR getpriority (2) |
| Return the nice value of a thread, a process group, |
| or the set of threads owned by a specified user. |
| .TP |
| .BR setpriority (2) |
| Set the nice value of a thread, a process group, |
| or the set of threads owned by a specified user. |
| .TP |
| .BR sched_setscheduler (2) |
| Set the scheduling policy and parameters of a specified thread. |
| .TP |
| .BR sched_getscheduler (2) |
| Return the scheduling policy of a specified thread. |
| .TP |
| .BR sched_setparam (2) |
| Set the scheduling parameters of a specified thread. |
| .TP |
| .BR sched_getparam (2) |
| Fetch the scheduling parameters of a specified thread. |
| .TP |
| .BR sched_get_priority_max (2) |
| Return the maximum priority available in a specified scheduling policy. |
| .TP |
| .BR sched_get_priority_min (2) |
| Return the minimum priority available in a specified scheduling policy. |
| .TP |
| .BR sched_rr_get_interval (2) |
| Fetch the quantum used for threads that are scheduled under |
| the "round-robin" scheduling policy. |
| .TP |
| .BR sched_yield (2) |
| Cause the caller to relinquish the CPU, |
| so that some other thread be executed. |
| .TP |
| .BR sched_setaffinity (2) |
| (Linux-specific) |
| Set the CPU affinity of a specified thread. |
| .TP |
| .BR sched_getaffinity (2) |
| (Linux-specific) |
| Get the CPU affinity of a specified thread. |
| .TP |
| .BR sched_setattr (2) |
| Set the scheduling policy and parameters of a specified thread. |
| This (Linux-specific) system call provides a superset of the functionality of |
| .BR sched_setscheduler (2) |
| and |
| .BR sched_setparam (2). |
| .TP |
| .BR sched_getattr (2) |
| Fetch the scheduling policy and parameters of a specified thread. |
| This (Linux-specific) system call provides a superset of the functionality of |
| .BR sched_getscheduler (2) |
| and |
| .BR sched_getparam (2). |
| .\" |
| .SS Scheduling policies |
| The scheduler is the kernel component that decides which runnable thread |
| will be executed by the CPU next. |
| Each thread has an associated scheduling policy and a \fIstatic\fP |
| scheduling priority, |
| .IR sched_priority . |
| The scheduler makes its decisions based on knowledge of the scheduling |
| policy and static priority of all threads on the system. |
| .PP |
| For threads scheduled under one of the normal scheduling policies |
| (\fBSCHED_OTHER\fP, \fBSCHED_IDLE\fP, \fBSCHED_BATCH\fP), |
| \fIsched_priority\fP is not used in scheduling |
| decisions (it must be specified as 0). |
| .PP |
| Processes scheduled under one of the real-time policies |
| (\fBSCHED_FIFO\fP, \fBSCHED_RR\fP) have a |
| \fIsched_priority\fP value in the range 1 (low) to 99 (high). |
| (As the numbers imply, real-time threads always have higher priority |
| than normal threads.) |
| Note well: POSIX.1 requires an implementation to support only a |
| minimum 32 distinct priority levels for the real-time policies, |
| and some systems supply just this minimum. |
| Portable programs should use |
| .BR sched_get_priority_min (2) |
| and |
| .BR sched_get_priority_max (2) |
| to find the range of priorities supported for a particular policy. |
| .PP |
| Conceptually, the scheduler maintains a list of runnable |
| threads for each possible \fIsched_priority\fP value. |
| In order to determine which thread runs next, the scheduler looks for |
| the nonempty list with the highest static priority and selects the |
| thread at the head of this list. |
| .PP |
| A thread's scheduling policy determines |
| where it will be inserted into the list of threads |
| with equal static priority and how it will move inside this list. |
| .PP |
| All scheduling is preemptive: if a thread with a higher static |
| priority becomes ready to run, the currently running thread |
| will be preempted and |
| returned to the wait list for its static priority level. |
| The scheduling policy determines the |
| ordering only within the list of runnable threads with equal static |
| priority. |
| .SS SCHED_FIFO: First in-first out scheduling |
| \fBSCHED_FIFO\fP can be used only with static priorities higher than |
| 0, which means that when a \fBSCHED_FIFO\fP thread becomes runnable, |
| it will always immediately preempt any currently running |
| \fBSCHED_OTHER\fP, \fBSCHED_BATCH\fP, or \fBSCHED_IDLE\fP thread. |
| \fBSCHED_FIFO\fP is a simple scheduling |
| algorithm without time slicing. |
| For threads scheduled under the |
| \fBSCHED_FIFO\fP policy, the following rules apply: |
| .IP 1) 3 |
| A running \fBSCHED_FIFO\fP thread that has been preempted by another thread of |
| higher priority will stay at the head of the list for its priority and |
| will resume execution as soon as all threads of higher priority are |
| blocked again. |
| .IP 2) |
| When a blocked \fBSCHED_FIFO\fP thread becomes runnable, it |
| will be inserted at the end of the list for its priority. |
| .IP 3) |
| If a call to |
| .BR sched_setscheduler (2), |
| .BR sched_setparam (2), |
| .BR sched_setattr (2), |
| .BR pthread_setschedparam (3), |
| or |
| .BR pthread_setschedprio (3) |
| changes the priority of the running or runnable |
| .B SCHED_FIFO |
| thread identified by |
| .I pid |
| the effect on the thread's position in the list depends on |
| the direction of the change to threads priority: |
| .RS |
| .IP \(bu 3 |
| If the thread's priority is raised, |
| it is placed at the end of the list for its new priority. |
| As a consequence, |
| it may preempt a currently running thread with the same priority. |
| .IP \(bu |
| If the thread's priority is unchanged, |
| its position in the run list is unchanged. |
| .IP \(bu |
| If the thread's priority is lowered, |
| it is placed at the front of the list for its new priority. |
| .RE |
| .IP |
| According to POSIX.1-2008, |
| changes to a thread's priority (or policy) using any mechanism other than |
| .BR pthread_setschedprio (3) |
| should result in the thread being placed at the end of |
| the list for its priority. |
| .\" In 2.2.x and 2.4.x, the thread is placed at the front of the queue |
| .\" In 2.0.x, the Right Thing happened: the thread went to the back -- MTK |
| .IP 4) |
| A thread calling |
| .BR sched_yield (2) |
| will be put at the end of the list. |
| .PP |
| No other events will move a thread |
| scheduled under the \fBSCHED_FIFO\fP policy in the wait list of |
| runnable threads with equal static priority. |
| .PP |
| A \fBSCHED_FIFO\fP |
| thread runs until either it is blocked by an I/O request, it is |
| preempted by a higher priority thread, or it calls |
| .BR sched_yield (2). |
| .SS SCHED_RR: Round-robin scheduling |
| \fBSCHED_RR\fP is a simple enhancement of \fBSCHED_FIFO\fP. |
| Everything |
| described above for \fBSCHED_FIFO\fP also applies to \fBSCHED_RR\fP, |
| except that each thread is allowed to run only for a maximum time |
| quantum. |
| If a \fBSCHED_RR\fP thread has been running for a time |
| period equal to or longer than the time quantum, it will be put at the |
| end of the list for its priority. |
| A \fBSCHED_RR\fP thread that has |
| been preempted by a higher priority thread and subsequently resumes |
| execution as a running thread will complete the unexpired portion of |
| its round-robin time quantum. |
| The length of the time quantum can be |
| retrieved using |
| .BR sched_rr_get_interval (2). |
| .\" On Linux 2.4, the length of the RR interval is influenced |
| .\" by the process nice value -- MTK |
| .\" |
| .SS SCHED_DEADLINE: Sporadic task model deadline scheduling |
| Since version 3.14, Linux provides a deadline scheduling policy |
| .RB ( SCHED_DEADLINE ). |
| This policy is currently implemented using |
| GEDF (Global Earliest Deadline First) |
| in conjunction with CBS (Constant Bandwidth Server). |
| To set and fetch this policy and associated attributes, |
| one must use the Linux-specific |
| .BR sched_setattr (2) |
| and |
| .BR sched_getattr (2) |
| system calls. |
| .PP |
| A sporadic task is one that has a sequence of jobs, where each |
| job is activated at most once per period. |
| Each job also has a |
| .IR "relative deadline" , |
| before which it should finish execution, and a |
| .IR "computation time" , |
| which is the CPU time necessary for executing the job. |
| The moment when a task wakes up |
| because a new job has to be executed is called the |
| .IR "arrival time" |
| (also referred to as the request time or release time). |
| The |
| .IR "start time" |
| is the time at which a task starts its execution. |
| The |
| .I "absolute deadline" |
| is thus obtained by adding the relative deadline to the arrival time. |
| .PP |
| The following diagram clarifies these terms: |
| .PP |
| .in +4n |
| .EX |
| arrival/wakeup absolute deadline |
| | start time | |
| | | | |
| v v v |
| -----x--------xooooooooooooooooo--------x--------x--- |
| |<- comp. time ->| |
| |<------- relative deadline ------>| |
| |<-------------- period ------------------->| |
| .EE |
| .in |
| .PP |
| When setting a |
| .B SCHED_DEADLINE |
| policy for a thread using |
| .BR sched_setattr (2), |
| one can specify three parameters: |
| .IR Runtime , |
| .IR Deadline , |
| and |
| .IR Period . |
| These parameters do not necessarily correspond to the aforementioned terms: |
| usual practice is to set Runtime to something bigger than the average |
| computation time (or worst-case execution time for hard real-time tasks), |
| Deadline to the relative deadline, and Period to the period of the task. |
| Thus, for |
| .BR SCHED_DEADLINE |
| scheduling, we have: |
| .PP |
| .in +4n |
| .EX |
| arrival/wakeup absolute deadline |
| | start time | |
| | | | |
| v v v |
| -----x--------xooooooooooooooooo--------x--------x--- |
| |<-- Runtime ------->| |
| |<----------- Deadline ----------->| |
| |<-------------- Period ------------------->| |
| .EE |
| .in |
| .PP |
| The three deadline-scheduling parameters correspond to the |
| .IR sched_runtime , |
| .IR sched_deadline , |
| and |
| .IR sched_period |
| fields of the |
| .I sched_attr |
| structure; see |
| .BR sched_setattr (2). |
| These fields express values in nanoseconds. |
| .\" FIXME It looks as though specifying sched_period as 0 means |
| .\" "make sched_period the same as sched_deadline". |
| .\" This needs to be documented. |
| If |
| .IR sched_period |
| is specified as 0, then it is made the same as |
| .IR sched_deadline . |
| .PP |
| The kernel requires that: |
| .PP |
| sched_runtime <= sched_deadline <= sched_period |
| .PP |
| .\" See __checkparam_dl in kernel/sched/core.c |
| In addition, under the current implementation, |
| all of the parameter values must be at least 1024 |
| (i.e., just over one microsecond, |
| which is the resolution of the implementation), and less than 2^63. |
| If any of these checks fails, |
| .BR sched_setattr (2) |
| fails with the error |
| .BR EINVAL . |
| .PP |
| The CBS guarantees non-interference between tasks, by throttling |
| threads that attempt to over-run their specified Runtime. |
| .PP |
| To ensure deadline scheduling guarantees, |
| the kernel must prevent situations where the set of |
| .B SCHED_DEADLINE |
| threads is not feasible (schedulable) within the given constraints. |
| The kernel thus performs an admittance test when setting or changing |
| .B SCHED_DEADLINE |
| policy and attributes. |
| This admission test calculates whether the change is feasible; |
| if it is not, |
| .BR sched_setattr (2) |
| fails with the error |
| .BR EBUSY . |
| .PP |
| For example, it is required (but not necessarily sufficient) for |
| the total utilization to be less than or equal to the total number of |
| CPUs available, where, since each thread can maximally run for |
| Runtime per Period, that thread's utilization is its |
| Runtime divided by its Period. |
| .PP |
| In order to fulfill the guarantees that are made when |
| a thread is admitted to the |
| .BR SCHED_DEADLINE |
| policy, |
| .BR SCHED_DEADLINE |
| threads are the highest priority (user controllable) threads in the |
| system; if any |
| .BR SCHED_DEADLINE |
| thread is runnable, |
| it will preempt any thread scheduled under one of the other policies. |
| .PP |
| A call to |
| .BR fork (2) |
| by a thread scheduled under the |
| .B SCHED_DEADLINE |
| policy fails with the error |
| .BR EAGAIN , |
| unless the thread has its reset-on-fork flag set (see below). |
| .PP |
| A |
| .B SCHED_DEADLINE |
| thread that calls |
| .BR sched_yield (2) |
| will yield the current job and wait for a new period to begin. |
| .\" |
| .\" FIXME Calling sched_getparam() on a SCHED_DEADLINE thread |
| .\" fails with EINVAL, but sched_getscheduler() succeeds. |
| .\" Is that intended? (Why?) |
| .\" |
| .SS SCHED_OTHER: Default Linux time-sharing scheduling |
| \fBSCHED_OTHER\fP can be used at only static priority 0 |
| (i.e., threads under real-time policies always have priority over |
| .B SCHED_OTHER |
| processes). |
| \fBSCHED_OTHER\fP is the standard Linux time-sharing scheduler that is |
| intended for all threads that do not require the special |
| real-time mechanisms. |
| .PP |
| The thread to run is chosen from the static |
| priority 0 list based on a \fIdynamic\fP priority that is determined only |
| inside this list. |
| The dynamic priority is based on the nice value (see below) |
| and is increased for each time quantum the thread is ready to run, |
| but denied to run by the scheduler. |
| This ensures fair progress among all \fBSCHED_OTHER\fP threads. |
| .PP |
| In the Linux kernel source code, the |
| .B SCHED_OTHER |
| policy is actually named |
| .BR SCHED_NORMAL . |
| .\" |
| .SS The nice value |
| The nice value is an attribute |
| that can be used to influence the CPU scheduler to |
| favor or disfavor a process in scheduling decisions. |
| It affects the scheduling of |
| .BR SCHED_OTHER |
| and |
| .BR SCHED_BATCH |
| (see below) processes. |
| The nice value can be modified using |
| .BR nice (2), |
| .BR setpriority (2), |
| or |
| .BR sched_setattr (2). |
| .PP |
| According to POSIX.1, the nice value is a per-process attribute; |
| that is, the threads in a process should share a nice value. |
| However, on Linux, the nice value is a per-thread attribute: |
| different threads in the same process may have different nice values. |
| .PP |
| The range of the nice value |
| varies across UNIX systems. |
| On modern Linux, the range is \-20 (high priority) to +19 (low priority). |
| On some other systems, the range is \-20..20. |
| Very early Linux kernels (Before Linux 2.0) had the range \-infinity..15. |
| .\" Linux before 1.3.36 had \-infinity..15. |
| .\" Since kernel 1.3.43, Linux has the range \-20..19. |
| .PP |
| The degree to which the nice value affects the relative scheduling of |
| .BR SCHED_OTHER |
| processes likewise varies across UNIX systems and |
| across Linux kernel versions. |
| .PP |
| With the advent of the CFS scheduler in kernel 2.6.23, |
| Linux adopted an algorithm that causes |
| relative differences in nice values to have a much stronger effect. |
| In the current implementation, each unit of difference in the |
| nice values of two processes results in a factor of 1.25 |
| in the degree to which the scheduler favors the higher priority process. |
| This causes very low nice values (+19) to truly provide little CPU |
| to a process whenever there is any other |
| higher priority load on the system, |
| and makes high nice values (\-20) deliver most of the CPU to applications |
| that require it (e.g., some audio applications). |
| .PP |
| On Linux, the |
| .BR RLIMIT_NICE |
| resource limit can be used to define a limit to which |
| an unprivileged process's nice value can be raised; see |
| .BR setrlimit (2) |
| for details. |
| .PP |
| For further details on the nice value, see the subsections on |
| the autogroup feature and group scheduling, below. |
| .\" |
| .SS SCHED_BATCH: Scheduling batch processes |
| (Since Linux 2.6.16.) |
| \fBSCHED_BATCH\fP can be used only at static priority 0. |
| This policy is similar to \fBSCHED_OTHER\fP in that it schedules |
| the thread according to its dynamic priority |
| (based on the nice value). |
| The difference is that this policy |
| will cause the scheduler to always assume |
| that the thread is CPU-intensive. |
| Consequently, the scheduler will apply a small scheduling |
| penalty with respect to wakeup behavior, |
| so that this thread is mildly disfavored in scheduling decisions. |
| .PP |
| .\" The following paragraph is drawn largely from the text that |
| .\" accompanied Ingo Molnar's patch for the implementation of |
| .\" SCHED_BATCH. |
| .\" commit b0a9499c3dd50d333e2aedb7e894873c58da3785 |
| This policy is useful for workloads that are noninteractive, |
| but do not want to lower their nice value, |
| and for workloads that want a deterministic scheduling policy without |
| interactivity causing extra preemptions (between the workload's tasks). |
| .\" |
| .SS SCHED_IDLE: Scheduling very low priority jobs |
| (Since Linux 2.6.23.) |
| \fBSCHED_IDLE\fP can be used only at static priority 0; |
| the process nice value has no influence for this policy. |
| .PP |
| This policy is intended for running jobs at extremely low |
| priority (lower even than a +19 nice value with the |
| .B SCHED_OTHER |
| or |
| .B SCHED_BATCH |
| policies). |
| .\" |
| .SS Resetting scheduling policy for child processes |
| Each thread has a reset-on-fork scheduling flag. |
| When this flag is set, children created by |
| .BR fork (2) |
| do not inherit privileged scheduling policies. |
| The reset-on-fork flag can be set by either: |
| .IP * 3 |
| ORing the |
| .B SCHED_RESET_ON_FORK |
| flag into the |
| .I policy |
| argument when calling |
| .BR sched_setscheduler (2) |
| (since Linux 2.6.32); |
| or |
| .IP * |
| specifying the |
| .B SCHED_FLAG_RESET_ON_FORK |
| flag in |
| .IR attr.sched_flags |
| when calling |
| .BR sched_setattr (2). |
| .PP |
| Note that the constants used with these two APIs have different names. |
| The state of the reset-on-fork flag can analogously be retrieved using |
| .BR sched_getscheduler (2) |
| and |
| .BR sched_getattr (2). |
| .PP |
| The reset-on-fork feature is intended for media-playback applications, |
| and can be used to prevent applications evading the |
| .BR RLIMIT_RTTIME |
| resource limit (see |
| .BR getrlimit (2)) |
| by creating multiple child processes. |
| .PP |
| More precisely, if the reset-on-fork flag is set, |
| the following rules apply for subsequently created children: |
| .IP * 3 |
| If the calling thread has a scheduling policy of |
| .B SCHED_FIFO |
| or |
| .BR SCHED_RR , |
| the policy is reset to |
| .BR SCHED_OTHER |
| in child processes. |
| .IP * |
| If the calling process has a negative nice value, |
| the nice value is reset to zero in child processes. |
| .PP |
| After the reset-on-fork flag has been enabled, |
| it can be reset only if the thread has the |
| .BR CAP_SYS_NICE |
| capability. |
| This flag is disabled in child processes created by |
| .BR fork (2). |
| .\" |
| .SS Privileges and resource limits |
| In Linux kernels before 2.6.12, only privileged |
| .RB ( CAP_SYS_NICE ) |
| threads can set a nonzero static priority (i.e., set a real-time |
| scheduling policy). |
| The only change that an unprivileged thread can make is to set the |
| .B SCHED_OTHER |
| policy, and this can be done only if the effective user ID of the caller |
| matches the real or effective user ID of the target thread |
| (i.e., the thread specified by |
| .IR pid ) |
| whose policy is being changed. |
| .PP |
| A thread must be privileged |
| .RB ( CAP_SYS_NICE ) |
| in order to set or modify a |
| .BR SCHED_DEADLINE |
| policy. |
| .PP |
| Since Linux 2.6.12, the |
| .B RLIMIT_RTPRIO |
| resource limit defines a ceiling on an unprivileged thread's |
| static priority for the |
| .B SCHED_RR |
| and |
| .B SCHED_FIFO |
| policies. |
| The rules for changing scheduling policy and priority are as follows: |
| .IP * 3 |
| If an unprivileged thread has a nonzero |
| .B RLIMIT_RTPRIO |
| soft limit, then it can change its scheduling policy and priority, |
| subject to the restriction that the priority cannot be set to a |
| value higher than the maximum of its current priority and its |
| .B RLIMIT_RTPRIO |
| soft limit. |
| .IP * |
| If the |
| .B RLIMIT_RTPRIO |
| soft limit is 0, then the only permitted changes are to lower the priority, |
| or to switch to a non-real-time policy. |
| .IP * |
| Subject to the same rules, |
| another unprivileged thread can also make these changes, |
| as long as the effective user ID of the thread making the change |
| matches the real or effective user ID of the target thread. |
| .IP * |
| Special rules apply for the |
| .BR SCHED_IDLE |
| policy. |
| In Linux kernels before 2.6.39, |
| an unprivileged thread operating under this policy cannot |
| change its policy, regardless of the value of its |
| .BR RLIMIT_RTPRIO |
| resource limit. |
| In Linux kernels since 2.6.39, |
| .\" commit c02aa73b1d18e43cfd79c2f193b225e84ca497c8 |
| an unprivileged thread can switch to either the |
| .BR SCHED_BATCH |
| or the |
| .BR SCHED_OTHER |
| policy so long as its nice value falls within the range permitted by its |
| .BR RLIMIT_NICE |
| resource limit (see |
| .BR getrlimit (2)). |
| .PP |
| Privileged |
| .RB ( CAP_SYS_NICE ) |
| threads ignore the |
| .B RLIMIT_RTPRIO |
| limit; as with older kernels, |
| they can make arbitrary changes to scheduling policy and priority. |
| See |
| .BR getrlimit (2) |
| for further information on |
| .BR RLIMIT_RTPRIO . |
| .SS Limiting the CPU usage of real-time and deadline processes |
| A nonblocking infinite loop in a thread scheduled under the |
| .BR SCHED_FIFO , |
| .BR SCHED_RR , |
| or |
| .BR SCHED_DEADLINE |
| policy can potentially block all other threads from accessing |
| the CPU forever. |
| Prior to Linux 2.6.25, the only way of preventing a runaway real-time |
| process from freezing the system was to run (at the console) |
| a shell scheduled under a higher static priority than the tested application. |
| This allows an emergency kill of tested |
| real-time applications that do not block or terminate as expected. |
| .PP |
| Since Linux 2.6.25, there are other techniques for dealing with runaway |
| real-time and deadline processes. |
| One of these is to use the |
| .BR RLIMIT_RTTIME |
| resource limit to set a ceiling on the CPU time that |
| a real-time process may consume. |
| See |
| .BR getrlimit (2) |
| for details. |
| .PP |
| Since version 2.6.25, Linux also provides two |
| .I /proc |
| files that can be used to reserve a certain amount of CPU time |
| to be used by non-real-time processes. |
| Reserving CPU time in this fashion allows some CPU time to be |
| allocated to (say) a root shell that can be used to kill a runaway process. |
| Both of these files specify time values in microseconds: |
| .TP |
| .IR /proc/sys/kernel/sched_rt_period_us |
| This file specifies a scheduling period that is equivalent to |
| 100% CPU bandwidth. |
| The value in this file can range from 1 to |
| .BR INT_MAX , |
| giving an operating range of 1 microsecond to around 35 minutes. |
| The default value in this file is 1,000,000 (1 second). |
| .TP |
| .IR /proc/sys/kernel/sched_rt_runtime_us |
| The value in this file specifies how much of the "period" time |
| can be used by all real-time and deadline scheduled processes |
| on the system. |
| The value in this file can range from \-1 to |
| .BR INT_MAX \-1. |
| Specifying \-1 makes the run time the same as the period; |
| that is, no CPU time is set aside for non-real-time processes |
| (which was the Linux behavior before kernel 2.6.25). |
| The default value in this file is 950,000 (0.95 seconds), |
| meaning that 5% of the CPU time is reserved for processes that |
| don't run under a real-time or deadline scheduling policy. |
| .SS Response time |
| A blocked high priority thread waiting for I/O has a certain |
| response time before it is scheduled again. |
| The device driver writer |
| can greatly reduce this response time by using a "slow interrupt" |
| interrupt handler. |
| .\" as described in |
| .\" .BR request_irq (9). |
| .SS Miscellaneous |
| Child processes inherit the scheduling policy and parameters across a |
| .BR fork (2). |
| The scheduling policy and parameters are preserved across |
| .BR execve (2). |
| .PP |
| Memory locking is usually needed for real-time processes to avoid |
| paging delays; this can be done with |
| .BR mlock (2) |
| or |
| .BR mlockall (2). |
| .\" |
| .SS The autogroup feature |
| .\" commit 5091faa449ee0b7d73bc296a93bca9540fc51d0a |
| Since Linux 2.6.38, |
| the kernel provides a feature known as autogrouping to improve interactive |
| desktop performance in the face of multiprocess, CPU-intensive |
| workloads such as building the Linux kernel with large numbers of |
| parallel build processes (i.e., the |
| .BR make (1) |
| .BR \-j |
| flag). |
| .PP |
| This feature operates in conjunction with the |
| CFS scheduler and requires a kernel that is configured with |
| .BR CONFIG_SCHED_AUTOGROUP . |
| On a running system, this feature is enabled or disabled via the file |
| .IR /proc/sys/kernel/sched_autogroup_enabled ; |
| a value of 0 disables the feature, while a value of 1 enables it. |
| The default value in this file is 1, unless the kernel was booted with the |
| .IR noautogroup |
| parameter. |
| .PP |
| A new autogroup is created when a new session is created via |
| .BR setsid (2); |
| this happens, for example, when a new terminal window is started. |
| A new process created by |
| .BR fork (2) |
| inherits its parent's autogroup membership. |
| Thus, all of the processes in a session are members of the same autogroup. |
| An autogroup is automatically destroyed when the last process |
| in the group terminates. |
| .PP |
| When autogrouping is enabled, all of the members of an autogroup |
| are placed in the same kernel scheduler "task group". |
| The CFS scheduler employs an algorithm that equalizes the |
| distribution of CPU cycles across task groups. |
| The benefits of this for interactive desktop performance |
| can be described via the following example. |
| .PP |
| Suppose that there are two autogroups competing for the same CPU |
| (i.e., presume either a single CPU system or the use of |
| .BR taskset (1) |
| to confine all the processes to the same CPU on an SMP system). |
| The first group contains ten CPU-bound processes from |
| a kernel build started with |
| .IR "make\ \-j10" . |
| The other contains a single CPU-bound process: a video player. |
| The effect of autogrouping is that the two groups will |
| each receive half of the CPU cycles. |
| That is, the video player will receive 50% of the CPU cycles, |
| rather than just 9% of the cycles, |
| which would likely lead to degraded video playback. |
| The situation on an SMP system is more complex, |
| .\" Mike Galbraith, 25 Nov 2016: |
| .\" I'd say something more wishy-washy here, like cycles are |
| .\" distributed fairly across groups and leave it at that, as your |
| .\" detailed example is incorrect due to SMP fairness (which I don't |
| .\" like much because [very unlikely] worst case scenario |
| .\" renders a box sized group incapable of utilizing more that |
| .\" a single CPU total). For example, if a group of NR_CPUS |
| .\" size competes with a singleton, load balancing will try to give |
| .\" the singleton a full CPU of its very own. If groups intersect for |
| .\" whatever reason on say my quad lappy, distribution is 80/20 in |
| .\" favor of the singleton. |
| but the general effect is the same: |
| the scheduler distributes CPU cycles across task groups such that |
| an autogroup that contains a large number of CPU-bound processes |
| does not end up hogging CPU cycles at the expense of the other |
| jobs on the system. |
| .PP |
| A process's autogroup (task group) membership can be viewed via the file |
| .IR /proc/[pid]/autogroup : |
| .PP |
| .in +4n |
| .EX |
| $ \fBcat /proc/1/autogroup\fP |
| /autogroup\-1 nice 0 |
| .EE |
| .in |
| .PP |
| This file can also be used to modify the CPU bandwidth allocated |
| to an autogroup. |
| This is done by writing a number in the "nice" range to the file |
| to set the autogroup's nice value. |
| The allowed range is from +19 (low priority) to \-20 (high priority). |
| (Writing values outside of this range causes |
| .BR write (2) |
| to fail with the error |
| .BR EINVAL .) |
| .\" FIXME . |
| .\" Because of a bug introduced in Linux 4.7 |
| .\" (commit 2159197d66770ec01f75c93fb11dc66df81fd45b made changes |
| .\" that exposed the fact that autogroup didn't call scale_load()), |
| .\" it happened that *all* values in this range caused a task group |
| .\" to be further disfavored by the scheduler, with \-20 resulting |
| .\" in the scheduler mildly disfavoring the task group and +19 greatly |
| .\" disfavoring it. |
| .\" |
| .\" A patch was posted on 23 Nov 2016 |
| .\" ("sched/autogroup: Fix 64bit kernel nice adjustment"; |
| .\" check later to see in which kernel version it lands. |
| .PP |
| The autogroup nice setting has the same meaning as the process nice value, |
| but applies to distribution of CPU cycles to the autogroup as a whole, |
| based on the relative nice values of other autogroups. |
| For a process inside an autogroup, the CPU cycles that it receives |
| will be a product of the autogroup's nice value |
| (compared to other autogroups) |
| and the process's nice value |
| (compared to other processes in the same autogroup. |
| .PP |
| The use of the |
| .BR cgroups (7) |
| CPU controller to place processes in cgroups other than the |
| root CPU cgroup overrides the effect of autogrouping. |
| .PP |
| The autogroup feature groups only processes scheduled under |
| non-real-time policies |
| .RB ( SCHED_OTHER , |
| .BR SCHED_BATCH , |
| and |
| .BR SCHED_IDLE ). |
| It does not group processes scheduled under real-time and |
| deadline policies. |
| Those processes are scheduled according to the rules described earlier. |
| .\" |
| .SS The nice value and group scheduling |
| When scheduling non-real-time processes (i.e., those scheduled under the |
| .BR SCHED_OTHER , |
| .BR SCHED_BATCH , |
| and |
| .BR SCHED_IDLE |
| policies), the CFS scheduler employs a technique known as "group scheduling", |
| if the kernel was configured with the |
| .BR CONFIG_FAIR_GROUP_SCHED |
| option (which is typical). |
| .PP |
| Under group scheduling, threads are scheduled in "task groups". |
| Task groups have a hierarchical relationship, |
| rooted under the initial task group on the system, |
| known as the "root task group". |
| Task groups are formed in the following circumstances: |
| .IP * 3 |
| All of the threads in a CPU cgroup form a task group. |
| The parent of this task group is the task group of the |
| corresponding parent cgroup. |
| .IP * |
| If autogrouping is enabled, |
| then all of the threads that are (implicitly) placed in an autogroup |
| (i.e., the same session, as created by |
| .BR setsid (2)) |
| form a task group. |
| Each new autogroup is thus a separate task group. |
| The root task group is the parent of all such autogroups. |
| .IP * |
| If autogrouping is enabled, then the root task group consists of |
| all processes in the root CPU cgroup that were not |
| otherwise implicitly placed into a new autogroup. |
| .IP * |
| If autogrouping is disabled, then the root task group consists of |
| all processes in the root CPU cgroup. |
| .IP * |
| If group scheduling was disabled (i.e., the kernel was configured without |
| .BR CONFIG_FAIR_GROUP_SCHED ), |
| then all of the processes on the system are notionally placed |
| in a single task group. |
| .PP |
| Under group scheduling, |
| a thread's nice value has an effect for scheduling decisions |
| .IR "only relative to other threads in the same task group" . |
| This has some surprising consequences in terms of the traditional semantics |
| of the nice value on UNIX systems. |
| In particular, if autogrouping |
| is enabled (which is the default in various distributions), then employing |
| .BR setpriority (2) |
| or |
| .BR nice (1) |
| on a process has an effect only for scheduling relative |
| to other processes executed in the same session |
| (typically: the same terminal window). |
| .PP |
| Conversely, for two processes that are (for example) |
| the sole CPU-bound processes in different sessions |
| (e.g., different terminal windows, |
| each of whose jobs are tied to different autogroups), |
| .IR "modifying the nice value of the process in one of the sessions" |
| .IR "has no effect" |
| in terms of the scheduler's decisions relative to the |
| process in the other session. |
| .\" More succinctly: the nice(1) command is in many cases a no-op since |
| .\" Linux 2.6.38. |
| .\" |
| A possibly useful workaround here is to use a command such as |
| the following to modify the autogroup nice value for |
| .I all |
| of the processes in a terminal session: |
| .PP |
| .in +4n |
| .EX |
| $ \fBecho 10 > /proc/self/autogroup\fP |
| .EE |
| .in |
| .SS Real-time features in the mainline Linux kernel |
| .\" FIXME . Probably this text will need some minor tweaking |
| .\" ask Carsten Emde about this. |
| Since kernel version 2.6.18, Linux is gradually |
| becoming equipped with real-time capabilities, |
| most of which are derived from the former |
| .I realtime\-preempt |
| patch set. |
| Until the patches have been completely merged into the |
| mainline kernel, |
| they must be installed to achieve the best real-time performance. |
| These patches are named: |
| .PP |
| .in +4n |
| .EX |
| patch\-\fIkernelversion\fP\-rt\fIpatchversion\fP |
| .EE |
| .in |
| .PP |
| and can be downloaded from |
| .UR http://www.kernel.org\:/pub\:/linux\:/kernel\:/projects\:/rt/ |
| .UE . |
| .PP |
| Without the patches and prior to their full inclusion into the mainline |
| kernel, the kernel configuration offers only the three preemption classes |
| .BR CONFIG_PREEMPT_NONE , |
| .BR CONFIG_PREEMPT_VOLUNTARY , |
| and |
| .B CONFIG_PREEMPT_DESKTOP |
| which respectively provide no, some, and considerable |
| reduction of the worst-case scheduling latency. |
| .PP |
| With the patches applied or after their full inclusion into the mainline |
| kernel, the additional configuration item |
| .B CONFIG_PREEMPT_RT |
| becomes available. |
| If this is selected, Linux is transformed into a regular |
| real-time operating system. |
| The FIFO and RR scheduling policies are then used to run a thread |
| with true real-time priority and a minimum worst-case scheduling latency. |
| .SH NOTES |
| The |
| .BR cgroups (7) |
| CPU controller can be used to limit the CPU consumption of |
| groups of processes. |
| .PP |
| Originally, Standard Linux was intended as a general-purpose operating |
| system being able to handle background processes, interactive |
| applications, and less demanding real-time applications (applications that |
| need to usually meet timing deadlines). |
| Although the Linux kernel 2.6 |
| allowed for kernel preemption and the newly introduced O(1) scheduler |
| ensures that the time needed to schedule is fixed and deterministic |
| irrespective of the number of active tasks, true real-time computing |
| was not possible up to kernel version 2.6.17. |
| .SH SEE ALSO |
| .ad l |
| .nh |
| .BR chcpu (1), |
| .BR chrt (1), |
| .BR lscpu (1), |
| .BR ps (1), |
| .BR taskset (1), |
| .BR top (1), |
| .BR getpriority (2), |
| .BR mlock (2), |
| .BR mlockall (2), |
| .BR munlock (2), |
| .BR munlockall (2), |
| .BR nice (2), |
| .BR sched_get_priority_max (2), |
| .BR sched_get_priority_min (2), |
| .BR sched_getaffinity (2), |
| .BR sched_getparam (2), |
| .BR sched_getscheduler (2), |
| .BR sched_rr_get_interval (2), |
| .BR sched_setaffinity (2), |
| .BR sched_setparam (2), |
| .BR sched_setscheduler (2), |
| .BR sched_yield (2), |
| .BR setpriority (2), |
| .BR pthread_getaffinity_np (3), |
| .BR pthread_getschedparam (3), |
| .BR pthread_setaffinity_np (3), |
| .BR sched_getcpu (3), |
| .BR capabilities (7), |
| .BR cpuset (7) |
| .ad |
| .PP |
| .I Programming for the real world \- POSIX.4 |
| by Bill O.\& Gallmeister, O'Reilly & Associates, Inc., ISBN 1-56592-074-0. |
| .PP |
| The Linux kernel source files |
| .IR Documentation/scheduler/sched\-deadline.txt , |
| .IR Documentation/scheduler/sched\-rt\-group.txt , |
| .IR Documentation/scheduler/sched\-design\-CFS.txt , |
| and |
| .IR Documentation/scheduler/sched\-nice\-design.txt |