| CPU Controller |
| -------------- |
| |
| The CPU controller is responsible for grouping tasks together that will be |
| viewed by the scheduler as a single unit. The CFS scheduler will first divide |
| CPU time equally between all entities in the same level, and then proceed by |
| doing the same in the next level. Basic use cases for that are described in the |
| main cgroup documentation file, cgroups.txt. |
| |
| Users of this functionality should be aware that deep hierarchies will of |
| course impose scheduler overhead, since the scheduler will have to take extra |
| steps and look up additional data structures to make its final decision. |
| |
| Through the CPU controller, the scheduler is also able to cap the CPU |
| utilization of a particular group. This is particularly useful in environments |
| in which CPU is paid for by the hour, and one values predictability over |
| performance. |
| |
| CPU Accounting |
| -------------- |
| |
| The CPU cgroup will also provide additional files under the prefix "cpuacct". |
| Those files provide accounting statistics and were previously provided by the |
| separate cpuacct controller. Although the cpuacct controller will still be kept |
| around for compatibility reasons, its usage is discouraged. If both the CPU and |
| cpuacct controllers are present in the system, distributors are encouraged to |
| always mount them together. |
| |
| Files |
| ----- |
| |
| The CPU controller exposes the following files to the user: |
| |
| - cpu.shares: The weight of each group living in the same hierarchy, that |
| translates into the amount of CPU it is expected to get. Upon cgroup creation, |
| each group gets assigned a default of 1024. The percentage of CPU assigned to |
| the cgroup is the value of shares divided by the sum of all shares in all |
| cgroups in the same level. |
| |
| - cpu.cfs_period_us: The duration in microseconds of each scheduler period, for |
| bandwidth decisions. This defaults to 100000us or 100ms. Larger periods will |
| improve throughput at the expense of latency, since the scheduler will be able |
| to sustain a cpu-bound workload for longer. The opposite of true for smaller |
| periods. Note that this only affects non-RT tasks that are scheduled by the |
| CFS scheduler. |
| |
| - cpu.cfs_quota_us: The maximum time in microseconds during each cfs_period_us |
| in for the current group will be allowed to run. For instance, if it is set to |
| half of cpu_period_us, the cgroup will only be able to peak run for 50 % of |
| the time. One should note that this represents aggregate time over all CPUs |
| in the system. Therefore, in order to allow full usage of two CPUs, for |
| instance, one should set this value to twice the value of cfs_period_us. |
| |
| - cpu.stat: statistics about the bandwidth controls. No data will be presented |
| if cpu.cfs_quota_us is not set. The file presents three |
| numbers: |
| nr_periods: how many full periods have been elapsed. |
| nr_throttled: number of times we exausted the full allowed bandwidth |
| throttled_time: total time the tasks were not run due to being overquota |
| |
| - cpu.rt_runtime_us and cpu.rt_period_us: Those files are the RT-tasks |
| analogous to the CFS files cfs_quota_us and cfs_period_us. One important |
| difference, though, is that while the cfs quotas are upper bounds that |
| won't necessarily be met, the rt runtimes form a stricter guarantee. |
| Therefore, no overlap is allowed. Implications of that are that given a |
| hierarchy with multiple children, the sum of all rt_runtime_us may not exceed |
| the runtime of the parent. Also, a rt_runtime_us of 0, means that no rt tasks |
| can ever be run in this cgroup. For more information about rt tasks runtime |
| assignments, see scheduler/sched-rt-group.txt |
| |
| - cpu.stat_percpu: Various scheduler statistics for the current group. The |
| information provided in this file is akin to the one displayed in /proc/stat, |
| except for the fact that it is cgroup-aware. The file format consists of a |
| one-line header that describes the fields being listed. No guarantee is |
| given that the fields will be kept the same between kernel releases, and |
| readers should always check the header in order to introspect it. |
| |
| Each of the following lines will show the respective field value for |
| each of the possible cpus in the system. All values are show in |
| nanoseconds. One example output for this file is: |
| |
| cpu user nice system irq softirq guest guest_nice wait nr_switches nr_running |
| cpu0 471000000 0 15000000 0 0 0 0 1996534 7205 1 |
| cpu1 588000000 0 17000000 0 0 0 0 2848680 6510 1 |
| cpu2 505000000 0 14000000 0 0 0 0 2350771 6183 1 |
| cpu3 472000000 0 16000000 0 0 0 0 19766345 6277 2 |
| |
| |
| - cpuacct.usage: The aggregate CPU time, in nanoseconds, consumed by all tasks |
| in this group. |
| |
| - cpuacct.usage_percpu: The CPU time, in nanoseconds, consumed by all tasks in |
| this group, separated by CPU. The format is an space-separated array of time |
| values, one for each present CPU. |
| |
| - cpuacct.stat: aggregate user and system time consumed by tasks in this group. |
| The format is |
| user: x |
| system: y |