|      CPU frequency and voltage scaling code in the Linux(TM) kernel | 
 |  | 
 |  | 
 | 		         L i n u x    C P U F r e q | 
 |  | 
 | 		      C P U F r e q   G o v e r n o r s | 
 |  | 
 | 		   - information for users and developers - | 
 |  | 
 |  | 
 | 		    Dominik Brodowski  <linux@brodo.de> | 
 |             some additions and corrections by Nico Golde <nico@ngolde.de> | 
 | 		Rafael J. Wysocki <rafael.j.wysocki@intel.com> | 
 | 		   Viresh Kumar <viresh.kumar@linaro.org> | 
 |  | 
 |  | 
 |  | 
 |    Clock scaling allows you to change the clock speed of the CPUs on the | 
 |     fly. This is a nice method to save battery power, because the lower | 
 |             the clock speed, the less power the CPU consumes. | 
 |  | 
 |  | 
 | Contents: | 
 | --------- | 
 | 1.   What is a CPUFreq Governor? | 
 |  | 
 | 2.   Governors In the Linux Kernel | 
 | 2.1  Performance | 
 | 2.2  Powersave | 
 | 2.3  Userspace | 
 | 2.4  Ondemand | 
 | 2.5  Conservative | 
 | 2.6  Schedutil | 
 |  | 
 | 3.   The Governor Interface in the CPUfreq Core | 
 |  | 
 | 4.   References | 
 |  | 
 |  | 
 | 1. What Is A CPUFreq Governor? | 
 | ============================== | 
 |  | 
 | Most cpufreq drivers (except the intel_pstate and longrun) or even most | 
 | cpu frequency scaling algorithms only allow the CPU frequency to be set | 
 | to predefined fixed values.  In order to offer dynamic frequency | 
 | scaling, the cpufreq core must be able to tell these drivers of a | 
 | "target frequency". So these specific drivers will be transformed to | 
 | offer a "->target/target_index/fast_switch()" call instead of the | 
 | "->setpolicy()" call. For set_policy drivers, all stays the same, | 
 | though. | 
 |  | 
 | How to decide what frequency within the CPUfreq policy should be used? | 
 | That's done using "cpufreq governors". | 
 |  | 
 | Basically, it's the following flow graph: | 
 |  | 
 | CPU can be set to switch independently	 |	   CPU can only be set | 
 |       within specific "limits"		 |       to specific frequencies | 
 |  | 
 |                                  "CPUfreq policy" | 
 | 		consists of frequency limits (policy->{min,max}) | 
 |   		     and CPUfreq governor to be used | 
 | 			 /		      \ | 
 | 			/		       \ | 
 | 		       /		       the cpufreq governor decides | 
 | 		      /			       (dynamically or statically) | 
 | 		     /			       what target_freq to set within | 
 | 		    /			       the limits of policy->{min,max} | 
 | 		   /			            \ | 
 | 		  /				     \ | 
 | 	Using the ->setpolicy call,		 Using the ->target/target_index/fast_switch call, | 
 | 	    the limits and the			  the frequency closest | 
 | 	     "policy" is set.			  to target_freq is set. | 
 | 						  It is assured that it | 
 | 						  is within policy->{min,max} | 
 |  | 
 |  | 
 | 2. Governors In the Linux Kernel | 
 | ================================ | 
 |  | 
 | 2.1 Performance | 
 | --------------- | 
 |  | 
 | The CPUfreq governor "performance" sets the CPU statically to the | 
 | highest frequency within the borders of scaling_min_freq and | 
 | scaling_max_freq. | 
 |  | 
 |  | 
 | 2.2 Powersave | 
 | ------------- | 
 |  | 
 | The CPUfreq governor "powersave" sets the CPU statically to the | 
 | lowest frequency within the borders of scaling_min_freq and | 
 | scaling_max_freq. | 
 |  | 
 |  | 
 | 2.3 Userspace | 
 | ------------- | 
 |  | 
 | The CPUfreq governor "userspace" allows the user, or any userspace | 
 | program running with UID "root", to set the CPU to a specific frequency | 
 | by making a sysfs file "scaling_setspeed" available in the CPU-device | 
 | directory. | 
 |  | 
 |  | 
 | 2.4 Ondemand | 
 | ------------ | 
 |  | 
 | The CPUfreq governor "ondemand" sets the CPU frequency depending on the | 
 | current system load. Load estimation is triggered by the scheduler | 
 | through the update_util_data->func hook; when triggered, cpufreq checks | 
 | the CPU-usage statistics over the last period and the governor sets the | 
 | CPU accordingly.  The CPU must have the capability to switch the | 
 | frequency very quickly. | 
 |  | 
 | Sysfs files: | 
 |  | 
 | * sampling_rate: | 
 |  | 
 |   Measured in uS (10^-6 seconds), this is how often you want the kernel | 
 |   to look at the CPU usage and to make decisions on what to do about the | 
 |   frequency.  Typically this is set to values of around '10000' or more. | 
 |   It's default value is (cmp. with users-guide.txt): transition_latency | 
 |   * 1000.  Be aware that transition latency is in ns and sampling_rate | 
 |   is in us, so you get the same sysfs value by default.  Sampling rate | 
 |   should always get adjusted considering the transition latency to set | 
 |   the sampling rate 750 times as high as the transition latency in the | 
 |   bash (as said, 1000 is default), do: | 
 |  | 
 |   $ echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) > ondemand/sampling_rate | 
 |  | 
 | * sampling_rate_min: | 
 |  | 
 |   The sampling rate is limited by the HW transition latency: | 
 |   transition_latency * 100 | 
 |  | 
 |   Or by kernel restrictions: | 
 |   - If CONFIG_NO_HZ_COMMON is set, the limit is 10ms fixed. | 
 |   - If CONFIG_NO_HZ_COMMON is not set or nohz=off boot parameter is | 
 |     used, the limits depend on the CONFIG_HZ option: | 
 |     HZ=1000: min=20000us  (20ms) | 
 |     HZ=250:  min=80000us  (80ms) | 
 |     HZ=100:  min=200000us (200ms) | 
 |  | 
 |   The highest value of kernel and HW latency restrictions is shown and | 
 |   used as the minimum sampling rate. | 
 |  | 
 | * up_threshold: | 
 |  | 
 |   This defines what the average CPU usage between the samplings of | 
 |   'sampling_rate' needs to be for the kernel to make a decision on | 
 |   whether it should increase the frequency.  For example when it is set | 
 |   to its default value of '95' it means that between the checking | 
 |   intervals the CPU needs to be on average more than 95% in use to then | 
 |   decide that the CPU frequency needs to be increased. | 
 |  | 
 | * ignore_nice_load: | 
 |  | 
 |   This parameter takes a value of '0' or '1'. When set to '0' (its | 
 |   default), all processes are counted towards the 'cpu utilisation' | 
 |   value.  When set to '1', the processes that are run with a 'nice' | 
 |   value will not count (and thus be ignored) in the overall usage | 
 |   calculation.  This is useful if you are running a CPU intensive | 
 |   calculation on your laptop that you do not care how long it takes to | 
 |   complete as you can 'nice' it and prevent it from taking part in the | 
 |   deciding process of whether to increase your CPU frequency. | 
 |  | 
 | * sampling_down_factor: | 
 |  | 
 |   This parameter controls the rate at which the kernel makes a decision | 
 |   on when to decrease the frequency while running at top speed. When set | 
 |   to 1 (the default) decisions to reevaluate load are made at the same | 
 |   interval regardless of current clock speed. But when set to greater | 
 |   than 1 (e.g. 100) it acts as a multiplier for the scheduling interval | 
 |   for reevaluating load when the CPU is at its top speed due to high | 
 |   load. This improves performance by reducing the overhead of load | 
 |   evaluation and helping the CPU stay at its top speed when truly busy, | 
 |   rather than shifting back and forth in speed. This tunable has no | 
 |   effect on behavior at lower speeds/lower CPU loads. | 
 |  | 
 | * powersave_bias: | 
 |  | 
 |   This parameter takes a value between 0 to 1000. It defines the | 
 |   percentage (times 10) value of the target frequency that will be | 
 |   shaved off of the target. For example, when set to 100 -- 10%, when | 
 |   ondemand governor would have targeted 1000 MHz, it will target | 
 |   1000 MHz - (10% of 1000 MHz) = 900 MHz instead. This is set to 0 | 
 |   (disabled) by default. | 
 |  | 
 |   When AMD frequency sensitivity powersave bias driver -- | 
 |   drivers/cpufreq/amd_freq_sensitivity.c is loaded, this parameter | 
 |   defines the workload frequency sensitivity threshold in which a lower | 
 |   frequency is chosen instead of ondemand governor's original target. | 
 |   The frequency sensitivity is a hardware reported (on AMD Family 16h | 
 |   Processors and above) value between 0 to 100% that tells software how | 
 |   the performance of the workload running on a CPU will change when | 
 |   frequency changes. A workload with sensitivity of 0% (memory/IO-bound) | 
 |   will not perform any better on higher core frequency, whereas a | 
 |   workload with sensitivity of 100% (CPU-bound) will perform better | 
 |   higher the frequency. When the driver is loaded, this is set to 400 by | 
 |   default -- for CPUs running workloads with sensitivity value below | 
 |   40%, a lower frequency is chosen. Unloading the driver or writing 0 | 
 |   will disable this feature. | 
 |  | 
 |  | 
 | 2.5 Conservative | 
 | ---------------- | 
 |  | 
 | The CPUfreq governor "conservative", much like the "ondemand" | 
 | governor, sets the CPU frequency depending on the current usage.  It | 
 | differs in behaviour in that it gracefully increases and decreases the | 
 | CPU speed rather than jumping to max speed the moment there is any load | 
 | on the CPU. This behaviour is more suitable in a battery powered | 
 | environment.  The governor is tweaked in the same manner as the | 
 | "ondemand" governor through sysfs with the addition of: | 
 |  | 
 | * freq_step: | 
 |  | 
 |   This describes what percentage steps the cpu freq should be increased | 
 |   and decreased smoothly by.  By default the cpu frequency will increase | 
 |   in 5% chunks of your maximum cpu frequency.  You can change this value | 
 |   to anywhere between 0 and 100 where '0' will effectively lock your CPU | 
 |   at a speed regardless of its load whilst '100' will, in theory, make | 
 |   it behave identically to the "ondemand" governor. | 
 |  | 
 | * down_threshold: | 
 |  | 
 |   Same as the 'up_threshold' found for the "ondemand" governor but for | 
 |   the opposite direction.  For example when set to its default value of | 
 |   '20' it means that if the CPU usage needs to be below 20% between | 
 |   samples to have the frequency decreased. | 
 |  | 
 | * sampling_down_factor: | 
 |  | 
 |   Similar functionality as in "ondemand" governor.  But in | 
 |   "conservative", it controls the rate at which the kernel makes a | 
 |   decision on when to decrease the frequency while running in any speed. | 
 |   Load for frequency increase is still evaluated every sampling rate. | 
 |  | 
 |  | 
 | 2.6 Schedutil | 
 | ------------- | 
 |  | 
 | The "schedutil" governor aims at better integration with the Linux | 
 | kernel scheduler.  Load estimation is achieved through the scheduler's | 
 | Per-Entity Load Tracking (PELT) mechanism, which also provides | 
 | information about the recent load [1].  This governor currently does | 
 | load based DVFS only for tasks managed by CFS. RT and DL scheduler tasks | 
 | are always run at the highest frequency.  Unlike all the other | 
 | governors, the code is located under the kernel/sched/ directory. | 
 |  | 
 | Sysfs files: | 
 |  | 
 | * rate_limit_us: | 
 |  | 
 |   This contains a value in microseconds. The governor waits for | 
 |   rate_limit_us time before reevaluating the load again, after it has | 
 |   evaluated the load once. | 
 |  | 
 | For an in-depth comparison with the other governors refer to [2]. | 
 |  | 
 |  | 
 | 3. The Governor Interface in the CPUfreq Core | 
 | ============================================= | 
 |  | 
 | A new governor must register itself with the CPUfreq core using | 
 | "cpufreq_register_governor". The struct cpufreq_governor, which has to | 
 | be passed to that function, must contain the following values: | 
 |  | 
 | governor->name - A unique name for this governor. | 
 | governor->owner - .THIS_MODULE for the governor module (if appropriate). | 
 |  | 
 | plus a set of hooks to the functions implementing the governor's logic. | 
 |  | 
 | The CPUfreq governor may call the CPU processor driver using one of | 
 | these two functions: | 
 |  | 
 | int cpufreq_driver_target(struct cpufreq_policy *policy, | 
 |                                  unsigned int target_freq, | 
 |                                  unsigned int relation); | 
 |  | 
 | int __cpufreq_driver_target(struct cpufreq_policy *policy, | 
 |                                    unsigned int target_freq, | 
 |                                    unsigned int relation); | 
 |  | 
 | target_freq must be within policy->min and policy->max, of course. | 
 | What's the difference between these two functions? When your governor is | 
 | in a direct code path of a call to governor callbacks, like | 
 | governor->start(), the policy->rwsem is still held in the cpufreq core, | 
 | and there's no need to lock it again (in fact, this would cause a | 
 | deadlock). So use __cpufreq_driver_target only in these cases. In all | 
 | other cases (for example, when there's a "daemonized" function that | 
 | wakes up every second), use cpufreq_driver_target to take policy->rwsem | 
 | before the command is passed to the cpufreq driver. | 
 |  | 
 | 4. References | 
 | ============= | 
 |  | 
 | [1] Per-entity load tracking: https://lwn.net/Articles/531853/ | 
 | [2] Improvements in CPU frequency management: https://lwn.net/Articles/682391/ | 
 |  |