| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 |      CPU frequency and voltage scaling code in the Linux(TM) kernel | 
 | 2 |  | 
 | 3 |  | 
 | 4 | 		         L i n u x    C P U F r e q | 
 | 5 |  | 
 | 6 | 		      C P U F r e q   G o v e r n o r s | 
 | 7 |  | 
 | 8 | 		   - information for users and developers - | 
 | 9 |  | 
 | 10 |  | 
 | 11 | 		    Dominik Brodowski  <linux@brodo.de> | 
| Nico Golde | 594dd2c | 2005-06-25 14:58:33 -0700 | [diff] [blame] | 12 |             some additions and corrections by Nico Golde <nico@ngolde.de> | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 13 |  | 
 | 14 |  | 
 | 15 |  | 
 | 16 |    Clock scaling allows you to change the clock speed of the CPUs on the | 
 | 17 |     fly. This is a nice method to save battery power, because the lower | 
 | 18 |             the clock speed, the less power the CPU consumes. | 
 | 19 |  | 
 | 20 |  | 
 | 21 | Contents: | 
 | 22 | --------- | 
 | 23 | 1.   What is a CPUFreq Governor? | 
 | 24 |  | 
 | 25 | 2.   Governors In the Linux Kernel | 
 | 26 | 2.1  Performance | 
 | 27 | 2.2  Powersave | 
 | 28 | 2.3  Userspace | 
| Nico Golde | 594dd2c | 2005-06-25 14:58:33 -0700 | [diff] [blame] | 29 | 2.4  Ondemand | 
| Alexander Clouter | 537208c | 2005-12-01 01:09:23 -0800 | [diff] [blame] | 30 | 2.5  Conservative | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 31 |  | 
 | 32 | 3.   The Governor Interface in the CPUfreq Core | 
 | 33 |  | 
 | 34 |  | 
 | 35 |  | 
 | 36 | 1. What Is A CPUFreq Governor? | 
 | 37 | ============================== | 
 | 38 |  | 
 | 39 | Most cpufreq drivers (in fact, all except one, longrun) or even most | 
 | 40 | cpu frequency scaling algorithms only offer the CPU to be set to one | 
 | 41 | frequency. In order to offer dynamic frequency scaling, the cpufreq | 
 | 42 | core must be able to tell these drivers of a "target frequency". So | 
 | 43 | these specific drivers will be transformed to offer a "->target" | 
 | 44 | call instead of the existing "->setpolicy" call. For "longrun", all | 
 | 45 | stays the same, though. | 
 | 46 |  | 
 | 47 | How to decide what frequency within the CPUfreq policy should be used? | 
 | 48 | That's done using "cpufreq governors". Two are already in this patch | 
 | 49 | -- they're the already existing "powersave" and "performance" which | 
 | 50 | set the frequency statically to the lowest or highest frequency, | 
 | 51 | respectively. At least two more such governors will be ready for | 
 | 52 | addition in the near future, but likely many more as there are various | 
 | 53 | different theories and models about dynamic frequency scaling | 
 | 54 | around. Using such a generic interface as cpufreq offers to scaling | 
 | 55 | governors, these can be tested extensively, and the best one can be | 
 | 56 | selected for each specific use. | 
 | 57 |  | 
 | 58 | Basically, it's the following flow graph: | 
 | 59 |  | 
| Matt LaPlante | 2fe0ae7 | 2006-10-03 22:50:39 +0200 | [diff] [blame] | 60 | CPU can be set to switch independently	 |	   CPU can only be set | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 61 |       within specific "limits"		 |       to specific frequencies | 
 | 62 |  | 
 | 63 |                                  "CPUfreq policy" | 
 | 64 | 		consists of frequency limits (policy->{min,max}) | 
 | 65 |   		     and CPUfreq governor to be used | 
 | 66 | 			 /		      \ | 
 | 67 | 			/		       \ | 
 | 68 | 		       /		       the cpufreq governor decides | 
 | 69 | 		      /			       (dynamically or statically) | 
 | 70 | 		     /			       what target_freq to set within | 
 | 71 | 		    /			       the limits of policy->{min,max} | 
 | 72 | 		   /			            \ | 
 | 73 | 		  /				     \ | 
 | 74 | 	Using the ->setpolicy call,		 Using the ->target call, | 
 | 75 | 	    the limits and the			  the frequency closest | 
 | 76 | 	     "policy" is set.			  to target_freq is set. | 
 | 77 | 						  It is assured that it | 
 | 78 | 						  is within policy->{min,max} | 
 | 79 |  | 
 | 80 |  | 
 | 81 | 2. Governors In the Linux Kernel | 
 | 82 | ================================ | 
 | 83 |  | 
 | 84 | 2.1 Performance | 
 | 85 | --------------- | 
 | 86 |  | 
 | 87 | The CPUfreq governor "performance" sets the CPU statically to the | 
 | 88 | highest frequency within the borders of scaling_min_freq and | 
 | 89 | scaling_max_freq. | 
 | 90 |  | 
 | 91 |  | 
| Nico Golde | 594dd2c | 2005-06-25 14:58:33 -0700 | [diff] [blame] | 92 | 2.2 Powersave | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 93 | ------------- | 
 | 94 |  | 
 | 95 | The CPUfreq governor "powersave" sets the CPU statically to the | 
 | 96 | lowest frequency within the borders of scaling_min_freq and | 
 | 97 | scaling_max_freq. | 
 | 98 |  | 
 | 99 |  | 
| Nico Golde | 594dd2c | 2005-06-25 14:58:33 -0700 | [diff] [blame] | 100 | 2.3 Userspace | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 101 | ------------- | 
 | 102 |  | 
 | 103 | The CPUfreq governor "userspace" allows the user, or any userspace | 
 | 104 | program running with UID "root", to set the CPU to a specific frequency | 
 | 105 | by making a sysfs file "scaling_setspeed" available in the CPU-device | 
 | 106 | directory. | 
 | 107 |  | 
 | 108 |  | 
| Nico Golde | 594dd2c | 2005-06-25 14:58:33 -0700 | [diff] [blame] | 109 | 2.4 Ondemand | 
 | 110 | ------------ | 
 | 111 |  | 
| Matt LaPlante | a2ffd27 | 2006-10-03 22:49:15 +0200 | [diff] [blame] | 112 | The CPUfreq governor "ondemand" sets the CPU depending on the | 
| Nico Golde | 594dd2c | 2005-06-25 14:58:33 -0700 | [diff] [blame] | 113 | current usage. To do this the CPU must have the capability to | 
| Alexander Clouter | 537208c | 2005-12-01 01:09:23 -0800 | [diff] [blame] | 114 | switch the frequency very quickly.  There are a number of sysfs file | 
 | 115 | accessible parameters: | 
 | 116 |  | 
 | 117 | sampling_rate: measured in uS (10^-6 seconds), this is how often you | 
 | 118 | want the kernel to look at the CPU usage and to make decisions on | 
 | 119 | what to do about the frequency.  Typically this is set to values of | 
| Thomas Renninger | 112124a | 2009-02-04 11:55:12 +0100 | [diff] [blame] | 120 | around '10000' or more. It's default value is (cmp. with users-guide.txt): | 
 | 121 | transition_latency * 1000 | 
 | 122 | The lowest value you can set is: | 
 | 123 | transition_latency * 100 or it may get restricted to a value where it | 
 | 124 | makes not sense for the kernel anymore to poll that often which depends | 
 | 125 | on your HZ config variable (HZ=1000: max=20000us, HZ=250: max=5000). | 
 | 126 | Be aware that transition latency is in ns and sampling_rate is in us, so you | 
 | 127 | get the same sysfs value by default. | 
 | 128 | Sampling rate should always get adjusted considering the transition latency | 
 | 129 | To set the sampling rate 750 times as high as the transition latency | 
 | 130 | in the bash (as said, 1000 is default), do: | 
 | 131 | echo `$(($(cat cpuinfo_transition_latency) * 750 / 1000)) \ | 
 | 132 |     >ondemand/sampling_rate | 
| Alexander Clouter | 537208c | 2005-12-01 01:09:23 -0800 | [diff] [blame] | 133 |  | 
| Thomas Renninger | 9411b4e | 2009-02-04 11:54:04 +0100 | [diff] [blame] | 134 | show_sampling_rate_(min|max): THIS INTERFACE IS DEPRECATED, DON'T USE IT. | 
 | 135 | You can use wider ranges now and the general | 
 | 136 | cpuinfo_transition_latency variable (cmp. with user-guide.txt) can be | 
 | 137 | used to obtain exactly the same info: | 
 | 138 | show_sampling_rate_min = transtition_latency * 500    / 1000 | 
 | 139 | show_sampling_rate_max = transtition_latency * 500000 / 1000 | 
 | 140 | (divided by 1000 is to illustrate that sampling rate is in us and | 
 | 141 | transition latency is exported ns). | 
| Alexander Clouter | 537208c | 2005-12-01 01:09:23 -0800 | [diff] [blame] | 142 |  | 
| Matt LaPlante | d919588 | 2008-07-25 19:45:33 -0700 | [diff] [blame] | 143 | up_threshold: defines what the average CPU usage between the samplings | 
| Alexander Clouter | 537208c | 2005-12-01 01:09:23 -0800 | [diff] [blame] | 144 | of 'sampling_rate' needs to be for the kernel to make a decision on | 
 | 145 | whether it should increase the frequency.  For example when it is set | 
 | 146 | to its default value of '80' it means that between the checking | 
 | 147 | intervals the CPU needs to be on average more than 80% in use to then | 
 | 148 | decide that the CPU frequency needs to be increased.   | 
 | 149 |  | 
| Matt LaPlante | 992caac | 2006-10-03 22:52:05 +0200 | [diff] [blame] | 150 | ignore_nice_load: this parameter takes a value of '0' or '1'. When | 
 | 151 | set to '0' (its default), all processes are counted towards the | 
 | 152 | 'cpu utilisation' value.  When set to '1', the processes that are | 
| Alexander Clouter | 537208c | 2005-12-01 01:09:23 -0800 | [diff] [blame] | 153 | run with a 'nice' value will not count (and thus be ignored) in the | 
| Matt LaPlante | 992caac | 2006-10-03 22:52:05 +0200 | [diff] [blame] | 154 | overall usage calculation.  This is useful if you are running a CPU | 
| Alexander Clouter | 537208c | 2005-12-01 01:09:23 -0800 | [diff] [blame] | 155 | intensive calculation on your laptop that you do not care how long it | 
 | 156 | takes to complete as you can 'nice' it and prevent it from taking part | 
 | 157 | in the deciding process of whether to increase your CPU frequency. | 
| Nico Golde | 594dd2c | 2005-06-25 14:58:33 -0700 | [diff] [blame] | 158 |  | 
 | 159 |  | 
| Alexander Clouter | 537208c | 2005-12-01 01:09:23 -0800 | [diff] [blame] | 160 | 2.5 Conservative | 
 | 161 | ---------------- | 
 | 162 |  | 
 | 163 | The CPUfreq governor "conservative", much like the "ondemand" | 
 | 164 | governor, sets the CPU depending on the current usage.  It differs in | 
 | 165 | behaviour in that it gracefully increases and decreases the CPU speed | 
 | 166 | rather than jumping to max speed the moment there is any load on the | 
 | 167 | CPU.  This behaviour more suitable in a battery powered environment. | 
 | 168 | The governor is tweaked in the same manner as the "ondemand" governor | 
 | 169 | through sysfs with the addition of: | 
 | 170 |  | 
 | 171 | freq_step: this describes what percentage steps the cpu freq should be | 
 | 172 | increased and decreased smoothly by.  By default the cpu frequency will | 
 | 173 | increase in 5% chunks of your maximum cpu frequency.  You can change this | 
 | 174 | value to anywhere between 0 and 100 where '0' will effectively lock your | 
 | 175 | CPU at a speed regardless of its load whilst '100' will, in theory, make | 
 | 176 | it behave identically to the "ondemand" governor. | 
 | 177 |  | 
 | 178 | down_threshold: same as the 'up_threshold' found for the "ondemand" | 
 | 179 | governor but for the opposite direction.  For example when set to its | 
 | 180 | default value of '20' it means that if the CPU usage needs to be below | 
 | 181 | 20% between samples to have the frequency decreased. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 182 |  | 
 | 183 | 3. The Governor Interface in the CPUfreq Core | 
 | 184 | ============================================= | 
 | 185 |  | 
 | 186 | A new governor must register itself with the CPUfreq core using | 
 | 187 | "cpufreq_register_governor". The struct cpufreq_governor, which has to | 
 | 188 | be passed to that function, must contain the following values: | 
 | 189 |  | 
 | 190 | governor->name -	    A unique name for this governor | 
 | 191 | governor->governor -	    The governor callback function | 
 | 192 | governor->owner	-	    .THIS_MODULE for the governor module (if  | 
 | 193 | 			    appropriate) | 
 | 194 |  | 
 | 195 | The governor->governor callback is called with the current (or to-be-set) | 
 | 196 | cpufreq_policy struct for that CPU, and an unsigned int event. The | 
 | 197 | following events are currently defined: | 
 | 198 |  | 
 | 199 | CPUFREQ_GOV_START:   This governor shall start its duty for the CPU | 
 | 200 | 		     policy->cpu | 
 | 201 | CPUFREQ_GOV_STOP:    This governor shall end its duty for the CPU | 
 | 202 | 		     policy->cpu | 
 | 203 | CPUFREQ_GOV_LIMITS:  The limits for CPU policy->cpu have changed to | 
 | 204 | 		     policy->min and policy->max. | 
 | 205 |  | 
 | 206 | If you need other "events" externally of your driver, _only_ use the | 
 | 207 | cpufreq_governor_l(unsigned int cpu, unsigned int event) call to the | 
 | 208 | CPUfreq core to ensure proper locking. | 
 | 209 |  | 
 | 210 |  | 
 | 211 | The CPUfreq governor may call the CPU processor driver using one of | 
 | 212 | these two functions: | 
 | 213 |  | 
 | 214 | int cpufreq_driver_target(struct cpufreq_policy *policy, | 
 | 215 |                                  unsigned int target_freq, | 
 | 216 |                                  unsigned int relation); | 
 | 217 |  | 
 | 218 | int __cpufreq_driver_target(struct cpufreq_policy *policy, | 
 | 219 |                                    unsigned int target_freq, | 
 | 220 |                                    unsigned int relation); | 
 | 221 |  | 
 | 222 | target_freq must be within policy->min and policy->max, of course. | 
 | 223 | What's the difference between these two functions? When your governor | 
 | 224 | still is in a direct code path of a call to governor->governor, the | 
 | 225 | per-CPU cpufreq lock is still held in the cpufreq core, and there's | 
 | 226 | no need to lock it again (in fact, this would cause a deadlock). So | 
 | 227 | use __cpufreq_driver_target only in these cases. In all other cases  | 
 | 228 | (for example, when there's a "daemonized" function that wakes up  | 
 | 229 | every second), use cpufreq_driver_target to lock the cpufreq per-CPU | 
 | 230 | lock before the command is passed to the cpufreq processor driver. | 
 | 231 |  |