| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Documentation for /proc/sys/vm/*	kernel version 2.2.10 | 
|  | 2 | (c) 1998, 1999,  Rik van Riel <riel@nl.linux.org> | 
|  | 3 |  | 
|  | 4 | For general info and legal blurb, please look in README. | 
|  | 5 |  | 
|  | 6 | ============================================================== | 
|  | 7 |  | 
|  | 8 | This file contains the documentation for the sysctl files in | 
|  | 9 | /proc/sys/vm and is valid for Linux kernel version 2.2. | 
|  | 10 |  | 
|  | 11 | The files in this directory can be used to tune the operation | 
|  | 12 | of the virtual memory (VM) subsystem of the Linux kernel and | 
|  | 13 | the writeout of dirty data to disk. | 
|  | 14 |  | 
|  | 15 | Default values and initialization routines for most of these | 
|  | 16 | files can be found in mm/swap.c. | 
|  | 17 |  | 
|  | 18 | Currently, these files are in /proc/sys/vm: | 
|  | 19 | - overcommit_memory | 
|  | 20 | - page-cluster | 
|  | 21 | - dirty_ratio | 
|  | 22 | - dirty_background_ratio | 
|  | 23 | - dirty_expire_centisecs | 
|  | 24 | - dirty_writeback_centisecs | 
| Bron Gondwana | 195cf45 | 2008-02-04 22:29:20 -0800 | [diff] [blame] | 25 | - highmem_is_dirtyable   (only if CONFIG_HIGHMEM set) | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 26 | - max_map_count | 
|  | 27 | - min_free_kbytes | 
|  | 28 | - laptop_mode | 
|  | 29 | - block_dump | 
| Andrew Morton | 9d0243b | 2006-01-08 01:00:39 -0800 | [diff] [blame] | 30 | - drop-caches | 
| Christoph Lameter | 1743660 | 2006-01-18 17:42:32 -0800 | [diff] [blame] | 31 | - zone_reclaim_mode | 
| Christoph Lameter | 9614634 | 2006-07-03 00:24:13 -0700 | [diff] [blame] | 32 | - min_unmapped_ratio | 
| Christoph Lameter | 0ff3849 | 2006-09-25 23:31:52 -0700 | [diff] [blame] | 33 | - min_slab_ratio | 
| KAMEZAWA Hiroyuki | fadd8fb | 2006-06-23 02:03:13 -0700 | [diff] [blame] | 34 | - panic_on_oom | 
| David Rientjes | fef1bdd | 2008-02-07 00:14:07 -0800 | [diff] [blame] | 35 | - oom_dump_tasks | 
| David Rientjes | fe071d7 | 2007-10-16 23:25:56 -0700 | [diff] [blame] | 36 | - oom_kill_allocating_task | 
| Eric Paris | ed03218 | 2007-06-28 15:55:21 -0400 | [diff] [blame] | 37 | - mmap_min_address | 
| KAMEZAWA Hiroyuki | f0c0b2b | 2007-07-15 23:38:01 -0700 | [diff] [blame] | 38 | - numa_zonelist_order | 
| Nishanth Aravamudan | d5dbac8 | 2007-12-17 16:20:25 -0800 | [diff] [blame] | 39 | - nr_hugepages | 
|  | 40 | - nr_overcommit_hugepages | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 41 |  | 
|  | 42 | ============================================================== | 
|  | 43 |  | 
|  | 44 | dirty_ratio, dirty_background_ratio, dirty_expire_centisecs, | 
| Bron Gondwana | 195cf45 | 2008-02-04 22:29:20 -0800 | [diff] [blame] | 45 | dirty_writeback_centisecs, highmem_is_dirtyable, | 
|  | 46 | vfs_cache_pressure, laptop_mode, block_dump, swap_token_timeout, | 
|  | 47 | drop-caches, hugepages_treat_as_movable: | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 48 |  | 
|  | 49 | See Documentation/filesystems/proc.txt | 
|  | 50 |  | 
|  | 51 | ============================================================== | 
|  | 52 |  | 
|  | 53 | overcommit_memory: | 
|  | 54 |  | 
|  | 55 | This value contains a flag that enables memory overcommitment. | 
|  | 56 |  | 
|  | 57 | When this flag is 0, the kernel attempts to estimate the amount | 
|  | 58 | of free memory left when userspace requests more memory. | 
|  | 59 |  | 
|  | 60 | When this flag is 1, the kernel pretends there is always enough | 
|  | 61 | memory until it actually runs out. | 
|  | 62 |  | 
|  | 63 | When this flag is 2, the kernel uses a "never overcommit" | 
|  | 64 | policy that attempts to prevent any overcommit of memory. | 
|  | 65 |  | 
|  | 66 | This feature can be very useful because there are a lot of | 
|  | 67 | programs that malloc() huge amounts of memory "just-in-case" | 
|  | 68 | and don't use much of it. | 
|  | 69 |  | 
|  | 70 | The default value is 0. | 
|  | 71 |  | 
|  | 72 | See Documentation/vm/overcommit-accounting and | 
|  | 73 | security/commoncap.c::cap_vm_enough_memory() for more information. | 
|  | 74 |  | 
|  | 75 | ============================================================== | 
|  | 76 |  | 
|  | 77 | overcommit_ratio: | 
|  | 78 |  | 
|  | 79 | When overcommit_memory is set to 2, the committed address | 
|  | 80 | space is not permitted to exceed swap plus this percentage | 
|  | 81 | of physical RAM.  See above. | 
|  | 82 |  | 
|  | 83 | ============================================================== | 
|  | 84 |  | 
|  | 85 | page-cluster: | 
|  | 86 |  | 
|  | 87 | The Linux VM subsystem avoids excessive disk seeks by reading | 
|  | 88 | multiple pages on a page fault. The number of pages it reads | 
|  | 89 | is dependent on the amount of memory in your machine. | 
|  | 90 |  | 
|  | 91 | The number of pages the kernel reads in at once is equal to | 
|  | 92 | 2 ^ page-cluster. Values above 2 ^ 5 don't make much sense | 
|  | 93 | for swap because we only cluster swap data in 32-page groups. | 
|  | 94 |  | 
|  | 95 | ============================================================== | 
|  | 96 |  | 
|  | 97 | max_map_count: | 
|  | 98 |  | 
|  | 99 | This file contains the maximum number of memory map areas a process | 
|  | 100 | may have. Memory map areas are used as a side-effect of calling | 
|  | 101 | malloc, directly by mmap and mprotect, and also when loading shared | 
|  | 102 | libraries. | 
|  | 103 |  | 
|  | 104 | While most applications need less than a thousand maps, certain | 
|  | 105 | programs, particularly malloc debuggers, may consume lots of them, | 
|  | 106 | e.g., up to one or two maps per allocation. | 
|  | 107 |  | 
|  | 108 | The default value is 65536. | 
|  | 109 |  | 
|  | 110 | ============================================================== | 
|  | 111 |  | 
|  | 112 | min_free_kbytes: | 
|  | 113 |  | 
|  | 114 | This is used to force the Linux VM to keep a minimum number | 
|  | 115 | of kilobytes free.  The VM uses this number to compute a pages_min | 
|  | 116 | value for each lowmem zone in the system.  Each lowmem zone gets | 
|  | 117 | a number of reserved free pages based proportionally on its size. | 
| Rohit Seth | 8ad4b1f | 2006-01-08 01:00:40 -0800 | [diff] [blame] | 118 |  | 
| Matt LaPlante | d919588 | 2008-07-25 19:45:33 -0700 | [diff] [blame] | 119 | Some minimal amount of memory is needed to satisfy PF_MEMALLOC | 
| Pavel Machek | 2495089 | 2007-10-16 23:31:28 -0700 | [diff] [blame] | 120 | allocations; if you set this to lower than 1024KB, your system will | 
|  | 121 | become subtly broken, and prone to deadlock under high loads. | 
|  | 122 |  | 
|  | 123 | Setting this too high will OOM your machine instantly. | 
|  | 124 |  | 
| Rohit Seth | 8ad4b1f | 2006-01-08 01:00:40 -0800 | [diff] [blame] | 125 | ============================================================== | 
|  | 126 |  | 
|  | 127 | percpu_pagelist_fraction | 
|  | 128 |  | 
|  | 129 | This is the fraction of pages at most (high mark pcp->high) in each zone that | 
|  | 130 | are allocated for each per cpu page list.  The min value for this is 8.  It | 
|  | 131 | means that we don't allow more than 1/8th of pages in each zone to be | 
|  | 132 | allocated in any single per_cpu_pagelist.  This entry only changes the value | 
|  | 133 | of hot per cpu pagelists.  User can specify a number like 100 to allocate | 
|  | 134 | 1/100th of each zone to each per cpu page list. | 
|  | 135 |  | 
|  | 136 | The batch value of each per cpu pagelist is also updated as a result.  It is | 
|  | 137 | set to pcp->high/4.  The upper limit of batch is (PAGE_SHIFT * 8) | 
|  | 138 |  | 
|  | 139 | The initial value is zero.  Kernel does not use this value at boot time to set | 
|  | 140 | the high water marks for each per cpu page list. | 
| Christoph Lameter | 1743660 | 2006-01-18 17:42:32 -0800 | [diff] [blame] | 141 |  | 
|  | 142 | =============================================================== | 
|  | 143 |  | 
|  | 144 | zone_reclaim_mode: | 
|  | 145 |  | 
| Matt LaPlante | 5d3f083 | 2006-11-30 05:21:10 +0100 | [diff] [blame] | 146 | Zone_reclaim_mode allows someone to set more or less aggressive approaches to | 
| Christoph Lameter | 1b2ffb7 | 2006-02-01 03:05:34 -0800 | [diff] [blame] | 147 | reclaim memory when a zone runs out of memory. If it is set to zero then no | 
|  | 148 | zone reclaim occurs. Allocations will be satisfied from other zones / nodes | 
|  | 149 | in the system. | 
|  | 150 |  | 
|  | 151 | This is value ORed together of | 
|  | 152 |  | 
|  | 153 | 1	= Zone reclaim on | 
|  | 154 | 2	= Zone reclaim writes dirty pages out | 
|  | 155 | 4	= Zone reclaim swaps pages | 
|  | 156 |  | 
|  | 157 | zone_reclaim_mode is set during bootup to 1 if it is determined that pages | 
|  | 158 | from remote zones will cause a measurable performance reduction. The | 
| Christoph Lameter | 1743660 | 2006-01-18 17:42:32 -0800 | [diff] [blame] | 159 | page allocator will then reclaim easily reusable pages (those page | 
| Christoph Lameter | 1b2ffb7 | 2006-02-01 03:05:34 -0800 | [diff] [blame] | 160 | cache pages that are currently not used) before allocating off node pages. | 
| Christoph Lameter | 1743660 | 2006-01-18 17:42:32 -0800 | [diff] [blame] | 161 |  | 
| Christoph Lameter | 1b2ffb7 | 2006-02-01 03:05:34 -0800 | [diff] [blame] | 162 | It may be beneficial to switch off zone reclaim if the system is | 
|  | 163 | used for a file server and all of memory should be used for caching files | 
|  | 164 | from disk. In that case the caching effect is more important than | 
|  | 165 | data locality. | 
| Christoph Lameter | 1743660 | 2006-01-18 17:42:32 -0800 | [diff] [blame] | 166 |  | 
| Christoph Lameter | 1b2ffb7 | 2006-02-01 03:05:34 -0800 | [diff] [blame] | 167 | Allowing zone reclaim to write out pages stops processes that are | 
|  | 168 | writing large amounts of data from dirtying pages on other nodes. Zone | 
|  | 169 | reclaim will write out dirty pages if a zone fills up and so effectively | 
|  | 170 | throttle the process. This may decrease the performance of a single process | 
|  | 171 | since it cannot use all of system memory to buffer the outgoing writes | 
|  | 172 | anymore but it preserve the memory on other nodes so that the performance | 
|  | 173 | of other processes running on other nodes will not be affected. | 
|  | 174 |  | 
|  | 175 | Allowing regular swap effectively restricts allocations to the local | 
|  | 176 | node unless explicitly overridden by memory policies or cpuset | 
|  | 177 | configurations. | 
|  | 178 |  | 
| KAMEZAWA Hiroyuki | fadd8fb | 2006-06-23 02:03:13 -0700 | [diff] [blame] | 179 | ============================================================= | 
|  | 180 |  | 
| Christoph Lameter | 9614634 | 2006-07-03 00:24:13 -0700 | [diff] [blame] | 181 | min_unmapped_ratio: | 
|  | 182 |  | 
|  | 183 | This is available only on NUMA kernels. | 
|  | 184 |  | 
| Christoph Lameter | 0ff3849 | 2006-09-25 23:31:52 -0700 | [diff] [blame] | 185 | A percentage of the total pages in each zone.  Zone reclaim will only | 
| Christoph Lameter | 9614634 | 2006-07-03 00:24:13 -0700 | [diff] [blame] | 186 | occur if more than this percentage of pages are file backed and unmapped. | 
|  | 187 | This is to insure that a minimal amount of local pages is still available for | 
|  | 188 | file I/O even if the node is overallocated. | 
|  | 189 |  | 
|  | 190 | The default is 1 percent. | 
|  | 191 |  | 
|  | 192 | ============================================================= | 
|  | 193 |  | 
| Christoph Lameter | 0ff3849 | 2006-09-25 23:31:52 -0700 | [diff] [blame] | 194 | min_slab_ratio: | 
|  | 195 |  | 
|  | 196 | This is available only on NUMA kernels. | 
|  | 197 |  | 
|  | 198 | A percentage of the total pages in each zone.  On Zone reclaim | 
|  | 199 | (fallback from the local zone occurs) slabs will be reclaimed if more | 
|  | 200 | than this percentage of pages in a zone are reclaimable slab pages. | 
|  | 201 | This insures that the slab growth stays under control even in NUMA | 
|  | 202 | systems that rarely perform global reclaim. | 
|  | 203 |  | 
|  | 204 | The default is 5 percent. | 
|  | 205 |  | 
|  | 206 | Note that slab reclaim is triggered in a per zone / node fashion. | 
|  | 207 | The process of reclaiming slab memory is currently not node specific | 
|  | 208 | and may not be fast. | 
|  | 209 |  | 
|  | 210 | ============================================================= | 
|  | 211 |  | 
| KAMEZAWA Hiroyuki | fadd8fb | 2006-06-23 02:03:13 -0700 | [diff] [blame] | 212 | panic_on_oom | 
|  | 213 |  | 
| Yasunori Goto | 2b744c0 | 2007-05-06 14:49:59 -0700 | [diff] [blame] | 214 | This enables or disables panic on out-of-memory feature. | 
|  | 215 |  | 
|  | 216 | If this is set to 0, the kernel will kill some rogue process, | 
|  | 217 | called oom_killer.  Usually, oom_killer can kill rogue processes and | 
|  | 218 | system will survive. | 
|  | 219 |  | 
|  | 220 | If this is set to 1, the kernel panics when out-of-memory happens. | 
|  | 221 | However, if a process limits using nodes by mempolicy/cpusets, | 
|  | 222 | and those nodes become memory exhaustion status, one process | 
|  | 223 | may be killed by oom-killer. No panic occurs in this case. | 
|  | 224 | Because other nodes' memory may be free. This means system total status | 
|  | 225 | may be not fatal yet. | 
|  | 226 |  | 
|  | 227 | If this is set to 2, the kernel panics compulsorily even on the | 
|  | 228 | above-mentioned. | 
| KAMEZAWA Hiroyuki | fadd8fb | 2006-06-23 02:03:13 -0700 | [diff] [blame] | 229 |  | 
|  | 230 | The default value is 0. | 
| Yasunori Goto | 2b744c0 | 2007-05-06 14:49:59 -0700 | [diff] [blame] | 231 | 1 and 2 are for failover of clustering. Please select either | 
|  | 232 | according to your policy of failover. | 
| Eric Paris | ed03218 | 2007-06-28 15:55:21 -0400 | [diff] [blame] | 233 |  | 
| David Rientjes | fe071d7 | 2007-10-16 23:25:56 -0700 | [diff] [blame] | 234 | ============================================================= | 
|  | 235 |  | 
| David Rientjes | fef1bdd | 2008-02-07 00:14:07 -0800 | [diff] [blame] | 236 | oom_dump_tasks | 
|  | 237 |  | 
|  | 238 | Enables a system-wide task dump (excluding kernel threads) to be | 
|  | 239 | produced when the kernel performs an OOM-killing and includes such | 
|  | 240 | information as pid, uid, tgid, vm size, rss, cpu, oom_adj score, and | 
|  | 241 | name.  This is helpful to determine why the OOM killer was invoked | 
|  | 242 | and to identify the rogue task that caused it. | 
|  | 243 |  | 
|  | 244 | If this is set to zero, this information is suppressed.  On very | 
|  | 245 | large systems with thousands of tasks it may not be feasible to dump | 
|  | 246 | the memory state information for each one.  Such systems should not | 
|  | 247 | be forced to incur a performance penalty in OOM conditions when the | 
|  | 248 | information may not be desired. | 
|  | 249 |  | 
|  | 250 | If this is set to non-zero, this information is shown whenever the | 
|  | 251 | OOM killer actually kills a memory-hogging task. | 
|  | 252 |  | 
|  | 253 | The default value is 0. | 
|  | 254 |  | 
|  | 255 | ============================================================= | 
|  | 256 |  | 
| David Rientjes | fe071d7 | 2007-10-16 23:25:56 -0700 | [diff] [blame] | 257 | oom_kill_allocating_task | 
|  | 258 |  | 
|  | 259 | This enables or disables killing the OOM-triggering task in | 
|  | 260 | out-of-memory situations. | 
|  | 261 |  | 
|  | 262 | If this is set to zero, the OOM killer will scan through the entire | 
|  | 263 | tasklist and select a task based on heuristics to kill.  This normally | 
|  | 264 | selects a rogue memory-hogging task that frees up a large amount of | 
|  | 265 | memory when killed. | 
|  | 266 |  | 
|  | 267 | If this is set to non-zero, the OOM killer simply kills the task that | 
|  | 268 | triggered the out-of-memory condition.  This avoids the expensive | 
|  | 269 | tasklist scan. | 
|  | 270 |  | 
|  | 271 | If panic_on_oom is selected, it takes precedence over whatever value | 
|  | 272 | is used in oom_kill_allocating_task. | 
|  | 273 |  | 
|  | 274 | The default value is 0. | 
|  | 275 |  | 
| Eric Paris | ed03218 | 2007-06-28 15:55:21 -0400 | [diff] [blame] | 276 | ============================================================== | 
|  | 277 |  | 
|  | 278 | mmap_min_addr | 
|  | 279 |  | 
|  | 280 | This file indicates the amount of address space  which a user process will | 
|  | 281 | be restricted from mmaping.  Since kernel null dereference bugs could | 
|  | 282 | accidentally operate based on the information in the first couple of pages | 
|  | 283 | of memory userspace processes should not be allowed to write to them.  By | 
|  | 284 | default this value is set to 0 and no protections will be enforced by the | 
|  | 285 | security module.  Setting this value to something like 64k will allow the | 
|  | 286 | vast majority of applications to work correctly and provide defense in depth | 
|  | 287 | against future potential kernel bugs. | 
|  | 288 |  | 
| KAMEZAWA Hiroyuki | f0c0b2b | 2007-07-15 23:38:01 -0700 | [diff] [blame] | 289 | ============================================================== | 
|  | 290 |  | 
|  | 291 | numa_zonelist_order | 
|  | 292 |  | 
|  | 293 | This sysctl is only for NUMA. | 
|  | 294 | 'where the memory is allocated from' is controlled by zonelists. | 
|  | 295 | (This documentation ignores ZONE_HIGHMEM/ZONE_DMA32 for simple explanation. | 
|  | 296 | you may be able to read ZONE_DMA as ZONE_DMA32...) | 
|  | 297 |  | 
|  | 298 | In non-NUMA case, a zonelist for GFP_KERNEL is ordered as following. | 
|  | 299 | ZONE_NORMAL -> ZONE_DMA | 
|  | 300 | This means that a memory allocation request for GFP_KERNEL will | 
|  | 301 | get memory from ZONE_DMA only when ZONE_NORMAL is not available. | 
|  | 302 |  | 
|  | 303 | In NUMA case, you can think of following 2 types of order. | 
|  | 304 | Assume 2 node NUMA and below is zonelist of Node(0)'s GFP_KERNEL | 
|  | 305 |  | 
|  | 306 | (A) Node(0) ZONE_NORMAL -> Node(0) ZONE_DMA -> Node(1) ZONE_NORMAL | 
|  | 307 | (B) Node(0) ZONE_NORMAL -> Node(1) ZONE_NORMAL -> Node(0) ZONE_DMA. | 
|  | 308 |  | 
|  | 309 | Type(A) offers the best locality for processes on Node(0), but ZONE_DMA | 
|  | 310 | will be used before ZONE_NORMAL exhaustion. This increases possibility of | 
|  | 311 | out-of-memory(OOM) of ZONE_DMA because ZONE_DMA is tend to be small. | 
|  | 312 |  | 
|  | 313 | Type(B) cannot offer the best locality but is more robust against OOM of | 
|  | 314 | the DMA zone. | 
|  | 315 |  | 
|  | 316 | Type(A) is called as "Node" order. Type (B) is "Zone" order. | 
|  | 317 |  | 
|  | 318 | "Node order" orders the zonelists by node, then by zone within each node. | 
|  | 319 | Specify "[Nn]ode" for zone order | 
|  | 320 |  | 
|  | 321 | "Zone Order" orders the zonelists by zone type, then by node within each | 
|  | 322 | zone.  Specify "[Zz]one"for zode order. | 
|  | 323 |  | 
|  | 324 | Specify "[Dd]efault" to request automatic configuration.  Autoconfiguration | 
|  | 325 | will select "node" order in following case. | 
|  | 326 | (1) if the DMA zone does not exist or | 
|  | 327 | (2) if the DMA zone comprises greater than 50% of the available memory or | 
|  | 328 | (3) if any node's DMA zone comprises greater than 60% of its local memory and | 
|  | 329 | the amount of local memory is big enough. | 
|  | 330 |  | 
|  | 331 | Otherwise, "zone" order will be selected. Default order is recommended unless | 
|  | 332 | this is causing problems for your system/application. | 
| Nishanth Aravamudan | d5dbac8 | 2007-12-17 16:20:25 -0800 | [diff] [blame] | 333 |  | 
|  | 334 | ============================================================== | 
|  | 335 |  | 
|  | 336 | nr_hugepages | 
|  | 337 |  | 
|  | 338 | Change the minimum size of the hugepage pool. | 
|  | 339 |  | 
|  | 340 | See Documentation/vm/hugetlbpage.txt | 
|  | 341 |  | 
|  | 342 | ============================================================== | 
|  | 343 |  | 
|  | 344 | nr_overcommit_hugepages | 
|  | 345 |  | 
|  | 346 | Change the maximum size of the hugepage pool. The maximum is | 
|  | 347 | nr_hugepages + nr_overcommit_hugepages. | 
|  | 348 |  | 
|  | 349 | See Documentation/vm/hugetlbpage.txt |