| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | 				CPUSETS | 
 | 2 | 				------- | 
 | 3 |  | 
 | 4 | Copyright (C) 2004 BULL SA. | 
 | 5 | Written by Simon.Derr@bull.net | 
 | 6 |  | 
| Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 7 | Portions Copyright (c) 2004-2006 Silicon Graphics, Inc. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 8 | Modified by Paul Jackson <pj@sgi.com> | 
| Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 9 | Modified by Christoph Lameter <clameter@sgi.com> | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 10 | Modified by Paul Menage <menage@google.com> | 
| Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 11 | Modified by Hidetoshi Seto <seto.hidetoshi@jp.fujitsu.com> | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 12 |  | 
 | 13 | CONTENTS: | 
 | 14 | ========= | 
 | 15 |  | 
 | 16 | 1. Cpusets | 
 | 17 |   1.1 What are cpusets ? | 
 | 18 |   1.2 Why are cpusets needed ? | 
 | 19 |   1.3 How are cpusets implemented ? | 
| Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 20 |   1.4 What are exclusive cpusets ? | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 21 |   1.5 What is memory_pressure ? | 
 | 22 |   1.6 What is memory spread ? | 
| Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 23 |   1.7 What is sched_load_balance ? | 
| Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 24 |   1.8 What is sched_relax_domain_level ? | 
 | 25 |   1.9 How do I use cpusets ? | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 26 | 2. Usage Examples and Syntax | 
 | 27 |   2.1 Basic Usage | 
 | 28 |   2.2 Adding/removing cpus | 
 | 29 |   2.3 Setting flags | 
 | 30 |   2.4 Attaching processes | 
 | 31 | 3. Questions | 
 | 32 | 4. Contact | 
 | 33 |  | 
 | 34 | 1. Cpusets | 
 | 35 | ========== | 
 | 36 |  | 
 | 37 | 1.1 What are cpusets ? | 
 | 38 | ---------------------- | 
 | 39 |  | 
 | 40 | Cpusets provide a mechanism for assigning a set of CPUs and Memory | 
| Christoph Lameter | 0e1e7c7 | 2007-10-16 01:25:38 -0700 | [diff] [blame] | 41 | Nodes to a set of tasks.   In this document "Memory Node" refers to | 
 | 42 | an on-line node that contains memory. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 43 |  | 
 | 44 | Cpusets constrain the CPU and Memory placement of tasks to only | 
 | 45 | the resources within a tasks current cpuset.  They form a nested | 
 | 46 | hierarchy visible in a virtual file system.  These are the essential | 
 | 47 | hooks, beyond what is already present, required to manage dynamic | 
 | 48 | job placement on large systems. | 
 | 49 |  | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 50 | Cpusets use the generic cgroup subsystem described in | 
| Matt Helsley | bde5ab6 | 2008-10-18 20:27:24 -0700 | [diff] [blame] | 51 | Documentation/cgroups/cgroups.txt. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 52 |  | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 53 | Requests by a task, using the sched_setaffinity(2) system call to | 
 | 54 | include CPUs in its CPU affinity mask, and using the mbind(2) and | 
 | 55 | set_mempolicy(2) system calls to include Memory Nodes in its memory | 
 | 56 | policy, are both filtered through that tasks cpuset, filtering out any | 
 | 57 | CPUs or Memory Nodes not in that cpuset.  The scheduler will not | 
 | 58 | schedule a task on a CPU that is not allowed in its cpus_allowed | 
 | 59 | vector, and the kernel page allocator will not allocate a page on a | 
 | 60 | node that is not allowed in the requesting tasks mems_allowed vector. | 
 | 61 |  | 
 | 62 | User level code may create and destroy cpusets by name in the cgroup | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 63 | virtual file system, manage the attributes and permissions of these | 
 | 64 | cpusets and which CPUs and Memory Nodes are assigned to each cpuset, | 
 | 65 | specify and query to which cpuset a task is assigned, and list the | 
 | 66 | task pids assigned to a cpuset. | 
 | 67 |  | 
 | 68 |  | 
 | 69 | 1.2 Why are cpusets needed ? | 
 | 70 | ---------------------------- | 
 | 71 |  | 
 | 72 | The management of large computer systems, with many processors (CPUs), | 
 | 73 | complex memory cache hierarchies and multiple Memory Nodes having | 
 | 74 | non-uniform access times (NUMA) presents additional challenges for | 
 | 75 | the efficient scheduling and memory placement of processes. | 
 | 76 |  | 
 | 77 | Frequently more modest sized systems can be operated with adequate | 
 | 78 | efficiency just by letting the operating system automatically share | 
 | 79 | the available CPU and Memory resources amongst the requesting tasks. | 
 | 80 |  | 
 | 81 | But larger systems, which benefit more from careful processor and | 
 | 82 | memory placement to reduce memory access times and contention, | 
 | 83 | and which typically represent a larger investment for the customer, | 
| Jean Delvare | 33430dc | 2005-10-30 15:02:20 -0800 | [diff] [blame] | 84 | can benefit from explicitly placing jobs on properly sized subsets of | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 85 | the system. | 
 | 86 |  | 
 | 87 | This can be especially valuable on: | 
 | 88 |  | 
 | 89 |     * Web Servers running multiple instances of the same web application, | 
 | 90 |     * Servers running different applications (for instance, a web server | 
 | 91 |       and a database), or | 
 | 92 |     * NUMA systems running large HPC applications with demanding | 
 | 93 |       performance characteristics. | 
 | 94 |  | 
 | 95 | These subsets, or "soft partitions" must be able to be dynamically | 
 | 96 | adjusted, as the job mix changes, without impacting other concurrently | 
| Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 97 | executing jobs. The location of the running jobs pages may also be moved | 
 | 98 | when the memory locations are changed. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 99 |  | 
 | 100 | The kernel cpuset patch provides the minimum essential kernel | 
 | 101 | mechanisms required to efficiently implement such subsets.  It | 
 | 102 | leverages existing CPU and Memory Placement facilities in the Linux | 
 | 103 | kernel to avoid any additional impact on the critical scheduler or | 
 | 104 | memory allocator code. | 
 | 105 |  | 
 | 106 |  | 
 | 107 | 1.3 How are cpusets implemented ? | 
 | 108 | --------------------------------- | 
 | 109 |  | 
| Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 110 | Cpusets provide a Linux kernel mechanism to constrain which CPUs and | 
 | 111 | Memory Nodes are used by a process or set of processes. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 112 |  | 
 | 113 | The Linux kernel already has a pair of mechanisms to specify on which | 
 | 114 | CPUs a task may be scheduled (sched_setaffinity) and on which Memory | 
 | 115 | Nodes it may obtain memory (mbind, set_mempolicy). | 
 | 116 |  | 
 | 117 | Cpusets extends these two mechanisms as follows: | 
 | 118 |  | 
 | 119 |  - Cpusets are sets of allowed CPUs and Memory Nodes, known to the | 
 | 120 |    kernel. | 
 | 121 |  - Each task in the system is attached to a cpuset, via a pointer | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 122 |    in the task structure to a reference counted cgroup structure. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 123 |  - Calls to sched_setaffinity are filtered to just those CPUs | 
 | 124 |    allowed in that tasks cpuset. | 
 | 125 |  - Calls to mbind and set_mempolicy are filtered to just | 
 | 126 |    those Memory Nodes allowed in that tasks cpuset. | 
 | 127 |  - The root cpuset contains all the systems CPUs and Memory | 
 | 128 |    Nodes. | 
 | 129 |  - For any cpuset, one can define child cpusets containing a subset | 
 | 130 |    of the parents CPU and Memory Node resources. | 
 | 131 |  - The hierarchy of cpusets can be mounted at /dev/cpuset, for | 
 | 132 |    browsing and manipulation from user space. | 
 | 133 |  - A cpuset may be marked exclusive, which ensures that no other | 
| Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 134 |    cpuset (except direct ancestors and descendants) may contain | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 135 |    any overlapping CPUs or Memory Nodes. | 
 | 136 |  - You can list all the tasks (by pid) attached to any cpuset. | 
 | 137 |  | 
 | 138 | The implementation of cpusets requires a few, simple hooks | 
 | 139 | into the rest of the kernel, none in performance critical paths: | 
 | 140 |  | 
| Paul Jackson | 864913f | 2006-01-11 02:01:38 +0100 | [diff] [blame] | 141 |  - in init/main.c, to initialize the root cpuset at system boot. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 142 |  - in fork and exit, to attach and detach a task from its cpuset. | 
 | 143 |  - in sched_setaffinity, to mask the requested CPUs by what's | 
 | 144 |    allowed in that tasks cpuset. | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 145 |  - in sched.c migrate_live_tasks(), to keep migrating tasks within | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 146 |    the CPUs allowed by their cpuset, if possible. | 
 | 147 |  - in the mbind and set_mempolicy system calls, to mask the requested | 
 | 148 |    Memory Nodes by what's allowed in that tasks cpuset. | 
| Paul Jackson | 864913f | 2006-01-11 02:01:38 +0100 | [diff] [blame] | 149 |  - in page_alloc.c, to restrict memory to allowed nodes. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 150 |  - in vmscan.c, to restrict page recovery to the current cpuset. | 
 | 151 |  | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 152 | You should mount the "cgroup" filesystem type in order to enable | 
 | 153 | browsing and modifying the cpusets presently known to the kernel.  No | 
 | 154 | new system calls are added for cpusets - all support for querying and | 
 | 155 | modifying cpusets is via this cpuset file system. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 156 |  | 
| Paul Jackson | 985ee7f | 2008-07-04 10:00:01 -0700 | [diff] [blame] | 157 | The /proc/<pid>/status file for each task has four added lines, | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 158 | displaying the tasks cpus_allowed (on which CPUs it may be scheduled) | 
 | 159 | and mems_allowed (on which Memory Nodes it may obtain memory), | 
| Paul Jackson | 985ee7f | 2008-07-04 10:00:01 -0700 | [diff] [blame] | 160 | in the two formats seen in the following example: | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 161 |  | 
 | 162 |   Cpus_allowed:   ffffffff,ffffffff,ffffffff,ffffffff | 
| Paul Jackson | 985ee7f | 2008-07-04 10:00:01 -0700 | [diff] [blame] | 163 |   Cpus_allowed_list:      0-127 | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 164 |   Mems_allowed:   ffffffff,ffffffff | 
| Paul Jackson | 985ee7f | 2008-07-04 10:00:01 -0700 | [diff] [blame] | 165 |   Mems_allowed_list:      0-63 | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 166 |  | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 167 | Each cpuset is represented by a directory in the cgroup file system | 
 | 168 | containing (on top of the standard cgroup files) the following | 
 | 169 | files describing that cpuset: | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 170 |  | 
 | 171 |  - cpus: list of CPUs in that cpuset | 
 | 172 |  - mems: list of Memory Nodes in that cpuset | 
| Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 173 |  - memory_migrate flag: if set, move pages to cpusets nodes | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 174 |  - cpu_exclusive flag: is cpu placement exclusive? | 
 | 175 |  - mem_exclusive flag: is memory placement exclusive? | 
| Paul Menage | 7860836 | 2008-04-29 01:00:26 -0700 | [diff] [blame] | 176 |  - mem_hardwall flag:  is memory allocation hardwalled | 
| Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 177 |  - memory_pressure: measure of how much paging pressure in cpuset | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 178 |  - memory_spread_page flag: if set, spread page cache evenly on allowed nodes | 
 | 179 |  - memory_spread_slab flag: if set, spread slab cache evenly on allowed nodes | 
 | 180 |  - sched_load_balance flag: if set, load balance within CPUs on that cpuset | 
 | 181 |  - sched_relax_domain_level: the searching range when migrating tasks | 
| Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 182 |  | 
 | 183 | In addition, the root cpuset only has the following file: | 
 | 184 |  - memory_pressure_enabled flag: compute memory_pressure? | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 185 |  | 
 | 186 | New cpusets are created using the mkdir system call or shell | 
 | 187 | command.  The properties of a cpuset, such as its flags, allowed | 
 | 188 | CPUs and Memory Nodes, and attached tasks, are modified by writing | 
 | 189 | to the appropriate file in that cpusets directory, as listed above. | 
 | 190 |  | 
 | 191 | The named hierarchical structure of nested cpusets allows partitioning | 
 | 192 | a large system into nested, dynamically changeable, "soft-partitions". | 
 | 193 |  | 
 | 194 | The attachment of each task, automatically inherited at fork by any | 
 | 195 | children of that task, to a cpuset allows organizing the work load | 
 | 196 | on a system into related sets of tasks such that each set is constrained | 
 | 197 | to using the CPUs and Memory Nodes of a particular cpuset.  A task | 
 | 198 | may be re-attached to any other cpuset, if allowed by the permissions | 
 | 199 | on the necessary cpuset file system directories. | 
 | 200 |  | 
 | 201 | Such management of a system "in the large" integrates smoothly with | 
 | 202 | the detailed placement done on individual tasks and memory regions | 
 | 203 | using the sched_setaffinity, mbind and set_mempolicy system calls. | 
 | 204 |  | 
 | 205 | The following rules apply to each cpuset: | 
 | 206 |  | 
 | 207 |  - Its CPUs and Memory Nodes must be a subset of its parents. | 
| Miao Xie | 6a7d68e | 2008-06-05 22:45:54 -0700 | [diff] [blame] | 208 |  - It can't be marked exclusive unless its parent is. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 209 |  - If its cpu or memory is exclusive, they may not overlap any sibling. | 
 | 210 |  | 
 | 211 | These rules, and the natural hierarchy of cpusets, enable efficient | 
 | 212 | enforcement of the exclusive guarantee, without having to scan all | 
 | 213 | cpusets every time any of them change to ensure nothing overlaps a | 
 | 214 | exclusive cpuset.  Also, the use of a Linux virtual file system (vfs) | 
 | 215 | to represent the cpuset hierarchy provides for a familiar permission | 
 | 216 | and name space for cpusets, with a minimum of additional kernel code. | 
 | 217 |  | 
| Paul Jackson | 38837fc | 2006-09-29 02:01:16 -0700 | [diff] [blame] | 218 | The cpus and mems files in the root (top_cpuset) cpuset are | 
 | 219 | read-only.  The cpus file automatically tracks the value of | 
 | 220 | cpu_online_map using a CPU hotplug notifier, and the mems file | 
| KOSAKI Motohiro | 0b72037 | 2008-02-23 15:23:41 -0800 | [diff] [blame] | 221 | automatically tracks the value of node_states[N_HIGH_MEMORY]--i.e., | 
| Christoph Lameter | 0e1e7c7 | 2007-10-16 01:25:38 -0700 | [diff] [blame] | 222 | nodes with memory--using the cpuset_track_online_nodes() hook. | 
| Paul Jackson | 4c4d50f | 2006-08-27 01:23:51 -0700 | [diff] [blame] | 223 |  | 
| Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 224 |  | 
 | 225 | 1.4 What are exclusive cpusets ? | 
 | 226 | -------------------------------- | 
 | 227 |  | 
 | 228 | If a cpuset is cpu or mem exclusive, no other cpuset, other than | 
| Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 229 | a direct ancestor or descendant, may share any of the same CPUs or | 
| Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 230 | Memory Nodes. | 
 | 231 |  | 
| Paul Menage | 7860836 | 2008-04-29 01:00:26 -0700 | [diff] [blame] | 232 | A cpuset that is mem_exclusive *or* mem_hardwall is "hardwalled", | 
 | 233 | i.e. it restricts kernel allocations for page, buffer and other data | 
 | 234 | commonly shared by the kernel across multiple users.  All cpusets, | 
 | 235 | whether hardwalled or not, restrict allocations of memory for user | 
 | 236 | space.  This enables configuring a system so that several independent | 
 | 237 | jobs can share common kernel data, such as file system pages, while | 
 | 238 | isolating each job's user allocation in its own cpuset.  To do this, | 
 | 239 | construct a large mem_exclusive cpuset to hold all the jobs, and | 
 | 240 | construct child, non-mem_exclusive cpusets for each individual job. | 
 | 241 | Only a small amount of typical kernel memory, such as requests from | 
 | 242 | interrupt handlers, is allowed to be taken outside even a | 
 | 243 | mem_exclusive cpuset. | 
| Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 244 |  | 
 | 245 |  | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 246 | 1.5 What is memory_pressure ? | 
| Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 247 | ----------------------------- | 
 | 248 | The memory_pressure of a cpuset provides a simple per-cpuset metric | 
 | 249 | of the rate that the tasks in a cpuset are attempting to free up in | 
 | 250 | use memory on the nodes of the cpuset to satisfy additional memory | 
 | 251 | requests. | 
 | 252 |  | 
 | 253 | This enables batch managers monitoring jobs running in dedicated | 
 | 254 | cpusets to efficiently detect what level of memory pressure that job | 
 | 255 | is causing. | 
 | 256 |  | 
 | 257 | This is useful both on tightly managed systems running a wide mix of | 
 | 258 | submitted jobs, which may choose to terminate or re-prioritize jobs that | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 259 | are trying to use more memory than allowed on the nodes assigned to them, | 
| Paul Jackson | bd5e09c | 2006-01-08 01:01:50 -0800 | [diff] [blame] | 260 | and with tightly coupled, long running, massively parallel scientific | 
 | 261 | computing jobs that will dramatically fail to meet required performance | 
 | 262 | goals if they start to use more memory than allowed to them. | 
 | 263 |  | 
 | 264 | This mechanism provides a very economical way for the batch manager | 
 | 265 | to monitor a cpuset for signs of memory pressure.  It's up to the | 
 | 266 | batch manager or other user code to decide what to do about it and | 
 | 267 | take action. | 
 | 268 |  | 
 | 269 | ==> Unless this feature is enabled by writing "1" to the special file | 
 | 270 |     /dev/cpuset/memory_pressure_enabled, the hook in the rebalance | 
 | 271 |     code of __alloc_pages() for this metric reduces to simply noticing | 
 | 272 |     that the cpuset_memory_pressure_enabled flag is zero.  So only | 
 | 273 |     systems that enable this feature will compute the metric. | 
 | 274 |  | 
 | 275 | Why a per-cpuset, running average: | 
 | 276 |  | 
 | 277 |     Because this meter is per-cpuset, rather than per-task or mm, | 
 | 278 |     the system load imposed by a batch scheduler monitoring this | 
 | 279 |     metric is sharply reduced on large systems, because a scan of | 
 | 280 |     the tasklist can be avoided on each set of queries. | 
 | 281 |  | 
 | 282 |     Because this meter is a running average, instead of an accumulating | 
 | 283 |     counter, a batch scheduler can detect memory pressure with a | 
 | 284 |     single read, instead of having to read and accumulate results | 
 | 285 |     for a period of time. | 
 | 286 |  | 
 | 287 |     Because this meter is per-cpuset rather than per-task or mm, | 
 | 288 |     the batch scheduler can obtain the key information, memory | 
 | 289 |     pressure in a cpuset, with a single read, rather than having to | 
 | 290 |     query and accumulate results over all the (dynamically changing) | 
 | 291 |     set of tasks in the cpuset. | 
 | 292 |  | 
 | 293 | A per-cpuset simple digital filter (requires a spinlock and 3 words | 
 | 294 | of data per-cpuset) is kept, and updated by any task attached to that | 
 | 295 | cpuset, if it enters the synchronous (direct) page reclaim code. | 
 | 296 |  | 
 | 297 | A per-cpuset file provides an integer number representing the recent | 
 | 298 | (half-life of 10 seconds) rate of direct page reclaims caused by | 
 | 299 | the tasks in the cpuset, in units of reclaims attempted per second, | 
 | 300 | times 1000. | 
 | 301 |  | 
 | 302 |  | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 303 | 1.6 What is memory spread ? | 
| Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 304 | --------------------------- | 
 | 305 | There are two boolean flag files per cpuset that control where the | 
 | 306 | kernel allocates pages for the file system buffers and related in | 
 | 307 | kernel data structures.  They are called 'memory_spread_page' and | 
 | 308 | 'memory_spread_slab'. | 
 | 309 |  | 
 | 310 | If the per-cpuset boolean flag file 'memory_spread_page' is set, then | 
 | 311 | the kernel will spread the file system buffers (page cache) evenly | 
 | 312 | over all the nodes that the faulting task is allowed to use, instead | 
 | 313 | of preferring to put those pages on the node where the task is running. | 
 | 314 |  | 
 | 315 | If the per-cpuset boolean flag file 'memory_spread_slab' is set, | 
 | 316 | then the kernel will spread some file system related slab caches, | 
 | 317 | such as for inodes and dentries evenly over all the nodes that the | 
 | 318 | faulting task is allowed to use, instead of preferring to put those | 
 | 319 | pages on the node where the task is running. | 
 | 320 |  | 
 | 321 | The setting of these flags does not affect anonymous data segment or | 
 | 322 | stack segment pages of a task. | 
 | 323 |  | 
 | 324 | By default, both kinds of memory spreading are off, and memory | 
 | 325 | pages are allocated on the node local to where the task is running, | 
 | 326 | except perhaps as modified by the tasks NUMA mempolicy or cpuset | 
 | 327 | configuration, so long as sufficient free memory pages are available. | 
 | 328 |  | 
 | 329 | When new cpusets are created, they inherit the memory spread settings | 
 | 330 | of their parent. | 
 | 331 |  | 
 | 332 | Setting memory spreading causes allocations for the affected page | 
 | 333 | or slab caches to ignore the tasks NUMA mempolicy and be spread | 
 | 334 | instead.    Tasks using mbind() or set_mempolicy() calls to set NUMA | 
 | 335 | mempolicies will not notice any change in these calls as a result of | 
 | 336 | their containing tasks memory spread settings.  If memory spreading | 
 | 337 | is turned off, then the currently specified NUMA mempolicy once again | 
 | 338 | applies to memory page allocations. | 
 | 339 |  | 
 | 340 | Both 'memory_spread_page' and 'memory_spread_slab' are boolean flag | 
 | 341 | files.  By default they contain "0", meaning that the feature is off | 
 | 342 | for that cpuset.  If a "1" is written to that file, then that turns | 
 | 343 | the named feature on. | 
 | 344 |  | 
 | 345 | The implementation is simple. | 
 | 346 |  | 
 | 347 | Setting the flag 'memory_spread_page' turns on a per-process flag | 
 | 348 | PF_SPREAD_PAGE for each task that is in that cpuset or subsequently | 
 | 349 | joins that cpuset.  The page allocation calls for the page cache | 
 | 350 | is modified to perform an inline check for this PF_SPREAD_PAGE task | 
 | 351 | flag, and if set, a call to a new routine cpuset_mem_spread_node() | 
 | 352 | returns the node to prefer for the allocation. | 
 | 353 |  | 
| Miao Xie | 6a7d68e | 2008-06-05 22:45:54 -0700 | [diff] [blame] | 354 | Similarly, setting 'memory_spread_slab' turns on the flag | 
| Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 355 | PF_SPREAD_SLAB, and appropriately marked slab caches will allocate | 
 | 356 | pages from the node returned by cpuset_mem_spread_node(). | 
 | 357 |  | 
 | 358 | The cpuset_mem_spread_node() routine is also simple.  It uses the | 
 | 359 | value of a per-task rotor cpuset_mem_spread_rotor to select the next | 
 | 360 | node in the current tasks mems_allowed to prefer for the allocation. | 
 | 361 |  | 
 | 362 | This memory placement policy is also known (in other contexts) as | 
 | 363 | round-robin or interleave. | 
 | 364 |  | 
 | 365 | This policy can provide substantial improvements for jobs that need | 
 | 366 | to place thread local data on the corresponding node, but that need | 
 | 367 | to access large file system data sets that need to be spread across | 
 | 368 | the several nodes in the jobs cpuset in order to fit.  Without this | 
 | 369 | policy, especially for jobs that might have one thread reading in the | 
 | 370 | data set, the memory allocation across the nodes in the jobs cpuset | 
 | 371 | can become very uneven. | 
 | 372 |  | 
| Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 373 | 1.7 What is sched_load_balance ? | 
 | 374 | -------------------------------- | 
| Paul Jackson | 825a46a | 2006-03-24 03:16:03 -0800 | [diff] [blame] | 375 |  | 
| Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 376 | The kernel scheduler (kernel/sched.c) automatically load balances | 
 | 377 | tasks.  If one CPU is underutilized, kernel code running on that | 
 | 378 | CPU will look for tasks on other more overloaded CPUs and move those | 
 | 379 | tasks to itself, within the constraints of such placement mechanisms | 
 | 380 | as cpusets and sched_setaffinity. | 
 | 381 |  | 
 | 382 | The algorithmic cost of load balancing and its impact on key shared | 
 | 383 | kernel data structures such as the task list increases more than | 
 | 384 | linearly with the number of CPUs being balanced.  So the scheduler | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 385 | has support to partition the systems CPUs into a number of sched | 
| Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 386 | domains such that it only load balances within each sched domain. | 
 | 387 | Each sched domain covers some subset of the CPUs in the system; | 
 | 388 | no two sched domains overlap; some CPUs might not be in any sched | 
 | 389 | domain and hence won't be load balanced. | 
 | 390 |  | 
 | 391 | Put simply, it costs less to balance between two smaller sched domains | 
 | 392 | than one big one, but doing so means that overloads in one of the | 
 | 393 | two domains won't be load balanced to the other one. | 
 | 394 |  | 
 | 395 | By default, there is one sched domain covering all CPUs, except those | 
 | 396 | marked isolated using the kernel boot time "isolcpus=" argument. | 
 | 397 |  | 
 | 398 | This default load balancing across all CPUs is not well suited for | 
 | 399 | the following two situations: | 
 | 400 |  1) On large systems, load balancing across many CPUs is expensive. | 
 | 401 |     If the system is managed using cpusets to place independent jobs | 
 | 402 |     on separate sets of CPUs, full load balancing is unnecessary. | 
 | 403 |  2) Systems supporting realtime on some CPUs need to minimize | 
 | 404 |     system overhead on those CPUs, including avoiding task load | 
 | 405 |     balancing if that is not needed. | 
 | 406 |  | 
 | 407 | When the per-cpuset flag "sched_load_balance" is enabled (the default | 
 | 408 | setting), it requests that all the CPUs in that cpusets allowed 'cpus' | 
 | 409 | be contained in a single sched domain, ensuring that load balancing | 
 | 410 | can move a task (not otherwised pinned, as by sched_setaffinity) | 
 | 411 | from any CPU in that cpuset to any other. | 
 | 412 |  | 
 | 413 | When the per-cpuset flag "sched_load_balance" is disabled, then the | 
 | 414 | scheduler will avoid load balancing across the CPUs in that cpuset, | 
 | 415 | --except-- in so far as is necessary because some overlapping cpuset | 
 | 416 | has "sched_load_balance" enabled. | 
 | 417 |  | 
 | 418 | So, for example, if the top cpuset has the flag "sched_load_balance" | 
 | 419 | enabled, then the scheduler will have one sched domain covering all | 
 | 420 | CPUs, and the setting of the "sched_load_balance" flag in any other | 
 | 421 | cpusets won't matter, as we're already fully load balancing. | 
 | 422 |  | 
 | 423 | Therefore in the above two situations, the top cpuset flag | 
 | 424 | "sched_load_balance" should be disabled, and only some of the smaller, | 
 | 425 | child cpusets have this flag enabled. | 
 | 426 |  | 
 | 427 | When doing this, you don't usually want to leave any unpinned tasks in | 
 | 428 | the top cpuset that might use non-trivial amounts of CPU, as such tasks | 
 | 429 | may be artificially constrained to some subset of CPUs, depending on | 
| Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 430 | the particulars of this flag setting in descendant cpusets.  Even if | 
| Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 431 | such a task could use spare CPU cycles in some other CPUs, the kernel | 
 | 432 | scheduler might not consider the possibility of load balancing that | 
 | 433 | task to that underused CPU. | 
 | 434 |  | 
 | 435 | Of course, tasks pinned to a particular CPU can be left in a cpuset | 
 | 436 | that disables "sched_load_balance" as those tasks aren't going anywhere | 
 | 437 | else anyway. | 
 | 438 |  | 
 | 439 | There is an impedance mismatch here, between cpusets and sched domains. | 
 | 440 | Cpusets are hierarchical and nest.  Sched domains are flat; they don't | 
 | 441 | overlap and each CPU is in at most one sched domain. | 
 | 442 |  | 
 | 443 | It is necessary for sched domains to be flat because load balancing | 
 | 444 | across partially overlapping sets of CPUs would risk unstable dynamics | 
 | 445 | that would be beyond our understanding.  So if each of two partially | 
 | 446 | overlapping cpusets enables the flag 'sched_load_balance', then we | 
 | 447 | form a single sched domain that is a superset of both.  We won't move | 
 | 448 | a task to a CPU outside it cpuset, but the scheduler load balancing | 
 | 449 | code might waste some compute cycles considering that possibility. | 
 | 450 |  | 
 | 451 | This mismatch is why there is not a simple one-to-one relation | 
 | 452 | between which cpusets have the flag "sched_load_balance" enabled, | 
 | 453 | and the sched domain configuration.  If a cpuset enables the flag, it | 
 | 454 | will get balancing across all its CPUs, but if it disables the flag, | 
 | 455 | it will only be assured of no load balancing if no other overlapping | 
 | 456 | cpuset enables the flag. | 
 | 457 |  | 
 | 458 | If two cpusets have partially overlapping 'cpus' allowed, and only | 
 | 459 | one of them has this flag enabled, then the other may find its | 
 | 460 | tasks only partially load balanced, just on the overlapping CPUs. | 
 | 461 | This is just the general case of the top_cpuset example given a few | 
 | 462 | paragraphs above.  In the general case, as in the top cpuset case, | 
 | 463 | don't leave tasks that might use non-trivial amounts of CPU in | 
 | 464 | such partially load balanced cpusets, as they may be artificially | 
 | 465 | constrained to some subset of the CPUs allowed to them, for lack of | 
 | 466 | load balancing to the other CPUs. | 
 | 467 |  | 
 | 468 | 1.7.1 sched_load_balance implementation details. | 
 | 469 | ------------------------------------------------ | 
 | 470 |  | 
 | 471 | The per-cpuset flag 'sched_load_balance' defaults to enabled (contrary | 
 | 472 | to most cpuset flags.)  When enabled for a cpuset, the kernel will | 
 | 473 | ensure that it can load balance across all the CPUs in that cpuset | 
 | 474 | (makes sure that all the CPUs in the cpus_allowed of that cpuset are | 
 | 475 | in the same sched domain.) | 
 | 476 |  | 
 | 477 | If two overlapping cpusets both have 'sched_load_balance' enabled, | 
 | 478 | then they will be (must be) both in the same sched domain. | 
 | 479 |  | 
 | 480 | If, as is the default, the top cpuset has 'sched_load_balance' enabled, | 
 | 481 | then by the above that means there is a single sched domain covering | 
 | 482 | the whole system, regardless of any other cpuset settings. | 
 | 483 |  | 
 | 484 | The kernel commits to user space that it will avoid load balancing | 
 | 485 | where it can.  It will pick as fine a granularity partition of sched | 
 | 486 | domains as it can while still providing load balancing for any set | 
 | 487 | of CPUs allowed to a cpuset having 'sched_load_balance' enabled. | 
 | 488 |  | 
 | 489 | The internal kernel cpuset to scheduler interface passes from the | 
 | 490 | cpuset code to the scheduler code a partition of the load balanced | 
 | 491 | CPUs in the system. This partition is a set of subsets (represented | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 492 | as an array of struct cpumask) of CPUs, pairwise disjoint, that cover | 
 | 493 | all the CPUs that must be load balanced. | 
| Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 494 |  | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 495 | The cpuset code builds a new such partition and passes it to the | 
 | 496 | scheduler sched domain setup code, to have the sched domains rebuilt | 
 | 497 | as necessary, whenever: | 
 | 498 |  - the 'sched_load_balance' flag of a cpuset with non-empty CPUs changes, | 
 | 499 |  - or CPUs come or go from a cpuset with this flag enabled, | 
 | 500 |  - or 'sched_relax_domain_level' value of a cpuset with non-empty CPUs | 
 | 501 |    and with this flag enabled changes, | 
 | 502 |  - or a cpuset with non-empty CPUs and with this flag enabled is removed, | 
 | 503 |  - or a cpu is offlined/onlined. | 
| Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 504 |  | 
 | 505 | This partition exactly defines what sched domains the scheduler should | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 506 | setup - one sched domain for each element (struct cpumask) in the | 
 | 507 | partition. | 
| Paul Jackson | 029190c | 2007-10-18 23:40:20 -0700 | [diff] [blame] | 508 |  | 
 | 509 | The scheduler remembers the currently active sched domain partitions. | 
 | 510 | When the scheduler routine partition_sched_domains() is invoked from | 
 | 511 | the cpuset code to update these sched domains, it compares the new | 
 | 512 | partition requested with the current, and updates its sched domains, | 
 | 513 | removing the old and adding the new, for each change. | 
 | 514 |  | 
| Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 515 |  | 
 | 516 | 1.8 What is sched_relax_domain_level ? | 
 | 517 | -------------------------------------- | 
 | 518 |  | 
 | 519 | In sched domain, the scheduler migrates tasks in 2 ways; periodic load | 
 | 520 | balance on tick, and at time of some schedule events. | 
 | 521 |  | 
 | 522 | When a task is woken up, scheduler try to move the task on idle CPU. | 
 | 523 | For example, if a task A running on CPU X activates another task B | 
 | 524 | on the same CPU X, and if CPU Y is X's sibling and performing idle, | 
 | 525 | then scheduler migrate task B to CPU Y so that task B can start on | 
 | 526 | CPU Y without waiting task A on CPU X. | 
 | 527 |  | 
 | 528 | And if a CPU run out of tasks in its runqueue, the CPU try to pull | 
 | 529 | extra tasks from other busy CPUs to help them before it is going to | 
 | 530 | be idle. | 
 | 531 |  | 
 | 532 | Of course it takes some searching cost to find movable tasks and/or | 
 | 533 | idle CPUs, the scheduler might not search all CPUs in the domain | 
| Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 534 | every time.  In fact, in some architectures, the searching ranges on | 
| Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 535 | events are limited in the same socket or node where the CPU locates, | 
| Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 536 | while the load balance on tick searches all. | 
| Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 537 |  | 
 | 538 | For example, assume CPU Z is relatively far from CPU X.  Even if CPU Z | 
 | 539 | is idle while CPU X and the siblings are busy, scheduler can't migrate | 
 | 540 | woken task B from X to Z since it is out of its searching range. | 
 | 541 | As the result, task B on CPU X need to wait task A or wait load balance | 
 | 542 | on the next tick.  For some applications in special situation, waiting | 
 | 543 | 1 tick may be too long. | 
 | 544 |  | 
 | 545 | The 'sched_relax_domain_level' file allows you to request changing | 
 | 546 | this searching range as you like.  This file takes int value which | 
 | 547 | indicates size of searching range in levels ideally as follows, | 
 | 548 | otherwise initial value -1 that indicates the cpuset has no request. | 
 | 549 |  | 
 | 550 |   -1  : no request. use system default or follow request of others. | 
 | 551 |    0  : no search. | 
 | 552 |    1  : search siblings (hyperthreads in a core). | 
 | 553 |    2  : search cores in a package. | 
 | 554 |    3  : search cpus in a node [= system wide on non-NUMA system] | 
 | 555 |  ( 4  : search nodes in a chunk of node [on NUMA system] ) | 
| Li Zefan | 30e0e17 | 2008-05-13 10:27:17 +0800 | [diff] [blame] | 556 |  ( 5  : search system wide [on NUMA system] ) | 
| Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 557 |  | 
| Paul Jackson | 46b6d94 | 2008-07-04 10:00:09 -0700 | [diff] [blame] | 558 | The system default is architecture dependent.  The system default | 
 | 559 | can be changed using the relax_domain_level= boot parameter. | 
 | 560 |  | 
| Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 561 | This file is per-cpuset and affect the sched domain where the cpuset | 
 | 562 | belongs to.  Therefore if the flag 'sched_load_balance' of a cpuset | 
 | 563 | is disabled, then 'sched_relax_domain_level' have no effect since | 
 | 564 | there is no sched domain belonging the cpuset. | 
 | 565 |  | 
 | 566 | If multiple cpusets are overlapping and hence they form a single sched | 
 | 567 | domain, the largest value among those is used.  Be careful, if one | 
 | 568 | requests 0 and others are -1 then 0 is used. | 
 | 569 |  | 
 | 570 | Note that modifying this file will have both good and bad effects, | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 571 | and whether it is acceptable or not depends on your situation. | 
| Hidetoshi Seto | 4d5f355 | 2008-04-15 14:03:17 +0900 | [diff] [blame] | 572 | Don't modify this file if you are not sure. | 
 | 573 |  | 
 | 574 | If your situation is: | 
 | 575 |  - The migration costs between each cpu can be assumed considerably | 
 | 576 |    small(for you) due to your special application's behavior or | 
 | 577 |    special hardware support for CPU cache etc. | 
 | 578 |  - The searching cost doesn't have impact(for you) or you can make | 
 | 579 |    the searching cost enough small by managing cpuset to compact etc. | 
 | 580 |  - The latency is required even it sacrifices cache hit rate etc. | 
 | 581 | then increasing 'sched_relax_domain_level' would benefit you. | 
 | 582 |  | 
 | 583 |  | 
 | 584 | 1.9 How do I use cpusets ? | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 585 | -------------------------- | 
 | 586 |  | 
 | 587 | In order to minimize the impact of cpusets on critical kernel | 
 | 588 | code, such as the scheduler, and due to the fact that the kernel | 
 | 589 | does not support one task updating the memory placement of another | 
 | 590 | task directly, the impact on a task of changing its cpuset CPU | 
 | 591 | or Memory Node placement, or of changing to which cpuset a task | 
 | 592 | is attached, is subtle. | 
 | 593 |  | 
 | 594 | If a cpuset has its Memory Nodes modified, then for each task attached | 
 | 595 | to that cpuset, the next time that the kernel attempts to allocate | 
 | 596 | a page of memory for that task, the kernel will notice the change | 
 | 597 | in the tasks cpuset, and update its per-task memory placement to | 
 | 598 | remain within the new cpusets memory placement.  If the task was using | 
 | 599 | mempolicy MPOL_BIND, and the nodes to which it was bound overlap with | 
 | 600 | its new cpuset, then the task will continue to use whatever subset | 
 | 601 | of MPOL_BIND nodes are still allowed in the new cpuset.  If the task | 
 | 602 | was using MPOL_BIND and now none of its MPOL_BIND nodes are allowed | 
 | 603 | in the new cpuset, then the task will be essentially treated as if it | 
| Chris Samuel | caa790b | 2009-01-17 00:01:18 +1100 | [diff] [blame] | 604 | was MPOL_BIND bound to the new cpuset (even though its NUMA placement, | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 605 | as queried by get_mempolicy(), doesn't change).  If a task is moved | 
 | 606 | from one cpuset to another, then the kernel will adjust the tasks | 
 | 607 | memory placement, as above, the next time that the kernel attempts | 
 | 608 | to allocate a page of memory for that task. | 
 | 609 |  | 
| Paul Jackson | 8f5aa26 | 2008-02-07 00:14:48 -0800 | [diff] [blame] | 610 | If a cpuset has its 'cpus' modified, then each task in that cpuset | 
 | 611 | will have its allowed CPU placement changed immediately.  Similarly, | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 612 | if a tasks pid is written to another cpusets 'tasks' file, then its | 
 | 613 | allowed CPU placement is changed immediately.  If such a task had been | 
 | 614 | bound to some subset of its cpuset using the sched_setaffinity() call, | 
 | 615 | the task will be allowed to run on any CPU allowed in its new cpuset, | 
 | 616 | negating the effect of the prior sched_setaffinity() call. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 617 |  | 
 | 618 | In summary, the memory placement of a task whose cpuset is changed is | 
 | 619 | updated by the kernel, on the next allocation of a page for that task, | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 620 | and the processor placement is updated immediately. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 621 |  | 
| Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 622 | Normally, once a page is allocated (given a physical page | 
 | 623 | of main memory) then that page stays on whatever node it | 
 | 624 | was allocated, so long as it remains allocated, even if the | 
 | 625 | cpusets memory placement policy 'mems' subsequently changes. | 
 | 626 | If the cpuset flag file 'memory_migrate' is set true, then when | 
 | 627 | tasks are attached to that cpuset, any pages that task had | 
 | 628 | allocated to it on nodes in its previous cpuset are migrated | 
| Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 629 | to the tasks new cpuset. The relative placement of the page within | 
 | 630 | the cpuset is preserved during these migration operations if possible. | 
 | 631 | For example if the page was on the second valid node of the prior cpuset | 
 | 632 | then the page will be placed on the second valid node of the new cpuset. | 
 | 633 |  | 
| Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 634 | Also if 'memory_migrate' is set true, then if that cpusets | 
 | 635 | 'mems' file is modified, pages allocated to tasks in that | 
 | 636 | cpuset, that were on nodes in the previous setting of 'mems', | 
| Christoph Lameter | b4fb376 | 2006-03-14 19:50:20 -0800 | [diff] [blame] | 637 | will be moved to nodes in the new setting of 'mems.' | 
 | 638 | Pages that were not in the tasks prior cpuset, or in the cpusets | 
 | 639 | prior 'mems' setting, will not be moved. | 
| Paul Jackson | 45b07ef | 2006-01-08 01:00:56 -0800 | [diff] [blame] | 640 |  | 
| Tobias Klauser | d533f67 | 2005-09-10 00:26:46 -0700 | [diff] [blame] | 641 | There is an exception to the above.  If hotplug functionality is used | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 642 | to remove all the CPUs that are currently assigned to a cpuset, | 
| Li Zefan | 0249943 | 2008-09-13 02:33:09 -0700 | [diff] [blame] | 643 | then all the tasks in that cpuset will be moved to the nearest ancestor | 
 | 644 | with non-empty cpus.  But the moving of some (or all) tasks might fail if | 
 | 645 | cpuset is bound with another cgroup subsystem which has some restrictions | 
 | 646 | on task attaching.  In this failing case, those tasks will stay | 
 | 647 | in the original cpuset, and the kernel will automatically update | 
 | 648 | their cpus_allowed to allow all online CPUs.  When memory hotplug | 
 | 649 | functionality for removing Memory Nodes is available, a similar exception | 
 | 650 | is expected to apply there as well.  In general, the kernel prefers to | 
 | 651 | violate cpuset placement, over starving a task that has had all | 
 | 652 | its allowed CPUs or Memory Nodes taken offline. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 653 |  | 
 | 654 | There is a second exception to the above.  GFP_ATOMIC requests are | 
 | 655 | kernel internal allocations that must be satisfied, immediately. | 
 | 656 | The kernel may drop some request, in rare cases even panic, if a | 
 | 657 | GFP_ATOMIC alloc fails.  If the request cannot be satisfied within | 
 | 658 | the current tasks cpuset, then we relax the cpuset, and look for | 
 | 659 | memory anywhere we can find it.  It's better to violate the cpuset | 
 | 660 | than stress the kernel. | 
 | 661 |  | 
 | 662 | To start a new job that is to be contained within a cpuset, the steps are: | 
 | 663 |  | 
 | 664 |  1) mkdir /dev/cpuset | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 665 |  2) mount -t cgroup -ocpuset cpuset /dev/cpuset | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 666 |  3) Create the new cpuset by doing mkdir's and write's (or echo's) in | 
 | 667 |     the /dev/cpuset virtual file system. | 
 | 668 |  4) Start a task that will be the "founding father" of the new job. | 
 | 669 |  5) Attach that task to the new cpuset by writing its pid to the | 
 | 670 |     /dev/cpuset tasks file for that cpuset. | 
 | 671 |  6) fork, exec or clone the job tasks from this founding father task. | 
 | 672 |  | 
 | 673 | For example, the following sequence of commands will setup a cpuset | 
 | 674 | named "Charlie", containing just CPUs 2 and 3, and Memory Node 1, | 
 | 675 | and then start a subshell 'sh' in that cpuset: | 
 | 676 |  | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 677 |   mount -t cgroup -ocpuset cpuset /dev/cpuset | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 678 |   cd /dev/cpuset | 
 | 679 |   mkdir Charlie | 
 | 680 |   cd Charlie | 
 | 681 |   /bin/echo 2-3 > cpus | 
 | 682 |   /bin/echo 1 > mems | 
 | 683 |   /bin/echo $$ > tasks | 
 | 684 |   sh | 
 | 685 |   # The subshell 'sh' is now running in cpuset Charlie | 
 | 686 |   # The next line should display '/Charlie' | 
 | 687 |   cat /proc/self/cpuset | 
 | 688 |  | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 689 | There are ways to query or modify cpusets: | 
 | 690 |  - via the cpuset file system directly, using the various cd, mkdir, echo, | 
 | 691 |    cat, rmdir commands from the shell, or their equivalent from C. | 
 | 692 |  - via the C library libcpuset. | 
 | 693 |  - via the C library libcgroup. | 
 | 694 |    (http://sourceforge.net/proects/libcg/) | 
 | 695 |  - via the python application cset. | 
 | 696 |    (http://developer.novell.com/wiki/index.php/Cpuset) | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 697 |  | 
 | 698 | The sched_setaffinity calls can also be done at the shell prompt using | 
 | 699 | SGI's runon or Robert Love's taskset.  The mbind and set_mempolicy | 
 | 700 | calls can be done at the shell prompt using the numactl command | 
 | 701 | (part of Andi Kleen's numa package). | 
 | 702 |  | 
 | 703 | 2. Usage Examples and Syntax | 
 | 704 | ============================ | 
 | 705 |  | 
 | 706 | 2.1 Basic Usage | 
 | 707 | --------------- | 
 | 708 |  | 
 | 709 | Creating, modifying, using the cpusets can be done through the cpuset | 
 | 710 | virtual filesystem. | 
 | 711 |  | 
 | 712 | To mount it, type: | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 713 | # mount -t cgroup -o cpuset cpuset /dev/cpuset | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 714 |  | 
 | 715 | Then under /dev/cpuset you can find a tree that corresponds to the | 
 | 716 | tree of the cpusets in the system. For instance, /dev/cpuset | 
 | 717 | is the cpuset that holds the whole system. | 
 | 718 |  | 
 | 719 | If you want to create a new cpuset under /dev/cpuset: | 
 | 720 | # cd /dev/cpuset | 
 | 721 | # mkdir my_cpuset | 
 | 722 |  | 
 | 723 | Now you want to do something with this cpuset. | 
 | 724 | # cd my_cpuset | 
 | 725 |  | 
 | 726 | In this directory you can find several files: | 
 | 727 | # ls | 
| Miao Xie | 6a7d68e | 2008-06-05 22:45:54 -0700 | [diff] [blame] | 728 | cpu_exclusive  memory_migrate      mems                      tasks | 
 | 729 | cpus           memory_pressure     notify_on_release | 
 | 730 | mem_exclusive  memory_spread_page  sched_load_balance | 
 | 731 | mem_hardwall   memory_spread_slab  sched_relax_domain_level | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 732 |  | 
 | 733 | Reading them will give you information about the state of this cpuset: | 
 | 734 | the CPUs and Memory Nodes it can use, the processes that are using | 
 | 735 | it, its properties.  By writing to these files you can manipulate | 
 | 736 | the cpuset. | 
 | 737 |  | 
 | 738 | Set some flags: | 
 | 739 | # /bin/echo 1 > cpu_exclusive | 
 | 740 |  | 
 | 741 | Add some cpus: | 
 | 742 | # /bin/echo 0-7 > cpus | 
 | 743 |  | 
| Simon Horman | 2400ff7 | 2007-04-01 23:49:40 -0700 | [diff] [blame] | 744 | Add some mems: | 
 | 745 | # /bin/echo 0-7 > mems | 
 | 746 |  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 747 | Now attach your shell to this cpuset: | 
 | 748 | # /bin/echo $$ > tasks | 
 | 749 |  | 
 | 750 | You can also create cpusets inside your cpuset by using mkdir in this | 
 | 751 | directory. | 
 | 752 | # mkdir my_sub_cs | 
 | 753 |  | 
 | 754 | To remove a cpuset, just use rmdir: | 
 | 755 | # rmdir my_sub_cs | 
 | 756 | This will fail if the cpuset is in use (has cpusets inside, or has | 
 | 757 | processes attached). | 
 | 758 |  | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 759 | Note that for legacy reasons, the "cpuset" filesystem exists as a | 
 | 760 | wrapper around the cgroup filesystem. | 
 | 761 |  | 
 | 762 | The command | 
 | 763 |  | 
 | 764 | mount -t cpuset X /dev/cpuset | 
 | 765 |  | 
 | 766 | is equivalent to | 
 | 767 |  | 
| Li Zefan | 3fd076d | 2009-02-20 15:38:48 -0800 | [diff] [blame] | 768 | mount -t cgroup -ocpuset,noprefix X /dev/cpuset | 
| Paul Menage | 8793d85 | 2007-10-18 23:39:39 -0700 | [diff] [blame] | 769 | echo "/sbin/cpuset_release_agent" > /dev/cpuset/release_agent | 
 | 770 |  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 771 | 2.2 Adding/removing cpus | 
 | 772 | ------------------------ | 
 | 773 |  | 
 | 774 | This is the syntax to use when writing in the cpus or mems files | 
 | 775 | in cpuset directories: | 
 | 776 |  | 
 | 777 | # /bin/echo 1-4 > cpus		-> set cpus list to cpus 1,2,3,4 | 
 | 778 | # /bin/echo 1,2,3,4 > cpus	-> set cpus list to cpus 1,2,3,4 | 
 | 779 |  | 
| Nikanth Karthikesan | b37f2d4 | 2009-06-30 11:41:36 -0700 | [diff] [blame] | 780 | To add a CPU to a cpuset, write the new list of CPUs including the | 
 | 781 | CPU to be added. To add 6 to the above cpuset: | 
 | 782 |  | 
 | 783 | # /bin/echo 1-4,6 > cpus	-> set cpus list to cpus 1,2,3,4,6 | 
 | 784 |  | 
 | 785 | Similarly to remove a CPU from a cpuset, write the new list of CPUs | 
 | 786 | without the CPU to be removed. | 
 | 787 |  | 
 | 788 | To remove all the CPUs: | 
 | 789 |  | 
 | 790 | # /bin/echo "" > cpus		-> clear cpus list | 
 | 791 |  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 792 | 2.3 Setting flags | 
 | 793 | ----------------- | 
 | 794 |  | 
 | 795 | The syntax is very simple: | 
 | 796 |  | 
 | 797 | # /bin/echo 1 > cpu_exclusive 	-> set flag 'cpu_exclusive' | 
 | 798 | # /bin/echo 0 > cpu_exclusive 	-> unset flag 'cpu_exclusive' | 
 | 799 |  | 
 | 800 | 2.4 Attaching processes | 
 | 801 | ----------------------- | 
 | 802 |  | 
 | 803 | # /bin/echo PID > tasks | 
 | 804 |  | 
 | 805 | Note that it is PID, not PIDs. You can only attach ONE task at a time. | 
 | 806 | If you have several tasks to attach, you have to do it one after another: | 
 | 807 |  | 
 | 808 | # /bin/echo PID1 > tasks | 
 | 809 | # /bin/echo PID2 > tasks | 
 | 810 | 	... | 
 | 811 | # /bin/echo PIDn > tasks | 
 | 812 |  | 
 | 813 |  | 
 | 814 | 3. Questions | 
 | 815 | ============ | 
 | 816 |  | 
 | 817 | Q: what's up with this '/bin/echo' ? | 
 | 818 | A: bash's builtin 'echo' command does not check calls to write() against | 
 | 819 |    errors. If you use it in the cpuset file system, you won't be | 
 | 820 |    able to tell whether a command succeeded or failed. | 
 | 821 |  | 
 | 822 | Q: When I attach processes, only the first of the line gets really attached ! | 
 | 823 | A: We can only return one error code per call to write(). So you should also | 
 | 824 |    put only ONE pid. | 
 | 825 |  | 
 | 826 | 4. Contact | 
 | 827 | ========== | 
 | 828 |  | 
 | 829 | Web: http://www.bullopensource.org/cpuset |