| Borislav Petkov | e2495b5 | 2011-03-27 17:57:13 +0200 | [diff] [blame] | 1 | Each CPU has a "base" scheduling domain (struct sched_domain). The domain | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 2 | hierarchy is built from these base domains via the ->parent pointer. ->parent | 
| Borislav Petkov | e2495b5 | 2011-03-27 17:57:13 +0200 | [diff] [blame] | 3 | MUST be NULL terminated, and domain structures should be per-CPU as they are | 
 | 4 | locklessly updated. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 5 |  | 
 | 6 | Each scheduling domain spans a number of CPUs (stored in the ->span field). | 
 | 7 | A domain's span MUST be a superset of it child's span (this restriction could | 
 | 8 | be relaxed if the need arises), and a base domain for CPU i MUST span at least | 
 | 9 | i. The top domain for each CPU will generally span all CPUs in the system | 
 | 10 | although strictly it doesn't have to, but this could lead to a case where some | 
 | 11 | CPUs will never be given tasks to run unless the CPUs allowed mask is | 
 | 12 | explicitly set. A sched domain's span means "balance process load among these | 
 | 13 | CPUs". | 
 | 14 |  | 
 | 15 | Each scheduling domain must have one or more CPU groups (struct sched_group) | 
 | 16 | which are organised as a circular one way linked list from the ->groups | 
 | 17 | pointer. The union of cpumasks of these groups MUST be the same as the | 
 | 18 | domain's span. The intersection of cpumasks from any two of these groups | 
 | 19 | MUST be the empty set. The group pointed to by the ->groups pointer MUST | 
 | 20 | contain the CPU to which the domain belongs. Groups may be shared among | 
 | 21 | CPUs as they contain read only data after they have been set up. | 
 | 22 |  | 
 | 23 | Balancing within a sched domain occurs between groups. That is, each group | 
 | 24 | is treated as one entity. The load of a group is defined as the sum of the | 
 | 25 | load of each of its member CPUs, and only when the load of a group becomes | 
 | 26 | out of balance are tasks moved between groups. | 
 | 27 |  | 
| Borislav Petkov | e2495b5 | 2011-03-27 17:57:13 +0200 | [diff] [blame] | 28 | In kernel/sched.c, trigger_load_balance() is run periodically on each CPU | 
 | 29 | through scheduler_tick(). It raises a softirq after the next regularly scheduled | 
 | 30 | rebalancing event for the current runqueue has arrived. The actual load | 
 | 31 | balancing workhorse, run_rebalance_domains()->rebalance_domains(), is then run | 
 | 32 | in softirq context (SCHED_SOFTIRQ). | 
 | 33 |  | 
 | 34 | The latter function takes two arguments: the current CPU and whether it was idle | 
 | 35 | at the time the scheduler_tick() happened and iterates over all sched domains | 
 | 36 | our CPU is on, starting from its base domain and going up the ->parent chain. | 
 | 37 | While doing that, it checks to see if the current domain has exhausted its | 
 | 38 | rebalance interval. If so, it runs load_balance() on that domain. It then checks | 
 | 39 | the parent sched_domain (if it exists), and the parent of the parent and so | 
 | 40 | forth. | 
 | 41 |  | 
 | 42 | Initially, load_balance() finds the busiest group in the current sched domain. | 
 | 43 | If it succeeds, it looks for the busiest runqueue of all the CPUs' runqueues in | 
 | 44 | that group. If it manages to find such a runqueue, it locks both our initial | 
 | 45 | CPU's runqueue and the newly found busiest one and starts moving tasks from it | 
 | 46 | to our runqueue. The exact number of tasks amounts to an imbalance previously | 
 | 47 | computed while iterating over this sched domain's groups. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 48 |  | 
 | 49 | *** Implementing sched domains *** | 
 | 50 | The "base" domain will "span" the first level of the hierarchy. In the case | 
 | 51 | of SMT, you'll span all siblings of the physical CPU, with each group being | 
 | 52 | a single virtual CPU. | 
 | 53 |  | 
 | 54 | In SMP, the parent of the base domain will span all physical CPUs in the | 
 | 55 | node. Each group being a single physical CPU. Then with NUMA, the parent | 
 | 56 | of the SMP domain will span the entire machine, with each group having the | 
 | 57 | cpumask of a node. Or, you could do multi-level NUMA or Opteron, for example, | 
 | 58 | might have just one domain covering its one NUMA level. | 
 | 59 |  | 
 | 60 | The implementor should read comments in include/linux/sched.h: | 
 | 61 | struct sched_domain fields, SD_FLAG_*, SD_*_INIT to get an idea of | 
 | 62 | the specifics and what to tune. | 
 | 63 |  | 
 | 64 | For SMT, the architecture must define CONFIG_SCHED_SMT and provide a | 
 | 65 | cpumask_t cpu_sibling_map[NR_CPUS], where cpu_sibling_map[i] is the mask of | 
 | 66 | all "i"'s siblings as well as "i" itself. | 
 | 67 |  | 
 | 68 | Architectures may retain the regular override the default SD_*_INIT flags | 
 | 69 | while using the generic domain builder in kernel/sched.c if they wish to | 
 | 70 | retain the traditional SMT->SMP->NUMA topology (or some subset of that). This | 
 | 71 | can be done by #define'ing ARCH_HASH_SCHED_TUNE. | 
 | 72 |  | 
 | 73 | Alternatively, the architecture may completely override the generic domain | 
 | 74 | builder by #define'ing ARCH_HASH_SCHED_DOMAIN, and exporting your | 
 | 75 | arch_init_sched_domains function. This function will attach domains to all | 
 | 76 | CPUs using cpu_attach_domain. | 
 | 77 |  | 
| Gautham R Shenoy | e29c98d | 2008-05-29 12:36:18 +0530 | [diff] [blame] | 78 | The sched-domains debugging infrastructure can be enabled by enabling | 
 | 79 | CONFIG_SCHED_DEBUG. This enables an error checking parse of the sched domains | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 80 | which should catch most possible errors (described above). It also prints out | 
 | 81 | the domain structure in a visual format. |