| Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 1 | # | 
|  | 2 | # Copyright (c) 2006 Steven Rostedt | 
|  | 3 | # Licensed under the GNU Free Documentation License, Version 1.2 | 
|  | 4 | # | 
|  | 5 |  | 
|  | 6 | RT-mutex implementation design | 
|  | 7 | ------------------------------ | 
|  | 8 |  | 
|  | 9 | This document tries to describe the design of the rtmutex.c implementation. | 
|  | 10 | It doesn't describe the reasons why rtmutex.c exists. For that please see | 
|  | 11 | Documentation/rt-mutex.txt.  Although this document does explain problems | 
|  | 12 | that happen without this code, but that is in the concept to understand | 
|  | 13 | what the code actually is doing. | 
|  | 14 |  | 
|  | 15 | The goal of this document is to help others understand the priority | 
|  | 16 | inheritance (PI) algorithm that is used, as well as reasons for the | 
|  | 17 | decisions that were made to implement PI in the manner that was done. | 
|  | 18 |  | 
|  | 19 |  | 
|  | 20 | Unbounded Priority Inversion | 
|  | 21 | ---------------------------- | 
|  | 22 |  | 
|  | 23 | Priority inversion is when a lower priority process executes while a higher | 
|  | 24 | priority process wants to run.  This happens for several reasons, and | 
|  | 25 | most of the time it can't be helped.  Anytime a high priority process wants | 
|  | 26 | to use a resource that a lower priority process has (a mutex for example), | 
|  | 27 | the high priority process must wait until the lower priority process is done | 
|  | 28 | with the resource.  This is a priority inversion.  What we want to prevent | 
|  | 29 | is something called unbounded priority inversion.  That is when the high | 
|  | 30 | priority process is prevented from running by a lower priority process for | 
|  | 31 | an undetermined amount of time. | 
|  | 32 |  | 
|  | 33 | The classic example of unbounded priority inversion is were you have three | 
|  | 34 | processes, let's call them processes A, B, and C, where A is the highest | 
|  | 35 | priority process, C is the lowest, and B is in between. A tries to grab a lock | 
|  | 36 | that C owns and must wait and lets C run to release the lock. But in the | 
|  | 37 | meantime, B executes, and since B is of a higher priority than C, it preempts C, | 
|  | 38 | but by doing so, it is in fact preempting A which is a higher priority process. | 
|  | 39 | Now there's no way of knowing how long A will be sleeping waiting for C | 
|  | 40 | to release the lock, because for all we know, B is a CPU hog and will | 
|  | 41 | never give C a chance to release the lock.  This is called unbounded priority | 
|  | 42 | inversion. | 
|  | 43 |  | 
|  | 44 | Here's a little ASCII art to show the problem. | 
|  | 45 |  | 
|  | 46 | grab lock L1 (owned by C) | 
|  | 47 | | | 
|  | 48 | A ---+ | 
|  | 49 | C preempted by B | 
|  | 50 | | | 
|  | 51 | C    +----+ | 
|  | 52 |  | 
|  | 53 | B         +--------> | 
|  | 54 | B now keeps A from running. | 
|  | 55 |  | 
|  | 56 |  | 
|  | 57 | Priority Inheritance (PI) | 
|  | 58 | ------------------------- | 
|  | 59 |  | 
|  | 60 | There are several ways to solve this issue, but other ways are out of scope | 
|  | 61 | for this document.  Here we only discuss PI. | 
|  | 62 |  | 
|  | 63 | PI is where a process inherits the priority of another process if the other | 
|  | 64 | process blocks on a lock owned by the current process.  To make this easier | 
|  | 65 | to understand, let's use the previous example, with processes A, B, and C again. | 
|  | 66 |  | 
|  | 67 | This time, when A blocks on the lock owned by C, C would inherit the priority | 
|  | 68 | of A.  So now if B becomes runnable, it would not preempt C, since C now has | 
|  | 69 | the high priority of A.  As soon as C releases the lock, it loses its | 
|  | 70 | inherited priority, and A then can continue with the resource that C had. | 
|  | 71 |  | 
|  | 72 | Terminology | 
|  | 73 | ----------- | 
|  | 74 |  | 
|  | 75 | Here I explain some terminology that is used in this document to help describe | 
|  | 76 | the design that is used to implement PI. | 
|  | 77 |  | 
|  | 78 | PI chain - The PI chain is an ordered series of locks and processes that cause | 
|  | 79 | processes to inherit priorities from a previous process that is | 
|  | 80 | blocked on one of its locks.  This is described in more detail | 
|  | 81 | later in this document. | 
|  | 82 |  | 
|  | 83 | mutex    - In this document, to differentiate from locks that implement | 
|  | 84 | PI and spin locks that are used in the PI code, from now on | 
|  | 85 | the PI locks will be called a mutex. | 
|  | 86 |  | 
|  | 87 | lock     - In this document from now on, I will use the term lock when | 
|  | 88 | referring to spin locks that are used to protect parts of the PI | 
|  | 89 | algorithm.  These locks disable preemption for UP (when | 
|  | 90 | CONFIG_PREEMPT is enabled) and on SMP prevents multiple CPUs from | 
|  | 91 | entering critical sections simultaneously. | 
|  | 92 |  | 
|  | 93 | spin lock - Same as lock above. | 
|  | 94 |  | 
|  | 95 | waiter   - A waiter is a struct that is stored on the stack of a blocked | 
|  | 96 | process.  Since the scope of the waiter is within the code for | 
|  | 97 | a process being blocked on the mutex, it is fine to allocate | 
|  | 98 | the waiter on the process's stack (local variable).  This | 
|  | 99 | structure holds a pointer to the task, as well as the mutex that | 
|  | 100 | the task is blocked on.  It also has the plist node structures to | 
|  | 101 | place the task in the waiter_list of a mutex as well as the | 
|  | 102 | pi_list of a mutex owner task (described below). | 
|  | 103 |  | 
|  | 104 | waiter is sometimes used in reference to the task that is waiting | 
|  | 105 | on a mutex. This is the same as waiter->task. | 
|  | 106 |  | 
|  | 107 | waiters  - A list of processes that are blocked on a mutex. | 
|  | 108 |  | 
|  | 109 | top waiter - The highest priority process waiting on a specific mutex. | 
|  | 110 |  | 
|  | 111 | top pi waiter - The highest priority process waiting on one of the mutexes | 
|  | 112 | that a specific process owns. | 
|  | 113 |  | 
|  | 114 | Note:  task and process are used interchangeably in this document, mostly to | 
|  | 115 | differentiate between two processes that are being described together. | 
|  | 116 |  | 
|  | 117 |  | 
|  | 118 | PI chain | 
|  | 119 | -------- | 
|  | 120 |  | 
|  | 121 | The PI chain is a list of processes and mutexes that may cause priority | 
|  | 122 | inheritance to take place.  Multiple chains may converge, but a chain | 
|  | 123 | would never diverge, since a process can't be blocked on more than one | 
|  | 124 | mutex at a time. | 
|  | 125 |  | 
|  | 126 | Example: | 
|  | 127 |  | 
|  | 128 | Process:  A, B, C, D, E | 
|  | 129 | Mutexes:  L1, L2, L3, L4 | 
|  | 130 |  | 
|  | 131 | A owns: L1 | 
|  | 132 | B blocked on L1 | 
|  | 133 | B owns L2 | 
|  | 134 | C blocked on L2 | 
|  | 135 | C owns L3 | 
|  | 136 | D blocked on L3 | 
|  | 137 | D owns L4 | 
|  | 138 | E blocked on L4 | 
|  | 139 |  | 
|  | 140 | The chain would be: | 
|  | 141 |  | 
|  | 142 | E->L4->D->L3->C->L2->B->L1->A | 
|  | 143 |  | 
|  | 144 | To show where two chains merge, we could add another process F and | 
|  | 145 | another mutex L5 where B owns L5 and F is blocked on mutex L5. | 
|  | 146 |  | 
|  | 147 | The chain for F would be: | 
|  | 148 |  | 
|  | 149 | F->L5->B->L1->A | 
|  | 150 |  | 
|  | 151 | Since a process may own more than one mutex, but never be blocked on more than | 
|  | 152 | one, the chains merge. | 
|  | 153 |  | 
|  | 154 | Here we show both chains: | 
|  | 155 |  | 
|  | 156 | E->L4->D->L3->C->L2-+ | 
|  | 157 | | | 
|  | 158 | +->B->L1->A | 
|  | 159 | | | 
|  | 160 | F->L5-+ | 
|  | 161 |  | 
|  | 162 | For PI to work, the processes at the right end of these chains (or we may | 
|  | 163 | also call it the Top of the chain) must be equal to or higher in priority | 
|  | 164 | than the processes to the left or below in the chain. | 
|  | 165 |  | 
|  | 166 | Also since a mutex may have more than one process blocked on it, we can | 
|  | 167 | have multiple chains merge at mutexes.  If we add another process G that is | 
|  | 168 | blocked on mutex L2: | 
|  | 169 |  | 
|  | 170 | G->L2->B->L1->A | 
|  | 171 |  | 
|  | 172 | And once again, to show how this can grow I will show the merging chains | 
|  | 173 | again. | 
|  | 174 |  | 
|  | 175 | E->L4->D->L3->C-+ | 
|  | 176 | +->L2-+ | 
|  | 177 | |     | | 
|  | 178 | G-+     +->B->L1->A | 
|  | 179 | | | 
|  | 180 | F->L5-+ | 
|  | 181 |  | 
|  | 182 |  | 
|  | 183 | Plist | 
|  | 184 | ----- | 
|  | 185 |  | 
|  | 186 | Before I go further and talk about how the PI chain is stored through lists | 
|  | 187 | on both mutexes and processes, I'll explain the plist.  This is similar to | 
|  | 188 | the struct list_head functionality that is already in the kernel. | 
|  | 189 | The implementation of plist is out of scope for this document, but it is | 
|  | 190 | very important to understand what it does. | 
|  | 191 |  | 
|  | 192 | There are a few differences between plist and list, the most important one | 
|  | 193 | being that plist is a priority sorted linked list.  This means that the | 
|  | 194 | priorities of the plist are sorted, such that it takes O(1) to retrieve the | 
|  | 195 | highest priority item in the list.  Obviously this is useful to store processes | 
|  | 196 | based on their priorities. | 
|  | 197 |  | 
|  | 198 | Another difference, which is important for implementation, is that, unlike | 
|  | 199 | list, the head of the list is a different element than the nodes of a list. | 
|  | 200 | So the head of the list is declared as struct plist_head and nodes that will | 
|  | 201 | be added to the list are declared as struct plist_node. | 
|  | 202 |  | 
|  | 203 |  | 
|  | 204 | Mutex Waiter List | 
|  | 205 | ----------------- | 
|  | 206 |  | 
|  | 207 | Every mutex keeps track of all the waiters that are blocked on itself. The mutex | 
|  | 208 | has a plist to store these waiters by priority.  This list is protected by | 
|  | 209 | a spin lock that is located in the struct of the mutex. This lock is called | 
|  | 210 | wait_lock.  Since the modification of the waiter list is never done in | 
|  | 211 | interrupt context, the wait_lock can be taken without disabling interrupts. | 
|  | 212 |  | 
|  | 213 |  | 
|  | 214 | Task PI List | 
|  | 215 | ------------ | 
|  | 216 |  | 
|  | 217 | To keep track of the PI chains, each process has its own PI list.  This is | 
|  | 218 | a list of all top waiters of the mutexes that are owned by the process. | 
|  | 219 | Note that this list only holds the top waiters and not all waiters that are | 
|  | 220 | blocked on mutexes owned by the process. | 
|  | 221 |  | 
|  | 222 | The top of the task's PI list is always the highest priority task that | 
|  | 223 | is waiting on a mutex that is owned by the task.  So if the task has | 
|  | 224 | inherited a priority, it will always be the priority of the task that is | 
|  | 225 | at the top of this list. | 
|  | 226 |  | 
|  | 227 | This list is stored in the task structure of a process as a plist called | 
|  | 228 | pi_list.  This list is protected by a spin lock also in the task structure, | 
|  | 229 | called pi_lock.  This lock may also be taken in interrupt context, so when | 
|  | 230 | locking the pi_lock, interrupts must be disabled. | 
|  | 231 |  | 
|  | 232 |  | 
|  | 233 | Depth of the PI Chain | 
|  | 234 | --------------------- | 
|  | 235 |  | 
|  | 236 | The maximum depth of the PI chain is not dynamic, and could actually be | 
|  | 237 | defined.  But is very complex to figure it out, since it depends on all | 
|  | 238 | the nesting of mutexes.  Let's look at the example where we have 3 mutexes, | 
|  | 239 | L1, L2, and L3, and four separate functions func1, func2, func3 and func4. | 
|  | 240 | The following shows a locking order of L1->L2->L3, but may not actually | 
|  | 241 | be directly nested that way. | 
|  | 242 |  | 
|  | 243 | void func1(void) | 
|  | 244 | { | 
|  | 245 | mutex_lock(L1); | 
|  | 246 |  | 
|  | 247 | /* do anything */ | 
|  | 248 |  | 
|  | 249 | mutex_unlock(L1); | 
|  | 250 | } | 
|  | 251 |  | 
|  | 252 | void func2(void) | 
|  | 253 | { | 
|  | 254 | mutex_lock(L1); | 
|  | 255 | mutex_lock(L2); | 
|  | 256 |  | 
|  | 257 | /* do something */ | 
|  | 258 |  | 
|  | 259 | mutex_unlock(L2); | 
|  | 260 | mutex_unlock(L1); | 
|  | 261 | } | 
|  | 262 |  | 
|  | 263 | void func3(void) | 
|  | 264 | { | 
|  | 265 | mutex_lock(L2); | 
|  | 266 | mutex_lock(L3); | 
|  | 267 |  | 
|  | 268 | /* do something else */ | 
|  | 269 |  | 
|  | 270 | mutex_unlock(L3); | 
|  | 271 | mutex_unlock(L2); | 
|  | 272 | } | 
|  | 273 |  | 
|  | 274 | void func4(void) | 
|  | 275 | { | 
|  | 276 | mutex_lock(L3); | 
|  | 277 |  | 
|  | 278 | /* do something again */ | 
|  | 279 |  | 
|  | 280 | mutex_unlock(L3); | 
|  | 281 | } | 
|  | 282 |  | 
|  | 283 | Now we add 4 processes that run each of these functions separately. | 
|  | 284 | Processes A, B, C, and D which run functions func1, func2, func3 and func4 | 
|  | 285 | respectively, and such that D runs first and A last.  With D being preempted | 
|  | 286 | in func4 in the "do something again" area, we have a locking that follows: | 
|  | 287 |  | 
|  | 288 | D owns L3 | 
|  | 289 | C blocked on L3 | 
|  | 290 | C owns L2 | 
|  | 291 | B blocked on L2 | 
|  | 292 | B owns L1 | 
|  | 293 | A blocked on L1 | 
|  | 294 |  | 
|  | 295 | And thus we have the chain A->L1->B->L2->C->L3->D. | 
|  | 296 |  | 
|  | 297 | This gives us a PI depth of 4 (four processes), but looking at any of the | 
|  | 298 | functions individually, it seems as though they only have at most a locking | 
|  | 299 | depth of two.  So, although the locking depth is defined at compile time, | 
|  | 300 | it still is very difficult to find the possibilities of that depth. | 
|  | 301 |  | 
|  | 302 | Now since mutexes can be defined by user-land applications, we don't want a DOS | 
|  | 303 | type of application that nests large amounts of mutexes to create a large | 
|  | 304 | PI chain, and have the code holding spin locks while looking at a large | 
|  | 305 | amount of data.  So to prevent this, the implementation not only implements | 
|  | 306 | a maximum lock depth, but also only holds at most two different locks at a | 
|  | 307 | time, as it walks the PI chain.  More about this below. | 
|  | 308 |  | 
|  | 309 |  | 
|  | 310 | Mutex owner and flags | 
|  | 311 | --------------------- | 
|  | 312 |  | 
|  | 313 | The mutex structure contains a pointer to the owner of the mutex.  If the | 
|  | 314 | mutex is not owned, this owner is set to NULL.  Since all architectures | 
|  | 315 | have the task structure on at least a four byte alignment (and if this is | 
|  | 316 | not true, the rtmutex.c code will be broken!), this allows for the two | 
|  | 317 | least significant bits to be used as flags.  This part is also described | 
|  | 318 | in Documentation/rt-mutex.txt, but will also be briefly described here. | 
|  | 319 |  | 
|  | 320 | Bit 0 is used as the "Pending Owner" flag.  This is described later. | 
|  | 321 | Bit 1 is used as the "Has Waiters" flags.  This is also described later | 
|  | 322 | in more detail, but is set whenever there are waiters on a mutex. | 
|  | 323 |  | 
|  | 324 |  | 
|  | 325 | cmpxchg Tricks | 
|  | 326 | -------------- | 
|  | 327 |  | 
|  | 328 | Some architectures implement an atomic cmpxchg (Compare and Exchange).  This | 
|  | 329 | is used (when applicable) to keep the fast path of grabbing and releasing | 
|  | 330 | mutexes short. | 
|  | 331 |  | 
|  | 332 | cmpxchg is basically the following function performed atomically: | 
|  | 333 |  | 
|  | 334 | unsigned long _cmpxchg(unsigned long *A, unsigned long *B, unsigned long *C) | 
|  | 335 | { | 
| Jan Altenberg | 9ba0bdf | 2006-09-30 23:28:08 -0700 | [diff] [blame] | 336 | unsigned long T = *A; | 
|  | 337 | if (*A == *B) { | 
|  | 338 | *A = *C; | 
|  | 339 | } | 
|  | 340 | return T; | 
| Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 341 | } | 
|  | 342 | #define cmpxchg(a,b,c) _cmpxchg(&a,&b,&c) | 
|  | 343 |  | 
|  | 344 | This is really nice to have, since it allows you to only update a variable | 
|  | 345 | if the variable is what you expect it to be.  You know if it succeeded if | 
|  | 346 | the return value (the old value of A) is equal to B. | 
|  | 347 |  | 
|  | 348 | The macro rt_mutex_cmpxchg is used to try to lock and unlock mutexes. If | 
|  | 349 | the architecture does not support CMPXCHG, then this macro is simply set | 
|  | 350 | to fail every time.  But if CMPXCHG is supported, then this will | 
|  | 351 | help out extremely to keep the fast path short. | 
|  | 352 |  | 
|  | 353 | The use of rt_mutex_cmpxchg with the flags in the owner field help optimize | 
|  | 354 | the system for architectures that support it.  This will also be explained | 
|  | 355 | later in this document. | 
|  | 356 |  | 
|  | 357 |  | 
|  | 358 | Priority adjustments | 
|  | 359 | -------------------- | 
|  | 360 |  | 
|  | 361 | The implementation of the PI code in rtmutex.c has several places that a | 
|  | 362 | process must adjust its priority.  With the help of the pi_list of a | 
|  | 363 | process this is rather easy to know what needs to be adjusted. | 
|  | 364 |  | 
|  | 365 | The functions implementing the task adjustments are rt_mutex_adjust_prio, | 
|  | 366 | __rt_mutex_adjust_prio (same as the former, but expects the task pi_lock | 
|  | 367 | to already be taken), rt_mutex_get_prio, and rt_mutex_setprio. | 
|  | 368 |  | 
|  | 369 | rt_mutex_getprio and rt_mutex_setprio are only used in __rt_mutex_adjust_prio. | 
|  | 370 |  | 
|  | 371 | rt_mutex_getprio returns the priority that the task should have.  Either the | 
|  | 372 | task's own normal priority, or if a process of a higher priority is waiting on | 
|  | 373 | a mutex owned by the task, then that higher priority should be returned. | 
|  | 374 | Since the pi_list of a task holds an order by priority list of all the top | 
|  | 375 | waiters of all the mutexes that the task owns, rt_mutex_getprio simply needs | 
|  | 376 | to compare the top pi waiter to its own normal priority, and return the higher | 
|  | 377 | priority back. | 
|  | 378 |  | 
|  | 379 | (Note:  if looking at the code, you will notice that the lower number of | 
|  | 380 | prio is returned.  This is because the prio field in the task structure | 
|  | 381 | is an inverse order of the actual priority.  So a "prio" of 5 is | 
|  | 382 | of higher priority than a "prio" of 10.) | 
|  | 383 |  | 
|  | 384 | __rt_mutex_adjust_prio examines the result of rt_mutex_getprio, and if the | 
|  | 385 | result does not equal the task's current priority, then rt_mutex_setprio | 
|  | 386 | is called to adjust the priority of the task to the new priority. | 
|  | 387 | Note that rt_mutex_setprio is defined in kernel/sched.c to implement the | 
|  | 388 | actual change in priority. | 
|  | 389 |  | 
|  | 390 | It is interesting to note that __rt_mutex_adjust_prio can either increase | 
|  | 391 | or decrease the priority of the task.  In the case that a higher priority | 
|  | 392 | process has just blocked on a mutex owned by the task, __rt_mutex_adjust_prio | 
|  | 393 | would increase/boost the task's priority.  But if a higher priority task | 
|  | 394 | were for some reason to leave the mutex (timeout or signal), this same function | 
|  | 395 | would decrease/unboost the priority of the task.  That is because the pi_list | 
|  | 396 | always contains the highest priority task that is waiting on a mutex owned | 
|  | 397 | by the task, so we only need to compare the priority of that top pi waiter | 
|  | 398 | to the normal priority of the given task. | 
|  | 399 |  | 
|  | 400 |  | 
|  | 401 | High level overview of the PI chain walk | 
|  | 402 | ---------------------------------------- | 
|  | 403 |  | 
|  | 404 | The PI chain walk is implemented by the function rt_mutex_adjust_prio_chain. | 
|  | 405 |  | 
|  | 406 | The implementation has gone through several iterations, and has ended up | 
|  | 407 | with what we believe is the best.  It walks the PI chain by only grabbing | 
|  | 408 | at most two locks at a time, and is very efficient. | 
|  | 409 |  | 
|  | 410 | The rt_mutex_adjust_prio_chain can be used either to boost or lower process | 
|  | 411 | priorities. | 
|  | 412 |  | 
|  | 413 | rt_mutex_adjust_prio_chain is called with a task to be checked for PI | 
|  | 414 | (de)boosting (the owner of a mutex that a process is blocking on), a flag to | 
|  | 415 | check for deadlocking, the mutex that the task owns, and a pointer to a waiter | 
|  | 416 | that is the process's waiter struct that is blocked on the mutex (although this | 
|  | 417 | parameter may be NULL for deboosting). | 
|  | 418 |  | 
|  | 419 | For this explanation, I will not mention deadlock detection. This explanation | 
|  | 420 | will try to stay at a high level. | 
|  | 421 |  | 
|  | 422 | When this function is called, there are no locks held.  That also means | 
|  | 423 | that the state of the owner and lock can change when entered into this function. | 
|  | 424 |  | 
|  | 425 | Before this function is called, the task has already had rt_mutex_adjust_prio | 
|  | 426 | performed on it.  This means that the task is set to the priority that it | 
|  | 427 | should be at, but the plist nodes of the task's waiter have not been updated | 
|  | 428 | with the new priorities, and that this task may not be in the proper locations | 
|  | 429 | in the pi_lists and wait_lists that the task is blocked on.  This function | 
|  | 430 | solves all that. | 
|  | 431 |  | 
|  | 432 | A loop is entered, where task is the owner to be checked for PI changes that | 
|  | 433 | was passed by parameter (for the first iteration).  The pi_lock of this task is | 
|  | 434 | taken to prevent any more changes to the pi_list of the task.  This also | 
|  | 435 | prevents new tasks from completing the blocking on a mutex that is owned by this | 
|  | 436 | task. | 
|  | 437 |  | 
|  | 438 | If the task is not blocked on a mutex then the loop is exited.  We are at | 
|  | 439 | the top of the PI chain. | 
|  | 440 |  | 
|  | 441 | A check is now done to see if the original waiter (the process that is blocked | 
|  | 442 | on the current mutex) is the top pi waiter of the task.  That is, is this | 
|  | 443 | waiter on the top of the task's pi_list.  If it is not, it either means that | 
|  | 444 | there is another process higher in priority that is blocked on one of the | 
|  | 445 | mutexes that the task owns, or that the waiter has just woken up via a signal | 
|  | 446 | or timeout and has left the PI chain.  In either case, the loop is exited, since | 
|  | 447 | we don't need to do any more changes to the priority of the current task, or any | 
|  | 448 | task that owns a mutex that this current task is waiting on.  A priority chain | 
|  | 449 | walk is only needed when a new top pi waiter is made to a task. | 
|  | 450 |  | 
|  | 451 | The next check sees if the task's waiter plist node has the priority equal to | 
|  | 452 | the priority the task is set at.  If they are equal, then we are done with | 
|  | 453 | the loop.  Remember that the function started with the priority of the | 
|  | 454 | task adjusted, but the plist nodes that hold the task in other processes | 
|  | 455 | pi_lists have not been adjusted. | 
|  | 456 |  | 
|  | 457 | Next, we look at the mutex that the task is blocked on. The mutex's wait_lock | 
|  | 458 | is taken.  This is done by a spin_trylock, because the locking order of the | 
|  | 459 | pi_lock and wait_lock goes in the opposite direction. If we fail to grab the | 
|  | 460 | lock, the pi_lock is released, and we restart the loop. | 
|  | 461 |  | 
|  | 462 | Now that we have both the pi_lock of the task as well as the wait_lock of | 
|  | 463 | the mutex the task is blocked on, we update the task's waiter's plist node | 
|  | 464 | that is located on the mutex's wait_list. | 
|  | 465 |  | 
|  | 466 | Now we release the pi_lock of the task. | 
|  | 467 |  | 
|  | 468 | Next the owner of the mutex has its pi_lock taken, so we can update the | 
|  | 469 | task's entry in the owner's pi_list.  If the task is the highest priority | 
|  | 470 | process on the mutex's wait_list, then we remove the previous top waiter | 
|  | 471 | from the owner's pi_list, and replace it with the task. | 
|  | 472 |  | 
|  | 473 | Note: It is possible that the task was the current top waiter on the mutex, | 
|  | 474 | in which case the task is not yet on the pi_list of the waiter.  This | 
|  | 475 | is OK, since plist_del does nothing if the plist node is not on any | 
|  | 476 | list. | 
|  | 477 |  | 
|  | 478 | If the task was not the top waiter of the mutex, but it was before we | 
|  | 479 | did the priority updates, that means we are deboosting/lowering the | 
|  | 480 | task.  In this case, the task is removed from the pi_list of the owner, | 
|  | 481 | and the new top waiter is added. | 
|  | 482 |  | 
|  | 483 | Lastly, we unlock both the pi_lock of the task, as well as the mutex's | 
|  | 484 | wait_lock, and continue the loop again.  On the next iteration of the | 
|  | 485 | loop, the previous owner of the mutex will be the task that will be | 
|  | 486 | processed. | 
|  | 487 |  | 
|  | 488 | Note: One might think that the owner of this mutex might have changed | 
|  | 489 | since we just grab the mutex's wait_lock. And one could be right. | 
|  | 490 | The important thing to remember is that the owner could not have | 
|  | 491 | become the task that is being processed in the PI chain, since | 
|  | 492 | we have taken that task's pi_lock at the beginning of the loop. | 
|  | 493 | So as long as there is an owner of this mutex that is not the same | 
|  | 494 | process as the tasked being worked on, we are OK. | 
|  | 495 |  | 
|  | 496 | Looking closely at the code, one might be confused.  The check for the | 
|  | 497 | end of the PI chain is when the task isn't blocked on anything or the | 
|  | 498 | task's waiter structure "task" element is NULL.  This check is | 
|  | 499 | protected only by the task's pi_lock.  But the code to unlock the mutex | 
|  | 500 | sets the task's waiter structure "task" element to NULL with only | 
|  | 501 | the protection of the mutex's wait_lock, which was not taken yet. | 
|  | 502 | Isn't this a race condition if the task becomes the new owner? | 
|  | 503 |  | 
|  | 504 | The answer is No!  The trick is the spin_trylock of the mutex's | 
|  | 505 | wait_lock.  If we fail that lock, we release the pi_lock of the | 
|  | 506 | task and continue the loop, doing the end of PI chain check again. | 
|  | 507 |  | 
|  | 508 | In the code to release the lock, the wait_lock of the mutex is held | 
|  | 509 | the entire time, and it is not let go when we grab the pi_lock of the | 
|  | 510 | new owner of the mutex.  So if the switch of a new owner were to happen | 
|  | 511 | after the check for end of the PI chain and the grabbing of the | 
|  | 512 | wait_lock, the unlocking code would spin on the new owner's pi_lock | 
|  | 513 | but never give up the wait_lock.  So the PI chain loop is guaranteed to | 
|  | 514 | fail the spin_trylock on the wait_lock, release the pi_lock, and | 
|  | 515 | try again. | 
|  | 516 |  | 
|  | 517 | If you don't quite understand the above, that's OK. You don't have to, | 
|  | 518 | unless you really want to make a proof out of it ;) | 
|  | 519 |  | 
|  | 520 |  | 
|  | 521 | Pending Owners and Lock stealing | 
|  | 522 | -------------------------------- | 
|  | 523 |  | 
|  | 524 | One of the flags in the owner field of the mutex structure is "Pending Owner". | 
|  | 525 | What this means is that an owner was chosen by the process releasing the | 
|  | 526 | mutex, but that owner has yet to wake up and actually take the mutex. | 
|  | 527 |  | 
|  | 528 | Why is this important?  Why can't we just give the mutex to another process | 
|  | 529 | and be done with it? | 
|  | 530 |  | 
|  | 531 | The PI code is to help with real-time processes, and to let the highest | 
|  | 532 | priority process run as long as possible with little latencies and delays. | 
|  | 533 | If a high priority process owns a mutex that a lower priority process is | 
|  | 534 | blocked on, when the mutex is released it would be given to the lower priority | 
|  | 535 | process.  What if the higher priority process wants to take that mutex again. | 
|  | 536 | The high priority process would fail to take that mutex that it just gave up | 
|  | 537 | and it would need to boost the lower priority process to run with full | 
|  | 538 | latency of that critical section (since the low priority process just entered | 
|  | 539 | it). | 
|  | 540 |  | 
|  | 541 | There's no reason a high priority process that gives up a mutex should be | 
|  | 542 | penalized if it tries to take that mutex again.  If the new owner of the | 
|  | 543 | mutex has not woken up yet, there's no reason that the higher priority process | 
|  | 544 | could not take that mutex away. | 
|  | 545 |  | 
|  | 546 | To solve this, we introduced Pending Ownership and Lock Stealing.  When a | 
|  | 547 | new process is given a mutex that it was blocked on, it is only given | 
|  | 548 | pending ownership.  This means that it's the new owner, unless a higher | 
|  | 549 | priority process comes in and tries to grab that mutex.  If a higher priority | 
|  | 550 | process does come along and wants that mutex, we let the higher priority | 
|  | 551 | process "steal" the mutex from the pending owner (only if it is still pending) | 
|  | 552 | and continue with the mutex. | 
|  | 553 |  | 
|  | 554 |  | 
|  | 555 | Taking of a mutex (The walk through) | 
|  | 556 | ------------------------------------ | 
|  | 557 |  | 
|  | 558 | OK, now let's take a look at the detailed walk through of what happens when | 
|  | 559 | taking a mutex. | 
|  | 560 |  | 
|  | 561 | The first thing that is tried is the fast taking of the mutex.  This is | 
|  | 562 | done when we have CMPXCHG enabled (otherwise the fast taking automatically | 
|  | 563 | fails).  Only when the owner field of the mutex is NULL can the lock be | 
|  | 564 | taken with the CMPXCHG and nothing else needs to be done. | 
|  | 565 |  | 
|  | 566 | If there is contention on the lock, whether it is owned or pending owner | 
|  | 567 | we go about the slow path (rt_mutex_slowlock). | 
|  | 568 |  | 
|  | 569 | The slow path function is where the task's waiter structure is created on | 
|  | 570 | the stack.  This is because the waiter structure is only needed for the | 
|  | 571 | scope of this function.  The waiter structure holds the nodes to store | 
|  | 572 | the task on the wait_list of the mutex, and if need be, the pi_list of | 
|  | 573 | the owner. | 
|  | 574 |  | 
|  | 575 | The wait_lock of the mutex is taken since the slow path of unlocking the | 
|  | 576 | mutex also takes this lock. | 
|  | 577 |  | 
|  | 578 | We then call try_to_take_rt_mutex.  This is where the architecture that | 
|  | 579 | does not implement CMPXCHG would always grab the lock (if there's no | 
|  | 580 | contention). | 
|  | 581 |  | 
|  | 582 | try_to_take_rt_mutex is used every time the task tries to grab a mutex in the | 
|  | 583 | slow path.  The first thing that is done here is an atomic setting of | 
|  | 584 | the "Has Waiters" flag of the mutex's owner field.  Yes, this could really | 
| Jan Altenberg | 9ba0bdf | 2006-09-30 23:28:08 -0700 | [diff] [blame] | 585 | be false, because if the mutex has no owner, there are no waiters and | 
| Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 586 | the current task also won't have any waiters.  But we don't have the lock | 
|  | 587 | yet, so we assume we are going to be a waiter.  The reason for this is to | 
|  | 588 | play nice for those architectures that do have CMPXCHG.  By setting this flag | 
|  | 589 | now, the owner of the mutex can't release the mutex without going into the | 
|  | 590 | slow unlock path, and it would then need to grab the wait_lock, which this | 
|  | 591 | code currently holds.  So setting the "Has Waiters" flag forces the owner | 
|  | 592 | to synchronize with this code. | 
|  | 593 |  | 
|  | 594 | Now that we know that we can't have any races with the owner releasing the | 
|  | 595 | mutex, we check to see if we can take the ownership.  This is done if the | 
|  | 596 | mutex doesn't have a owner, or if we can steal the mutex from a pending | 
|  | 597 | owner.  Let's look at the situations we have here. | 
|  | 598 |  | 
|  | 599 | 1) Has owner that is pending | 
|  | 600 | ---------------------------- | 
|  | 601 |  | 
|  | 602 | The mutex has a owner, but it hasn't woken up and the mutex flag | 
|  | 603 | "Pending Owner" is set.  The first check is to see if the owner isn't the | 
|  | 604 | current task.  This is because this function is also used for the pending | 
|  | 605 | owner to grab the mutex.  When a pending owner wakes up, it checks to see | 
|  | 606 | if it can take the mutex, and this is done if the owner is already set to | 
|  | 607 | itself.  If so, we succeed and leave the function, clearing the "Pending | 
|  | 608 | Owner" bit. | 
|  | 609 |  | 
|  | 610 | If the pending owner is not current, we check to see if the current priority is | 
|  | 611 | higher than the pending owner.  If not, we fail the function and return. | 
|  | 612 |  | 
|  | 613 | There's also something special about a pending owner.  That is a pending owner | 
|  | 614 | is never blocked on a mutex.  So there is no PI chain to worry about.  It also | 
|  | 615 | means that if the mutex doesn't have any waiters, there's no accounting needed | 
|  | 616 | to update the pending owner's pi_list, since we only worry about processes | 
|  | 617 | blocked on the current mutex. | 
|  | 618 |  | 
|  | 619 | If there are waiters on this mutex, and we just stole the ownership, we need | 
|  | 620 | to take the top waiter, remove it from the pi_list of the pending owner, and | 
|  | 621 | add it to the current pi_list.  Note that at this moment, the pending owner | 
|  | 622 | is no longer on the list of waiters.  This is fine, since the pending owner | 
|  | 623 | would add itself back when it realizes that it had the ownership stolen | 
|  | 624 | from itself.  When the pending owner tries to grab the mutex, it will fail | 
|  | 625 | in try_to_take_rt_mutex if the owner field points to another process. | 
|  | 626 |  | 
|  | 627 | 2) No owner | 
|  | 628 | ----------- | 
|  | 629 |  | 
|  | 630 | If there is no owner (or we successfully stole the lock), we set the owner | 
|  | 631 | of the mutex to current, and set the flag of "Has Waiters" if the current | 
|  | 632 | mutex actually has waiters, or we clear the flag if it doesn't.  See, it was | 
|  | 633 | OK that we set that flag early, since now it is cleared. | 
|  | 634 |  | 
|  | 635 | 3) Failed to grab ownership | 
|  | 636 | --------------------------- | 
|  | 637 |  | 
|  | 638 | The most interesting case is when we fail to take ownership. This means that | 
|  | 639 | there exists an owner, or there's a pending owner with equal or higher | 
|  | 640 | priority than the current task. | 
|  | 641 |  | 
|  | 642 | We'll continue on the failed case. | 
|  | 643 |  | 
|  | 644 | If the mutex has a timeout, we set up a timer to go off to break us out | 
|  | 645 | of this mutex if we failed to get it after a specified amount of time. | 
|  | 646 |  | 
|  | 647 | Now we enter a loop that will continue to try to take ownership of the mutex, or | 
|  | 648 | fail from a timeout or signal. | 
|  | 649 |  | 
|  | 650 | Once again we try to take the mutex.  This will usually fail the first time | 
|  | 651 | in the loop, since it had just failed to get the mutex.  But the second time | 
|  | 652 | in the loop, this would likely succeed, since the task would likely be | 
|  | 653 | the pending owner. | 
|  | 654 |  | 
|  | 655 | If the mutex is TASK_INTERRUPTIBLE a check for signals and timeout is done | 
|  | 656 | here. | 
|  | 657 |  | 
|  | 658 | The waiter structure has a "task" field that points to the task that is blocked | 
|  | 659 | on the mutex.  This field can be NULL the first time it goes through the loop | 
|  | 660 | or if the task is a pending owner and had it's mutex stolen.  If the "task" | 
|  | 661 | field is NULL then we need to set up the accounting for it. | 
|  | 662 |  | 
|  | 663 | Task blocks on mutex | 
|  | 664 | -------------------- | 
|  | 665 |  | 
|  | 666 | The accounting of a mutex and process is done with the waiter structure of | 
|  | 667 | the process.  The "task" field is set to the process, and the "lock" field | 
|  | 668 | to the mutex.  The plist nodes are initialized to the processes current | 
|  | 669 | priority. | 
|  | 670 |  | 
|  | 671 | Since the wait_lock was taken at the entry of the slow lock, we can safely | 
|  | 672 | add the waiter to the wait_list.  If the current process is the highest | 
|  | 673 | priority process currently waiting on this mutex, then we remove the | 
|  | 674 | previous top waiter process (if it exists) from the pi_list of the owner, | 
|  | 675 | and add the current process to that list.  Since the pi_list of the owner | 
|  | 676 | has changed, we call rt_mutex_adjust_prio on the owner to see if the owner | 
|  | 677 | should adjust its priority accordingly. | 
|  | 678 |  | 
|  | 679 | If the owner is also blocked on a lock, and had its pi_list changed | 
|  | 680 | (or deadlock checking is on), we unlock the wait_lock of the mutex and go ahead | 
|  | 681 | and run rt_mutex_adjust_prio_chain on the owner, as described earlier. | 
|  | 682 |  | 
|  | 683 | Now all locks are released, and if the current process is still blocked on a | 
|  | 684 | mutex (waiter "task" field is not NULL), then we go to sleep (call schedule). | 
|  | 685 |  | 
|  | 686 | Waking up in the loop | 
|  | 687 | --------------------- | 
|  | 688 |  | 
|  | 689 | The schedule can then wake up for a few reasons. | 
|  | 690 | 1) we were given pending ownership of the mutex. | 
|  | 691 | 2) we received a signal and was TASK_INTERRUPTIBLE | 
|  | 692 | 3) we had a timeout and was TASK_INTERRUPTIBLE | 
|  | 693 |  | 
|  | 694 | In any of these cases, we continue the loop and once again try to grab the | 
|  | 695 | ownership of the mutex.  If we succeed, we exit the loop, otherwise we continue | 
|  | 696 | and on signal and timeout, will exit the loop, or if we had the mutex stolen | 
|  | 697 | we just simply add ourselves back on the lists and go back to sleep. | 
|  | 698 |  | 
|  | 699 | Note: For various reasons, because of timeout and signals, the steal mutex | 
|  | 700 | algorithm needs to be careful. This is because the current process is | 
|  | 701 | still on the wait_list. And because of dynamic changing of priorities, | 
|  | 702 | especially on SCHED_OTHER tasks, the current process can be the | 
|  | 703 | highest priority task on the wait_list. | 
|  | 704 |  | 
|  | 705 | Failed to get mutex on Timeout or Signal | 
|  | 706 | ---------------------------------------- | 
|  | 707 |  | 
|  | 708 | If a timeout or signal occurred, the waiter's "task" field would not be | 
|  | 709 | NULL and the task needs to be taken off the wait_list of the mutex and perhaps | 
|  | 710 | pi_list of the owner.  If this process was a high priority process, then | 
|  | 711 | the rt_mutex_adjust_prio_chain needs to be executed again on the owner, | 
|  | 712 | but this time it will be lowering the priorities. | 
|  | 713 |  | 
|  | 714 |  | 
|  | 715 | Unlocking the Mutex | 
|  | 716 | ------------------- | 
|  | 717 |  | 
|  | 718 | The unlocking of a mutex also has a fast path for those architectures with | 
|  | 719 | CMPXCHG.  Since the taking of a mutex on contention always sets the | 
|  | 720 | "Has Waiters" flag of the mutex's owner, we use this to know if we need to | 
|  | 721 | take the slow path when unlocking the mutex.  If the mutex doesn't have any | 
|  | 722 | waiters, the owner field of the mutex would equal the current process and | 
|  | 723 | the mutex can be unlocked by just replacing the owner field with NULL. | 
|  | 724 |  | 
|  | 725 | If the owner field has the "Has Waiters" bit set (or CMPXCHG is not available), | 
|  | 726 | the slow unlock path is taken. | 
|  | 727 |  | 
|  | 728 | The first thing done in the slow unlock path is to take the wait_lock of the | 
|  | 729 | mutex.  This synchronizes the locking and unlocking of the mutex. | 
|  | 730 |  | 
|  | 731 | A check is made to see if the mutex has waiters or not.  On architectures that | 
|  | 732 | do not have CMPXCHG, this is the location that the owner of the mutex will | 
|  | 733 | determine if a waiter needs to be awoken or not.  On architectures that | 
|  | 734 | do have CMPXCHG, that check is done in the fast path, but it is still needed | 
|  | 735 | in the slow path too.  If a waiter of a mutex woke up because of a signal | 
|  | 736 | or timeout between the time the owner failed the fast path CMPXCHG check and | 
|  | 737 | the grabbing of the wait_lock, the mutex may not have any waiters, thus the | 
| Jan Altenberg | 9ba0bdf | 2006-09-30 23:28:08 -0700 | [diff] [blame] | 738 | owner still needs to make this check. If there are no waiters then the mutex | 
| Steven Rostedt | a6537be | 2006-06-27 02:54:54 -0700 | [diff] [blame] | 739 | owner field is set to NULL, the wait_lock is released and nothing more is | 
|  | 740 | needed. | 
|  | 741 |  | 
|  | 742 | If there are waiters, then we need to wake one up and give that waiter | 
|  | 743 | pending ownership. | 
|  | 744 |  | 
|  | 745 | On the wake up code, the pi_lock of the current owner is taken.  The top | 
|  | 746 | waiter of the lock is found and removed from the wait_list of the mutex | 
|  | 747 | as well as the pi_list of the current owner.  The task field of the new | 
|  | 748 | pending owner's waiter structure is set to NULL, and the owner field of the | 
|  | 749 | mutex is set to the new owner with the "Pending Owner" bit set, as well | 
|  | 750 | as the "Has Waiters" bit if there still are other processes blocked on the | 
|  | 751 | mutex. | 
|  | 752 |  | 
|  | 753 | The pi_lock of the previous owner is released, and the new pending owner's | 
|  | 754 | pi_lock is taken.  Remember that this is the trick to prevent the race | 
|  | 755 | condition in rt_mutex_adjust_prio_chain from adding itself as a waiter | 
|  | 756 | on the mutex. | 
|  | 757 |  | 
|  | 758 | We now clear the "pi_blocked_on" field of the new pending owner, and if | 
|  | 759 | the mutex still has waiters pending, we add the new top waiter to the pi_list | 
|  | 760 | of the pending owner. | 
|  | 761 |  | 
|  | 762 | Finally we unlock the pi_lock of the pending owner and wake it up. | 
|  | 763 |  | 
|  | 764 |  | 
|  | 765 | Contact | 
|  | 766 | ------- | 
|  | 767 |  | 
|  | 768 | For updates on this document, please email Steven Rostedt <rostedt@goodmis.org> | 
|  | 769 |  | 
|  | 770 |  | 
|  | 771 | Credits | 
|  | 772 | ------- | 
|  | 773 |  | 
|  | 774 | Author:  Steven Rostedt <rostedt@goodmis.org> | 
|  | 775 |  | 
|  | 776 | Reviewers:  Ingo Molnar, Thomas Gleixner, Thomas Duetsch, and Randy Dunlap | 
|  | 777 |  | 
|  | 778 | Updates | 
|  | 779 | ------- | 
|  | 780 |  | 
|  | 781 | This document was originally written for 2.6.17-rc3-mm1 |