| Darren Hart | b30505c | 2009-05-07 15:40:14 -0700 | [diff] [blame] | 1 | Futex Requeue PI | 
|  | 2 | ---------------- | 
|  | 3 |  | 
|  | 4 | Requeueing of tasks from a non-PI futex to a PI futex requires | 
|  | 5 | special handling in order to ensure the underlying rt_mutex is never | 
|  | 6 | left without an owner if it has waiters; doing so would break the PI | 
|  | 7 | boosting logic [see rt-mutex-desgin.txt] For the purposes of | 
|  | 8 | brevity, this action will be referred to as "requeue_pi" throughout | 
|  | 9 | this document.  Priority inheritance is abbreviated throughout as | 
|  | 10 | "PI". | 
|  | 11 |  | 
|  | 12 | Motivation | 
|  | 13 | ---------- | 
|  | 14 |  | 
|  | 15 | Without requeue_pi, the glibc implementation of | 
|  | 16 | pthread_cond_broadcast() must resort to waking all the tasks waiting | 
|  | 17 | on a pthread_condvar and letting them try to sort out which task | 
|  | 18 | gets to run first in classic thundering-herd formation.  An ideal | 
|  | 19 | implementation would wake the highest-priority waiter, and leave the | 
|  | 20 | rest to the natural wakeup inherent in unlocking the mutex | 
|  | 21 | associated with the condvar. | 
|  | 22 |  | 
|  | 23 | Consider the simplified glibc calls: | 
|  | 24 |  | 
|  | 25 | /* caller must lock mutex */ | 
|  | 26 | pthread_cond_wait(cond, mutex) | 
|  | 27 | { | 
|  | 28 | lock(cond->__data.__lock); | 
|  | 29 | unlock(mutex); | 
|  | 30 | do { | 
|  | 31 | unlock(cond->__data.__lock); | 
|  | 32 | futex_wait(cond->__data.__futex); | 
|  | 33 | lock(cond->__data.__lock); | 
|  | 34 | } while(...) | 
|  | 35 | unlock(cond->__data.__lock); | 
|  | 36 | lock(mutex); | 
|  | 37 | } | 
|  | 38 |  | 
|  | 39 | pthread_cond_broadcast(cond) | 
|  | 40 | { | 
|  | 41 | lock(cond->__data.__lock); | 
|  | 42 | unlock(cond->__data.__lock); | 
|  | 43 | futex_requeue(cond->data.__futex, cond->mutex); | 
|  | 44 | } | 
|  | 45 |  | 
|  | 46 | Once pthread_cond_broadcast() requeues the tasks, the cond->mutex | 
|  | 47 | has waiters. Note that pthread_cond_wait() attempts to lock the | 
|  | 48 | mutex only after it has returned to user space.  This will leave the | 
|  | 49 | underlying rt_mutex with waiters, and no owner, breaking the | 
|  | 50 | previously mentioned PI-boosting algorithms. | 
|  | 51 |  | 
|  | 52 | In order to support PI-aware pthread_condvar's, the kernel needs to | 
|  | 53 | be able to requeue tasks to PI futexes.  This support implies that | 
|  | 54 | upon a successful futex_wait system call, the caller would return to | 
|  | 55 | user space already holding the PI futex.  The glibc implementation | 
|  | 56 | would be modified as follows: | 
|  | 57 |  | 
|  | 58 |  | 
|  | 59 | /* caller must lock mutex */ | 
|  | 60 | pthread_cond_wait_pi(cond, mutex) | 
|  | 61 | { | 
|  | 62 | lock(cond->__data.__lock); | 
|  | 63 | unlock(mutex); | 
|  | 64 | do { | 
|  | 65 | unlock(cond->__data.__lock); | 
|  | 66 | futex_wait_requeue_pi(cond->__data.__futex); | 
|  | 67 | lock(cond->__data.__lock); | 
|  | 68 | } while(...) | 
|  | 69 | unlock(cond->__data.__lock); | 
|  | 70 | /* the kernel acquired the the mutex for us */ | 
|  | 71 | } | 
|  | 72 |  | 
|  | 73 | pthread_cond_broadcast_pi(cond) | 
|  | 74 | { | 
|  | 75 | lock(cond->__data.__lock); | 
|  | 76 | unlock(cond->__data.__lock); | 
|  | 77 | futex_requeue_pi(cond->data.__futex, cond->mutex); | 
|  | 78 | } | 
|  | 79 |  | 
|  | 80 | The actual glibc implementation will likely test for PI and make the | 
|  | 81 | necessary changes inside the existing calls rather than creating new | 
|  | 82 | calls for the PI cases.  Similar changes are needed for | 
|  | 83 | pthread_cond_timedwait() and pthread_cond_signal(). | 
|  | 84 |  | 
|  | 85 | Implementation | 
|  | 86 | -------------- | 
|  | 87 |  | 
|  | 88 | In order to ensure the rt_mutex has an owner if it has waiters, it | 
|  | 89 | is necessary for both the requeue code, as well as the waiting code, | 
|  | 90 | to be able to acquire the rt_mutex before returning to user space. | 
|  | 91 | The requeue code cannot simply wake the waiter and leave it to | 
|  | 92 | acquire the rt_mutex as it would open a race window between the | 
|  | 93 | requeue call returning to user space and the waiter waking and | 
|  | 94 | starting to run.  This is especially true in the uncontended case. | 
|  | 95 |  | 
|  | 96 | The solution involves two new rt_mutex helper routines, | 
|  | 97 | rt_mutex_start_proxy_lock() and rt_mutex_finish_proxy_lock(), which | 
|  | 98 | allow the requeue code to acquire an uncontended rt_mutex on behalf | 
|  | 99 | of the waiter and to enqueue the waiter on a contended rt_mutex. | 
|  | 100 | Two new system calls provide the kernel<->user interface to | 
|  | 101 | requeue_pi: FUTEX_WAIT_REQUEUE_PI and FUTEX_REQUEUE_CMP_PI. | 
|  | 102 |  | 
|  | 103 | FUTEX_WAIT_REQUEUE_PI is called by the waiter (pthread_cond_wait() | 
|  | 104 | and pthread_cond_timedwait()) to block on the initial futex and wait | 
|  | 105 | to be requeued to a PI-aware futex.  The implementation is the | 
|  | 106 | result of a high-speed collision between futex_wait() and | 
|  | 107 | futex_lock_pi(), with some extra logic to check for the additional | 
|  | 108 | wake-up scenarios. | 
|  | 109 |  | 
|  | 110 | FUTEX_REQUEUE_CMP_PI is called by the waker | 
|  | 111 | (pthread_cond_broadcast() and pthread_cond_signal()) to requeue and | 
|  | 112 | possibly wake the waiting tasks. Internally, this system call is | 
|  | 113 | still handled by futex_requeue (by passing requeue_pi=1).  Before | 
|  | 114 | requeueing, futex_requeue() attempts to acquire the requeue target | 
|  | 115 | PI futex on behalf of the top waiter.  If it can, this waiter is | 
|  | 116 | woken.  futex_requeue() then proceeds to requeue the remaining | 
|  | 117 | nr_wake+nr_requeue tasks to the PI futex, calling | 
|  | 118 | rt_mutex_start_proxy_lock() prior to each requeue to prepare the | 
|  | 119 | task as a waiter on the underlying rt_mutex.  It is possible that | 
|  | 120 | the lock can be acquired at this stage as well, if so, the next | 
|  | 121 | waiter is woken to finish the acquisition of the lock. | 
|  | 122 |  | 
|  | 123 | FUTEX_REQUEUE_PI accepts nr_wake and nr_requeue as arguments, but | 
|  | 124 | their sum is all that really matters.  futex_requeue() will wake or | 
|  | 125 | requeue up to nr_wake + nr_requeue tasks.  It will wake only as many | 
|  | 126 | tasks as it can acquire the lock for, which in the majority of cases | 
|  | 127 | should be 0 as good programming practice dictates that the caller of | 
|  | 128 | either pthread_cond_broadcast() or pthread_cond_signal() acquire the | 
|  | 129 | mutex prior to making the call. FUTEX_REQUEUE_PI requires that | 
|  | 130 | nr_wake=1.  nr_requeue should be INT_MAX for broadcast and 0 for | 
|  | 131 | signal. |