| Thomas Gleixner | df78488 | 2006-01-09 20:52:33 -0800 | [diff] [blame] | 1 |  | 
 | 2 | hrtimers - subsystem for high-resolution kernel timers | 
 | 3 | ---------------------------------------------------- | 
 | 4 |  | 
 | 5 | This patch introduces a new subsystem for high-resolution kernel timers. | 
 | 6 |  | 
 | 7 | One might ask the question: we already have a timer subsystem | 
 | 8 | (kernel/timers.c), why do we need two timer subsystems? After a lot of | 
 | 9 | back and forth trying to integrate high-resolution and high-precision | 
 | 10 | features into the existing timer framework, and after testing various | 
 | 11 | such high-resolution timer implementations in practice, we came to the | 
 | 12 | conclusion that the timer wheel code is fundamentally not suitable for | 
| Matt LaPlante | fff9289 | 2006-10-03 22:47:42 +0200 | [diff] [blame] | 13 | such an approach. We initially didn't believe this ('there must be a way | 
| Thomas Gleixner | df78488 | 2006-01-09 20:52:33 -0800 | [diff] [blame] | 14 | to solve this'), and spent a considerable effort trying to integrate | 
 | 15 | things into the timer wheel, but we failed. In hindsight, there are | 
 | 16 | several reasons why such integration is hard/impossible: | 
 | 17 |  | 
 | 18 | - the forced handling of low-resolution and high-resolution timers in | 
 | 19 |   the same way leads to a lot of compromises, macro magic and #ifdef | 
 | 20 |   mess. The timers.c code is very "tightly coded" around jiffies and | 
 | 21 |   32-bitness assumptions, and has been honed and micro-optimized for a | 
 | 22 |   relatively narrow use case (jiffies in a relatively narrow HZ range) | 
 | 23 |   for many years - and thus even small extensions to it easily break | 
 | 24 |   the wheel concept, leading to even worse compromises. The timer wheel | 
 | 25 |   code is very good and tight code, there's zero problems with it in its | 
 | 26 |   current usage - but it is simply not suitable to be extended for | 
 | 27 |   high-res timers. | 
 | 28 |  | 
 | 29 | - the unpredictable [O(N)] overhead of cascading leads to delays which | 
| Matt LaPlante | 992caac | 2006-10-03 22:52:05 +0200 | [diff] [blame] | 30 |   necessitate a more complex handling of high resolution timers, which | 
| Thomas Gleixner | df78488 | 2006-01-09 20:52:33 -0800 | [diff] [blame] | 31 |   in turn decreases robustness. Such a design still led to rather large | 
 | 32 |   timing inaccuracies. Cascading is a fundamental property of the timer | 
 | 33 |   wheel concept, it cannot be 'designed out' without unevitably | 
 | 34 |   degrading other portions of the timers.c code in an unacceptable way. | 
 | 35 |  | 
 | 36 | - the implementation of the current posix-timer subsystem on top of | 
 | 37 |   the timer wheel has already introduced a quite complex handling of | 
 | 38 |   the required readjusting of absolute CLOCK_REALTIME timers at | 
 | 39 |   settimeofday or NTP time - further underlying our experience by | 
 | 40 |   example: that the timer wheel data structure is too rigid for high-res | 
 | 41 |   timers. | 
 | 42 |  | 
 | 43 | - the timer wheel code is most optimal for use cases which can be | 
 | 44 |   identified as "timeouts". Such timeouts are usually set up to cover | 
 | 45 |   error conditions in various I/O paths, such as networking and block | 
 | 46 |   I/O. The vast majority of those timers never expire and are rarely | 
 | 47 |   recascaded because the expected correct event arrives in time so they | 
 | 48 |   can be removed from the timer wheel before any further processing of | 
 | 49 |   them becomes necessary. Thus the users of these timeouts can accept | 
 | 50 |   the granularity and precision tradeoffs of the timer wheel, and | 
 | 51 |   largely expect the timer subsystem to have near-zero overhead. | 
 | 52 |   Accurate timing for them is not a core purpose - in fact most of the | 
 | 53 |   timeout values used are ad-hoc. For them it is at most a necessary | 
 | 54 |   evil to guarantee the processing of actual timeout completions | 
 | 55 |   (because most of the timeouts are deleted before completion), which | 
 | 56 |   should thus be as cheap and unintrusive as possible. | 
 | 57 |  | 
 | 58 | The primary users of precision timers are user-space applications that | 
 | 59 | utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel | 
 | 60 | users like drivers and subsystems which require precise timed events | 
| Matt LaPlante | 53cb472 | 2006-10-03 22:55:17 +0200 | [diff] [blame] | 61 | (e.g. multimedia) can benefit from the availability of a separate | 
| Thomas Gleixner | df78488 | 2006-01-09 20:52:33 -0800 | [diff] [blame] | 62 | high-resolution timer subsystem as well. | 
 | 63 |  | 
 | 64 | While this subsystem does not offer high-resolution clock sources just | 
 | 65 | yet, the hrtimer subsystem can be easily extended with high-resolution | 
 | 66 | clock capabilities, and patches for that exist and are maturing quickly. | 
 | 67 | The increasing demand for realtime and multimedia applications along | 
 | 68 | with other potential users for precise timers gives another reason to | 
 | 69 | separate the "timeout" and "precise timer" subsystems. | 
 | 70 |  | 
| Matt LaPlante | 53cb472 | 2006-10-03 22:55:17 +0200 | [diff] [blame] | 71 | Another potential benefit is that such a separation allows even more | 
| Thomas Gleixner | df78488 | 2006-01-09 20:52:33 -0800 | [diff] [blame] | 72 | special-purpose optimization of the existing timer wheel for the low | 
 | 73 | resolution and low precision use cases - once the precision-sensitive | 
 | 74 | APIs are separated from the timer wheel and are migrated over to | 
 | 75 | hrtimers. E.g. we could decrease the frequency of the timeout subsystem | 
 | 76 | from 250 Hz to 100 HZ (or even smaller). | 
 | 77 |  | 
 | 78 | hrtimer subsystem implementation details | 
 | 79 | ---------------------------------------- | 
 | 80 |  | 
 | 81 | the basic design considerations were: | 
 | 82 |  | 
 | 83 | - simplicity | 
 | 84 |  | 
 | 85 | - data structure not bound to jiffies or any other granularity. All the | 
 | 86 |   kernel logic works at 64-bit nanoseconds resolution - no compromises. | 
 | 87 |  | 
 | 88 | - simplification of existing, timing related kernel code | 
 | 89 |  | 
 | 90 | another basic requirement was the immediate enqueueing and ordering of | 
 | 91 | timers at activation time. After looking at several possible solutions | 
 | 92 | such as radix trees and hashes, we chose the red black tree as the basic | 
 | 93 | data structure. Rbtrees are available as a library in the kernel and are | 
 | 94 | used in various performance-critical areas of e.g. memory management and | 
 | 95 | file systems. The rbtree is solely used for time sorted ordering, while | 
 | 96 | a separate list is used to give the expiry code fast access to the | 
 | 97 | queued timers, without having to walk the rbtree. | 
 | 98 |  | 
| Matt LaPlante | 53cb472 | 2006-10-03 22:55:17 +0200 | [diff] [blame] | 99 | (This separate list is also useful for later when we'll introduce | 
 | 100 | high-resolution clocks, where we need separate pending and expired | 
| Thomas Gleixner | df78488 | 2006-01-09 20:52:33 -0800 | [diff] [blame] | 101 | queues while keeping the time-order intact.) | 
 | 102 |  | 
 | 103 | Time-ordered enqueueing is not purely for the purposes of | 
 | 104 | high-resolution clocks though, it also simplifies the handling of | 
 | 105 | absolute timers based on a low-resolution CLOCK_REALTIME. The existing | 
 | 106 | implementation needed to keep an extra list of all armed absolute | 
 | 107 | CLOCK_REALTIME timers along with complex locking. In case of | 
 | 108 | settimeofday and NTP, all the timers (!) had to be dequeued, the | 
 | 109 | time-changing code had to fix them up one by one, and all of them had to | 
 | 110 | be enqueued again. The time-ordered enqueueing and the storage of the | 
 | 111 | expiry time in absolute time units removes all this complex and poorly | 
 | 112 | scaling code from the posix-timer implementation - the clock can simply | 
 | 113 | be set without having to touch the rbtree. This also makes the handling | 
 | 114 | of posix-timers simpler in general. | 
 | 115 |  | 
 | 116 | The locking and per-CPU behavior of hrtimers was mostly taken from the | 
 | 117 | existing timer wheel code, as it is mature and well suited. Sharing code | 
 | 118 | was not really a win, due to the different data structures. Also, the | 
 | 119 | hrtimer functions now have clearer behavior and clearer names - such as | 
 | 120 | hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly | 
 | 121 | equivalent to del_timer() and del_timer_sync()] - so there's no direct | 
 | 122 | 1:1 mapping between them on the algorithmical level, and thus no real | 
 | 123 | potential for code sharing either. | 
 | 124 |  | 
 | 125 | Basic data types: every time value, absolute or relative, is in a | 
 | 126 | special nanosecond-resolution type: ktime_t. The kernel-internal | 
 | 127 | representation of ktime_t values and operations is implemented via | 
 | 128 | macros and inline functions, and can be switched between a "hybrid | 
 | 129 | union" type and a plain "scalar" 64bit nanoseconds representation (at | 
 | 130 | compile time). The hybrid union type optimizes time conversions on 32bit | 
 | 131 | CPUs. This build-time-selectable ktime_t storage format was implemented | 
 | 132 | to avoid the performance impact of 64-bit multiplications and divisions | 
 | 133 | on 32bit CPUs. Such operations are frequently necessary to convert | 
 | 134 | between the storage formats provided by kernel and userspace interfaces | 
 | 135 | and the internal time format. (See include/linux/ktime.h for further | 
 | 136 | details.) | 
 | 137 |  | 
 | 138 | hrtimers - rounding of timer values | 
 | 139 | ----------------------------------- | 
 | 140 |  | 
 | 141 | the hrtimer code will round timer events to lower-resolution clocks | 
 | 142 | because it has to. Otherwise it will do no artificial rounding at all. | 
 | 143 |  | 
 | 144 | one question is, what resolution value should be returned to the user by | 
 | 145 | the clock_getres() interface. This will return whatever real resolution | 
 | 146 | a given clock has - be it low-res, high-res, or artificially-low-res. | 
 | 147 |  | 
 | 148 | hrtimers - testing and verification | 
 | 149 | ---------------------------------- | 
 | 150 |  | 
 | 151 | We used the high-resolution clock subsystem ontop of hrtimers to verify | 
 | 152 | the hrtimer implementation details in praxis, and we also ran the posix | 
 | 153 | timer tests in order to ensure specification compliance. We also ran | 
 | 154 | tests on low-resolution clocks. | 
 | 155 |  | 
 | 156 | The hrtimer patch converts the following kernel functionality to use | 
 | 157 | hrtimers: | 
 | 158 |  | 
 | 159 |  - nanosleep | 
 | 160 |  - itimers | 
 | 161 |  - posix-timers | 
 | 162 |  | 
 | 163 | The conversion of nanosleep and posix-timers enabled the unification of | 
 | 164 | nanosleep and clock_nanosleep. | 
 | 165 |  | 
 | 166 | The code was successfully compiled for the following platforms: | 
 | 167 |  | 
 | 168 |  i386, x86_64, ARM, PPC, PPC64, IA64 | 
 | 169 |  | 
 | 170 | The code was run-tested on the following platforms: | 
 | 171 |  | 
 | 172 |  i386(UP/SMP), x86_64(UP/SMP), ARM, PPC | 
 | 173 |  | 
 | 174 | hrtimers were also integrated into the -rt tree, along with a | 
 | 175 | hrtimers-based high-resolution clock implementation, so the hrtimers | 
 | 176 | code got a healthy amount of testing and use in practice. | 
 | 177 |  | 
 | 178 | 	Thomas Gleixner, Ingo Molnar |