Blame - Documentation/hrtimers.txt - android_kernel_oneplus_msm8996

blob: 7620ff735faf9384a87f61ba6bf4ccdc7763b655 [file] [log] [blame]

Thomas Gleixner	df78488	2006-01-09 20:52:33 -0800	[diff] [blame]	1
				2	hrtimers - subsystem for high-resolution kernel timers
				3	----------------------------------------------------
				4
				5	This patch introduces a new subsystem for high-resolution kernel timers.
				6
				7	One might ask the question: we already have a timer subsystem
				8	(kernel/timers.c), why do we need two timer subsystems? After a lot of
				9	back and forth trying to integrate high-resolution and high-precision
				10	features into the existing timer framework, and after testing various
				11	such high-resolution timer implementations in practice, we came to the
				12	conclusion that the timer wheel code is fundamentally not suitable for
				13	such an approach. We initially didnt believe this ('there must be a way
				14	to solve this'), and spent a considerable effort trying to integrate
				15	things into the timer wheel, but we failed. In hindsight, there are
				16	several reasons why such integration is hard/impossible:
				17
				18	- the forced handling of low-resolution and high-resolution timers in
				19	the same way leads to a lot of compromises, macro magic and #ifdef
				20	mess. The timers.c code is very "tightly coded" around jiffies and
				21	32-bitness assumptions, and has been honed and micro-optimized for a
				22	relatively narrow use case (jiffies in a relatively narrow HZ range)
				23	for many years - and thus even small extensions to it easily break
				24	the wheel concept, leading to even worse compromises. The timer wheel
				25	code is very good and tight code, there's zero problems with it in its
				26	current usage - but it is simply not suitable to be extended for
				27	high-res timers.
				28
				29	- the unpredictable [O(N)] overhead of cascading leads to delays which
				30	necessiate a more complex handling of high resolution timers, which
				31	in turn decreases robustness. Such a design still led to rather large
				32	timing inaccuracies. Cascading is a fundamental property of the timer
				33	wheel concept, it cannot be 'designed out' without unevitably
				34	degrading other portions of the timers.c code in an unacceptable way.
				35
				36	- the implementation of the current posix-timer subsystem on top of
				37	the timer wheel has already introduced a quite complex handling of
				38	the required readjusting of absolute CLOCK_REALTIME timers at
				39	settimeofday or NTP time - further underlying our experience by
				40	example: that the timer wheel data structure is too rigid for high-res
				41	timers.
				42
				43	- the timer wheel code is most optimal for use cases which can be
				44	identified as "timeouts". Such timeouts are usually set up to cover
				45	error conditions in various I/O paths, such as networking and block
				46	I/O. The vast majority of those timers never expire and are rarely
				47	recascaded because the expected correct event arrives in time so they
				48	can be removed from the timer wheel before any further processing of
				49	them becomes necessary. Thus the users of these timeouts can accept
				50	the granularity and precision tradeoffs of the timer wheel, and
				51	largely expect the timer subsystem to have near-zero overhead.
				52	Accurate timing for them is not a core purpose - in fact most of the
				53	timeout values used are ad-hoc. For them it is at most a necessary
				54	evil to guarantee the processing of actual timeout completions
				55	(because most of the timeouts are deleted before completion), which
				56	should thus be as cheap and unintrusive as possible.
				57
				58	The primary users of precision timers are user-space applications that
				59	utilize nanosleep, posix-timers and itimer interfaces. Also, in-kernel
				60	users like drivers and subsystems which require precise timed events
				61	(e.g. multimedia) can benefit from the availability of a seperate
				62	high-resolution timer subsystem as well.
				63
				64	While this subsystem does not offer high-resolution clock sources just
				65	yet, the hrtimer subsystem can be easily extended with high-resolution
				66	clock capabilities, and patches for that exist and are maturing quickly.
				67	The increasing demand for realtime and multimedia applications along
				68	with other potential users for precise timers gives another reason to
				69	separate the "timeout" and "precise timer" subsystems.
				70
				71	Another potential benefit is that such a seperation allows even more
				72	special-purpose optimization of the existing timer wheel for the low
				73	resolution and low precision use cases - once the precision-sensitive
				74	APIs are separated from the timer wheel and are migrated over to
				75	hrtimers. E.g. we could decrease the frequency of the timeout subsystem
				76	from 250 Hz to 100 HZ (or even smaller).
				77
				78	hrtimer subsystem implementation details
				79	----------------------------------------
				80
				81	the basic design considerations were:
				82
				83	- simplicity
				84
				85	- data structure not bound to jiffies or any other granularity. All the
				86	kernel logic works at 64-bit nanoseconds resolution - no compromises.
				87
				88	- simplification of existing, timing related kernel code
				89
				90	another basic requirement was the immediate enqueueing and ordering of
				91	timers at activation time. After looking at several possible solutions
				92	such as radix trees and hashes, we chose the red black tree as the basic
				93	data structure. Rbtrees are available as a library in the kernel and are
				94	used in various performance-critical areas of e.g. memory management and
				95	file systems. The rbtree is solely used for time sorted ordering, while
				96	a separate list is used to give the expiry code fast access to the
				97	queued timers, without having to walk the rbtree.
				98
				99	(This seperate list is also useful for later when we'll introduce
				100	high-resolution clocks, where we need seperate pending and expired
				101	queues while keeping the time-order intact.)
				102
				103	Time-ordered enqueueing is not purely for the purposes of
				104	high-resolution clocks though, it also simplifies the handling of
				105	absolute timers based on a low-resolution CLOCK_REALTIME. The existing
				106	implementation needed to keep an extra list of all armed absolute
				107	CLOCK_REALTIME timers along with complex locking. In case of
				108	settimeofday and NTP, all the timers (!) had to be dequeued, the
				109	time-changing code had to fix them up one by one, and all of them had to
				110	be enqueued again. The time-ordered enqueueing and the storage of the
				111	expiry time in absolute time units removes all this complex and poorly
				112	scaling code from the posix-timer implementation - the clock can simply
				113	be set without having to touch the rbtree. This also makes the handling
				114	of posix-timers simpler in general.
				115
				116	The locking and per-CPU behavior of hrtimers was mostly taken from the
				117	existing timer wheel code, as it is mature and well suited. Sharing code
				118	was not really a win, due to the different data structures. Also, the
				119	hrtimer functions now have clearer behavior and clearer names - such as
				120	hrtimer_try_to_cancel() and hrtimer_cancel() [which are roughly
				121	equivalent to del_timer() and del_timer_sync()] - so there's no direct
				122	1:1 mapping between them on the algorithmical level, and thus no real
				123	potential for code sharing either.
				124
				125	Basic data types: every time value, absolute or relative, is in a
				126	special nanosecond-resolution type: ktime_t. The kernel-internal
				127	representation of ktime_t values and operations is implemented via
				128	macros and inline functions, and can be switched between a "hybrid
				129	union" type and a plain "scalar" 64bit nanoseconds representation (at
				130	compile time). The hybrid union type optimizes time conversions on 32bit
				131	CPUs. This build-time-selectable ktime_t storage format was implemented
				132	to avoid the performance impact of 64-bit multiplications and divisions
				133	on 32bit CPUs. Such operations are frequently necessary to convert
				134	between the storage formats provided by kernel and userspace interfaces
				135	and the internal time format. (See include/linux/ktime.h for further
				136	details.)
				137
				138	hrtimers - rounding of timer values
				139	-----------------------------------
				140
				141	the hrtimer code will round timer events to lower-resolution clocks
				142	because it has to. Otherwise it will do no artificial rounding at all.
				143
				144	one question is, what resolution value should be returned to the user by
				145	the clock_getres() interface. This will return whatever real resolution
				146	a given clock has - be it low-res, high-res, or artificially-low-res.
				147
				148	hrtimers - testing and verification
				149	----------------------------------
				150
				151	We used the high-resolution clock subsystem ontop of hrtimers to verify
				152	the hrtimer implementation details in praxis, and we also ran the posix
				153	timer tests in order to ensure specification compliance. We also ran
				154	tests on low-resolution clocks.
				155
				156	The hrtimer patch converts the following kernel functionality to use
				157	hrtimers:
				158
				159	- nanosleep
				160	- itimers
				161	- posix-timers
				162
				163	The conversion of nanosleep and posix-timers enabled the unification of
				164	nanosleep and clock_nanosleep.
				165
				166	The code was successfully compiled for the following platforms:
				167
				168	i386, x86_64, ARM, PPC, PPC64, IA64
				169
				170	The code was run-tested on the following platforms:
				171
				172	i386(UP/SMP), x86_64(UP/SMP), ARM, PPC
				173
				174	hrtimers were also integrated into the -rt tree, along with a
				175	hrtimers-based high-resolution clock implementation, so the hrtimers
				176	code got a healthy amount of testing and use in practice.
				177
				178	Thomas Gleixner, Ingo Molnar