Blame - Documentation/DocBook/kernel-locking.tmpl - android_kernel_oneplus_msm8996

blob: 2e9d6b41f034594b3b1af87c97ab1da81f5caccd [file] [log] [blame]

Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1	<?xml version="1.0" encoding="UTF-8"?>
				2	<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
				4
				5	<book id="LKLockingGuide">
				6	<bookinfo>
				7	<title>Unreliable Guide To Locking</title>
				8
				9	<authorgroup>
				10	<author>
				11	<firstname>Rusty</firstname>
				12	<surname>Russell</surname>
				13	<affiliation>
				14	<address>
				15	<email>rusty@rustcorp.com.au</email>
				16	</address>
				17	</affiliation>
				18	</author>
				19	</authorgroup>
				20
				21	<copyright>
				22	<year>2003</year>
				23	<holder>Rusty Russell</holder>
				24	</copyright>
				25
				26	<legalnotice>
				27	<para>
				28	This documentation is free software; you can redistribute
				29	it and/or modify it under the terms of the GNU General Public
				30	License as published by the Free Software Foundation; either
				31	version 2 of the License, or (at your option) any later
				32	version.
				33	</para>
				34
				35	<para>
				36	This program is distributed in the hope that it will be
				37	useful, but WITHOUT ANY WARRANTY; without even the implied
				38	warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
				39	See the GNU General Public License for more details.
				40	</para>
				41
				42	<para>
				43	You should have received a copy of the GNU General Public
				44	License along with this program; if not, write to the Free
				45	Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
				46	MA 02111-1307 USA
				47	</para>
				48
				49	<para>
				50	For more details see the file COPYING in the source
				51	distribution of Linux.
				52	</para>
				53	</legalnotice>
				54	</bookinfo>
				55
				56	<toc></toc>
				57	<chapter id="intro">
				58	<title>Introduction</title>
				59	<para>
				60	Welcome, to Rusty's Remarkably Unreliable Guide to Kernel
				61	Locking issues. This document describes the locking systems in
				62	the Linux Kernel in 2.6.
				63	</para>
				64	<para>
				65	With the wide availability of HyperThreading, and <firstterm
				66	linkend="gloss-preemption">preemption </firstterm> in the Linux
				67	Kernel, everyone hacking on the kernel needs to know the
				68	fundamentals of concurrency and locking for
				69	<firstterm linkend="gloss-smp"><acronym>SMP</acronym></firstterm>.
				70	</para>
				71	</chapter>
				72
				73	<chapter id="races">
				74	<title>The Problem With Concurrency</title>
				75	<para>
				76	(Skip this if you know what a Race Condition is).
				77	</para>
				78	<para>
				79	In a normal program, you can increment a counter like so:
				80	</para>
				81	<programlisting>
				82	very_important_count++;
				83	</programlisting>
				84
				85	<para>
				86	This is what they would expect to happen:
				87	</para>
				88
				89	<table>
				90	<title>Expected Results</title>
				91
				92	<tgroup cols="2" align="left">
				93
				94	<thead>
				95	<row>
				96	<entry>Instance 1</entry>
				97	<entry>Instance 2</entry>
				98	</row>
				99	</thead>
				100
				101	<tbody>
				102	<row>
				103	<entry>read very_important_count (5)</entry>
				104	<entry></entry>
				105	</row>
				106	<row>
				107	<entry>add 1 (6)</entry>
				108	<entry></entry>
				109	</row>
				110	<row>
				111	<entry>write very_important_count (6)</entry>
				112	<entry></entry>
				113	</row>
				114	<row>
				115	<entry></entry>
				116	<entry>read very_important_count (6)</entry>
				117	</row>
				118	<row>
				119	<entry></entry>
				120	<entry>add 1 (7)</entry>
				121	</row>
				122	<row>
				123	<entry></entry>
				124	<entry>write very_important_count (7)</entry>
				125	</row>
				126	</tbody>
				127
				128	</tgroup>
				129	</table>
				130
				131	<para>
				132	This is what might happen:
				133	</para>
				134
				135	<table>
				136	<title>Possible Results</title>
				137
				138	<tgroup cols="2" align="left">
				139	<thead>
				140	<row>
				141	<entry>Instance 1</entry>
				142	<entry>Instance 2</entry>
				143	</row>
				144	</thead>
				145
				146	<tbody>
				147	<row>
				148	<entry>read very_important_count (5)</entry>
				149	<entry></entry>
				150	</row>
				151	<row>
				152	<entry></entry>
				153	<entry>read very_important_count (5)</entry>
				154	</row>
				155	<row>
				156	<entry>add 1 (6)</entry>
				157	<entry></entry>
				158	</row>
				159	<row>
				160	<entry></entry>
				161	<entry>add 1 (6)</entry>
				162	</row>
				163	<row>
				164	<entry>write very_important_count (6)</entry>
				165	<entry></entry>
				166	</row>
				167	<row>
				168	<entry></entry>
				169	<entry>write very_important_count (6)</entry>
				170	</row>
				171	</tbody>
				172	</tgroup>
				173	</table>
				174
				175	<sect1 id="race-condition">
				176	<title>Race Conditions and Critical Regions</title>
				177	<para>
				178	This overlap, where the result depends on the
				179	relative timing of multiple tasks, is called a <firstterm>race condition</firstterm>.
				180	The piece of code containing the concurrency issue is called a
				181	<firstterm>critical region</firstterm>. And especially since Linux starting running
				182	on SMP machines, they became one of the major issues in kernel
				183	design and implementation.
				184	</para>
				185	<para>
				186	Preemption can have the same effect, even if there is only one
				187	CPU: by preempting one task during the critical region, we have
				188	exactly the same race condition. In this case the thread which
				189	preempts might run the critical region itself.
				190	</para>
				191	<para>
				192	The solution is to recognize when these simultaneous accesses
				193	occur, and use locks to make sure that only one instance can
				194	enter the critical region at any time. There are many
				195	friendly primitives in the Linux kernel to help you do this.
				196	And then there are the unfriendly primitives, but I'll pretend
				197	they don't exist.
				198	</para>
				199	</sect1>
				200	</chapter>
				201
				202	<chapter id="locks">
				203	<title>Locking in the Linux Kernel</title>
				204
				205	<para>
				206	If I could give you one piece of advice: never sleep with anyone
				207	crazier than yourself. But if I had to give you advice on
				208	locking: <emphasis>keep it simple</emphasis>.
				209	</para>
				210
				211	<para>
				212	Be reluctant to introduce new locks.
				213	</para>
				214
				215	<para>
				216	Strangely enough, this last one is the exact reverse of my advice when
				217	you <emphasis>have</emphasis> slept with someone crazier than yourself.
				218	And you should think about getting a big dog.
				219	</para>
				220
				221	<sect1 id="lock-intro">
Linus Nilsson	e320226	2007-07-21 17:25:34 +0200	[diff] [blame]	222	<title>Three Main Types of Kernel Locks: Spinlocks, Mutexes and Semaphores</title>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	223
				224	<para>
Ingo Molnar	f3f54ff	2006-01-09 15:59:20 -0800	[diff] [blame]	225	There are three main types of kernel locks. The fundamental type
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	226	is the spinlock
				227	(<filename class="headerfile">include/asm/spinlock.h</filename>),
				228	which is a very simple single-holder lock: if you can't get the
				229	spinlock, you keep trying (spinning) until you can. Spinlocks are
				230	very small and fast, and can be used anywhere.
				231	</para>
				232	<para>
Ingo Molnar	f3f54ff	2006-01-09 15:59:20 -0800	[diff] [blame]	233	The second type is a mutex
				234	(<filename class="headerfile">include/linux/mutex.h</filename>): it
				235	is like a spinlock, but you may block holding a mutex.
				236	If you can't lock a mutex, your task will suspend itself, and be woken
				237	up when the mutex is released. This means the CPU can do something
				238	else while you are waiting. There are many cases when you simply
				239	can't sleep (see <xref linkend="sleeping-things"/>), and so have to
				240	use a spinlock instead.
				241	</para>
				242	<para>
				243	The third type is a semaphore
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	244	(<filename class="headerfile">include/asm/semaphore.h</filename>): it
				245	can have more than one holder at any time (the number decided at
				246	initialization time), although it is most commonly used as a
Ingo Molnar	f3f54ff	2006-01-09 15:59:20 -0800	[diff] [blame]	247	single-holder lock (a mutex). If you can't get a semaphore, your
				248	task will be suspended and later on woken up - just like for mutexes.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	249	</para>
				250	<para>
				251	Neither type of lock is recursive: see
				252	<xref linkend="deadlock"/>.
				253	</para>
				254	</sect1>
				255
				256	<sect1 id="uniprocessor">
				257	<title>Locks and Uniprocessor Kernels</title>
				258
				259	<para>
				260	For kernels compiled without <symbol>CONFIG_SMP</symbol>, and
				261	without <symbol>CONFIG_PREEMPT</symbol> spinlocks do not exist at
				262	all. This is an excellent design decision: when no-one else can
				263	run at the same time, there is no reason to have a lock.
				264	</para>
				265
				266	<para>
				267	If the kernel is compiled without <symbol>CONFIG_SMP</symbol>,
				268	but <symbol>CONFIG_PREEMPT</symbol> is set, then spinlocks
				269	simply disable preemption, which is sufficient to prevent any
				270	races. For most purposes, we can think of preemption as
				271	equivalent to SMP, and not worry about it separately.
				272	</para>
				273
				274	<para>
				275	You should always test your locking code with <symbol>CONFIG_SMP</symbol>
				276	and <symbol>CONFIG_PREEMPT</symbol> enabled, even if you don't have an SMP test box, because it
				277	will still catch some kinds of locking bugs.
				278	</para>
				279
				280	<para>
				281	Semaphores still exist, because they are required for
				282	synchronization between <firstterm linkend="gloss-usercontext">user
				283	contexts</firstterm>, as we will see below.
				284	</para>
				285	</sect1>
				286
				287	<sect1 id="usercontextlocking">
				288	<title>Locking Only In User Context</title>
				289
				290	<para>
				291	If you have a data structure which is only ever accessed from
				292	user context, then you can use a simple semaphore
				293	(<filename>linux/asm/semaphore.h</filename>) to protect it. This
				294	is the most trivial case: you initialize the semaphore to the number
				295	of resources available (usually 1), and call
				296	<function>down_interruptible()</function> to grab the semaphore, and
				297	<function>up()</function> to release it. There is also a
				298	<function>down()</function>, which should be avoided, because it
				299	will not return if a signal is received.
				300	</para>
				301
				302	<para>
				303	Example: <filename>linux/net/core/netfilter.c</filename> allows
				304	registration of new <function>setsockopt()</function> and
				305	<function>getsockopt()</function> calls, with
				306	<function>nf_register_sockopt()</function>. Registration and
				307	de-registration are only done on module load and unload (and boot
				308	time, where there is no concurrency), and the list of registrations
				309	is only consulted for an unknown <function>setsockopt()</function>
				310	or <function>getsockopt()</function> system call. The
				311	<varname>nf_sockopt_mutex</varname> is perfect to protect this,
				312	especially since the setsockopt and getsockopt calls may well
				313	sleep.
				314	</para>
				315	</sect1>
				316
				317	<sect1 id="lock-user-bh">
				318	<title>Locking Between User Context and Softirqs</title>
				319
				320	<para>
				321	If a <firstterm linkend="gloss-softirq">softirq</firstterm> shares
				322	data with user context, you have two problems. Firstly, the current
				323	user context can be interrupted by a softirq, and secondly, the
				324	critical region could be entered from another CPU. This is where
				325	<function>spin_lock_bh()</function>
				326	(<filename class="headerfile">include/linux/spinlock.h</filename>) is
				327	used. It disables softirqs on that CPU, then grabs the lock.
				328	<function>spin_unlock_bh()</function> does the reverse. (The
				329	'_bh' suffix is a historical reference to "Bottom Halves", the
				330	old name for software interrupts. It should really be
				331	called spin_lock_softirq()' in a perfect world).
				332	</para>
				333
				334	<para>
				335	Note that you can also use <function>spin_lock_irq()</function>
				336	or <function>spin_lock_irqsave()</function> here, which stop
				337	hardware interrupts as well: see <xref linkend="hardirq-context"/>.
				338	</para>
				339
				340	<para>
				341	This works perfectly for <firstterm linkend="gloss-up"><acronym>UP
				342	</acronym></firstterm> as well: the spin lock vanishes, and this macro
				343	simply becomes <function>local_bh_disable()</function>
				344	(<filename class="headerfile">include/linux/interrupt.h</filename>), which
				345	protects you from the softirq being run.
				346	</para>
				347	</sect1>
				348
				349	<sect1 id="lock-user-tasklet">
				350	<title>Locking Between User Context and Tasklets</title>
				351
				352	<para>
				353	This is exactly the same as above, because <firstterm
				354	linkend="gloss-tasklet">tasklets</firstterm> are actually run
				355	from a softirq.
				356	</para>
				357	</sect1>
				358
				359	<sect1 id="lock-user-timers">
				360	<title>Locking Between User Context and Timers</title>
				361
				362	<para>
				363	This, too, is exactly the same as above, because <firstterm
				364	linkend="gloss-timers">timers</firstterm> are actually run from
				365	a softirq. From a locking point of view, tasklets and timers
				366	are identical.
				367	</para>
				368	</sect1>
				369
				370	<sect1 id="lock-tasklets">
				371	<title>Locking Between Tasklets/Timers</title>
				372
				373	<para>
				374	Sometimes a tasklet or timer might want to share data with
				375	another tasklet or timer.
				376	</para>
				377
				378	<sect2 id="lock-tasklets-same">
				379	<title>The Same Tasklet/Timer</title>
				380	<para>
				381	Since a tasklet is never run on two CPUs at once, you don't
				382	need to worry about your tasklet being reentrant (running
				383	twice at once), even on SMP.
				384	</para>
				385	</sect2>
				386
				387	<sect2 id="lock-tasklets-different">
				388	<title>Different Tasklets/Timers</title>
				389	<para>
				390	If another tasklet/timer wants
				391	to share data with your tasklet or timer , you will both need to use
				392	<function>spin_lock()</function> and
				393	<function>spin_unlock()</function> calls.
				394	<function>spin_lock_bh()</function> is
				395	unnecessary here, as you are already in a tasklet, and
				396	none will be run on the same CPU.
				397	</para>
				398	</sect2>
				399	</sect1>
				400
				401	<sect1 id="lock-softirqs">
				402	<title>Locking Between Softirqs</title>
				403
				404	<para>
				405	Often a softirq might
				406	want to share data with itself or a tasklet/timer.
				407	</para>
				408
				409	<sect2 id="lock-softirqs-same">
				410	<title>The Same Softirq</title>
				411
				412	<para>
				413	The same softirq can run on the other CPUs: you can use a
				414	per-CPU array (see <xref linkend="per-cpu"/>) for better
				415	performance. If you're going so far as to use a softirq,
				416	you probably care about scalable performance enough
				417	to justify the extra complexity.
				418	</para>
				419
				420	<para>
				421	You'll need to use <function>spin_lock()</function> and
				422	<function>spin_unlock()</function> for shared data.
				423	</para>
				424	</sect2>
				425
				426	<sect2 id="lock-softirqs-different">
				427	<title>Different Softirqs</title>
				428
				429	<para>
				430	You'll need to use <function>spin_lock()</function> and
				431	<function>spin_unlock()</function> for shared data, whether it
				432	be a timer, tasklet, different softirq or the same or another
				433	softirq: any of them could be running on a different CPU.
				434	</para>
				435	</sect2>
				436	</sect1>
				437	</chapter>
				438
				439	<chapter id="hardirq-context">
				440	<title>Hard IRQ Context</title>
				441
				442	<para>
				443	Hardware interrupts usually communicate with a
				444	tasklet or softirq. Frequently this involves putting work in a
				445	queue, which the softirq will take out.
				446	</para>
				447
				448	<sect1 id="hardirq-softirq">
				449	<title>Locking Between Hard IRQ and Softirqs/Tasklets</title>
				450
				451	<para>
				452	If a hardware irq handler shares data with a softirq, you have
				453	two concerns. Firstly, the softirq processing can be
				454	interrupted by a hardware interrupt, and secondly, the
				455	critical region could be entered by a hardware interrupt on
				456	another CPU. This is where <function>spin_lock_irq()</function> is
				457	used. It is defined to disable interrupts on that cpu, then grab
				458	the lock. <function>spin_unlock_irq()</function> does the reverse.
				459	</para>
				460
				461	<para>
				462	The irq handler does not to use
				463	<function>spin_lock_irq()</function>, because the softirq cannot
				464	run while the irq handler is running: it can use
				465	<function>spin_lock()</function>, which is slightly faster. The
				466	only exception would be if a different hardware irq handler uses
				467	the same lock: <function>spin_lock_irq()</function> will stop
				468	that from interrupting us.
				469	</para>
				470
				471	<para>
				472	This works perfectly for UP as well: the spin lock vanishes,
				473	and this macro simply becomes <function>local_irq_disable()</function>
				474	(<filename class="headerfile">include/asm/smp.h</filename>), which
				475	protects you from the softirq/tasklet/BH being run.
				476	</para>
				477
				478	<para>
				479	<function>spin_lock_irqsave()</function>
				480	(<filename>include/linux/spinlock.h</filename>) is a variant
				481	which saves whether interrupts were on or off in a flags word,
				482	which is passed to <function>spin_unlock_irqrestore()</function>. This
				483	means that the same code can be used inside an hard irq handler (where
				484	interrupts are already off) and in softirqs (where the irq
				485	disabling is required).
				486	</para>
				487
				488	<para>
				489	Note that softirqs (and hence tasklets and timers) are run on
				490	return from hardware interrupts, so
				491	<function>spin_lock_irq()</function> also stops these. In that
				492	sense, <function>spin_lock_irqsave()</function> is the most
				493	general and powerful locking function.
				494	</para>
				495
				496	</sect1>
				497	<sect1 id="hardirq-hardirq">
				498	<title>Locking Between Two Hard IRQ Handlers</title>
				499	<para>
				500	It is rare to have to share data between two IRQ handlers, but
				501	if you do, <function>spin_lock_irqsave()</function> should be
				502	used: it is architecture-specific whether all interrupts are
				503	disabled inside irq handlers themselves.
				504	</para>
				505	</sect1>
				506
				507	</chapter>
				508
				509	<chapter id="cheatsheet">
				510	<title>Cheat Sheet For Locking</title>
				511	<para>
				512	Pete Zaitcev gives the following summary:
				513	</para>
				514	<itemizedlist>
				515	<listitem>
				516	<para>
				517	If you are in a process context (any syscall) and want to
				518	lock other process out, use a semaphore. You can take a semaphore
				519	and sleep (<function>copy_from_user*(</function> or
				520	<function>kmalloc(x,GFP_KERNEL)</function>).
				521	</para>
				522	</listitem>
				523	<listitem>
				524	<para>
				525	Otherwise (== data can be touched in an interrupt), use
				526	<function>spin_lock_irqsave()</function> and
				527	<function>spin_unlock_irqrestore()</function>.
				528	</para>
				529	</listitem>
				530	<listitem>
				531	<para>
				532	Avoid holding spinlock for more than 5 lines of code and
				533	across any function call (except accessors like
				534	<function>readb</function>).
				535	</para>
				536	</listitem>
				537	</itemizedlist>
				538
				539	<sect1 id="minimum-lock-reqirements">
				540	<title>Table of Minimum Requirements</title>
				541
				542	<para> The following table lists the <emphasis>minimum</emphasis>
				543	locking requirements between various contexts. In some cases,
				544	the same context can only be running on one CPU at a time, so
				545	no locking is required for that context (eg. a particular
				546	thread can only run on one CPU at a time, but if it needs
				547	shares data with another thread, locking is required).
				548	</para>
				549	<para>
				550	Remember the advice above: you can always use
				551	<function>spin_lock_irqsave()</function>, which is a superset
				552	of all other spinlock primitives.
				553	</para>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	554
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	555	<table>
				556	<title>Table of Locking Requirements</title>
				557	<tgroup cols="11">
				558	<tbody>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	559
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	560	<row>
				561	<entry></entry>
				562	<entry>IRQ Handler A</entry>
				563	<entry>IRQ Handler B</entry>
				564	<entry>Softirq A</entry>
				565	<entry>Softirq B</entry>
				566	<entry>Tasklet A</entry>
				567	<entry>Tasklet B</entry>
				568	<entry>Timer A</entry>
				569	<entry>Timer B</entry>
				570	<entry>User Context A</entry>
				571	<entry>User Context B</entry>
				572	</row>
				573
				574	<row>
				575	<entry>IRQ Handler A</entry>
				576	<entry>None</entry>
				577	</row>
				578
				579	<row>
				580	<entry>IRQ Handler B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	581	<entry>SLIS</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	582	<entry>None</entry>
				583	</row>
				584
				585	<row>
				586	<entry>Softirq A</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	587	<entry>SLI</entry>
				588	<entry>SLI</entry>
				589	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	590	</row>
				591
				592	<row>
				593	<entry>Softirq B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	594	<entry>SLI</entry>
				595	<entry>SLI</entry>
				596	<entry>SL</entry>
				597	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	598	</row>
				599
				600	<row>
				601	<entry>Tasklet A</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	602	<entry>SLI</entry>
				603	<entry>SLI</entry>
				604	<entry>SL</entry>
				605	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	606	<entry>None</entry>
				607	</row>
				608
				609	<row>
				610	<entry>Tasklet B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	611	<entry>SLI</entry>
				612	<entry>SLI</entry>
				613	<entry>SL</entry>
				614	<entry>SL</entry>
				615	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	616	<entry>None</entry>
				617	</row>
				618
				619	<row>
				620	<entry>Timer A</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	621	<entry>SLI</entry>
				622	<entry>SLI</entry>
				623	<entry>SL</entry>
				624	<entry>SL</entry>
				625	<entry>SL</entry>
				626	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	627	<entry>None</entry>
				628	</row>
				629
				630	<row>
				631	<entry>Timer B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	632	<entry>SLI</entry>
				633	<entry>SLI</entry>
				634	<entry>SL</entry>
				635	<entry>SL</entry>
				636	<entry>SL</entry>
				637	<entry>SL</entry>
				638	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	639	<entry>None</entry>
				640	</row>
				641
				642	<row>
				643	<entry>User Context A</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	644	<entry>SLI</entry>
				645	<entry>SLI</entry>
				646	<entry>SLBH</entry>
				647	<entry>SLBH</entry>
				648	<entry>SLBH</entry>
				649	<entry>SLBH</entry>
				650	<entry>SLBH</entry>
				651	<entry>SLBH</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	652	<entry>None</entry>
				653	</row>
				654
				655	<row>
				656	<entry>User Context B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	657	<entry>SLI</entry>
				658	<entry>SLI</entry>
				659	<entry>SLBH</entry>
				660	<entry>SLBH</entry>
				661	<entry>SLBH</entry>
				662	<entry>SLBH</entry>
				663	<entry>SLBH</entry>
				664	<entry>SLBH</entry>
				665	<entry>DI</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	666	<entry>None</entry>
				667	</row>
				668
				669	</tbody>
				670	</tgroup>
				671	</table>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	672
				673	<table>
				674	<title>Legend for Locking Requirements Table</title>
				675	<tgroup cols="2">
				676	<tbody>
				677
				678	<row>
				679	<entry>SLIS</entry>
				680	<entry>spin_lock_irqsave</entry>
				681	</row>
				682	<row>
				683	<entry>SLI</entry>
				684	<entry>spin_lock_irq</entry>
				685	</row>
				686	<row>
				687	<entry>SL</entry>
				688	<entry>spin_lock</entry>
				689	</row>
				690	<row>
				691	<entry>SLBH</entry>
				692	<entry>spin_lock_bh</entry>
				693	</row>
				694	<row>
				695	<entry>DI</entry>
				696	<entry>down_interruptible</entry>
				697	</row>
				698
				699	</tbody>
				700	</tgroup>
				701	</table>
				702
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	703	</sect1>
				704	</chapter>
				705
				706	<chapter id="Examples">
				707	<title>Common Examples</title>
				708	<para>
				709	Let's step through a simple example: a cache of number to name
				710	mappings. The cache keeps a count of how often each of the objects is
				711	used, and when it gets full, throws out the least used one.
				712
				713	</para>
				714
				715	<sect1 id="examples-usercontext">
				716	<title>All In User Context</title>
				717	<para>
				718	For our first example, we assume that all operations are in user
				719	context (ie. from system calls), so we can sleep. This means we can
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	720	use a mutex to protect the cache and all the objects within
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	721	it. Here's the code:
				722	</para>
				723
				724	<programlisting>
				725	#include <linux/list.h>
				726	#include <linux/slab.h>
				727	#include <linux/string.h>
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	728	#include <linux/mutex.h>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	729	#include <asm/errno.h>
				730
				731	struct object
				732	{
				733	struct list_head list;
				734	int id;
				735	char name[32];
				736	int popularity;
				737	};
				738
				739	/* Protects the cache, cache_num, and the objects within it */
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	740	static DEFINE_MUTEX(cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	741	static LIST_HEAD(cache);
				742	static unsigned int cache_num = 0;
				743	#define MAX_CACHE_SIZE 10
				744
				745	/* Must be holding cache_lock */
				746	static struct object *__cache_find(int id)
				747	{
				748	struct object *i;
				749
				750	list_for_each_entry(i, &cache, list)
				751	if (i->id == id) {
				752	i->popularity++;
				753	return i;
				754	}
				755	return NULL;
				756	}
				757
				758	/* Must be holding cache_lock */
				759	static void __cache_delete(struct object *obj)
				760	{
				761	BUG_ON(!obj);
				762	list_del(&obj->list);
				763	kfree(obj);
				764	cache_num--;
				765	}
				766
				767	/* Must be holding cache_lock */
				768	static void __cache_add(struct object *obj)
				769	{
				770	list_add(&obj->list, &cache);
				771	if (++cache_num > MAX_CACHE_SIZE) {
				772	struct object i, outcast = NULL;
				773	list_for_each_entry(i, &cache, list) {
				774	if (!outcast \|\| i->popularity < outcast->popularity)
				775	outcast = i;
				776	}
				777	__cache_delete(outcast);
				778	}
				779	}
				780
				781	int cache_add(int id, const char *name)
				782	{
				783	struct object *obj;
				784
				785	if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
				786	return -ENOMEM;
				787
				788	strlcpy(obj->name, name, sizeof(obj->name));
				789	obj->id = id;
				790	obj->popularity = 0;
				791
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	792	mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	793	__cache_add(obj);
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	794	mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	795	return 0;
				796	}
				797
				798	void cache_delete(int id)
				799	{
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	800	mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	801	__cache_delete(__cache_find(id));
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	802	mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	803	}
				804
				805	int cache_find(int id, char *name)
				806	{
				807	struct object *obj;
				808	int ret = -ENOENT;
				809
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	810	mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	811	obj = __cache_find(id);
				812	if (obj) {
				813	ret = 0;
				814	strcpy(name, obj->name);
				815	}
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	816	mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	817	return ret;
				818	}
				819	</programlisting>
				820
				821	<para>
				822	Note that we always make sure we have the cache_lock when we add,
				823	delete, or look up the cache: both the cache infrastructure itself and
				824	the contents of the objects are protected by the lock. In this case
				825	it's easy, since we copy the data for the user, and never let them
				826	access the objects directly.
				827	</para>
				828	<para>
				829	There is a slight (and common) optimization here: in
				830	<function>cache_add</function> we set up the fields of the object
				831	before grabbing the lock. This is safe, as no-one else can access it
				832	until we put it in cache.
				833	</para>
				834	</sect1>
				835
				836	<sect1 id="examples-interrupt">
				837	<title>Accessing From Interrupt Context</title>
				838	<para>
				839	Now consider the case where <function>cache_find</function> can be
				840	called from interrupt context: either a hardware interrupt or a
				841	softirq. An example would be a timer which deletes object from the
				842	cache.
				843	</para>
				844	<para>
				845	The change is shown below, in standard patch format: the
				846	<symbol>-</symbol> are lines which are taken away, and the
				847	<symbol>+</symbol> are lines which are added.
				848	</para>
				849	<programlisting>
				850	--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100
				851	+++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100
				852	@@ -12,7 +12,7 @@
				853	int popularity;
				854	};
				855
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	856	-static DEFINE_MUTEX(cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	857	+static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED;
				858	static LIST_HEAD(cache);
				859	static unsigned int cache_num = 0;
				860	#define MAX_CACHE_SIZE 10
				861	@@ -55,6 +55,7 @@
				862	int cache_add(int id, const char *name)
				863	{
				864	struct object *obj;
				865	+ unsigned long flags;
				866
				867	if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
				868	return -ENOMEM;
				869	@@ -63,30 +64,33 @@
				870	obj->id = id;
				871	obj->popularity = 0;
				872
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	873	- mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	874	+ spin_lock_irqsave(&cache_lock, flags);
				875	__cache_add(obj);
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	876	- mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	877	+ spin_unlock_irqrestore(&cache_lock, flags);
				878	return 0;
				879	}
				880
				881	void cache_delete(int id)
				882	{
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	883	- mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	884	+ unsigned long flags;
				885	+
				886	+ spin_lock_irqsave(&cache_lock, flags);
				887	__cache_delete(__cache_find(id));
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	888	- mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	889	+ spin_unlock_irqrestore(&cache_lock, flags);
				890	}
				891
				892	int cache_find(int id, char *name)
				893	{
				894	struct object *obj;
				895	int ret = -ENOENT;
				896	+ unsigned long flags;
				897
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	898	- mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	899	+ spin_lock_irqsave(&cache_lock, flags);
				900	obj = __cache_find(id);
				901	if (obj) {
				902	ret = 0;
				903	strcpy(name, obj->name);
				904	}
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	905	- mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	906	+ spin_unlock_irqrestore(&cache_lock, flags);
				907	return ret;
				908	}
				909	</programlisting>
				910
				911	<para>
				912	Note that the <function>spin_lock_irqsave</function> will turn off
				913	interrupts if they are on, otherwise does nothing (if we are already
				914	in an interrupt handler), hence these functions are safe to call from
				915	any context.
				916	</para>
				917	<para>
				918	Unfortunately, <function>cache_add</function> calls
				919	<function>kmalloc</function> with the <symbol>GFP_KERNEL</symbol>
				920	flag, which is only legal in user context. I have assumed that
				921	<function>cache_add</function> is still only called in user context,
				922	otherwise this should become a parameter to
				923	<function>cache_add</function>.
				924	</para>
				925	</sect1>
				926	<sect1 id="examples-refcnt">
				927	<title>Exposing Objects Outside This File</title>
				928	<para>
				929	If our objects contained more information, it might not be sufficient
				930	to copy the information in and out: other parts of the code might want
				931	to keep pointers to these objects, for example, rather than looking up
				932	the id every time. This produces two problems.
				933	</para>
				934	<para>
				935	The first problem is that we use the <symbol>cache_lock</symbol> to
				936	protect objects: we'd need to make this non-static so the rest of the
				937	code can use it. This makes locking trickier, as it is no longer all
				938	in one place.
				939	</para>
				940	<para>
				941	The second problem is the lifetime problem: if another structure keeps
				942	a pointer to an object, it presumably expects that pointer to remain
				943	valid. Unfortunately, this is only guaranteed while you hold the
				944	lock, otherwise someone might call <function>cache_delete</function>
				945	and even worse, add another object, re-using the same address.
				946	</para>
				947	<para>
				948	As there is only one lock, you can't hold it forever: no-one else would
				949	get any work done.
				950	</para>
				951	<para>
				952	The solution to this problem is to use a reference count: everyone who
				953	has a pointer to the object increases it when they first get the
				954	object, and drops the reference count when they're finished with it.
				955	Whoever drops it to zero knows it is unused, and can actually delete it.
				956	</para>
				957	<para>
				958	Here is the code:
				959	</para>
				960
				961	<programlisting>
				962	--- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100
				963	+++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100
				964	@@ -7,6 +7,7 @@
				965	struct object
				966	{
				967	struct list_head list;
				968	+ unsigned int refcnt;
				969	int id;
				970	char name[32];
				971	int popularity;
				972	@@ -17,6 +18,35 @@
				973	static unsigned int cache_num = 0;
				974	#define MAX_CACHE_SIZE 10
				975
				976	+static void __object_put(struct object *obj)
				977	+{
				978	+ if (--obj->refcnt == 0)
				979	+ kfree(obj);
				980	+}
				981	+
				982	+static void __object_get(struct object *obj)
				983	+{
				984	+ obj->refcnt++;
				985	+}
				986	+
				987	+void object_put(struct object *obj)
				988	+{
				989	+ unsigned long flags;
				990	+
				991	+ spin_lock_irqsave(&cache_lock, flags);
				992	+ __object_put(obj);
				993	+ spin_unlock_irqrestore(&cache_lock, flags);
				994	+}
				995	+
				996	+void object_get(struct object *obj)
				997	+{
				998	+ unsigned long flags;
				999	+
				1000	+ spin_lock_irqsave(&cache_lock, flags);
				1001	+ __object_get(obj);
				1002	+ spin_unlock_irqrestore(&cache_lock, flags);
				1003	+}
				1004	+
				1005	/* Must be holding cache_lock */
				1006	static struct object *__cache_find(int id)
				1007	{
				1008	@@ -35,6 +65,7 @@
				1009	{
				1010	BUG_ON(!obj);
				1011	list_del(&obj->list);
				1012	+ __object_put(obj);
				1013	cache_num--;
				1014	}
				1015
				1016	@@ -63,6 +94,7 @@
				1017	strlcpy(obj->name, name, sizeof(obj->name));
				1018	obj->id = id;
				1019	obj->popularity = 0;
				1020	+ obj->refcnt = 1; /* The cache holds a reference */
				1021
				1022	spin_lock_irqsave(&cache_lock, flags);
				1023	__cache_add(obj);
				1024	@@ -79,18 +111,15 @@
				1025	spin_unlock_irqrestore(&cache_lock, flags);
				1026	}
				1027
				1028	-int cache_find(int id, char *name)
				1029	+struct object *cache_find(int id)
				1030	{
				1031	struct object *obj;
				1032	- int ret = -ENOENT;
				1033	unsigned long flags;
				1034
				1035	spin_lock_irqsave(&cache_lock, flags);
				1036	obj = __cache_find(id);
				1037	- if (obj) {
				1038	- ret = 0;
				1039	- strcpy(name, obj->name);
				1040	- }
				1041	+ if (obj)
				1042	+ __object_get(obj);
				1043	spin_unlock_irqrestore(&cache_lock, flags);
				1044	- return ret;
				1045	+ return obj;
				1046	}
				1047	</programlisting>
				1048
				1049	<para>
				1050	We encapsulate the reference counting in the standard 'get' and 'put'
				1051	functions. Now we can return the object itself from
				1052	<function>cache_find</function> which has the advantage that the user
				1053	can now sleep holding the object (eg. to
				1054	<function>copy_to_user</function> to name to userspace).
				1055	</para>
				1056	<para>
				1057	The other point to note is that I said a reference should be held for
				1058	every pointer to the object: thus the reference count is 1 when first
				1059	inserted into the cache. In some versions the framework does not hold
				1060	a reference count, but they are more complicated.
				1061	</para>
				1062
				1063	<sect2 id="examples-refcnt-atomic">
				1064	<title>Using Atomic Operations For The Reference Count</title>
				1065	<para>
				1066	In practice, <type>atomic_t</type> would usually be used for
				1067	<structfield>refcnt</structfield>. There are a number of atomic
				1068	operations defined in
				1069
				1070	<filename class="headerfile">include/asm/atomic.h</filename>: these are
				1071	guaranteed to be seen atomically from all CPUs in the system, so no
				1072	lock is required. In this case, it is simpler than using spinlocks,
				1073	although for anything non-trivial using spinlocks is clearer. The
				1074	<function>atomic_inc</function> and
				1075	<function>atomic_dec_and_test</function> are used instead of the
				1076	standard increment and decrement operators, and the lock is no longer
				1077	used to protect the reference count itself.
				1078	</para>
				1079
				1080	<programlisting>
				1081	--- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100
				1082	+++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100
				1083	@@ -7,7 +7,7 @@
				1084	struct object
				1085	{
				1086	struct list_head list;
				1087	- unsigned int refcnt;
				1088	+ atomic_t refcnt;
				1089	int id;
				1090	char name[32];
				1091	int popularity;
				1092	@@ -18,33 +18,15 @@
				1093	static unsigned int cache_num = 0;
				1094	#define MAX_CACHE_SIZE 10
				1095
				1096	-static void __object_put(struct object *obj)
				1097	-{
				1098	- if (--obj->refcnt == 0)
				1099	- kfree(obj);
				1100	-}
				1101	-
				1102	-static void __object_get(struct object *obj)
				1103	-{
				1104	- obj->refcnt++;
				1105	-}
				1106	-
				1107	void object_put(struct object *obj)
				1108	{
				1109	- unsigned long flags;
				1110	-
				1111	- spin_lock_irqsave(&cache_lock, flags);
				1112	- __object_put(obj);
				1113	- spin_unlock_irqrestore(&cache_lock, flags);
				1114	+ if (atomic_dec_and_test(&obj->refcnt))
				1115	+ kfree(obj);
				1116	}
				1117
				1118	void object_get(struct object *obj)
				1119	{
				1120	- unsigned long flags;
				1121	-
				1122	- spin_lock_irqsave(&cache_lock, flags);
				1123	- __object_get(obj);
				1124	- spin_unlock_irqrestore(&cache_lock, flags);
				1125	+ atomic_inc(&obj->refcnt);
				1126	}
				1127
				1128	/* Must be holding cache_lock */
				1129	@@ -65,7 +47,7 @@
				1130	{
				1131	BUG_ON(!obj);
				1132	list_del(&obj->list);
				1133	- __object_put(obj);
				1134	+ object_put(obj);
				1135	cache_num--;
				1136	}
				1137
				1138	@@ -94,7 +76,7 @@
				1139	strlcpy(obj->name, name, sizeof(obj->name));
				1140	obj->id = id;
				1141	obj->popularity = 0;
				1142	- obj->refcnt = 1; /* The cache holds a reference */
				1143	+ atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
				1144
				1145	spin_lock_irqsave(&cache_lock, flags);
				1146	__cache_add(obj);
				1147	@@ -119,7 +101,7 @@
				1148	spin_lock_irqsave(&cache_lock, flags);
				1149	obj = __cache_find(id);
				1150	if (obj)
				1151	- __object_get(obj);
				1152	+ object_get(obj);
				1153	spin_unlock_irqrestore(&cache_lock, flags);
				1154	return obj;
				1155	}
				1156	</programlisting>
				1157	</sect2>
				1158	</sect1>
				1159
				1160	<sect1 id="examples-lock-per-obj">
				1161	<title>Protecting The Objects Themselves</title>
				1162	<para>
				1163	In these examples, we assumed that the objects (except the reference
				1164	counts) never changed once they are created. If we wanted to allow
				1165	the name to change, there are three possibilities:
				1166	</para>
				1167	<itemizedlist>
				1168	<listitem>
				1169	<para>
				1170	You can make <symbol>cache_lock</symbol> non-static, and tell people
				1171	to grab that lock before changing the name in any object.
				1172	</para>
				1173	</listitem>
				1174	<listitem>
				1175	<para>
				1176	You can provide a <function>cache_obj_rename</function> which grabs
				1177	this lock and changes the name for the caller, and tell everyone to
				1178	use that function.
				1179	</para>
				1180	</listitem>
				1181	<listitem>
				1182	<para>
				1183	You can make the <symbol>cache_lock</symbol> protect only the cache
				1184	itself, and use another lock to protect the name.
				1185	</para>
				1186	</listitem>
				1187	</itemizedlist>
				1188
				1189	<para>
				1190	Theoretically, you can make the locks as fine-grained as one lock for
				1191	every field, for every object. In practice, the most common variants
				1192	are:
				1193	</para>
				1194	<itemizedlist>
				1195	<listitem>
				1196	<para>
				1197	One lock which protects the infrastructure (the <symbol>cache</symbol>
				1198	list in this example) and all the objects. This is what we have done
				1199	so far.
				1200	</para>
				1201	</listitem>
				1202	<listitem>
				1203	<para>
				1204	One lock which protects the infrastructure (including the list
				1205	pointers inside the objects), and one lock inside the object which
				1206	protects the rest of that object.
				1207	</para>
				1208	</listitem>
				1209	<listitem>
				1210	<para>
				1211	Multiple locks to protect the infrastructure (eg. one lock per hash
				1212	chain), possibly with a separate per-object lock.
				1213	</para>
				1214	</listitem>
				1215	</itemizedlist>
				1216
				1217	<para>
				1218	Here is the "lock-per-object" implementation:
				1219	</para>
				1220	<programlisting>
				1221	--- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100
				1222	+++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100
				1223	@@ -6,11 +6,17 @@
				1224
				1225	struct object
				1226	{
				1227	+ /* These two protected by cache_lock. */
				1228	struct list_head list;
				1229	+ int popularity;
				1230	+
				1231	atomic_t refcnt;
				1232	+
				1233	+ /* Doesn't change once created. */
				1234	int id;
				1235	+
				1236	+ spinlock_t lock; /* Protects the name */
				1237	char name[32];
				1238	- int popularity;
				1239	};
				1240
				1241	static spinlock_t cache_lock = SPIN_LOCK_UNLOCKED;
				1242	@@ -77,6 +84,7 @@
				1243	obj->id = id;
				1244	obj->popularity = 0;
				1245	atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
				1246	+ spin_lock_init(&obj->lock);
				1247
				1248	spin_lock_irqsave(&cache_lock, flags);
				1249	__cache_add(obj);
				1250	</programlisting>
				1251
				1252	<para>
				1253	Note that I decide that the <structfield>popularity</structfield>
				1254	count should be protected by the <symbol>cache_lock</symbol> rather
				1255	than the per-object lock: this is because it (like the
				1256	<structname>struct list_head</structname> inside the object) is
				1257	logically part of the infrastructure. This way, I don't need to grab
				1258	the lock of every object in <function>__cache_add</function> when
				1259	seeking the least popular.
				1260	</para>
				1261
				1262	<para>
				1263	I also decided that the <structfield>id</structfield> member is
				1264	unchangeable, so I don't need to grab each object lock in
				1265	<function>__cache_find()</function> to examine the
				1266	<structfield>id</structfield>: the object lock is only used by a
				1267	caller who wants to read or write the <structfield>name</structfield>
				1268	field.
				1269	</para>
				1270
				1271	<para>
				1272	Note also that I added a comment describing what data was protected by
				1273	which locks. This is extremely important, as it describes the runtime
				1274	behavior of the code, and can be hard to gain from just reading. And
				1275	as Alan Cox says, <quote>Lock data, not code</quote>.
				1276	</para>
				1277	</sect1>
				1278	</chapter>
				1279
				1280	<chapter id="common-problems">
				1281	<title>Common Problems</title>
				1282	<sect1 id="deadlock">
				1283	<title>Deadlock: Simple and Advanced</title>
				1284
				1285	<para>
				1286	There is a coding bug where a piece of code tries to grab a
				1287	spinlock twice: it will spin forever, waiting for the lock to
				1288	be released (spinlocks, rwlocks and semaphores are not
				1289	recursive in Linux). This is trivial to diagnose: not a
				1290	stay-up-five-nights-talk-to-fluffy-code-bunnies kind of
				1291	problem.
				1292	</para>
				1293
				1294	<para>
				1295	For a slightly more complex case, imagine you have a region
				1296	shared by a softirq and user context. If you use a
				1297	<function>spin_lock()</function> call to protect it, it is
				1298	possible that the user context will be interrupted by the softirq
				1299	while it holds the lock, and the softirq will then spin
				1300	forever trying to get the same lock.
				1301	</para>
				1302
				1303	<para>
				1304	Both of these are called deadlock, and as shown above, it can
				1305	occur even with a single CPU (although not on UP compiles,
				1306	since spinlocks vanish on kernel compiles with
				1307	<symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption
				1308	in the second example).
				1309	</para>
				1310
				1311	<para>
				1312	This complete lockup is easy to diagnose: on SMP boxes the
				1313	watchdog timer or compiling with <symbol>DEBUG_SPINLOCKS</symbol> set
				1314	(<filename>include/linux/spinlock.h</filename>) will show this up
				1315	immediately when it happens.
				1316	</para>
				1317
				1318	<para>
				1319	A more complex problem is the so-called 'deadly embrace',
				1320	involving two or more locks. Say you have a hash table: each
				1321	entry in the table is a spinlock, and a chain of hashed
				1322	objects. Inside a softirq handler, you sometimes want to
				1323	alter an object from one place in the hash to another: you
				1324	grab the spinlock of the old hash chain and the spinlock of
				1325	the new hash chain, and delete the object from the old one,
				1326	and insert it in the new one.
				1327	</para>
				1328
				1329	<para>
				1330	There are two problems here. First, if your code ever
				1331	tries to move the object to the same chain, it will deadlock
				1332	with itself as it tries to lock it twice. Secondly, if the
				1333	same softirq on another CPU is trying to move another object
				1334	in the reverse direction, the following could happen:
				1335	</para>
				1336
				1337	<table>
				1338	<title>Consequences</title>
				1339
				1340	<tgroup cols="2" align="left">
				1341
				1342	<thead>
				1343	<row>
				1344	<entry>CPU 1</entry>
				1345	<entry>CPU 2</entry>
				1346	</row>
				1347	</thead>
				1348
				1349	<tbody>
				1350	<row>
				1351	<entry>Grab lock A -> OK</entry>
				1352	<entry>Grab lock B -> OK</entry>
				1353	</row>
				1354	<row>
				1355	<entry>Grab lock B -> spin</entry>
				1356	<entry>Grab lock A -> spin</entry>
				1357	</row>
				1358	</tbody>
				1359	</tgroup>
				1360	</table>
				1361
				1362	<para>
				1363	The two CPUs will spin forever, waiting for the other to give up
				1364	their lock. It will look, smell, and feel like a crash.
				1365	</para>
				1366	</sect1>
				1367
				1368	<sect1 id="techs-deadlock-prevent">
				1369	<title>Preventing Deadlock</title>
				1370
				1371	<para>
				1372	Textbooks will tell you that if you always lock in the same
				1373	order, you will never get this kind of deadlock. Practice
				1374	will tell you that this approach doesn't scale: when I
				1375	create a new lock, I don't understand enough of the kernel
				1376	to figure out where in the 5000 lock hierarchy it will fit.
				1377	</para>
				1378
				1379	<para>
				1380	The best locks are encapsulated: they never get exposed in
				1381	headers, and are never held around calls to non-trivial
				1382	functions outside the same file. You can read through this
				1383	code and see that it will never deadlock, because it never
				1384	tries to grab another lock while it has that one. People
				1385	using your code don't even need to know you are using a
				1386	lock.
				1387	</para>
				1388
				1389	<para>
				1390	A classic problem here is when you provide callbacks or
				1391	hooks: if you call these with the lock held, you risk simple
				1392	deadlock, or a deadly embrace (who knows what the callback
				1393	will do?). Remember, the other programmers are out to get
				1394	you, so don't do this.
				1395	</para>
				1396
				1397	<sect2 id="techs-deadlock-overprevent">
				1398	<title>Overzealous Prevention Of Deadlocks</title>
				1399
				1400	<para>
				1401	Deadlocks are problematic, but not as bad as data
				1402	corruption. Code which grabs a read lock, searches a list,
				1403	fails to find what it wants, drops the read lock, grabs a
				1404	write lock and inserts the object has a race condition.
				1405	</para>
				1406
				1407	<para>
				1408	If you don't see why, please stay the fuck away from my code.
				1409	</para>
				1410	</sect2>
				1411	</sect1>
				1412
				1413	<sect1 id="racing-timers">
				1414	<title>Racing Timers: A Kernel Pastime</title>
				1415
				1416	<para>
				1417	Timers can produce their own special problems with races.
				1418	Consider a collection of objects (list, hash, etc) where each
				1419	object has a timer which is due to destroy it.
				1420	</para>
				1421
				1422	<para>
				1423	If you want to destroy the entire collection (say on module
				1424	removal), you might do the following:
				1425	</para>
				1426
				1427	<programlisting>
				1428	/* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
				1429	HUNGARIAN NOTATION */
				1430	spin_lock_bh(&list_lock);
				1431
				1432	while (list) {
				1433	struct foo *next = list->next;
				1434	del_timer(&list->timer);
				1435	kfree(list);
				1436	list = next;
				1437	}
				1438
				1439	spin_unlock_bh(&list_lock);
				1440	</programlisting>
				1441
				1442	<para>
				1443	Sooner or later, this will crash on SMP, because a timer can
				1444	have just gone off before the <function>spin_lock_bh()</function>,
				1445	and it will only get the lock after we
				1446	<function>spin_unlock_bh()</function>, and then try to free
				1447	the element (which has already been freed!).
				1448	</para>
				1449
				1450	<para>
				1451	This can be avoided by checking the result of
				1452	<function>del_timer()</function>: if it returns
				1453	<returnvalue>1</returnvalue>, the timer has been deleted.
				1454	If <returnvalue>0</returnvalue>, it means (in this
				1455	case) that it is currently running, so we can do:
				1456	</para>
				1457
				1458	<programlisting>
				1459	retry:
				1460	spin_lock_bh(&list_lock);
				1461
				1462	while (list) {
				1463	struct foo *next = list->next;
				1464	if (!del_timer(&list->timer)) {
				1465	/* Give timer a chance to delete this */
				1466	spin_unlock_bh(&list_lock);
				1467	goto retry;
				1468	}
				1469	kfree(list);
				1470	list = next;
				1471	}
				1472
				1473	spin_unlock_bh(&list_lock);
				1474	</programlisting>
				1475
				1476	<para>
				1477	Another common problem is deleting timers which restart
				1478	themselves (by calling <function>add_timer()</function> at the end
				1479	of their timer function). Because this is a fairly common case
				1480	which is prone to races, you should use <function>del_timer_sync()</function>
				1481	(<filename class="headerfile">include/linux/timer.h</filename>)
				1482	to handle this case. It returns the number of times the timer
				1483	had to be deleted before we finally stopped it from adding itself back
				1484	in.
				1485	</para>
				1486	</sect1>
				1487
				1488	</chapter>
				1489
				1490	<chapter id="Efficiency">
				1491	<title>Locking Speed</title>
				1492
				1493	<para>
				1494	There are three main things to worry about when considering speed of
				1495	some code which does locking. First is concurrency: how many things
				1496	are going to be waiting while someone else is holding a lock. Second
				1497	is the time taken to actually acquire and release an uncontended lock.
				1498	Third is using fewer, or smarter locks. I'm assuming that the lock is
				1499	used fairly often: otherwise, you wouldn't be concerned about
				1500	efficiency.
				1501	</para>
				1502	<para>
				1503	Concurrency depends on how long the lock is usually held: you should
				1504	hold the lock for as long as needed, but no longer. In the cache
				1505	example, we always create the object without the lock held, and then
				1506	grab the lock only when we are ready to insert it in the list.
				1507	</para>
				1508	<para>
				1509	Acquisition times depend on how much damage the lock operations do to
				1510	the pipeline (pipeline stalls) and how likely it is that this CPU was
				1511	the last one to grab the lock (ie. is the lock cache-hot for this
				1512	CPU): on a machine with more CPUs, this likelihood drops fast.
				1513	Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns,
				1514	an atomic increment takes about 58ns, a lock which is cache-hot on
				1515	this CPU takes 160ns, and a cacheline transfer from another CPU takes
				1516	an additional 170 to 360ns. (These figures from Paul McKenney's
				1517	<ulink url="http://www.linuxjournal.com/article.php?sid=6993"> Linux
				1518	Journal RCU article</ulink>).
				1519	</para>
				1520	<para>
				1521	These two aims conflict: holding a lock for a short time might be done
				1522	by splitting locks into parts (such as in our final per-object-lock
				1523	example), but this increases the number of lock acquisitions, and the
				1524	results are often slower than having a single lock. This is another
				1525	reason to advocate locking simplicity.
				1526	</para>
				1527	<para>
				1528	The third concern is addressed below: there are some methods to reduce
				1529	the amount of locking which needs to be done.
				1530	</para>
				1531
				1532	<sect1 id="efficiency-rwlocks">
				1533	<title>Read/Write Lock Variants</title>
				1534
				1535	<para>
				1536	Both spinlocks and semaphores have read/write variants:
				1537	<type>rwlock_t</type> and <structname>struct rw_semaphore</structname>.
				1538	These divide users into two classes: the readers and the writers. If
				1539	you are only reading the data, you can get a read lock, but to write to
				1540	the data you need the write lock. Many people can hold a read lock,
				1541	but a writer must be sole holder.
				1542	</para>
				1543
				1544	<para>
				1545	If your code divides neatly along reader/writer lines (as our
				1546	cache code does), and the lock is held by readers for
				1547	significant lengths of time, using these locks can help. They
				1548	are slightly slower than the normal locks though, so in practice
				1549	<type>rwlock_t</type> is not usually worthwhile.
				1550	</para>
				1551	</sect1>
				1552
				1553	<sect1 id="efficiency-read-copy-update">
				1554	<title>Avoiding Locks: Read Copy Update</title>
				1555
				1556	<para>
				1557	There is a special method of read/write locking called Read Copy
				1558	Update. Using RCU, the readers can avoid taking a lock
				1559	altogether: as we expect our cache to be read more often than
				1560	updated (otherwise the cache is a waste of time), it is a
				1561	candidate for this optimization.
				1562	</para>
				1563
				1564	<para>
				1565	How do we get rid of read locks? Getting rid of read locks
				1566	means that writers may be changing the list underneath the
				1567	readers. That is actually quite simple: we can read a linked
				1568	list while an element is being added if the writer adds the
				1569	element very carefully. For example, adding
				1570	<symbol>new</symbol> to a single linked list called
				1571	<symbol>list</symbol>:
				1572	</para>
				1573
				1574	<programlisting>
				1575	new->next = list->next;
				1576	wmb();
				1577	list->next = new;
				1578	</programlisting>
				1579
				1580	<para>
				1581	The <function>wmb()</function> is a write memory barrier. It
				1582	ensures that the first operation (setting the new element's
				1583	<symbol>next</symbol> pointer) is complete and will be seen by
				1584	all CPUs, before the second operation is (putting the new
				1585	element into the list). This is important, since modern
				1586	compilers and modern CPUs can both reorder instructions unless
				1587	told otherwise: we want a reader to either not see the new
				1588	element at all, or see the new element with the
				1589	<symbol>next</symbol> pointer correctly pointing at the rest of
				1590	the list.
				1591	</para>
				1592	<para>
				1593	Fortunately, there is a function to do this for standard
				1594	<structname>struct list_head</structname> lists:
				1595	<function>list_add_rcu()</function>
				1596	(<filename>include/linux/list.h</filename>).
				1597	</para>
				1598	<para>
				1599	Removing an element from the list is even simpler: we replace
				1600	the pointer to the old element with a pointer to its successor,
				1601	and readers will either see it, or skip over it.
				1602	</para>
				1603	<programlisting>
				1604	list->next = old->next;
				1605	</programlisting>
				1606	<para>
				1607	There is <function>list_del_rcu()</function>
				1608	(<filename>include/linux/list.h</filename>) which does this (the
				1609	normal version poisons the old object, which we don't want).
				1610	</para>
				1611	<para>
				1612	The reader must also be careful: some CPUs can look through the
				1613	<symbol>next</symbol> pointer to start reading the contents of
				1614	the next element early, but don't realize that the pre-fetched
				1615	contents is wrong when the <symbol>next</symbol> pointer changes
				1616	underneath them. Once again, there is a
				1617	<function>list_for_each_entry_rcu()</function>
				1618	(<filename>include/linux/list.h</filename>) to help you. Of
				1619	course, writers can just use
				1620	<function>list_for_each_entry()</function>, since there cannot
				1621	be two simultaneous writers.
				1622	</para>
				1623	<para>
				1624	Our final dilemma is this: when can we actually destroy the
				1625	removed element? Remember, a reader might be stepping through
olecom@mail.ru	2e2d0dc	2006-06-26 19:05:40 +0200	[diff] [blame]	1626	this element in the list right now: if we free this element and
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1627	the <symbol>next</symbol> pointer changes, the reader will jump
				1628	off into garbage and crash. We need to wait until we know that
				1629	all the readers who were traversing the list when we deleted the
				1630	element are finished. We use <function>call_rcu()</function> to
				1631	register a callback which will actually destroy the object once
				1632	the readers are finished.
				1633	</para>
				1634	<para>
				1635	But how does Read Copy Update know when the readers are
				1636	finished? The method is this: firstly, the readers always
				1637	traverse the list inside
				1638	<function>rcu_read_lock()</function>/<function>rcu_read_unlock()</function>
				1639	pairs: these simply disable preemption so the reader won't go to
				1640	sleep while reading the list.
				1641	</para>
				1642	<para>
				1643	RCU then waits until every other CPU has slept at least once:
				1644	since readers cannot sleep, we know that any readers which were
				1645	traversing the list during the deletion are finished, and the
				1646	callback is triggered. The real Read Copy Update code is a
				1647	little more optimized than this, but this is the fundamental
				1648	idea.
				1649	</para>
				1650
				1651	<programlisting>
				1652	--- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100
				1653	+++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100
				1654	@@ -1,15 +1,18 @@
				1655	#include <linux/list.h>
				1656	#include <linux/slab.h>
				1657	#include <linux/string.h>
				1658	+#include <linux/rcupdate.h>
				1659	#include <asm/semaphore.h>
				1660	#include <asm/errno.h>
				1661
				1662	struct object
				1663	{
				1664	- /* These two protected by cache_lock. */
				1665	+ /* This is protected by RCU */
				1666	struct list_head list;
				1667	int popularity;
				1668
				1669	+ struct rcu_head rcu;
				1670	+
				1671	atomic_t refcnt;
				1672
				1673	/* Doesn't change once created. */
				1674	@@ -40,7 +43,7 @@
				1675	{
				1676	struct object *i;
				1677
				1678	- list_for_each_entry(i, &cache, list) {
				1679	+ list_for_each_entry_rcu(i, &cache, list) {
				1680	if (i->id == id) {
				1681	i->popularity++;
				1682	return i;
				1683	@@ -49,19 +52,25 @@
				1684	return NULL;
				1685	}
				1686
				1687	+/* Final discard done once we know no readers are looking. */
				1688	+static void cache_delete_rcu(void *arg)
				1689	+{
				1690	+ object_put(arg);
				1691	+}
				1692	+
				1693	/* Must be holding cache_lock */
				1694	static void __cache_delete(struct object *obj)
				1695	{
				1696	BUG_ON(!obj);
				1697	- list_del(&obj->list);
				1698	- object_put(obj);
				1699	+ list_del_rcu(&obj->list);
				1700	cache_num--;
				1701	+ call_rcu(&obj->rcu, cache_delete_rcu, obj);
				1702	}
				1703
				1704	/* Must be holding cache_lock */
				1705	static void __cache_add(struct object *obj)
				1706	{
				1707	- list_add(&obj->list, &cache);
				1708	+ list_add_rcu(&obj->list, &cache);
				1709	if (++cache_num > MAX_CACHE_SIZE) {
				1710	struct object i, outcast = NULL;
				1711	list_for_each_entry(i, &cache, list) {
				1712	@@ -85,6 +94,7 @@
				1713	obj->popularity = 0;
				1714	atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
				1715	spin_lock_init(&obj->lock);
				1716	+ INIT_RCU_HEAD(&obj->rcu);
				1717
				1718	spin_lock_irqsave(&cache_lock, flags);
				1719	__cache_add(obj);
				1720	@@ -104,12 +114,11 @@
				1721	struct object *cache_find(int id)
				1722	{
				1723	struct object *obj;
				1724	- unsigned long flags;
				1725
				1726	- spin_lock_irqsave(&cache_lock, flags);
				1727	+ rcu_read_lock();
				1728	obj = __cache_find(id);
				1729	if (obj)
				1730	object_get(obj);
				1731	- spin_unlock_irqrestore(&cache_lock, flags);
				1732	+ rcu_read_unlock();
				1733	return obj;
				1734	}
				1735	</programlisting>
				1736
				1737	<para>
				1738	Note that the reader will alter the
				1739	<structfield>popularity</structfield> member in
				1740	<function>__cache_find()</function>, and now it doesn't hold a lock.
				1741	One solution would be to make it an <type>atomic_t</type>, but for
				1742	this usage, we don't really care about races: an approximate result is
				1743	good enough, so I didn't change it.
				1744	</para>
				1745
				1746	<para>
				1747	The result is that <function>cache_find()</function> requires no
				1748	synchronization with any other functions, so is almost as fast on SMP
				1749	as it would be on UP.
				1750	</para>
				1751
				1752	<para>
				1753	There is a furthur optimization possible here: remember our original
				1754	cache code, where there were no reference counts and the caller simply
				1755	held the lock whenever using the object? This is still possible: if
				1756	you hold the lock, noone can delete the object, so you don't need to
				1757	get and put the reference count.
				1758	</para>
				1759
				1760	<para>
				1761	Now, because the 'read lock' in RCU is simply disabling preemption, a
				1762	caller which always has preemption disabled between calling
				1763	<function>cache_find()</function> and
				1764	<function>object_put()</function> does not need to actually get and
				1765	put the reference count: we could expose
				1766	<function>__cache_find()</function> by making it non-static, and
				1767	such callers could simply call that.
				1768	</para>
				1769	<para>
				1770	The benefit here is that the reference count is not written to: the
				1771	object is not altered in any way, which is much faster on SMP
				1772	machines due to caching.
				1773	</para>
				1774	</sect1>
				1775
				1776	<sect1 id="per-cpu">
				1777	<title>Per-CPU Data</title>
				1778
				1779	<para>
				1780	Another technique for avoiding locking which is used fairly
				1781	widely is to duplicate information for each CPU. For example,
				1782	if you wanted to keep a count of a common condition, you could
				1783	use a spin lock and a single counter. Nice and simple.
				1784	</para>
				1785
				1786	<para>
				1787	If that was too slow (it's usually not, but if you've got a
				1788	really big machine to test on and can show that it is), you
				1789	could instead use a counter for each CPU, then none of them need
				1790	an exclusive lock. See <function>DEFINE_PER_CPU()</function>,
				1791	<function>get_cpu_var()</function> and
				1792	<function>put_cpu_var()</function>
				1793	(<filename class="headerfile">include/linux/percpu.h</filename>).
				1794	</para>
				1795
				1796	<para>
				1797	Of particular use for simple per-cpu counters is the
				1798	<type>local_t</type> type, and the
				1799	<function>cpu_local_inc()</function> and related functions,
				1800	which are more efficient than simple code on some architectures
				1801	(<filename class="headerfile">include/asm/local.h</filename>).
				1802	</para>
				1803
				1804	<para>
				1805	Note that there is no simple, reliable way of getting an exact
				1806	value of such a counter, without introducing more locks. This
				1807	is not a problem for some uses.
				1808	</para>
				1809	</sect1>
				1810
				1811	<sect1 id="mostly-hardirq">
				1812	<title>Data Which Mostly Used By An IRQ Handler</title>
				1813
				1814	<para>
				1815	If data is always accessed from within the same IRQ handler, you
				1816	don't need a lock at all: the kernel already guarantees that the
				1817	irq handler will not run simultaneously on multiple CPUs.
				1818	</para>
				1819	<para>
				1820	Manfred Spraul points out that you can still do this, even if
				1821	the data is very occasionally accessed in user context or
				1822	softirqs/tasklets. The irq handler doesn't use a lock, and
				1823	all other accesses are done as so:
				1824	</para>
				1825
				1826	<programlisting>
				1827	spin_lock(&lock);
				1828	disable_irq(irq);
				1829	...
				1830	enable_irq(irq);
				1831	spin_unlock(&lock);
				1832	</programlisting>
				1833	<para>
				1834	The <function>disable_irq()</function> prevents the irq handler
				1835	from running (and waits for it to finish if it's currently
				1836	running on other CPUs). The spinlock prevents any other
				1837	accesses happening at the same time. Naturally, this is slower
				1838	than just a <function>spin_lock_irq()</function> call, so it
				1839	only makes sense if this type of access happens extremely
				1840	rarely.
				1841	</para>
				1842	</sect1>
				1843	</chapter>
				1844
				1845	<chapter id="sleeping-things">
				1846	<title>What Functions Are Safe To Call From Interrupts?</title>
				1847
				1848	<para>
				1849	Many functions in the kernel sleep (ie. call schedule())
				1850	directly or indirectly: you can never call them while holding a
				1851	spinlock, or with preemption disabled. This also means you need
				1852	to be in user context: calling them from an interrupt is illegal.
				1853	</para>
				1854
				1855	<sect1 id="sleeping">
				1856	<title>Some Functions Which Sleep</title>
				1857
				1858	<para>
				1859	The most common ones are listed below, but you usually have to
				1860	read the code to find out if other calls are safe. If everyone
				1861	else who calls it can sleep, you probably need to be able to
				1862	sleep, too. In particular, registration and deregistration
				1863	functions usually expect to be called from user context, and can
				1864	sleep.
				1865	</para>
				1866
				1867	<itemizedlist>
				1868	<listitem>
				1869	<para>
				1870	Accesses to
				1871	<firstterm linkend="gloss-userspace">userspace</firstterm>:
				1872	</para>
				1873	<itemizedlist>
				1874	<listitem>
				1875	<para>
				1876	<function>copy_from_user()</function>
				1877	</para>
				1878	</listitem>
				1879	<listitem>
				1880	<para>
				1881	<function>copy_to_user()</function>
				1882	</para>
				1883	</listitem>
				1884	<listitem>
				1885	<para>
				1886	<function>get_user()</function>
				1887	</para>
				1888	</listitem>
				1889	<listitem>
				1890	<para>
				1891	<function> put_user()</function>
				1892	</para>
				1893	</listitem>
				1894	</itemizedlist>
				1895	</listitem>
				1896
				1897	<listitem>
				1898	<para>
				1899	<function>kmalloc(GFP_KERNEL)</function>
				1900	</para>
				1901	</listitem>
				1902
				1903	<listitem>
				1904	<para>
				1905	<function>down_interruptible()</function> and
				1906	<function>down()</function>
				1907	</para>
				1908	<para>
				1909	There is a <function>down_trylock()</function> which can be
				1910	used inside interrupt context, as it will not sleep.
				1911	<function>up()</function> will also never sleep.
				1912	</para>
				1913	</listitem>
				1914	</itemizedlist>
				1915	</sect1>
				1916
				1917	<sect1 id="dont-sleep">
				1918	<title>Some Functions Which Don't Sleep</title>
				1919
				1920	<para>
				1921	Some functions are safe to call from any context, or holding
				1922	almost any lock.
				1923	</para>
				1924
				1925	<itemizedlist>
				1926	<listitem>
				1927	<para>
				1928	<function>printk()</function>
				1929	</para>
				1930	</listitem>
				1931	<listitem>
				1932	<para>
				1933	<function>kfree()</function>
				1934	</para>
				1935	</listitem>
				1936	<listitem>
				1937	<para>
				1938	<function>add_timer()</function> and <function>del_timer()</function>
				1939	</para>
				1940	</listitem>
				1941	</itemizedlist>
				1942	</sect1>
				1943	</chapter>
				1944
				1945	<chapter id="references">
				1946	<title>Further reading</title>
				1947
				1948	<itemizedlist>
				1949	<listitem>
				1950	<para>
				1951	<filename>Documentation/spinlocks.txt</filename>:
				1952	Linus Torvalds' spinlocking tutorial in the kernel sources.
				1953	</para>
				1954	</listitem>
				1955
				1956	<listitem>
				1957	<para>
				1958	Unix Systems for Modern Architectures: Symmetric
				1959	Multiprocessing and Caching for Kernel Programmers:
				1960	</para>
				1961
				1962	<para>
				1963	Curt Schimmel's very good introduction to kernel level
				1964	locking (not written for Linux, but nearly everything
				1965	applies). The book is expensive, but really worth every
				1966	penny to understand SMP locking. [ISBN: 0201633388]
				1967	</para>
				1968	</listitem>
				1969	</itemizedlist>
				1970	</chapter>
				1971
				1972	<chapter id="thanks">
				1973	<title>Thanks</title>
				1974
				1975	<para>
				1976	Thanks to Telsa Gwynne for DocBooking, neatening and adding
				1977	style.
				1978	</para>
				1979
				1980	<para>
				1981	Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul
				1982	Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim
				1983	Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney,
				1984	John Ashby for proofreading, correcting, flaming, commenting.
				1985	</para>
				1986
				1987	<para>
				1988	Thanks to the cabal for having no influence on this document.
				1989	</para>
				1990	</chapter>
				1991
				1992	<glossary id="glossary">
				1993	<title>Glossary</title>
				1994
				1995	<glossentry id="gloss-preemption">
				1996	<glossterm>preemption</glossterm>
				1997	<glossdef>
				1998	<para>
				1999	Prior to 2.5, or when <symbol>CONFIG_PREEMPT</symbol> is
				2000	unset, processes in user context inside the kernel would not
				2001	preempt each other (ie. you had that CPU until you have it up,
				2002	except for interrupts). With the addition of
				2003	<symbol>CONFIG_PREEMPT</symbol> in 2.5.4, this changed: when
				2004	in user context, higher priority tasks can "cut in": spinlocks
				2005	were changed to disable preemption, even on UP.
				2006	</para>
				2007	</glossdef>
				2008	</glossentry>
				2009
				2010	<glossentry id="gloss-bh">
				2011	<glossterm>bh</glossterm>
				2012	<glossdef>
				2013	<para>
				2014	Bottom Half: for historical reasons, functions with
				2015	'_bh' in them often now refer to any software interrupt, e.g.
				2016	<function>spin_lock_bh()</function> blocks any software interrupt
				2017	on the current CPU. Bottom halves are deprecated, and will
				2018	eventually be replaced by tasklets. Only one bottom half will be
				2019	running at any time.
				2020	</para>
				2021	</glossdef>
				2022	</glossentry>
				2023
				2024	<glossentry id="gloss-hwinterrupt">
				2025	<glossterm>Hardware Interrupt / Hardware IRQ</glossterm>
				2026	<glossdef>
				2027	<para>
				2028	Hardware interrupt request. <function>in_irq()</function> returns
				2029	<returnvalue>true</returnvalue> in a hardware interrupt handler.
				2030	</para>
				2031	</glossdef>
				2032	</glossentry>
				2033
				2034	<glossentry id="gloss-interruptcontext">
				2035	<glossterm>Interrupt Context</glossterm>
				2036	<glossdef>
				2037	<para>
				2038	Not user context: processing a hardware irq or software irq.
				2039	Indicated by the <function>in_interrupt()</function> macro
				2040	returning <returnvalue>true</returnvalue>.
				2041	</para>
				2042	</glossdef>
				2043	</glossentry>
				2044
				2045	<glossentry id="gloss-smp">
				2046	<glossterm><acronym>SMP</acronym></glossterm>
				2047	<glossdef>
				2048	<para>
				2049	Symmetric Multi-Processor: kernels compiled for multiple-CPU
				2050	machines. (CONFIG_SMP=y).
				2051	</para>
				2052	</glossdef>
				2053	</glossentry>
				2054
				2055	<glossentry id="gloss-softirq">
				2056	<glossterm>Software Interrupt / softirq</glossterm>
				2057	<glossdef>
				2058	<para>
				2059	Software interrupt handler. <function>in_irq()</function> returns
				2060	<returnvalue>false</returnvalue>; <function>in_softirq()</function>
				2061	returns <returnvalue>true</returnvalue>. Tasklets and softirqs
				2062	both fall into the category of 'software interrupts'.
				2063	</para>
				2064	<para>
				2065	Strictly speaking a softirq is one of up to 32 enumerated software
				2066	interrupts which can run on multiple CPUs at once.
				2067	Sometimes used to refer to tasklets as
				2068	well (ie. all software interrupts).
				2069	</para>
				2070	</glossdef>
				2071	</glossentry>
				2072
				2073	<glossentry id="gloss-tasklet">
				2074	<glossterm>tasklet</glossterm>
				2075	<glossdef>
				2076	<para>
				2077	A dynamically-registrable software interrupt,
				2078	which is guaranteed to only run on one CPU at a time.
				2079	</para>
				2080	</glossdef>
				2081	</glossentry>
				2082
				2083	<glossentry id="gloss-timers">
				2084	<glossterm>timer</glossterm>
				2085	<glossdef>
				2086	<para>
				2087	A dynamically-registrable software interrupt, which is run at
				2088	(or close to) a given time. When running, it is just like a
				2089	tasklet (in fact, they are called from the TIMER_SOFTIRQ).
				2090	</para>
				2091	</glossdef>
				2092	</glossentry>
				2093
				2094	<glossentry id="gloss-up">
				2095	<glossterm><acronym>UP</acronym></glossterm>
				2096	<glossdef>
				2097	<para>
				2098	Uni-Processor: Non-SMP. (CONFIG_SMP=n).
				2099	</para>
				2100	</glossdef>
				2101	</glossentry>
				2102
				2103	<glossentry id="gloss-usercontext">
				2104	<glossterm>User Context</glossterm>
				2105	<glossdef>
				2106	<para>
				2107	The kernel executing on behalf of a particular process (ie. a
				2108	system call or trap) or kernel thread. You can tell which
				2109	process with the <symbol>current</symbol> macro.) Not to
				2110	be confused with userspace. Can be interrupted by software or
				2111	hardware interrupts.
				2112	</para>
				2113	</glossdef>
				2114	</glossentry>
				2115
				2116	<glossentry id="gloss-userspace">
				2117	<glossterm>Userspace</glossterm>
				2118	<glossdef>
				2119	<para>
				2120	A process executing its own code outside the kernel.
				2121	</para>
				2122	</glossdef>
				2123	</glossentry>
				2124
				2125	</glossary>
				2126	</book>
				2127