Blame - Documentation/DocBook/kernel-locking.tmpl - android_kernel_htc_msm8960

blob: 084f6ad7b7a0a4729b85aea3cc5c3e6eb6557689 [file] [log] [blame]

Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1	<?xml version="1.0" encoding="UTF-8"?>
				2	<!DOCTYPE book PUBLIC "-//OASIS//DTD DocBook XML V4.1.2//EN"
				3	"http://www.oasis-open.org/docbook/xml/4.1.2/docbookx.dtd" []>
				4
				5	<book id="LKLockingGuide">
				6	<bookinfo>
				7	<title>Unreliable Guide To Locking</title>
				8
				9	<authorgroup>
				10	<author>
				11	<firstname>Rusty</firstname>
				12	<surname>Russell</surname>
				13	<affiliation>
				14	<address>
				15	<email>rusty@rustcorp.com.au</email>
				16	</address>
				17	</affiliation>
				18	</author>
				19	</authorgroup>
				20
				21	<copyright>
				22	<year>2003</year>
				23	<holder>Rusty Russell</holder>
				24	</copyright>
				25
				26	<legalnotice>
				27	<para>
				28	This documentation is free software; you can redistribute
				29	it and/or modify it under the terms of the GNU General Public
				30	License as published by the Free Software Foundation; either
				31	version 2 of the License, or (at your option) any later
				32	version.
				33	</para>
				34
				35	<para>
				36	This program is distributed in the hope that it will be
				37	useful, but WITHOUT ANY WARRANTY; without even the implied
				38	warranty of MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.
				39	See the GNU General Public License for more details.
				40	</para>
				41
				42	<para>
				43	You should have received a copy of the GNU General Public
				44	License along with this program; if not, write to the Free
				45	Software Foundation, Inc., 59 Temple Place, Suite 330, Boston,
				46	MA 02111-1307 USA
				47	</para>
				48
				49	<para>
				50	For more details see the file COPYING in the source
				51	distribution of Linux.
				52	</para>
				53	</legalnotice>
				54	</bookinfo>
				55
				56	<toc></toc>
				57	<chapter id="intro">
				58	<title>Introduction</title>
				59	<para>
				60	Welcome, to Rusty's Remarkably Unreliable Guide to Kernel
				61	Locking issues. This document describes the locking systems in
				62	the Linux Kernel in 2.6.
				63	</para>
				64	<para>
				65	With the wide availability of HyperThreading, and <firstterm
				66	linkend="gloss-preemption">preemption </firstterm> in the Linux
				67	Kernel, everyone hacking on the kernel needs to know the
				68	fundamentals of concurrency and locking for
				69	<firstterm linkend="gloss-smp"><acronym>SMP</acronym></firstterm>.
				70	</para>
				71	</chapter>
				72
				73	<chapter id="races">
				74	<title>The Problem With Concurrency</title>
				75	<para>
				76	(Skip this if you know what a Race Condition is).
				77	</para>
				78	<para>
				79	In a normal program, you can increment a counter like so:
				80	</para>
				81	<programlisting>
				82	very_important_count++;
				83	</programlisting>
				84
				85	<para>
				86	This is what they would expect to happen:
				87	</para>
				88
				89	<table>
				90	<title>Expected Results</title>
				91
				92	<tgroup cols="2" align="left">
				93
				94	<thead>
				95	<row>
				96	<entry>Instance 1</entry>
				97	<entry>Instance 2</entry>
				98	</row>
				99	</thead>
				100
				101	<tbody>
				102	<row>
				103	<entry>read very_important_count (5)</entry>
				104	<entry></entry>
				105	</row>
				106	<row>
				107	<entry>add 1 (6)</entry>
				108	<entry></entry>
				109	</row>
				110	<row>
				111	<entry>write very_important_count (6)</entry>
				112	<entry></entry>
				113	</row>
				114	<row>
				115	<entry></entry>
				116	<entry>read very_important_count (6)</entry>
				117	</row>
				118	<row>
				119	<entry></entry>
				120	<entry>add 1 (7)</entry>
				121	</row>
				122	<row>
				123	<entry></entry>
				124	<entry>write very_important_count (7)</entry>
				125	</row>
				126	</tbody>
				127
				128	</tgroup>
				129	</table>
				130
				131	<para>
				132	This is what might happen:
				133	</para>
				134
				135	<table>
				136	<title>Possible Results</title>
				137
				138	<tgroup cols="2" align="left">
				139	<thead>
				140	<row>
				141	<entry>Instance 1</entry>
				142	<entry>Instance 2</entry>
				143	</row>
				144	</thead>
				145
				146	<tbody>
				147	<row>
				148	<entry>read very_important_count (5)</entry>
				149	<entry></entry>
				150	</row>
				151	<row>
				152	<entry></entry>
				153	<entry>read very_important_count (5)</entry>
				154	</row>
				155	<row>
				156	<entry>add 1 (6)</entry>
				157	<entry></entry>
				158	</row>
				159	<row>
				160	<entry></entry>
				161	<entry>add 1 (6)</entry>
				162	</row>
				163	<row>
				164	<entry>write very_important_count (6)</entry>
				165	<entry></entry>
				166	</row>
				167	<row>
				168	<entry></entry>
				169	<entry>write very_important_count (6)</entry>
				170	</row>
				171	</tbody>
				172	</tgroup>
				173	</table>
				174
				175	<sect1 id="race-condition">
				176	<title>Race Conditions and Critical Regions</title>
				177	<para>
				178	This overlap, where the result depends on the
				179	relative timing of multiple tasks, is called a <firstterm>race condition</firstterm>.
				180	The piece of code containing the concurrency issue is called a
				181	<firstterm>critical region</firstterm>. And especially since Linux starting running
				182	on SMP machines, they became one of the major issues in kernel
				183	design and implementation.
				184	</para>
				185	<para>
				186	Preemption can have the same effect, even if there is only one
				187	CPU: by preempting one task during the critical region, we have
				188	exactly the same race condition. In this case the thread which
				189	preempts might run the critical region itself.
				190	</para>
				191	<para>
				192	The solution is to recognize when these simultaneous accesses
				193	occur, and use locks to make sure that only one instance can
				194	enter the critical region at any time. There are many
				195	friendly primitives in the Linux kernel to help you do this.
				196	And then there are the unfriendly primitives, but I'll pretend
				197	they don't exist.
				198	</para>
				199	</sect1>
				200	</chapter>
				201
				202	<chapter id="locks">
				203	<title>Locking in the Linux Kernel</title>
				204
				205	<para>
				206	If I could give you one piece of advice: never sleep with anyone
				207	crazier than yourself. But if I had to give you advice on
				208	locking: <emphasis>keep it simple</emphasis>.
				209	</para>
				210
				211	<para>
				212	Be reluctant to introduce new locks.
				213	</para>
				214
				215	<para>
				216	Strangely enough, this last one is the exact reverse of my advice when
				217	you <emphasis>have</emphasis> slept with someone crazier than yourself.
				218	And you should think about getting a big dog.
				219	</para>
				220
				221	<sect1 id="lock-intro">
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	222	<title>Two Main Types of Kernel Locks: Spinlocks and Mutexes</title>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	223
				224	<para>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	225	There are two main types of kernel locks. The fundamental type
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	226	is the spinlock
				227	(<filename class="headerfile">include/asm/spinlock.h</filename>),
				228	which is a very simple single-holder lock: if you can't get the
				229	spinlock, you keep trying (spinning) until you can. Spinlocks are
				230	very small and fast, and can be used anywhere.
				231	</para>
				232	<para>
Ingo Molnar	f3f54ff	2006-01-09 15:59:20 -0800	[diff] [blame]	233	The second type is a mutex
				234	(<filename class="headerfile">include/linux/mutex.h</filename>): it
				235	is like a spinlock, but you may block holding a mutex.
				236	If you can't lock a mutex, your task will suspend itself, and be woken
				237	up when the mutex is released. This means the CPU can do something
				238	else while you are waiting. There are many cases when you simply
				239	can't sleep (see <xref linkend="sleeping-things"/>), and so have to
				240	use a spinlock instead.
				241	</para>
				242	<para>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	243	Neither type of lock is recursive: see
				244	<xref linkend="deadlock"/>.
				245	</para>
				246	</sect1>
				247
				248	<sect1 id="uniprocessor">
				249	<title>Locks and Uniprocessor Kernels</title>
				250
				251	<para>
				252	For kernels compiled without <symbol>CONFIG_SMP</symbol>, and
				253	without <symbol>CONFIG_PREEMPT</symbol> spinlocks do not exist at
				254	all. This is an excellent design decision: when no-one else can
				255	run at the same time, there is no reason to have a lock.
				256	</para>
				257
				258	<para>
				259	If the kernel is compiled without <symbol>CONFIG_SMP</symbol>,
				260	but <symbol>CONFIG_PREEMPT</symbol> is set, then spinlocks
				261	simply disable preemption, which is sufficient to prevent any
				262	races. For most purposes, we can think of preemption as
				263	equivalent to SMP, and not worry about it separately.
				264	</para>
				265
				266	<para>
				267	You should always test your locking code with <symbol>CONFIG_SMP</symbol>
				268	and <symbol>CONFIG_PREEMPT</symbol> enabled, even if you don't have an SMP test box, because it
				269	will still catch some kinds of locking bugs.
				270	</para>
				271
				272	<para>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	273	Mutexes still exist, because they are required for
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	274	synchronization between <firstterm linkend="gloss-usercontext">user
				275	contexts</firstterm>, as we will see below.
				276	</para>
				277	</sect1>
				278
				279	<sect1 id="usercontextlocking">
				280	<title>Locking Only In User Context</title>
				281
				282	<para>
				283	If you have a data structure which is only ever accessed from
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	284	user context, then you can use a simple mutex
				285	(<filename>include/linux/mutex.h</filename>) to protect it. This
				286	is the most trivial case: you initialize the mutex. Then you can
				287	call <function>mutex_lock_interruptible()</function> to grab the mutex,
				288	and <function>mutex_unlock()</function> to release it. There is also a
				289	<function>mutex_lock()</function>, which should be avoided, because it
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	290	will not return if a signal is received.
				291	</para>
				292
				293	<para>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	294	Example: <filename>net/netfilter/nf_sockopt.c</filename> allows
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	295	registration of new <function>setsockopt()</function> and
				296	<function>getsockopt()</function> calls, with
				297	<function>nf_register_sockopt()</function>. Registration and
				298	de-registration are only done on module load and unload (and boot
				299	time, where there is no concurrency), and the list of registrations
				300	is only consulted for an unknown <function>setsockopt()</function>
				301	or <function>getsockopt()</function> system call. The
				302	<varname>nf_sockopt_mutex</varname> is perfect to protect this,
				303	especially since the setsockopt and getsockopt calls may well
				304	sleep.
				305	</para>
				306	</sect1>
				307
				308	<sect1 id="lock-user-bh">
				309	<title>Locking Between User Context and Softirqs</title>
				310
				311	<para>
				312	If a <firstterm linkend="gloss-softirq">softirq</firstterm> shares
				313	data with user context, you have two problems. Firstly, the current
				314	user context can be interrupted by a softirq, and secondly, the
				315	critical region could be entered from another CPU. This is where
				316	<function>spin_lock_bh()</function>
				317	(<filename class="headerfile">include/linux/spinlock.h</filename>) is
				318	used. It disables softirqs on that CPU, then grabs the lock.
				319	<function>spin_unlock_bh()</function> does the reverse. (The
				320	'_bh' suffix is a historical reference to "Bottom Halves", the
				321	old name for software interrupts. It should really be
				322	called spin_lock_softirq()' in a perfect world).
				323	</para>
				324
				325	<para>
				326	Note that you can also use <function>spin_lock_irq()</function>
				327	or <function>spin_lock_irqsave()</function> here, which stop
				328	hardware interrupts as well: see <xref linkend="hardirq-context"/>.
				329	</para>
				330
				331	<para>
				332	This works perfectly for <firstterm linkend="gloss-up"><acronym>UP
				333	</acronym></firstterm> as well: the spin lock vanishes, and this macro
				334	simply becomes <function>local_bh_disable()</function>
				335	(<filename class="headerfile">include/linux/interrupt.h</filename>), which
				336	protects you from the softirq being run.
				337	</para>
				338	</sect1>
				339
				340	<sect1 id="lock-user-tasklet">
				341	<title>Locking Between User Context and Tasklets</title>
				342
				343	<para>
				344	This is exactly the same as above, because <firstterm
				345	linkend="gloss-tasklet">tasklets</firstterm> are actually run
				346	from a softirq.
				347	</para>
				348	</sect1>
				349
				350	<sect1 id="lock-user-timers">
				351	<title>Locking Between User Context and Timers</title>
				352
				353	<para>
				354	This, too, is exactly the same as above, because <firstterm
				355	linkend="gloss-timers">timers</firstterm> are actually run from
				356	a softirq. From a locking point of view, tasklets and timers
				357	are identical.
				358	</para>
				359	</sect1>
				360
				361	<sect1 id="lock-tasklets">
				362	<title>Locking Between Tasklets/Timers</title>
				363
				364	<para>
				365	Sometimes a tasklet or timer might want to share data with
				366	another tasklet or timer.
				367	</para>
				368
				369	<sect2 id="lock-tasklets-same">
				370	<title>The Same Tasklet/Timer</title>
				371	<para>
				372	Since a tasklet is never run on two CPUs at once, you don't
				373	need to worry about your tasklet being reentrant (running
				374	twice at once), even on SMP.
				375	</para>
				376	</sect2>
				377
				378	<sect2 id="lock-tasklets-different">
				379	<title>Different Tasklets/Timers</title>
				380	<para>
				381	If another tasklet/timer wants
				382	to share data with your tasklet or timer , you will both need to use
				383	<function>spin_lock()</function> and
				384	<function>spin_unlock()</function> calls.
				385	<function>spin_lock_bh()</function> is
				386	unnecessary here, as you are already in a tasklet, and
				387	none will be run on the same CPU.
				388	</para>
				389	</sect2>
				390	</sect1>
				391
				392	<sect1 id="lock-softirqs">
				393	<title>Locking Between Softirqs</title>
				394
				395	<para>
				396	Often a softirq might
				397	want to share data with itself or a tasklet/timer.
				398	</para>
				399
				400	<sect2 id="lock-softirqs-same">
				401	<title>The Same Softirq</title>
				402
				403	<para>
				404	The same softirq can run on the other CPUs: you can use a
				405	per-CPU array (see <xref linkend="per-cpu"/>) for better
				406	performance. If you're going so far as to use a softirq,
				407	you probably care about scalable performance enough
				408	to justify the extra complexity.
				409	</para>
				410
				411	<para>
				412	You'll need to use <function>spin_lock()</function> and
				413	<function>spin_unlock()</function> for shared data.
				414	</para>
				415	</sect2>
				416
				417	<sect2 id="lock-softirqs-different">
				418	<title>Different Softirqs</title>
				419
				420	<para>
				421	You'll need to use <function>spin_lock()</function> and
				422	<function>spin_unlock()</function> for shared data, whether it
				423	be a timer, tasklet, different softirq or the same or another
				424	softirq: any of them could be running on a different CPU.
				425	</para>
				426	</sect2>
				427	</sect1>
				428	</chapter>
				429
				430	<chapter id="hardirq-context">
				431	<title>Hard IRQ Context</title>
				432
				433	<para>
				434	Hardware interrupts usually communicate with a
				435	tasklet or softirq. Frequently this involves putting work in a
				436	queue, which the softirq will take out.
				437	</para>
				438
				439	<sect1 id="hardirq-softirq">
				440	<title>Locking Between Hard IRQ and Softirqs/Tasklets</title>
				441
				442	<para>
				443	If a hardware irq handler shares data with a softirq, you have
				444	two concerns. Firstly, the softirq processing can be
				445	interrupted by a hardware interrupt, and secondly, the
				446	critical region could be entered by a hardware interrupt on
				447	another CPU. This is where <function>spin_lock_irq()</function> is
				448	used. It is defined to disable interrupts on that cpu, then grab
				449	the lock. <function>spin_unlock_irq()</function> does the reverse.
				450	</para>
				451
				452	<para>
				453	The irq handler does not to use
				454	<function>spin_lock_irq()</function>, because the softirq cannot
				455	run while the irq handler is running: it can use
				456	<function>spin_lock()</function>, which is slightly faster. The
				457	only exception would be if a different hardware irq handler uses
				458	the same lock: <function>spin_lock_irq()</function> will stop
				459	that from interrupting us.
				460	</para>
				461
				462	<para>
				463	This works perfectly for UP as well: the spin lock vanishes,
				464	and this macro simply becomes <function>local_irq_disable()</function>
				465	(<filename class="headerfile">include/asm/smp.h</filename>), which
				466	protects you from the softirq/tasklet/BH being run.
				467	</para>
				468
				469	<para>
				470	<function>spin_lock_irqsave()</function>
				471	(<filename>include/linux/spinlock.h</filename>) is a variant
				472	which saves whether interrupts were on or off in a flags word,
				473	which is passed to <function>spin_unlock_irqrestore()</function>. This
				474	means that the same code can be used inside an hard irq handler (where
				475	interrupts are already off) and in softirqs (where the irq
				476	disabling is required).
				477	</para>
				478
				479	<para>
				480	Note that softirqs (and hence tasklets and timers) are run on
				481	return from hardware interrupts, so
				482	<function>spin_lock_irq()</function> also stops these. In that
				483	sense, <function>spin_lock_irqsave()</function> is the most
				484	general and powerful locking function.
				485	</para>
				486
				487	</sect1>
				488	<sect1 id="hardirq-hardirq">
				489	<title>Locking Between Two Hard IRQ Handlers</title>
				490	<para>
				491	It is rare to have to share data between two IRQ handlers, but
				492	if you do, <function>spin_lock_irqsave()</function> should be
				493	used: it is architecture-specific whether all interrupts are
				494	disabled inside irq handlers themselves.
				495	</para>
				496	</sect1>
				497
				498	</chapter>
				499
				500	<chapter id="cheatsheet">
				501	<title>Cheat Sheet For Locking</title>
				502	<para>
				503	Pete Zaitcev gives the following summary:
				504	</para>
				505	<itemizedlist>
				506	<listitem>
				507	<para>
				508	If you are in a process context (any syscall) and want to
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	509	lock other process out, use a mutex. You can take a mutex
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	510	and sleep (<function>copy_from_user*(</function> or
				511	<function>kmalloc(x,GFP_KERNEL)</function>).
				512	</para>
				513	</listitem>
				514	<listitem>
				515	<para>
				516	Otherwise (== data can be touched in an interrupt), use
				517	<function>spin_lock_irqsave()</function> and
				518	<function>spin_unlock_irqrestore()</function>.
				519	</para>
				520	</listitem>
				521	<listitem>
				522	<para>
				523	Avoid holding spinlock for more than 5 lines of code and
				524	across any function call (except accessors like
				525	<function>readb</function>).
				526	</para>
				527	</listitem>
				528	</itemizedlist>
				529
				530	<sect1 id="minimum-lock-reqirements">
				531	<title>Table of Minimum Requirements</title>
				532
				533	<para> The following table lists the <emphasis>minimum</emphasis>
				534	locking requirements between various contexts. In some cases,
				535	the same context can only be running on one CPU at a time, so
				536	no locking is required for that context (eg. a particular
				537	thread can only run on one CPU at a time, but if it needs
				538	shares data with another thread, locking is required).
				539	</para>
				540	<para>
				541	Remember the advice above: you can always use
				542	<function>spin_lock_irqsave()</function>, which is a superset
				543	of all other spinlock primitives.
				544	</para>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	545
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	546	<table>
				547	<title>Table of Locking Requirements</title>
				548	<tgroup cols="11">
				549	<tbody>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	550
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	551	<row>
				552	<entry></entry>
				553	<entry>IRQ Handler A</entry>
				554	<entry>IRQ Handler B</entry>
				555	<entry>Softirq A</entry>
				556	<entry>Softirq B</entry>
				557	<entry>Tasklet A</entry>
				558	<entry>Tasklet B</entry>
				559	<entry>Timer A</entry>
				560	<entry>Timer B</entry>
				561	<entry>User Context A</entry>
				562	<entry>User Context B</entry>
				563	</row>
				564
				565	<row>
				566	<entry>IRQ Handler A</entry>
				567	<entry>None</entry>
				568	</row>
				569
				570	<row>
				571	<entry>IRQ Handler B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	572	<entry>SLIS</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	573	<entry>None</entry>
				574	</row>
				575
				576	<row>
				577	<entry>Softirq A</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	578	<entry>SLI</entry>
				579	<entry>SLI</entry>
				580	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	581	</row>
				582
				583	<row>
				584	<entry>Softirq B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	585	<entry>SLI</entry>
				586	<entry>SLI</entry>
				587	<entry>SL</entry>
				588	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	589	</row>
				590
				591	<row>
				592	<entry>Tasklet A</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	593	<entry>SLI</entry>
				594	<entry>SLI</entry>
				595	<entry>SL</entry>
				596	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	597	<entry>None</entry>
				598	</row>
				599
				600	<row>
				601	<entry>Tasklet B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	602	<entry>SLI</entry>
				603	<entry>SLI</entry>
				604	<entry>SL</entry>
				605	<entry>SL</entry>
				606	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	607	<entry>None</entry>
				608	</row>
				609
				610	<row>
				611	<entry>Timer A</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	612	<entry>SLI</entry>
				613	<entry>SLI</entry>
				614	<entry>SL</entry>
				615	<entry>SL</entry>
				616	<entry>SL</entry>
				617	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	618	<entry>None</entry>
				619	</row>
				620
				621	<row>
				622	<entry>Timer B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	623	<entry>SLI</entry>
				624	<entry>SLI</entry>
				625	<entry>SL</entry>
				626	<entry>SL</entry>
				627	<entry>SL</entry>
				628	<entry>SL</entry>
				629	<entry>SL</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	630	<entry>None</entry>
				631	</row>
				632
				633	<row>
				634	<entry>User Context A</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	635	<entry>SLI</entry>
				636	<entry>SLI</entry>
				637	<entry>SLBH</entry>
				638	<entry>SLBH</entry>
				639	<entry>SLBH</entry>
				640	<entry>SLBH</entry>
				641	<entry>SLBH</entry>
				642	<entry>SLBH</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	643	<entry>None</entry>
				644	</row>
				645
				646	<row>
				647	<entry>User Context B</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	648	<entry>SLI</entry>
				649	<entry>SLI</entry>
				650	<entry>SLBH</entry>
				651	<entry>SLBH</entry>
				652	<entry>SLBH</entry>
				653	<entry>SLBH</entry>
				654	<entry>SLBH</entry>
				655	<entry>SLBH</entry>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	656	<entry>MLI</entry>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	657	<entry>None</entry>
				658	</row>
				659
				660	</tbody>
				661	</tgroup>
				662	</table>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	663
				664	<table>
				665	<title>Legend for Locking Requirements Table</title>
				666	<tgroup cols="2">
				667	<tbody>
				668
				669	<row>
				670	<entry>SLIS</entry>
				671	<entry>spin_lock_irqsave</entry>
				672	</row>
				673	<row>
				674	<entry>SLI</entry>
				675	<entry>spin_lock_irq</entry>
				676	</row>
				677	<row>
				678	<entry>SL</entry>
				679	<entry>spin_lock</entry>
				680	</row>
				681	<row>
				682	<entry>SLBH</entry>
				683	<entry>spin_lock_bh</entry>
				684	</row>
				685	<row>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	686	<entry>MLI</entry>
				687	<entry>mutex_lock_interruptible</entry>
Randy Dunlap	621e59a	2007-05-16 22:11:12 -0700	[diff] [blame]	688	</row>
				689
				690	</tbody>
				691	</tgroup>
				692	</table>
				693
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	694	</sect1>
				695	</chapter>
				696
Matti Linnanvuori	4d2e7d0	2008-05-13 18:31:47 +0300	[diff] [blame]	697	<chapter id="trylock-functions">
				698	<title>The trylock Functions</title>
				699	<para>
				700	There are functions that try to acquire a lock only once and immediately
				701	return a value telling about success or failure to acquire the lock.
				702	They can be used if you need no access to the data protected with the lock
				703	when some other thread is holding the lock. You should acquire the lock
				704	later if you then need access to the data protected with the lock.
				705	</para>
				706
				707	<para>
				708	<function>spin_trylock()</function> does not spin but returns non-zero if
				709	it acquires the spinlock on the first try or 0 if not. This function can
				710	be used in all contexts like <function>spin_lock</function>: you must have
				711	disabled the contexts that might interrupt you and acquire the spin lock.
				712	</para>
				713
				714	<para>
				715	<function>mutex_trylock()</function> does not suspend your task
				716	but returns non-zero if it could lock the mutex on the first try
				717	or 0 if not. This function cannot be safely used in hardware or software
				718	interrupt contexts despite not sleeping.
				719	</para>
				720	</chapter>
				721
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	722	<chapter id="Examples">
				723	<title>Common Examples</title>
				724	<para>
				725	Let's step through a simple example: a cache of number to name
				726	mappings. The cache keeps a count of how often each of the objects is
				727	used, and when it gets full, throws out the least used one.
				728
				729	</para>
				730
				731	<sect1 id="examples-usercontext">
				732	<title>All In User Context</title>
				733	<para>
				734	For our first example, we assume that all operations are in user
				735	context (ie. from system calls), so we can sleep. This means we can
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	736	use a mutex to protect the cache and all the objects within
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	737	it. Here's the code:
				738	</para>
				739
				740	<programlisting>
				741	#include <linux/list.h>
				742	#include <linux/slab.h>
				743	#include <linux/string.h>
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	744	#include <linux/mutex.h>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	745	#include <asm/errno.h>
				746
				747	struct object
				748	{
				749	struct list_head list;
				750	int id;
				751	char name[32];
				752	int popularity;
				753	};
				754
				755	/* Protects the cache, cache_num, and the objects within it */
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	756	static DEFINE_MUTEX(cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	757	static LIST_HEAD(cache);
				758	static unsigned int cache_num = 0;
				759	#define MAX_CACHE_SIZE 10
				760
				761	/* Must be holding cache_lock */
				762	static struct object *__cache_find(int id)
				763	{
				764	struct object *i;
				765
				766	list_for_each_entry(i, &cache, list)
				767	if (i->id == id) {
				768	i->popularity++;
				769	return i;
				770	}
				771	return NULL;
				772	}
				773
				774	/* Must be holding cache_lock */
				775	static void __cache_delete(struct object *obj)
				776	{
				777	BUG_ON(!obj);
				778	list_del(&obj->list);
				779	kfree(obj);
				780	cache_num--;
				781	}
				782
				783	/* Must be holding cache_lock */
				784	static void __cache_add(struct object *obj)
				785	{
				786	list_add(&obj->list, &cache);
				787	if (++cache_num > MAX_CACHE_SIZE) {
				788	struct object i, outcast = NULL;
				789	list_for_each_entry(i, &cache, list) {
				790	if (!outcast \|\| i->popularity < outcast->popularity)
				791	outcast = i;
				792	}
				793	__cache_delete(outcast);
				794	}
				795	}
				796
				797	int cache_add(int id, const char *name)
				798	{
				799	struct object *obj;
				800
				801	if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
				802	return -ENOMEM;
				803
				804	strlcpy(obj->name, name, sizeof(obj->name));
				805	obj->id = id;
				806	obj->popularity = 0;
				807
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	808	mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	809	__cache_add(obj);
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	810	mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	811	return 0;
				812	}
				813
				814	void cache_delete(int id)
				815	{
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	816	mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	817	__cache_delete(__cache_find(id));
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	818	mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	819	}
				820
				821	int cache_find(int id, char *name)
				822	{
				823	struct object *obj;
				824	int ret = -ENOENT;
				825
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	826	mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	827	obj = __cache_find(id);
				828	if (obj) {
				829	ret = 0;
				830	strcpy(name, obj->name);
				831	}
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	832	mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	833	return ret;
				834	}
				835	</programlisting>
				836
				837	<para>
				838	Note that we always make sure we have the cache_lock when we add,
				839	delete, or look up the cache: both the cache infrastructure itself and
				840	the contents of the objects are protected by the lock. In this case
				841	it's easy, since we copy the data for the user, and never let them
				842	access the objects directly.
				843	</para>
				844	<para>
				845	There is a slight (and common) optimization here: in
				846	<function>cache_add</function> we set up the fields of the object
				847	before grabbing the lock. This is safe, as no-one else can access it
				848	until we put it in cache.
				849	</para>
				850	</sect1>
				851
				852	<sect1 id="examples-interrupt">
				853	<title>Accessing From Interrupt Context</title>
				854	<para>
				855	Now consider the case where <function>cache_find</function> can be
				856	called from interrupt context: either a hardware interrupt or a
				857	softirq. An example would be a timer which deletes object from the
				858	cache.
				859	</para>
				860	<para>
				861	The change is shown below, in standard patch format: the
				862	<symbol>-</symbol> are lines which are taken away, and the
				863	<symbol>+</symbol> are lines which are added.
				864	</para>
				865	<programlisting>
				866	--- cache.c.usercontext 2003-12-09 13:58:54.000000000 +1100
				867	+++ cache.c.interrupt 2003-12-09 14:07:49.000000000 +1100
				868	@@ -12,7 +12,7 @@
				869	int popularity;
				870	};
				871
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	872	-static DEFINE_MUTEX(cache_lock);
Robert P. J. Day	c0d1f29	2008-04-21 22:44:50 +0000	[diff] [blame]	873	+static DEFINE_SPINLOCK(cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	874	static LIST_HEAD(cache);
				875	static unsigned int cache_num = 0;
				876	#define MAX_CACHE_SIZE 10
				877	@@ -55,6 +55,7 @@
				878	int cache_add(int id, const char *name)
				879	{
				880	struct object *obj;
				881	+ unsigned long flags;
				882
				883	if ((obj = kmalloc(sizeof(*obj), GFP_KERNEL)) == NULL)
				884	return -ENOMEM;
				885	@@ -63,30 +64,33 @@
				886	obj->id = id;
				887	obj->popularity = 0;
				888
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	889	- mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	890	+ spin_lock_irqsave(&cache_lock, flags);
				891	__cache_add(obj);
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	892	- mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	893	+ spin_unlock_irqrestore(&cache_lock, flags);
				894	return 0;
				895	}
				896
				897	void cache_delete(int id)
				898	{
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	899	- mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	900	+ unsigned long flags;
				901	+
				902	+ spin_lock_irqsave(&cache_lock, flags);
				903	__cache_delete(__cache_find(id));
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	904	- mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	905	+ spin_unlock_irqrestore(&cache_lock, flags);
				906	}
				907
				908	int cache_find(int id, char *name)
				909	{
				910	struct object *obj;
				911	int ret = -ENOENT;
				912	+ unsigned long flags;
				913
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	914	- mutex_lock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	915	+ spin_lock_irqsave(&cache_lock, flags);
				916	obj = __cache_find(id);
				917	if (obj) {
				918	ret = 0;
				919	strcpy(name, obj->name);
				920	}
Daniel Walker	66656eb	2008-02-06 01:37:39 -0800	[diff] [blame]	921	- mutex_unlock(&cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	922	+ spin_unlock_irqrestore(&cache_lock, flags);
				923	return ret;
				924	}
				925	</programlisting>
				926
				927	<para>
				928	Note that the <function>spin_lock_irqsave</function> will turn off
				929	interrupts if they are on, otherwise does nothing (if we are already
				930	in an interrupt handler), hence these functions are safe to call from
				931	any context.
				932	</para>
				933	<para>
				934	Unfortunately, <function>cache_add</function> calls
				935	<function>kmalloc</function> with the <symbol>GFP_KERNEL</symbol>
				936	flag, which is only legal in user context. I have assumed that
				937	<function>cache_add</function> is still only called in user context,
				938	otherwise this should become a parameter to
				939	<function>cache_add</function>.
				940	</para>
				941	</sect1>
				942	<sect1 id="examples-refcnt">
				943	<title>Exposing Objects Outside This File</title>
				944	<para>
				945	If our objects contained more information, it might not be sufficient
				946	to copy the information in and out: other parts of the code might want
				947	to keep pointers to these objects, for example, rather than looking up
				948	the id every time. This produces two problems.
				949	</para>
				950	<para>
				951	The first problem is that we use the <symbol>cache_lock</symbol> to
				952	protect objects: we'd need to make this non-static so the rest of the
				953	code can use it. This makes locking trickier, as it is no longer all
				954	in one place.
				955	</para>
				956	<para>
				957	The second problem is the lifetime problem: if another structure keeps
				958	a pointer to an object, it presumably expects that pointer to remain
				959	valid. Unfortunately, this is only guaranteed while you hold the
				960	lock, otherwise someone might call <function>cache_delete</function>
				961	and even worse, add another object, re-using the same address.
				962	</para>
				963	<para>
				964	As there is only one lock, you can't hold it forever: no-one else would
				965	get any work done.
				966	</para>
				967	<para>
				968	The solution to this problem is to use a reference count: everyone who
				969	has a pointer to the object increases it when they first get the
				970	object, and drops the reference count when they're finished with it.
				971	Whoever drops it to zero knows it is unused, and can actually delete it.
				972	</para>
				973	<para>
				974	Here is the code:
				975	</para>
				976
				977	<programlisting>
				978	--- cache.c.interrupt 2003-12-09 14:25:43.000000000 +1100
				979	+++ cache.c.refcnt 2003-12-09 14:33:05.000000000 +1100
				980	@@ -7,6 +7,7 @@
				981	struct object
				982	{
				983	struct list_head list;
				984	+ unsigned int refcnt;
				985	int id;
				986	char name[32];
				987	int popularity;
				988	@@ -17,6 +18,35 @@
				989	static unsigned int cache_num = 0;
				990	#define MAX_CACHE_SIZE 10
				991
				992	+static void __object_put(struct object *obj)
				993	+{
				994	+ if (--obj->refcnt == 0)
				995	+ kfree(obj);
				996	+}
				997	+
				998	+static void __object_get(struct object *obj)
				999	+{
				1000	+ obj->refcnt++;
				1001	+}
				1002	+
				1003	+void object_put(struct object *obj)
				1004	+{
				1005	+ unsigned long flags;
				1006	+
				1007	+ spin_lock_irqsave(&cache_lock, flags);
				1008	+ __object_put(obj);
				1009	+ spin_unlock_irqrestore(&cache_lock, flags);
				1010	+}
				1011	+
				1012	+void object_get(struct object *obj)
				1013	+{
				1014	+ unsigned long flags;
				1015	+
				1016	+ spin_lock_irqsave(&cache_lock, flags);
				1017	+ __object_get(obj);
				1018	+ spin_unlock_irqrestore(&cache_lock, flags);
				1019	+}
				1020	+
				1021	/* Must be holding cache_lock */
				1022	static struct object *__cache_find(int id)
				1023	{
				1024	@@ -35,6 +65,7 @@
				1025	{
				1026	BUG_ON(!obj);
				1027	list_del(&obj->list);
				1028	+ __object_put(obj);
				1029	cache_num--;
				1030	}
				1031
				1032	@@ -63,6 +94,7 @@
				1033	strlcpy(obj->name, name, sizeof(obj->name));
				1034	obj->id = id;
				1035	obj->popularity = 0;
				1036	+ obj->refcnt = 1; /* The cache holds a reference */
				1037
				1038	spin_lock_irqsave(&cache_lock, flags);
				1039	__cache_add(obj);
				1040	@@ -79,18 +111,15 @@
				1041	spin_unlock_irqrestore(&cache_lock, flags);
				1042	}
				1043
				1044	-int cache_find(int id, char *name)
				1045	+struct object *cache_find(int id)
				1046	{
				1047	struct object *obj;
				1048	- int ret = -ENOENT;
				1049	unsigned long flags;
				1050
				1051	spin_lock_irqsave(&cache_lock, flags);
				1052	obj = __cache_find(id);
				1053	- if (obj) {
				1054	- ret = 0;
				1055	- strcpy(name, obj->name);
				1056	- }
				1057	+ if (obj)
				1058	+ __object_get(obj);
				1059	spin_unlock_irqrestore(&cache_lock, flags);
				1060	- return ret;
				1061	+ return obj;
				1062	}
				1063	</programlisting>
				1064
				1065	<para>
				1066	We encapsulate the reference counting in the standard 'get' and 'put'
				1067	functions. Now we can return the object itself from
				1068	<function>cache_find</function> which has the advantage that the user
				1069	can now sleep holding the object (eg. to
				1070	<function>copy_to_user</function> to name to userspace).
				1071	</para>
				1072	<para>
				1073	The other point to note is that I said a reference should be held for
				1074	every pointer to the object: thus the reference count is 1 when first
				1075	inserted into the cache. In some versions the framework does not hold
				1076	a reference count, but they are more complicated.
				1077	</para>
				1078
				1079	<sect2 id="examples-refcnt-atomic">
				1080	<title>Using Atomic Operations For The Reference Count</title>
				1081	<para>
				1082	In practice, <type>atomic_t</type> would usually be used for
				1083	<structfield>refcnt</structfield>. There are a number of atomic
				1084	operations defined in
				1085
				1086	<filename class="headerfile">include/asm/atomic.h</filename>: these are
				1087	guaranteed to be seen atomically from all CPUs in the system, so no
				1088	lock is required. In this case, it is simpler than using spinlocks,
				1089	although for anything non-trivial using spinlocks is clearer. The
				1090	<function>atomic_inc</function> and
				1091	<function>atomic_dec_and_test</function> are used instead of the
				1092	standard increment and decrement operators, and the lock is no longer
				1093	used to protect the reference count itself.
				1094	</para>
				1095
				1096	<programlisting>
				1097	--- cache.c.refcnt 2003-12-09 15:00:35.000000000 +1100
				1098	+++ cache.c.refcnt-atomic 2003-12-11 15:49:42.000000000 +1100
				1099	@@ -7,7 +7,7 @@
				1100	struct object
				1101	{
				1102	struct list_head list;
				1103	- unsigned int refcnt;
				1104	+ atomic_t refcnt;
				1105	int id;
				1106	char name[32];
				1107	int popularity;
				1108	@@ -18,33 +18,15 @@
				1109	static unsigned int cache_num = 0;
				1110	#define MAX_CACHE_SIZE 10
				1111
				1112	-static void __object_put(struct object *obj)
				1113	-{
				1114	- if (--obj->refcnt == 0)
				1115	- kfree(obj);
				1116	-}
				1117	-
				1118	-static void __object_get(struct object *obj)
				1119	-{
				1120	- obj->refcnt++;
				1121	-}
				1122	-
				1123	void object_put(struct object *obj)
				1124	{
				1125	- unsigned long flags;
				1126	-
				1127	- spin_lock_irqsave(&cache_lock, flags);
				1128	- __object_put(obj);
				1129	- spin_unlock_irqrestore(&cache_lock, flags);
				1130	+ if (atomic_dec_and_test(&obj->refcnt))
				1131	+ kfree(obj);
				1132	}
				1133
				1134	void object_get(struct object *obj)
				1135	{
				1136	- unsigned long flags;
				1137	-
				1138	- spin_lock_irqsave(&cache_lock, flags);
				1139	- __object_get(obj);
				1140	- spin_unlock_irqrestore(&cache_lock, flags);
				1141	+ atomic_inc(&obj->refcnt);
				1142	}
				1143
				1144	/* Must be holding cache_lock */
				1145	@@ -65,7 +47,7 @@
				1146	{
				1147	BUG_ON(!obj);
				1148	list_del(&obj->list);
				1149	- __object_put(obj);
				1150	+ object_put(obj);
				1151	cache_num--;
				1152	}
				1153
				1154	@@ -94,7 +76,7 @@
				1155	strlcpy(obj->name, name, sizeof(obj->name));
				1156	obj->id = id;
				1157	obj->popularity = 0;
				1158	- obj->refcnt = 1; /* The cache holds a reference */
				1159	+ atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
				1160
				1161	spin_lock_irqsave(&cache_lock, flags);
				1162	__cache_add(obj);
				1163	@@ -119,7 +101,7 @@
				1164	spin_lock_irqsave(&cache_lock, flags);
				1165	obj = __cache_find(id);
				1166	if (obj)
				1167	- __object_get(obj);
				1168	+ object_get(obj);
				1169	spin_unlock_irqrestore(&cache_lock, flags);
				1170	return obj;
				1171	}
				1172	</programlisting>
				1173	</sect2>
				1174	</sect1>
				1175
				1176	<sect1 id="examples-lock-per-obj">
				1177	<title>Protecting The Objects Themselves</title>
				1178	<para>
				1179	In these examples, we assumed that the objects (except the reference
				1180	counts) never changed once they are created. If we wanted to allow
				1181	the name to change, there are three possibilities:
				1182	</para>
				1183	<itemizedlist>
				1184	<listitem>
				1185	<para>
				1186	You can make <symbol>cache_lock</symbol> non-static, and tell people
				1187	to grab that lock before changing the name in any object.
				1188	</para>
				1189	</listitem>
				1190	<listitem>
				1191	<para>
				1192	You can provide a <function>cache_obj_rename</function> which grabs
				1193	this lock and changes the name for the caller, and tell everyone to
				1194	use that function.
				1195	</para>
				1196	</listitem>
				1197	<listitem>
				1198	<para>
				1199	You can make the <symbol>cache_lock</symbol> protect only the cache
				1200	itself, and use another lock to protect the name.
				1201	</para>
				1202	</listitem>
				1203	</itemizedlist>
				1204
				1205	<para>
				1206	Theoretically, you can make the locks as fine-grained as one lock for
				1207	every field, for every object. In practice, the most common variants
				1208	are:
				1209	</para>
				1210	<itemizedlist>
				1211	<listitem>
				1212	<para>
				1213	One lock which protects the infrastructure (the <symbol>cache</symbol>
				1214	list in this example) and all the objects. This is what we have done
				1215	so far.
				1216	</para>
				1217	</listitem>
				1218	<listitem>
				1219	<para>
				1220	One lock which protects the infrastructure (including the list
				1221	pointers inside the objects), and one lock inside the object which
				1222	protects the rest of that object.
				1223	</para>
				1224	</listitem>
				1225	<listitem>
				1226	<para>
				1227	Multiple locks to protect the infrastructure (eg. one lock per hash
				1228	chain), possibly with a separate per-object lock.
				1229	</para>
				1230	</listitem>
				1231	</itemizedlist>
				1232
				1233	<para>
				1234	Here is the "lock-per-object" implementation:
				1235	</para>
				1236	<programlisting>
				1237	--- cache.c.refcnt-atomic 2003-12-11 15:50:54.000000000 +1100
				1238	+++ cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100
				1239	@@ -6,11 +6,17 @@
				1240
				1241	struct object
				1242	{
				1243	+ /* These two protected by cache_lock. */
				1244	struct list_head list;
				1245	+ int popularity;
				1246	+
				1247	atomic_t refcnt;
				1248	+
				1249	+ /* Doesn't change once created. */
				1250	int id;
				1251	+
				1252	+ spinlock_t lock; /* Protects the name */
				1253	char name[32];
				1254	- int popularity;
				1255	};
				1256
Robert P. J. Day	c0d1f29	2008-04-21 22:44:50 +0000	[diff] [blame]	1257	static DEFINE_SPINLOCK(cache_lock);
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1258	@@ -77,6 +84,7 @@
				1259	obj->id = id;
				1260	obj->popularity = 0;
				1261	atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
				1262	+ spin_lock_init(&obj->lock);
				1263
				1264	spin_lock_irqsave(&cache_lock, flags);
				1265	__cache_add(obj);
				1266	</programlisting>
				1267
				1268	<para>
				1269	Note that I decide that the <structfield>popularity</structfield>
				1270	count should be protected by the <symbol>cache_lock</symbol> rather
				1271	than the per-object lock: this is because it (like the
				1272	<structname>struct list_head</structname> inside the object) is
				1273	logically part of the infrastructure. This way, I don't need to grab
				1274	the lock of every object in <function>__cache_add</function> when
				1275	seeking the least popular.
				1276	</para>
				1277
				1278	<para>
				1279	I also decided that the <structfield>id</structfield> member is
				1280	unchangeable, so I don't need to grab each object lock in
				1281	<function>__cache_find()</function> to examine the
				1282	<structfield>id</structfield>: the object lock is only used by a
				1283	caller who wants to read or write the <structfield>name</structfield>
				1284	field.
				1285	</para>
				1286
				1287	<para>
				1288	Note also that I added a comment describing what data was protected by
				1289	which locks. This is extremely important, as it describes the runtime
				1290	behavior of the code, and can be hard to gain from just reading. And
				1291	as Alan Cox says, <quote>Lock data, not code</quote>.
				1292	</para>
				1293	</sect1>
				1294	</chapter>
				1295
				1296	<chapter id="common-problems">
				1297	<title>Common Problems</title>
				1298	<sect1 id="deadlock">
				1299	<title>Deadlock: Simple and Advanced</title>
				1300
				1301	<para>
				1302	There is a coding bug where a piece of code tries to grab a
				1303	spinlock twice: it will spin forever, waiting for the lock to
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	1304	be released (spinlocks, rwlocks and mutexes are not
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1305	recursive in Linux). This is trivial to diagnose: not a
				1306	stay-up-five-nights-talk-to-fluffy-code-bunnies kind of
				1307	problem.
				1308	</para>
				1309
				1310	<para>
				1311	For a slightly more complex case, imagine you have a region
				1312	shared by a softirq and user context. If you use a
				1313	<function>spin_lock()</function> call to protect it, it is
				1314	possible that the user context will be interrupted by the softirq
				1315	while it holds the lock, and the softirq will then spin
				1316	forever trying to get the same lock.
				1317	</para>
				1318
				1319	<para>
				1320	Both of these are called deadlock, and as shown above, it can
				1321	occur even with a single CPU (although not on UP compiles,
				1322	since spinlocks vanish on kernel compiles with
				1323	<symbol>CONFIG_SMP</symbol>=n. You'll still get data corruption
				1324	in the second example).
				1325	</para>
				1326
				1327	<para>
				1328	This complete lockup is easy to diagnose: on SMP boxes the
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	1329	watchdog timer or compiling with <symbol>DEBUG_SPINLOCK</symbol> set
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1330	(<filename>include/linux/spinlock.h</filename>) will show this up
				1331	immediately when it happens.
				1332	</para>
				1333
				1334	<para>
				1335	A more complex problem is the so-called 'deadly embrace',
				1336	involving two or more locks. Say you have a hash table: each
				1337	entry in the table is a spinlock, and a chain of hashed
				1338	objects. Inside a softirq handler, you sometimes want to
				1339	alter an object from one place in the hash to another: you
				1340	grab the spinlock of the old hash chain and the spinlock of
				1341	the new hash chain, and delete the object from the old one,
				1342	and insert it in the new one.
				1343	</para>
				1344
				1345	<para>
				1346	There are two problems here. First, if your code ever
				1347	tries to move the object to the same chain, it will deadlock
				1348	with itself as it tries to lock it twice. Secondly, if the
				1349	same softirq on another CPU is trying to move another object
				1350	in the reverse direction, the following could happen:
				1351	</para>
				1352
				1353	<table>
				1354	<title>Consequences</title>
				1355
				1356	<tgroup cols="2" align="left">
				1357
				1358	<thead>
				1359	<row>
				1360	<entry>CPU 1</entry>
				1361	<entry>CPU 2</entry>
				1362	</row>
				1363	</thead>
				1364
				1365	<tbody>
				1366	<row>
				1367	<entry>Grab lock A -> OK</entry>
				1368	<entry>Grab lock B -> OK</entry>
				1369	</row>
				1370	<row>
				1371	<entry>Grab lock B -> spin</entry>
				1372	<entry>Grab lock A -> spin</entry>
				1373	</row>
				1374	</tbody>
				1375	</tgroup>
				1376	</table>
				1377
				1378	<para>
				1379	The two CPUs will spin forever, waiting for the other to give up
				1380	their lock. It will look, smell, and feel like a crash.
				1381	</para>
				1382	</sect1>
				1383
				1384	<sect1 id="techs-deadlock-prevent">
				1385	<title>Preventing Deadlock</title>
				1386
				1387	<para>
				1388	Textbooks will tell you that if you always lock in the same
				1389	order, you will never get this kind of deadlock. Practice
				1390	will tell you that this approach doesn't scale: when I
				1391	create a new lock, I don't understand enough of the kernel
				1392	to figure out where in the 5000 lock hierarchy it will fit.
				1393	</para>
				1394
				1395	<para>
				1396	The best locks are encapsulated: they never get exposed in
				1397	headers, and are never held around calls to non-trivial
				1398	functions outside the same file. You can read through this
				1399	code and see that it will never deadlock, because it never
				1400	tries to grab another lock while it has that one. People
				1401	using your code don't even need to know you are using a
				1402	lock.
				1403	</para>
				1404
				1405	<para>
				1406	A classic problem here is when you provide callbacks or
				1407	hooks: if you call these with the lock held, you risk simple
				1408	deadlock, or a deadly embrace (who knows what the callback
				1409	will do?). Remember, the other programmers are out to get
				1410	you, so don't do this.
				1411	</para>
				1412
				1413	<sect2 id="techs-deadlock-overprevent">
				1414	<title>Overzealous Prevention Of Deadlocks</title>
				1415
				1416	<para>
				1417	Deadlocks are problematic, but not as bad as data
				1418	corruption. Code which grabs a read lock, searches a list,
				1419	fails to find what it wants, drops the read lock, grabs a
				1420	write lock and inserts the object has a race condition.
				1421	</para>
				1422
				1423	<para>
				1424	If you don't see why, please stay the fuck away from my code.
				1425	</para>
				1426	</sect2>
				1427	</sect1>
				1428
				1429	<sect1 id="racing-timers">
				1430	<title>Racing Timers: A Kernel Pastime</title>
				1431
				1432	<para>
				1433	Timers can produce their own special problems with races.
				1434	Consider a collection of objects (list, hash, etc) where each
				1435	object has a timer which is due to destroy it.
				1436	</para>
				1437
				1438	<para>
				1439	If you want to destroy the entire collection (say on module
				1440	removal), you might do the following:
				1441	</para>
				1442
				1443	<programlisting>
				1444	/* THIS CODE BAD BAD BAD BAD: IF IT WAS ANY WORSE IT WOULD USE
				1445	HUNGARIAN NOTATION */
				1446	spin_lock_bh(&list_lock);
				1447
				1448	while (list) {
				1449	struct foo *next = list->next;
				1450	del_timer(&list->timer);
				1451	kfree(list);
				1452	list = next;
				1453	}
				1454
				1455	spin_unlock_bh(&list_lock);
				1456	</programlisting>
				1457
				1458	<para>
				1459	Sooner or later, this will crash on SMP, because a timer can
				1460	have just gone off before the <function>spin_lock_bh()</function>,
				1461	and it will only get the lock after we
				1462	<function>spin_unlock_bh()</function>, and then try to free
				1463	the element (which has already been freed!).
				1464	</para>
				1465
				1466	<para>
				1467	This can be avoided by checking the result of
				1468	<function>del_timer()</function>: if it returns
				1469	<returnvalue>1</returnvalue>, the timer has been deleted.
				1470	If <returnvalue>0</returnvalue>, it means (in this
				1471	case) that it is currently running, so we can do:
				1472	</para>
				1473
				1474	<programlisting>
				1475	retry:
				1476	spin_lock_bh(&list_lock);
				1477
				1478	while (list) {
				1479	struct foo *next = list->next;
				1480	if (!del_timer(&list->timer)) {
				1481	/* Give timer a chance to delete this */
				1482	spin_unlock_bh(&list_lock);
				1483	goto retry;
				1484	}
				1485	kfree(list);
				1486	list = next;
				1487	}
				1488
				1489	spin_unlock_bh(&list_lock);
				1490	</programlisting>
				1491
				1492	<para>
				1493	Another common problem is deleting timers which restart
				1494	themselves (by calling <function>add_timer()</function> at the end
				1495	of their timer function). Because this is a fairly common case
				1496	which is prone to races, you should use <function>del_timer_sync()</function>
				1497	(<filename class="headerfile">include/linux/timer.h</filename>)
				1498	to handle this case. It returns the number of times the timer
				1499	had to be deleted before we finally stopped it from adding itself back
				1500	in.
				1501	</para>
				1502	</sect1>
				1503
				1504	</chapter>
				1505
				1506	<chapter id="Efficiency">
				1507	<title>Locking Speed</title>
				1508
				1509	<para>
				1510	There are three main things to worry about when considering speed of
				1511	some code which does locking. First is concurrency: how many things
				1512	are going to be waiting while someone else is holding a lock. Second
				1513	is the time taken to actually acquire and release an uncontended lock.
				1514	Third is using fewer, or smarter locks. I'm assuming that the lock is
				1515	used fairly often: otherwise, you wouldn't be concerned about
				1516	efficiency.
				1517	</para>
				1518	<para>
				1519	Concurrency depends on how long the lock is usually held: you should
				1520	hold the lock for as long as needed, but no longer. In the cache
				1521	example, we always create the object without the lock held, and then
				1522	grab the lock only when we are ready to insert it in the list.
				1523	</para>
				1524	<para>
				1525	Acquisition times depend on how much damage the lock operations do to
				1526	the pipeline (pipeline stalls) and how likely it is that this CPU was
				1527	the last one to grab the lock (ie. is the lock cache-hot for this
				1528	CPU): on a machine with more CPUs, this likelihood drops fast.
				1529	Consider a 700MHz Intel Pentium III: an instruction takes about 0.7ns,
				1530	an atomic increment takes about 58ns, a lock which is cache-hot on
				1531	this CPU takes 160ns, and a cacheline transfer from another CPU takes
				1532	an additional 170 to 360ns. (These figures from Paul McKenney's
				1533	<ulink url="http://www.linuxjournal.com/article.php?sid=6993"> Linux
				1534	Journal RCU article</ulink>).
				1535	</para>
				1536	<para>
				1537	These two aims conflict: holding a lock for a short time might be done
				1538	by splitting locks into parts (such as in our final per-object-lock
				1539	example), but this increases the number of lock acquisitions, and the
				1540	results are often slower than having a single lock. This is another
				1541	reason to advocate locking simplicity.
				1542	</para>
				1543	<para>
				1544	The third concern is addressed below: there are some methods to reduce
				1545	the amount of locking which needs to be done.
				1546	</para>
				1547
				1548	<sect1 id="efficiency-rwlocks">
				1549	<title>Read/Write Lock Variants</title>
				1550
				1551	<para>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	1552	Both spinlocks and mutexes have read/write variants:
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1553	<type>rwlock_t</type> and <structname>struct rw_semaphore</structname>.
				1554	These divide users into two classes: the readers and the writers. If
				1555	you are only reading the data, you can get a read lock, but to write to
				1556	the data you need the write lock. Many people can hold a read lock,
				1557	but a writer must be sole holder.
				1558	</para>
				1559
				1560	<para>
				1561	If your code divides neatly along reader/writer lines (as our
				1562	cache code does), and the lock is held by readers for
				1563	significant lengths of time, using these locks can help. They
				1564	are slightly slower than the normal locks though, so in practice
				1565	<type>rwlock_t</type> is not usually worthwhile.
				1566	</para>
				1567	</sect1>
				1568
				1569	<sect1 id="efficiency-read-copy-update">
				1570	<title>Avoiding Locks: Read Copy Update</title>
				1571
				1572	<para>
				1573	There is a special method of read/write locking called Read Copy
				1574	Update. Using RCU, the readers can avoid taking a lock
				1575	altogether: as we expect our cache to be read more often than
				1576	updated (otherwise the cache is a waste of time), it is a
				1577	candidate for this optimization.
				1578	</para>
				1579
				1580	<para>
				1581	How do we get rid of read locks? Getting rid of read locks
				1582	means that writers may be changing the list underneath the
				1583	readers. That is actually quite simple: we can read a linked
				1584	list while an element is being added if the writer adds the
				1585	element very carefully. For example, adding
				1586	<symbol>new</symbol> to a single linked list called
				1587	<symbol>list</symbol>:
				1588	</para>
				1589
				1590	<programlisting>
				1591	new->next = list->next;
				1592	wmb();
				1593	list->next = new;
				1594	</programlisting>
				1595
				1596	<para>
				1597	The <function>wmb()</function> is a write memory barrier. It
				1598	ensures that the first operation (setting the new element's
				1599	<symbol>next</symbol> pointer) is complete and will be seen by
				1600	all CPUs, before the second operation is (putting the new
				1601	element into the list). This is important, since modern
				1602	compilers and modern CPUs can both reorder instructions unless
				1603	told otherwise: we want a reader to either not see the new
				1604	element at all, or see the new element with the
				1605	<symbol>next</symbol> pointer correctly pointing at the rest of
				1606	the list.
				1607	</para>
				1608	<para>
				1609	Fortunately, there is a function to do this for standard
				1610	<structname>struct list_head</structname> lists:
				1611	<function>list_add_rcu()</function>
				1612	(<filename>include/linux/list.h</filename>).
				1613	</para>
				1614	<para>
				1615	Removing an element from the list is even simpler: we replace
				1616	the pointer to the old element with a pointer to its successor,
				1617	and readers will either see it, or skip over it.
				1618	</para>
				1619	<programlisting>
				1620	list->next = old->next;
				1621	</programlisting>
				1622	<para>
				1623	There is <function>list_del_rcu()</function>
				1624	(<filename>include/linux/list.h</filename>) which does this (the
				1625	normal version poisons the old object, which we don't want).
				1626	</para>
				1627	<para>
				1628	The reader must also be careful: some CPUs can look through the
				1629	<symbol>next</symbol> pointer to start reading the contents of
				1630	the next element early, but don't realize that the pre-fetched
				1631	contents is wrong when the <symbol>next</symbol> pointer changes
				1632	underneath them. Once again, there is a
				1633	<function>list_for_each_entry_rcu()</function>
				1634	(<filename>include/linux/list.h</filename>) to help you. Of
				1635	course, writers can just use
				1636	<function>list_for_each_entry()</function>, since there cannot
				1637	be two simultaneous writers.
				1638	</para>
				1639	<para>
				1640	Our final dilemma is this: when can we actually destroy the
				1641	removed element? Remember, a reader might be stepping through
olecom@mail.ru	2e2d0dc	2006-06-26 19:05:40 +0200	[diff] [blame]	1642	this element in the list right now: if we free this element and
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1643	the <symbol>next</symbol> pointer changes, the reader will jump
				1644	off into garbage and crash. We need to wait until we know that
				1645	all the readers who were traversing the list when we deleted the
				1646	element are finished. We use <function>call_rcu()</function> to
				1647	register a callback which will actually destroy the object once
				1648	the readers are finished.
				1649	</para>
				1650	<para>
				1651	But how does Read Copy Update know when the readers are
				1652	finished? The method is this: firstly, the readers always
				1653	traverse the list inside
				1654	<function>rcu_read_lock()</function>/<function>rcu_read_unlock()</function>
				1655	pairs: these simply disable preemption so the reader won't go to
				1656	sleep while reading the list.
				1657	</para>
				1658	<para>
				1659	RCU then waits until every other CPU has slept at least once:
				1660	since readers cannot sleep, we know that any readers which were
				1661	traversing the list during the deletion are finished, and the
				1662	callback is triggered. The real Read Copy Update code is a
				1663	little more optimized than this, but this is the fundamental
				1664	idea.
				1665	</para>
				1666
				1667	<programlisting>
				1668	--- cache.c.perobjectlock 2003-12-11 17:15:03.000000000 +1100
				1669	+++ cache.c.rcupdate 2003-12-11 17:55:14.000000000 +1100
				1670	@@ -1,15 +1,18 @@
				1671	#include <linux/list.h>
				1672	#include <linux/slab.h>
				1673	#include <linux/string.h>
				1674	+#include <linux/rcupdate.h>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	1675	#include <linux/mutex.h>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1676	#include <asm/errno.h>
				1677
				1678	struct object
				1679	{
				1680	- /* These two protected by cache_lock. */
				1681	+ /* This is protected by RCU */
				1682	struct list_head list;
				1683	int popularity;
				1684
				1685	+ struct rcu_head rcu;
				1686	+
				1687	atomic_t refcnt;
				1688
				1689	/* Doesn't change once created. */
				1690	@@ -40,7 +43,7 @@
				1691	{
				1692	struct object *i;
				1693
				1694	- list_for_each_entry(i, &cache, list) {
				1695	+ list_for_each_entry_rcu(i, &cache, list) {
				1696	if (i->id == id) {
				1697	i->popularity++;
				1698	return i;
				1699	@@ -49,19 +52,25 @@
				1700	return NULL;
				1701	}
				1702
				1703	+/* Final discard done once we know no readers are looking. */
				1704	+static void cache_delete_rcu(void *arg)
				1705	+{
				1706	+ object_put(arg);
				1707	+}
				1708	+
				1709	/* Must be holding cache_lock */
				1710	static void __cache_delete(struct object *obj)
				1711	{
				1712	BUG_ON(!obj);
				1713	- list_del(&obj->list);
				1714	- object_put(obj);
				1715	+ list_del_rcu(&obj->list);
				1716	cache_num--;
				1717	+ call_rcu(&obj->rcu, cache_delete_rcu, obj);
				1718	}
				1719
				1720	/* Must be holding cache_lock */
				1721	static void __cache_add(struct object *obj)
				1722	{
				1723	- list_add(&obj->list, &cache);
				1724	+ list_add_rcu(&obj->list, &cache);
				1725	if (++cache_num > MAX_CACHE_SIZE) {
				1726	struct object i, outcast = NULL;
				1727	list_for_each_entry(i, &cache, list) {
				1728	@@ -85,6 +94,7 @@
				1729	obj->popularity = 0;
				1730	atomic_set(&obj->refcnt, 1); /* The cache holds a reference */
				1731	spin_lock_init(&obj->lock);
				1732	+ INIT_RCU_HEAD(&obj->rcu);
				1733
				1734	spin_lock_irqsave(&cache_lock, flags);
				1735	__cache_add(obj);
				1736	@@ -104,12 +114,11 @@
				1737	struct object *cache_find(int id)
				1738	{
				1739	struct object *obj;
				1740	- unsigned long flags;
				1741
				1742	- spin_lock_irqsave(&cache_lock, flags);
				1743	+ rcu_read_lock();
				1744	obj = __cache_find(id);
				1745	if (obj)
				1746	object_get(obj);
				1747	- spin_unlock_irqrestore(&cache_lock, flags);
				1748	+ rcu_read_unlock();
				1749	return obj;
				1750	}
				1751	</programlisting>
				1752
				1753	<para>
				1754	Note that the reader will alter the
				1755	<structfield>popularity</structfield> member in
				1756	<function>__cache_find()</function>, and now it doesn't hold a lock.
				1757	One solution would be to make it an <type>atomic_t</type>, but for
				1758	this usage, we don't really care about races: an approximate result is
				1759	good enough, so I didn't change it.
				1760	</para>
				1761
				1762	<para>
				1763	The result is that <function>cache_find()</function> requires no
				1764	synchronization with any other functions, so is almost as fast on SMP
				1765	as it would be on UP.
				1766	</para>
				1767
				1768	<para>
				1769	There is a furthur optimization possible here: remember our original
				1770	cache code, where there were no reference counts and the caller simply
				1771	held the lock whenever using the object? This is still possible: if
				1772	you hold the lock, noone can delete the object, so you don't need to
				1773	get and put the reference count.
				1774	</para>
				1775
				1776	<para>
				1777	Now, because the 'read lock' in RCU is simply disabling preemption, a
				1778	caller which always has preemption disabled between calling
				1779	<function>cache_find()</function> and
				1780	<function>object_put()</function> does not need to actually get and
				1781	put the reference count: we could expose
				1782	<function>__cache_find()</function> by making it non-static, and
				1783	such callers could simply call that.
				1784	</para>
				1785	<para>
				1786	The benefit here is that the reference count is not written to: the
				1787	object is not altered in any way, which is much faster on SMP
				1788	machines due to caching.
				1789	</para>
				1790	</sect1>
				1791
				1792	<sect1 id="per-cpu">
				1793	<title>Per-CPU Data</title>
				1794
				1795	<para>
				1796	Another technique for avoiding locking which is used fairly
				1797	widely is to duplicate information for each CPU. For example,
				1798	if you wanted to keep a count of a common condition, you could
				1799	use a spin lock and a single counter. Nice and simple.
				1800	</para>
				1801
				1802	<para>
				1803	If that was too slow (it's usually not, but if you've got a
				1804	really big machine to test on and can show that it is), you
				1805	could instead use a counter for each CPU, then none of them need
				1806	an exclusive lock. See <function>DEFINE_PER_CPU()</function>,
				1807	<function>get_cpu_var()</function> and
				1808	<function>put_cpu_var()</function>
				1809	(<filename class="headerfile">include/linux/percpu.h</filename>).
				1810	</para>
				1811
				1812	<para>
				1813	Of particular use for simple per-cpu counters is the
				1814	<type>local_t</type> type, and the
				1815	<function>cpu_local_inc()</function> and related functions,
				1816	which are more efficient than simple code on some architectures
				1817	(<filename class="headerfile">include/asm/local.h</filename>).
				1818	</para>
				1819
				1820	<para>
				1821	Note that there is no simple, reliable way of getting an exact
				1822	value of such a counter, without introducing more locks. This
				1823	is not a problem for some uses.
				1824	</para>
				1825	</sect1>
				1826
				1827	<sect1 id="mostly-hardirq">
				1828	<title>Data Which Mostly Used By An IRQ Handler</title>
				1829
				1830	<para>
				1831	If data is always accessed from within the same IRQ handler, you
				1832	don't need a lock at all: the kernel already guarantees that the
				1833	irq handler will not run simultaneously on multiple CPUs.
				1834	</para>
				1835	<para>
				1836	Manfred Spraul points out that you can still do this, even if
				1837	the data is very occasionally accessed in user context or
				1838	softirqs/tasklets. The irq handler doesn't use a lock, and
				1839	all other accesses are done as so:
				1840	</para>
				1841
				1842	<programlisting>
				1843	spin_lock(&lock);
				1844	disable_irq(irq);
				1845	...
				1846	enable_irq(irq);
				1847	spin_unlock(&lock);
				1848	</programlisting>
				1849	<para>
				1850	The <function>disable_irq()</function> prevents the irq handler
				1851	from running (and waits for it to finish if it's currently
				1852	running on other CPUs). The spinlock prevents any other
				1853	accesses happening at the same time. Naturally, this is slower
				1854	than just a <function>spin_lock_irq()</function> call, so it
				1855	only makes sense if this type of access happens extremely
				1856	rarely.
				1857	</para>
				1858	</sect1>
				1859	</chapter>
				1860
				1861	<chapter id="sleeping-things">
				1862	<title>What Functions Are Safe To Call From Interrupts?</title>
				1863
				1864	<para>
				1865	Many functions in the kernel sleep (ie. call schedule())
				1866	directly or indirectly: you can never call them while holding a
				1867	spinlock, or with preemption disabled. This also means you need
				1868	to be in user context: calling them from an interrupt is illegal.
				1869	</para>
				1870
				1871	<sect1 id="sleeping">
				1872	<title>Some Functions Which Sleep</title>
				1873
				1874	<para>
				1875	The most common ones are listed below, but you usually have to
				1876	read the code to find out if other calls are safe. If everyone
				1877	else who calls it can sleep, you probably need to be able to
				1878	sleep, too. In particular, registration and deregistration
				1879	functions usually expect to be called from user context, and can
				1880	sleep.
				1881	</para>
				1882
				1883	<itemizedlist>
				1884	<listitem>
				1885	<para>
				1886	Accesses to
				1887	<firstterm linkend="gloss-userspace">userspace</firstterm>:
				1888	</para>
				1889	<itemizedlist>
				1890	<listitem>
				1891	<para>
				1892	<function>copy_from_user()</function>
				1893	</para>
				1894	</listitem>
				1895	<listitem>
				1896	<para>
				1897	<function>copy_to_user()</function>
				1898	</para>
				1899	</listitem>
				1900	<listitem>
				1901	<para>
				1902	<function>get_user()</function>
				1903	</para>
				1904	</listitem>
				1905	<listitem>
				1906	<para>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	1907	<function>put_user()</function>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1908	</para>
				1909	</listitem>
				1910	</itemizedlist>
				1911	</listitem>
				1912
				1913	<listitem>
				1914	<para>
				1915	<function>kmalloc(GFP_KERNEL)</function>
				1916	</para>
				1917	</listitem>
				1918
				1919	<listitem>
				1920	<para>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	1921	<function>mutex_lock_interruptible()</function> and
				1922	<function>mutex_lock()</function>
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1923	</para>
				1924	<para>
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	1925	There is a <function>mutex_trylock()</function> which can be
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1926	used inside interrupt context, as it will not sleep.
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	1927	<function>mutex_unlock()</function> will also never sleep.
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	1928	</para>
				1929	</listitem>
				1930	</itemizedlist>
				1931	</sect1>
				1932
				1933	<sect1 id="dont-sleep">
				1934	<title>Some Functions Which Don't Sleep</title>
				1935
				1936	<para>
				1937	Some functions are safe to call from any context, or holding
				1938	almost any lock.
				1939	</para>
				1940
				1941	<itemizedlist>
				1942	<listitem>
				1943	<para>
				1944	<function>printk()</function>
				1945	</para>
				1946	</listitem>
				1947	<listitem>
				1948	<para>
				1949	<function>kfree()</function>
				1950	</para>
				1951	</listitem>
				1952	<listitem>
				1953	<para>
				1954	<function>add_timer()</function> and <function>del_timer()</function>
				1955	</para>
				1956	</listitem>
				1957	</itemizedlist>
				1958	</sect1>
				1959	</chapter>
				1960
				1961	<chapter id="references">
				1962	<title>Further reading</title>
				1963
				1964	<itemizedlist>
				1965	<listitem>
				1966	<para>
				1967	<filename>Documentation/spinlocks.txt</filename>:
				1968	Linus Torvalds' spinlocking tutorial in the kernel sources.
				1969	</para>
				1970	</listitem>
				1971
				1972	<listitem>
				1973	<para>
				1974	Unix Systems for Modern Architectures: Symmetric
				1975	Multiprocessing and Caching for Kernel Programmers:
				1976	</para>
				1977
				1978	<para>
				1979	Curt Schimmel's very good introduction to kernel level
				1980	locking (not written for Linux, but nearly everything
				1981	applies). The book is expensive, but really worth every
				1982	penny to understand SMP locking. [ISBN: 0201633388]
				1983	</para>
				1984	</listitem>
				1985	</itemizedlist>
				1986	</chapter>
				1987
				1988	<chapter id="thanks">
				1989	<title>Thanks</title>
				1990
				1991	<para>
				1992	Thanks to Telsa Gwynne for DocBooking, neatening and adding
				1993	style.
				1994	</para>
				1995
				1996	<para>
				1997	Thanks to Martin Pool, Philipp Rumpf, Stephen Rothwell, Paul
				1998	Mackerras, Ruedi Aschwanden, Alan Cox, Manfred Spraul, Tim
				1999	Waugh, Pete Zaitcev, James Morris, Robert Love, Paul McKenney,
				2000	John Ashby for proofreading, correcting, flaming, commenting.
				2001	</para>
				2002
				2003	<para>
				2004	Thanks to the cabal for having no influence on this document.
				2005	</para>
				2006	</chapter>
				2007
				2008	<glossary id="glossary">
				2009	<title>Glossary</title>
				2010
				2011	<glossentry id="gloss-preemption">
				2012	<glossterm>preemption</glossterm>
				2013	<glossdef>
				2014	<para>
				2015	Prior to 2.5, or when <symbol>CONFIG_PREEMPT</symbol> is
				2016	unset, processes in user context inside the kernel would not
Matthew Wilcox	78305de	2008-04-23 07:20:41 -0400	[diff] [blame]	2017	preempt each other (ie. you had that CPU until you gave it up,
Linus Torvalds	1da177e	2005-04-16 15:20:36 -0700	[diff] [blame]	2018	except for interrupts). With the addition of
				2019	<symbol>CONFIG_PREEMPT</symbol> in 2.5.4, this changed: when
				2020	in user context, higher priority tasks can "cut in": spinlocks
				2021	were changed to disable preemption, even on UP.
				2022	</para>
				2023	</glossdef>
				2024	</glossentry>
				2025
				2026	<glossentry id="gloss-bh">
				2027	<glossterm>bh</glossterm>
				2028	<glossdef>
				2029	<para>
				2030	Bottom Half: for historical reasons, functions with
				2031	'_bh' in them often now refer to any software interrupt, e.g.
				2032	<function>spin_lock_bh()</function> blocks any software interrupt
				2033	on the current CPU. Bottom halves are deprecated, and will
				2034	eventually be replaced by tasklets. Only one bottom half will be
				2035	running at any time.
				2036	</para>
				2037	</glossdef>
				2038	</glossentry>
				2039
				2040	<glossentry id="gloss-hwinterrupt">
				2041	<glossterm>Hardware Interrupt / Hardware IRQ</glossterm>
				2042	<glossdef>
				2043	<para>
				2044	Hardware interrupt request. <function>in_irq()</function> returns
				2045	<returnvalue>true</returnvalue> in a hardware interrupt handler.
				2046	</para>
				2047	</glossdef>
				2048	</glossentry>
				2049
				2050	<glossentry id="gloss-interruptcontext">
				2051	<glossterm>Interrupt Context</glossterm>
				2052	<glossdef>
				2053	<para>
				2054	Not user context: processing a hardware irq or software irq.
				2055	Indicated by the <function>in_interrupt()</function> macro
				2056	returning <returnvalue>true</returnvalue>.
				2057	</para>
				2058	</glossdef>
				2059	</glossentry>
				2060
				2061	<glossentry id="gloss-smp">
				2062	<glossterm><acronym>SMP</acronym></glossterm>
				2063	<glossdef>
				2064	<para>
				2065	Symmetric Multi-Processor: kernels compiled for multiple-CPU
				2066	machines. (CONFIG_SMP=y).
				2067	</para>
				2068	</glossdef>
				2069	</glossentry>
				2070
				2071	<glossentry id="gloss-softirq">
				2072	<glossterm>Software Interrupt / softirq</glossterm>
				2073	<glossdef>
				2074	<para>
				2075	Software interrupt handler. <function>in_irq()</function> returns
				2076	<returnvalue>false</returnvalue>; <function>in_softirq()</function>
				2077	returns <returnvalue>true</returnvalue>. Tasklets and softirqs
				2078	both fall into the category of 'software interrupts'.
				2079	</para>
				2080	<para>
				2081	Strictly speaking a softirq is one of up to 32 enumerated software
				2082	interrupts which can run on multiple CPUs at once.
				2083	Sometimes used to refer to tasklets as
				2084	well (ie. all software interrupts).
				2085	</para>
				2086	</glossdef>
				2087	</glossentry>
				2088
				2089	<glossentry id="gloss-tasklet">
				2090	<glossterm>tasklet</glossterm>
				2091	<glossdef>
				2092	<para>
				2093	A dynamically-registrable software interrupt,
				2094	which is guaranteed to only run on one CPU at a time.
				2095	</para>
				2096	</glossdef>
				2097	</glossentry>
				2098
				2099	<glossentry id="gloss-timers">
				2100	<glossterm>timer</glossterm>
				2101	<glossdef>
				2102	<para>
				2103	A dynamically-registrable software interrupt, which is run at
				2104	(or close to) a given time. When running, it is just like a
				2105	tasklet (in fact, they are called from the TIMER_SOFTIRQ).
				2106	</para>
				2107	</glossdef>
				2108	</glossentry>
				2109
				2110	<glossentry id="gloss-up">
				2111	<glossterm><acronym>UP</acronym></glossterm>
				2112	<glossdef>
				2113	<para>
				2114	Uni-Processor: Non-SMP. (CONFIG_SMP=n).
				2115	</para>
				2116	</glossdef>
				2117	</glossentry>
				2118
				2119	<glossentry id="gloss-usercontext">
				2120	<glossterm>User Context</glossterm>
				2121	<glossdef>
				2122	<para>
				2123	The kernel executing on behalf of a particular process (ie. a
				2124	system call or trap) or kernel thread. You can tell which
				2125	process with the <symbol>current</symbol> macro.) Not to
				2126	be confused with userspace. Can be interrupted by software or
				2127	hardware interrupts.
				2128	</para>
				2129	</glossdef>
				2130	</glossentry>
				2131
				2132	<glossentry id="gloss-userspace">
				2133	<glossterm>Userspace</glossterm>
				2134	<glossdef>
				2135	<para>
				2136	A process executing its own code outside the kernel.
				2137	</para>
				2138	</glossdef>
				2139	</glossentry>
				2140
				2141	</glossary>
				2142	</book>
				2143