Blame - Documentation/device-mapper/thin-provisioning.txt - android_kernel_oneplus_msm8996

blob: 30b8b83bd333401a2cc1138d664d6086b4d47aef [file] [log] [blame]

Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	1	Introduction
				2	============
				3
Masanari Iida	4998d8e	2012-02-16 22:14:34 +0900	[diff] [blame]	4	This document describes a collection of device-mapper targets that
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	5	between them implement thin-provisioning and snapshots.
				6
				7	The main highlight of this implementation, compared to the previous
				8	implementation of snapshots, is that it allows many virtual devices to
				9	be stored on the same data volume. This simplifies administration and
				10	allows the sharing of data between volumes, thus reducing disk usage.
				11
				12	Another significant feature is support for an arbitrary depth of
				13	recursive snapshots (snapshots of snapshots of snapshots ...). The
				14	previous implementation of snapshots did this by chaining together
				15	lookup tables, and so performance was O(depth). This new
				16	implementation uses a single data structure to avoid this degradation
				17	with depth. Fragmentation may still be an issue, however, in some
				18	scenarios.
				19
				20	Metadata is stored on a separate device from data, giving the
				21	administrator some freedom, for example to:
				22
				23	- Improve metadata resilience by storing metadata on a mirrored volume
				24	but data on a non-mirrored one.
				25
				26	- Improve performance by storing the metadata on SSD.
				27
				28	Status
				29	======
				30
				31	These targets are very much still in the EXPERIMENTAL state. Please
				32	do not yet rely on them in production. But do experiment and offer us
				33	feedback. Different use cases will have different performance
				34	characteristics, for example due to fragmentation of the data volume.
				35
				36	If you find this software is not performing as expected please mail
				37	dm-devel@redhat.com with details and we'll try our best to improve
				38	things for you.
				39
				40	Userspace tools for checking and repairing the metadata are under
				41	development.
				42
				43	Cookbook
				44	========
				45
				46	This section describes some quick recipes for using thin provisioning.
				47	They use the dmsetup program to control the device-mapper driver
				48	directly. End users will be advised to use a higher-level volume
				49	manager such as LVM2 once support has been added.
				50
				51	Pool device
				52	-----------
				53
				54	The pool device ties together the metadata volume and the data volume.
				55	It maps I/O linearly to the data volume and updates the metadata via
				56	two mechanisms:
				57
				58	- Function calls from the thin targets
				59
				60	- Device-mapper 'messages' from userspace which control the creation of new
				61	virtual devices amongst other things.
				62
				63	Setting up a fresh pool device
				64	------------------------------
				65
				66	Setting up a pool device requires a valid metadata device, and a
				67	data device. If you do not have an existing metadata device you can
				68	make one by zeroing the first 4k to indicate empty metadata.
				69
				70	dd if=/dev/zero of=$metadata_dev bs=4096 count=1
				71
				72	The amount of metadata you need will vary according to how many blocks
				73	are shared between thin devices (i.e. through snapshots). If you have
				74	less sharing than average you'll need a larger-than-average metadata device.
				75
				76	As a guide, we suggest you calculate the number of bytes to use in the
				77	metadata device as 48 * $data_dev_size / $data_block_size but round it up
Mike Snitzer	c4a69ec	2012-03-28 18:41:28 +0100	[diff] [blame]	78	to 2MB if the answer is smaller. If you're creating large numbers of
				79	snapshots which are recording large amounts of change, you may find you
				80	need to increase this.
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	81
Mike Snitzer	c4a69ec	2012-03-28 18:41:28 +0100	[diff] [blame]	82	The largest size supported is 16GB: If the device is larger,
				83	a warning will be issued and the excess space will not be used.
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	84
				85	Reloading a pool table
				86	----------------------
				87
				88	You may reload a pool's table, indeed this is how the pool is resized
				89	if it runs out of space. (N.B. While specifying a different metadata
				90	device when reloading is not forbidden at the moment, things will go
				91	wrong if it does not route I/O to exactly the same on-disk location as
				92	previously.)
				93
				94	Using an existing pool device
				95	-----------------------------
				96
				97	dmsetup create pool \
				98	--table "0 20971520 thin-pool $metadata_dev $data_dev \
				99	$data_block_size $low_water_mark"
				100
				101	$data_block_size gives the smallest unit of disk space that can be
				102	allocated at a time expressed in units of 512-byte sectors. People
				103	primarily interested in thin provisioning may want to use a value such
				104	as 1024 (512KB). People doing lots of snapshotting may want a smaller value
				105	such as 128 (64KB). If you are not zeroing newly-allocated data,
				106	a larger $data_block_size in the region of 256000 (128MB) is suggested.
				107	$data_block_size must be the same for the lifetime of the
				108	metadata device.
				109
				110	$low_water_mark is expressed in blocks of size $data_block_size. If
				111	free space on the data device drops below this level then a dm event
				112	will be triggered which a userspace daemon should catch allowing it to
				113	extend the pool device. Only one such event will be sent.
				114	Resuming a device with a new table itself triggers an event so the
				115	userspace daemon can use this to detect a situation where a new table
				116	already exceeds the threshold.
				117
				118	Thin provisioning
				119	-----------------
				120
				121	i) Creating a new thinly-provisioned volume.
				122
				123	To create a new thinly- provisioned volume you must send a message to an
				124	active pool device, /dev/mapper/pool in this example.
				125
				126	dmsetup message /dev/mapper/pool 0 "create_thin 0"
				127
				128	Here '0' is an identifier for the volume, a 24-bit number. It's up
				129	to the caller to allocate and manage these identifiers. If the
				130	identifier is already in use, the message will fail with -EEXIST.
				131
				132	ii) Using a thinly-provisioned volume.
				133
				134	Thinly-provisioned volumes are activated using the 'thin' target:
				135
				136	dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0"
				137
				138	The last parameter is the identifier for the thinp device.
				139
				140	Internal snapshots
				141	------------------
				142
				143	i) Creating an internal snapshot.
				144
				145	Snapshots are created with another message to the pool.
				146
				147	N.B. If the origin device that you wish to snapshot is active, you
				148	must suspend it before creating the snapshot to avoid corruption.
				149	This is NOT enforced at the moment, so please be careful!
				150
				151	dmsetup suspend /dev/mapper/thin
				152	dmsetup message /dev/mapper/pool 0 "create_snap 1 0"
				153	dmsetup resume /dev/mapper/thin
				154
				155	Here '1' is the identifier for the volume, a 24-bit number. '0' is the
				156	identifier for the origin device.
				157
				158	ii) Using an internal snapshot.
				159
				160	Once created, the user doesn't have to worry about any connection
				161	between the origin and the snapshot. Indeed the snapshot is no
				162	different from any other thinly-provisioned device and can be
				163	snapshotted itself via the same method. It's perfectly legal to
				164	have only one of them active, and there's no ordering requirement on
				165	activating or removing them both. (This differs from conventional
				166	device-mapper snapshots.)
				167
				168	Activate it exactly the same way as any other thinly-provisioned volume:
				169
				170	dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1"
				171
Joe Thornber	2dd9c25	2012-03-28 18:41:28 +0100	[diff] [blame]	172	External snapshots
				173	------------------
				174
				175	You can use an external _read only_ device as an origin for a
				176	thinly-provisioned volume. Any read to an unprovisioned area of the
				177	thin device will be passed through to the origin. Writes trigger
				178	the allocation of new blocks as usual.
				179
				180	One use case for this is VM hosts that want to run guests on
				181	thinly-provisioned volumes but have the base image on another device
				182	(possibly shared between many VMs).
				183
				184	You must not write to the origin device if you use this technique!
				185	Of course, you may write to the thin device and take internal snapshots
				186	of the thin volume.
				187
				188	i) Creating a snapshot of an external device
				189
				190	This is the same as creating a thin device.
				191	You don't mention the origin at this stage.
				192
				193	dmsetup message /dev/mapper/pool 0 "create_thin 0"
				194
				195	ii) Using a snapshot of an external device.
				196
				197	Append an extra parameter to the thin target specifying the origin:
				198
				199	dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image"
				200
				201	N.B. All descendants (internal snapshots) of this snapshot require the
				202	same extra origin parameter.
				203
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	204	Deactivation
				205	------------
				206
				207	All devices using a pool must be deactivated before the pool itself
				208	can be.
				209
				210	dmsetup remove thin
				211	dmsetup remove snap
				212	dmsetup remove pool
				213
				214	Reference
				215	=========
				216
				217	'thin-pool' target
				218	------------------
				219
				220	i) Constructor
				221
				222	thin-pool <metadata dev> <data dev> <data block size (sectors)> \
				223	<low water mark (blocks)> [<number of feature args> [<arg>]*]
				224
				225	Optional feature arguments:
Joe Thornber	67e2e2b	2012-03-28 18:41:29 +0100	[diff] [blame]	226
				227	skip_block_zeroing: Skip the zeroing of newly-provisioned blocks.
				228
				229	ignore_discard: Disable discard support.
				230
				231	no_discard_passdown: Don't pass discards down to the underlying
				232	data device, but just remove the mapping.
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	233
Joe Thornber	e49e582	2012-07-27 15:08:16 +0100	[diff] [blame]	234	read_only: Don't allow any changes to be made to the pool
				235	metadata.
				236
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	237	Data block size must be between 64KB (128 sectors) and 1GB
				238	(2097152 sectors) inclusive.
				239
				240
				241	ii) Status
				242
				243	<transaction id> <used metadata blocks>/<total metadata blocks>
				244	<used data blocks>/<total data blocks> <held metadata root>
Joe Thornber	e49e582	2012-07-27 15:08:16 +0100	[diff] [blame]	245	[no_]discard_passdown ro\|rw
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	246
				247	transaction id:
				248	A 64-bit number used by userspace to help synchronise with metadata
				249	from volume managers.
				250
				251	used data blocks / total data blocks
				252	If the number of free blocks drops below the pool's low water mark a
				253	dm event will be sent to userspace. This event is edge-triggered and
				254	it will occur only once after each resume so volume manager writers
				255	should register for the event and then check the target's status.
				256
				257	held metadata root:
				258	The location, in sectors, of the metadata root that has been
				259	'held' for userspace read access. '-' indicates there is no
				260	held root. This feature is not yet implemented so '-' is
				261	always returned.
				262
Joe Thornber	e49e582	2012-07-27 15:08:16 +0100	[diff] [blame]	263	discard_passdown\|no_discard_passdown
				264	Whether or not discards are actually being passed down to the
				265	underlying device. When this is enabled when loading the table,
				266	it can get disabled if the underlying device doesn't support it.
				267
				268	ro\|rw
				269	If the pool encounters certain types of device failures it will
				270	drop into a read-only metadata mode in which no changes to
				271	the pool metadata (like allocating new blocks) are permitted.
				272
				273	In serious cases where even a read-only mode is deemed unsafe
				274	no further I/O will be permitted and the status will just
				275	contain the string 'Fail'. The userspace recovery tools
				276	should then be used.
				277
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	278	iii) Messages
				279
				280	create_thin <dev id>
				281
				282	Create a new thinly-provisioned device.
				283	<dev id> is an arbitrary unique 24-bit identifier chosen by
				284	the caller.
				285
				286	create_snap <dev id> <origin id>
				287
				288	Create a new snapshot of another thinly-provisioned device.
				289	<dev id> is an arbitrary unique 24-bit identifier chosen by
				290	the caller.
				291	<origin id> is the identifier of the thinly-provisioned device
				292	of which the new device will be a snapshot.
				293
				294	delete <dev id>
				295
				296	Deletes a thin device. Irreversible.
				297
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	298	set_transaction_id <current id> <new id>
				299
				300	Userland volume managers, such as LVM, need a way to
				301	synchronise their external metadata with the internal metadata of the
				302	pool target. The thin-pool target offers to store an
				303	arbitrary 64-bit transaction id and return it on the target's
				304	status line. To avoid races you must provide what you think
				305	the current transaction id is when you change it with this
				306	compare-and-swap message.
				307
Joe Thornber	cc8394d	2012-06-03 00:30:01 +0100	[diff] [blame]	308	reserve_metadata_snap
				309
				310	Reserve a copy of the data mapping btree for use by userland.
				311	This allows userland to inspect the mappings as they were when
				312	this message was executed. Use the pool's status command to
				313	get the root block associated with the metadata snapshot.
				314
				315	release_metadata_snap
				316
				317	Release a previously reserved copy of the data mapping btree.
				318
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	319	'thin' target
				320	-------------
				321
				322	i) Constructor
				323
Joe Thornber	2dd9c25	2012-03-28 18:41:28 +0100	[diff] [blame]	324	thin <pool dev> <dev id> [<external origin dev>]
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	325
				326	pool dev:
				327	the thin-pool device, e.g. /dev/mapper/my_pool or 253:0
				328
				329	dev id:
				330	the internal device identifier of the device to be
				331	activated.
				332
Joe Thornber	2dd9c25	2012-03-28 18:41:28 +0100	[diff] [blame]	333	external origin dev:
				334	an optional block device outside the pool to be treated as a
				335	read-only snapshot origin: reads to unprovisioned areas of the
				336	thin target will be mapped to this device.
				337
Joe Thornber	991d9fa0	2011-10-31 20:21:18 +0000	[diff] [blame]	338	The pool doesn't store any size against the thin devices. If you
				339	load a thin target that is smaller than you've been using previously,
				340	then you'll have no access to blocks mapped beyond the end. If you
				341	load a target that is bigger than before, then extra blocks will be
				342	provisioned as and when needed.
				343
				344	If you wish to reduce the size of your thin device and potentially
				345	regain some space then send the 'trim' message to the pool.
				346
				347	ii) Status
				348
				349	<nr mapped sectors> <highest mapped sector>
Joe Thornber	e49e582	2012-07-27 15:08:16 +0100	[diff] [blame]	350
				351	If the pool has encountered device errors and failed, the status
				352	will just contain the string 'Fail'. The userspace recovery
				353	tools should then be used.