| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 1 | Introduction | 
 | 2 | ============ | 
 | 3 |  | 
| Masanari Iida | 4998d8e | 2012-02-16 22:14:34 +0900 | [diff] [blame] | 4 | This document describes a collection of device-mapper targets that | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 5 | between them implement thin-provisioning and snapshots. | 
 | 6 |  | 
 | 7 | The main highlight of this implementation, compared to the previous | 
 | 8 | implementation of snapshots, is that it allows many virtual devices to | 
 | 9 | be stored on the same data volume.  This simplifies administration and | 
 | 10 | allows the sharing of data between volumes, thus reducing disk usage. | 
 | 11 |  | 
 | 12 | Another significant feature is support for an arbitrary depth of | 
 | 13 | recursive snapshots (snapshots of snapshots of snapshots ...).  The | 
 | 14 | previous implementation of snapshots did this by chaining together | 
 | 15 | lookup tables, and so performance was O(depth).  This new | 
 | 16 | implementation uses a single data structure to avoid this degradation | 
 | 17 | with depth.  Fragmentation may still be an issue, however, in some | 
 | 18 | scenarios. | 
 | 19 |  | 
 | 20 | Metadata is stored on a separate device from data, giving the | 
 | 21 | administrator some freedom, for example to: | 
 | 22 |  | 
 | 23 | - Improve metadata resilience by storing metadata on a mirrored volume | 
 | 24 |   but data on a non-mirrored one. | 
 | 25 |  | 
 | 26 | - Improve performance by storing the metadata on SSD. | 
 | 27 |  | 
 | 28 | Status | 
 | 29 | ====== | 
 | 30 |  | 
 | 31 | These targets are very much still in the EXPERIMENTAL state.  Please | 
 | 32 | do not yet rely on them in production.  But do experiment and offer us | 
 | 33 | feedback.  Different use cases will have different performance | 
 | 34 | characteristics, for example due to fragmentation of the data volume. | 
 | 35 |  | 
 | 36 | If you find this software is not performing as expected please mail | 
 | 37 | dm-devel@redhat.com with details and we'll try our best to improve | 
 | 38 | things for you. | 
 | 39 |  | 
 | 40 | Userspace tools for checking and repairing the metadata are under | 
 | 41 | development. | 
 | 42 |  | 
 | 43 | Cookbook | 
 | 44 | ======== | 
 | 45 |  | 
 | 46 | This section describes some quick recipes for using thin provisioning. | 
 | 47 | They use the dmsetup program to control the device-mapper driver | 
 | 48 | directly.  End users will be advised to use a higher-level volume | 
 | 49 | manager such as LVM2 once support has been added. | 
 | 50 |  | 
 | 51 | Pool device | 
 | 52 | ----------- | 
 | 53 |  | 
 | 54 | The pool device ties together the metadata volume and the data volume. | 
 | 55 | It maps I/O linearly to the data volume and updates the metadata via | 
 | 56 | two mechanisms: | 
 | 57 |  | 
 | 58 | - Function calls from the thin targets | 
 | 59 |  | 
 | 60 | - Device-mapper 'messages' from userspace which control the creation of new | 
 | 61 |   virtual devices amongst other things. | 
 | 62 |  | 
 | 63 | Setting up a fresh pool device | 
 | 64 | ------------------------------ | 
 | 65 |  | 
 | 66 | Setting up a pool device requires a valid metadata device, and a | 
 | 67 | data device.  If you do not have an existing metadata device you can | 
 | 68 | make one by zeroing the first 4k to indicate empty metadata. | 
 | 69 |  | 
 | 70 |     dd if=/dev/zero of=$metadata_dev bs=4096 count=1 | 
 | 71 |  | 
 | 72 | The amount of metadata you need will vary according to how many blocks | 
 | 73 | are shared between thin devices (i.e. through snapshots).  If you have | 
 | 74 | less sharing than average you'll need a larger-than-average metadata device. | 
 | 75 |  | 
 | 76 | As a guide, we suggest you calculate the number of bytes to use in the | 
 | 77 | metadata device as 48 * $data_dev_size / $data_block_size but round it up | 
| Mike Snitzer | c4a69ec | 2012-03-28 18:41:28 +0100 | [diff] [blame] | 78 | to 2MB if the answer is smaller.  If you're creating large numbers of | 
 | 79 | snapshots which are recording large amounts of change, you may find you | 
 | 80 | need to increase this. | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 81 |  | 
| Mike Snitzer | c4a69ec | 2012-03-28 18:41:28 +0100 | [diff] [blame] | 82 | The largest size supported is 16GB: If the device is larger, | 
 | 83 | a warning will be issued and the excess space will not be used. | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 84 |  | 
 | 85 | Reloading a pool table | 
 | 86 | ---------------------- | 
 | 87 |  | 
 | 88 | You may reload a pool's table, indeed this is how the pool is resized | 
 | 89 | if it runs out of space.  (N.B. While specifying a different metadata | 
 | 90 | device when reloading is not forbidden at the moment, things will go | 
 | 91 | wrong if it does not route I/O to exactly the same on-disk location as | 
 | 92 | previously.) | 
 | 93 |  | 
 | 94 | Using an existing pool device | 
 | 95 | ----------------------------- | 
 | 96 |  | 
 | 97 |     dmsetup create pool \ | 
 | 98 | 	--table "0 20971520 thin-pool $metadata_dev $data_dev \ | 
 | 99 | 		 $data_block_size $low_water_mark" | 
 | 100 |  | 
 | 101 | $data_block_size gives the smallest unit of disk space that can be | 
 | 102 | allocated at a time expressed in units of 512-byte sectors.  People | 
 | 103 | primarily interested in thin provisioning may want to use a value such | 
 | 104 | as 1024 (512KB).  People doing lots of snapshotting may want a smaller value | 
 | 105 | such as 128 (64KB).  If you are not zeroing newly-allocated data, | 
 | 106 | a larger $data_block_size in the region of 256000 (128MB) is suggested. | 
 | 107 | $data_block_size must be the same for the lifetime of the | 
 | 108 | metadata device. | 
 | 109 |  | 
 | 110 | $low_water_mark is expressed in blocks of size $data_block_size.  If | 
 | 111 | free space on the data device drops below this level then a dm event | 
 | 112 | will be triggered which a userspace daemon should catch allowing it to | 
 | 113 | extend the pool device.  Only one such event will be sent. | 
 | 114 | Resuming a device with a new table itself triggers an event so the | 
 | 115 | userspace daemon can use this to detect a situation where a new table | 
 | 116 | already exceeds the threshold. | 
 | 117 |  | 
 | 118 | Thin provisioning | 
 | 119 | ----------------- | 
 | 120 |  | 
 | 121 | i) Creating a new thinly-provisioned volume. | 
 | 122 |  | 
 | 123 |   To create a new thinly- provisioned volume you must send a message to an | 
 | 124 |   active pool device, /dev/mapper/pool in this example. | 
 | 125 |  | 
 | 126 |     dmsetup message /dev/mapper/pool 0 "create_thin 0" | 
 | 127 |  | 
 | 128 |   Here '0' is an identifier for the volume, a 24-bit number.  It's up | 
 | 129 |   to the caller to allocate and manage these identifiers.  If the | 
 | 130 |   identifier is already in use, the message will fail with -EEXIST. | 
 | 131 |  | 
 | 132 | ii) Using a thinly-provisioned volume. | 
 | 133 |  | 
 | 134 |   Thinly-provisioned volumes are activated using the 'thin' target: | 
 | 135 |  | 
 | 136 |     dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" | 
 | 137 |  | 
 | 138 |   The last parameter is the identifier for the thinp device. | 
 | 139 |  | 
 | 140 | Internal snapshots | 
 | 141 | ------------------ | 
 | 142 |  | 
 | 143 | i) Creating an internal snapshot. | 
 | 144 |  | 
 | 145 |   Snapshots are created with another message to the pool. | 
 | 146 |  | 
 | 147 |   N.B.  If the origin device that you wish to snapshot is active, you | 
 | 148 |   must suspend it before creating the snapshot to avoid corruption. | 
 | 149 |   This is NOT enforced at the moment, so please be careful! | 
 | 150 |  | 
 | 151 |     dmsetup suspend /dev/mapper/thin | 
 | 152 |     dmsetup message /dev/mapper/pool 0 "create_snap 1 0" | 
 | 153 |     dmsetup resume /dev/mapper/thin | 
 | 154 |  | 
 | 155 |   Here '1' is the identifier for the volume, a 24-bit number.  '0' is the | 
 | 156 |   identifier for the origin device. | 
 | 157 |  | 
 | 158 | ii) Using an internal snapshot. | 
 | 159 |  | 
 | 160 |   Once created, the user doesn't have to worry about any connection | 
 | 161 |   between the origin and the snapshot.  Indeed the snapshot is no | 
 | 162 |   different from any other thinly-provisioned device and can be | 
 | 163 |   snapshotted itself via the same method.  It's perfectly legal to | 
 | 164 |   have only one of them active, and there's no ordering requirement on | 
 | 165 |   activating or removing them both.  (This differs from conventional | 
 | 166 |   device-mapper snapshots.) | 
 | 167 |  | 
 | 168 |   Activate it exactly the same way as any other thinly-provisioned volume: | 
 | 169 |  | 
 | 170 |     dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" | 
 | 171 |  | 
| Joe Thornber | 2dd9c25 | 2012-03-28 18:41:28 +0100 | [diff] [blame] | 172 | External snapshots | 
 | 173 | ------------------ | 
 | 174 |  | 
 | 175 | You can use an external _read only_ device as an origin for a | 
 | 176 | thinly-provisioned volume.  Any read to an unprovisioned area of the | 
 | 177 | thin device will be passed through to the origin.  Writes trigger | 
 | 178 | the allocation of new blocks as usual. | 
 | 179 |  | 
 | 180 | One use case for this is VM hosts that want to run guests on | 
 | 181 | thinly-provisioned volumes but have the base image on another device | 
 | 182 | (possibly shared between many VMs). | 
 | 183 |  | 
 | 184 | You must not write to the origin device if you use this technique! | 
 | 185 | Of course, you may write to the thin device and take internal snapshots | 
 | 186 | of the thin volume. | 
 | 187 |  | 
 | 188 | i) Creating a snapshot of an external device | 
 | 189 |  | 
 | 190 |   This is the same as creating a thin device. | 
 | 191 |   You don't mention the origin at this stage. | 
 | 192 |  | 
 | 193 |     dmsetup message /dev/mapper/pool 0 "create_thin 0" | 
 | 194 |  | 
 | 195 | ii) Using a snapshot of an external device. | 
 | 196 |  | 
 | 197 |   Append an extra parameter to the thin target specifying the origin: | 
 | 198 |  | 
 | 199 |     dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 0 /dev/image" | 
 | 200 |  | 
 | 201 |   N.B. All descendants (internal snapshots) of this snapshot require the | 
 | 202 |   same extra origin parameter. | 
 | 203 |  | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 204 | Deactivation | 
 | 205 | ------------ | 
 | 206 |  | 
 | 207 | All devices using a pool must be deactivated before the pool itself | 
 | 208 | can be. | 
 | 209 |  | 
 | 210 |     dmsetup remove thin | 
 | 211 |     dmsetup remove snap | 
 | 212 |     dmsetup remove pool | 
 | 213 |  | 
 | 214 | Reference | 
 | 215 | ========= | 
 | 216 |  | 
 | 217 | 'thin-pool' target | 
 | 218 | ------------------ | 
 | 219 |  | 
 | 220 | i) Constructor | 
 | 221 |  | 
 | 222 |     thin-pool <metadata dev> <data dev> <data block size (sectors)> \ | 
 | 223 | 	      <low water mark (blocks)> [<number of feature args> [<arg>]*] | 
 | 224 |  | 
 | 225 |     Optional feature arguments: | 
| Joe Thornber | 67e2e2b | 2012-03-28 18:41:29 +0100 | [diff] [blame] | 226 |  | 
 | 227 |       skip_block_zeroing: Skip the zeroing of newly-provisioned blocks. | 
 | 228 |  | 
 | 229 |       ignore_discard: Disable discard support. | 
 | 230 |  | 
 | 231 |       no_discard_passdown: Don't pass discards down to the underlying | 
 | 232 | 			   data device, but just remove the mapping. | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 233 |  | 
| Joe Thornber | e49e582 | 2012-07-27 15:08:16 +0100 | [diff] [blame] | 234 |       read_only: Don't allow any changes to be made to the pool | 
 | 235 | 		 metadata. | 
 | 236 |  | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 237 |     Data block size must be between 64KB (128 sectors) and 1GB | 
 | 238 |     (2097152 sectors) inclusive. | 
 | 239 |  | 
 | 240 |  | 
 | 241 | ii) Status | 
 | 242 |  | 
 | 243 |     <transaction id> <used metadata blocks>/<total metadata blocks> | 
 | 244 |     <used data blocks>/<total data blocks> <held metadata root> | 
| Joe Thornber | e49e582 | 2012-07-27 15:08:16 +0100 | [diff] [blame] | 245 |     [no_]discard_passdown ro|rw | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 246 |  | 
 | 247 |     transaction id: | 
 | 248 | 	A 64-bit number used by userspace to help synchronise with metadata | 
 | 249 | 	from volume managers. | 
 | 250 |  | 
 | 251 |     used data blocks / total data blocks | 
 | 252 | 	If the number of free blocks drops below the pool's low water mark a | 
 | 253 | 	dm event will be sent to userspace.  This event is edge-triggered and | 
 | 254 | 	it will occur only once after each resume so volume manager writers | 
 | 255 | 	should register for the event and then check the target's status. | 
 | 256 |  | 
 | 257 |     held metadata root: | 
 | 258 | 	The location, in sectors, of the metadata root that has been | 
 | 259 | 	'held' for userspace read access.  '-' indicates there is no | 
 | 260 | 	held root.  This feature is not yet implemented so '-' is | 
 | 261 | 	always returned. | 
 | 262 |  | 
| Joe Thornber | e49e582 | 2012-07-27 15:08:16 +0100 | [diff] [blame] | 263 |     discard_passdown|no_discard_passdown | 
 | 264 | 	Whether or not discards are actually being passed down to the | 
 | 265 | 	underlying device.  When this is enabled when loading the table, | 
 | 266 | 	it can get disabled if the underlying device doesn't support it. | 
 | 267 |  | 
 | 268 |     ro|rw | 
 | 269 | 	If the pool encounters certain types of device failures it will | 
 | 270 | 	drop into a read-only metadata mode in which no changes to | 
 | 271 | 	the pool metadata (like allocating new blocks) are permitted. | 
 | 272 |  | 
 | 273 | 	In serious cases where even a read-only mode is deemed unsafe | 
 | 274 | 	no further I/O will be permitted and the status will just | 
 | 275 | 	contain the string 'Fail'.  The userspace recovery tools | 
 | 276 | 	should then be used. | 
 | 277 |  | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 278 | iii) Messages | 
 | 279 |  | 
 | 280 |     create_thin <dev id> | 
 | 281 |  | 
 | 282 | 	Create a new thinly-provisioned device. | 
 | 283 | 	<dev id> is an arbitrary unique 24-bit identifier chosen by | 
 | 284 | 	the caller. | 
 | 285 |  | 
 | 286 |     create_snap <dev id> <origin id> | 
 | 287 |  | 
 | 288 | 	Create a new snapshot of another thinly-provisioned device. | 
 | 289 | 	<dev id> is an arbitrary unique 24-bit identifier chosen by | 
 | 290 | 	the caller. | 
 | 291 | 	<origin id> is the identifier of the thinly-provisioned device | 
 | 292 | 	of which the new device will be a snapshot. | 
 | 293 |  | 
 | 294 |     delete <dev id> | 
 | 295 |  | 
 | 296 | 	Deletes a thin device.  Irreversible. | 
 | 297 |  | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 298 |     set_transaction_id <current id> <new id> | 
 | 299 |  | 
 | 300 | 	Userland volume managers, such as LVM, need a way to | 
 | 301 | 	synchronise their external metadata with the internal metadata of the | 
 | 302 | 	pool target.  The thin-pool target offers to store an | 
 | 303 | 	arbitrary 64-bit transaction id and return it on the target's | 
 | 304 | 	status line.  To avoid races you must provide what you think | 
 | 305 | 	the current transaction id is when you change it with this | 
 | 306 | 	compare-and-swap message. | 
 | 307 |  | 
| Joe Thornber | cc8394d | 2012-06-03 00:30:01 +0100 | [diff] [blame] | 308 |     reserve_metadata_snap | 
 | 309 |  | 
 | 310 |         Reserve a copy of the data mapping btree for use by userland. | 
 | 311 |         This allows userland to inspect the mappings as they were when | 
 | 312 |         this message was executed.  Use the pool's status command to | 
 | 313 |         get the root block associated with the metadata snapshot. | 
 | 314 |  | 
 | 315 |     release_metadata_snap | 
 | 316 |  | 
 | 317 |         Release a previously reserved copy of the data mapping btree. | 
 | 318 |  | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 319 | 'thin' target | 
 | 320 | ------------- | 
 | 321 |  | 
 | 322 | i) Constructor | 
 | 323 |  | 
| Joe Thornber | 2dd9c25 | 2012-03-28 18:41:28 +0100 | [diff] [blame] | 324 |     thin <pool dev> <dev id> [<external origin dev>] | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 325 |  | 
 | 326 |     pool dev: | 
 | 327 | 	the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 | 
 | 328 |  | 
 | 329 |     dev id: | 
 | 330 | 	the internal device identifier of the device to be | 
 | 331 | 	activated. | 
 | 332 |  | 
| Joe Thornber | 2dd9c25 | 2012-03-28 18:41:28 +0100 | [diff] [blame] | 333 |     external origin dev: | 
 | 334 | 	an optional block device outside the pool to be treated as a | 
 | 335 | 	read-only snapshot origin: reads to unprovisioned areas of the | 
 | 336 | 	thin target will be mapped to this device. | 
 | 337 |  | 
| Joe Thornber | 991d9fa0 | 2011-10-31 20:21:18 +0000 | [diff] [blame] | 338 | The pool doesn't store any size against the thin devices.  If you | 
 | 339 | load a thin target that is smaller than you've been using previously, | 
 | 340 | then you'll have no access to blocks mapped beyond the end.  If you | 
 | 341 | load a target that is bigger than before, then extra blocks will be | 
 | 342 | provisioned as and when needed. | 
 | 343 |  | 
 | 344 | If you wish to reduce the size of your thin device and potentially | 
 | 345 | regain some space then send the 'trim' message to the pool. | 
 | 346 |  | 
 | 347 | ii) Status | 
 | 348 |  | 
 | 349 |      <nr mapped sectors> <highest mapped sector> | 
| Joe Thornber | e49e582 | 2012-07-27 15:08:16 +0100 | [diff] [blame] | 350 |  | 
 | 351 | 	If the pool has encountered device errors and failed, the status | 
 | 352 | 	will just contain the string 'Fail'.  The userspace recovery | 
 | 353 | 	tools should then be used. |