| Joe Thornber | 991d9fa | 2011-10-31 20:21:18 +0000 | [diff] [blame^] | 1 | Introduction | 
|  | 2 | ============ | 
|  | 3 |  | 
|  | 4 | This document descibes a collection of device-mapper targets that | 
|  | 5 | between them implement thin-provisioning and snapshots. | 
|  | 6 |  | 
|  | 7 | The main highlight of this implementation, compared to the previous | 
|  | 8 | implementation of snapshots, is that it allows many virtual devices to | 
|  | 9 | be stored on the same data volume.  This simplifies administration and | 
|  | 10 | allows the sharing of data between volumes, thus reducing disk usage. | 
|  | 11 |  | 
|  | 12 | Another significant feature is support for an arbitrary depth of | 
|  | 13 | recursive snapshots (snapshots of snapshots of snapshots ...).  The | 
|  | 14 | previous implementation of snapshots did this by chaining together | 
|  | 15 | lookup tables, and so performance was O(depth).  This new | 
|  | 16 | implementation uses a single data structure to avoid this degradation | 
|  | 17 | with depth.  Fragmentation may still be an issue, however, in some | 
|  | 18 | scenarios. | 
|  | 19 |  | 
|  | 20 | Metadata is stored on a separate device from data, giving the | 
|  | 21 | administrator some freedom, for example to: | 
|  | 22 |  | 
|  | 23 | - Improve metadata resilience by storing metadata on a mirrored volume | 
|  | 24 | but data on a non-mirrored one. | 
|  | 25 |  | 
|  | 26 | - Improve performance by storing the metadata on SSD. | 
|  | 27 |  | 
|  | 28 | Status | 
|  | 29 | ====== | 
|  | 30 |  | 
|  | 31 | These targets are very much still in the EXPERIMENTAL state.  Please | 
|  | 32 | do not yet rely on them in production.  But do experiment and offer us | 
|  | 33 | feedback.  Different use cases will have different performance | 
|  | 34 | characteristics, for example due to fragmentation of the data volume. | 
|  | 35 |  | 
|  | 36 | If you find this software is not performing as expected please mail | 
|  | 37 | dm-devel@redhat.com with details and we'll try our best to improve | 
|  | 38 | things for you. | 
|  | 39 |  | 
|  | 40 | Userspace tools for checking and repairing the metadata are under | 
|  | 41 | development. | 
|  | 42 |  | 
|  | 43 | Cookbook | 
|  | 44 | ======== | 
|  | 45 |  | 
|  | 46 | This section describes some quick recipes for using thin provisioning. | 
|  | 47 | They use the dmsetup program to control the device-mapper driver | 
|  | 48 | directly.  End users will be advised to use a higher-level volume | 
|  | 49 | manager such as LVM2 once support has been added. | 
|  | 50 |  | 
|  | 51 | Pool device | 
|  | 52 | ----------- | 
|  | 53 |  | 
|  | 54 | The pool device ties together the metadata volume and the data volume. | 
|  | 55 | It maps I/O linearly to the data volume and updates the metadata via | 
|  | 56 | two mechanisms: | 
|  | 57 |  | 
|  | 58 | - Function calls from the thin targets | 
|  | 59 |  | 
|  | 60 | - Device-mapper 'messages' from userspace which control the creation of new | 
|  | 61 | virtual devices amongst other things. | 
|  | 62 |  | 
|  | 63 | Setting up a fresh pool device | 
|  | 64 | ------------------------------ | 
|  | 65 |  | 
|  | 66 | Setting up a pool device requires a valid metadata device, and a | 
|  | 67 | data device.  If you do not have an existing metadata device you can | 
|  | 68 | make one by zeroing the first 4k to indicate empty metadata. | 
|  | 69 |  | 
|  | 70 | dd if=/dev/zero of=$metadata_dev bs=4096 count=1 | 
|  | 71 |  | 
|  | 72 | The amount of metadata you need will vary according to how many blocks | 
|  | 73 | are shared between thin devices (i.e. through snapshots).  If you have | 
|  | 74 | less sharing than average you'll need a larger-than-average metadata device. | 
|  | 75 |  | 
|  | 76 | As a guide, we suggest you calculate the number of bytes to use in the | 
|  | 77 | metadata device as 48 * $data_dev_size / $data_block_size but round it up | 
|  | 78 | to 2MB if the answer is smaller.  The largest size supported is 16GB. | 
|  | 79 |  | 
|  | 80 | If you're creating large numbers of snapshots which are recording large | 
|  | 81 | amounts of change, you may need find you need to increase this. | 
|  | 82 |  | 
|  | 83 | Reloading a pool table | 
|  | 84 | ---------------------- | 
|  | 85 |  | 
|  | 86 | You may reload a pool's table, indeed this is how the pool is resized | 
|  | 87 | if it runs out of space.  (N.B. While specifying a different metadata | 
|  | 88 | device when reloading is not forbidden at the moment, things will go | 
|  | 89 | wrong if it does not route I/O to exactly the same on-disk location as | 
|  | 90 | previously.) | 
|  | 91 |  | 
|  | 92 | Using an existing pool device | 
|  | 93 | ----------------------------- | 
|  | 94 |  | 
|  | 95 | dmsetup create pool \ | 
|  | 96 | --table "0 20971520 thin-pool $metadata_dev $data_dev \ | 
|  | 97 | $data_block_size $low_water_mark" | 
|  | 98 |  | 
|  | 99 | $data_block_size gives the smallest unit of disk space that can be | 
|  | 100 | allocated at a time expressed in units of 512-byte sectors.  People | 
|  | 101 | primarily interested in thin provisioning may want to use a value such | 
|  | 102 | as 1024 (512KB).  People doing lots of snapshotting may want a smaller value | 
|  | 103 | such as 128 (64KB).  If you are not zeroing newly-allocated data, | 
|  | 104 | a larger $data_block_size in the region of 256000 (128MB) is suggested. | 
|  | 105 | $data_block_size must be the same for the lifetime of the | 
|  | 106 | metadata device. | 
|  | 107 |  | 
|  | 108 | $low_water_mark is expressed in blocks of size $data_block_size.  If | 
|  | 109 | free space on the data device drops below this level then a dm event | 
|  | 110 | will be triggered which a userspace daemon should catch allowing it to | 
|  | 111 | extend the pool device.  Only one such event will be sent. | 
|  | 112 | Resuming a device with a new table itself triggers an event so the | 
|  | 113 | userspace daemon can use this to detect a situation where a new table | 
|  | 114 | already exceeds the threshold. | 
|  | 115 |  | 
|  | 116 | Thin provisioning | 
|  | 117 | ----------------- | 
|  | 118 |  | 
|  | 119 | i) Creating a new thinly-provisioned volume. | 
|  | 120 |  | 
|  | 121 | To create a new thinly- provisioned volume you must send a message to an | 
|  | 122 | active pool device, /dev/mapper/pool in this example. | 
|  | 123 |  | 
|  | 124 | dmsetup message /dev/mapper/pool 0 "create_thin 0" | 
|  | 125 |  | 
|  | 126 | Here '0' is an identifier for the volume, a 24-bit number.  It's up | 
|  | 127 | to the caller to allocate and manage these identifiers.  If the | 
|  | 128 | identifier is already in use, the message will fail with -EEXIST. | 
|  | 129 |  | 
|  | 130 | ii) Using a thinly-provisioned volume. | 
|  | 131 |  | 
|  | 132 | Thinly-provisioned volumes are activated using the 'thin' target: | 
|  | 133 |  | 
|  | 134 | dmsetup create thin --table "0 2097152 thin /dev/mapper/pool 0" | 
|  | 135 |  | 
|  | 136 | The last parameter is the identifier for the thinp device. | 
|  | 137 |  | 
|  | 138 | Internal snapshots | 
|  | 139 | ------------------ | 
|  | 140 |  | 
|  | 141 | i) Creating an internal snapshot. | 
|  | 142 |  | 
|  | 143 | Snapshots are created with another message to the pool. | 
|  | 144 |  | 
|  | 145 | N.B.  If the origin device that you wish to snapshot is active, you | 
|  | 146 | must suspend it before creating the snapshot to avoid corruption. | 
|  | 147 | This is NOT enforced at the moment, so please be careful! | 
|  | 148 |  | 
|  | 149 | dmsetup suspend /dev/mapper/thin | 
|  | 150 | dmsetup message /dev/mapper/pool 0 "create_snap 1 0" | 
|  | 151 | dmsetup resume /dev/mapper/thin | 
|  | 152 |  | 
|  | 153 | Here '1' is the identifier for the volume, a 24-bit number.  '0' is the | 
|  | 154 | identifier for the origin device. | 
|  | 155 |  | 
|  | 156 | ii) Using an internal snapshot. | 
|  | 157 |  | 
|  | 158 | Once created, the user doesn't have to worry about any connection | 
|  | 159 | between the origin and the snapshot.  Indeed the snapshot is no | 
|  | 160 | different from any other thinly-provisioned device and can be | 
|  | 161 | snapshotted itself via the same method.  It's perfectly legal to | 
|  | 162 | have only one of them active, and there's no ordering requirement on | 
|  | 163 | activating or removing them both.  (This differs from conventional | 
|  | 164 | device-mapper snapshots.) | 
|  | 165 |  | 
|  | 166 | Activate it exactly the same way as any other thinly-provisioned volume: | 
|  | 167 |  | 
|  | 168 | dmsetup create snap --table "0 2097152 thin /dev/mapper/pool 1" | 
|  | 169 |  | 
|  | 170 | Deactivation | 
|  | 171 | ------------ | 
|  | 172 |  | 
|  | 173 | All devices using a pool must be deactivated before the pool itself | 
|  | 174 | can be. | 
|  | 175 |  | 
|  | 176 | dmsetup remove thin | 
|  | 177 | dmsetup remove snap | 
|  | 178 | dmsetup remove pool | 
|  | 179 |  | 
|  | 180 | Reference | 
|  | 181 | ========= | 
|  | 182 |  | 
|  | 183 | 'thin-pool' target | 
|  | 184 | ------------------ | 
|  | 185 |  | 
|  | 186 | i) Constructor | 
|  | 187 |  | 
|  | 188 | thin-pool <metadata dev> <data dev> <data block size (sectors)> \ | 
|  | 189 | <low water mark (blocks)> [<number of feature args> [<arg>]*] | 
|  | 190 |  | 
|  | 191 | Optional feature arguments: | 
|  | 192 | - 'skip_block_zeroing': skips the zeroing of newly-provisioned blocks. | 
|  | 193 |  | 
|  | 194 | Data block size must be between 64KB (128 sectors) and 1GB | 
|  | 195 | (2097152 sectors) inclusive. | 
|  | 196 |  | 
|  | 197 |  | 
|  | 198 | ii) Status | 
|  | 199 |  | 
|  | 200 | <transaction id> <used metadata blocks>/<total metadata blocks> | 
|  | 201 | <used data blocks>/<total data blocks> <held metadata root> | 
|  | 202 |  | 
|  | 203 |  | 
|  | 204 | transaction id: | 
|  | 205 | A 64-bit number used by userspace to help synchronise with metadata | 
|  | 206 | from volume managers. | 
|  | 207 |  | 
|  | 208 | used data blocks / total data blocks | 
|  | 209 | If the number of free blocks drops below the pool's low water mark a | 
|  | 210 | dm event will be sent to userspace.  This event is edge-triggered and | 
|  | 211 | it will occur only once after each resume so volume manager writers | 
|  | 212 | should register for the event and then check the target's status. | 
|  | 213 |  | 
|  | 214 | held metadata root: | 
|  | 215 | The location, in sectors, of the metadata root that has been | 
|  | 216 | 'held' for userspace read access.  '-' indicates there is no | 
|  | 217 | held root.  This feature is not yet implemented so '-' is | 
|  | 218 | always returned. | 
|  | 219 |  | 
|  | 220 | iii) Messages | 
|  | 221 |  | 
|  | 222 | create_thin <dev id> | 
|  | 223 |  | 
|  | 224 | Create a new thinly-provisioned device. | 
|  | 225 | <dev id> is an arbitrary unique 24-bit identifier chosen by | 
|  | 226 | the caller. | 
|  | 227 |  | 
|  | 228 | create_snap <dev id> <origin id> | 
|  | 229 |  | 
|  | 230 | Create a new snapshot of another thinly-provisioned device. | 
|  | 231 | <dev id> is an arbitrary unique 24-bit identifier chosen by | 
|  | 232 | the caller. | 
|  | 233 | <origin id> is the identifier of the thinly-provisioned device | 
|  | 234 | of which the new device will be a snapshot. | 
|  | 235 |  | 
|  | 236 | delete <dev id> | 
|  | 237 |  | 
|  | 238 | Deletes a thin device.  Irreversible. | 
|  | 239 |  | 
|  | 240 | trim <dev id> <new size in sectors> | 
|  | 241 |  | 
|  | 242 | Delete mappings from the end of a thin device.  Irreversible. | 
|  | 243 | You might want to use this if you're reducing the size of | 
|  | 244 | your thinly-provisioned device.  In many cases, due to the | 
|  | 245 | sharing of blocks between devices, it is not possible to | 
|  | 246 | determine in advance how much space 'trim' will release.  (In | 
|  | 247 | future a userspace tool might be able to perform this | 
|  | 248 | calculation.) | 
|  | 249 |  | 
|  | 250 | set_transaction_id <current id> <new id> | 
|  | 251 |  | 
|  | 252 | Userland volume managers, such as LVM, need a way to | 
|  | 253 | synchronise their external metadata with the internal metadata of the | 
|  | 254 | pool target.  The thin-pool target offers to store an | 
|  | 255 | arbitrary 64-bit transaction id and return it on the target's | 
|  | 256 | status line.  To avoid races you must provide what you think | 
|  | 257 | the current transaction id is when you change it with this | 
|  | 258 | compare-and-swap message. | 
|  | 259 |  | 
|  | 260 | 'thin' target | 
|  | 261 | ------------- | 
|  | 262 |  | 
|  | 263 | i) Constructor | 
|  | 264 |  | 
|  | 265 | thin <pool dev> <dev id> | 
|  | 266 |  | 
|  | 267 | pool dev: | 
|  | 268 | the thin-pool device, e.g. /dev/mapper/my_pool or 253:0 | 
|  | 269 |  | 
|  | 270 | dev id: | 
|  | 271 | the internal device identifier of the device to be | 
|  | 272 | activated. | 
|  | 273 |  | 
|  | 274 | The pool doesn't store any size against the thin devices.  If you | 
|  | 275 | load a thin target that is smaller than you've been using previously, | 
|  | 276 | then you'll have no access to blocks mapped beyond the end.  If you | 
|  | 277 | load a target that is bigger than before, then extra blocks will be | 
|  | 278 | provisioned as and when needed. | 
|  | 279 |  | 
|  | 280 | If you wish to reduce the size of your thin device and potentially | 
|  | 281 | regain some space then send the 'trim' message to the pool. | 
|  | 282 |  | 
|  | 283 | ii) Status | 
|  | 284 |  | 
|  | 285 | <nr mapped sectors> <highest mapped sector> |