| Rob Landley | 7f46a24 | 2005-11-07 01:01:09 -0800 | [diff] [blame] | 1 | ramfs, rootfs and initramfs | 
 | 2 | October 17, 2005 | 
 | 3 | Rob Landley <rob@landley.net> | 
 | 4 | ============================= | 
 | 5 |  | 
 | 6 | What is ramfs? | 
 | 7 | -------------- | 
 | 8 |  | 
 | 9 | Ramfs is a very simple filesystem that exports Linux's disk caching | 
 | 10 | mechanisms (the page cache and dentry cache) as a dynamically resizable | 
 | 11 | ram-based filesystem. | 
 | 12 |  | 
 | 13 | Normally all files are cached in memory by Linux.  Pages of data read from | 
 | 14 | backing store (usually the block device the filesystem is mounted on) are kept | 
 | 15 | around in case it's needed again, but marked as clean (freeable) in case the | 
 | 16 | Virtual Memory system needs the memory for something else.  Similarly, data | 
 | 17 | written to files is marked clean as soon as it has been written to backing | 
 | 18 | store, but kept around for caching purposes until the VM reallocates the | 
 | 19 | memory.  A similar mechanism (the dentry cache) greatly speeds up access to | 
 | 20 | directories. | 
 | 21 |  | 
 | 22 | With ramfs, there is no backing store.  Files written into ramfs allocate | 
 | 23 | dentries and page cache as usual, but there's nowhere to write them to. | 
 | 24 | This means the pages are never marked clean, so they can't be freed by the | 
 | 25 | VM when it's looking to recycle memory. | 
 | 26 |  | 
 | 27 | The amount of code required to implement ramfs is tiny, because all the | 
 | 28 | work is done by the existing Linux caching infrastructure.  Basically, | 
 | 29 | you're mounting the disk cache as a filesystem.  Because of this, ramfs is not | 
 | 30 | an optional component removable via menuconfig, since there would be negligible | 
 | 31 | space savings. | 
 | 32 |  | 
 | 33 | ramfs and ramdisk: | 
 | 34 | ------------------ | 
 | 35 |  | 
 | 36 | The older "ram disk" mechanism created a synthetic block device out of | 
 | 37 | an area of ram and used it as backing store for a filesystem.  This block | 
 | 38 | device was of fixed size, so the filesystem mounted on it was of fixed | 
 | 39 | size.  Using a ram disk also required unnecessarily copying memory from the | 
 | 40 | fake block device into the page cache (and copying changes back out), as well | 
 | 41 | as creating and destroying dentries.  Plus it needed a filesystem driver | 
 | 42 | (such as ext2) to format and interpret this data. | 
 | 43 |  | 
 | 44 | Compared to ramfs, this wastes memory (and memory bus bandwidth), creates | 
 | 45 | unnecessary work for the CPU, and pollutes the CPU caches.  (There are tricks | 
 | 46 | to avoid this copying by playing with the page tables, but they're unpleasantly | 
 | 47 | complicated and turn out to be about as expensive as the copying anyway.) | 
 | 48 | More to the point, all the work ramfs is doing has to happen _anyway_, | 
 | 49 | since all file access goes through the page and dentry caches.  The ram | 
 | 50 | disk is simply unnecessary, ramfs is internally much simpler. | 
 | 51 |  | 
 | 52 | Another reason ramdisks are semi-obsolete is that the introduction of | 
 | 53 | loopback devices offered a more flexible and convenient way to create | 
 | 54 | synthetic block devices, now from files instead of from chunks of memory. | 
 | 55 | See losetup (8) for details. | 
 | 56 |  | 
 | 57 | ramfs and tmpfs: | 
 | 58 | ---------------- | 
 | 59 |  | 
 | 60 | One downside of ramfs is you can keep writing data into it until you fill | 
 | 61 | up all memory, and the VM can't free it because the VM thinks that files | 
 | 62 | should get written to backing store (rather than swap space), but ramfs hasn't | 
 | 63 | got any backing store.  Because of this, only root (or a trusted user) should | 
 | 64 | be allowed write access to a ramfs mount. | 
 | 65 |  | 
 | 66 | A ramfs derivative called tmpfs was created to add size limits, and the ability | 
 | 67 | to write the data to swap space.  Normal users can be allowed write access to | 
 | 68 | tmpfs mounts.  See Documentation/filesystems/tmpfs.txt for more information. | 
 | 69 |  | 
 | 70 | What is rootfs? | 
 | 71 | --------------- | 
 | 72 |  | 
 | 73 | Rootfs is a special instance of ramfs, which is always present in 2.6 systems. | 
 | 74 | (It's used internally as the starting and stopping point for searches of the | 
 | 75 | kernel's doubly-linked list of mount points.) | 
 | 76 |  | 
 | 77 | Most systems just mount another filesystem over it and ignore it.  The | 
 | 78 | amount of space an empty instance of ramfs takes up is tiny. | 
 | 79 |  | 
 | 80 | What is initramfs? | 
 | 81 | ------------------ | 
 | 82 |  | 
 | 83 | All 2.6 Linux kernels contain a gzipped "cpio" format archive, which is | 
 | 84 | extracted into rootfs when the kernel boots up.  After extracting, the kernel | 
 | 85 | checks to see if rootfs contains a file "init", and if so it executes it as PID | 
 | 86 | 1.  If found, this init process is responsible for bringing the system the | 
 | 87 | rest of the way up, including locating and mounting the real root device (if | 
 | 88 | any).  If rootfs does not contain an init program after the embedded cpio | 
 | 89 | archive is extracted into it, the kernel will fall through to the older code | 
 | 90 | to locate and mount a root partition, then exec some variant of /sbin/init | 
 | 91 | out of that. | 
 | 92 |  | 
 | 93 | All this differs from the old initrd in several ways: | 
 | 94 |  | 
 | 95 |   - The old initrd was a separate file, while the initramfs archive is linked | 
 | 96 |     into the linux kernel image.  (The directory linux-*/usr is devoted to | 
 | 97 |     generating this archive during the build.) | 
 | 98 |  | 
 | 99 |   - The old initrd file was a gzipped filesystem image (in some file format, | 
 | 100 |     such as ext2, that had to be built into the kernel), while the new | 
 | 101 |     initramfs archive is a gzipped cpio archive (like tar only simpler, | 
 | 102 |     see cpio(1) and Documentation/early-userspace/buffer-format.txt). | 
 | 103 |  | 
 | 104 |   - The program run by the old initrd (which was called /initrd, not /init) did | 
 | 105 |     some setup and then returned to the kernel, while the init program from | 
 | 106 |     initramfs is not expected to return to the kernel.  (If /init needs to hand | 
 | 107 |     off control it can overmount / with a new root device and exec another init | 
 | 108 |     program.  See the switch_root utility, below.) | 
 | 109 |  | 
 | 110 |   - When switching another root device, initrd would pivot_root and then | 
 | 111 |     umount the ramdisk.  But initramfs is rootfs: you can neither pivot_root | 
 | 112 |     rootfs, nor unmount it.  Instead delete everything out of rootfs to | 
 | 113 |     free up the space (find -xdev / -exec rm '{}' ';'), overmount rootfs | 
 | 114 |     with the new root (cd /newmount; mount --move . /; chroot .), attach | 
 | 115 |     stdin/stdout/stderr to the new /dev/console, and exec the new init. | 
 | 116 |  | 
 | 117 |     Since this is a remarkably persnickity process (and involves deleting | 
 | 118 |     commands before you can run them), the klibc package introduced a helper | 
 | 119 |     program (utils/run_init.c) to do all this for you.  Most other packages | 
 | 120 |     (such as busybox) have named this command "switch_root". | 
 | 121 |  | 
 | 122 | Populating initramfs: | 
 | 123 | --------------------- | 
 | 124 |  | 
 | 125 | The 2.6 kernel build process always creates a gzipped cpio format initramfs | 
 | 126 | archive and links it into the resulting kernel binary.  By default, this | 
 | 127 | archive is empty (consuming 134 bytes on x86).  The config option | 
 | 128 | CONFIG_INITRAMFS_SOURCE (for some reason buried under devices->block devices | 
 | 129 | in menuconfig, and living in usr/Kconfig) can be used to specify a source for | 
 | 130 | the initramfs archive, which will automatically be incorporated into the | 
 | 131 | resulting binary.  This option can point to an existing gzipped cpio archive, a | 
 | 132 | directory containing files to be archived, or a text file specification such | 
 | 133 | as the following example: | 
 | 134 |  | 
 | 135 |   dir /dev 755 0 0 | 
 | 136 |   nod /dev/console 644 0 0 c 5 1 | 
 | 137 |   nod /dev/loop0 644 0 0 b 7 0 | 
 | 138 |   dir /bin 755 1000 1000 | 
 | 139 |   slink /bin/sh busybox 777 0 0 | 
 | 140 |   file /bin/busybox initramfs/busybox 755 0 0 | 
 | 141 |   dir /proc 755 0 0 | 
 | 142 |   dir /sys 755 0 0 | 
 | 143 |   dir /mnt 755 0 0 | 
 | 144 |   file /init initramfs/init.sh 755 0 0 | 
 | 145 |  | 
| Rob Landley | 99aef42 | 2006-01-08 01:03:43 -0800 | [diff] [blame] | 146 | Run "usr/gen_init_cpio" (after the kernel build) to get a usage message | 
 | 147 | documenting the above file format. | 
 | 148 |  | 
| Rob Landley | 7f46a24 | 2005-11-07 01:01:09 -0800 | [diff] [blame] | 149 | One advantage of the text file is that root access is not required to | 
 | 150 | set permissions or create device nodes in the new archive.  (Note that those | 
 | 151 | two example "file" entries expect to find files named "init.sh" and "busybox" in | 
 | 152 | a directory called "initramfs", under the linux-2.6.* directory.  See | 
 | 153 | Documentation/early-userspace/README for more details.) | 
 | 154 |  | 
| Rob Landley | 99aef42 | 2006-01-08 01:03:43 -0800 | [diff] [blame] | 155 | The kernel does not depend on external cpio tools, gen_init_cpio is created | 
 | 156 | from usr/gen_init_cpio.c which is entirely self-contained, and the kernel's | 
 | 157 | boot-time extractor is also (obviously) self-contained.  However, if you _do_ | 
 | 158 | happen to have cpio installed, the following command line can extract the | 
 | 159 | generated cpio image back into its component files: | 
 | 160 |  | 
 | 161 |   cpio -i -d -H newc -F initramfs_data.cpio --no-absolute-filenames | 
 | 162 |  | 
 | 163 | Contents of initramfs: | 
 | 164 | ---------------------- | 
 | 165 |  | 
| Rob Landley | 7f46a24 | 2005-11-07 01:01:09 -0800 | [diff] [blame] | 166 | If you don't already understand what shared libraries, devices, and paths | 
 | 167 | you need to get a minimal root filesystem up and running, here are some | 
 | 168 | references: | 
 | 169 | http://www.tldp.org/HOWTO/Bootdisk-HOWTO/ | 
 | 170 | http://www.tldp.org/HOWTO/From-PowerUp-To-Bash-Prompt-HOWTO.html | 
 | 171 | http://www.linuxfromscratch.org/lfs/view/stable/ | 
 | 172 |  | 
 | 173 | The "klibc" package (http://www.kernel.org/pub/linux/libs/klibc) is | 
 | 174 | designed to be a tiny C library to statically link early userspace | 
 | 175 | code against, along with some related utilities.  It is BSD licensed. | 
 | 176 |  | 
 | 177 | I use uClibc (http://www.uclibc.org) and busybox (http://www.busybox.net) | 
| Rob Landley | 99aef42 | 2006-01-08 01:03:43 -0800 | [diff] [blame] | 178 | myself.  These are LGPL and GPL, respectively.  (A self-contained initramfs | 
 | 179 | package is planned for the busybox 1.2 release.) | 
| Rob Landley | 7f46a24 | 2005-11-07 01:01:09 -0800 | [diff] [blame] | 180 |  | 
 | 181 | In theory you could use glibc, but that's not well suited for small embedded | 
 | 182 | uses like this.  (A "hello world" program statically linked against glibc is | 
 | 183 | over 400k.  With uClibc it's 7k.  Also note that glibc dlopens libnss to do | 
 | 184 | name lookups, even when otherwise statically linked.) | 
 | 185 |  | 
| Rob Landley | 99aef42 | 2006-01-08 01:03:43 -0800 | [diff] [blame] | 186 | Why cpio rather than tar? | 
 | 187 | ------------------------- | 
 | 188 |  | 
 | 189 | This decision was made back in December, 2001.  The discussion started here: | 
 | 190 |  | 
 | 191 |   http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1538.html | 
 | 192 |  | 
 | 193 | And spawned a second thread (specifically on tar vs cpio), starting here: | 
 | 194 |  | 
 | 195 |   http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1587.html | 
 | 196 |  | 
 | 197 | The quick and dirty summary version (which is no substitute for reading | 
 | 198 | the above threads) is: | 
 | 199 |  | 
 | 200 | 1) cpio is a standard.  It's decades old (from the AT&T days), and already | 
 | 201 |    widely used on Linux (inside RPM, Red Hat's device driver disks).  Here's | 
 | 202 |    a Linux Journal article about it from 1996: | 
 | 203 |  | 
 | 204 |       http://www.linuxjournal.com/article/1213 | 
 | 205 |  | 
 | 206 |    It's not as popular as tar because the traditional cpio command line tools | 
 | 207 |    require _truly_hideous_ command line arguments.  But that says nothing | 
 | 208 |    either way about the archive format, and there are alternative tools, | 
 | 209 |    such as: | 
 | 210 |  | 
 | 211 |      http://freshmeat.net/projects/afio/ | 
 | 212 |  | 
 | 213 | 2) The cpio archive format chosen by the kernel is simpler and cleaner (and | 
 | 214 |    thus easier to create and parse) than any of the (literally dozens of) | 
 | 215 |    various tar archive formats.  The complete initramfs archive format is | 
 | 216 |    explained in buffer-format.txt, created in usr/gen_init_cpio.c, and | 
 | 217 |    extracted in init/initramfs.c.  All three together come to less than 26k | 
 | 218 |    total of human-readable text. | 
 | 219 |  | 
 | 220 | 3) The GNU project standardizing on tar is approximately as relevant as | 
 | 221 |    Windows standardizing on zip.  Linux is not part of either, and is free | 
 | 222 |    to make its own technical decisions. | 
 | 223 |  | 
 | 224 | 4) Since this is a kernel internal format, it could easily have been | 
 | 225 |    something brand new.  The kernel provides its own tools to create and | 
 | 226 |    extract this format anyway.  Using an existing standard was preferable, | 
 | 227 |    but not essential. | 
 | 228 |  | 
 | 229 | 5) Al Viro made the decision (quote: "tar is ugly as hell and not going to be | 
 | 230 |    supported on the kernel side"): | 
 | 231 |  | 
 | 232 |       http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1540.html | 
 | 233 |  | 
 | 234 |    explained his reasoning: | 
 | 235 |  | 
 | 236 |       http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1550.html | 
 | 237 |       http://www.uwsg.iu.edu/hypermail/linux/kernel/0112.2/1638.html | 
 | 238 |  | 
 | 239 |    and, most importantly, designed and implemented the initramfs code. | 
 | 240 |  | 
| Rob Landley | 7f46a24 | 2005-11-07 01:01:09 -0800 | [diff] [blame] | 241 | Future directions: | 
 | 242 | ------------------ | 
 | 243 |  | 
 | 244 | Today (2.6.14), initramfs is always compiled in, but not always used.  The | 
 | 245 | kernel falls back to legacy boot code that is reached only if initramfs does | 
 | 246 | not contain an /init program.  The fallback is legacy code, there to ensure a | 
 | 247 | smooth transition and allowing early boot functionality to gradually move to | 
 | 248 | "early userspace" (I.E. initramfs). | 
 | 249 |  | 
 | 250 | The move to early userspace is necessary because finding and mounting the real | 
 | 251 | root device is complex.  Root partitions can span multiple devices (raid or | 
 | 252 | separate journal).  They can be out on the network (requiring dhcp, setting a | 
 | 253 | specific mac address, logging into a server, etc).  They can live on removable | 
 | 254 | media, with dynamically allocated major/minor numbers and persistent naming | 
 | 255 | issues requiring a full udev implementation to sort out.  They can be | 
 | 256 | compressed, encrypted, copy-on-write, loopback mounted, strangely partitioned, | 
 | 257 | and so on. | 
 | 258 |  | 
 | 259 | This kind of complexity (which inevitably includes policy) is rightly handled | 
 | 260 | in userspace.  Both klibc and busybox/uClibc are working on simple initramfs | 
 | 261 | packages to drop into a kernel build, and when standard solutions are ready | 
 | 262 | and widely deployed, the kernel's legacy early boot code will become obsolete | 
 | 263 | and a candidate for the feature removal schedule. | 
 | 264 |  | 
 | 265 | But that's a while off yet. |