| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 1 | SQUASHFS 4.0 FILESYSTEM | 
|  | 2 | ======================= | 
|  | 3 |  | 
|  | 4 | Squashfs is a compressed read-only filesystem for Linux. | 
| Phillip Lougher | 812753d | 2011-07-22 02:26:52 +0100 | [diff] [blame] | 5 | It uses zlib/lzo/xz compression to compress files, inodes and directories. | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 6 | Inodes in the system are very small and all blocks are packed to minimise | 
|  | 7 | data overhead. Block sizes greater than 4K are supported up to a maximum | 
|  | 8 | of 1Mbytes (default block size 128K). | 
|  | 9 |  | 
|  | 10 | Squashfs is intended for general read-only filesystem use, for archival | 
|  | 11 | use (i.e. in cases where a .tar.gz file may be used), and in constrained | 
|  | 12 | block device/memory systems (e.g. embedded systems) where low overhead is | 
|  | 13 | needed. | 
|  | 14 |  | 
|  | 15 | Mailing list: squashfs-devel@lists.sourceforge.net | 
|  | 16 | Web site: www.squashfs.org | 
|  | 17 |  | 
|  | 18 | 1. FILESYSTEM FEATURES | 
|  | 19 | ---------------------- | 
|  | 20 |  | 
|  | 21 | Squashfs filesystem features versus Cramfs: | 
|  | 22 |  | 
|  | 23 | Squashfs		Cramfs | 
|  | 24 |  | 
| Phillip Lougher | edf2e28 | 2009-03-05 00:40:13 +0000 | [diff] [blame] | 25 | Max filesystem size:		2^64			256 MiB | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 26 | Max file size:			~ 2 TiB			16 MiB | 
|  | 27 | Max files:			unlimited		unlimited | 
|  | 28 | Max directories:		unlimited		unlimited | 
|  | 29 | Max entries per directory:	unlimited		unlimited | 
|  | 30 | Max block size:			1 MiB			4 KiB | 
|  | 31 | Metadata compression:		yes			no | 
|  | 32 | Directory indexes:		yes			no | 
|  | 33 | Sparse file support:		yes			no | 
|  | 34 | Tail-end packing (fragments):	yes			no | 
|  | 35 | Exportable (NFS etc.):		yes			no | 
|  | 36 | Hard link support:		yes			no | 
|  | 37 | "." and ".." in readdir:	yes			no | 
|  | 38 | Real inode numbers:		yes			no | 
|  | 39 | 32-bit uids/gids:		yes			no | 
|  | 40 | File creation time:		yes			no | 
| Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 41 | Xattr support:			yes			no | 
|  | 42 | ACL support:			no			no | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 43 |  | 
|  | 44 | Squashfs compresses data, inodes and directories.  In addition, inode and | 
|  | 45 | directory data are highly compacted, and packed on byte boundaries.  Each | 
|  | 46 | compressed inode is on average 8 bytes in length (the exact length varies on | 
|  | 47 | file type, i.e. regular file, directory, symbolic link, and block/char device | 
|  | 48 | inodes have different sizes). | 
|  | 49 |  | 
|  | 50 | 2. USING SQUASHFS | 
|  | 51 | ----------------- | 
|  | 52 |  | 
|  | 53 | As squashfs is a read-only filesystem, the mksquashfs program must be used to | 
|  | 54 | create populated squashfs filesystems.  This and other squashfs utilities | 
|  | 55 | can be obtained from http://www.squashfs.org.  Usage instructions can be | 
|  | 56 | obtained from this site also. | 
|  | 57 |  | 
| Phillip Lougher | 812753d | 2011-07-22 02:26:52 +0100 | [diff] [blame] | 58 | The squashfs-tools development tree is now located on kernel.org | 
|  | 59 | git://git.kernel.org/pub/scm/fs/squashfs/squashfs-tools.git | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 60 |  | 
|  | 61 | 3. SQUASHFS FILESYSTEM DESIGN | 
|  | 62 | ----------------------------- | 
|  | 63 |  | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 64 | A squashfs filesystem consists of a maximum of nine parts, packed together on a | 
|  | 65 | byte alignment: | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 66 |  | 
|  | 67 | --------------- | 
|  | 68 | |  superblock 	| | 
|  | 69 | |---------------| | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 70 | |  compression  | | 
|  | 71 | |    options    | | 
|  | 72 | |---------------| | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 73 | |  datablocks   | | 
|  | 74 | |  & fragments  | | 
|  | 75 | |---------------| | 
|  | 76 | |  inode table	| | 
|  | 77 | |---------------| | 
|  | 78 | |   directory	| | 
|  | 79 | |     table     | | 
|  | 80 | |---------------| | 
|  | 81 | |   fragment	| | 
|  | 82 | |    table      | | 
|  | 83 | |---------------| | 
|  | 84 | |    export     | | 
|  | 85 | |    table      | | 
|  | 86 | |---------------| | 
|  | 87 | |    uid/gid	| | 
|  | 88 | |  lookup table	| | 
| Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 89 | |---------------| | 
|  | 90 | |     xattr     | | 
|  | 91 | |     table	| | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 92 | --------------- | 
|  | 93 |  | 
|  | 94 | Compressed data blocks are written to the filesystem as files are read from | 
|  | 95 | the source directory, and checked for duplicates.  Once all file data has been | 
|  | 96 | written the completed inode, directory, fragment, export and uid/gid lookup | 
|  | 97 | tables are written. | 
|  | 98 |  | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 99 | 3.1 Compression options | 
|  | 100 | ----------------------- | 
|  | 101 |  | 
|  | 102 | Compressors can optionally support compression specific options (e.g. | 
|  | 103 | dictionary size).  If non-default compression options have been used, then | 
|  | 104 | these are stored here. | 
|  | 105 |  | 
|  | 106 | 3.2 Inodes | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 107 | ---------- | 
|  | 108 |  | 
|  | 109 | Metadata (inodes and directories) are compressed in 8Kbyte blocks.  Each | 
|  | 110 | compressed block is prefixed by a two byte length, the top bit is set if the | 
|  | 111 | block is uncompressed.  A block will be uncompressed if the -noI option is set, | 
|  | 112 | or if the compressed block was larger than the uncompressed block. | 
|  | 113 |  | 
|  | 114 | Inodes are packed into the metadata blocks, and are not aligned to block | 
|  | 115 | boundaries, therefore inodes overlap compressed blocks.  Inodes are identified | 
|  | 116 | by a 48-bit number which encodes the location of the compressed metadata block | 
|  | 117 | containing the inode, and the byte offset into that block where the inode is | 
|  | 118 | placed (<block, offset>). | 
|  | 119 |  | 
|  | 120 | To maximise compression there are different inodes for each file type | 
|  | 121 | (regular file, directory, device, etc.), the inode contents and length | 
|  | 122 | varying with the type. | 
|  | 123 |  | 
|  | 124 | To further maximise compression, two types of regular file inode and | 
|  | 125 | directory inode are defined: inodes optimised for frequently occurring | 
|  | 126 | regular files and directories, and extended types where extra | 
|  | 127 | information has to be stored. | 
|  | 128 |  | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 129 | 3.3 Directories | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 130 | --------------- | 
|  | 131 |  | 
|  | 132 | Like inodes, directories are packed into compressed metadata blocks, stored | 
|  | 133 | in a directory table.  Directories are accessed using the start address of | 
|  | 134 | the metablock containing the directory and the offset into the | 
|  | 135 | decompressed block (<block, offset>). | 
|  | 136 |  | 
|  | 137 | Directories are organised in a slightly complex way, and are not simply | 
|  | 138 | a list of file names.  The organisation takes advantage of the | 
|  | 139 | fact that (in most cases) the inodes of the files will be in the same | 
|  | 140 | compressed metadata block, and therefore, can share the start block. | 
|  | 141 | Directories are therefore organised in a two level list, a directory | 
|  | 142 | header containing the shared start block value, and a sequence of directory | 
|  | 143 | entries, each of which share the shared start block.  A new directory header | 
|  | 144 | is written once/if the inode start block changes.  The directory | 
|  | 145 | header/directory entry list is repeated as many times as necessary. | 
|  | 146 |  | 
|  | 147 | Directories are sorted, and can contain a directory index to speed up | 
|  | 148 | file lookup.  Directory indexes store one entry per metablock, each entry | 
|  | 149 | storing the index/filename mapping to the first directory header | 
|  | 150 | in each metadata block.  Directories are sorted in alphabetical order, | 
|  | 151 | and at lookup the index is scanned linearly looking for the first filename | 
|  | 152 | alphabetically larger than the filename being looked up.  At this point the | 
|  | 153 | location of the metadata block the filename is in has been found. | 
|  | 154 | The general idea of the index is ensure only one metadata block needs to be | 
|  | 155 | decompressed to do a lookup irrespective of the length of the directory. | 
|  | 156 | This scheme has the advantage that it doesn't require extra memory overhead | 
|  | 157 | and doesn't require much extra storage on disk. | 
|  | 158 |  | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 159 | 3.4 File data | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 160 | ------------- | 
|  | 161 |  | 
|  | 162 | Regular files consist of a sequence of contiguous compressed blocks, and/or a | 
|  | 163 | compressed fragment block (tail-end packed block).   The compressed size | 
|  | 164 | of each datablock is stored in a block list contained within the | 
|  | 165 | file inode. | 
|  | 166 |  | 
|  | 167 | To speed up access to datablocks when reading 'large' files (256 Mbytes or | 
|  | 168 | larger), the code implements an index cache that caches the mapping from | 
|  | 169 | block index to datablock location on disk. | 
|  | 170 |  | 
|  | 171 | The index cache allows Squashfs to handle large files (up to 1.75 TiB) while | 
|  | 172 | retaining a simple and space-efficient block list on disk.  The cache | 
|  | 173 | is split into slots, caching up to eight 224 GiB files (128 KiB blocks). | 
|  | 174 | Larger files use multiple slots, with 1.75 TiB files using all 8 slots. | 
|  | 175 | The index cache is designed to be memory efficient, and by default uses | 
|  | 176 | 16 KiB. | 
|  | 177 |  | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 178 | 3.5 Fragment lookup table | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 179 | ------------------------- | 
|  | 180 |  | 
|  | 181 | Regular files can contain a fragment index which is mapped to a fragment | 
|  | 182 | location on disk and compressed size using a fragment lookup table.  This | 
|  | 183 | fragment lookup table is itself stored compressed into metadata blocks. | 
|  | 184 | A second index table is used to locate these.  This second index table for | 
|  | 185 | speed of access (and because it is small) is read at mount time and cached | 
|  | 186 | in memory. | 
|  | 187 |  | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 188 | 3.6 Uid/gid lookup table | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 189 | ------------------------ | 
|  | 190 |  | 
|  | 191 | For space efficiency regular files store uid and gid indexes, which are | 
|  | 192 | converted to 32-bit uids/gids using an id look up table.  This table is | 
|  | 193 | stored compressed into metadata blocks.  A second index table is used to | 
|  | 194 | locate these.  This second index table for speed of access (and because it | 
|  | 195 | is small) is read at mount time and cached in memory. | 
|  | 196 |  | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 197 | 3.7 Export table | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 198 | ---------------- | 
|  | 199 |  | 
|  | 200 | To enable Squashfs filesystems to be exportable (via NFS etc.) filesystems | 
|  | 201 | can optionally (disabled with the -no-exports Mksquashfs option) contain | 
|  | 202 | an inode number to inode disk location lookup table.  This is required to | 
|  | 203 | enable Squashfs to map inode numbers passed in filehandles to the inode | 
|  | 204 | location on disk, which is necessary when the export code reinstantiates | 
|  | 205 | expired/flushed inodes. | 
|  | 206 |  | 
|  | 207 | This table is stored compressed into metadata blocks.  A second index table is | 
|  | 208 | used to locate these.  This second index table for speed of access (and because | 
|  | 209 | it is small) is read at mount time and cached in memory. | 
|  | 210 |  | 
| Phillip Lougher | 4c1d204 | 2011-02-28 16:32:39 +0000 | [diff] [blame] | 211 | 3.8 Xattr table | 
| Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 212 | --------------- | 
|  | 213 |  | 
|  | 214 | The xattr table contains extended attributes for each inode.  The xattrs | 
|  | 215 | for each inode are stored in a list, each list entry containing a type, | 
|  | 216 | name and value field.  The type field encodes the xattr prefix | 
|  | 217 | ("user.", "trusted." etc) and it also encodes how the name/value fields | 
|  | 218 | should be interpreted.  Currently the type indicates whether the value | 
|  | 219 | is stored inline (in which case the value field contains the xattr value), | 
|  | 220 | or if it is stored out of line (in which case the value field stores a | 
|  | 221 | reference to where the actual value is stored).  This allows large values | 
|  | 222 | to be stored out of line improving scanning and lookup performance and it | 
|  | 223 | also allows values to be de-duplicated, the value being stored once, and | 
| Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 224 | all other occurrences holding an out of line reference to that value. | 
| Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 225 |  | 
|  | 226 | The xattr lists are packed into compressed 8K metadata blocks. | 
|  | 227 | To reduce overhead in inodes, rather than storing the on-disk | 
|  | 228 | location of the xattr list inside each inode, a 32-bit xattr id | 
|  | 229 | is stored.  This xattr id is mapped into the location of the xattr | 
|  | 230 | list using a second xattr id lookup table. | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 231 |  | 
|  | 232 | 4. TODOS AND OUTSTANDING ISSUES | 
|  | 233 | ------------------------------- | 
|  | 234 |  | 
|  | 235 | 4.1 Todo list | 
|  | 236 | ------------- | 
|  | 237 |  | 
| Phillip Lougher | 899f453 | 2010-05-25 02:47:00 +0100 | [diff] [blame] | 238 | Implement ACL support. | 
| Phillip Lougher | 9eb425c | 2009-01-05 08:46:29 +0000 | [diff] [blame] | 239 |  | 
|  | 240 | 4.2 Squashfs internal cache | 
|  | 241 | --------------------------- | 
|  | 242 |  | 
|  | 243 | Blocks in Squashfs are compressed.  To avoid repeatedly decompressing | 
|  | 244 | recently accessed data Squashfs uses two small metadata and fragment caches. | 
|  | 245 |  | 
|  | 246 | The cache is not used for file datablocks, these are decompressed and cached in | 
|  | 247 | the page-cache in the normal way.  The cache is used to temporarily cache | 
|  | 248 | fragment and metadata blocks which have been read as a result of a metadata | 
|  | 249 | (i.e. inode or directory) or fragment access.  Because metadata and fragments | 
|  | 250 | are packed together into blocks (to gain greater compression) the read of a | 
|  | 251 | particular piece of metadata or fragment will retrieve other metadata/fragments | 
|  | 252 | which have been packed with it, these because of locality-of-reference may be | 
|  | 253 | read in the near future. Temporarily caching them ensures they are available | 
|  | 254 | for near future access without requiring an additional read and decompress. | 
|  | 255 |  | 
|  | 256 | In the future this internal cache may be replaced with an implementation which | 
|  | 257 | uses the kernel page cache.  Because the page cache operates on page sized | 
|  | 258 | units this may introduce additional complexity in terms of locking and | 
|  | 259 | associated race conditions. |