| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 |  | 
|  | 2 | Making Filesystems Exportable | 
|  | 3 | ============================= | 
|  | 4 |  | 
|  | 5 | Most filesystem operations require a dentry (or two) as a starting | 
|  | 6 | point.  Local applications have a reference-counted hold on suitable | 
|  | 7 | dentrys via open file descriptors or cwd/root.  However remote | 
|  | 8 | applications that access a filesystem via a remote filesystem protocol | 
|  | 9 | such as NFS may not be able to hold such a reference, and so need a | 
|  | 10 | different way to refer to a particular dentry.  As the alternative | 
|  | 11 | form of reference needs to be stable across renames, truncates, and | 
|  | 12 | server-reboot (among other things, though these tend to be the most | 
|  | 13 | problematic), there is no simple answer like 'filename'. | 
|  | 14 |  | 
|  | 15 | The mechanism discussed here allows each filesystem implementation to | 
|  | 16 | specify how to generate an opaque (out side of the filesystem) byte | 
|  | 17 | string for any dentry, and how to find an appropriate dentry for any | 
|  | 18 | given opaque byte string. | 
|  | 19 | This byte string will be called a "filehandle fragment" as it | 
|  | 20 | corresponds to part of an NFS filehandle. | 
|  | 21 |  | 
|  | 22 | A filesystem which supports the mapping between filehandle fragments | 
|  | 23 | and dentrys will be termed "exportable". | 
|  | 24 |  | 
|  | 25 |  | 
|  | 26 |  | 
|  | 27 | Dcache Issues | 
|  | 28 | ------------- | 
|  | 29 |  | 
|  | 30 | The dcache normally contains a proper prefix of any given filesystem | 
|  | 31 | tree.  This means that if any filesystem object is in the dcache, then | 
|  | 32 | all of the ancestors of that filesystem object are also in the dcache. | 
|  | 33 | As normal access is by filename this prefix is created naturally and | 
|  | 34 | maintained easily (by each object maintaining a reference count on | 
|  | 35 | its parent). | 
|  | 36 |  | 
|  | 37 | However when objects are included into the dcache by interpreting a | 
|  | 38 | filehandle fragment, there is no automatic creation of a path prefix | 
|  | 39 | for the object.  This leads to two related but distinct features of | 
|  | 40 | the dcache that are not needed for normal filesystem access. | 
|  | 41 |  | 
|  | 42 | 1/ The dcache must sometimes contain objects that are not part of the | 
|  | 43 | proper prefix. i.e that are not connected to the root. | 
|  | 44 | 2/ The dcache must be prepared for a newly found (via ->lookup) directory | 
|  | 45 | to already have a (non-connected) dentry, and must be able to move | 
|  | 46 | that dentry into place (based on the parent and name in the | 
|  | 47 | ->lookup).   This is particularly needed for directories as | 
|  | 48 | it is a dcache invariant that directories only have one dentry. | 
|  | 49 |  | 
|  | 50 | To implement these features, the dcache has: | 
|  | 51 |  | 
|  | 52 | a/ A dentry flag DCACHE_DISCONNECTED which is set on | 
|  | 53 | any dentry that might not be part of the proper prefix. | 
|  | 54 | This is set when anonymous dentries are created, and cleared when a | 
|  | 55 | dentry is noticed to be a child of a dentry which is in the proper | 
|  | 56 | prefix. | 
|  | 57 |  | 
|  | 58 | b/ A per-superblock list "s_anon" of dentries which are the roots of | 
|  | 59 | subtrees that are not in the proper prefix.  These dentries, as | 
|  | 60 | well as the proper prefix, need to be released at unmount time.  As | 
|  | 61 | these dentries will not be hashed, they are linked together on the | 
|  | 62 | d_hash list_head. | 
|  | 63 |  | 
|  | 64 | c/ Helper routines to allocate anonymous dentries, and to help attach | 
|  | 65 | loose directory dentries at lookup time. They are: | 
|  | 66 | d_alloc_anon(inode) will return a dentry for the given inode. | 
|  | 67 | If the inode already has a dentry, one of those is returned. | 
|  | 68 | If it doesn't, a new anonymous (IS_ROOT and | 
|  | 69 | DCACHE_DISCONNECTED) dentry is allocated and attached. | 
|  | 70 | In the case of a directory, care is taken that only one dentry | 
|  | 71 | can ever be attached. | 
|  | 72 | d_splice_alias(inode, dentry) will make sure that there is a | 
|  | 73 | dentry with the same name and parent as the given dentry, and | 
|  | 74 | which refers to the given inode. | 
|  | 75 | If the inode is a directory and already has a dentry, then that | 
|  | 76 | dentry is d_moved over the given dentry. | 
|  | 77 | If the passed dentry gets attached, care is taken that this is | 
|  | 78 | mutually exclusive to a d_alloc_anon operation. | 
|  | 79 | If the passed dentry is used, NULL is returned, else the used | 
|  | 80 | dentry is returned.  This corresponds to the calling pattern of | 
|  | 81 | ->lookup. | 
|  | 82 |  | 
|  | 83 |  | 
|  | 84 | Filesystem Issues | 
|  | 85 | ----------------- | 
|  | 86 |  | 
|  | 87 | For a filesystem to be exportable it must: | 
|  | 88 |  | 
|  | 89 | 1/ provide the filehandle fragment routines described below. | 
|  | 90 | 2/ make sure that d_splice_alias is used rather than d_add | 
|  | 91 | when ->lookup finds an inode for a given parent and name. | 
|  | 92 | Typically the ->lookup routine will end: | 
|  | 93 | if (inode) | 
|  | 94 | return d_splice(inode, dentry); | 
|  | 95 | d_add(dentry, inode); | 
|  | 96 | return NULL; | 
|  | 97 | } | 
|  | 98 |  | 
|  | 99 |  | 
|  | 100 |  | 
|  | 101 | A file system implementation declares that instances of the filesystem | 
|  | 102 | are exportable by setting the s_export_op field in the struct | 
|  | 103 | super_block.  This field must point to a "struct export_operations" | 
|  | 104 | struct which could potentially be full of NULLs, though normally at | 
|  | 105 | least get_parent will be set. | 
|  | 106 |  | 
|  | 107 | The primary operations are decode_fh and encode_fh. | 
|  | 108 | decode_fh takes a filehandle fragment and tries to find or create a | 
|  | 109 | dentry for the object referred to by the filehandle. | 
|  | 110 | encode_fh takes a dentry and creates a filehandle fragment which can | 
|  | 111 | later be used to find/create a dentry for the same object. | 
|  | 112 |  | 
|  | 113 | decode_fh will probably make use of "find_exported_dentry". | 
|  | 114 | This function lives in the "exportfs" module which a filesystem does | 
|  | 115 | not need unless it is being exported.  So rather that calling | 
|  | 116 | find_exported_dentry directly, each filesystem should call it through | 
|  | 117 | the find_exported_dentry pointer in it's export_operations table. | 
|  | 118 | This field is set correctly by the exporting agent (e.g. nfsd) when a | 
|  | 119 | filesystem is exported, and before any export operations are called. | 
|  | 120 |  | 
|  | 121 | find_exported_dentry needs three support functions from the | 
|  | 122 | filesystem: | 
|  | 123 | get_name.  When given a parent dentry and a child dentry, this | 
|  | 124 | should find a name in the directory identified by the parent | 
|  | 125 | dentry, which leads to the object identified by the child dentry. | 
|  | 126 | If no get_name function is supplied, a default implementation is | 
|  | 127 | provided which uses vfs_readdir to find potential names, and | 
|  | 128 | matches inode numbers to find the correct match. | 
|  | 129 |  | 
|  | 130 | get_parent.  When given a dentry for a directory, this should return | 
|  | 131 | a dentry for the parent.  Quite possibly the parent dentry will | 
|  | 132 | have been allocated by d_alloc_anon. | 
|  | 133 | The default get_parent function just returns an error so any | 
|  | 134 | filehandle lookup that requires finding a parent will fail. | 
|  | 135 | ->lookup("..") is *not* used as a default as it can leave ".." | 
|  | 136 | entries in the dcache which are too messy to work with. | 
|  | 137 |  | 
|  | 138 | get_dentry.  When given an opaque datum, this should find the | 
|  | 139 | implied object and create a dentry for it (possibly with | 
|  | 140 | d_alloc_anon). | 
|  | 141 | The opaque datum is whatever is passed down by the decode_fh | 
|  | 142 | function, and is often simply a fragment of the filehandle | 
|  | 143 | fragment. | 
|  | 144 | decode_fh passes two datums through find_exported_dentry.  One that | 
|  | 145 | should be used to identify the target object, and one that can be | 
|  | 146 | used to identify the object's parent, should that be necessary. | 
|  | 147 | The default get_dentry function assumes that the datum contains an | 
|  | 148 | inode number and a generation number, and it attempts to get the | 
|  | 149 | inode using "iget" and check it's validity by matching the | 
|  | 150 | generation number.  A filesystem should only depend on the default | 
|  | 151 | if iget can safely be used this way. | 
|  | 152 |  | 
|  | 153 | If decode_fh and/or encode_fh are left as NULL, then default | 
|  | 154 | implementations are used.  These defaults are suitable for ext2 and | 
|  | 155 | extremely similar filesystems (like ext3). | 
|  | 156 |  | 
|  | 157 | The default encode_fh creates a filehandle fragment from the inode | 
|  | 158 | number and generation number of the target together with the inode | 
|  | 159 | number and generation number of the parent (if the parent is | 
|  | 160 | required). | 
|  | 161 |  | 
|  | 162 | The default decode_fh extract the target and parent datums from the | 
|  | 163 | filehandle assuming the format used by the default encode_fh and | 
|  | 164 | passed them to find_exported_dentry. | 
|  | 165 |  | 
|  | 166 |  | 
|  | 167 | A filehandle fragment consists of an array of 1 or more 4byte words, | 
|  | 168 | together with a one byte "type". | 
|  | 169 | The decode_fh routine should not depend on the stated size that is | 
|  | 170 | passed to it.  This size may be larger than the original filehandle | 
|  | 171 | generated by encode_fh, in which case it will have been padded with | 
|  | 172 | nuls.  Rather, the encode_fh routine should choose a "type" which | 
|  | 173 | indicates the decode_fh how much of the filehandle is valid, and how | 
|  | 174 | it should be interpreted. | 
|  | 175 |  | 
|  | 176 |  |