| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Locking scheme used for directory operations is based on two | 
| Josef 'Jeff' Sipek | c2b3898 | 2007-05-24 12:21:43 -0400 | [diff] [blame] | 2 | kinds of locks - per-inode (->i_mutex) and per-filesystem | 
|  | 3 | (->s_vfs_rename_mutex). | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 4 |  | 
|  | 5 | For our purposes all operations fall in 5 classes: | 
|  | 6 |  | 
|  | 7 | 1) read access.  Locking rules: caller locks directory we are accessing. | 
|  | 8 |  | 
|  | 9 | 2) object creation.  Locking rules: same as above. | 
|  | 10 |  | 
|  | 11 | 3) object removal.  Locking rules: caller locks parent, finds victim, | 
|  | 12 | locks victim and calls the method. | 
|  | 13 |  | 
|  | 14 | 4) rename() that is _not_ cross-directory.  Locking rules: caller locks | 
|  | 15 | the parent, finds source and target, if target already exists - locks it | 
|  | 16 | and then calls the method. | 
|  | 17 |  | 
|  | 18 | 5) link creation.  Locking rules: | 
|  | 19 | * lock parent | 
|  | 20 | * check that source is not a directory | 
|  | 21 | * lock source | 
|  | 22 | * call the method. | 
|  | 23 |  | 
|  | 24 | 6) cross-directory rename.  The trickiest in the whole bunch.  Locking | 
|  | 25 | rules: | 
|  | 26 | * lock the filesystem | 
|  | 27 | * lock parents in "ancestors first" order. | 
|  | 28 | * find source and target. | 
|  | 29 | * if old parent is equal to or is a descendent of target | 
|  | 30 | fail with -ENOTEMPTY | 
|  | 31 | * if new parent is equal to or is a descendent of source | 
|  | 32 | fail with -ELOOP | 
|  | 33 | * if target exists - lock it. | 
|  | 34 | * call the method. | 
|  | 35 |  | 
|  | 36 |  | 
|  | 37 | The rules above obviously guarantee that all directories that are going to be | 
|  | 38 | read, modified or removed by method will be locked by caller. | 
|  | 39 |  | 
|  | 40 |  | 
|  | 41 | If no directory is its own ancestor, the scheme above is deadlock-free. | 
|  | 42 | Proof: | 
|  | 43 |  | 
|  | 44 | First of all, at any moment we have a partial ordering of the | 
|  | 45 | objects - A < B iff A is an ancestor of B. | 
|  | 46 |  | 
|  | 47 | That ordering can change.  However, the following is true: | 
|  | 48 |  | 
|  | 49 | (1) if object removal or non-cross-directory rename holds lock on A and | 
|  | 50 | attempts to acquire lock on B, A will remain the parent of B until we | 
|  | 51 | acquire the lock on B.  (Proof: only cross-directory rename can change | 
|  | 52 | the parent of object and it would have to lock the parent). | 
|  | 53 |  | 
|  | 54 | (2) if cross-directory rename holds the lock on filesystem, order will not | 
|  | 55 | change until rename acquires all locks.  (Proof: other cross-directory | 
|  | 56 | renames will be blocked on filesystem lock and we don't start changing | 
|  | 57 | the order until we had acquired all locks). | 
|  | 58 |  | 
|  | 59 | (3) any operation holds at most one lock on non-directory object and | 
|  | 60 | that lock is acquired after all other locks.  (Proof: see descriptions | 
|  | 61 | of operations). | 
|  | 62 |  | 
|  | 63 | Now consider the minimal deadlock.  Each process is blocked on | 
|  | 64 | attempt to acquire some lock and already holds at least one lock.  Let's | 
|  | 65 | consider the set of contended locks.  First of all, filesystem lock is | 
|  | 66 | not contended, since any process blocked on it is not holding any locks. | 
| Josef 'Jeff' Sipek | c2b3898 | 2007-05-24 12:21:43 -0400 | [diff] [blame] | 67 | Thus all processes are blocked on ->i_mutex. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 68 |  | 
|  | 69 | Non-directory objects are not contended due to (3).  Thus link | 
|  | 70 | creation can't be a part of deadlock - it can't be blocked on source | 
|  | 71 | and it means that it doesn't hold any locks. | 
|  | 72 |  | 
|  | 73 | Any contended object is either held by cross-directory rename or | 
|  | 74 | has a child that is also contended.  Indeed, suppose that it is held by | 
|  | 75 | operation other than cross-directory rename.  Then the lock this operation | 
|  | 76 | is blocked on belongs to child of that object due to (1). | 
|  | 77 |  | 
|  | 78 | It means that one of the operations is cross-directory rename. | 
|  | 79 | Otherwise the set of contended objects would be infinite - each of them | 
|  | 80 | would have a contended child and we had assumed that no object is its | 
|  | 81 | own descendent.  Moreover, there is exactly one cross-directory rename | 
|  | 82 | (see above). | 
|  | 83 |  | 
|  | 84 | Consider the object blocking the cross-directory rename.  One | 
|  | 85 | of its descendents is locked by cross-directory rename (otherwise we | 
| Paolo Ornati | 670e9f3 | 2006-10-03 22:57:56 +0200 | [diff] [blame] | 86 | would again have an infinite set of contended objects).  But that | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 87 | means that cross-directory rename is taking locks out of order.  Due | 
|  | 88 | to (2) the order hadn't changed since we had acquired filesystem lock. | 
|  | 89 | But locking rules for cross-directory rename guarantee that we do not | 
|  | 90 | try to acquire lock on descendent before the lock on ancestor. | 
|  | 91 | Contradiction.  I.e.  deadlock is impossible.  Q.E.D. | 
|  | 92 |  | 
|  | 93 |  | 
|  | 94 | These operations are guaranteed to avoid loop creation.  Indeed, | 
|  | 95 | the only operation that could introduce loops is cross-directory rename. | 
|  | 96 | Since the only new (parent, child) pair added by rename() is (new parent, | 
|  | 97 | source), such loop would have to contain these objects and the rest of it | 
|  | 98 | would have to exist before rename().  I.e. at the moment of loop creation | 
|  | 99 | rename() responsible for that would be holding filesystem lock and new parent | 
|  | 100 | would have to be equal to or a descendent of source.  But that means that | 
|  | 101 | new parent had been equal to or a descendent of source since the moment when | 
|  | 102 | we had acquired filesystem lock and rename() would fail with -ELOOP in that | 
|  | 103 | case. | 
|  | 104 |  | 
|  | 105 | While this locking scheme works for arbitrary DAGs, it relies on | 
|  | 106 | ability to check that directory is a descendent of another object.  Current | 
|  | 107 | implementation assumes that directory graph is a tree.  This assumption is | 
|  | 108 | also preserved by all operations (cross-directory rename on a tree that would | 
|  | 109 | not introduce a cycle will leave it a tree and link() fails for directories). | 
|  | 110 |  | 
|  | 111 | Notice that "directory" in the above == "anything that might have | 
|  | 112 | children", so if we are going to introduce hybrid objects we will need | 
|  | 113 | either to make sure that link(2) doesn't work for them or to make changes | 
|  | 114 | in is_subdir() that would make it work even in presence of such beasts. |