| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | Mandatory File Locking For The Linux Operating System | 
|  | 2 |  | 
|  | 3 | Andy Walker <andy@lysaker.kvaerner.no> | 
|  | 4 |  | 
|  | 5 | 15 April 1996 | 
| J. Bruce Fields | 9efa68e | 2007-09-25 11:57:19 -0400 | [diff] [blame] | 6 | (Updated September 2007) | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 7 |  | 
| J. Bruce Fields | 9efa68e | 2007-09-25 11:57:19 -0400 | [diff] [blame] | 8 | 0. Why you should avoid mandatory locking | 
|  | 9 | ----------------------------------------- | 
|  | 10 |  | 
|  | 11 | The Linux implementation is prey to a number of difficult-to-fix race | 
|  | 12 | conditions which in practice make it not dependable: | 
|  | 13 |  | 
|  | 14 | - The write system call checks for a mandatory lock only once | 
|  | 15 | at its start.  It is therefore possible for a lock request to | 
|  | 16 | be granted after this check but before the data is modified. | 
|  | 17 | A process may then see file data change even while a mandatory | 
|  | 18 | lock was held. | 
|  | 19 | - Similarly, an exclusive lock may be granted on a file after | 
|  | 20 | the kernel has decided to proceed with a read, but before the | 
|  | 21 | read has actually completed, and the reading process may see | 
|  | 22 | the file data in a state which should not have been visible | 
|  | 23 | to it. | 
|  | 24 | - Similar races make the claimed mutual exclusion between lock | 
|  | 25 | and mmap similarly unreliable. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 26 |  | 
|  | 27 | 1. What is  mandatory locking? | 
|  | 28 | ------------------------------ | 
|  | 29 |  | 
|  | 30 | Mandatory locking is kernel enforced file locking, as opposed to the more usual | 
|  | 31 | cooperative file locking used to guarantee sequential access to files among | 
|  | 32 | processes. File locks are applied using the flock() and fcntl() system calls | 
|  | 33 | (and the lockf() library routine which is a wrapper around fcntl().) It is | 
|  | 34 | normally a process' responsibility to check for locks on a file it wishes to | 
|  | 35 | update, before applying its own lock, updating the file and unlocking it again. | 
|  | 36 | The most commonly used example of this (and in the case of sendmail, the most | 
|  | 37 | troublesome) is access to a user's mailbox. The mail user agent and the mail | 
|  | 38 | transfer agent must guard against updating the mailbox at the same time, and | 
|  | 39 | prevent reading the mailbox while it is being updated. | 
|  | 40 |  | 
|  | 41 | In a perfect world all processes would use and honour a cooperative, or | 
|  | 42 | "advisory" locking scheme. However, the world isn't perfect, and there's | 
|  | 43 | a lot of poorly written code out there. | 
|  | 44 |  | 
|  | 45 | In trying to address this problem, the designers of System V UNIX came up | 
|  | 46 | with a "mandatory" locking scheme, whereby the operating system kernel would | 
|  | 47 | block attempts by a process to write to a file that another process holds a | 
|  | 48 | "read" -or- "shared" lock on, and block attempts to both read and write to a | 
|  | 49 | file that a process holds a "write " -or- "exclusive" lock on. | 
|  | 50 |  | 
|  | 51 | The System V mandatory locking scheme was intended to have as little impact as | 
|  | 52 | possible on existing user code. The scheme is based on marking individual files | 
|  | 53 | as candidates for mandatory locking, and using the existing fcntl()/lockf() | 
|  | 54 | interface for applying locks just as if they were normal, advisory locks. | 
|  | 55 |  | 
|  | 56 | Note 1: In saying "file" in the paragraphs above I am actually not telling | 
|  | 57 | the whole truth. System V locking is based on fcntl(). The granularity of | 
|  | 58 | fcntl() is such that it allows the locking of byte ranges in files, in addition | 
|  | 59 | to entire files, so the mandatory locking rules also have byte level | 
|  | 60 | granularity. | 
|  | 61 |  | 
|  | 62 | Note 2: POSIX.1 does not specify any scheme for mandatory locking, despite | 
|  | 63 | borrowing the fcntl() locking scheme from System V. The mandatory locking | 
|  | 64 | scheme is defined by the System V Interface Definition (SVID) Version 3. | 
|  | 65 |  | 
|  | 66 | 2. Marking a file for mandatory locking | 
|  | 67 | --------------------------------------- | 
|  | 68 |  | 
|  | 69 | A file is marked as a candidate for mandatory locking by setting the group-id | 
|  | 70 | bit in its file mode but removing the group-execute bit. This is an otherwise | 
|  | 71 | meaningless combination, and was chosen by the System V implementors so as not | 
|  | 72 | to break existing user programs. | 
|  | 73 |  | 
|  | 74 | Note that the group-id bit is usually automatically cleared by the kernel when | 
|  | 75 | a setgid file is written to. This is a security measure. The kernel has been | 
|  | 76 | modified to recognize the special case of a mandatory lock candidate and to | 
|  | 77 | refrain from clearing this bit. Similarly the kernel has been modified not | 
|  | 78 | to run mandatory lock candidates with setgid privileges. | 
|  | 79 |  | 
|  | 80 | 3. Available implementations | 
|  | 81 | ---------------------------- | 
|  | 82 |  | 
|  | 83 | I have considered the implementations of mandatory locking available with | 
|  | 84 | SunOS 4.1.x, Solaris 2.x and HP-UX 9.x. | 
|  | 85 |  | 
|  | 86 | Generally I have tried to make the most sense out of the behaviour exhibited | 
|  | 87 | by these three reference systems. There are many anomalies. | 
|  | 88 |  | 
|  | 89 | All the reference systems reject all calls to open() for a file on which | 
|  | 90 | another process has outstanding mandatory locks. This is in direct | 
|  | 91 | contravention of SVID 3, which states that only calls to open() with the | 
|  | 92 | O_TRUNC flag set should be rejected. The Linux implementation follows the SVID | 
|  | 93 | definition, which is the "Right Thing", since only calls with O_TRUNC can | 
|  | 94 | modify the contents of the file. | 
|  | 95 |  | 
|  | 96 | HP-UX even disallows open() with O_TRUNC for a file with advisory locks, not | 
|  | 97 | just mandatory locks. That would appear to contravene POSIX.1. | 
|  | 98 |  | 
|  | 99 | mmap() is another interesting case. All the operating systems mentioned | 
|  | 100 | prevent mandatory locks from being applied to an mmap()'ed file, but  HP-UX | 
|  | 101 | also disallows advisory locks for such a file. SVID actually specifies the | 
|  | 102 | paranoid HP-UX behaviour. | 
|  | 103 |  | 
|  | 104 | In my opinion only MAP_SHARED mappings should be immune from locking, and then | 
|  | 105 | only from mandatory locks - that is what is currently implemented. | 
|  | 106 |  | 
|  | 107 | SunOS is so hopeless that it doesn't even honour the O_NONBLOCK flag for | 
|  | 108 | mandatory locks, so reads and writes to locked files always block when they | 
|  | 109 | should return EAGAIN. | 
|  | 110 |  | 
|  | 111 | I'm afraid that this is such an esoteric area that the semantics described | 
|  | 112 | below are just as valid as any others, so long as the main points seem to | 
|  | 113 | agree. | 
|  | 114 |  | 
|  | 115 | 4. Semantics | 
|  | 116 | ------------ | 
|  | 117 |  | 
|  | 118 | 1. Mandatory locks can only be applied via the fcntl()/lockf() locking | 
|  | 119 | interface - in other words the System V/POSIX interface. BSD style | 
|  | 120 | locks using flock() never result in a mandatory lock. | 
|  | 121 |  | 
|  | 122 | 2. If a process has locked a region of a file with a mandatory read lock, then | 
|  | 123 | other processes are permitted to read from that region. If any of these | 
|  | 124 | processes attempts to write to the region it will block until the lock is | 
|  | 125 | released, unless the process has opened the file with the O_NONBLOCK | 
|  | 126 | flag in which case the system call will return immediately with the error | 
|  | 127 | status EAGAIN. | 
|  | 128 |  | 
|  | 129 | 3. If a process has locked a region of a file with a mandatory write lock, all | 
|  | 130 | attempts to read or write to that region block until the lock is released, | 
|  | 131 | unless a process has opened the file with the O_NONBLOCK flag in which case | 
|  | 132 | the system call will return immediately with the error status EAGAIN. | 
|  | 133 |  | 
|  | 134 | 4. Calls to open() with O_TRUNC, or to creat(), on a existing file that has | 
|  | 135 | any mandatory locks owned by other processes will be rejected with the | 
|  | 136 | error status EAGAIN. | 
|  | 137 |  | 
|  | 138 | 5. Attempts to apply a mandatory lock to a file that is memory mapped and | 
|  | 139 | shared (via mmap() with MAP_SHARED) will be rejected with the error status | 
|  | 140 | EAGAIN. | 
|  | 141 |  | 
|  | 142 | 6. Attempts to create a shared memory map of a file (via mmap() with MAP_SHARED) | 
|  | 143 | that has any mandatory locks in effect will be rejected with the error status | 
|  | 144 | EAGAIN. | 
|  | 145 |  | 
|  | 146 | 5. Which system calls are affected? | 
|  | 147 | ----------------------------------- | 
|  | 148 |  | 
|  | 149 | Those which modify a file's contents, not just the inode. That gives read(), | 
|  | 150 | write(), readv(), writev(), open(), creat(), mmap(), truncate() and | 
|  | 151 | ftruncate(). truncate() and ftruncate() are considered to be "write" actions | 
|  | 152 | for the purposes of mandatory locking. | 
|  | 153 |  | 
|  | 154 | The affected region is usually defined as stretching from the current position | 
|  | 155 | for the total number of bytes read or written. For the truncate calls it is | 
|  | 156 | defined as the bytes of a file removed or added (we must also consider bytes | 
|  | 157 | added, as a lock can specify just "the whole file", rather than a specific | 
|  | 158 | range of bytes.) | 
|  | 159 |  | 
|  | 160 | Note 3: I may have overlooked some system calls that need mandatory lock | 
|  | 161 | checking in my eagerness to get this code out the door. Please let me know, or | 
|  | 162 | better still fix the system calls yourself and submit a patch to me or Linus. | 
|  | 163 |  | 
|  | 164 | 6. Warning! | 
|  | 165 | ----------- | 
|  | 166 |  | 
|  | 167 | Not even root can override a mandatory lock, so runaway processes can wreak | 
|  | 168 | havoc if they lock crucial files. The way around it is to change the file | 
|  | 169 | permissions (remove the setgid bit) before trying to read or write to it. | 
|  | 170 | Of course, that might be a bit tricky if the system is hung :-( | 
|  | 171 |  |