| JANAK DESAI | 0d4c3e7 | 2006-02-07 12:58:56 -0800 | [diff] [blame] | 1 |  | 
 | 2 | unshare system call: | 
 | 3 | -------------------- | 
 | 4 | This document describes the new system call, unshare. The document | 
 | 5 | provides an overview of the feature, why it is needed, how it can | 
 | 6 | be used, its interface specification, design, implementation and | 
 | 7 | how it can be tested. | 
 | 8 |  | 
 | 9 | Change Log: | 
 | 10 | ----------- | 
 | 11 | version 0.1  Initial document, Janak Desai (janak@us.ibm.com), Jan 11, 2006 | 
 | 12 |  | 
 | 13 | Contents: | 
 | 14 | --------- | 
 | 15 | 	1) Overview | 
 | 16 | 	2) Benefits | 
 | 17 | 	3) Cost | 
 | 18 | 	4) Requirements | 
 | 19 | 	5) Functional Specification | 
 | 20 | 	6) High Level Design | 
 | 21 | 	7) Low Level Design | 
 | 22 | 	8) Test Specification | 
 | 23 | 	9) Future Work | 
 | 24 |  | 
 | 25 | 1) Overview | 
 | 26 | ----------- | 
 | 27 | Most legacy operating system kernels support an abstraction of threads | 
 | 28 | as multiple execution contexts within a process. These kernels provide | 
 | 29 | special resources and mechanisms to maintain these "threads". The Linux | 
 | 30 | kernel, in a clever and simple manner, does not make distinction | 
 | 31 | between processes and "threads". The kernel allows processes to share | 
 | 32 | resources and thus they can achieve legacy "threads" behavior without | 
 | 33 | requiring additional data structures and mechanisms in the kernel. The | 
 | 34 | power of implementing threads in this manner comes not only from | 
 | 35 | its simplicity but also from allowing application programmers to work | 
 | 36 | outside the confinement of all-or-nothing shared resources of legacy | 
 | 37 | threads. On Linux, at the time of thread creation using the clone system | 
 | 38 | call, applications can selectively choose which resources to share | 
 | 39 | between threads. | 
 | 40 |  | 
 | 41 | unshare system call adds a primitive to the Linux thread model that | 
 | 42 | allows threads to selectively 'unshare' any resources that were being | 
 | 43 | shared at the time of their creation. unshare was conceptualized by | 
 | 44 | Al Viro in the August of 2000, on the Linux-Kernel mailing list, as part | 
 | 45 | of the discussion on POSIX threads on Linux.  unshare augments the | 
 | 46 | usefulness of Linux threads for applications that would like to control | 
 | 47 | shared resources without creating a new process. unshare is a natural | 
 | 48 | addition to the set of available primitives on Linux that implement | 
 | 49 | the concept of process/thread as a virtual machine. | 
 | 50 |  | 
 | 51 | 2) Benefits | 
 | 52 | ----------- | 
 | 53 | unshare would be useful to large application frameworks such as PAM | 
 | 54 | where creating a new process to control sharing/unsharing of process | 
 | 55 | resources is not possible. Since namespaces are shared by default | 
 | 56 | when creating a new process using fork or clone, unshare can benefit | 
 | 57 | even non-threaded applications if they have a need to disassociate | 
 | 58 | from default shared namespace. The following lists two use-cases | 
 | 59 | where unshare can be used. | 
 | 60 |  | 
 | 61 | 2.1 Per-security context namespaces | 
 | 62 | ----------------------------------- | 
 | 63 | unshare can be used to implement polyinstantiated directories using | 
 | 64 | the kernel's per-process namespace mechanism. Polyinstantiated directories, | 
 | 65 | such as per-user and/or per-security context instance of /tmp, /var/tmp or | 
 | 66 | per-security context instance of a user's home directory, isolate user | 
 | 67 | processes when working with these directories. Using unshare, a PAM | 
 | 68 | module can easily setup a private namespace for a user at login. | 
 | 69 | Polyinstantiated directories are required for Common Criteria certification | 
 | 70 | with Labeled System Protection Profile, however, with the availability | 
 | 71 | of shared-tree feature in the Linux kernel, even regular Linux systems | 
 | 72 | can benefit from setting up private namespaces at login and | 
 | 73 | polyinstantiating /tmp, /var/tmp and other directories deemed | 
 | 74 | appropriate by system administrators. | 
 | 75 |  | 
 | 76 | 2.2 unsharing of virtual memory and/or open files | 
 | 77 | ------------------------------------------------- | 
 | 78 | Consider a client/server application where the server is processing | 
 | 79 | client requests by creating processes that share resources such as | 
 | 80 | virtual memory and open files. Without unshare, the server has to | 
 | 81 | decide what needs to be shared at the time of creating the process | 
 | 82 | which services the request. unshare allows the server an ability to | 
 | 83 | disassociate parts of the context during the servicing of the | 
 | 84 | request. For large and complex middleware application frameworks, this | 
 | 85 | ability to unshare after the process was created can be very | 
 | 86 | useful. | 
 | 87 |  | 
 | 88 | 3) Cost | 
 | 89 | ------- | 
 | 90 | In order to not duplicate code and to handle the fact that unshare | 
 | 91 | works on an active task (as opposed to clone/fork working on a newly | 
 | 92 | allocated inactive task) unshare had to make minor reorganizational | 
 | 93 | changes to copy_* functions utilized by clone/fork system call. | 
 | 94 | There is a cost associated with altering existing, well tested and | 
 | 95 | stable code to implement a new feature that may not get exercised | 
 | 96 | extensively in the beginning. However, with proper design and code | 
 | 97 | review of the changes and creation of an unshare test for the LTP | 
 | 98 | the benefits of this new feature can exceed its cost. | 
 | 99 |  | 
 | 100 | 4) Requirements | 
 | 101 | --------------- | 
 | 102 | unshare reverses sharing that was done using clone(2) system call, | 
 | 103 | so unshare should have a similar interface as clone(2). That is, | 
 | 104 | since flags in clone(int flags, void *stack) specifies what should | 
 | 105 | be shared, similar flags in unshare(int flags) should specify | 
 | 106 | what should be unshared. Unfortunately, this may appear to invert | 
 | 107 | the meaning of the flags from the way they are used in clone(2). | 
 | 108 | However, there was no easy solution that was less confusing and that | 
 | 109 | allowed incremental context unsharing in future without an ABI change. | 
 | 110 |  | 
 | 111 | unshare interface should accommodate possible future addition of | 
 | 112 | new context flags without requiring a rebuild of old applications. | 
 | 113 | If and when new context flags are added, unshare design should allow | 
 | 114 | incremental unsharing of those resources on an as needed basis. | 
 | 115 |  | 
 | 116 | 5) Functional Specification | 
 | 117 | --------------------------- | 
 | 118 | NAME | 
 | 119 | 	unshare - disassociate parts of the process execution context | 
 | 120 |  | 
 | 121 | SYNOPSIS | 
 | 122 | 	#include <sched.h> | 
 | 123 |  | 
 | 124 | 	int unshare(int flags); | 
 | 125 |  | 
 | 126 | DESCRIPTION | 
 | 127 | 	unshare allows a process to disassociate parts of its execution | 
 | 128 | 	context that are currently being shared with other processes. Part | 
 | 129 | 	of execution context, such as the namespace, is shared by default | 
 | 130 | 	when a new process is created using fork(2), while other parts, | 
 | 131 | 	such as the virtual memory, open file descriptors, etc, may be | 
 | 132 | 	shared by explicit request to share them when creating a process | 
 | 133 | 	using clone(2). | 
 | 134 |  | 
 | 135 | 	The main use of unshare is to allow a process to control its | 
 | 136 | 	shared execution context without creating a new process. | 
 | 137 |  | 
 | 138 | 	The flags argument specifies one or bitwise-or'ed of several of | 
 | 139 | 	the following constants. | 
 | 140 |  | 
 | 141 | 	CLONE_FS | 
 | 142 | 		If CLONE_FS is set, file system information of the caller | 
 | 143 | 		is disassociated from the shared file system information. | 
 | 144 |  | 
 | 145 | 	CLONE_FILES | 
 | 146 | 		If CLONE_FILES is set, the file descriptor table of the | 
 | 147 | 		caller is disassociated from the shared file descriptor | 
 | 148 | 		table. | 
 | 149 |  | 
 | 150 | 	CLONE_NEWNS | 
 | 151 | 		If CLONE_NEWNS is set, the namespace of the caller is | 
 | 152 | 		disassociated from the shared namespace. | 
 | 153 |  | 
 | 154 | 	CLONE_VM | 
 | 155 | 		If CLONE_VM is set, the virtual memory of the caller is | 
 | 156 | 		disassociated from the shared virtual memory. | 
 | 157 |  | 
 | 158 | RETURN VALUE | 
 | 159 | 	On success, zero returned. On failure, -1 is returned and errno is | 
 | 160 |  | 
 | 161 | ERRORS | 
 | 162 | 	EPERM	CLONE_NEWNS was specified by a non-root process (process | 
 | 163 | 		without CAP_SYS_ADMIN). | 
 | 164 |  | 
 | 165 | 	ENOMEM	Cannot allocate sufficient memory to copy parts of caller's | 
 | 166 | 		context that need to be unshared. | 
 | 167 |  | 
 | 168 | 	EINVAL	Invalid flag was specified as an argument. | 
 | 169 |  | 
 | 170 | CONFORMING TO | 
 | 171 | 	The unshare() call is Linux-specific and  should  not be used | 
 | 172 | 	in programs intended to be portable. | 
 | 173 |  | 
 | 174 | SEE ALSO | 
 | 175 | 	clone(2), fork(2) | 
 | 176 |  | 
 | 177 | 6) High Level Design | 
 | 178 | -------------------- | 
 | 179 | Depending on the flags argument, the unshare system call allocates | 
 | 180 | appropriate process context structures, populates it with values from | 
 | 181 | the current shared version, associates newly duplicated structures | 
 | 182 | with the current task structure and releases corresponding shared | 
 | 183 | versions. Helper functions of clone (copy_*) could not be used | 
 | 184 | directly by unshare because of the following two reasons. | 
 | 185 |   1) clone operates on a newly allocated not-yet-active task | 
 | 186 |      structure, where as unshare operates on the current active | 
 | 187 |      task. Therefore unshare has to take appropriate task_lock() | 
 | 188 |      before associating newly duplicated context structures | 
 | 189 |   2) unshare has to allocate and duplicate all context structures | 
 | 190 |      that are being unshared, before associating them with the | 
 | 191 |      current task and releasing older shared structures. Failure | 
 | 192 |      do so will create race conditions and/or oops when trying | 
 | 193 |      to backout due to an error. Consider the case of unsharing | 
 | 194 |      both virtual memory and namespace. After successfully unsharing | 
 | 195 |      vm, if the system call encounters an error while allocating | 
 | 196 |      new namespace structure, the error return code will have to | 
 | 197 |      reverse the unsharing of vm. As part of the reversal the | 
 | 198 |      system call will have to go back to older, shared, vm | 
 | 199 |      structure, which may not exist anymore. | 
 | 200 |  | 
 | 201 | Therefore code from copy_* functions that allocated and duplicated | 
 | 202 | current context structure was moved into new dup_* functions. Now, | 
 | 203 | copy_* functions call dup_* functions to allocate and duplicate | 
 | 204 | appropriate context structures and then associate them with the | 
 | 205 | task structure that is being constructed. unshare system call on | 
 | 206 | the other hand performs the following: | 
 | 207 |   1) Check flags to force missing, but implied, flags | 
 | 208 |   2) For each context structure, call the corresponding unshare | 
 | 209 |      helper function to allocate and duplicate a new context | 
 | 210 |      structure, if the appropriate bit is set in the flags argument. | 
 | 211 |   3) If there is no error in allocation and duplication and there | 
 | 212 |      are new context structures then lock the current task structure, | 
 | 213 |      associate new context structures with the current task structure, | 
 | 214 |      and release the lock on the current task structure. | 
 | 215 |   4) Appropriately release older, shared, context structures. | 
 | 216 |  | 
 | 217 | 7) Low Level Design | 
 | 218 | ------------------- | 
 | 219 | Implementation of unshare can be grouped in the following 4 different | 
 | 220 | items: | 
 | 221 |   a) Reorganization of existing copy_* functions | 
 | 222 |   b) unshare system call service function | 
 | 223 |   c) unshare helper functions for each different process context | 
 | 224 |   d) Registration of system call number for different architectures | 
 | 225 |  | 
 | 226 |   7.1) Reorganization of copy_* functions | 
 | 227 |        Each copy function such as copy_mm, copy_namespace, copy_files, | 
 | 228 |        etc, had roughly two components. The first component allocated | 
 | 229 |        and duplicated the appropriate structure and the second component | 
 | 230 |        linked it to the task structure passed in as an argument to the copy | 
 | 231 |        function. The first component was split into its own function. | 
 | 232 |        These dup_* functions allocated and duplicated the appropriate | 
 | 233 |        context structure. The reorganized copy_* functions invoked | 
 | 234 |        their corresponding dup_* functions and then linked the newly | 
 | 235 |        duplicated structures to the task structure with which the | 
 | 236 |        copy function was called. | 
 | 237 |  | 
 | 238 |   7.2) unshare system call service function | 
 | 239 |        * Check flags | 
 | 240 | 	 Force implied flags. If CLONE_THREAD is set force CLONE_VM. | 
 | 241 | 	 If CLONE_VM is set, force CLONE_SIGHAND. If CLONE_SIGHAND is | 
 | 242 | 	 set and signals are also being shared, force CLONE_THREAD. If | 
 | 243 | 	 CLONE_NEWNS is set, force CLONE_FS. | 
 | 244 |        * For each context flag, invoke the corresponding unshare_* | 
 | 245 | 	 helper routine with flags passed into the system call and a | 
 | 246 | 	 reference to pointer pointing the new unshared structure | 
 | 247 |        * If any new structures are created by unshare_* helper | 
 | 248 | 	 functions, take the task_lock() on the current task, | 
 | 249 | 	 modify appropriate context pointers, and release the | 
 | 250 |          task lock. | 
 | 251 |        * For all newly unshared structures, release the corresponding | 
 | 252 |          older, shared, structures. | 
 | 253 |  | 
 | 254 |   7.3) unshare_* helper functions | 
 | 255 |        For unshare_* helpers corresponding to CLONE_SYSVSEM, CLONE_SIGHAND, | 
 | 256 |        and CLONE_THREAD, return -EINVAL since they are not implemented yet. | 
 | 257 |        For others, check the flag value to see if the unsharing is | 
 | 258 |        required for that structure. If it is, invoke the corresponding | 
 | 259 |        dup_* function to allocate and duplicate the structure and return | 
 | 260 |        a pointer to it. | 
 | 261 |  | 
 | 262 |   7.4) Appropriately modify architecture specific code to register the | 
| Paolo Ornati | 670e9f3 | 2006-10-03 22:57:56 +0200 | [diff] [blame] | 263 |        new system call. | 
| JANAK DESAI | 0d4c3e7 | 2006-02-07 12:58:56 -0800 | [diff] [blame] | 264 |  | 
 | 265 | 8) Test Specification | 
 | 266 | --------------------- | 
 | 267 | The test for unshare should test the following: | 
 | 268 |   1) Valid flags: Test to check that clone flags for signal and | 
 | 269 | 	signal handlers, for which unsharing is not implemented | 
 | 270 | 	yet, return -EINVAL. | 
 | 271 |   2) Missing/implied flags: Test to make sure that if unsharing | 
 | 272 | 	namespace without specifying unsharing of filesystem, correctly | 
 | 273 | 	unshares both namespace and filesystem information. | 
 | 274 |   3) For each of the four (namespace, filesystem, files and vm) | 
 | 275 | 	supported unsharing, verify that the system call correctly | 
 | 276 | 	unshares the appropriate structure. Verify that unsharing | 
 | 277 | 	them individually as well as in combination with each | 
 | 278 | 	other works as expected. | 
 | 279 |   4) Concurrent execution: Use shared memory segments and futex on | 
 | 280 | 	an address in the shm segment to synchronize execution of | 
 | 281 | 	about 10 threads. Have a couple of threads execute execve, | 
 | 282 | 	a couple _exit and the rest unshare with different combination | 
 | 283 | 	of flags. Verify that unsharing is performed as expected and | 
 | 284 | 	that there are no oops or hangs. | 
 | 285 |  | 
 | 286 | 9) Future Work | 
 | 287 | -------------- | 
 | 288 | The current implementation of unshare does not allow unsharing of | 
 | 289 | signals and signal handlers. Signals are complex to begin with and | 
 | 290 | to unshare signals and/or signal handlers of a currently running | 
 | 291 | process is even more complex. If in the future there is a specific | 
 | 292 | need to allow unsharing of signals and/or signal handlers, it can | 
 | 293 | be incrementally added to unshare without affecting legacy | 
 | 294 | applications using unshare. | 
 | 295 |  |