Blame - Documentation/filesystems/fuse.txt - android_kernel_oneplus_msm8996

blob: 33f74310d161bac64f1c4d997842c5769b64759f [file] [log] [blame]

Miklos Szeredi	334f485	2005-09-09 13:10:27 -0700	[diff] [blame]	1	Definitions
				2	~~~~~~~~~~~
				3
				4	Userspace filesystem:
				5
				6	A filesystem in which data and metadata are provided by an ordinary
				7	userspace process. The filesystem can be accessed normally through
				8	the kernel interface.
				9
				10	Filesystem daemon:
				11
				12	The process(es) providing the data and metadata of the filesystem.
				13
				14	Non-privileged mount (or user mount):
				15
				16	A userspace filesystem mounted by a non-privileged (non-root) user.
				17	The filesystem daemon is running with the privileges of the mounting
				18	user. NOTE: this is not the same as mounts allowed with the "user"
				19	option in /etc/fstab, which is not discussed here.
				20
				21	Mount owner:
				22
				23	The user who does the mounting.
				24
				25	User:
				26
				27	The user who is performing filesystem operations.
				28
				29	What is FUSE?
				30	~~~~~~~~~~~~~
				31
				32	FUSE is a userspace filesystem framework. It consists of a kernel
				33	module (fuse.ko), a userspace library (libfuse.*) and a mount utility
				34	(fusermount).
				35
				36	One of the most important features of FUSE is allowing secure,
				37	non-privileged mounts. This opens up new possibilities for the use of
				38	filesystems. A good example is sshfs: a secure network filesystem
				39	using the sftp protocol.
				40
				41	The userspace library and utilities are available from the FUSE
				42	homepage:
				43
				44	http://fuse.sourceforge.net/
				45
				46	Mount options
				47	~~~~~~~~~~~~~
				48
				49	'fd=N'
				50
				51	The file descriptor to use for communication between the userspace
				52	filesystem and the kernel. The file descriptor must have been
				53	obtained by opening the FUSE device ('/dev/fuse').
				54
				55	'rootmode=M'
				56
				57	The file mode of the filesystem's root in octal representation.
				58
				59	'user_id=N'
				60
				61	The numeric user id of the mount owner.
				62
				63	'group_id=N'
				64
				65	The numeric group id of the mount owner.
				66
				67	'default_permissions'
				68
				69	By default FUSE doesn't check file access permissions, the
				70	filesystem is free to implement it's access policy or leave it to
				71	the underlying file access mechanism (e.g. in case of network
				72	filesystems). This option enables permission checking, restricting
				73	access based on file mode. This is option is usually useful
				74	together with the 'allow_other' mount option.
				75
				76	'allow_other'
				77
				78	This option overrides the security measure restricting file access
				79	to the user mounting the filesystem. This option is by default only
				80	allowed to root, but this restriction can be removed with a
				81	(userspace) configuration option.
				82
Miklos Szeredi	334f485	2005-09-09 13:10:27 -0700	[diff] [blame]	83	'max_read=N'
				84
				85	With this option the maximum size of read operations can be set.
				86	The default is infinite. Note that the size of read requests is
				87	limited anyway to 32 pages (which is 128kbyte on i386).
				88
Miklos Szeredi	bacac38	2006-01-16 22:14:47 -0800	[diff] [blame]	89	Sysfs
				90	~~~~~
				91
				92	FUSE sets up the following hierarchy in sysfs:
				93
				94	/sys/fs/fuse/connections/N/
				95
				96	where N is an increasing number allocated to each new connection.
				97
				98	For each connection the following attributes are defined:
				99
				100	'waiting'
				101
				102	The number of requests which are waiting to be transfered to
				103	userspace or being processed by the filesystem daemon. If there is
				104	no filesystem activity and 'waiting' is non-zero, then the
				105	filesystem is hung or deadlocked.
				106
				107	'abort'
				108
				109	Writing anything into this file will abort the filesystem
				110	connection. This means that all waiting requests will be aborted an
				111	error returned for all aborted and new requests.
				112
				113	Only a privileged user may read or write these attributes.
				114
				115	Aborting a filesystem connection
				116	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				117
				118	It is possible to get into certain situations where the filesystem is
				119	not responding. Reasons for this may be:
				120
				121	a) Broken userspace filesystem implementation
				122
				123	b) Network connection down
				124
				125	c) Accidental deadlock
				126
				127	d) Malicious deadlock
				128
				129	(For more on c) and d) see later sections)
				130
				131	In either of these cases it may be useful to abort the connection to
				132	the filesystem. There are several ways to do this:
				133
				134	- Kill the filesystem daemon. Works in case of a) and b)
				135
				136	- Kill the filesystem daemon and all users of the filesystem. Works
				137	in all cases except some malicious deadlocks
				138
				139	- Use forced umount (umount -f). Works in all cases but only if
				140	filesystem is still attached (it hasn't been lazy unmounted)
				141
				142	- Abort filesystem through the sysfs interface. Most powerful
				143	method, always works.
				144
Miklos Szeredi	334f485	2005-09-09 13:10:27 -0700	[diff] [blame]	145	How do non-privileged mounts work?
				146	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				147
				148	Since the mount() system call is a privileged operation, a helper
				149	program (fusermount) is needed, which is installed setuid root.
				150
				151	The implication of providing non-privileged mounts is that the mount
				152	owner must not be able to use this capability to compromise the
				153	system. Obvious requirements arising from this are:
				154
				155	A) mount owner should not be able to get elevated privileges with the
				156	help of the mounted filesystem
				157
				158	B) mount owner should not get illegitimate access to information from
				159	other users' and the super user's processes
				160
				161	C) mount owner should not be able to induce undesired behavior in
				162	other users' or the super user's processes
				163
				164	How are requirements fulfilled?
				165	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				166
				167	A) The mount owner could gain elevated privileges by either:
				168
				169	1) creating a filesystem containing a device file, then opening
				170	this device
				171
				172	2) creating a filesystem containing a suid or sgid application,
				173	then executing this application
				174
				175	The solution is not to allow opening device files and ignore
				176	setuid and setgid bits when executing programs. To ensure this
				177	fusermount always adds "nosuid" and "nodev" to the mount options
				178	for non-privileged mounts.
				179
				180	B) If another user is accessing files or directories in the
				181	filesystem, the filesystem daemon serving requests can record the
				182	exact sequence and timing of operations performed. This
				183	information is otherwise inaccessible to the mount owner, so this
				184	counts as an information leak.
				185
				186	The solution to this problem will be presented in point 2) of C).
				187
				188	C) There are several ways in which the mount owner can induce
				189	undesired behavior in other users' processes, such as:
				190
				191	1) mounting a filesystem over a file or directory which the mount
				192	owner could otherwise not be able to modify (or could only
				193	make limited modifications).
				194
				195	This is solved in fusermount, by checking the access
				196	permissions on the mountpoint and only allowing the mount if
				197	the mount owner can do unlimited modification (has write
				198	access to the mountpoint, and mountpoint is not a "sticky"
				199	directory)
				200
				201	2) Even if 1) is solved the mount owner can change the behavior
				202	of other users' processes.
				203
				204	i) It can slow down or indefinitely delay the execution of a
				205	filesystem operation creating a DoS against the user or the
				206	whole system. For example a suid application locking a
				207	system file, and then accessing a file on the mount owner's
				208	filesystem could be stopped, and thus causing the system
				209	file to be locked forever.
				210
				211	ii) It can present files or directories of unlimited length, or
				212	directory structures of unlimited depth, possibly causing a
				213	system process to eat up diskspace, memory or other
				214	resources, again causing DoS.
				215
				216	The solution to this as well as B) is not to allow processes
				217	to access the filesystem, which could otherwise not be
				218	monitored or manipulated by the mount owner. Since if the
				219	mount owner can ptrace a process, it can do all of the above
				220	without using a FUSE mount, the same criteria as used in
				221	ptrace can be used to check if a process is allowed to access
				222	the filesystem or not.
				223
				224	Note that the ptrace check is not strictly necessary to
				225	prevent B/2/i, it is enough to check if mount owner has enough
				226	privilege to send signal to the process accessing the
				227	filesystem, since SIGSTOP can be used to get a similar effect.
				228
				229	I think these limitations are unacceptable?
				230	~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				231
				232	If a sysadmin trusts the users enough, or can ensure through other
				233	measures, that system processes will never enter non-privileged
				234	mounts, it can relax the last limitation with a "user_allow_other"
				235	config option. If this config option is set, the mounting user can
				236	add the "allow_other" mount option which disables the check for other
				237	users' processes.
				238
				239	Kernel - userspace interface
				240	~~~~~~~~~~~~~~~~~~~~~~~~~~~~
				241
				242	The following diagram shows how a filesystem operation (in this
				243	example unlink) is performed in FUSE.
				244
				245	NOTE: everything in this description is greatly simplified
				246
				247	\| "rm /mnt/fuse/file" \| FUSE filesystem daemon
				248	\| \|
				249	\| \| >sys_read()
				250	\| \| >fuse_dev_read()
				251	\| \| >request_wait()
				252	\| \| [sleep on fc->waitq]
				253	\| \|
				254	\| >sys_unlink() \|
				255	\| >fuse_unlink() \|
				256	\| [get request from \|
				257	\| fc->unused_list] \|
				258	\| >request_send() \|
				259	\| [queue req on fc->pending] \|
				260	\| [wake up fc->waitq] \| [woken up]
				261	\| >request_wait_answer() \|
				262	\| [sleep on req->waitq] \|
				263	\| \| <request_wait()
				264	\| \| [remove req from fc->pending]
				265	\| \| [copy req to read buffer]
				266	\| \| [add req to fc->processing]
				267	\| \| <fuse_dev_read()
				268	\| \| <sys_read()
				269	\| \|
				270	\| \| [perform unlink]
				271	\| \|
				272	\| \| >sys_write()
				273	\| \| >fuse_dev_write()
				274	\| \| [look up req in fc->processing]
				275	\| \| [remove from fc->processing]
				276	\| \| [copy write buffer to req]
				277	\| [woken up] \| [wake up req->waitq]
				278	\| \| <fuse_dev_write()
				279	\| \| <sys_write()
				280	\| <request_wait_answer() \|
				281	\| <request_send() \|
				282	\| [add request to \|
				283	\| fc->unused_list] \|
				284	\| <fuse_unlink() \|
				285	\| <sys_unlink() \|
				286
				287	There are a couple of ways in which to deadlock a FUSE filesystem.
				288	Since we are talking about unprivileged userspace programs,
				289	something must be done about these.
				290
				291	Scenario 1 - Simple deadlock
				292	-----------------------------
				293
				294	\| "rm /mnt/fuse/file" \| FUSE filesystem daemon
				295	\| \|
				296	\| >sys_unlink("/mnt/fuse/file") \|
				297	\| [acquire inode semaphore \|
				298	\| for "file"] \|
				299	\| >fuse_unlink() \|
				300	\| [sleep on req->waitq] \|
				301	\| \| <sys_read()
				302	\| \| >sys_unlink("/mnt/fuse/file")
				303	\| \| [acquire inode semaphore
				304	\| \| for "file"]
				305	\| \| DEADLOCK
				306
				307	The solution for this is to allow requests to be interrupted while
				308	they are in userspace:
				309
				310	\| [interrupted by signal] \|
				311	\| <fuse_unlink() \|
				312	\| [release semaphore] \| [semaphore acquired]
				313	\| <sys_unlink() \|
				314	\| \| >fuse_unlink()
				315	\| \| [queue req on fc->pending]
				316	\| \| [wake up fc->waitq]
				317	\| \| [sleep on req->waitq]
				318
				319	If the filesystem daemon was single threaded, this will stop here,
				320	since there's no other thread to dequeue and execute the request.
				321	In this case the solution is to kill the FUSE daemon as well. If
				322	there are multiple serving threads, you just have to kill them as
				323	long as any remain.
				324
				325	Moral: a filesystem which deadlocks, can soon find itself dead.
				326
				327	Scenario 2 - Tricky deadlock
				328	----------------------------
				329
				330	This one needs a carefully crafted filesystem. It's a variation on
				331	the above, only the call back to the filesystem is not explicit,
				332	but is caused by a pagefault.
				333
				334	\| Kamikaze filesystem thread 1 \| Kamikaze filesystem thread 2
				335	\| \|
				336	\| [fd = open("/mnt/fuse/file")] \| [request served normally]
				337	\| [mmap fd to 'addr'] \|
				338	\| [close fd] \| [FLUSH triggers 'magic' flag]
				339	\| [read a byte from addr] \|
				340	\| >do_page_fault() \|
				341	\| [find or create page] \|
				342	\| [lock page] \|
				343	\| >fuse_readpage() \|
				344	\| [queue READ request] \|
				345	\| [sleep on req->waitq] \|
				346	\| \| [read request to buffer]
				347	\| \| [create reply header before addr]
				348	\| \| >sys_write(addr - headerlength)
				349	\| \| >fuse_dev_write()
				350	\| \| [look up req in fc->processing]
				351	\| \| [remove from fc->processing]
				352	\| \| [copy write buffer to req]
				353	\| \| >do_page_fault()
				354	\| \| [find or create page]
				355	\| \| [lock page]
				356	\| \| * DEADLOCK *
				357
				358	Solution is again to let the the request be interrupted (not
				359	elaborated further).
				360
				361	An additional problem is that while the write buffer is being
				362	copied to the request, the request must not be interrupted. This
				363	is because the destination address of the copy may not be valid
				364	after the request is interrupted.
				365
				366	This is solved with doing the copy atomically, and allowing
				367	interruption while the page(s) belonging to the write buffer are
				368	faulted with get_user_pages(). The 'req->locked' flag indicates
				369	when the copy is taking place, and interruption is delayed until
				370	this flag is unset.
				371
Miklos Szeredi	bacac38	2006-01-16 22:14:47 -0800	[diff] [blame]	372	Scenario 3 - Tricky deadlock with asynchronous read
				373	---------------------------------------------------
				374
				375	The same situation as above, except thread-1 will wait on page lock
				376	and hence it will be uninterruptible as well. The solution is to
				377	abort the connection with forced umount (if mount is attached) or
				378	through the abort attribute in sysfs.