linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	tidy up around finish_automount()	Al Viro	2011-01-17	1	-28/+18
\| \| \| \| \| \| \| \|	do_add_mount() and mnt_clear_expiry() are not needed outside of namespace.c anymore, now that namei has finish_automount() to use. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	don't drop newmnt on error in do_add_mount()	Al Viro	2011-01-17	1	-11/+5
\| \| \| \| \| \| \| \|	That gets rid of the kludge in finish_automount() - we need to keep refcount on the vfsmount as-is until we evict it from expiry list. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Take the completion of automount into new helper	Al Viro	2011-01-17	1	-0/+33
\| \| \| \| \| \|	... and shift it from namei.c to namespace.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	VFS: Fix UP compile error in fs/namespace.c	Al Viro	2011-01-16	1	-7/+24
\| \| \| \| \| \| \| \|	mnt_longterm is there only on SMP Reported-and-tested-by: Joachim Eastwood <manabian@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	sanitize vfsmount refcounting changes	Al Viro	2011-01-16	1	-73/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of splitting refcount between (per-cpu) mnt_count and (SMP-only) mnt_longrefs, make all references contribute to mnt_count again and keep track of how many are longterm ones. Accounting rules for longterm count: * 1 for each fs_struct.root.mnt * 1 for each fs_struct.pwd.mnt * 1 for having non-NULL ->mnt_ns * decrement to 0 happens only under vfsmount lock exclusive That allows nice common case for mntput() - since we can't drop the final reference until after mnt_longterm has reached 0 due to the rules above, mntput() can grab vfsmount lock shared and check mnt_longterm. If it turns out to be non-zero (which is the common case), we know that this is not the final mntput() and can just blindly decrement percpu mnt_count. Otherwise we grab vfsmount lock exclusive and do usual decrement-and-check of percpu mnt_count. For fs_struct.c we have mnt_make_longterm() and mnt_make_shortterm(); namespace.c uses the latter in places where we don't already hold vfsmount lock exclusive and opencodes a few remaining spots where we need to manipulate mnt_longterm. Note that we mostly revert the code outside of fs/namespace.c back to what we used to have; in particular, normal code doesn't need to care about two kinds of references, etc. And we get to keep the optimization Nick's variant had bought us... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	fix old umount_tree() breakage	Al Viro	2011-01-16	1	-3/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Expiry-related code calls umount_tree() several times with the same list to collect vfsmounts to. Which is fine, except that umount_tree() implicitly assumed that the list would be empty on each call - it moves the victims over there and then iterates through the list kicking them out. It's almost idempotent, so everything nearly worked. However, mnt->ghosts handling (and thus expirability checks) had been broken - that part was not idempotent... The fix is trivial - use local temporary list, splice it to the the collector list when we are through. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Unexport do_add_mount() and add in follow_automount(), not ->d_automount()	David Howells	2011-01-16	1	-8/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unexport do_add_mount() and make ->d_automount() return the vfsmount to be added rather than calling do_add_mount() itself. follow_automount() will then do the addition. This slightly complicates things as ->d_automount() normally wants to add the new vfsmount to an expiration list and start an expiration timer. The problem with that is that the vfsmount will be deleted if it has a refcount of 1 and the timer will not repeat if the expiration list is empty. To this end, we require the vfsmount to be returned from d_automount() with a refcount of (at least) 2. One of these refs will be dropped unconditionally. In addition, follow_automount() must get a 3rd ref around the call to do_add_mount() lest it eat a ref and return an error, leaving the mount we have open to being expired as we would otherwise have only 1 ref on it. d_automount() should also add the the vfsmount to the expiration list (by calling mnt_set_expiry()) and start the expiration timer before returning, if this mechanism is to be used. The vfsmount will be unlinked from the expiration list by follow_automount() if do_add_mount() fails. This patch also fixes the call to do_add_mount() for AFS to propagate the mount flags from the parent vfsmount. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Add a dentry op to allow processes to be held during pathwalk transit	David Howells	2011-01-16	1	-6/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a dentry op (d_manage) to permit a filesystem to hold a process and make it sleep when it tries to transit away from one of that filesystem's directories during a pathwalk. The operation is keyed off a new dentry flag (DCACHE_MANAGE_TRANSIT). The filesystem is allowed to be selective about which processes it holds and which it permits to continue on or prohibits from transiting from each flagged directory. This will allow autofs to hold up client processes whilst letting its userspace daemon through to maintain the directory or the stuff behind it or mounted upon it. The ->d_manage() dentry operation: int (d_manage)(struct path path, bool mounting_here); takes a pointer to the directory about to be transited away from and a flag indicating whether the transit is undertaken by do_add_mount() or do_move_mount() skipping through a pile of filesystems mounted on a mountpoint. It should return 0 if successful and to let the process continue on its way; -EISDIR to prohibit the caller from skipping to overmounted filesystems or automounting, and to use this directory; or some other error code to return to the user. ->d_manage() is called with namespace_sem writelocked if mounting_here is true and no other locks held, so it may sleep. However, if mounting_here is true, it may not initiate or wait for a mount or unmount upon the parameter directory, even if the act is actually performed by userspace. Within fs/namei.c, follow_managed() is extended to check with d_manage() first on each managed directory, before transiting away from it or attempting to automount upon it. follow_down() is renamed follow_down_one() and should only be used where the filesystem deliberately intends to avoid management steps (e.g. autofs). A new follow_down() is added that incorporates the loop done by all other callers of follow_down() (do_add/move_mount(), autofs and NFSD; whilst AFS, NFS and CIFS do use it, their use is removed by converting them to use d_automount()). The new follow_down() calls d_manage() as appropriate. It also takes an extra parameter to indicate if it is being called from mount code (with namespace_sem writelocked) which it passes to d_manage(). follow_down() ignores automount points so that it can be used to mount on them. __follow_mount_rcu() is made to abort rcu-walk mode if it hits a directory with DCACHE_MANAGE_TRANSIT set on the basis that we're probably going to have to sleep. It would be possible to enter d_manage() in rcu-walk mode too, and have that determine whether to abort or not itself. That would allow the autofs daemon to continue on in rcu-walk mode. Note that DCACHE_MANAGE_TRANSIT on a directory should be cleared when it isn't required as every tranist from that directory will cause d_manage() to be invoked. It can always be set again when necessary. ========================== WHAT THIS MEANS FOR AUTOFS ========================== Autofs currently uses the lookup() inode op and the d_revalidate() dentry op to trigger the automounting of indirect mounts, and both of these can be called with i_mutex held. autofs knows that the i_mutex will be held by the caller in lookup(), and so can drop it before invoking the daemon - but this isn't so for d_revalidate(), since the lock is only held on _some_ of the code paths that call it. This means that autofs can't risk dropping i_mutex from its d_revalidate() function before it calls the daemon. The bug could manifest itself as, for example, a process that's trying to validate an automount dentry that gets made to wait because that dentry is expired and needs cleaning up: mkdir S ffffffff8014e05a 0 32580 24956 Call Trace: [<ffffffff885371fd>] :autofs4:autofs4_wait+0x674/0x897 [<ffffffff80127f7d>] avc_has_perm+0x46/0x58 [<ffffffff8009fdcf>] autoremove_wake_function+0x0/0x2e [<ffffffff88537be6>] :autofs4:autofs4_expire_wait+0x41/0x6b [<ffffffff88535cfc>] :autofs4:autofs4_revalidate+0x91/0x149 [<ffffffff80036d96>] __lookup_hash+0xa0/0x12f [<ffffffff80057a2f>] lookup_create+0x46/0x80 [<ffffffff800e6e31>] sys_mkdirat+0x56/0xe4 versus the automount daemon which wants to remove that dentry, but can't because the normal process is holding the i_mutex lock: automount D ffffffff8014e05a 0 32581 1 32561 Call Trace: [<ffffffff80063c3f>] __mutex_lock_slowpath+0x60/0x9b [<ffffffff8000ccf1>] do_path_lookup+0x2ca/0x2f1 [<ffffffff80063c89>] .text.lock.mutex+0xf/0x14 [<ffffffff800e6d55>] do_rmdir+0x77/0xde [<ffffffff8005d229>] tracesys+0x71/0xe0 [<ffffffff8005d28d>] tracesys+0xd5/0xe0 which means that the system is deadlocked. This patch allows autofs to hold up normal processes whilst the daemon goes ahead and does things to the dentry tree behind the automouter point without risking a deadlock as almost no locks are held in d_manage() and none in d_automount(). Signed-off-by: David Howells <dhowells@redhat.com> Was-Acked-by: Ian Kent <raven@themaw.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	fs: scale mntget/mntput	Nick Piggin	2011-01-07	1	-43/+199
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The problem that this patch aims to fix is vfsmount refcounting scalability. We need to take a reference on the vfsmount for every successful path lookup, which often go to the same mount point. The fundamental difficulty is that a "simple" reference count can never be made scalable, because any time a reference is dropped, we must check whether that was the last reference. To do that requires communication with all other CPUs that may have taken a reference count. We can make refcounts more scalable in a couple of ways, involving keeping distributed counters, and checking for the global-zero condition less frequently. - check the global sum once every interval (this will delay zero detection for some interval, so it's probably a showstopper for vfsmounts). - keep a local count and only taking the global sum when local reaches 0 (this is difficult for vfsmounts, because we can't hold preempt off for the life of a reference, so a counter would need to be per-thread or tied strongly to a particular CPU which requires more locking). - keep a local difference of increments and decrements, which allows us to sum the total difference and hence find the refcount when summing all CPUs. Then, keep a single integer "long" refcount for slow and long lasting references, and only take the global sum of local counters when the long refcount is 0. This last scheme is what I implemented here. Attached mounts and process root and working directory references are "long" references, and everything else is a short reference. This allows scalable vfsmount references during path walking over mounted subtrees and unattached (lazy umounted) mounts with processes still running in them. This results in one fewer atomic op in the fastpath: mntget is now just a per-CPU inc, rather than an atomic inc; and mntput just requires a spinlock and non-atomic decrement in the common case. However code is otherwise bigger and heavier, so single threaded performance is basically a wash. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
*	fs: rename vfsmount counter helpers	Nick Piggin	2011-01-07	1	-11/+11
\| \| \| \| \| \| \| \|	Suggested by Andreas, mnt_ prefix is clearer namespace, follows kernel conventions better, and is easier for tab complete. I introduced these names so I'll admit they were not good choices. Signed-off-by: Nick Piggin <npiggin@kernel.dk>
*	fs: dcache remove d_mounted	Nick Piggin	2011-01-07	1	-3/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Rather than keep a d_mounted count in the dentry, set a dentry flag instead. The flag can be cleared by checking the hash table to see if there are any mounts left, which is not time critical because it is performed at detach time. The mounted state of a dentry is only used to speculatively take a look in the mount hash table if it is set -- before following the mount, vfsmount lock is taken and mount re-checked without races. This saves 4 bytes on 32-bit, nothing on 64-bit but it does provide a hole I might use later (and some configs have larger than 32-bit spinlocks which might make use of the hole). Autofs4 conversion and changelog by Ian Kent <raven@themaw.net>: In autofs4, when expring direct (or offset) mounts we need to ensure that we block user path walks into the autofs mount, which is covered by another mount. To do this we clear the mounted status so that follows stop before walking into the mount and are essentially blocked until the expire is completed. The automount daemon still finds the correct dentry for the umount due to the follow mount logic in fs/autofs4/root.c:autofs4_follow_link(), which is set as an inode operation for direct and offset mounts only and is called following the lookup that stopped at the covered mount. At the end of the expire the covering mount probably has gone away so the mounted status need not be restored. But we need to check this and only restore the mounted status if the expire failed. XXX: autofs may not work right if we have other mounts go over the top of it? Signed-off-by: Nick Piggin <npiggin@kernel.dk>
*	BKL: remove extraneous #include <smp_lock.h>	Arnd Bergmann	2010-11-17	1	-1/+0
\| \| \| \| \| \| \| \| \| \|	The big kernel lock has been removed from all these files at some point, leaving only the #include. Remove this too as a cleanup. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	vfs: fix infinite loop caused by clone_mnt race	Miklos Szeredi	2010-10-26	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \|	If clone_mnt() happens while mnt_make_readonly() is running, the cloned mount might have MNT_WRITE_HOLD flag set, which results in mnt_want_write() spinning forever on this mount. Needs CAP_SYS_ADMIN to trigger deliberately and unlikely to happen accidentally. But if it does happen it can hang the machine. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	BKL: Remove BKL from do_new_mount()	Jan Blunck	2010-10-04	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	After pushing down the BKL to the get_sb/fill_super operations of the filesystems that still make usage of the BKL it is safe to remove it from do_new_mount(). I've read through all the code formerly covered by the BKL inside do_kern_mount() and have satisfied myself that it doesn't need the BKL any more. Signed-off-by: Jan Blunck <jblunck@infradead.org> Cc: Matthew Wilcox <matthew@wil.cx> Signed-off-by: Arnd Bergmann <arnd@arndb.de>
*	VFS: Sanity check mount flags passed to change_mnt_propagation()	Valerie Aurora	2010-09-07	1	-1/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Sanity check the flags passed to change_mnt_propagation(). Exactly one flag should be set. Return EINVAL otherwise. Userspace can pass in arbitrary combinations of MS_* flags to mount(). do_change_type() is called if any of MS_SHARED, MS_PRIVATE, MS_SLAVE, or MS_UNBINDABLE is set. do_change_type() clears MS_REC and then calls change_mnt_propagation() with the rest of the user-supplied flags. change_mnt_propagation() clearly assumes only one flag is set but do_change_type() does not check that this is true. For example, mount() with flags MS_SHARED \| MS_RDONLY does not actually make the mount shared or read-only but does clear MNT_UNBINDABLE. Signed-off-by: Valerie Aurora <vaurora@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fs: brlock vfsmount_lock	Nick Piggin	2010-08-18	1	-66/+111
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fs: brlock vfsmount_lock Use a brlock for the vfsmount lock. It must be taken for write whenever modifying the mount hash or associated fields, and may be taken for read when performing mount hash lookups. A new lock is added for the mnt-id allocator, so it doesn't need to take the heavy vfsmount write-lock. The number of atomics should remain the same for fastpath rlock cases, though code would be slightly slower due to per-cpu access. Scalability is not not be much improved in common cases yet, due to other locks (ie. dcache_lock) getting in the way. However path lookups crossing mountpoints should be one case where scalability is improved (currently requiring the global lock). The slowpath is slower due to use of brlock. On a 64 core, 64 socket, 32 node Altix system (high latency to remote nodes), a simple umount microbenchmark (mount --bind mnt mnt2 ; umount mnt2 loop 1000 times), before this patch it took 6.8s, afterwards took 7.1s, about 5% slower. Cc: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Nick Piggin <npiggin@kernel.dk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vfs: remove unused MNT_STRICTATIME	Miklos Szeredi	2010-08-11	1	-1/+0
\| \| \| \| \| \| \| \| \|	Commit d0adde574b8487ef30f69e2d08bba769e4be513f added MNT_STRICTATIME but it isn't actually used (MS_STRICTATIME clears MNT_RELATIME and MNT_NOATIME rather than setting any mount flag). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vfs: add helpers to get root and pwd	Miklos Szeredi	2010-08-11	1	-4/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Add three helpers that retrieve a refcounted copy of the root and cwd from the supplied fs_struct. get_fs_root() get_fs_pwd() get_fs_root_and_pwd() Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Merge branch 'for-linus' of git://git.infradead.org/users/eparis/notify	Linus Torvalds	2010-08-10	1	-0/+5
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	* 'for-linus' of git://git.infradead.org/users/eparis/notify: (132 commits) fanotify: use both marks when possible fsnotify: pass both the vfsmount mark and inode mark fsnotify: walk the inode and vfsmount lists simultaneously fsnotify: rework ignored mark flushing fsnotify: remove global fsnotify groups lists fsnotify: remove group->mask fsnotify: remove the global masks fsnotify: cleanup should_send_event fanotify: use the mark in handler functions audit: use the mark in handler functions dnotify: use the mark in handler functions inotify: use the mark in handler functions fsnotify: send fsnotify_mark to groups in event handling functions fsnotify: Exchange list heads instead of moving elements fsnotify: srcu to protect read side of inode and vfsmount locks fsnotify: use an explicit flag to indicate fsnotify_destroy_mark has been called fsnotify: use _rcu functions for mark list traversal fsnotify: place marks on object in order of group memory address vfs/fsnotify: fsnotify_close can delay the final work in fput fsnotify: store struct file not struct path ... Fix up trivial delete/modify conflict in fs/notify/inotify/inotify.c.
\| *	fsnotify: Infrastructure for per-mount watches	Andreas Gruenbacher	2010-07-28	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Per-mount watches allow groups to listen to fsnotify events on an entire mount. This patch simply adds and initializes the fields needed in the vfsmount struct to make this happen. Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Eric Paris <eparis@redhat.com>
\| *	fsnotify/vfsmount: add fsnotify fields to struct vfsmount	Andreas Gruenbacher	2010-07-28	1	-0/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch adds the list and mask fields needed to support vfsmount marks. These are the same fields fsnotify needs on an inode. They are not used, just declared and we note where the cleanup hook should be (the function is not yet defined) Signed-off-by: Andreas Gruenbacher <agruen@suse.de> Signed-off-by: Eric Paris <eparis@redhat.com>
* \|	Fix sget() race with failing mount	Al Viro	2010-08-09	1	-1/+1
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If sget() finds a matching superblock being set up, it'll grab an active reference to it and grab s_umount. That's fine - we'll wait for completion of foofs_get_sb() that way. However, if said foofs_get_sb() fails we'll end up holding the halfway-created superblock. deactivate_locked_super() called by foofs_get_sb() will just unlock the sucker since we are holding another active reference to it. What we need is a way to tell if superblock has been successfully set up. Unfortunately, neither ->s_root nor the check for MS_ACTIVE quite fit. Cheap and easy way, suitable for backport: new flag set by the (only) caller of ->get_sb(). If that flag isn't present by the time sget() grabbed s_umount on preexisting superblock it has found, it's seeing a stillborn and should just bury it with deactivate_locked_super() (and repeat the search). Longer term we want to set that flag in ->get_sb() instances (and check for it to distinguish between "sget() found us a live sb" and "sget() has allocated an sb, we need to set it up" in there, instead of checking ->s_root as we do now). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Cc: stable@kernel.org
*	Merge branch 'next' into for-linus	James Morris	2010-05-18	1	-13/+0
\|\
\| *	security: remove dead hook sb_post_pivotroot	Eric Paris	2010-04-12	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
\| *	security: remove dead hook sb_post_addmount	Eric Paris	2010-04-12	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
\| *	security: remove dead hook sb_post_remount	Eric Paris	2010-04-12	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
\| *	security: remove dead hook sb_umount_busy	Eric Paris	2010-04-12	1	-2/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
\| *	security: remove dead hook sb_umount_close	Eric Paris	2010-04-12	1	-1/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unused hook. Remove. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
\| *	security: remove sb_check_sb hooks	Eric Paris	2010-04-12	1	-5/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Unused hook. Remove it. Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: James Morris <jmorris@namei.org>
* \|	Fix the regression created by "set S_DEAD on unlink()..." commit	Al Viro	2010-05-15	1	-3/+3
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	1) i_flags simply doesn't work for mount/unlink race prevention; we may have many links to file and rm on one of those obviously shouldn't prevent bind on top of another later on. To fix it right way we need to mark _dentry_ as unsuitable for mounting upon; new flag (DCACHE_CANT_MOUNT) is protected by d_flags and i_mutex on the inode in question. Set it (with dont_mount(dentry)) in unlink/rmdir/etc., check (with cant_mount(dentry)) in places in namespace.c that used to check for S_DEAD. Setting S_DEAD is still needed in places where we used to set it (for directories getting killed), since we rely on it for readdir/rmdir race prevention. 2) rename()/mount() protection has another bogosity - we unhash the target before we'd checked that it's not a mountpoint. Fixed. 3) ancient bogosity in pivot_root() - we locked i_mutex on the right directory, but checked S_DEAD on the different (and wrong) one. Noticed and fixed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	vfs: add NOFOLLOW flag to umount(2)	Miklos Szeredi	2010-03-03	1	-1/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a new UMOUNT_NOFOLLOW flag to umount(2). This is needed to prevent symlink attacks in unprivileged unmounts (fuse, samba, ncpfs). Additionally, return -EINVAL if an unknown flag is used (and specify an explicitly unused flag: UMOUNT_UNUSED). This makes it possible for the caller to determine if a flag is supported or not. CC: Eugene Teo <eugene@redhat.com> CC: Michael Kerrisk <mtk.manpages@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Mirror MS_KERNMOUNT in ->mnt_flags	Al Viro	2010-03-03	1	-1/+1
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	get rid of useless vfsmount_lock use in put_mnt_ns()	Al Viro	2010-03-03	1	-6/+2
\| \| \| \| \| \| \| \|	It hadn't been needed since we'd sanitized the logics in mark_mounts_for_expiry() (which, in turn, used to be a rudiment of bad old times when namespace_sem was per-ns). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	take check for new events in namespace (guts of mounts_poll()) to namespace.c	Al Viro	2010-03-03	1	-0/+15
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	new helper: iterate_mounts()	Al Viro	2010-03-03	1	-0/+15
\| \| \| \| \| \| \|	apply function to vfsmounts in set returned by collect_mounts(), stop if it returns non-zero. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	VFS: Clean up shared mount flag propagation	Valerie Aurora	2010-03-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The handling of mount flags in set_mnt_shared() got a little tangled up during previous cleanups, with the following problems: * MNT_PNODE_MASK is defined as a literal constant when it should be a bitwise xor of other MNT_* flags * set_mnt_shared() clears and then sets MNT_SHARED (part of MNT_PNODE_MASK) * MNT_PNODE_MASK could use a comment in mount.h * MNT_PNODE_MASK is a terrible name, change to MNT_SHARED_MASK This patch fixes these problems. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Kill CL_PROPAGATION, sanitize fs/pnode.c:get_source()	Al Viro	2010-03-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	First of all, get_source() never results in CL_PROPAGATION alone. We either get CL_MAKE_SHARED (for the continuation of peer group) or CL_SLAVE (slave that is not shared) or both (beginning of peer group among slaves). Massage the code to make that explicit, kill CL_PROPAGATION test in clone_mnt() (nothing sets CL_MAKE_SHARED without CL_PROPAGATION and in clone_mnt() we are checking CL_PROPAGATION after we'd found that there's no CL_SLAVE, so the check for CL_MAKE_SHARED would do just as well). Fix comments, while we are at it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	do_add_mount() should sanitize mnt_flags	Al Viro	2010-01-16	1	-0/+2
\| \| \| \| \| \| \| \|	MNT_WRITE_HOLD shouldn't leak into new vfsmount and neither should MNT_SHARED (the latter will be set properly, along with the rest of shared-subtree data structures) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	mnt_flags fixes in do_remount()	Al Viro	2010-01-16	1	-1/+5
\| \| \| \| \| \| \|	* need vfsmount_lock over modifying it * need to preserve MNT_SHARED/MNT_UNBINDABLE Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	attach_recursive_mnt() needs to hold vfsmount_lock over set_mnt_shared()	Al Viro	2010-01-16	1	-2/+2
\| \| \| \| \| \|	race in mnt_flags update Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	may_umount() needs namespace_sem	Al Viro	2010-01-16	1	-0/+2
\| \| \| \| \| \|	otherwise it races with clone_mnt() changing mnt_share/mnt_slaves Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	Revert "fix mismerge with Trond's stuff (create_mnt_ns() export is gone now)"	Linus Torvalds	2009-12-17	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This reverts commit e9496ff46a20a8592fdc7bdaaf41b45eb808d310. Quoth Al: "it's dependent on a lot of other stuff not currently in mainline and badly broken with current fs/namespace.c. Sorry, badly out-of-order cherry-pick from old queue. PS: there's a large pending series reworking the refcounting and lifetime rules for vfsmounts that will, among other things, allow to rip a subtree away _without_ dissolving connections in it, to be garbage-collected when all active references are gone. It's considerably saner wrt "is the subtree busy" logics, but it's nowhere near being ready for merge at the moment; this changeset is one of the things becoming possible with that sucker, but it certainly shouldn't have been picked during this cycle. My apologies..." Noticed-by: Eric Paris <eparis@redhat.com> Requested-by: Al Viro <viro@ZenIV.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	fix mismerge with Trond's stuff (create_mnt_ns() export is gone now)	Al Viro	2009-12-16	1	-2/+1
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	LSM: Pass original mount flags to security_sb_mount().	Tetsuo Handa	2009-10-12	1	-10/+10
\| \| \| \| \| \| \| \| \| \| \| \|	This patch allows LSM modules to determine based on original mount flags passed to mount(). A LSM module can get masked mount flags (if needed) by flags &= ~(MS_NOSUID \| MS_NOEXEC \| MS_NODEV \| MS_ACTIVE \| MS_NOATIME \| MS_NODIRATIME \| MS_RELATIME\| MS_KERNMOUNT \| MS_STRICTATIME); Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: James Morris <jmorris@namei.org>
*	fs: fix overflow in sys_mount() for in-kernel calls	Vegard Nossum	2009-09-24	1	-30/+47
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	sys_mount() reads/copies a whole page for its "type" parameter. When do_mount_root() passes a kernel address that points to an object which is smaller than a whole page, copy_mount_options() will happily go past this memory object, possibly dereferencing "wild" pointers that could be in any state (hence the kmemcheck warning, which shows that parts of the next page are not even allocated). (The likelihood of something going wrong here is pretty low -- first of all this only applies to kernel calls to sys_mount(), which are mostly found in the boot code. Secondly, I guess if the page was not mapped, exact_copy_from_user() _would_ in fact handle it correctly because of its access_ok(), etc. checks.) But it is much nicer to avoid the dubious reads altogether, by stopping as soon as we find a NUL byte. Is there a good reason why we can't do something like this, using the already existing strndup_from_user()? [akpm@linux-foundation.org: make copy_mount_string() static] [AV: fix compat mount breakage, which involves undoing akpm's change above] Reported-by: Ingo Molnar <mingo@elte.hu> Signed-off-by: Vegard Nossum <vegard.nossum@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Pekka Enberg <penberg@cs.helsinki.fi> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: al <al@dizzy.pdmi.ras.ru>
*	vfs: mnt_want_write_file(): fix special file handling	OGAWA Hirofumi	2009-08-07	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \|	I suspect that mnt_want_write_file() may have wrong assumption. I think mnt_want_write_file() is assuming it increments ->mnt_writers if (file->f_mode & FMODE_WRITE). But, if it's special_file(), it is false? Signed-off-by: OGAWA Hirofumi <hirofumi@mail.parknet.co.jp> Acked-by: Dave Hansen <dave@linux.vnet.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Nick Piggin <nickpiggin@yahoo.com.au> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	headers: mnt_namespace.h redux	Alexey Dobriyan	2009-07-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	Fix various silly problems wrt mnt_namespace.h: - exit_mnt_ns() isn't used, remove it - done that, sched.h and nsproxy.h inclusions aren't needed - mount.h inclusion was need for vfsmount_lock, but no longer - remove mnt_namespace.h inclusion from files which don't use anything from mnt_namespace.h Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
*	... and the same for vfsmount id/mount group id	Al Viro	2009-06-24	1	-4/+22
\| \| \| \|	Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	VFS: Switch init_mount_tree() to use the new create_mnt_ns() helper	Trond Myklebust	2009-06-24	1	-9/+2
\| \| \| \| \| \| \|	Eliminates some duplicated code... Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
*	VFS: Add VFS helper functions for setting up private namespaces	Trond Myklebust	2009-06-23	1	-8/+37
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The purpose of this patch is to improve the remote mount path lookup support for distributed filesystems such as the NFSv4 client. When given a mount command of the form "mount server:/foo/bar /mnt", the NFSv4 client is required to look up the filehandle for "server:/", and then look up each component of the remote mount path "foo/bar" in order to find the directory that is actually going to be mounted on /mnt. Following that remote mount path may involve following symlinks, crossing server-side mount points and even following referrals to filesystem volumes on other servers. Since the standard VFS path lookup code already supports walking paths that contain all these features (using in-kernel automounts for following referrals) we would like to be able to reuse that rather than duplicate the full path traversal functionality in the NFSv4 client code. This patch therefore defines a VFS helper function create_mnt_ns(), that sets up a temporary filesystem namespace and attaches a root filesystem to it. It exports the create_mnt_ns() and put_mnt_ns() function for use by filesystem modules. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>