summaryrefslogtreecommitdiffstats
path: root/fs/namei.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* vfs: rename: move d_move() upMiklos Szeredi2014-04-011-3/+2
| | | | | | | | | | | | | Move the d_move() in vfs_rename_dir() up, similarly to how it's done in vfs_rename_other(). The next patch will consolidate these two functions and this is the only structural difference between them. I'm not sure if doing the d_move() after the dput is even valid. But there may be a logical explanation for that. But moving the d_move() before the dput() (and the mutex_unlock()) should definitely not hurt. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
* vfs: add d_is_dir()Miklos Szeredi2014-04-011-12/+11
| | | | | | | | | Add d_is_dir(dentry) helper which is analogous to S_ISDIR(). To avoid confusion, rename d_is_directory() to d_can_lookup(). Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Reviewed-by: J. Bruce Fields <bfields@redhat.com>
* rcuwalk: recheck mount_lock after mountpoint crossing attemptsAl Viro2014-03-231-16/+13
| | | | | | | | | | | | | | | | | | | | We can get false negative from __lookup_mnt() if an unrelated vfsmount gets moved. In that case legitimize_mnt() is guaranteed to fail, and we will fall back to non-RCU walk... unless we end up running into a hard error on a filesystem object we wouldn't have reached if not for that false negative. IOW, delaying that check until the end of pathname resolution is wrong - we should recheck right after we attempt to cross the mountpoint. We don't need to recheck unless we see d_mountpoint() being true - in that case even if we have just raced with mount/umount, we can simply go on as if we'd come at the moment when the sucker wasn't a mountpoint; if we run into a hard error as the result, it was a legitimate outcome. __lookup_mnt() returning NULL is different in that respect, since it might've happened due to operation on completely unrelated mountpoint. Cc: stable@vger.kernel.org Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: atomic f_pos accesses as per POSIXLinus Torvalds2014-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Our write() system call has always been atomic in the sense that you get the expected thread-safe contiguous write, but we haven't actually guaranteed that concurrent writes are serialized wrt f_pos accesses, so threads (or processes) that share a file descriptor and use "write()" concurrently would quite likely overwrite each others data. This violates POSIX.1-2008/SUSv4 Section XSI 2.9.7 that says: "2.9.7 Thread Interactions with Regular File Operations All of the following functions shall be atomic with respect to each other in the effects specified in POSIX.1-2008 when they operate on regular files or symbolic links: [...]" and one of the effects is the file position update. This unprotected file position behavior is not new behavior, and nobody has ever cared. Until now. Yongzhi Pan reported unexpected behavior to Michael Kerrisk that was due to this. This resolves the issue with a f_pos-specific lock that is taken by read/write/lseek on file descriptors that may be shared across threads or processes. Reported-by: Yongzhi Pan <panyongzhi@gmail.com> Reported-by: Michael Kerrisk <mtk.manpages@gmail.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* execve: use 'struct filename *' for executable name passingLinus Torvalds2014-02-051-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This changes 'do_execve()' to get the executable name as a 'struct filename', and to free it when it is done. This is what the normal users want, and it simplifies and streamlines their error handling. The controlled lifetime of the executable name also fixes a use-after-free problem with the trace_sched_process_exec tracepoint: the lifetime of the passed-in string for kernel users was not at all obvious, and the user-mode helper code used UMH_WAIT_EXEC to serialize the pathname allocation lifetime with the execve() having finished, which in turn meant that the trace point that happened after mm_release() of the old process VM ended up using already free'd memory. To solve the kernel string lifetime issue, this simply introduces "getname_kernel()" that works like the normal user-space getname() function, except with the source coming from kernel memory. As Oleg points out, this also means that we could drop the tcomm[] array from 'struct linux_binprm', since the pathname lifetime now covers setup_new_exec(). That would be a separate cleanup. Reported-by: Igor Zhbanov <i.zhbanov@samsung.com> Tested-by: Steven Rostedt <rostedt@goodmis.org> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* Fix mountpoint reference leakage in linkatOleg Drokin2014-01-311-1/+4
| | | | | | | | | | | | | | | | Recent changes to retry on ESTALE in linkat (commit 442e31ca5a49e398351b2954b51f578353fdf210) introduced a mountpoint reference leak and a small memory leak in case a filesystem link operation returns ESTALE which is pretty normal for distributed filesystems like lustre, nfs and so on. Free old_path in such a case. [AV: there was another missing path_put() nearby - on the previous goto retry] Signed-off-by: Oleg Drokin: <green@linuxhacker.ru> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* vfs: unexport the getname() symbolJeff Layton2014-01-311-1/+0
| | | | | | | Leaving getname() exported when putname() isn't is a bad idea. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* fs: add get_acl helperChristoph Hellwig2014-01-261-21/+3
| | | | | | | | | Factor out the code to get an ACL either from the inode or disk from check_acl, so that it can be used elsewhere later on. Signed-off-by: Christoph Hellwig <hch@lst.de> Reviewed-by: Jan Kara <jack@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* dcache: allow word-at-a-time name hashing with big-endian CPUsWill Deacon2013-12-121-6/+1
| | | | | | | | | | | | | | | | When explicitly hashing the end of a string with the word-at-a-time interface, we have to be careful which end of the word we pick up. On big-endian CPUs, the upper-bits will contain the data we're after, so ensure we generate our masks accordingly (and avoid hashing whatever random junk may have been sitting after the string). This patch adds a new dcache helper, bytemask_from_count, which creates a mask appropriate for the CPU endianness. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Will Deacon <will.deacon@arm.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fix bogus path_put() of nd->root after some unlazy_walk() failuresAl Viro2013-11-291-2/+1
| | | | | | | | | | Failure to grab reference to parent dentry should go through the same cleanup as nd->seq mismatch. As it is, we might end up with caller thinking it needs to path_put() nd->root, with obvious nasty results once we'd hit that bug enough times to drive the refcount of root dentry all the way to zero... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge git://git.infradead.org/users/eparis/auditLinus Torvalds2013-11-221-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull audit updates from Eric Paris: "Nothing amazing. Formatting, small bug fixes, couple of fixes where we didn't get records due to some old VFS changes, and a change to how we collect execve info..." Fixed conflict in fs/exec.c as per Eric and linux-next. * git://git.infradead.org/users/eparis/audit: (28 commits) audit: fix type of sessionid in audit_set_loginuid() audit: call audit_bprm() only once to add AUDIT_EXECVE information audit: move audit_aux_data_execve contents into audit_context union audit: remove unused envc member of audit_aux_data_execve audit: Kill the unused struct audit_aux_data_capset audit: do not reject all AUDIT_INODE filter types audit: suppress stock memalloc failure warnings since already managed audit: log the audit_names record type audit: add child record before the create to handle case where create fails audit: use given values in tty_audit enable api audit: use nlmsg_len() to get message payload length audit: use memset instead of trying to initialize field by field audit: fix info leak in AUDIT_GET requests audit: update AUDIT_INODE filter rule to comparator function audit: audit feature to set loginuid immutable audit: audit feature to only allow unsetting the loginuid audit: allow unsetting the loginuid (with priv) audit: remove CONFIG_AUDIT_LOGINUID_IMMUTABLE audit: loginuid functions coding style selinux: apply selinux checks on new audit message types ...
| * audit: add child record before the create to handle case where create failsJeff Layton2013-11-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Historically, when a syscall that creates a dentry fails, you get an audit record that looks something like this (when trying to create a file named "new" in "/tmp/tmp.SxiLnCcv63"): type=PATH msg=audit(1366128956.279:965): item=0 name="/tmp/tmp.SxiLnCcv63/new" inode=2138308 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023 This record makes no sense since it's associating the inode information for "/tmp/tmp.SxiLnCcv63" with the path "/tmp/tmp.SxiLnCcv63/new". The recent patch I posted to fix the audit_inode call in do_last fixes this, by making it look more like this: type=PATH msg=audit(1366128765.989:13875): item=0 name="/tmp/tmp.DJ1O8V3e4f/" inode=141 dev=fd:02 mode=040700 ouid=0 ogid=0 rdev=00:00 obj=staff_u:object_r:user_tmp_t:s15:c0.c1023 While this is more correct, if the creation of the file fails, then we have no record of the filename that the user tried to create. This patch adds a call to audit_inode_child to may_create. This creates an AUDIT_TYPE_CHILD_CREATE record that will sit in place until the create succeeds. When and if the create does succeed, then this record will be updated with the correct inode info from the create. This fixes what was broken in commit bfcec708. Commit 79f6530c should also be backported to stable v3.7+. Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com> Signed-off-by: Richard Guy Briggs <rgb@redhat.com> Signed-off-by: Eric Paris <eparis@redhat.com>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-11-131-126/+196
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "All kinds of stuff this time around; some more notable parts: - RCU'd vfsmounts handling - new primitives for coredump handling - files_lock is gone - Bruce's delegations handling series - exportfs fixes plus misc stuff all over the place" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (101 commits) ecryptfs: ->f_op is never NULL locks: break delegations on any attribute modification locks: break delegations on link locks: break delegations on rename locks: helper functions for delegation breaking locks: break delegations on unlink namei: minor vfs_unlink cleanup locks: implement delegations locks: introduce new FL_DELEG lock flag vfs: take i_mutex on renamed file vfs: rename I_MUTEX_QUOTA now that it's not used for quotas vfs: don't use PARENT/CHILD lock classes for non-directories vfs: pull ext4's double-i_mutex-locking into common code exportfs: fix quadratic behavior in filehandle lookup exportfs: better variable name exportfs: move most of reconnect_path to helper function exportfs: eliminate unused "noprogress" counter exportfs: stop retrying once we race with rename/remove exportfs: clear DISCONNECTED on all parents sooner exportfs: more detailed comment for path_reconnect ...
| * | locks: break delegations on linkJ. Bruce Fields2013-11-091-4/+32
| | | | | | | | | | | | | | | | | | | | | | | | Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Dustin Kirkland <dustin.kirkland@gazzang.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | locks: break delegations on renameJ. Bruce Fields2013-11-091-4/+43
| | | | | | | | | | | | | | | | | | | | | Cc: David Howells <dhowells@redhat.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | locks: helper functions for delegation breakingJ. Bruce Fields2013-11-091-10/+3
| | | | | | | | | | | | | | | | | | | | | | | | We'll need the same logic for rename and link. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | locks: break delegations on unlinkJ. Bruce Fields2013-11-091-3/+39
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to break delegations on any operation that changes the set of links pointing to an inode. Start with unlink. Such operations also hold the i_mutex on a parent directory. Breaking a delegation may require waiting for a timeout (by default 90 seconds) in the case of a unresponsive NFS client. To avoid blocking all directory operations, we therefore drop locks before waiting for the delegation. The logic then looks like: acquire locks ... test for delegation; if found: take reference on inode release locks wait for delegation break drop reference on inode retry It is possible this could never terminate. (Even if we take precautions to prevent another delegation being acquired on the same inode, we could get a different inode on each retry.) But this seems very unlikely. The initial test for a delegation happens after the lock on the target inode is acquired, but the directory inode may have been acquired further up the call stack. We therefore add a "struct inode **" argument to any intervening functions, which we use to pass the inode back up to the caller in the case it needs a delegation synchronously broken. Cc: David Howells <dhowells@redhat.com> Cc: Tyler Hicks <tyhicks@canonical.com> Cc: Dustin Kirkland <dustin.kirkland@gazzang.com> Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | namei: minor vfs_unlink cleanupJ. Bruce Fields2013-11-091-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | We'll be using dentry->d_inode in one more place. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | vfs: take i_mutex on renamed fileJ. Bruce Fields2013-11-091-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A read delegation is used by NFSv4 as a guarantee that a client can perform local read opens without informing the server. The open operation takes the last component of the pathname as an argument, thus is also a lookup operation, and giving the client the above guarantee means informing the client before we allow anything that would change the set of names pointing to the inode. Therefore, we need to break delegations on rename, link, and unlink. We also need to prevent new delegations from being acquired while one of these operations is in progress. We could add some completely new locking for that purpose, but it's simpler to use the i_mutex, since that's already taken by all the operations we care about. The single exception is rename. So, modify rename to take the i_mutex on the file that is being renamed. Also fix up lockdep and Documentation/filesystems/directory-locking to reflect the change. Acked-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | dcache: fix outdated DCACHE_NEED_LOOKUP commentJ. Bruce Fields2013-11-091-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The DCACHE_NEED_LOOKUP case referred to here was removed with 39e3c9553f34381a1b664c27b0c696a266a5735e "vfs: remove DCACHE_NEED_LOOKUP". There are only four real_lookup() callers and all of them pass in an unhashed dentry just returned from d_alloc. Signed-off-by: J. Bruce Fields <bfields@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | VFS: Put a small type field into struct dentry::d_flagsDavid Howells2013-11-091-57/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Put a type field into struct dentry::d_flags to indicate if the dentry is one of the following types that relate particularly to pathwalk: Miss (negative dentry) Directory "Automount" directory (defective - no i_op->lookup()) Symlink Other (regular, socket, fifo, device) The type field is set to one of the first five types on a dentry by calls to __d_instantiate() and d_obtain_alias() from information in the inode (if one is given). The type is cleared by dentry_unlink_inode() when it reconstitutes an existing dentry as a negative dentry. Accessors provided are: d_set_type(dentry, type) d_is_directory(dentry) d_is_autodir(dentry) d_is_symlink(dentry) d_is_file(dentry) d_is_negative(dentry) d_is_positive(dentry) A bunch of checks in pathname resolution switched to those. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | get rid of {lock,unlock}_rcu_walk()Al Viro2013-11-091-24/+14
| | | | | | | | | | | | | | | | | | those have become aliases for rcu_read_{lock,unlock}() Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | RCU'd vfsmountsAl Viro2013-11-091-24/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * RCU-delayed freeing of vfsmounts * vfsmount_lock replaced with a seqlock (mount_lock) * sequence number from mount_lock is stored in nameidata->m_seq and used when we exit RCU mode * new vfsmount flag - MNT_SYNC_UMOUNT. Set by umount_tree() when its caller knows that vfsmount will have no surviving references. * synchronize_rcu() done between unlocking namespace_sem in namespace_unlock() and doing pending mntput(). * new helper: legitimize_mnt(mnt, seq). Checks the mount_lock sequence number against seq, then grabs reference to mnt. Then it rechecks mount_lock again to close the race and either returns success or drops the reference it has acquired. The subtle point is that in case of MNT_SYNC_UMOUNT we can simply decrement the refcount and sod off - aforementioned synchronize_rcu() makes sure that final mntput() won't come until we leave RCU mode. We need that, since we don't want to end up with some lazy pathwalk racing with umount() and stealing the final mntput() from it - caller of umount() may expect it to return only once the fs is shut down and we don't want to break that. In other cases (i.e. with MNT_SYNC_UMOUNT absent) we have to do full-blown mntput() in case of mount_lock sequence number mismatch happening just as we'd grabbed the reference, but in those cases we won't be stealing the final mntput() from anything that would care. * mntput_no_expire() doesn't lock anything on the fast path now. Incidentally, SMP and UP cases are handled the same way - no ifdefs there. * normal pathname resolution does *not* do any writes to mount_lock. It does, of course, bump the refcounts of vfsmount and dentry in the very end, but that's it. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | split __lookup_mnt() in two functionsAl Viro2013-10-251-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | Instead of passing the direction as argument (and checking it on every step through the hash chain), just have separate __lookup_mnt() and __lookup_mnt_last(). And use the standard iterators... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | fs/namei.c: fix new kernel-doc warningRandy Dunlap2013-10-221-1/+2
|/ / | | | | | | | | | | | | | | | | | | Add @path parameter to fix kernel-doc warning. Also fix a spello/typo. Warning(fs/namei.c:2304): No description found for parameter 'path' Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | atomic_open: take care of EEXIST in no-open case with O_CREAT|O_EXCL in ↵Al Viro2013-09-171-14/+19
| | | | | | | | | | | | fs/namei.c Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | vfs: don't set FILE_CREATED before calling ->atomic_open()Miklos Szeredi2013-09-171-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If O_CREAT|O_EXCL are passed to open, then we know that either - the file is successfully created, or - the operation fails in some way. So previously we set FILE_CREATED before calling ->atomic_open() so the filesystem doesn't have to. This, however, led to bugs in the implementation that went unnoticed when the filesystem didn't check for existence, yet returned success. To prevent this kind of bug, require filesystems to always explicitly set FILE_CREATED on O_CREAT|O_EXCL and verify this in the VFS. Also added a couple more verifications for the result of atomic_open(): - Warn if filesystem set FILE_CREATED despite the lack of O_CREAT. - Warn if filesystem set FILE_CREATED but gave a negative dentry. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Add missing unlocks to error paths of mountpoint_last.Dave Jones2013-09-101-1/+4
| | | | | | | | | | Signed-off-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | ... and fold the renamed __vfs_follow_link() into its only callerAl Viro2013-09-101-24/+14
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fs: remove vfs_follow_linkChristoph Hellwig2013-09-101-8/+2
| | | | | | | | | | | | | | | | | | | | | | | | For a long time no filesystem has been using vfs_follow_link, and as seen by recent filesystem submissions any new use is accidental as well. Remove vfs_follow_link, document the replacement in Documentation/filesystems/porting and also rename __vfs_follow_link to match its only caller better. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-101-53/+59
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 3 (of many) from Al Viro: "Waiman's conversion of d_path() and bits related to it, kern_path_mountpoint(), several cleanups and fixes (exportfs one is -stable fodder, IMO). There definitely will be more... ;-/" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: split read_seqretry_or_unlock(), convert d_walk() to resulting primitives dcache: Translating dentry into pathname without taking rename_lock autofs4 - fix device ioctl mount lookup introduce kern_path_mountpoint() rename user_path_umountat() to user_path_mountpoint_at() take unlazy_walk() into umount_lookup_last() Kill indirect include of file.h from eventfd.h, use fdget() in cgroup.c prune_super(): sb->s_op is never NULL exportfs: don't assume that ->iterate() won't feed us too long entries afs: get rid of redundant ->d_name.len checks
| * | introduce kern_path_mountpoint()Al Viro2013-09-091-11/+24
| | | | | | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | rename user_path_umountat() to user_path_mountpoint_at()Al Viro2013-09-091-12/+11
| | | | | | | | | | | | | | | | | | | | | ... and move the extern from linux/namei.h to fs/internal.h, along with that of vfs_path_lookup(). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * | take unlazy_walk() into umount_lookup_last()Al Viro2013-09-091-33/+27
| | | | | | | | | | | | | | | | | | ... and massage it a bit to reduce nesting Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | vfs: make sure we don't have a stale root path if unlazy_walk() failsLinus Torvalds2013-09-101-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I moved the RCU walk termination into unlazy_walk(), I didn't copy quite all of it: for the successful RCU termination we properly add the necessary reference counts to our temporary copy of the root path, but for the failure case we need to make sure that any temporary root path information is cleared out (since it does _not_ have the proper reference counts from the RCU lookup). We could clean up this mess by just always dropping the temporary root information, but Al points out that that would mean that a single lookup through symlinks could see multiple different root entries if it races with another thread doing chroot. Not that I think we should really care (we had that before too, back before we had a copy of the root path in the nameidata). Al says he has a cunning plan. In the meantime, this is the minimal fix for the problem, even if it's not all that pretty. Reported-by: Mace Moneta <moneta.mace@gmail.com> Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | vfs: fix dentry RCU to refcounting possibly sleeping dput()Linus Torvalds2013-09-091-53/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the fix that the last two commits indirectly led up to - making sure that we don't call dput() in a bad context on the dentries we've looked up in RCU mode after the sequence count validation fails. This basically expands d_rcu_to_refcount() into the callers, and then fixes the callers to delay the dput() in the failure case until _after_ we've dropped all locks and are no longer in an RCU-locked region. The case of 'complete_walk()' was trivial, since its failure case did the unlock_rcu_walk() directly after the call to d_rcu_to_refcount(), and as such that is just a pure expansion of the function with a trivial movement of the resulting dput() to after 'unlock_rcu_walk()'. In contrast, the unlazy_walk() case was much more complicated, because not only does convert two different dentries from RCU to be reference counted, but it used to not call unlock_rcu_walk() at all, and instead just returned an error and let the caller clean everything up in "terminate_walk()". Happily, one of the dentries in question (called "parent" inside unlazy_walk()) is the dentry of "nd->path", which terminate_walk() wants a refcount to anyway for the non-RCU case. So what the new and improved unlazy_walk() does is to first turn that dentry into a refcounted one, and once that is set up, the error cases can continue to use the terminate_walk() helper for cleanup, but for the non-RCU case. Which makes it possible to drop out of RCU mode if we actually hit the sequence number failure case. Acked-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | vfs: use lockred "dead" flag to mark unrecoverably dead dentriesLinus Torvalds2013-09-081-18/+5
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This simplifies the RCU to refcounting code in particular. I was originally intending to leave this for later, but walking through all the dput() logic (see previous commit), I realized that the dput() "might_sleep()" check was misleadingly weak. And I removed it as misleading, both for performance profiling and for debugging. However, the might_sleep() debugging case is actually true: the final dput() can indeed sleep, if the inode of the dentry that you are releasing ends up sleeping at iput time (see dentry_iput()). So the problem with the might_sleep() in dput() wasn't that it wasn't true, it was that it wasn't actually testing and triggering on the interesting case. In particular, just about *any* dput() can indeed sleep, if you happen to race with another thread deleting the file in question, and you then lose the race to the be the last dput() for that file. But because it's a very rare race, the debugging code would never trigger it in practice. Why is this problematic? The new d_rcu_to_refcount() (see commit 15570086b590: "vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()") does a dput() for the failure case, and it does it under the RCU lock. So potentially sleeping really is a bug. But there's no way I'm going to fix this with the previous complicated "lockref_get_or_lock()" interface. And rather than revert to the old and crufty nested dentry locking code (which did get this right by delaying the reference count updates until they were verified to be safe), let's make forward progress. Cc: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-09-051-0/+182
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs pile 1 from Al Viro: "Unfortunately, this merge window it'll have a be a lot of small piles - my fault, actually, for not keeping #for-next in anything that would resemble a sane shape ;-/ This pile: assorted fixes (the first 3 are -stable fodder, IMO) and cleanups + %pd/%pD formats (dentry/file pathname, up to 4 last components) + several long-standing patches from various folks. There definitely will be a lot more (starting with Miklos' check_submount_and_drop() series)" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (26 commits) direct-io: Handle O_(D)SYNC AIO direct-io: Implement generic deferred AIO completions add formats for dentry/file pathnames kvm eventfd: switch to fdget powerpc kvm: use fdget switch fchmod() to fdget switch epoll_ctl() to fdget switch copy_module_from_fd() to fdget git simplify nilfs check for busy subtree ibmasmfs: don't bother passing superblock when not needed don't pass superblock to hypfs_{mkdir,create*} don't pass superblock to hypfs_diag_create_files don't pass superblock to hypfs_vm_create_files() oprofile: get rid of pointless forward declarations of struct super_block oprofilefs_create_...() do not need superblock argument oprofilefs_mkdir() doesn't need superblock argument don't bother with passing superblock to oprofile_create_stats_files() oprofile: don't bother with passing superblock to ->create_files() don't bother passing sb to oprofile_create_files() coh901318: don't open-code simple_read_from_buffer() ...
| * | vfs: allow umount to handle mountpoints without revalidating themJeff Layton2013-09-041-0/+182
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Christopher reported a regression where he was unable to unmount a NFS filesystem where the root had gone stale. The problem is that d_revalidate handles the root of the filesystem differently from other dentries, but d_weak_revalidate does not. We could simply fix this by making d_weak_revalidate return success on IS_ROOT dentries, but there are cases where we do want to revalidate the root of the fs. A umount is really a special case. We generally aren't interested in anything but the dentry and vfsmount that's attached at that point. If the inode turns out to be stale we just don't care since the intent is to stop using it anyway. Try to handle this situation better by treating umount as a special case in the lookup code. Have it resolve the parent using normal means, and then do a lookup of the final dentry without revalidating it. In most cases, the final lookup will come out of the dcache, but the case where there's a trailing symlink or !LAST_NORM entry on the end complicates things a bit. Cc: Neil Brown <neilb@suse.de> Reported-by: Christopher T Vogan <cvogan@us.ibm.com> Signed-off-by: Jeff Layton <jlayton@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | vfs: reimplement d_rcu_to_refcount() using lockref_get_or_lock()Linus Torvalds2013-09-021-26/+64
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This moves __d_rcu_to_refcount() from <linux/dcache.h> into fs/namei.c and re-implements it using the lockref infrastructure instead. It also adds a lot of comments about what is actually going on, because turning a dentry that was looked up using RCU into a long-lived reference counted entry is one of the more subtle parts of the rcu walk. We also used to be _particularly_ subtle in unlazy_walk() where we re-validate both the dentry and its parent using the same sequence count. We used to do it by nesting the locks and then verifying the sequence count just once. That was silly, because nested locking is expensive, but the sequence count check is not. So this just re-validates the dentry and the parent separately, avoiding the nested locking, and making the lockref lookup possible. Acked-by: Waiman Long <waiman.long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | vfs: make the dentry cache use the lockref infrastructureWaiman Long2013-08-291-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This just replaces the dentry count/lock combination with the lockref structure that contains both a count and a spinlock, and does the mechanical conversion to use the lockref infrastructure. There are no semantic changes here, it's purely syntactic. The reference lockref implementation uses the spinlock exactly the same way that the old dcache code did, and the bulk of this patch is just expanding the internal "d_count" use in the dcache code to use "d_lockref.count" instead. This is purely preparation for the real change to make the reference count updates be lockless during the 3.12 merge window. [ As with the previous commit, this is a rewritten version of a concept originally from Waiman, so credit goes to him, blame for any errors goes to me. Waiman's patch had some semantic differences for taking advantage of the lockless update in dget_parent(), while this patch is intentionally a pure search-and-replace change with no semantic changes. - Linus ] Signed-off-by: Waiman Long <Waiman.Long@hp.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Revert "fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flink"Linus Torvalds2013-08-281-3/+7
|/ | | | | | | | | | | | | | | | | | | | | This reverts commit bb2314b47996491bbc5add73633905c3120b6268. It wasn't necessarily wrong per se, but we're still busily discussing the exact details of this all, so I'm going to revert it for now. It's true that you can already do flink() through /proc and that flink() isn't new. But as Brad Spengler points out, some secure environments do not mount proc, and flink adds a new interface that can avoid path lookup of the source for those kinds of environments. We may re-do this (and even mark it for stable backporting back in 3.11 and possibly earlier) once the whole discussion about the interface is done. Cc: Andy Lutomirski <luto@amacapital.net> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Oleg Nesterov <oleg@redhat.com> Cc: Brad Spengler <spender@grsecurity.net> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* fs: Allow unprivileged linkat(..., AT_EMPTY_PATH) aka flinkAndy Lutomirski2013-08-051-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | Every now and then someone proposes a new flink syscall, and this spawns a long discussion of whether it would be a security problem. I think that this is missing the point: flink is *already* allowed without privilege as long as /proc is mounted -- it's called AT_SYMLINK_FOLLOW. Now that O_TMPFILE is here, the ability to create a file with O_TMPFILE, write it, and link it in is very convenient. The only problem is that it requires that /proc be mounted so that you can do: linkat(AT_FDCWD, "/proc/self/fd/<tmpfd>", dfd, path, AT_SYMLINK_NOFOLLOW) This sucks -- it's much nicer to do: linkat(tmpfd, "", dfd, path, AT_EMPTY_PATH) Let's allow it. If this turns out to be excessively scary, it we could instead require that the inode in question be I_LINKABLE, but this seems pointless given the /proc situation Signed-off-by: Andy Lutomirski <luto@amacapital.net> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Safer ABI for O_TMPFILEAl Viro2013-07-131-1/+1
| | | | | | | | [suggested by Rasmus Villemoes] make O_DIRECTORY | O_RDWR part of O_TMPFILE; that will fail on old kernels in a lot more cases than what I came up with. And make sure O_CREAT doesn't get there... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Don't pass inode to ->d_hash() and ->d_compare()Linus Torvalds2013-06-291-4/+3
| | | | | | | | | | | | Instances either don't look at it at all (the majority of cases) or only want it to find the superblock (which can be had as dentry->d_sb). A few cases that want more are actually safe with dentry->d_inode - the only precaution needed is the check that it hadn't been replaced with NULL by rmdir() or by overwriting rename(), which case should be simply treated as cache miss. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* allow the temp files created by open() to be linked toAl Viro2013-06-291-2/+14
| | | | | | | | O_TMPFILE | O_CREAT => linkat() with AT_SYMLINK_FOLLOW and /proc/self/fd/<n> as oldpath (i.e. flink()) will create a link O_TMPFILE | O_CREAT | O_EXCL => ENOENT on attempt to link those guys Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* [O_TMPFILE] it's still short a few helpers, but infrastructure should be OK ↵Al Viro2013-06-291-0/+60
| | | | | | now... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* allow build_open_flags() to return an errorAl Viro2013-06-291-4/+4
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* do_last(): fix missing checks for LAST_BIND caseAl Viro2013-06-291-21/+3
| | | | | | | /proc/self/cwd with O_CREAT should fail with EISDIR. /proc/self/exe, OTOH, should fail with ENOTDIR when opened with O_DIRECTORY. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* use can_lookup() instead of direct checks of ->i_op->lookupAl Viro2013-06-151-2/+2
| | | | | | a couple of places got missed back when Linus has introduced that one... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>