summaryrefslogtreecommitdiffstats
path: root/fs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-clientLinus Torvalds2016-12-168-211/+540
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "A varied set of changes: - a large rework of cephx auth code to cope with CONFIG_VMAP_STACK (myself). Also fixed a deadlock caused by a bogus allocation on the writeback path and authorize reply verification. - a fix for long stalls during fsync (Jeff Layton). The client now has a way to force the MDS log flush, leading to ~100x speedups in some synthetic tests. - a new [no]require_active_mds mount option (Zheng Yan). On mount, we will now check whether any of the MDSes are available and bail rather than block if none are. This check can be avoided by specifying the "no" option. - a couple of MDS cap handling fixes and a few assorted patches throughout" * tag 'ceph-for-4.10-rc1' of git://github.com/ceph/ceph-client: (32 commits) libceph: remove now unused finish_request() wrapper libceph: always signal completion when done ceph: avoid creating orphan object when checking pool permission ceph: properly set issue_seq for cap release ceph: add flags parameter to send_cap_msg ceph: update cap message struct version to 10 ceph: define new argument structure for send_cap_msg ceph: move xattr initialzation before the encoding past the ceph_mds_caps ceph: fix minor typo in unsafe_request_wait ceph: record truncate size/seq for snap data writeback ceph: check availability of mds cluster on mount ceph: fix splice read for no Fc capability case ceph: try getting buffer capability for readahead/fadvise ceph: fix scheduler warning due to nested blocking ceph: fix printing wrong return variable in ceph_direct_read_write() crush: include mapper.h in mapper.c rbd: silence bogus -Wmaybe-uninitialized warning libceph: no need to drop con->mutex for ->get_authorizer() libceph: drop len argument of *verify_authorizer_reply() libceph: verify authorize reply on connect ...
| * libceph: always signal completion when doneIlya Dryomov2016-12-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | r_safe_completion is currently, and has always been, signaled only if on-disk ack was requested. It's there for fsync and syncfs, which wait for in-flight writes to flush - all data write requests set ONDISK. However, the pool perm check code introduced in 4.2 sends a write request with only ACK set. An unfortunately timed syncfs can then hang forever: r_safe_completion won't be signaled because only an unsafe reply was requested. We could patch ceph_osdc_sync() to skip !ONDISK write requests, but that is somewhat incomplete and yet another special case. Instead, rename this completion to r_done_completion and always signal it when the OSD client is done with the request, whether unsafe, safe, or error. This is a bit cleaner and helps with the cancellation code. Reported-by: Yan, Zheng <zyan@redhat.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * ceph: avoid creating orphan object when checking pool permissionYan, Zheng2016-12-141-0/+9
| | | | | | | | | | | | | | | | | | Pool permission check needs to write to the first object. But for snapshot, head of the first object may have already been deleted. Skip the check for snapshot inode to avoid creating orphan object. Link: http://tracker.ceph.com/issues/18211 Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: properly set issue_seq for cap releaseYan, Zheng2016-12-121-0/+1
| | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: add flags parameter to send_cap_msgJeff Layton2016-12-121-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | Add a flags parameter to send_cap_msg, so we can request expedited service from the MDS when we know we'll be waiting on the result. Set that flag in the case of try_flush_caps. The callers of that function generally wait synchronously on the result, so it's beneficial to ask the server to expedite it. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: update cap message struct version to 10Jeff Layton2016-12-121-6/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The userland ceph has MClientCaps at struct version 10. This brings the kernel up the same version. For now, all of the the new stuff is set to default values including the flags field, which will be conditionally set in a later patch. Note that we don't need to set the change_attr and btime to anything since we aren't currently setting the feature flag. The MDS should ignore those values. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: define new argument structure for send_cap_msgJeff Layton2016-12-121-99/+126
| | | | | | | | | | | | | | | | | | | | | | | | When we get to this many arguments, it's hard to work with positional parameters. send_cap_msg is already at 25 arguments, with more needed. Define a new args structure and pass a pointer to it to send_cap_msg. Eventually it might make sense to embed one of these inside ceph_cap_snap instead of tracking individual fields. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: move xattr initialzation before the encoding past the ceph_mds_capsJeff Layton2016-12-121-7/+7
| | | | | | | | | | | | | | | | Just for clarity. This part is inside the header, so it makes sense to group it with the rest of the stuff in the header. Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix minor typo in unsafe_request_waitJeff Layton2016-12-121-1/+1
| | | | | | | | | | Signed-off-by: Jeff Layton <jlayton@redhat.com> Reviewed-by: Yan, Zheng <zyan@redhat.com>
| * ceph: record truncate size/seq for snap data writebackYan, Zheng2016-12-123-13/+22
| | | | | | | | | | | | | | | | Dirty snapshot data needs to be flushed unconditionally. If they were created before truncation, writeback should use old truncate size/seq. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: check availability of mds cluster on mountYan, Zheng2016-12-124-11/+182
| | | | | | | | Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix splice read for no Fc capability caseYan, Zheng2016-12-121-54/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When iov_iter type is ITER_PIPE, copy_page_to_iter() increases the page's reference and add the page to a pipe_buffer. It also set the pipe_buffer's ops to page_cache_pipe_buf_ops. The comfirm callback in page_cache_pipe_buf_ops expects the page is from page cache and uptodate, otherwise it return error. For ceph_sync_read() case, pages are not from page cache. So we can't call copy_page_to_iter() when iov_iter type is ITER_PIPE. The fix is using iov_iter_get_pages_alloc() to allocate pages for the pipe. (the code is similar to default_file_splice_read) Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: try getting buffer capability for readahead/fadviseYan, Zheng2016-12-124-11/+73
| | | | | | | | | | | | | | | | | | For readahead/fadvise cases, caller of ceph_readpages does not hold buffer capability. Pages can be added to page cache while there is no buffer capability. This can cause data integrity issue. Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix scheduler warning due to nested blockingNikolay Borisov2016-12-121-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | try_get_cap_refs can be used as a condition in a wait_event* calls. This is all fine until it has to call __ceph_do_pending_vmtruncate, which in turn acquires the i_truncate_mutex. This leads to a situation in which a task's state is !TASK_RUNNING and at the same time it's trying to acquire a sleeping primitive. In essence a nested sleeping primitives are being used. This causes the following warning: WARNING: CPU: 22 PID: 11064 at kernel/sched/core.c:7631 __might_sleep+0x9f/0xb0() do not call blocking ops when !TASK_RUNNING; state=1 set at [<ffffffff8109447d>] prepare_to_wait_event+0x5d/0x110 ipmi_msghandler tcp_scalable ib_qib dca ib_mad ib_core ib_addr ipv6 CPU: 22 PID: 11064 Comm: fs_checker.pl Tainted: G O 4.4.20-clouder2 #6 Hardware name: Supermicro X10DRi/X10DRi, BIOS 1.1a 10/16/2015 0000000000000000 ffff8838b416fa88 ffffffff812f4409 ffff8838b416fad0 ffffffff81a034f2 ffff8838b416fac0 ffffffff81052b46 ffffffff81a0432c 0000000000000061 0000000000000000 0000000000000000 ffff88167bda54a0 Call Trace: [<ffffffff812f4409>] dump_stack+0x67/0x9e [<ffffffff81052b46>] warn_slowpath_common+0x86/0xc0 [<ffffffff81052bcc>] warn_slowpath_fmt+0x4c/0x50 [<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110 [<ffffffff8109447d>] ? prepare_to_wait_event+0x5d/0x110 [<ffffffff8107767f>] __might_sleep+0x9f/0xb0 [<ffffffff81612d30>] mutex_lock+0x20/0x40 [<ffffffffa04eea14>] __ceph_do_pending_vmtruncate+0x44/0x1a0 [ceph] [<ffffffffa04fa692>] try_get_cap_refs+0xa2/0x320 [ceph] [<ffffffffa04fd6f5>] ceph_get_caps+0x255/0x2b0 [ceph] [<ffffffff81094370>] ? wait_woken+0xb0/0xb0 [<ffffffffa04f2c11>] ceph_write_iter+0x2b1/0xde0 [ceph] [<ffffffff81613f22>] ? schedule_timeout+0x202/0x260 [<ffffffff8117f01a>] ? kmem_cache_free+0x1ea/0x200 [<ffffffff811b46ce>] ? iput+0x9e/0x230 [<ffffffff81077632>] ? __might_sleep+0x52/0xb0 [<ffffffff81156147>] ? __might_fault+0x37/0x40 [<ffffffff8119e123>] ? cp_new_stat+0x153/0x170 [<ffffffff81198cfa>] __vfs_write+0xaa/0xe0 [<ffffffff81199369>] vfs_write+0xa9/0x190 [<ffffffff811b6d01>] ? set_close_on_exec+0x31/0x70 [<ffffffff8119a056>] SyS_write+0x46/0xa0 This happens since wait_event_interruptible can interfere with the mutex locking code, since they both fiddle with the task state. Fix the issue by using the newly-added nested blocking infrastructure in 61ada528dea0 ("sched/wait: Provide infrastructure to deal with nested blocking") Link: https://lwn.net/Articles/628628/ Signed-off-by: Nikolay Borisov <kernel@kyup.com> Signed-off-by: Yan, Zheng <zyan@redhat.com>
| * ceph: fix printing wrong return variable in ceph_direct_read_write()Zhi Zhang2016-12-121-1/+1
| | | | | | | | | | | | | | | | Fix printing wrong return variable for invalidate_inode_pages2_range in ceph_direct_read_write(). Signed-off-by: Zhi Zhang <zhang.david2011@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
| * libceph: drop len argument of *verify_authorizer_reply()Ilya Dryomov2016-12-121-2/+2
| | | | | | | | | | | | | | | | | | The length of the reply is protocol-dependent - for cephx it's ceph_x_authorize_reply. Nothing sensible can be passed from the messenger layer anyway. Signed-off-by: Ilya Dryomov <idryomov@gmail.com> Reviewed-by: Sage Weil <sage@redhat.com>
* | Merge branch 'overlayfs-linus' of ↵Linus Torvalds2016-12-1614-805/+1101
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs Pull overlayfs updates from Miklos Szeredi: "This update contains: - try to clone on copy-up - allow renaming a directory - split source into managable chunks - misc cleanups and fixes It does not contain the read-only fd data inconsistency fix, which Al didn't like. I'll leave that to the next year..." * 'overlayfs-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/vfs: (36 commits) ovl: fix reStructuredText syntax errors in documentation ovl: fix return value of ovl_fill_super ovl: clean up kstat usage ovl: fold ovl_copy_up_truncate() into ovl_copy_up() ovl: create directories inside merged parent opaque ovl: opaque cleanup ovl: show redirect_dir mount option ovl: allow setting max size of redirect ovl: allow redirect_dir to default to "on" ovl: check for emptiness of redirect dir ovl: redirect on rename-dir ovl: lookup redirects ovl: consolidate lookup for underlying layers ovl: fix nested overlayfs mount ovl: check namelen ovl: split super.c ovl: use d_is_dir() ovl: simplify lookup ovl: check lower existence of rename target ovl: rename: simplify handling of lower/merged directory ...
| * | ovl: fix return value of ovl_fill_superGeliang Tang2016-12-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | If kcalloc() failed, the return value of ovl_fill_super() is -EINVAL, not -ENOMEM. So this patch sets this value to -ENOMEM before calling kcalloc(), and sets it back to -EINVAL after calling kcalloc(). Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: clean up kstat usageAl Viro2016-12-164-39/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FWIW, there's a bit of abuse of struct kstat in overlayfs object creation paths - for one thing, it ends up with a very small subset of struct kstat (mode + rdev), for another it also needs link in case of symlinks and ends up passing it separately. IMO it would be better to introduce a separate object for that. In principle, we might even lift that thing into general API and switch ->mkdir()/->mknod()/->symlink() to identical calling conventions. Hell knows, perhaps ->create() as well... Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: fold ovl_copy_up_truncate() into ovl_copy_up()Amir Goldstein2016-12-163-37/+13
| | | | | | | | | | | | | | | | | | | | | This removes code duplication. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: create directories inside merged parent opaqueAmir Goldstein2016-12-161-2/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The benefit of making directories opaque on creation is that lookups can stop short when they reach the original created directory, instead of continue lookup the entire depth of parent directory stack. The best case is overlay with N layers, performing lookup for first level directory, which exists only in upper. In that case, there will be only one lookup instead of N. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: opaque cleanupMiklos Szeredi2016-12-164-31/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | oe->opaque is set for a) whiteouts b) directories having the "trusted.overlay.opaque" xattr Case b can be simplified, since setting the xattr always implies setting oe->opaque. Also once set, the opaque flag is never cleared. Don't need to set opaque flag for non-directories. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: show redirect_dir mount optionAmir Goldstein2016-12-161-0/+3
| | | | | | | | | | | | | | | | | | | | | Show the value of redirect_dir in /proc/mounts. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: allow setting max size of redirectMiklos Szeredi2016-12-161-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a module option to allow tuning the max size of absolute redirects. Default is 256. Size of relative redirects is naturally limited by the the underlying filesystem's max filename length (usually 255). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: allow redirect_dir to default to "on"Miklos Szeredi2016-12-162-0/+19
| | | | | | | | | | | | | | | | | | | | | This patch introduces a kernel config option and a module param. Both can be used independently to turn the default value of redirect_dir on or off. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: check for emptiness of redirect dirAmir Goldstein2016-12-161-9/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Before introducing redirect_dir feature, the condition !ovl_lower_positive(dentry) for a directory, implied that it is a pure upper directory, which may be removed if empty. Now that directory can be redirect, it is possible that upper does not cover any lower (i.e. !ovl_lower_positive(dentry)), but the directory is a merge (with redirected path) and maybe non empty. Check for this case in ovl_remove_upper(). This change fixes the following test case from rename-pop-dir.py of unionmount-testsuite: """Remove dir and rename old name""" d = ctx.non_empty_dir() d2 = ctx.no_dir() ctx.rmdir(d, err=ENOTEMPTY) ctx.rename(d, d2) ctx.rmdir(d, err=ENOENT) ctx.rmdir(d2, err=ENOTEMPTY) ./run --ov rename-pop-dir /mnt/a/no_dir103: Expected error (Directory not empty) was not produced Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: redirect on rename-dirMiklos Szeredi2016-12-166-28/+176
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current code returns EXDEV when a directory would need to be copied up to move. We could copy up the directory tree in this case, but there's another, simpler solution: point to old lower directory from moved upper directory. This is achieved with a "trusted.overlay.redirect" xattr storing the path relative to the root of the overlay. After such attribute has been set, the directory can be moved without further actions required. This is a backward incompatible feature, old kernels won't be able to correctly mount an overlay containing redirected directories. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: lookup redirectsMiklos Szeredi2016-12-164-2/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If a directory has the "trusted.overlay.redirect" xattr, it means that the value of the xattr should be used to find the underlying directory on the next lower layer. The redirect may be relative or absolute. Absolute redirects begin with a slash. A relative redirect means: instead of the current dentry's name use the value of the redirect to find the directory in the next lower layer. Relative redirects must not contain a slash. An absolute redirect means: look up the directory relative to the root of the overlay using the value of the redirect in the next lower layer. Redirects work on lower layers as well. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: consolidate lookup for underlying layersMiklos Szeredi2016-12-161-70/+86
| | | | | | | | | | | | | | | | | | | | | | | | | | | Use a common helper for lookup of upper and lower layers. This paves the way for looking up directory redirects. No functional change. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: fix nested overlayfs mountAmir Goldstein2016-12-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the upper overlayfs checks "trusted.overlay.*" xattr on the underlying overlayfs mount, it gets -EPERM, which confuses the upper overlayfs. Fix this by returning -EOPNOTSUPP instead of -EPERM from ovl_own_xattr_get() and ovl_own_xattr_set(). This behavior is consistent with the behavior of ovl_listxattr(), which filters out the private overlayfs xattrs. Note: nested overlays are deprecated. But this change makes sense regardless: these xattrs are private to the overlay and should always be hidden. Hence getting and setting them should indicate this. [SzMi: Use EOPNOTSUPP instead of ENODATA and use it for both getting and setting "trusted.overlay." xattrs. This is a perfectly valid error code for "we don't support this prefix", which is the case here.] Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: check namelenMiklos Szeredi2016-12-163-21/+35
| | | | | | | | | | | | | | | | | | | | | We already calculate f_namelen in statfs as the maximum of the name lengths provided by the filesystems taking part in the overlay. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: split super.cMiklos Szeredi2016-12-166-546/+572
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fs/overlayfs/super.c is the biggest of the overlayfs source files and it contains various utility functions as well as the rather complicated lookup code. Split these parts out to separate files. Before: 1446 fs/overlayfs/super.c After: 919 fs/overlayfs/super.c 267 fs/overlayfs/namei.c 235 fs/overlayfs/util.c 51 fs/overlayfs/ovl_entry.h Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: use d_is_dir()Miklos Szeredi2016-12-161-2/+2
| | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: simplify lookupMiklos Szeredi2016-12-161-29/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | If encountering a non-directory, then stop looking at lower layers. In this case the oe->opaque flag is not set anymore, which doesn't matter since existence of lower file is now checked at remove/rename time. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: check lower existence of rename targetMiklos Szeredi2016-12-161-52/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check if something exists on the lower layer(s) under the target or rename to decide if directory needs to be marked "opaque". Marking opaque is done before the rename, and on failure the marking was undone. Also the opaque xattr was removed if the target didn't cover anything. This patch changes behavior so that removal of "opaque" is not done in either of the above cases. This means that directory may have the opaque flag even if it doesn't cover anything. However this shouldn't affect the performance or semantics of the overalay, while simplifying the code. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: rename: simplify handling of lower/merged directoryMiklos Szeredi2016-12-162-21/+12
| | | | | | | | | | | | | | | | | | | | | d_is_dir() is safe to call on a negative dentry. Use this fact to simplify handling of the lower or merged directories. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: get rid of PURE typeMiklos Szeredi2016-12-163-13/+6
| | | | | | | | | | | | | | | | | | | | | The remainging uses of __OVL_PATH_PURE can be replaced by ovl_dentry_is_opaque(). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: check lower existence when removingMiklos Szeredi2016-12-163-3/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently ovl_lookup() checks existence of lower file even if there's a non-directory on upper (which is always opaque). This is done so that remove can decide whether a whiteout is needed or not. It would be better to defer this check to unlink, since most of the time the gathered information about opaqueness will be unused. This adds a helper ovl_lower_positive() that checks if there's anything on the lower layer(s). The following patches also introduce changes to how the "opaque" attribute is updated on directories: this attribute is added when the directory is creted or moved over a whiteout or object covering something on the lower layer. However following changes will allow the attribute to remain on the directory after being moved, even if the new location doesn't cover anything. Because of this, we need to check lower layers even for opaque directories, so that whiteout is only created when necessary. This function will later be also used to decide about marking a directory opaque, so deal with negative dentries as well. When dealing with negative, it's enough to check for being a whiteout If the dentry is positive but not upper then it also obviously needs whiteout/opaque. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: add ovl_dentry_is_whiteout()Miklos Szeredi2016-12-163-3/+9
| | | | | | | | | | | | | | | | | | And use it instead of ovl_dentry_is_opaque() where appropriate. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: don't check stickyMiklos Szeredi2016-12-161-24/+0
| | | | | | | | | | | | | | | | | | | | | | | | Since commit 07a2daab49c5 ("ovl: Copy up underlying inode's ->i_mode to overlay inode") sticky checking on overlay inode is performed by the vfs, so checking against sticky on underlying inode is not needed. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: don't check rename to selfMiklos Szeredi2016-12-161-12/+3
| | | | | | | | | | | | | | | | | | | | | This is redundant, the vfs already performed this check (and was broken, see commit 9409e22acdfc ("vfs: rename: check backing inode being equal")). Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: treat special files like a regular fsMiklos Szeredi2016-12-164-19/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | No sense in opening special files on the underlying layers, they work just as well if opened on the overlay. Side effect is that it's no longer possible to connect one side of a pipe opened on overlayfs with the other side opened on the underlying layer. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: rename ovl_rename2() to ovl_rename()Miklos Szeredi2016-12-161-4/+4
| | | | | | | | | | | | Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | ovl: use vfs_clone_file_range() for copy up if possibleAmir Goldstein2016-12-161-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When copying up within the same fs, try to use vfs_clone_file_range(). This is very efficient when lower and upper are on the same fs with file reflink support. If vfs_clone_file_range() fails for any reason, copy up falls back to the regular data copy code. Tested correct behavior when lower and upper are on: 1. same ext4 (copy) 2. same xfs + reflink patches + mkfs.xfs (copy) 3. same xfs + reflink patches + mkfs.xfs -m reflink=1 (reflink) 4. different xfs + reflink patches + mkfs.xfs -m reflink=1 (copy) For comparison, on my laptop, xfstest overlay/001 (copy up of large sparse files) takes less than 1 second in the xfs reflink setup vs. 25 seconds on the rest of the setups. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | Revert "ovl: get_write_access() in truncate"Miklos Szeredi2016-12-161-21/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 03bea60409328de54e4ff7ec41672e12a9cb0908. Commit 4d0c5ba2ff79 ("vfs: do get_write_access() on upper layer of overlayfs") makes the writecount checks inside overlayfs superfluous, the file is already copied up and write access acquired on the upper inode when ovl_setattr is called with ATTR_SIZE. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | vfs: fix vfs_clone_file_range() for overlayfs filesAmir Goldstein2016-12-161-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With overlayfs, it is wrong to compare file_inode(inode)->i_sb of regular files with those of non-regular files, because the former reference the real (upper/lower) sb and the latter reference the overlayfs sb. Move the test for same super block after the sanity tests for clone range of directory and non-regular file. This change fixes xfstest generic/157, which returned EXDEV instead of EISDIR/EINVAL in the following test cases over overlayfs: echo "Try to reflink a dir" _reflink_range $testdir1/dir1 0 $testdir1/file2 0 $blksz echo "Try to reflink a device" _reflink_range $testdir1/dev1 0 $testdir1/file2 0 $blksz Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | vfs: call vfs_clone_file_range() under freeze protectionAmir Goldstein2016-12-163-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | Move sb_start_write()/sb_end_write() out of the vfs helper and up into the ioctl handler. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | vfs: allow vfs_clone_file_range() across mount pointsAmir Goldstein2016-12-162-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FICLONE/FICLONERANGE ioctls return -EXDEV if src and dest files are not on the same mount point. Practically, clone only requires that src and dest files are on the same file system. Move the check for same mount point to ioctl handler and keep only the check for same super block in the vfs helper. A following patch is going to use the vfs_clone_file_range() helper in overlayfs to copy up between lower and upper mount points on the same file system. Signed-off-by: Amir Goldstein <amir73il@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | vfs: no mnt_want_write_file() in vfs_{copy,clone}_file_range()Miklos Szeredi2016-12-161-8/+4
| | | | | | | | | | | | | | | | | | | | | We've checked for file_out being opened for write. This ensures that we already have mnt_want_write() on target. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>
| * | Revert "vfs: rename: check backing inode being equal"Miklos Szeredi2016-12-161-5/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 9409e22acdfc9153f88d9b1ed2bd2a5b34d2d3ca. Since commit 51f7e52dc943 ("ovl: share inode for hard link") there's no need to call d_real_inode() to check two overlay inodes for equality. Signed-off-by: Miklos Szeredi <mszeredi@redhat.com>