summaryrefslogtreecommitdiffstats
path: root/fs/fuse/file.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge branch 'for-linus' of ↵Linus Torvalds2014-04-121-4/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "The first vfs pile, with deep apologies for being very late in this window. Assorted cleanups and fixes, plus a large preparatory part of iov_iter work. There's a lot more of that, but it'll probably go into the next merge window - it *does* shape up nicely, removes a lot of boilerplate, gets rid of locking inconsistencie between aio_write and splice_write and I hope to get Kent's direct-io rewrite merged into the same queue, but some of the stuff after this point is having (mostly trivial) conflicts with the things already merged into mainline and with some I want more testing. This one passes LTP and xfstests without regressions, in addition to usual beating. BTW, readahead02 in ltp syscalls testsuite has started giving failures since "mm/readahead.c: fix readahead failure for memoryless NUMA nodes and limit readahead pages" - might be a false positive, might be a real regression..." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) missing bits of "splice: fix racy pipe->buffers uses" cifs: fix the race in cifs_writev() ceph_sync_{,direct_}write: fix an oops on ceph_osdc_new_request() failure kill generic_file_buffered_write() ocfs2_file_aio_write(): switch to generic_perform_write() ceph_aio_write(): switch to generic_perform_write() xfs_file_buffered_aio_write(): switch to generic_perform_write() export generic_perform_write(), start getting rid of generic_file_buffer_write() generic_file_direct_write(): get rid of ppos argument btrfs_file_aio_write(): get rid of ppos kill the 5th argument of generic_file_buffered_write() kill the 4th argument of __generic_file_aio_write() lustre: don't open-code kernel_recvmsg() ocfs2: don't open-code kernel_recvmsg() drbd: don't open-code kernel_recvmsg() constify blk_rq_map_user_iov() and friends lustre: switch to kernel_sendmsg() ocfs2: don't open-code kernel_sendmsg() take iov_iter stuff to mm/iov_iter.c process_vm_access: tidy up a bit ...
| * generic_file_direct_write(): get rid of ppos argumentAl Viro2014-04-021-2/+1
| | | | | | | | | | | | always equal to &iocb->ki_pos. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * callers of iov_copy_from_user_atomic() don't need pagecache_disable()Al Viro2014-04-021-2/+0
| | | | | | | | | | | | ... it does that itself (via kmap_atomic()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | mm: implement ->map_pages for page cacheKirill A. Shutemov2014-04-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | filemap_map_pages() is generic implementation of ->map_pages() for filesystems who uses page cache. It should be safe to use filemap_map_pages() for ->map_pages() if filesystem use filemap_fault() for ->fault(). Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Acked-by: Linus Torvalds <torvalds@linux-foundation.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Rik van Riel <riel@redhat.com> Cc: Andi Kleen <ak@linux.intel.com> Cc: Matthew Wilcox <matthew.r.wilcox@intel.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Cc: Dave Chinner <david@fromorbit.com> Cc: Ning Qu <quning@gmail.com> Cc: Hugh Dickins <hughd@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | fuse: fix "uninitialized variable" warningRajat Jain2014-04-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following warning: In file included from include/linux/fs.h:16:0, from fs/fuse/fuse_i.h:13, from fs/fuse/file.c:9: fs/fuse/file.c: In function 'fuse_file_poll': include/linux/rbtree.h:82:28: warning: 'parent' may be used uninitialized in this function [-Wmaybe-uninitialized] fs/fuse/file.c:2592:27: note: 'parent' was declared here Signed-off-by: Rajat Jain <rajatxjain@gmail.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: Turn writeback cache onPavel Emelyanov2014-04-021-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | Introduce a bit kernel and userspace exchange between each-other on the init stage and turn writeback on if the userspace want this and mount option 'allow_wbcache' is present (controlled by fusermount). Also add each writable file into per-inode write list and call the generic_file_aio_write to make use of the Linux page cache engine. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: Fix O_DIRECT operations vs cached writeback misorderPavel Emelyanov2014-04-021-6/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The problem is: 1. write cached data to a file 2. read directly from the same file (via another fd) The 2nd operation may read stale data, i.e. the one that was in a file before the 1st op. Problem is in how fuse manages writeback. When direct op occurs the core kernel code calls filemap_write_and_wait to flush all the cached ops in flight. But fuse acks the writeback right after the ->writepages callback exits w/o waiting for the real write to happen. Thus the subsequent direct op proceeds while the real writeback is still in flight. This is a problem for backends that reorder operation. Fix this by making the fuse direct IO callback explicitly wait on the in-flight writeback to finish. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: fuse_flush() should wait on writebackMaxim Patlasov2014-04-021-15/+23
| | | | | | | | | | | | | | | | | | | | | | The aim of .flush fop is to hint file-system that flushing its state or caches or any other important data to reliable storage would be desirable now. fuse_flush() passes this hint by sending FUSE_FLUSH request to userspace. However, dirty pages and pages under writeback may be not visible to userspace yet if we won't ensure it explicitly. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: Implement write_begin/write_end callbacksPavel Emelyanov2014-04-021-0/+73
| | | | | | | | | | | | | | | | | | The .write_begin and .write_end are requiered to use generic routines (generic_file_aio_write --> ... --> generic_perform_write) for buffered writes. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: restructure fuse_readpage()Maxim Patlasov2014-04-021-7/+16
| | | | | | | | | | | | | | | | | | | | Move the code filling and sending read request to a separate function. Future patches will use it for .write_begin -- partial modification of a page requires reading the page from the storage very similarly to what fuse_readpage does. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: Flush files on wb closePavel Emelyanov2014-04-021-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Any write request requires a file handle to report to the userspace. Thus when we close a file (and free the fuse_file with this info) we have to flush all the outstanding dirty pages. filemap_write_and_wait() is enough because every page under fuse writeback is accounted in ff->count. This delays actual close until all fuse wb is completed. In case of "write cache" turned off, the flush is ensured by fuse_vma_close(). Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: Trust kernel i_mtime onlyMaxim Patlasov2014-04-021-4/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Let the kernel maintain i_mtime locally: - clear S_NOCMTIME - implement i_op->update_time() - flush mtime on fsync and last close - update i_mtime explicitly on truncate and fallocate Fuse inode flag FUSE_I_MTIME_DIRTY serves as indication that local i_mtime should be flushed to the server eventually. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: Trust kernel i_size onlyPavel Emelyanov2014-04-021-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make fuse think that when writeback is on the inode's i_size is always up-to-date and not update it with the value received from the userspace. This is done because the page cache code may update i_size without letting the FS know. This assumption implies fixing the previously introduced short-read helper -- when a short read occurs the 'hole' is filled with zeroes. fuse_file_fallocate() is also fixed because now we should keep i_size up to date, so it must be updated if FUSE_FALLOCATE request succeeded. Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: Prepare to handle short readsPavel Emelyanov2014-04-021-8/+13
| | | | | | | | | | | | | | | | | | A helper which gets called when read reports less bytes than was requested. See patch "trust kernel i_size only" for details. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: Linking file to inode helperPavel Emelyanov2014-04-021-14/+19
|/ | | | | | | | | When writeback is ON every writeable file should be in per-inode write list, not only mmap-ed ones. Thus introduce a helper for this linkage. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* Merge branch 'for-linus' of ↵Linus Torvalds2014-01-281-0/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull vfs updates from Al Viro: "Assorted stuff; the biggest pile here is Christoph's ACL series. Plus assorted cleanups and fixes all over the place... There will be another pile later this week" * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (43 commits) __dentry_path() fixes vfs: Remove second variable named error in __dentry_path vfs: Is mounted should be testing mnt_ns for NULL or error. Fix race when checking i_size on direct i/o read hfsplus: remove can_set_xattr nfsd: use get_acl and ->set_acl fs: remove generic_acl nfs: use generic posix ACL infrastructure for v3 Posix ACLs gfs2: use generic posix ACL infrastructure jfs: use generic posix ACL infrastructure xfs: use generic posix ACL infrastructure reiserfs: use generic posix ACL infrastructure ocfs2: use generic posix ACL infrastructure jffs2: use generic posix ACL infrastructure hfsplus: use generic posix ACL infrastructure f2fs: use generic posix ACL infrastructure ext2/3/4: use generic posix ACL infrastructure btrfs: use generic posix ACL infrastructure fs: make posix_acl_create more useful fs: make posix_acl_chmod more useful ...
| * Fix race when checking i_size on direct i/o readSteven Whitehouse2014-01-261-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | So far I've had one ACK for this, and no other comments. So I think it is probably time to send this via some suitable tree. I'm guessing that the vfs tree would be the most appropriate route, but not sure that there is one at the moment (don't see anything recent at kernel.org) so in that case I think -mm is the "back up plan". Al, please let me know if you will take this? Steve. --------------------- Following on from the "Re: [PATCH v3] vfs: fix a bug when we do some dio reads with append dio writes" thread on linux-fsdevel, this patch is my current version of the fix proposed as option (b) in that thread. Removing the i_size test from the direct i/o read path at vfs level means that filesystems now have to deal with requests which are beyond i_size themselves. These I've divided into three sets: a) Those with "no op" ->direct_IO (9p, cifs, ceph) These are obviously not going to be an issue b) Those with "home brew" ->direct_IO (nfs, fuse) I've been told that NFS should not have any problem with the larger i_size, however I've added an extra test to FUSE to duplicate the original behaviour just to be on the safe side. c) Those using __blockdev_direct_IO() These call through to ->get_block() which should deal with the EOF condition correctly. I've verified that with GFS2 and I believe that Zheng has verified it for ext4. I've also run the test on XFS and it passes both before and after this change. The part of the patch in filemap.c looks a lot larger than it really is - there are only two lines of real change. The rest is just indentation of the contained code. There remains a test of i_size though, which was added for btrfs. It doesn't cause the other filesystems a problem as the test is performed after ->direct_IO has been called. It is possible that there is a race that does matter to btrfs, however this patch doesn't change that, so its still an overall improvement. Signed-off-by: Steven Whitehouse <swhiteho@redhat.com> Reported-by: Zheng Liu <gnehzuil.liu@gmail.com> Cc: Jan Kara <jack@suse.cz> Cc: Dave Chinner <david@fromorbit.com> Acked-by: Miklos Szeredi <miklos@szeredi.hu> Cc: Chris Mason <clm@fb.com> Cc: Josef Bacik <jbacik@fb.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Alexander Viro <viro@zeniv.linux.org.uk> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fuse: support clients that don't implement 'open'Andrew Gallagher2014-01-221-10/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | open/release operations require userspace transitions to keep track of the open count and to perform any FS-specific setup. However, for some purely read-only FSs which don't need to perform any setup at open/release time, we can avoid the performance overhead of calling into userspace for open/release calls. This patch adds the necessary support to the fuse kernel modules to prevent open/release operations from hitting in userspace. When the client returns ENOSYS, we avoid sending the subsequent release to userspace, and also remember this so that future opens also don't trigger a userspace operation. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* | fuse: don't invalidate attrs when not using atimeAndrew Gallagher2014-01-221-2/+2
|/ | | | | | | | | | Various read operations (e.g. readlink, readdir) invalidate the cached attrs for atime changes. This patch adds a new function 'fuse_invalidate_atime', which checks for a read-only super block and avoids the attr invalidation in that case. Signed-off-by: Andrew Gallagher <andrewjcg@fb.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: writepages: protect secondary requests from fuse file releaseMaxim Patlasov2013-11-051-0/+1
| | | | | | | | | | All async fuse requests must be supplied with extra reference to a fuse file. This is necessary to ensure that the fuse file is not released until all in-flight requests are completed. Fuse secondary writeback requests must obey this rule as well. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: writepages: update bdi writeout when deleting secondary requestMaxim Patlasov2013-11-051-1/+4
| | | | | | | | | | | | | | BDI_WRITTEN counter is used to estimate bdi bandwidth. It must be incremented every time as bdi ends page writeback. No matter whether it was fulfilled by actual write or by discarding the request (e.g. due to shrunk i_size). Note that even before writepages patches, the case "Got truncated off completely" was handled in fuse_send_writepage() by calling fuse_writepage_finish() which updated BDI_WRITTEN unconditionally. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: writepages: crop secondary requestsMaxim Patlasov2013-11-051-5/+31
| | | | | | | | | | | | | | | | | | | | | If writeback happens while fuse is in FUSE_NOWRITE condition, the request will be queued but not processed immediately (see fuse_flush_writepages()). Until FUSE_NOWRITE becomes relaxed, more writebacks can happen. They will be queued as "secondary" requests to that first ("primary") request. Existing implementation crops only primary request. This is not correct because a subsequent extending write(2) may increase i_size and then secondary requests won't be cropped properly. The result would be stale data written to the server to a file offset where zeros must be. Similar problem may happen if secondary requests are attached to an in-flight request that was already cropped. The patch solves the issue by cropping all secondary requests in fuse_writepage_end(). Thanks to Miklos for idea. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: writepages: roll back changes if request not foundMaxim Patlasov2013-11-051-2/+4
| | | | | | | | | | | fuse_writepage_in_flight() returns false if it fails to find request with given index in fi->writepages. Then the caller proceeds with populating data->orig_pages[] and incrementing req->num_pages. Hence, fuse_writepage_in_flight() must revert changes it made in request before returning false. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: writepage: skip already in flightMiklos Szeredi2013-10-011-0/+12
| | | | | | | | | | If ->writepage() tries to write back a page whose copy is still in flight, then just skip by calling redirty_page_for_writepage(). This is OK, since now ->writepage() should never be called for data integrity sync. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: writepages: handle same page rewritesMiklos Szeredi2013-10-011-10/+90
| | | | | | | | | | | | | | | | | | | | | | | | | | | As Maxim Patlasov pointed out, it's possible to get a dirty page while it's copy is still under writeback, despite fuse_page_mkwrite() doing its thing (direct IO). This could result in two concurrent write request for the same offset, with data corruption if they get mixed up. To prevent this, fuse needs to check and delay such writes. This implementation does this by: 1. check if page is still under writeout, if so create a new, single page secondary request for it 2. chain this secondary request onto the in-flight request 2/a. if a seconday request for the same offset was already chained to the in-flight request, then just copy the contents of the page and discard the new secondary request. This makes sure that for each page will have at most two requests associated with it 3. when the in-flight request finished, send off all secondary requests chained onto it Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: writepages: fix aggregationMiklos Szeredi2013-10-011-1/+1
| | | | | | | Checking against tmp-page indexes is not very useful, and results in one (or rarely two) page requests. Which is not much of an improvement... Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: fix race in fuse_writepages()Maxim Patlasov2013-10-011-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | The patch fixes a race between ftruncate(2), mmap-ed write and write(2): 1) An user makes a page dirty via mmap-ed write. 2) The user performs shrinking truncate(2) intended to purge the page. 3) Before fuse_do_setattr calls truncate_pagecache, the page goes to writeback. fuse_writepages_fill attaches a new page to FUSE_WRITE request, then releases the original page by end_page_writeback and unlock it. 4) fuse_do_setattr completes and successfully returns. Since now, i_mutex is free. 5) Ordinary write(2) extends i_size back to cover the page. Note that fuse_send_write_pages do wait for fuse writeback, but for another page->index. 6) fuse_writepages_fill attaches more pages to the request (if any), then fuse_writepages_send is eventually called. It is supposed to crop inarg->size of the request, but it doesn't because i_size has already been extended back. Moving end_page_writeback behind fuse_writepages_send guarantees that __fuse_release_nowrite (called from fuse_do_setattr) will crop inarg->size of the request before write(2) gets the chance to extend i_size. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: Implement writepages callbackPavel Emelyanov2013-10-011-3/+147
| | | | | | | | | | | The .writepages one is required to make each writeback request carry more than one page on it. The patch enables optimized behaviour unconditionally, i.e. mmap-ed writes will benefit from the patch even if fc->writeback_cache=0. [SzM: simplify, add comments] Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: don't BUG on no write fileMiklos Szeredi2013-10-011-5/+12
| | | | | | | Don't bug if there's no writable files found for page writeback. If ever this is triggered, a WARN_ON helps debugging it much better then a BUG_ON. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: lock page in mkwriteMiklos Szeredi2013-10-011-6/+9
| | | | | | | | Lock the page in fuse_page_mkwrite() to protect against a race with fuse_writepage() where the page is redirtied before the actual writeback begins. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: Prepare to handle multiple pages in writebackPavel Emelyanov2013-10-011-8/+16
| | | | | | | | The .writepages callback will issue writeback requests with more than one page aboard. Make existing end/check code be aware of this. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: Getting file for writeback helperPavel Emelyanov2013-10-011-8/+16
| | | | | | | | | There will be a .writepageS callback implementation which will need to get a fuse_file out of a fuse_inode, thus make a helper for this. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: fix fallocate vs. ftruncate raceMaxim Patlasov2013-09-181-0/+7
| | | | | | | | | | | | | | | | | | | | | A former patch introducing FUSE_I_SIZE_UNSTABLE flag provided detailed description of races between ftruncate and anyone who can extend i_size: > 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size > changed (due to our own fuse_do_setattr()) and is going to call > truncate_pagecache() for some 'new_size' it believes valid right now. But by > the time that particular truncate_pagecache() is called ... > 2. fuse_do_setattr() returns (either having called truncate_pagecache() or > not -- it doesn't matter). > 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2). > 4. mmap-ed write makes a page in the extended region dirty. This patch adds necessary bits to fuse_file_fallocate() to protect from that race. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
* fuse: wait for writeback in fuse_file_fallocate()Maxim Patlasov2013-09-181-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch fixes a race between mmap-ed write and fallocate(PUNCH_HOLE): 1) An user makes a page dirty via mmap-ed write. 2) The user performs fallocate(2) with mode == PUNCH_HOLE|KEEP_SIZE and <offset, size> covering the page. 3) Before truncate_pagecache_range call from fuse_file_fallocate, the page goes to write-back. The page is fully processed by fuse_writepage (including end_page_writeback on the page), but fuse_flush_writepages did nothing because fi->writectr < 0. 4) truncate_pagecache_range is called and fuse_file_fallocate is finishing by calling fuse_release_nowrite. The latter triggers processing queued write-back request which will write stale data to the hole soon. Changed in v2 (thanks to Brian for suggestion): - Do not truncate page cache until FUSE_FALLOCATE succeeded. Otherwise, we can end up in returning -ENOTSUPP while user data is already punched from page cache. Use filemap_write_and_wait_range() instead. Changed in v3 (thanks to Miklos for suggestion): - fuse_wait_on_writeback() is prone to livelocks; use fuse_set_nowrite() instead. So far as we need a dirty-page barrier only, fuse_sync_writes() should be enough. - rebased to for-linus branch of fuse.git Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
* fuse: hotfix truncate_pagecache() issueMaxim Patlasov2013-09-031-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The way how fuse calls truncate_pagecache() from fuse_change_attributes() is completely wrong. Because, w/o i_mutex held, we never sure whether 'oldsize' and 'attr->size' are valid by the time of execution of truncate_pagecache(inode, oldsize, attr->size). In fact, as soon as we released fc->lock in the middle of fuse_change_attributes(), we completely loose control of actions which may happen with given inode until we reach truncate_pagecache. The list of potentially dangerous actions includes mmap-ed reads and writes, ftruncate(2) and write(2) extending file size. The typical outcome of doing truncate_pagecache() with outdated arguments is data corruption from user point of view. This is (in some sense) acceptable in cases when the issue is triggered by a change of the file on the server (i.e. externally wrt fuse operation), but it is absolutely intolerable in scenarios when a single fuse client modifies a file without any external intervention. A real life case I discovered by fsx-linux looked like this: 1. Shrinking ftruncate(2) comes to fuse_do_setattr(). The latter sends FUSE_SETATTR to the server synchronously, but before getting fc->lock ... 2. fuse_dentry_revalidate() is asynchronously called. It sends FUSE_LOOKUP to the server synchronously, then calls fuse_change_attributes(). The latter updates i_size, releases fc->lock, but before comparing oldsize vs attr->size.. 3. fuse_do_setattr() from the first step proceeds by acquiring fc->lock and updating attributes and i_size, but now oldsize is equal to outarg.attr.size because i_size has just been updated (step 2). Hence, fuse_do_setattr() returns w/o calling truncate_pagecache(). 4. As soon as ftruncate(2) completes, the user extends file size by write(2) making a hole in the middle of file, then reads data from the hole either by read(2) or mmap-ed read. The user expects to get zero data from the hole, but gets stale data because truncate_pagecache() is not executed yet. The scenario above illustrates one side of the problem: not truncating the page cache even though we should. Another side corresponds to truncating page cache too late, when the state of inode changed significantly. Theoretically, the following is possible: 1. As in the previous scenario fuse_dentry_revalidate() discovered that i_size changed (due to our own fuse_do_setattr()) and is going to call truncate_pagecache() for some 'new_size' it believes valid right now. But by the time that particular truncate_pagecache() is called ... 2. fuse_do_setattr() returns (either having called truncate_pagecache() or not -- it doesn't matter). 3. The file is extended either by write(2) or ftruncate(2) or fallocate(2). 4. mmap-ed write makes a page in the extended region dirty. The result will be the lost of data user wrote on the fourth step. The patch is a hotfix resolving the issue in a simplistic way: let's skip dangerous i_size update and truncate_pagecache if an operation changing file size is in progress. This simplistic approach looks correct for the cases w/o external changes. And to handle them properly, more sophisticated and intrusive techniques (e.g. NFS-like one) would be required. I'd like to postpone it until the issue is well discussed on the mailing list(s). Changed in v2: - improved patch description to cover both sides of the issue. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
* fuse: postpone end_page_writeback() in fuse_writepage_locked()Maxim Patlasov2013-09-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | The patch fixes a race between ftruncate(2), mmap-ed write and write(2): 1) An user makes a page dirty via mmap-ed write. 2) The user performs shrinking truncate(2) intended to purge the page. 3) Before fuse_do_setattr calls truncate_pagecache, the page goes to writeback. fuse_writepage_locked fills FUSE_WRITE request and releases the original page by end_page_writeback. 4) fuse_do_setattr() completes and successfully returns. Since now, i_mutex is free. 5) Ordinary write(2) extends i_size back to cover the page. Note that fuse_send_write_pages do wait for fuse writeback, but for another page->index. 6) fuse_writepage_locked proceeds by queueing FUSE_WRITE request. fuse_send_writepage is supposed to crop inarg->size of the request, but it doesn't because i_size has already been extended back. Moving end_page_writeback to the end of fuse_writepage_locked fixes the race because now the fact that truncate_pagecache is successfully returned infers that fuse_writepage_locked has already called end_page_writeback. And this, in turn, infers that fuse_flush_writepages has already called fuse_send_writepage, and the latter used valid (shrunk) i_size. write(2) could not extend it because of i_mutex held by ftruncate(2). Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz> Cc: stable@vger.kernel.org
* Merge branch 'for-linus' of ↵Linus Torvalds2013-07-031-2/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull second set of VFS changes from Al Viro: "Assorted f_pos race fixes, making do_splice_direct() safe to call with i_mutex on parent, O_TMPFILE support, Jeff's locks.c series, ->d_hash/->d_compare calling conventions changes from Linus, misc stuff all over the place." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (63 commits) Document ->tmpfile() ext4: ->tmpfile() support vfs: export lseek_execute() to modules lseek_execute() doesn't need an inode passed to it block_dev: switch to fixed_size_llseek() cpqphp_sysfs: switch to fixed_size_llseek() tile-srom: switch to fixed_size_llseek() proc_powerpc: switch to fixed_size_llseek() ubi/cdev: switch to fixed_size_llseek() pci/proc: switch to fixed_size_llseek() isapnp: switch to fixed_size_llseek() lpfc: switch to fixed_size_llseek() locks: give the blocked_hash its own spinlock locks: add a new "lm_owner_key" lock operation locks: turn the blocked_list into a hashtable locks: convert fl_link to a hlist_node locks: avoid taking global lock if possible when waking up blocked waiters locks: protect most of the file_lock handling with i_lock locks: encapsulate the fl_link list handling locks: make "added" in __posix_lock_file a bool ...
| * fuse: another open-coded file_inode()Al Viro2013-06-291-2/+1
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | fuse: hold i_mutex in fuse_file_fallocate()Maxim Patlasov2013-06-181-4/+8
|/ | | | | | | | | | | | | | | | | | | | | | | | Changing size of a file on server and local update (fuse_write_update_size) should be always protected by inode->i_mutex. Otherwise a race like this is possible: 1. Process 'A' calls fallocate(2) to extend file (~FALLOC_FL_KEEP_SIZE). fuse_file_fallocate() sends FUSE_FALLOCATE request to the server. 2. Process 'B' calls ftruncate(2) shrinking the file. fuse_do_setattr() sends shrinking FUSE_SETATTR request to the server and updates local i_size by i_size_write(inode, outarg.attr.size). 3. Process 'A' resumes execution of fuse_file_fallocate() and calls fuse_write_update_size(inode, offset + length). But 'offset + length' was obsoleted by ftruncate from previous step. Changed in v2 (thanks Brian and Anand for suggestions): - made relation between mutex_lock() and fuse_set_nowrite(inode) more explicit and clear. - updated patch description to use ftruncate(2) in example Signed-off-by: Maxim V. Patlasov <MPatlasov@parallels.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: fix alignment in short read optimization for async_dioMaxim Patlasov2013-06-031-4/+10
| | | | | | | | | | | | | | | | | | | | | | | | | The bug was introduced with async_dio feature: trying to optimize short reads, we cut number-of-bytes-to-read to i_size boundary. Hence the following example: truncate --size=300 /mnt/file dd if=/mnt/file of=/dev/null iflag=direct led to FUSE_READ request of 300 bytes size. This turned out to be problem for userspace fuse implementations who rely on assumption that kernel fuse does not change alignment of request from client FS. The patch turns off the optimization if async_dio is disabled. And, if it's enabled, the patch fixes adjustment of number-of-bytes-to-read to preserve alignment. Note, that we cannot throw out short read optimization entirely because otherwise a direct read of a huge size issued on a tiny file would generate a huge amount of fuse requests and most of them would be ACKed by userspace with zero bytes read. Signed-off-by: Maxim Patlasov <MPatlasov@parallels.com> Reviewed-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: return -EIOCBQUEUED from fuse_direct_IO() for all async requestsBrian Foster2013-06-031-1/+1
| | | | | | | | | | | | | | | If request submission fails for an async request (i.e., get_user_pages() returns -ERESTARTSYS), we currently skip the -EIOCBQUEUED return and drop into wait_for_sync_kiocb() forever. Avoid this by always returning -EIOCBQUEUED for async requests. If an error occurs, the error is passed into fuse_aio_complete(), returned via aio_complete() and thus propagated to userspace via io_getevents(). Signed-off-by: Brian Foster <bfoster@redhat.com> Reviewed-by: Maxim Patlasov <MPatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: update inode size and invalidate attributes on fallocateBrian Foster2013-05-201-3/+12
| | | | | | | | | | | An fallocate request without FALLOC_FL_KEEP_SIZE set can extend the size of a file. Update the inode size after a successful fallocate. Also invalidate the inode attributes after a successful fallocate to ensure we pick up the latest attribute values (i.e., i_blocks). Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: truncate pagecache range on hole punchBrian Foster2013-05-201-2/+20
| | | | | | | | | | | | | | | | | | | | fuse supports hole punch via the fallocate() FALLOC_FL_PUNCH_HOLE interface. When a hole punch is passed through, the page cache is not cleared and thus allows reading stale data from the cache. This is easily demonstrable (using FOPEN_KEEP_CACHE) by reading a smallish random data file into cache, punching a hole and creating a copy of the file. Drop caches or remount and observe that the original file no longer matches the file copied after the hole punch. The original file contains a zeroed range and the latter file contains stale data. Protect against writepage requests in progress and punch out the associated page cache range after a successful client fs hole punch. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* fuse: allocate for_background dio requests based on io->async stateBrian Foster2013-05-141-2/+9
| | | | | | | | | | | | | | | | Commit 8b41e671 introduced explicit background checking for fuse_req structures with BUG_ON() checks for the appropriate type of request in in the associated send functions. Commit bcba24cc introduced the ability to send dio requests as background requests but does not update the request allocation based on the type of I/O request. As a result, a BUG_ON() triggers in the fuse_request_send_background() background path if an async I/O is sent. Allocate a request based on the async state of the fuse_io_priv to avoid the BUG. Signed-off-by: Brian Foster <bfoster@redhat.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
* Merge branch 'akpm' (incoming from Andrew)Linus Torvalds2013-05-081-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merge more incoming from Andrew Morton: - Various fixes which were stalled or which I picked up recently - A large rotorooting of the AIO code. Allegedly to improve performance but I don't really have good performance numbers (I might have lost the email) and I can't raise Kent today. I held this out of 3.9 and we could give it another cycle if it's all too late/scary. I ended up taking only the first two thirds of the AIO rotorooting. I left the percpu parts and the batch completion for later. - Linus * emailed patches from Andrew Morton <akpm@linux-foundation.org>: (33 commits) aio: don't include aio.h in sched.h aio: kill ki_retry aio: kill ki_key aio: give shared kioctx fields their own cachelines aio: kill struct aio_ring_info aio: kill batch allocation aio: change reqs_active to include unreaped completions aio: use cancellation list lazily aio: use flush_dcache_page() aio: make aio_read_evt() more efficient, convert to hrtimers wait: add wait_event_hrtimeout() aio: refcounting cleanup aio: make aio_put_req() lockless aio: do fget() after aio_get_req() aio: dprintk() -> pr_debug() aio: move private stuff out of aio.h aio: add kiocb_cancel() aio: kill return value of aio_complete() char: add aio_{read,write} to /dev/{null,zero} aio: remove retry-based AIO ...
| * aio: don't include aio.h in sched.hKent Overstreet2013-05-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Faster kernel compiles by way of fewer unnecessary includes. [akpm@linux-foundation.org: fix fallout] [akpm@linux-foundation.org: fix build] Signed-off-by: Kent Overstreet <koverstreet@google.com> Cc: Zach Brown <zab@redhat.com> Cc: Felipe Balbi <balbi@ti.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Cc: Mark Fasheh <mfasheh@suse.com> Cc: Joel Becker <jlbec@evilplan.org> Cc: Rusty Russell <rusty@rustcorp.com.au> Cc: Jens Axboe <axboe@kernel.dk> Cc: Asai Thambi S P <asamymuthupa@micron.com> Cc: Selvan Mani <smani@micron.com> Cc: Sam Bradshaw <sbradshaw@micron.com> Cc: Jeff Moyer <jmoyer@redhat.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Benjamin LaHaise <bcrl@kvack.org> Reviewed-by: "Theodore Ts'o" <tytso@mit.edu> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge branch 'for-linus' of ↵Linus Torvalds2013-05-071-38/+234
|\ \ | |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse Pull fuse updates from Miklos Szeredi: "This contains two patchsets from Maxim Patlasov. The first reworks the request throttling so that only async requests are throttled. Wakeup of waiting async requests is also optimized. The second series adds support for async processing of direct IO which optimizes direct IO and enables the use of the AIO userspace interface." * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/mszeredi/fuse: fuse: add flag to turn on async direct IO fuse: truncate file if async dio failed fuse: optimize short direct reads fuse: enable asynchronous processing direct IO fuse: make fuse_direct_io() aware about AIO fuse: add support of async IO fuse: move fuse_release_user_pages() up fuse: optimize wake_up fuse: implement exclusive wakeup for blocked_waitq fuse: skip blocking on allocations of synchronous requests fuse: add flag fc->initialized fuse: make request allocations for background processing explicit
| * fuse: add flag to turn on async direct IOMiklos Szeredi2013-05-011-4/+4
| | | | | | | | | | | | | | | | | | | | Without async DIO write requests to a single file were always serialized. With async DIO that's no longer the case. So don't turn on async DIO by default for fear of breaking backward compatibility. Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: truncate file if async dio failedMaxim Patlasov2013-04-181-2/+20
| | | | | | | | | | | | | | | | | | | | | | The patch improves error handling in fuse_direct_IO(): if we successfully submitted several fuse requests on behalf of synchronous direct write extending file and some of them failed, let's try to do our best to clean-up. Changed in v2: reuse fuse_do_setattr(). Thanks to Brian for suggestion. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>
| * fuse: optimize short direct readsMaxim Patlasov2013-04-171-6/+13
| | | | | | | | | | | | | | | | | | | | | | If user requested direct read beyond EOF, we can skip sending fuse requests for positions beyond EOF because userspace would ACK them with zero bytes read anyway. We can trust to i_size in fuse_direct_IO for such cases because it's called from fuse_file_aio_read() and the latter updates fuse attributes including i_size. Signed-off-by: Maxim Patlasov <mpatlasov@parallels.com> Signed-off-by: Miklos Szeredi <mszeredi@suse.cz>