summaryrefslogtreecommitdiffstats
path: root/fs (follow)
Commit message (Collapse)AuthorAgeFilesLines
* ufs_getfrag_block(): get rid of macro junglesAl Viro2015-07-061-29/+22
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_inode_get{frag,block}(): consolidate success exitsAl Viro2015-07-061-28/+22
| | | | | | | | | | These calling conventions are rudiments of pre-2.3 times; they really need to be sanitized. This is the first step; next will be _always_ returning a block number, instead of this "return a pointer to buffer_head, except when we get to the actual data" crap. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: use the branch depth in ufs_getfrag_block()Al Viro2015-07-061-6/+4
| | | | | | we'd already calculated it... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: move calculation of offsets into ufs_getfrag_block()Al Viro2015-07-061-8/+9
| | | | | | | | | ... and massage ufs_frag_map() to take those instead of fragment number. As it is, we duplicate the damn thing on the write side, open-coded and bloody hard to follow. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_inode_get{frag,block}(): get rid of retriesAl Viro2015-07-061-35/+8
| | | | | | | | | We are holding ->truncate_mutex, so nobody else can alter our block pointers. Rechecks/retries were needed back when we only held BKL there, and had to cope with write_begin/writepage and writepage/truncate races. Can't happen anymore... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* __ufs_truncate_blocks(): avoid excessive dirtying of indirect blocksAl Viro2015-07-061-3/+1
| | | | | | | | | | | | | | There's a case when an indirect block gets dirtied for no good reason - when there's a hole starting in the middle of area covered by it and spanning past its end, and truncate() is done precisely to the beginning of the hole. The block is obviously not modified at all - all removals happen beyond it. However, existing code ends up dirtying it just in case. It's trivial to fix and while it's not a real bug by any stretch of imagination, it makes the damn thing harder to follow. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* free_full_branch(): don't bother modifying the block we are going to freeAl Viro2015-07-061-12/+2
| | | | | | | Note that it's already made unreachable from the inode, so we don't have to worry about ufs_frag_map() walking into something already freed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* move marking inode dirty to the end of __ufs_truncate_blocks()Al Viro2015-07-061-6/+1
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* free_full_branch(): saner calling conventionsAl Viro2015-07-061-49/+51
| | | | | | | Have caller fetch the block number *and* remove it from wherever it was. Pass the block number instead. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_branch(): kill recursionAl Viro2015-07-061-26/+26
| | | | | | turn recursion into a pair of loops Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_branch(): massage towards killing recursionAl Viro2015-07-061-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | We always have 0 < depth2 <= depth in there, so if (--depth) { if (--depth2) A B } else { C // not using depth2 } D // not using depth2 is equivalent to if (--depth2) A with s/depth/depth - 1/ if (--depth) B else C D Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* split ufs_truncate_branch() into full- and partial-branch variantsAl Viro2015-07-061-16/+58
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: unify the logics for collecting adjacent data blocks to freeAl Viro2015-07-061-34/+22
| | | | | | open-coded in several places... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_branch(): separate the calls with non-NULL offsetsAl Viro2015-07-061-4/+7
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_branch(): never call with offsets != NULL && depth2 == 0Al Viro2015-07-061-3/+6
| | | | | | | | | | For calls in __ufs_truncate_blocks() it's just a matter of not incrementing offsets[0] and not making that call - immediately following loop will be executed one extra time and we'll be just fine. For recursive call in ufs_trunc_branch() itself, just assing NULL to offsets if we would be about to make such call. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* __ufs_trunc_blocks(): turn the part after switch into a loopAl Viro2015-07-061-25/+10
| | | | | | | ... and turn the switch into if (), since all cases with depth != 1 have just become identical. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* __ufs_truncate_blocks(): unify freeing the full branchesAl Viro2015-07-061-15/+14
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* unify ufs_trunc_..indirect()Al Viro2015-07-061-138/+60
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_..indirect(): more massage towards unifyingAl Viro2015-07-061-17/+26
| | | | | | | | | | | | Instead of manually checking that the array contains only zeroes, find the position of the last non-zero (in __ufs_truncate(), where we can conveniently do that) and use that to tell if there's any non-zero in the array tail passed to ufs_trunc_...indirect(). The goal of all that clumsiness is to get fold these functions together. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_...indirect(): pass the array of indices instead of offsetsAl Viro2015-07-061-28/+22
| | | | | | | | rather than bitslicing the offset just formed as sum of shifted indices, pass the array of those indices itself. NULL is used as equivalent of "all zeroes" (== free the entire branch). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* __ufs_truncate(); find cutoff distances into branches by offsets[] arrayAl Viro2015-07-061-2/+6
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_dindirect(): pass the number of blocks to keepAl Viro2015-07-061-31/+26
| | | | | | same as the previous two. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_indirect(): pass the index of the first pointer to freeAl Viro2015-07-061-33/+23
| | | | | | | ... instead of file offset. Same cleanups as in the tindirect conversion in previous commit. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs_trunc_tindirect(): pass the number of blocks to keepAl Viro2015-07-061-17/+11
| | | | | | | | | | | | | | IOW, the distance of cutoff from the begining of the branch (in blocks). That (and the fact that block just prior to cutoff is guaranteed to be present) allows to tell whether to free triple indirect block just by looking at the offset. While we are at it, using u64 for index in the block is wrong - those should be unsigned int. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: beginning of __ufs_truncate_block() massageAl Viro2015-07-061-4/+12
| | | | | | | | | Use ufs_block_to_path() to find the cutoff path in the block pointers' tree. For now just use the information about the depth (to bypass the fully preserved subtrees); subsequent commits will use the information about actual path. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: the offsets ufs_block_to_path() puts into array are not sector_tAl Viro2015-07-061-3/+3
| | | | | | | | type makes no sense - those are indices in block number arrays, not block numbers. And no, UFS is not likely to grow indirect blocks with 4Gpointers in them... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: move truncate code into inode.cAl Viro2015-07-064-533/+470
| | | | | | | | | It is closely tied to block pointers handling there, can benefit from existing helpers, etc. - no point keeping them apart. Trimmed the trailing whitespaces in inode.c at the same time. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: no retries are needed on truncateAl Viro2015-07-061-40/+17
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: ufs_trunc_...() has exclusion with everything that might cause allocationsAl Viro2015-07-061-12/+0
| | | | | | | | | | | | Currently - on lock_ufs(), eventually - on per-inode mutex. lock_ufs() used to be mere BKL, which is much weaker, so it needed those rechecks. BKL doesn't provide any exclusion once we lose CPU; its blind replacement, OTOH, _does_. Making that per-filesystem was an atrocity, but at least we can simplify life here. And yes, we certainly need to make that sucker per-inode - these days inode.c and truncate.c uses are needed only to protect the block pointers. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: ufs_trunc_direct() always returns 0Al Viro2015-07-061-6/+3
| | | | | | make it return void Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: kill lock_ufs()Al Viro2015-07-062-37/+2
| | | | | | | | | | There were 3 remaining users; in two of them we took ->s_lock immediately after lock_ufs() and held it until just before unlock_ufs(); the third one (statfs) could not be called from itself or from other two (remount and sync_fs). Just use ->s_lock in statfs and don't bother with lock_ufs at all. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: don't use lock_ufs() for block pointers tree protectionAl Viro2015-07-065-47/+121
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * stores to block pointers are under per-inode seqlock (meta_lock) and mutex (truncate_mutex) * fetches of block pointers are either under truncate_mutex, or wrapped into seqretry loop on meta_lock * all changes of ->i_size are under truncate_mutex and i_mutex * all changes of ->i_lastfrag are under truncate_mutex It's similar to what ext2 is doing; the main difference is that unlike ext2 we can't rely upon the atomicity of stores into block pointers - on UFS2 they are 64bit. So we can't cut the corner when switching a pointer from NULL to non-NULL as we could in ext2_splice_branch() and need to use meta_lock on all modifications. We use seqlock where ext2 uses rwlock; ext2 could probably also benefit from such change... Another non-trivial difference is that with UFS we *cannot* have reader grab truncate_mutex in case of race - it has to keep retrying. That might be possible to change, but not until we lift tail unpacking several levels up in call chain. After that commit we do *NOT* hold fs-wide serialization on accesses to block pointers anymore. Moreover, lock_ufs() can become a normal mutex now - it's only used on statfs, remount and sync_fs and none of those uses are recursive. As the matter of fact, *now* it can be collapsed with ->s_lock, and be eventually replaced with saner per-cylinder-group spinlocks, but that's a separate story. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: bforget() indirect blocks before freeing themAl Viro2015-07-061-3/+3
| | | | | | | right now it doesn't matter (lock_ufs() serializes everything), but when we switch to per-inode locking, it will be needed. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: move lock_ufs() down into __ufs_truncate_blocks()Al Viro2015-07-061-7/+2
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: move truncate_setsize() down into ufs_truncate()Al Viro2015-07-061-16/+11
| | | | | | | just prior to __ufs_truncate_blocks(), with matching change of calling conventions Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: free excessive blocks upon ->write_begin() failure/short copyAl Viro2015-07-061-2/+16
| | | | | | | | | | Broken in "[PATCH] ufs: truncate should allocate block for last byte"; all way back in 2006. ufs_setattr() hadn't been the only user of vmtruncate() and eliminating ->truncate() method required corrections in a bunch of places. Eventually those places had migrated into ->write_begin() failure exit and ->write_end() after short copy... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: switch ufs_evict_inode() to trimmed-down variant of ufs_truncate()Al Viro2015-07-063-27/+44
| | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* ufs: kill more lock_ufs() callsAl Viro2015-07-062-13/+4
| | | | | | | | | a) move it inside ufs_truncate() b) ufs_free_inode() doesn't need it - it's serialized on ->s_lock c) ufs_write_inode() doesn't need it either (and can be called without it anyway). Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge branch 'for-linus' of ↵Linus Torvalds2015-07-0569-467/+657
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull more vfs updates from Al Viro: "Assorted VFS fixes and related cleanups (IMO the most interesting in that part are f_path-related things and Eric's descriptor-related stuff). UFS regression fixes (it got broken last cycle). 9P fixes. fs-cache series, DAX patches, Jan's file_remove_suid() work" [ I'd say this is much more than "fixes and related cleanups". The file_table locking rule change by Eric Dumazet is a rather big and fundamental update even if the patch isn't huge. - Linus ] * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits) 9p: cope with bogus responses from server in p9_client_{read,write} p9_client_write(): avoid double p9_free_req() 9p: forgetting to cancel request on interrupted zero-copy RPC dax: bdev_direct_access() may sleep block: Add support for DAX reads/writes to block devices dax: Use copy_from_iter_nocache dax: Add block size note to documentation fs/file.c: __fget() and dup2() atomicity rules fs/file.c: don't acquire files->file_lock in fd_install() fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation vfs: avoid creation of inode number 0 in get_next_ino namei: make set_root_rcu() return void make simple_positive() public ufs: use dir_pages instead of ufs_dir_pages() pagemap.h: move dir_pages() over there remove the pointless include of lglock.h fs: cleanup slight list_entry abuse xfs: Correctly lock inode when removing suid and file capabilities fs: Call security_ops->inode_killpriv on truncate fs: Provide function telling whether file_remove_privs() will do anything ...
| * dax: bdev_direct_access() may sleepMatthew Wilcox2015-07-041-0/+6
| | | | | | | | | | | | | | | | | | | | | | The brd driver is the only in-tree driver that may sleep currently. After some discussion on linux-fsdevel, we decided that any driver may choose to sleep in its ->direct_access method. To ensure that all callers of bdev_direct_access() are prepared for this, add a call to might_sleep(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * block: Add support for DAX reads/writes to block devicesMatthew Wilcox2015-07-042-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | If a block device supports the ->direct_access methods, bypass the normal DIO path and use DAX to go straight to memcpy() instead of allocating a DIO and a BIO. Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in do_blockdev_direct_IO(). Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * dax: Use copy_from_iter_nocacheMatthew Wilcox2015-07-041-1/+1
| | | | | | | | | | | | | | | | When userspace does a write, there's no need for the written data to pollute the CPU cache. This matches the original XIP code. Signed-off-by: Matthew Wilcox <willy@linux.intel.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs/file.c: __fget() and dup2() atomicity rulesEric Dumazet2015-07-011-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __fget() does lockless fetch of pointer from the descriptor table, attempts to grab a reference and treats "it was already zero" as "it's already gone from the table, we just hadn't seen the store, let's fail". Unfortunately, that breaks the atomicity of dup2() - __fget() might see the old pointer, notice that it's been already dropped and treat that as "it's closed". What we should be getting is either the old file or new one, depending whether we come before or after dup2(). Dmitry had following test failing sometimes : int fd; void *Thread(void *x) { char buf; int n = read(fd, &buf, 1); if (n != 1) exit(printf("read failed: n=%d errno=%d\n", n, errno)); return 0; } int main() { fd = open("/dev/urandom", O_RDONLY); int fd2 = open("/dev/urandom", O_RDONLY); if (fd == -1 || fd2 == -1) exit(printf("open failed\n")); pthread_t th; pthread_create(&th, 0, Thread, 0); if (dup2(fd2, fd) == -1) exit(printf("dup2 failed\n")); pthread_join(th, 0); if (close(fd) == -1) exit(printf("close failed\n")); if (close(fd2) == -1) exit(printf("close failed\n")); printf("DONE\n"); return 0; } Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dmitry Vyukov <dvyukov@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs/file.c: don't acquire files->file_lock in fd_install()Eric Dumazet2015-07-011-19/+48
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mateusz Guzik reported : Currently obtaining a new file descriptor results in locking fdtable twice - once in order to reserve a slot and second time to fill it. Holding the spinlock in __fd_install() is needed in case a resize is done, or to prevent a resize. Mateusz provided an RFC patch and a micro benchmark : http://people.redhat.com/~mguzik/pipebench.c A resize is an unlikely operation in a process lifetime, as table size is at least doubled at every resize. We can use RCU instead of the spinlock. __fd_install() must wait if a resize is in progress. The resize must block new __fd_install() callers from starting, and wait that ongoing install are finished (synchronize_sched()) resize should be attempted by a single thread to not waste resources. rcu_sched variant is used, as __fd_install() and expand_fdtable() run from process context. It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R) CPU E5-2696 v2 @ 2.50GHz Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Mateusz Guzik <mguzik@redhat.com> Acked-by: Mateusz Guzik <mguzik@redhat.com> Tested-by: Mateusz Guzik <mguzik@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper ↵Wang YanQing2015-07-011-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | limitation Execution of get_anon_bdev concurrently and preemptive kernel all could bring race condition, it isn't enough to check dev against its upper limitation with equality operator only. This patch fix it. Signed-off-by: Wang YanQing <udknight@gmail.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * vfs: avoid creation of inode number 0 in get_next_inoCarlos Maiolino2015-07-011-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | currently, get_next_ino() is able to create inodes with inode number = 0. This have a bad impact in the filesystems relying in this function to generate inode numbers. While there is no problem at all in having inodes with number 0, userspace tools which handle file management tasks can have problems handling these files, like for example, the impossiblity of users to delete these files, since glibc will ignore them. So, I believe the best way is kernel to avoid creating them. This problem has been raised previously, but the old thread didn't have any other update for a year+, and I've seen too many users hitting the same issue regarding the impossibility to delete files while using filesystems relying on this function. So, I'm starting the thread again, with the same patch that I believe is enough to address this problem. Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * namei: make set_root_rcu() return voidAl Viro2015-06-291-3/+3
| | | | | | | | | | | | | | | | | | The only caller that cares about its return value can just as easily pick it from nd->root_seq itself. We used to just calculate it and return to caller, but these days we are storing it in nd->root_seq in all cases. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * make simple_positive() publicAl Viro2015-06-247-29/+9
| | | | | | | | Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * ufs: use dir_pages instead of ufs_dir_pages()Fabian Frederick2015-06-241-9/+4
| | | | | | | | | | | | | | | | dir_pages was declared in a lot of filesystems. Use newly dir_pages() from pagemap.h Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
| * pagemap.h: move dir_pages() over thereFabian Frederick2015-06-247-38/+0
| | | | | | | | | | | | | | | | That function was declared in a lot of filesystems to calculate directory pages. Signed-off-by: Fabian Frederick <fabf@skynet.be> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>