| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
| |
These calling conventions are rudiments of pre-2.3 times; they
really need to be sanitized. This is the first step; next
will be _always_ returning a block number, instead of this
"return a pointer to buffer_head, except when we get to the
actual data" crap.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
| |
we'd already calculated it...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
| |
... and massage ufs_frag_map() to take those instead of fragment number.
As it is, we duplicate the damn thing on the write side, open-coded and
bloody hard to follow.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
| |
We are holding ->truncate_mutex, so nobody else can alter our
block pointers. Rechecks/retries were needed back when we
only held BKL there, and had to cope with write_begin/writepage
and writepage/truncate races. Can't happen anymore...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
There's a case when an indirect block gets dirtied for no good
reason - when there's a hole starting in the middle of area
covered by it and spanning past its end, and truncate() is done
precisely to the beginning of the hole.
The block is obviously not modified at all - all removals happen
beyond it. However, existing code ends up dirtying it just in
case. It's trivial to fix and while it's not a real bug by any
stretch of imagination, it makes the damn thing harder to follow.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
| |
Note that it's already made unreachable from the inode, so we don't have
to worry about ufs_frag_map() walking into something already freed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
| |
Have caller fetch the block number *and* remove it from wherever
it was. Pass the block number instead.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
| |
turn recursion into a pair of loops
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We always have 0 < depth2 <= depth in there, so
if (--depth) {
if (--depth2)
A
B
} else {
C // not using depth2
}
D // not using depth2
is equivalent to
if (--depth2)
A with s/depth/depth - 1/
if (--depth)
B
else
C
D
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
| |
open-coded in several places...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
| |
For calls in __ufs_truncate_blocks() it's just a matter of not
incrementing offsets[0] and not making that call - immediately
following loop will be executed one extra time and we'll be just
fine. For recursive call in ufs_trunc_branch() itself, just
assing NULL to offsets if we would be about to make such call.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
| |
... and turn the switch into if (), since all cases with
depth != 1 have just become identical.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Instead of manually checking that the array contains only zeroes,
find the position of the last non-zero (in __ufs_truncate(), where
we can conveniently do that) and use that to tell if there's
any non-zero in the array tail passed to ufs_trunc_...indirect().
The goal of all that clumsiness is to get fold these functions
together.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
| |
rather than bitslicing the offset just formed as sum of shifted indices,
pass the array of those indices itself. NULL is used as equivalent
of "all zeroes" (== free the entire branch).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
| |
same as the previous two.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
| |
... instead of file offset. Same cleanups as in the tindirect
conversion in previous commit.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
IOW, the distance of cutoff from the begining of the branch
(in blocks).
That (and the fact that block just prior to cutoff is guaranteed to
be present) allows to tell whether to free triple indirect block
just by looking at the offset.
While we are at it, using u64 for index in the block is wrong -
those should be unsigned int.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
| |
Use ufs_block_to_path() to find the cutoff path in the block pointers' tree.
For now just use the information about the depth (to bypass the fully
preserved subtrees); subsequent commits will use the information about actual
path.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
| |
type makes no sense - those are indices in block number arrays, not
block numbers. And no, UFS is not likely to grow indirect blocks with
4Gpointers in them...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
| |
It is closely tied to block pointers handling there, can benefit
from existing helpers, etc. - no point keeping them apart.
Trimmed the trailing whitespaces in inode.c at the same time.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently - on lock_ufs(), eventually - on per-inode mutex.
lock_ufs() used to be mere BKL, which is much weaker, so it needed
those rechecks. BKL doesn't provide any exclusion once we lose CPU;
its blind replacement, OTOH, _does_. Making that per-filesystem was
an atrocity, but at least we can simplify life here. And yes, we
certainly need to make that sucker per-inode - these days inode.c and
truncate.c uses are needed only to protect the block pointers.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
| |
make it return void
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
| |
There were 3 remaining users; in two of them we took ->s_lock immediately
after lock_ufs() and held it until just before unlock_ufs(); the third
one (statfs) could not be called from itself or from other two (remount
and sync_fs). Just use ->s_lock in statfs and don't bother with lock_ufs
at all.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
* stores to block pointers are under per-inode seqlock (meta_lock) and
mutex (truncate_mutex)
* fetches of block pointers are either under truncate_mutex, or wrapped
into seqretry loop on meta_lock
* all changes of ->i_size are under truncate_mutex and i_mutex
* all changes of ->i_lastfrag are under truncate_mutex
It's similar to what ext2 is doing; the main difference is that unlike
ext2 we can't rely upon the atomicity of stores into block pointers -
on UFS2 they are 64bit. So we can't cut the corner when switching
a pointer from NULL to non-NULL as we could in ext2_splice_branch()
and need to use meta_lock on all modifications.
We use seqlock where ext2 uses rwlock; ext2 could probably also benefit
from such change...
Another non-trivial difference is that with UFS we *cannot* have reader
grab truncate_mutex in case of race - it has to keep retrying. That
might be possible to change, but not until we lift tail unpacking
several levels up in call chain.
After that commit we do *NOT* hold fs-wide serialization on accesses
to block pointers anymore. Moreover, lock_ufs() can become a normal
mutex now - it's only used on statfs, remount and sync_fs and none
of those uses are recursive. As the matter of fact, *now* it can be
collapsed with ->s_lock, and be eventually replaced with saner
per-cylinder-group spinlocks, but that's a separate story.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
| |
right now it doesn't matter (lock_ufs() serializes everything),
but when we switch to per-inode locking, it will be needed.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
| |
just prior to __ufs_truncate_blocks(), with matching change of calling
conventions
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
|
| |
Broken in "[PATCH] ufs: truncate should allocate block for last byte";
all way back in 2006. ufs_setattr() hadn't been the only user of
vmtruncate() and eliminating ->truncate() method required corrections
in a bunch of places. Eventually those places had migrated into
->write_begin() failure exit and ->write_end() after short copy...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
| |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|
|
|
|
|
|
|
|
| |
a) move it inside ufs_truncate()
b) ufs_free_inode() doesn't need it - it's serialized on ->s_lock
c) ufs_write_inode() doesn't need it either (and can be called without
it anyway).
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull more vfs updates from Al Viro:
"Assorted VFS fixes and related cleanups (IMO the most interesting in
that part are f_path-related things and Eric's descriptor-related
stuff). UFS regression fixes (it got broken last cycle). 9P fixes.
fs-cache series, DAX patches, Jan's file_remove_suid() work"
[ I'd say this is much more than "fixes and related cleanups". The
file_table locking rule change by Eric Dumazet is a rather big and
fundamental update even if the patch isn't huge. - Linus ]
* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: (49 commits)
9p: cope with bogus responses from server in p9_client_{read,write}
p9_client_write(): avoid double p9_free_req()
9p: forgetting to cancel request on interrupted zero-copy RPC
dax: bdev_direct_access() may sleep
block: Add support for DAX reads/writes to block devices
dax: Use copy_from_iter_nocache
dax: Add block size note to documentation
fs/file.c: __fget() and dup2() atomicity rules
fs/file.c: don't acquire files->file_lock in fd_install()
fs:super:get_anon_bdev: fix race condition could cause dev exceed its upper limitation
vfs: avoid creation of inode number 0 in get_next_ino
namei: make set_root_rcu() return void
make simple_positive() public
ufs: use dir_pages instead of ufs_dir_pages()
pagemap.h: move dir_pages() over there
remove the pointless include of lglock.h
fs: cleanup slight list_entry abuse
xfs: Correctly lock inode when removing suid and file capabilities
fs: Call security_ops->inode_killpriv on truncate
fs: Provide function telling whether file_remove_privs() will do anything
...
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The brd driver is the only in-tree driver that may sleep currently.
After some discussion on linux-fsdevel, we decided that any driver
may choose to sleep in its ->direct_access method. To ensure that all
callers of bdev_direct_access() are prepared for this, add a call
to might_sleep().
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If a block device supports the ->direct_access methods, bypass the normal
DIO path and use DAX to go straight to memcpy() instead of allocating
a DIO and a BIO.
Includes support for the DIO_SKIP_DIO_COUNT flag in DAX, as is done in
do_blockdev_direct_IO().
Signed-off-by: Matthew Wilcox <matthew.r.wilcox@intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| | |
When userspace does a write, there's no need for the written data to
pollute the CPU cache. This matches the original XIP code.
Signed-off-by: Matthew Wilcox <willy@linux.intel.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
__fget() does lockless fetch of pointer from the descriptor
table, attempts to grab a reference and treats "it was already
zero" as "it's already gone from the table, we just hadn't
seen the store, let's fail". Unfortunately, that breaks the
atomicity of dup2() - __fget() might see the old pointer,
notice that it's been already dropped and treat that as
"it's closed". What we should be getting is either the
old file or new one, depending whether we come before or after
dup2().
Dmitry had following test failing sometimes :
int fd;
void *Thread(void *x) {
char buf;
int n = read(fd, &buf, 1);
if (n != 1)
exit(printf("read failed: n=%d errno=%d\n", n, errno));
return 0;
}
int main()
{
fd = open("/dev/urandom", O_RDONLY);
int fd2 = open("/dev/urandom", O_RDONLY);
if (fd == -1 || fd2 == -1)
exit(printf("open failed\n"));
pthread_t th;
pthread_create(&th, 0, Thread, 0);
if (dup2(fd2, fd) == -1)
exit(printf("dup2 failed\n"));
pthread_join(th, 0);
if (close(fd) == -1)
exit(printf("close failed\n"));
if (close(fd2) == -1)
exit(printf("close failed\n"));
printf("DONE\n");
return 0;
}
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Dmitry Vyukov <dvyukov@google.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Mateusz Guzik reported :
Currently obtaining a new file descriptor results in locking fdtable
twice - once in order to reserve a slot and second time to fill it.
Holding the spinlock in __fd_install() is needed in case a resize is
done, or to prevent a resize.
Mateusz provided an RFC patch and a micro benchmark :
http://people.redhat.com/~mguzik/pipebench.c
A resize is an unlikely operation in a process lifetime,
as table size is at least doubled at every resize.
We can use RCU instead of the spinlock.
__fd_install() must wait if a resize is in progress.
The resize must block new __fd_install() callers from starting,
and wait that ongoing install are finished (synchronize_sched())
resize should be attempted by a single thread to not waste resources.
rcu_sched variant is used, as __fd_install() and expand_fdtable() run
from process context.
It gives us a ~30% speedup using pipebench on a dual Intel(R) Xeon(R)
CPU E5-2696 v2 @ 2.50GHz
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: Mateusz Guzik <mguzik@redhat.com>
Acked-by: Mateusz Guzik <mguzik@redhat.com>
Tested-by: Mateusz Guzik <mguzik@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
limitation
Execution of get_anon_bdev concurrently and preemptive kernel all
could bring race condition, it isn't enough to check dev against
its upper limitation with equality operator only.
This patch fix it.
Signed-off-by: Wang YanQing <udknight@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
currently, get_next_ino() is able to create inodes with inode number = 0.
This have a bad impact in the filesystems relying in this function to generate
inode numbers.
While there is no problem at all in having inodes with number 0, userspace tools
which handle file management tasks can have problems handling these files, like
for example, the impossiblity of users to delete these files, since glibc will
ignore them. So, I believe the best way is kernel to avoid creating them.
This problem has been raised previously, but the old thread didn't have any
other update for a year+, and I've seen too many users hitting the same issue
regarding the impossibility to delete files while using filesystems relying on
this function. So, I'm starting the thread again, with the same patch
that I believe is enough to address this problem.
Signed-off-by: Carlos Maiolino <cmaiolino@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The only caller that cares about its return value can just
as easily pick it from nd->root_seq itself. We used to just
calculate it and return to caller, but these days we are
storing it in nd->root_seq in all cases.
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| | |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| | |
dir_pages was declared in a lot of filesystems.
Use newly dir_pages() from pagemap.h
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |
| |
| |
| |
| |
| |
| |
| | |
That function was declared in a lot of filesystems to calculate
directory pages.
Signed-off-by: Fabian Frederick <fabf@skynet.be>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|