| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm
Pull filesystem-dax updates from Dan Williams:
- Fix handling of PMD-sized entries in the Xarray that lead to a crash
scenario
- Miscellaneous cleanups and small fixes
* tag 'fsdax-for-5.1' of git://git.kernel.org/pub/scm/linux/kernel/git/nvdimm/nvdimm:
dax: Flush partial PMDs correctly
fs/dax: NIT fix comment regarding start/end vs range
fs/dax: Convert to use vmf_error()
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The radix tree would rewind the index in an iterator to the lowest index
of a multi-slot entry. The XArray iterators instead leave the index
unchanged, but I overlooked that when converting DAX from the radix tree
to the XArray. Adjust the index that we use for flushing to the start
of the PMD range.
Fixes: c1901cd33cf4 ("page cache: Convert find_get_entries_tag to XArray")
Cc: <stable@vger.kernel.org>
Reported-by: Piotr Balcer <piotr.balcer@intel.com>
Tested-by: Dan Williams <dan.j.williams@intel.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Matthew Wilcox <willy@infradead.org>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
| |
| |
| |
| |
| |
| | |
Fixes: ac46d4f3c432 ("mm/mmu_notifier: use structure for invalidate_range_start/end calls v2")
Signed-off-by: Ira Weiny <ira.weiny@intel.com>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
| |
| |
| |
| |
| |
| |
| |
| | |
This code is converted to use vmf_error().
Signed-off-by: Souptick Joarder <jrdr.linux@gmail.com>
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dan Williams <dan.j.williams@intel.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBI and UBIFS updates from Richard Weinberger:
- A new interface for UBI to deal better with read disturb
- Reject unsupported ioctl flags in UBIFS (xfstests found it)
* tag 'upstream-5.1-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubi: wl: Silence uninitialized variable warning
ubifs: Reject unsupported ioctl flags explicitly
ubi: Expose the bitrot interface
ubi: Introduce in_pq()
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Reject unsupported ioctl flags explicitly, so the following command
on a regular ubifs file will fail:
chattr +d ubifs_file
And xfstests generic/424 will pass.
Signed-off-by: Hou Tao <houtao1@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull NFS server updates from Bruce Fields:
"Miscellaneous NFS server fixes.
Probably the most visible bug is one that could artificially limit
NFSv4.1 performance by limiting the number of oustanding rpcs from a
single client.
Neil Brown also gets a special mention for fixing a 14.5-year-old
memory-corruption bug in the encoding of NFSv3 readdir responses"
* tag 'nfsd-5.1' of git://linux-nfs.org/~bfields/linux:
nfsd: allow nfsv3 readdir request to be larger.
nfsd: fix wrong check in write_v4_end_grace()
nfsd: fix memory corruption caused by readdir
nfsd: fix performance-limiting session calculation
svcrpc: fix UDP on servers with lots of threads
svcrdma: Remove syslog warnings in work completion handlers
svcrdma: Squelch compiler warning when SUNRPC_DEBUG is disabled
svcrdma: Use struct_size() in kmalloc()
svcrpc: fix unlikely races preventing queueing of sockets
svcrpc: svc_xprt_has_something_to_do seems a little long
SUNRPC: Don't allow compiler optimisation of svc_xprt_release_slot()
nfsd: fix an IS_ERR() vs NULL check
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
nfsd currently reports the NFSv3 dtpref FSINFO parameter
to be PAGE_SIZE, so NFS clients will typically ask for one
page of directory entries at a time. This is needlessly restrictive
as nfsd can handle larger replies easily.
Also, a READDIR request (but not a READDIRPLUS request) has the count
size clipped to PAGE_SIE, again unnecessary.
This patch lifts these limits so that larger readdir requests can be
used.
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Commit 62a063b8e7d1 "nfsd4: fix crash on writing v4_end_grace before
nfsd startup" is trying to fix a NULL dereference issue, but it
mistakenly checks if the nfsd server is started. So fix it.
Fixes: 62a063b8e7d1 "nfsd4: fix crash on writing v4_end_grace before nfsd startup"
Cc: stable@vger.kernel.org
Reviewed-by: Joseph Qi <joseph.qi@linux.alibaba.com>
Signed-off-by: Yihao Wu <wuyihao@linux.alibaba.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
If the result of an NFSv3 readdir{,plus} request results in the
"offset" on one entry having to be split across 2 pages, and is sized
so that the next directory entry doesn't fit in the requested size,
then memory corruption can happen.
When encode_entry() is called after encoding the last entry that fits,
it notices that ->offset and ->offset1 are set, and so stores the
offset value in the two pages as required. It clears ->offset1 but
*does not* clear ->offset.
Normally this omission doesn't matter as encode_entry_baggage() will
be called, and will set ->offset to a suitable value (not on a page
boundary).
But in the case where cd->buflen < elen and nfserr_toosmall is
returned, ->offset is not reset.
This means that nfsd3proc_readdirplus will see ->offset with a value 4
bytes before the end of a page, and ->offset1 set to NULL.
It will try to write 8bytes to ->offset.
If we are lucky, the next page will be read-only, and the system will
BUG: unable to handle kernel paging request at...
If we are unlucky, some innocent page will have the first 4 bytes
corrupted.
nfsd3proc_readdir() doesn't even check for ->offset1, it just blindly
writes 8 bytes to the offset wherever it is.
Fix this by clearing ->offset after it is used, and copying the
->offset handling code from nfsd3_proc_readdirplus into
nfsd3_proc_readdir.
(Note that the commit hash in the Fixes tag is from the 'history'
tree - this bug predates git).
Fixes: 0b1d57cf7654 ("[PATCH] kNFSd: Fix nfs3 dentry encoding")
Fixes-URL: https://git.kernel.org/pub/scm/linux/kernel/git/history/history.git/commit/?id=0b1d57cf7654
Cc: stable@vger.kernel.org (v2.6.12+)
Signed-off-by: NeilBrown <neilb@suse.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We're unintentionally limiting the number of slots per nfsv4.1 session
to 10. Often more than 10 simultaneous RPCs are needed for the best
performance.
This calculation was meant to prevent any one client from using up more
than a third of the limit we set for total memory use across all clients
and sessions. Instead, it's limiting the client to a third of the
maximum for a single session.
Fix this.
Reported-by: Chris Tracy <ctracy@engr.scu.edu>
Cc: stable@vger.kernel.org
Fixes: de766e570413 "nfsd: give out fewer session slots as limit approaches"
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The get_backchannel_cred() used to return error pointers on error but
now it returns NULL pointers.
Fixes: 97f68c6b02e0 ("SUNRPC: add 'struct cred *' to auth_cred and rpc_cre")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: J. Bruce Fields <bfields@redhat.com>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 updates from Ted Ts'o:
"A large number of bug fixes and cleanups.
One new feature to allow users to more easily find the jbd2 journal
thread for a particular ext4 file system"
* tag 'ext4_for_linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4: (25 commits)
jbd2: jbd2_get_transaction does not need to return a value
jbd2: fix invalid descriptor block checksum
ext4: fix bigalloc cluster freeing when hole punching under load
ext4: add sysfs attr /sys/fs/ext4/<disk>/journal_task
ext4: Change debugging support help prefix from EXT4 to Ext4
ext4: fix compile error when using BUFFER_TRACE
jbd2: fix compile warning when using JBUFFER_TRACE
ext4: fix some error pointer dereferences
ext4: annotate more implicit fall throughs
ext4: annotate implicit fall throughs
ext4: don't update s_rev_level if not required
jbd2: fold jbd2_superblock_csum_{verify,set} into their callers
jbd2: fix race when writing superblock
ext4: fix crash during online resizing
ext4: disallow files with EXT4_JOURNAL_DATA_FL from EXT4_IOC_SWAP_BOOT
ext4: add mask of ext4 flags to swap
ext4: update quota information while swapping boot loader inode
ext4: cleanup pagecache before swap i_data
ext4: fix check of inode in swap_inode_boot_loader
ext4: unlock unused_pages timely when doing writeback
...
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In jbd2_get_transaction, a new transaction is initialized,
and set to the j_running_transaction. No need for a return
value, so remove it.
Also, adjust some comments to match the actual operation
of this function.
Signed-off-by: Liu Song <liu.song11@zte.com.cn>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In jbd2_journal_commit_transaction(), if we are in abort mode,
we may flush the buffer without setting descriptor block checksum
by goto start_journal_io. Then fs is mounted,
jbd2_descriptor_block_csum_verify() failed.
[ 271.379811] EXT4-fs (vdd): shut down requested (2)
[ 271.381827] Aborting journal on device vdd-8.
[ 271.597136] JBD2: Invalid checksum recovering block 22199 in log
[ 271.598023] JBD2: recovery failed
[ 271.598484] EXT4-fs (vdd): error loading journal
Fix this problem by keep setting descriptor block checksum if the
descriptor buffer is not NULL.
This checksum problem can be reproduced by xfstests generic/388.
Signed-off-by: luojiajun <luojiajun3@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Ext4 may not free clusters correctly when punching holes in bigalloc
file systems under high load conditions. If it's not possible to
extend and restart the journal in ext4_ext_rm_leaf() when preparing to
remove blocks from a punched region, a retry of the entire punch
operation is triggered in ext4_ext_remove_space(). This causes a
partial cluster to be set to the first cluster in the extent found to
the right of the punched region. However, if the punch operation
prior to the retry had made enough progress to delete one or more
extents and a partial cluster candidate for freeing had already been
recorded, the retry would overwrite the partial cluster. The loss of
this information makes it impossible to correctly free the original
partial cluster in all cases.
This bug can cause generic/476 to fail when run as part of
xfstests-bld's bigalloc and bigalloc_1k test cases. The failure is
reported when e2fsck detects bad iblocks counts greater than expected
in units of whole clusters and also detects a number of negative block
bitmap differences equal to the iblocks discrepancy in cluster units.
Signed-off-by: Eric Whitney <enwlinux@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This is useful for moving journal thread into cgroup or
for tracing it with ftrace/perf/blktrace.
For now the only way is `pgrep jbd2/$DISK` but this is not reliable:
name may be longer than "comm" limit and any task could mock it.
Attribute shows pid in current pid-namespace or 0 if task is unreachable.
Signed-off-by: Konstantin Khlebnikov <khlebnikov@yandex-team.ru>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
All other configuration options for the ext* family of file systems use
"Ext%u" instead of "EXT%u".
Fixes: 6ba495e9259cd9a0 ("ext4: Add configurable run-time mballoc debugging")
Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Fix compile error below when using BUFFER_TRACE.
fs/ext4/inode.c: In function ‘ext4_expand_extra_isize’:
fs/ext4/inode.c:5979:19: error: request for member ‘bh’ in something not a structure or union
BUFFER_TRACE(iloc.bh, "get_write_access");
Fixes: c03b45b853f58 ("ext4, project: expand inode extra size if possible")
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The jh pointer may be used uninitialized in the two cases below and the
compiler complain about it when enabling JBUFFER_TRACE macro, fix them.
In file included from fs/jbd2/transaction.c:19:0:
fs/jbd2/transaction.c: In function ‘jbd2_journal_get_undo_access’:
./include/linux/jbd2.h:1637:38: warning: ‘jh’ is used uninitialized in this function [-Wuninitialized]
#define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
^
fs/jbd2/transaction.c:1219:23: note: ‘jh’ was declared here
struct journal_head *jh;
^
In file included from fs/jbd2/transaction.c:19:0:
fs/jbd2/transaction.c: In function ‘jbd2_journal_dirty_metadata’:
./include/linux/jbd2.h:1637:38: warning: ‘jh’ may be used uninitialized in this function [-Wmaybe-uninitialized]
#define JBUFFER_TRACE(jh, info) do { printk("%s: %d\n", __func__, jh->b_jcount);} while (0)
^
fs/jbd2/transaction.c:1332:23: note: ‘jh’ was declared here
struct journal_head *jh;
^
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
Reviewed-by: Jan Kara <jack@suse.cz>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
We can't pass error pointers to brelse().
Fixes: fb265c9cb49e ("ext4: add ext4_sb_bread() to disambiguate ENOMEM cases")
Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
There is a plan to build the kernel with -Wimplicit-fallthrough and
these places in the code produced warnings (W=1). Fix them up.
This commit remove the following warnings:
fs/ext4/indirect.c:1182:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1188:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1432:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/indirect.c:1440:6: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
There is a plan to build the kernel with -Wimplicit-fallthrough and
these places in the code produced warnings (W=1). Fix them up.
This commit remove the following warnings:
fs/ext4/hash.c:233:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
fs/ext4/hash.c:246:15: warning: this statement may fall through [-Wimplicit-fallthrough=]
Signed-off-by: Mathieu Malaterre <malat@debian.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Don't update the superblock s_rev_level during mount if it isn't
actually necessary, only if superblock features are being set by
the kernel. This was originally added for ext3 since it always
set the INCOMPAT_RECOVER and HAS_JOURNAL features during mount,
but this is not needed since no journal mode was added to ext4.
That will allow Geert to mount his 20-year-old ext2 rev 0.0 m68k
filesystem, as a testament of the backward compatibility of ext4.
Fixes: 0390131ba84f ("ext4: Allow ext4 to run without a journal")
Signed-off-by: Andreas Dilger <adilger@dilger.ca>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The functions jbd2_superblock_csum_verify() and
jbd2_superblock_csum_set() only get called from one location, so to
simplify things, fold them into their callers.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The jbd2 superblock is lockless now, so there is probably a race
condition between writing it so disk and modifing contents of it, which
may lead to checksum error. The following race is the one case that we
have captured.
jbd2 fsstress
jbd2_journal_commit_transaction
jbd2_journal_update_sb_log_tail
jbd2_write_superblock
jbd2_superblock_csum_set jbd2_journal_revoke
jbd2_journal_set_features(revork)
modify superblock
submit_bh(checksum incorrect)
Fix this by locking the buffer head before modifing it. We always
write the jbd2 superblock after we modify it, so this just means
calling the lock_buffer() a little earlier.
This checksum corruption problem can be reproduced by xfstests
generic/475.
Reported-by: zhangyi (F) <yi.zhang@huawei.com>
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When computing maximum size of filesystem possible with given number of
group descriptor blocks, we forget to include s_first_data_block into
the number of blocks. Thus for filesystems with non-zero
s_first_data_block it can happen that computed maximum filesystem size
is actually lower than current filesystem size which confuses the code
and eventually leads to a BUG_ON in ext4_alloc_group_tables() hitting on
flex_gd->count == 0. The problem can be reproduced like:
truncate -s 100g /tmp/image
mkfs.ext4 -b 1024 -E resize=262144 /tmp/image 32768
mount -t ext4 -o loop /tmp/image /mnt
resize2fs /dev/loop0 262145
resize2fs /dev/loop0 300000
Fix the problem by properly including s_first_data_block into the
computed number of filesystem blocks.
Fixes: 1c6bd7173d66 "ext4: convert file system to meta_bg if needed..."
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
A malicious/clueless root user can use EXT4_IOC_SWAP_BOOT to force a
corner casew which can lead to the file system getting corrupted.
There's no usefulness to allowing this, so just prohibit this case.
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The reason is that while swapping two inode, we swap the flags too.
Some flags such as EXT4_JOURNAL_DATA_FL can really confuse the things
since we're not resetting the address operations structure. The
simplest way to keep things sane is to restrict the flags that can be
swapped.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@vger.kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
While do swap between two inode, they swap i_data without update
quota information. Also, swap_inode_boot_loader can do "revert"
somtimes, so update the quota while all operations has been finished.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
While do swap, we should make sure there has no new dirty page since we
should swap i_data between two inode:
1.We should lock i_mmap_sem with write to avoid new pagecache from mmap
read/write;
2.Change filemap_flush to filemap_write_and_wait and move them to the
space protected by inode lock to avoid new pagecache from buffer read/write.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Before really do swap between inode and boot inode, something need to
check to avoid invalid or not permitted operation, like does this inode
has inline data. But the condition check should be protected by inode
lock to avoid change while swapping. Also some other condition will not
change between swapping, but there has no problem to do this under inode
lock.
Signed-off-by: yangerkun <yangerkun@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
In mpage_add_bh_to_extent(), when accumulated extents length is greater
than MAX_WRITEPAGES_EXTENT_LEN or buffer head's b_stat is not equal, we
will not continue to search unmapped area for this page, but note this
page is locked, and will only be unlocked in mpage_release_unused_pages()
after ext4_io_submit, if io also is throttled by blk-throttle or similar
io qos, we will hold this page locked for a while, it's unnecessary.
I think the best fix is to refactor mpage_add_bh_to_extent() to let it
return some hints whether to unlock this page, but given that we will
improve dioread_nolock later, we can let it done later, so currently
the simple fix would just call mpage_release_unused_pages() before
ext4_io_submit().
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Now, we have already handle all cases of forgetting buffer in
jbd2_journal_forget(), the buffer should not be mapped to blockdevice
when reallocating it. So this patch remove all clean_bdev_aliases() and
clean_bdev_bh_alias() calls which were invoked by ext4 explicitly.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
We do not unmap and clear dirty flag when forgetting a buffer without
journal or does not belongs to any transaction, so the invalid dirty
data may still be written to the disk later. It's fine if the
corresponding block is never used before the next mount, and it's also
fine that we invoke clean_bdev_aliases() related functions to unmap
the block device mapping when re-allocating such freed block as data
block. But this logic is somewhat fragile and risky that may lead to
data corruption if we forget to clean bdev aliases. So, It's better to
discard dirty data during forget time.
We have been already handled all the cases of forgetting journalled
buffer, this patch deal with the remaining two cases.
- buffer is not journalled yet,
- buffer is journalled but doesn't belongs to any transaction.
We invoke __bforget() instead of __brelese() when forgetting an
un-journalled buffer in jbd2_journal_forget(). After this patch we can
remove all clean_bdev_aliases() related calls in ext4.
Suggested-by: Jan Kara <jack@suse.cz>
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Now, we capture a data corruption problem on ext4 while we're truncating
an extent index block. Imaging that if we are revoking a buffer which
has been journaled by the committing transaction, the buffer's jbddirty
flag will not be cleared in jbd2_journal_forget(), so the commit code
will set the buffer dirty flag again after refile the buffer.
fsx kjournald2
jbd2_journal_commit_transaction
jbd2_journal_revoke commit phase 1~5...
jbd2_journal_forget
belongs to older transaction commit phase 6
jbddirty not clear __jbd2_journal_refile_buffer
__jbd2_journal_unfile_buffer
test_clear_buffer_jbddirty
mark_buffer_dirty
Finally, if the freed extent index block was allocated again as data
block by some other files, it may corrupt the file data after writing
cached pages later, such as during unmount time. (In general,
clean_bdev_aliases() related helpers should be invoked after
re-allocation to prevent the above corruption, but unfortunately we
missed it when zeroout the head of extra extent blocks in
ext4_ext_handle_unwritten_extents()).
This patch mark buffer as freed and set j_next_transaction to the new
transaction when it already belongs to the committing transaction in
jbd2_journal_forget(), so that commit code knows it should clear dirty
bits when it is done with the buffer.
This problem can be reproduced by xfstests generic/455 easily with
seeds (3246 3247 3248 3249).
Signed-off-by: zhangyi (F) <yi.zhang@huawei.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
Cc: stable@vger.kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
There is a function which clearly conveys the objective of checking
i_writecount. Additionally the usage in ext4_mb_initialize_context was
wrong, since a node would have wrongfully been reported as writable if
i_writecount had a negative value (MMAP_DENY_WRITE).
Signed-off-by: Nikolay Borisov <nborisov@suse.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Jan Kara <jack@suse.cz>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This issue was found when I tried to put checkpoint work in a separate thread,
the deadlock below happened:
Thread1 | Thread2
__jbd2_log_wait_for_space |
jbd2_log_do_checkpoint (hold j_checkpoint_mutex)|
if (jh->b_transaction != NULL) |
... |
jbd2_log_start_commit(journal, tid); |jbd2_update_log_tail
| will lock j_checkpoint_mutex,
| but will be blocked here.
|
jbd2_log_wait_commit(journal, tid); |
wait_event(journal->j_wait_done_commit, |
!tid_gt(tid, journal->j_commit_sequence)); |
... |wake_up(j_wait_done_commit)
} |
then deadlock occurs, Thread1 will never be waken up.
To fix this issue, drop j_checkpoint_mutex in jbd2_log_do_checkpoint()
when we are going to wait for transaction commit.
Reviewed-by: Jan Kara <jack@suse.cz>
Signed-off-by: Xiaoguang Wang <xiaoguang.wang@linux.alibaba.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Pull ceph updates from Ilya Dryomov:
"The highlights are:
- rbd will now ignore discards that aren't aligned and big enough to
actually free up some space (myself). This is controlled by the new
alloc_size map option and can be disabled if needed.
- support for rbd deep-flatten feature (myself). Deep-flatten allows
"rbd flatten" to fully disconnect the clone image and its snapshots
from the parent and make the parent snapshot removable.
- a new round of cap handling improvements (Zheng Yan). The kernel
client should now be much more prompt about releasing its caps and
it is possible to put a limit on the number of caps held.
- support for getting ceph.dir.pin extended attribute (Zheng Yan)"
* tag 'ceph-for-5.1-rc1' of git://github.com/ceph/ceph-client: (26 commits)
Documentation: modern versions of ceph are not backed by btrfs
rbd: advertise support for RBD_FEATURE_DEEP_FLATTEN
rbd: whole-object write and zeroout should copyup when snapshots exist
rbd: copyup with an empty snapshot context (aka deep-copyup)
rbd: introduce rbd_obj_issue_copyup_ops()
rbd: stop copying num_osd_ops in rbd_obj_issue_copyup()
rbd: factor out __rbd_osd_req_create()
rbd: clear ->xferred on error from rbd_obj_issue_copyup()
rbd: remove experimental designation from kernel layering
ceph: add mount option to limit caps count
ceph: periodically trim stale dentries
ceph: delete stale dentry when last reference is dropped
ceph: remove dentry_lru file from debugfs
ceph: touch existing cap when handling reply
ceph: pass inclusive lend parameter to filemap_write_and_wait_range()
rbd: round off and ignore discards that are too small
rbd: handle DISCARD and WRITE_ZEROES separately
rbd: get rid of obj_req->obj_request_count
libceph: use struct_size() for kmalloc() in crush_decode()
ceph: send cap releases more aggressively
...
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
If number of caps exceed the limit, ceph_trim_dentires() also trim
dentries with valid leases. Trimming dentry releases references to
associated inode, which may evict inode and release caps.
By default, there is no limit for caps count.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Previous commit make VFS delete stale dentry when last reference is
dropped. Lease also can become invalid when corresponding dentry has
no reference. This patch make cephfs periodically scan lease list,
delete corresponding dentry if lease is invalid.
There are two types of lease, dentry lease and dir lease. dentry lease
has life time and applies to singe dentry. Dentry lease is added to tail
of a list when it's updated, leases at front of the list will expire
first. Dir lease is CEPH_CAP_FILE_SHARED on directory inode, it applies
to all dentries in the directory. Dentries have dir leases are added to
another list. Dentries in the list are periodically checked in a round
robin manner.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
introduce ceph_d_delete(), which checks if dentry has valid lease.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
The file shows all dentries in cephfs mount. It's not very useful.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Move cap to tail of session->s_caps list. So ceph_trim_caps() will
trim older caps first.
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
The 'lend' parameter of filemap_write_and_wait_range is required to be
inclusive, so follow the rule. Same for invalidate_inode_pages2_range.
Signed-off-by: zhengbin <zhengbin13@huawei.com>
Reviewed-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
When pending cap releases fill up one message, start a work to send
cap release message. (old way is sending cap releases every 5 seconds)
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Reviewed-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Link: http://tracker.ceph.com/issues/37576
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
In versioned reply, inodestat, dirstat and lease are encoded with
version, compat_version and struct_len.
Based on a patch from Jos Collin <jcollin@redhat.com>.
Link: http://tracker.ceph.com/issues/26936
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
ceph_getattr() return zero dev ID for head inodes and set dev ID to
snapid directly for snaphost inodes. This is not good because userspace
utilities may consider device ID of 0 as invalid, snapid may conflict
with other device's ID.
This patch introduces "snapids to anonymous bdev IDs" map. we create a
new mapping when we see a snapid for the first time. we trim unused
mapping after it is ilde for 5 minutes.
Link: http://tracker.ceph.com/issues/22353
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Acked-by: Jeff Layton <jlayton@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Signed-off-by: "Yan, Zheng" <zyan@redhat.com>
Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
|