| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux
Pull btrfs fixes from David Sterba:
"Notable highlights:
- fixes for some long-standing bugs in fsync that were quite hard to
catch but now finaly fixed
- some fixups to error handling paths that did not properly clean up
(locking, memory)
- fix to space reservation for inheriting properties"
* tag 'for-5.2-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux:
Btrfs: tree-checker: detect file extent items with overlapping ranges
Btrfs: fix race between ranged fsync and writeback of adjacent ranges
Btrfs: avoid fallback to transaction commit during fsync of files with holes
btrfs: extent-tree: Fix a bug that btrfs is unable to add pinned bytes
btrfs: sysfs: don't leak memory when failing add fsid
btrfs: sysfs: Fix error path kobject memory leak
Btrfs: do not abort transaction at btrfs_update_root() after failure to COW path
btrfs: use the existing reserved items for our first prop for inheritance
btrfs: don't double unlock on error in btrfs_punch_hole
btrfs: Check the compression level before getting a workspace
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Having file extent items with ranges that overlap each other is a
serious issue that leads to all sorts of corruptions and crashes (like a
BUG_ON() during the course of __btrfs_drop_extents() when it traims file
extent items). Therefore teach the tree checker to detect such cases.
This is motivated by a recently fixed bug (race between ranged full
fsync and writeback or adjacent ranges).
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When we do a full fsync (the bit BTRFS_INODE_NEEDS_FULL_SYNC is set in the
inode) that happens to be ranged, which happens during a msync() or writes
for files opened with O_SYNC for example, we can end up with a corrupt log,
due to different file extent items representing ranges that overlap with
each other, or hit some assertion failures.
When doing a ranged fsync we only flush delalloc and wait for ordered
exents within that range. If while we are logging items from our inode
ordered extents for adjacent ranges complete, we end up in a race that can
make us insert the file extent items that overlap with others we logged
previously and the assertion failures.
For example, if tree-log.c:copy_items() receives a leaf that has the
following file extents items, all with a length of 4K and therefore there
is an implicit hole in the range 68K to 72K - 1:
(257 EXTENT_ITEM 64K), (257 EXTENT_ITEM 72K), (257 EXTENT_ITEM 76K), ...
It copies them to the log tree. However due to the need to detect implicit
holes, it may release the path, in order to look at the previous leaf to
detect an implicit hole, and then later it will search again in the tree
for the first file extent item key, with the goal of locking again the
leaf (which might have changed due to concurrent changes to other inodes).
However when it locks again the leaf containing the first key, the key
corresponding to the extent at offset 72K may not be there anymore since
there is an ordered extent for that range that is finishing (that is,
somewhere in the middle of btrfs_finish_ordered_io()), and it just
removed the file extent item but has not yet replaced it with a new file
extent item, so the part of copy_items() that does hole detection will
decide that there is a hole in the range starting from 68K to 76K - 1,
and therefore insert a file extent item to represent that hole, having
a key offset of 68K. After that we now have a log tree with 2 different
extent items that have overlapping ranges:
1) The file extent item copied before copy_items() released the path,
which has a key offset of 72K and a length of 4K, representing the
file range 72K to 76K - 1.
2) And a file extent item representing a hole that has a key offset of
68K and a length of 8K, representing the range 68K to 76K - 1. This
item was inserted after releasing the path, and overlaps with the
extent item inserted before.
The overlapping extent items can cause all sorts of unpredictable and
incorrect behaviour, either when replayed or if a fast (non full) fsync
happens later, which can trigger a BUG_ON() when calling
btrfs_set_item_key_safe() through __btrfs_drop_extents(), producing a
trace like the following:
[61666.783269] ------------[ cut here ]------------
[61666.783943] kernel BUG at fs/btrfs/ctree.c:3182!
[61666.784644] invalid opcode: 0000 [#1] PREEMPT SMP
(...)
[61666.786253] task: ffff880117b88c40 task.stack: ffffc90008168000
[61666.786253] RIP: 0010:btrfs_set_item_key_safe+0x7c/0xd2 [btrfs]
[61666.786253] RSP: 0018:ffffc9000816b958 EFLAGS: 00010246
[61666.786253] RAX: 0000000000000000 RBX: 000000000000000f RCX: 0000000000030000
[61666.786253] RDX: 0000000000000000 RSI: ffffc9000816ba4f RDI: ffffc9000816b937
[61666.786253] RBP: ffffc9000816b998 R08: ffff88011dae2428 R09: 0000000000001000
[61666.786253] R10: 0000160000000000 R11: 6db6db6db6db6db7 R12: ffff88011dae2418
[61666.786253] R13: ffffc9000816ba4f R14: ffff8801e10c4118 R15: ffff8801e715c000
[61666.786253] FS: 00007f6060a18700(0000) GS:ffff88023f5c0000(0000) knlGS:0000000000000000
[61666.786253] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[61666.786253] CR2: 00007f6060a28000 CR3: 0000000213e69000 CR4: 00000000000006e0
[61666.786253] Call Trace:
[61666.786253] __btrfs_drop_extents+0x5e3/0xaad [btrfs]
[61666.786253] ? time_hardirqs_on+0x9/0x14
[61666.786253] btrfs_log_changed_extents+0x294/0x4e0 [btrfs]
[61666.786253] ? release_extent_buffer+0x38/0xb4 [btrfs]
[61666.786253] btrfs_log_inode+0xb6e/0xcdc [btrfs]
[61666.786253] ? lock_acquire+0x131/0x1c5
[61666.786253] ? btrfs_log_inode_parent+0xee/0x659 [btrfs]
[61666.786253] ? arch_local_irq_save+0x9/0xc
[61666.786253] ? btrfs_log_inode_parent+0x1f5/0x659 [btrfs]
[61666.786253] btrfs_log_inode_parent+0x223/0x659 [btrfs]
[61666.786253] ? arch_local_irq_save+0x9/0xc
[61666.786253] ? lockref_get_not_zero+0x2c/0x34
[61666.786253] ? rcu_read_unlock+0x3e/0x5d
[61666.786253] btrfs_log_dentry_safe+0x60/0x7b [btrfs]
[61666.786253] btrfs_sync_file+0x317/0x42c [btrfs]
[61666.786253] vfs_fsync_range+0x8c/0x9e
[61666.786253] SyS_msync+0x13c/0x1c9
[61666.786253] entry_SYSCALL_64_fastpath+0x18/0xad
A sample of a corrupt log tree leaf with overlapping extents I got from
running btrfs/072:
item 14 key (295 108 200704) itemoff 2599 itemsize 53
extent data disk bytenr 0 nr 0
extent data offset 0 nr 458752 ram 458752
item 15 key (295 108 659456) itemoff 2546 itemsize 53
extent data disk bytenr 4343541760 nr 770048
extent data offset 606208 nr 163840 ram 770048
item 16 key (295 108 663552) itemoff 2493 itemsize 53
extent data disk bytenr 4343541760 nr 770048
extent data offset 610304 nr 155648 ram 770048
item 17 key (295 108 819200) itemoff 2440 itemsize 53
extent data disk bytenr 4334788608 nr 4096
extent data offset 0 nr 4096 ram 4096
The file extent item at offset 659456 (item 15) ends at offset 823296
(659456 + 163840) while the next file extent item (item 16) starts at
offset 663552.
Another different problem that the race can trigger is a failure in the
assertions at tree-log.c:copy_items(), which expect that the first file
extent item key we found before releasing the path exists after we have
released path and that the last key we found before releasing the path
also exists after releasing the path:
$ cat -n fs/btrfs/tree-log.c
4080 if (need_find_last_extent) {
4081 /* btrfs_prev_leaf could return 1 without releasing the path */
4082 btrfs_release_path(src_path);
4083 ret = btrfs_search_slot(NULL, inode->root, &first_key,
4084 src_path, 0, 0);
4085 if (ret < 0)
4086 return ret;
4087 ASSERT(ret == 0);
(...)
4103 if (i >= btrfs_header_nritems(src_path->nodes[0])) {
4104 ret = btrfs_next_leaf(inode->root, src_path);
4105 if (ret < 0)
4106 return ret;
4107 ASSERT(ret == 0);
4108 src = src_path->nodes[0];
4109 i = 0;
4110 need_find_last_extent = true;
4111 }
(...)
The second assertion implicitly expects that the last key before the path
release still exists, because the surrounding while loop only stops after
we have found that key. When this assertion fails it produces a stack like
this:
[139590.037075] assertion failed: ret == 0, file: fs/btrfs/tree-log.c, line: 4107
[139590.037406] ------------[ cut here ]------------
[139590.037707] kernel BUG at fs/btrfs/ctree.h:3546!
[139590.038034] invalid opcode: 0000 [#1] SMP DEBUG_PAGEALLOC PTI
[139590.038340] CPU: 1 PID: 31841 Comm: fsstress Tainted: G W 5.0.0-btrfs-next-46 #1
(...)
[139590.039354] RIP: 0010:assfail.constprop.24+0x18/0x1a [btrfs]
(...)
[139590.040397] RSP: 0018:ffffa27f48f2b9b0 EFLAGS: 00010282
[139590.040730] RAX: 0000000000000041 RBX: ffff897c635d92c8 RCX: 0000000000000000
[139590.041105] RDX: 0000000000000000 RSI: ffff897d36a96868 RDI: ffff897d36a96868
[139590.041470] RBP: ffff897d1b9a0708 R08: 0000000000000000 R09: 0000000000000000
[139590.041815] R10: 0000000000000008 R11: 0000000000000000 R12: 0000000000000013
[139590.042159] R13: 0000000000000227 R14: ffff897cffcbba88 R15: 0000000000000001
[139590.042501] FS: 00007f2efc8dee80(0000) GS:ffff897d36a80000(0000) knlGS:0000000000000000
[139590.042847] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[139590.043199] CR2: 00007f8c064935e0 CR3: 0000000232252002 CR4: 00000000003606e0
[139590.043547] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[139590.043899] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[139590.044250] Call Trace:
[139590.044631] copy_items+0xa3f/0x1000 [btrfs]
[139590.045009] ? generic_bin_search.constprop.32+0x61/0x200 [btrfs]
[139590.045396] btrfs_log_inode+0x7b3/0xd70 [btrfs]
[139590.045773] btrfs_log_inode_parent+0x2b3/0xce0 [btrfs]
[139590.046143] ? do_raw_spin_unlock+0x49/0xc0
[139590.046510] btrfs_log_dentry_safe+0x4a/0x70 [btrfs]
[139590.046872] btrfs_sync_file+0x3b6/0x440 [btrfs]
[139590.047243] btrfs_file_write_iter+0x45b/0x5c0 [btrfs]
[139590.047592] __vfs_write+0x129/0x1c0
[139590.047932] vfs_write+0xc2/0x1b0
[139590.048270] ksys_write+0x55/0xc0
[139590.048608] do_syscall_64+0x60/0x1b0
[139590.048946] entry_SYSCALL_64_after_hwframe+0x49/0xbe
[139590.049287] RIP: 0033:0x7f2efc4be190
(...)
[139590.050342] RSP: 002b:00007ffe743243a8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
[139590.050701] RAX: ffffffffffffffda RBX: 0000000000008d58 RCX: 00007f2efc4be190
[139590.051067] RDX: 0000000000008d58 RSI: 00005567eca0f370 RDI: 0000000000000003
[139590.051459] RBP: 0000000000000024 R08: 0000000000000003 R09: 0000000000008d60
[139590.051863] R10: 0000000000000078 R11: 0000000000000246 R12: 0000000000000003
[139590.052252] R13: 00000000003d3507 R14: 00005567eca0f370 R15: 0000000000000000
(...)
[139590.055128] ---[ end trace 193f35d0215cdeeb ]---
So fix this race between a full ranged fsync and writeback of adjacent
ranges by flushing all delalloc and waiting for all ordered extents to
complete before logging the inode. This is the simplest way to solve the
problem because currently the full fsync path does not deal with ranges
at all (it assumes a full range from 0 to LLONG_MAX) and it always needs
to look at adjacent ranges for hole detection. For use cases of ranged
fsyncs this can make a few fsyncs slower but on the other hand it can
make some following fsyncs to other ranges do less work or no need to do
anything at all. A full fsync is rare anyway and happens only once after
loading/creating an inode and once after less common operations such as a
shrinking truncate.
This is an issue that exists for a long time, and was often triggered by
generic/127, because it does mmap'ed writes and msync (which triggers a
ranged fsync). Adding support for the tree checker to detect overlapping
extents (next patch in the series) and trigger a WARN() when such cases
are found, and then calling btrfs_check_leaf_full() at the end of
btrfs_insert_file_extent() made the issue much easier to detect. Running
btrfs/072 with that change to the tree checker and making fsstress open
files always with O_SYNC made it much easier to trigger the issue (as
triggering it with generic/127 is very rare).
CC: stable@vger.kernel.org # 3.16+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When we are doing a full fsync (bit BTRFS_INODE_NEEDS_FULL_SYNC set) of a
file that has holes and has file extent items spanning two or more leafs,
we can end up falling to back to a full transaction commit due to a logic
bug that leads to failure to insert a duplicate file extent item that is
meant to represent a hole between the last file extent item of a leaf and
the first file extent item in the next leaf. The failure (EEXIST error)
leads to a transaction commit (as most errors when logging an inode do).
For example, we have the two following leafs:
Leaf N:
-----------------------------------------------
| ..., ..., ..., (257, FILE_EXTENT_ITEM, 64K) |
-----------------------------------------------
The file extent item at the end of leaf N has a length of 4Kb,
representing the file range from 64K to 68K - 1.
Leaf N + 1:
-----------------------------------------------
| (257, FILE_EXTENT_ITEM, 72K), ..., ..., ... |
-----------------------------------------------
The file extent item at the first slot of leaf N + 1 has a length of
4Kb too, representing the file range from 72K to 76K - 1.
During the full fsync path, when we are at tree-log.c:copy_items() with
leaf N as a parameter, after processing the last file extent item, that
represents the extent at offset 64K, we take a look at the first file
extent item at the next leaf (leaf N + 1), and notice there's a 4K hole
between the two extents, and therefore we insert a file extent item
representing that hole, starting at file offset 68K and ending at offset
72K - 1. However we don't update the value of *last_extent, which is used
to represent the end offset (plus 1, non-inclusive end) of the last file
extent item inserted in the log, so it stays with a value of 68K and not
with a value of 72K.
Then, when copy_items() is called for leaf N + 1, because the value of
*last_extent is smaller then the offset of the first extent item in the
leaf (68K < 72K), we look at the last file extent item in the previous
leaf (leaf N) and see it there's a 4K gap between it and our first file
extent item (again, 68K < 72K), so we decide to insert a file extent item
representing the hole, starting at file offset 68K and ending at offset
72K - 1, this insertion will fail with -EEXIST being returned from
btrfs_insert_file_extent() because we already inserted a file extent item
representing a hole for this offset (68K) in the previous call to
copy_items(), when processing leaf N.
The -EEXIST error gets propagated to the fsync callback, btrfs_sync_file(),
which falls back to a full transaction commit.
Fix this by adjusting *last_extent after inserting a hole when we had to
look at the next leaf.
Fixes: 4ee3fad34a9c ("Btrfs: fix fsync after hole punching when using no-holes feature")
Cc: stable@vger.kernel.org # 4.14+
Reviewed-by: Josef Bacik <josef@toxicpanda.com>
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit ddf30cf03fb5 ("btrfs: extent-tree: Use btrfs_ref to refactor
add_pinned_bytes()") refactored add_pinned_bytes(), but during that
refactor, there are two callers which add the pinned bytes instead
of subtracting.
That refactor misses those two caller, causing incorrect pinned bytes
calculation and resulting unexpected ENOSPC error.
Fix it by adding a new parameter @sign to restore the original behavior.
Reported-by: kernel test robot <rong.a.chen@intel.com>
Fixes: ddf30cf03fb5 ("btrfs: extent-tree: Use btrfs_ref to refactor add_pinned_bytes()")
Signed-off-by: Qu Wenruo <wqu@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A failed call to kobject_init_and_add() must be followed by a call to
kobject_put(). Currently in the error path when adding fs_devices we
are missing this call. This could be fixed by calling
btrfs_sysfs_remove_fsid() if btrfs_sysfs_add_fsid() returns an error or
by adding a call to kobject_put() directly in btrfs_sysfs_add_fsid().
Here we choose the second option because it prevents the slightly
unusual error path handling requirements of kobject from leaking out
into btrfs functions.
Add a call to kobject_put() in the error path of kobject_add_and_init().
This causes the release method to be called if kobject_init_and_add()
fails. open_tree() is the function that calls btrfs_sysfs_add_fsid()
and the error code in this function is already written with the
assumption that the release method is called during the error path of
open_tree() (as seen by the call to btrfs_sysfs_remove_fsid() under the
fail_fsdev_sysfs label).
Cc: stable@vger.kernel.org # v4.4+
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If a call to kobject_init_and_add() fails we must call kobject_put()
otherwise we leak memory.
Calling kobject_put() when kobject_init_and_add() fails drops the
refcount back to 0 and calls the ktype release method (which in turn
calls the percpu destroy and kfree).
Add call to kobject_put() in the error path of call to
kobject_init_and_add().
Cc: stable@vger.kernel.org # v4.4+
Reviewed-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
Signed-off-by: Tobin C. Harding <tobin@kernel.org>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently when we fail to COW a path at btrfs_update_root() we end up
always aborting the transaction. However all the current callers of
btrfs_update_root() are able to deal with errors returned from it, many do
end up aborting the transaction themselves (directly or not, such as the
transaction commit path), other BUG_ON() or just gracefully cancel whatever
they were doing.
When syncing the fsync log, we call btrfs_update_root() through
tree-log.c:update_log_root(), and if it returns an -ENOSPC error, the log
sync code does not abort the transaction, instead it gracefully handles
the error and returns -EAGAIN to the fsync handler, so that it falls back
to a transaction commit. Any other error different from -ENOSPC, makes the
log sync code abort the transaction.
So remove the transaction abort from btrfs_update_log() when we fail to
COW a path to update the root item, so that if an -ENOSPC failure happens
we avoid aborting the current transaction and have a chance of the fsync
succeeding after falling back to a transaction commit.
Bugzilla: https://bugzilla.kernel.org/show_bug.cgi?id=203413
Fixes: 79787eaab46121 ("btrfs: replace many BUG_ONs with proper error handling")
Cc: stable@vger.kernel.org # 4.4+
Signed-off-by: Filipe Manana <fdmanana@suse.com>
Reviewed-by: Anand Jain <anand.jain@oracle.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
We're now reserving an extra items worth of space for property
inheritance. We only have one property at the moment so this covers us,
but if we add more in the future this will allow us to not get bitten by
the extra space reservation. If we do add more properties in the future
we should re-visit how we calculate the space reservation needs by the
callers.
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
[ refreshed on top of prop/xattr cleanups ]
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If we have an error writing out a delalloc range in
btrfs_punch_hole_lock_range we'll unlock the inode and then goto
out_only_mutex, where we will again unlock the inode. This is bad,
don't do this.
Fixes: f27451f22996 ("Btrfs: add support for fallocate's zero range operation")
CC: stable@vger.kernel.org # 4.19+
Reviewed-by: Filipe Manana <fdmanana@suse.com>
Signed-off-by: Josef Bacik <josef@toxicpanda.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When a file's compression property is set as zlib or zstd but leave
the compression mount option not be set, that means btrfs will try
to compress the file with default compression level. But in
btrfs_compress_pages(), it calls get_workspace() with level = 0.
This will return a workspace with a wrong compression level.
For zlib, the compression level in the workspace will be 0
(that means "store only"). And for zstd, the compression in the
workspace will be 1, not the default level 3.
How to reproduce:
mkfs -t btrfs /dev/sdb
mount /dev/sdb /mnt/
mkdir /mnt/zlib
btrfs property set /mnt/zlib/ compression zlib
dd if=/dev/zero of=/mnt/zlib/compression-friendly-file-10M bs=1M count=10
sync
btrfs-debugfs -f /mnt/zlib/compression-friendly-file-10M
btrfs-debugfs output:
* before:
...
(258 9961472): ram 524288 disk 1106247680 disk_size 524288
file: ... extents 20 disk size 10485760 logical size 10485760 ratio 1.00
* after:
...
(258 10354688): ram 131072 disk 14217216 disk_size 4096
file: ... extents 80 disk size 327680 logical size 10485760 ratio 32.00
The steps for zstd are similar, but need to put a debugging message to
show the level of the return workspace in zstd_get_workspace().
This commit adds a check of the compression level before getting a
workspace by set_level().
CC: stable@vger.kernel.org # 5.1+
Signed-off-by: Johnny Chang <johnnyc@synology.com>
Reviewed-by: David Sterba <dsterba@suse.com>
Signed-off-by: David Sterba <dsterba@suse.com>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs
Pull UBIFS fixes from Richard Weinberger:
- build errors wrt xattrs
- mismerge which lead to a wrong Kconfig ifdef
- missing endianness conversion
* tag 'upstream-5.2-rc2' of git://git.kernel.org/pub/scm/linux/kernel/git/rw/ubifs:
ubifs: Convert xattr inum to host order
ubifs: Use correct config name for encryption
ubifs: Fix build error without CONFIG_UBIFS_FS_XATTR
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
UBIFS stores inode numbers as LE64 integers.
We have to convert them to host oder, otherwise
BE hosts won't be able to use the integer correctly.
Reported-by: kbuild test robot <lkp@intel.com>
Fixes: 9ca2d7326444 ("ubifs: Limit number of xattrs per inode")
Signed-off-by: Richard Weinberger <richard@nod.at>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
CONFIG_UBIFS_FS_ENCRYPTION is gone, fscrypt is now
controlled via CONFIG_FS_ENCRYPTION.
This problem slipped into the tree because of a mis-merge on
my side.
Reported-by: Eric Biggers <ebiggers@kernel.org>
Fixes: eea2c05d927b ("ubifs: Remove #ifdef around CONFIG_FS_ENCRYPTION")
Signed-off-by: Richard Weinberger <richard@nod.at>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Fix gcc build error while CONFIG_UBIFS_FS_XATTR
is not set
fs/ubifs/dir.o: In function `ubifs_unlink':
dir.c:(.text+0x260): undefined reference to `ubifs_purge_xattrs'
fs/ubifs/dir.o: In function `do_rename':
dir.c:(.text+0x1edc): undefined reference to `ubifs_purge_xattrs'
fs/ubifs/dir.o: In function `ubifs_rmdir':
dir.c:(.text+0x2638): undefined reference to `ubifs_purge_xattrs'
Reported-by: Hulk Robot <hulkci@huawei.com>
Fixes: 9ca2d7326444 ("ubifs: Limit number of xattrs per inode")
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Richard Weinberger <richard@nod.at>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Merge yet more updates from Andrew Morton:
"A few final bits:
- large changes to vmalloc, yielding large performance benefits
- tweak the console-flush-on-panic code
- a few fixes"
* emailed patches from Andrew Morton <akpm@linux-foundation.org>:
panic: add an option to replay all the printk message in buffer
initramfs: don't free a non-existent initrd
fs/writeback.c: use rcu_barrier() to wait for inflight wb switches going into workqueue when umount
mm/compaction.c: correct zone boundary handling when isolating pages from a pageblock
mm/vmap: add DEBUG_AUGMENT_LOWEST_MATCH_CHECK macro
mm/vmap: add DEBUG_AUGMENT_PROPAGATE_CHECK macro
mm/vmalloc.c: keep track of free blocks for vmap allocation
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
into workqueue when umount
synchronize_rcu() didn't wait for call_rcu() callbacks, so inode wb
switch may not go to the workqueue after synchronize_rcu(). Thus
previous scheduled switches was not finished even flushing the
workqueue, which will cause a NULL pointer dereferenced followed below.
VFS: Busy inodes after unmount of vdd. Self-destruct in 5 seconds. Have a nice day...
BUG: unable to handle kernel NULL pointer dereference at 0000000000000278
evict+0xb3/0x180
iput+0x1b0/0x230
inode_switch_wbs_work_fn+0x3c0/0x6a0
worker_thread+0x4e/0x490
? process_one_work+0x410/0x410
kthread+0xe6/0x100
ret_from_fork+0x39/0x50
Replace the synchronize_rcu() call with a rcu_barrier() to wait for all
pending callbacks to finish. And inc isw_nr_in_flight after call_rcu()
in inode_switch_wbs() to make more sense.
Link: http://lkml.kernel.org/r/20190429024108.54150-1-jiufei.xue@linux.alibaba.com
Signed-off-by: Jiufei Xue <jiufei.xue@linux.alibaba.com>
Acked-by: Tejun Heo <tj@kernel.org>
Suggested-by: Tejun Heo <tj@kernel.org>
Cc: <stable@vger.kernel.org>
Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild
Pull more Kbuild updates from Masahiro Yamada:
- remove unneeded use of cc-option, cc-disable-warning, cc-ldoption
- exclude tracked files from .gitignore
- re-enable -Wint-in-bool-context warning
- refactor samples/Makefile
- stop building immediately if syncconfig fails
- do not sprinkle error messages when $(CC) does not exist
- move arch/alpha/defconfig to the configs subdirectory
- remove crappy header search path manipulation
- add comment lines to .config to clarify the end of menu blocks
- check uniqueness of module names (adding new warnings intentionally)
* tag 'kbuild-v5.2-2' of git://git.kernel.org/pub/scm/linux/kernel/git/masahiroy/linux-kbuild: (24 commits)
kconfig: use 'else ifneq' for Makefile to improve readability
kbuild: check uniqueness of module names
kconfig: Terminate menu blocks with a comment in the generated config
kbuild: add LICENSES to KBUILD_ALLDIRS
kbuild: remove 'addtree' and 'flags' magic for header search paths
treewide: prefix header search paths with $(srctree)/
media: prefix header search paths with $(srctree)/
media: remove unneeded header search paths
alpha: move arch/alpha/defconfig to arch/alpha/configs/defconfig
kbuild: terminate Kconfig when $(CC) or $(LD) is missing
kbuild: turn auto.conf.cmd into a mandatory include file
.gitignore: exclude .get_maintainer.ignore and .gitattributes
kbuild: add all Clang-specific flags unconditionally
kbuild: Don't try to add '-fcatch-undefined-behavior' flag
kbuild: add some extra warning flags unconditionally
kbuild: add -Wvla flag unconditionally
arch: remove dangling asm-generic wrappers
samples: guard sub-directories with CONFIG options
kbuild: re-enable int-in-bool-context warning
MAINTAINERS: kbuild: Add pattern for scripts/*vmlinux*
...
|
| |/ / /
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Currently, the Kbuild core manipulates header search paths in a crazy
way [1].
To fix this mess, I want all Makefiles to add explicit $(srctree)/ to
the search paths in the srctree. Some Makefiles are already written in
that way, but not all. The goal of this work is to make the notation
consistent, and finally get rid of the gross hacks.
Having whitespaces after -I does not matter since commit 48f6e3cf5bc6
("kbuild: do not drop -I without parameter").
[1]: https://patchwork.kernel.org/patch/9632347/
Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4
Pull ext4 fixes from Ted Ts'o:
"Some bug fixes, and an update to the URL's for the final version of
Unicode 12.1.0"
* tag 'ext4_for_linus_stable' of git://git.kernel.org/pub/scm/linux/kernel/git/tytso/ext4:
ext4: avoid panic during forced reboot due to aborted journal
ext4: fix block validity checks for journal inodes using indirect blocks
unicode: update to Unicode 12.1.0 final
unicode: add missing check for an error return from utf8lookup()
ext4: fix miscellaneous sparse warnings
ext4: unsigned int compared against zero
ext4: fix use-after-free in dx_release()
ext4: fix data corruption caused by overlapping unaligned and aligned IO
jbd2: fix potential double free
ext4: zero out the unused memory region in the extent tree block
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Handling of aborted journal is a special code path different from
standard ext4_error() one and it can call panic() as well. Commit
1dc1097ff60e ("ext4: avoid panic during forced reboot") forgot to update
this path so fix that omission.
Fixes: 1dc1097ff60e ("ext4: avoid panic during forced reboot")
Signed-off-by: Jan Kara <jack@suse.cz>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org # 5.1
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Commit 345c0dbf3a30 ("ext4: protect journal inode's blocks using
block_validity") failed to add an exception for the journal inode in
ext4_check_blockref(), which is the function used by ext4_get_branch()
for indirect blocks. This caused attempts to read from the ext3-style
journals to fail with:
[ 848.968550] EXT4-fs error (device sdb7): ext4_get_branch:171: inode #8: block 30343695: comm jbd2/sdb7-8: invalid block
Fix this by adding the missing exception check.
Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity")
Reported-by: Arthur Marsh <arthur.marsh@internode.on.net>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: Gabriel Krisman Bertazi <krisman@collabora.com>
|
| | | | |
| | | | |
| | | | |
| | | | | |
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
There are two cases where u32 variables n and err are being checked
for less than zero error values, the checks is always false because
the variables are not signed. Fix this by making the variables ints.
Addresses-Coverity: ("Unsigned compared against 0")
Fixes: 345c0dbf3a30 ("ext4: protect journal inode's blocks using block_validity")
Signed-off-by: Colin Ian King <colin.king@canonical.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
The buffer_head (frames[0].bh) and it's corresping page can be
potentially free'd once brelse() is done inside the for loop
but before the for loop exits in dx_release(). It can be free'd
in another context, when the page cache is flushed via
drop_caches_sysctl_handler(). This results into below data abort
when accessing info->indirect_levels in dx_release().
Unable to handle kernel paging request at virtual address ffffffc17ac3e01e
Call trace:
dx_release+0x70/0x90
ext4_htree_fill_tree+0x2d4/0x300
ext4_readdir+0x244/0x6f8
iterate_dir+0xbc/0x160
SyS_getdents64+0x94/0x174
Signed-off-by: Sahitya Tummala <stummala@codeaurora.org>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Reviewed-by: Andreas Dilger <adilger@dilger.ca>
Cc: stable@kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Unaligned AIO must be serialized because the zeroing of partial blocks
of unaligned AIO can result in data corruption in case it's overlapping
another in flight IO.
Currently we wait for all unwritten extents before we submit unaligned
AIO which protects data in case of unaligned AIO is following overlapping
IO. However if a unaligned AIO is followed by overlapping aligned AIO we
can still end up corrupting data.
To fix this, we must make sure that the unaligned AIO is the only IO in
flight by waiting for unwritten extents conversion not just before the
IO submission, but right after it as well.
This problem can be reproduced by xfstest generic/538
Signed-off-by: Lukas Czerner <lczerner@redhat.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When failing from creating cache jbd2_inode_cache, we will destroy the
previously created cache jbd2_handle_cache twice. This patch fixes
this by moving each cache initialization/destruction to its own
separate, individual function.
Signed-off-by: Chengguang Xu <cgxu519@gmail.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This commit zeroes out the unused memory region in the buffer_head
corresponding to the extent metablock after writing the extent header
and the corresponding extent node entries.
This is done to prevent random uninitialized data from getting into
the filesystem when the extent block is synced.
This fixes CVE-2019-11833.
Signed-off-by: Sriram Rajagopalan <sriramr@arista.com>
Signed-off-by: Theodore Ts'o <tytso@mit.edu>
Cc: stable@kernel.org
|
|\ \ \ \ \
| |_|/ / /
|/| | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pull cifs fixes from Steve French:
"Minor cleanup and fixes, one for stable, four rdma (smbdirect)
related. Also adds SEEK_HOLE support"
* tag '5.2-rc-smb3-fixes' of git://git.samba.org/sfrench/cifs-2.6:
cifs: add support for SEEK_DATA and SEEK_HOLE
Fixed https://bugzilla.kernel.org/show_bug.cgi?id=202935 allow write on the same file
cifs: Allocate memory for all iovs in smb2_ioctl
cifs: Don't match port on SMBDirect transport
cifs:smbd Use the correct DMA direction when sending data
cifs:smbd When reconnecting to server, call smbd_destroy() after all MIDs have been called
cifs: use the right include for signal_pending()
smb3: trivial cleanup to smb2ops.c
cifs: cleanup smb2ops.c and normalize strings
smb3: display session id in debug data
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Add llseek op for SEEK_DATA and SEEK_HOLE.
Improves xfstests/285,286,436,445,448 and 490
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
same file
Copychunk allows source and target to be on the same file.
For details on restrictions see MS-SMB2 3.3.5.15.6
Signed-off-by: Kovtunenko Oleksandr <alexander198961@gmail.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
An IOCTL uses up to 2 iovs. The 1st iov is the command itself, the 2nd iov is
optional data for that command. The 1st iov is always allocated on the heap
but the 2nd iov may point to a variable on the stack. This will trigger an
error when passing the 2nd iov for RDMA I/O.
Fix this by allocating a buffer for the 2nd iov.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
Reviewed-by: Ronnie sahlberg <lsahlber@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
SMBDirect manages its own ports in the transport layer, there is no need to
check the port to find a connection.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Ronnie sahlberg <lsahlber@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When sending data, use the DMA_TO_DEVICE to map buffers. Also log the number
of requests in a compounding request from upper layer.
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
Acked-by: Ronnie Sahlberg <lsahlber@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
have been called
commit 214bab448476 ("cifs: Call MID callback before destroying transport")
assumes that the MID callback should not take srv_mutex, this may not always
be true. SMB Direct requires the MID callback completed before calling
transport so all pending memory registration can be freed. So restore the
original calling sequence so TCP transport will use the same code, but moving
smbd_destroy() after all MID has been called.
fixes: 214bab448476 ("cifs: Call MID callback before destroying transport")
Signed-off-by: Long Li <longli@microsoft.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
This header is actually where signal_pending is defined
although either would work.
Signed-off-by: Ronnie Sahlberg <lsahlber@redhat.com>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Minor cleanup - e.g. missing \n at end of debug statement.
Reported-by: Christoph Probst <kernel@probst.it>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Fix checkpatch warnings/errors in smb2ops.c except "LONG_LINE". Add missing
linebreaks, indentings, __func__. Remove void-returns, unneeded braces.
Address warnings spotted by checkpatch.
Add SPDX License Header.
Add missing "\n" and capitalize first letter in some cifs_dbg() strings.
Signed-off-by: Christoph Probst <kernel@probst.it>
Signed-off-by: Steve French <stfrench@microsoft.com>
Acked-by: Pavel Shilovsky <pshilov@microsoft.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Displaying the session id in /proc/fs/cifs/DebugData
is needed in order to correlate Linux client information
with network and server traces for many common support
scenarios. Turned out to be very important for debugging.
Signed-off-by: Steve French <stfrench@microsoft.com>
CC: Stable <stable@vger.kernel.org>
Reviewed-by: Pavel Shilovsky <pshilov@microsoft.com>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Pull more vfs mount updates from Al Viro:
"Propagation of new syscalls to other architectures + cosmetic change
from Christian (fscontext didn't follow the convention for anon inode
names)"
* 'fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
uapi: Wire up the mount API syscalls on non-x86 arches [ver #2]
uapi, x86: Fix the syscall numbering of the mount API syscalls [ver #2]
uapi, fsopen: use square brackets around "fscontext" [ver #2]
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Make the name of the anon inode fd "[fscontext]" instead of "fscontext".
This is minor but most core-kernel anon inode fds already carry square
brackets around their name:
[eventfd]
[eventpoll]
[fanotify]
[io_uring]
[pidfd]
[signalfd]
[timerfd]
[userfaultfd]
For the sake of consistency lets do the same for the fscontext anon inode
fd that comes with the new mount api.
Signed-off-by: Christian Brauner <christian@brauner.io>
Signed-off-by: David Howells <dhowells@redhat.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Pull io_uring fixes from Jens Axboe:
"A small set of fixes for io_uring.
This contains:
- smp_rmb() cleanup for io_cqring_events() (Jackie)
- io_cqring_wait() simplification (Jackie)
- removal of dead 'ev_flags' passing (me)
- SQ poll CPU affinity verification fix (me)
- SQ poll wait fix (Roman)
- SQE command prep cleanup and fix (Stefan)"
* tag 'for-linus-20190516' of git://git.kernel.dk/linux-block:
io_uring: use wait_event_interruptible for cq_wait conditional wait
io_uring: adjust smp_rmb inside io_cqring_events
io_uring: fix infinite wait in khread_park() on io_finish_async()
io_uring: remove 'ev_flags' argument
io_uring: fix failure to verify SQ_AFF cpu
io_uring: fix race condition reading SQE data
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
The previous patch has ensured that io_cqring_events contain
smp_rmb memory barriers, Now we can use wait_event_interruptible
to keep the code simple.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Whenever smp_rmb is required to use io_cqring_events,
keep smp_rmb inside the function io_cqring_events.
Signed-off-by: Jackie Liu <liuyun01@kylinos.cn>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
This fixes couple of races which lead to infinite wait of park completion
with the following backtraces:
[20801.303319] Call Trace:
[20801.303321] ? __schedule+0x284/0x650
[20801.303323] schedule+0x33/0xc0
[20801.303324] schedule_timeout+0x1bc/0x210
[20801.303326] ? schedule+0x3d/0xc0
[20801.303327] ? schedule_timeout+0x1bc/0x210
[20801.303329] ? preempt_count_add+0x79/0xb0
[20801.303330] wait_for_completion+0xa5/0x120
[20801.303331] ? wake_up_q+0x70/0x70
[20801.303333] kthread_park+0x48/0x80
[20801.303335] io_finish_async+0x2c/0x70
[20801.303336] io_ring_ctx_wait_and_kill+0x95/0x180
[20801.303338] io_uring_release+0x1c/0x20
[20801.303339] __fput+0xad/0x210
[20801.303341] task_work_run+0x8f/0xb0
[20801.303342] exit_to_usermode_loop+0xa0/0xb0
[20801.303343] do_syscall_64+0xe0/0x100
[20801.303349] entry_SYSCALL_64_after_hwframe+0x44/0xa9
[20801.303380] Call Trace:
[20801.303383] ? __schedule+0x284/0x650
[20801.303384] schedule+0x33/0xc0
[20801.303386] io_sq_thread+0x38a/0x410
[20801.303388] ? __switch_to_asm+0x40/0x70
[20801.303390] ? wait_woken+0x80/0x80
[20801.303392] ? _raw_spin_lock_irqsave+0x17/0x40
[20801.303394] ? io_submit_sqes+0x120/0x120
[20801.303395] kthread+0x112/0x130
[20801.303396] ? kthread_create_on_node+0x60/0x60
[20801.303398] ret_from_fork+0x35/0x40
o kthread_park() waits for park completion, so io_sq_thread() loop
should check kthread_should_park() along with khread_should_stop(),
otherwise if kthread_park() is called before prepare_to_wait()
the following schedule() never returns:
CPU#0 CPU#1
io_sq_thread_stop(): io_sq_thread():
while(!kthread_should_stop() && !ctx->sqo_stop) {
ctx->sqo_stop = 1;
kthread_park()
prepare_to_wait();
if (kthread_should_stop() {
}
schedule(); <<< nobody checks park flag,
<<< so schedule and never return
o if the flag ctx->sqo_stop is observed by the io_sq_thread() loop
it is quite possible, that kthread_should_park() check and the
following kthread_parkme() is never called, because kthread_park()
has not been yet called, but few moments later is is called and
waits there for park completion, which never happens, because
kthread has already exited:
CPU#0 CPU#1
io_sq_thread_stop(): io_sq_thread():
ctx->sqo_stop = 1;
while(!kthread_should_stop() && !ctx->sqo_stop) {
<<< observe sqo_stop and exit the loop
}
if (kthread_should_park())
kthread_parkme(); <<< never called, since was
<<< never parked
kthread_park() <<< waits forever for park completion
In the current patch we quit the loop by only kthread_should_park()
check (kthread_park() is synchronous, so kthread_should_stop() is
never observed), and we abandon ->sqo_stop flag, since it is racy.
At the end of the io_sq_thread() we unconditionally call parmke(),
since we've exited the loop by the park flag.
Signed-off-by: Roman Penyaev <rpenyaev@suse.de>
Cc: Jens Axboe <axboe@kernel.dk>
Cc: linux-block@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
We always pass in 0 for the cqe flags argument, since the support for
"this read hit page cache" hint was dropped.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
The test case we have is rightfully failing with the current kernel:
io_uring_setup(1, 0x7ffe2cafebe0), flags: IORING_SETUP_SQPOLL|IORING_SETUP_SQ_AFF, resv: 0x00000000 0x00000000 0x00000000 0x00000000 0x00000000, sq_thread_cpu: 4
expected -1, got 3
This is in a vm, and CPU3 is the last valid one, hence asking for 4
should fail the setup with -EINVAL, not succeed. The problem is that
we're using array_index_nospec() with nr_cpu_ids as the index, hence we
wrap and end up using CPU0 instead of CPU4. This makes the setup
succeed where it should be failing.
We don't need to use array_index_nospec() as we're not indexing any
array with this. Instead just compare with nr_cpu_ids directly. This
is fine as we're checking with cpu_online() afterwards.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
When punting to workers the SQE gets copied after the initial try.
There is a race condition between reading SQE data for the initial try
and copying it for punting it to the workers.
For example io_rw_done calls kiocb->ki_complete even if it was prepared
for IORING_OP_FSYNC (and would be NULL).
The easiest solution for now is to alway prepare again in the worker.
req->file is safe to prepare though as long as it is checked before use.
Signed-off-by: Stefan Bühler <source@stbuehler.de>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|