| Commit message (Collapse) | Author | Age | Files | Lines |
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace
Pull more tracing updates from Steven Rostedt:
- Rename the staging files to give them some meaning. Just
stage1,stag2,etc, does not show what they are for
- Check for NULL from allocation in bootconfig
- Hold event mutex for dyn_event call in user events
- Mark user events to broken (to work on the API)
- Remove eBPF updates from user events
- Remove user events from uapi header to keep it from being installed.
- Move ftrace_graph_is_dead() into inline as it is called from hot
paths and also convert it into a static branch.
* tag 'trace-v5.18-2' of git://git.kernel.org/pub/scm/linux/kernel/git/rostedt/linux-trace:
tracing: Move user_events.h temporarily out of include/uapi
ftrace: Make ftrace_graph_is_dead() a static branch
tracing: Set user_events to BROKEN
tracing/user_events: Remove eBPF interfaces
tracing/user_events: Hold event_mutex during dyn_event_add
proc: bootconfig: Add null pointer check
tracing: Rename the staging files for trace_events
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
kzalloc is a memory allocation function which can return NULL when some
internal memory errors happen. It is safer to add null pointer check.
Link: https://lkml.kernel.org/r/20220329104004.2376879-1-lv.ruyi@zte.com.cn
Cc: stable@vger.kernel.org
Fixes: c1a3c36017d4 ("proc: bootconfig: Add /proc/bootconfig to show boot config list")
Acked-by: Masami Hiramatsu <mhiramat@kernel.org>
Reported-by: Zeal Robot <zealci@zte.com.cn>
Signed-off-by: Lv Ruyi <lv.ruyi@zte.com.cn>
Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull vfs updates from Al Viro:
"Assorted bits and pieces"
* 'work.misc' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
aio: drop needless assignment in aio_read()
clean overflow checks in count_mounts() a bit
seq_file: fix NULL pointer arithmetic warning
uml/x86: use x86 load_unaligned_zeropad()
asm/user.h: killed unused macros
constify struct path argument of finish_automount()/do_add_mount()
fs: Remove FIXME comment in generic_write_checks()
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit 84c4e1f89fef ("aio: simplify - and fix - fget/fput for io_submit()")
refactored aio_read() and some error cases into early return, which made
some intermediate assignment of the return variable needless.
Drop this needless assignment in aio_read().
No functional change. No change in resulting object code.
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Wraparound checks in there are redundant (x + y < x and
x + y < y are equivalent when x and y are both unsigned int).
IMO more straightforward code would be better here...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Implement conditional logic in order to replace NULL pointer arithmetic.
The use of NULL pointer arithmetic was pointed out by clang with the
following warning:
fs/kernfs/file.c:128:15: warning: performing pointer arithmetic on a
null pointer has undefined behavior [-Wnull-pointer-arithmetic]
return NULL + !*ppos;
~~~~ ^
fs/seq_file.c:559:14: warning: performing pointer arithmetic on a
null pointer has undefined behavior [-Wnull-pointer-arithmetic]
return NULL + (*pos == 0);
Signed-off-by: Maíra Canal <maira.canal@usp.br>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| | |
| | |
| | |
| | | |
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This patch removes an unnecessary comment that had to do with block special
files from `generic_write_checks()`.
The comment, originally added in Linux v2.4.14.9, was to clarify that we only
set `pos` to the file size when the file was opened with `O_APPEND` if the file
wasn't a block special file. Prior to Linux v2.4, block special files had a
different `write()` function which was unified into a generic `write()` function
in Linux v2.4. This generic `write()` function called `generic_write_checks()`.
For more details, see this earlier conversation:
https://lore.kernel.org/linux-fsdevel/Yc4Czk5A+p5p2Y4W@mit.edu/
Currently, block special devices have their own `write_iter()` function and no
longer share the same `generic_write_checks()`, therefore rendering the comment
irrelevant.
Signed-off-by: Tal Zussman <tz2294@columbia.edu>
Co-authored-by: Xijiao Li <xl2950@columbia.edu>
Co-authored-by: Hans Montero <hjm2133@columbia.edu>
Suggested-by: Theodore Ts'o <tytso@mit.edu>
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
|\ \ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull vfs fix from Darrick Wong:
"The erofs developers felt that FIEMAP should handle ranged requests
starting at s_maxbytes by returning EFBIG instead of passing the
filesystem implementation a nonsense 0-byte request.
Not sure why they keep tagging this 'iomap', but the VFS shouldn't be
asking for information about ranges of a file that the filesystem
already declared that it does not support.
- Fix a potential infinite loop in FIEMAP by fixing an off by one
error when comparing the requested range against s_maxbytes"
* tag 'vfs-5.18-merge-1' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
fs: fix an infinite loop in iomap_fiemap
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
when get fiemap starting from MAX_LFS_FILESIZE, (maxbytes - *len) < start
will always true , then *len set zero. because of start offset is beyond
file size, for erofs filesystem it will always return iomap.length with
zero,iomap iterate will enter infinite loop. it is necessary cover this
corner case to avoid this situation.
------------[ cut here ]------------
WARNING: CPU: 7 PID: 905 at fs/iomap/iter.c:35 iomap_iter+0x97f/0xc70
Modules linked in: xfs erofs
CPU: 7 PID: 905 Comm: iomap Tainted: G W 5.17.0-rc8 #27
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:iomap_iter+0x97f/0xc70
Code: 85 a1 fc ff ff e8 71 be 9c ff 0f 1f 44 00 00 e9 92 fc ff ff e8 62 be 9c ff 0f 0b b8 fb ff ff ff e9 fc f8 ff ff e8 51 be 9c ff <0f> 0b e9 2b fc ff ff e8 45 be 9c ff 0f 0b e9 e1 fb ff ff e8 39 be
RSP: 0018:ffff888060a37ab0 EFLAGS: 00010293
RAX: 0000000000000000 RBX: ffff888060a37bb0 RCX: 0000000000000000
RDX: ffff88807e19a900 RSI: ffffffff81a7da7f RDI: ffff888060a37be0
RBP: 7fffffffffffffff R08: 0000000000000000 R09: ffff888060a37c20
R10: ffff888060a37c67 R11: ffffed100c146f8c R12: 7fffffffffffffff
R13: 0000000000000000 R14: ffff888060a37bd8 R15: ffff888060a37c20
FS: 00007fd3cca01540(0000) GS:ffff888108780000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020010820 CR3: 0000000054b92000 CR4: 00000000000006e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
iomap_fiemap+0x1c9/0x2f0
erofs_fiemap+0x64/0x90 [erofs]
do_vfs_ioctl+0x40d/0x12e0
__x64_sys_ioctl+0xaa/0x1c0
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
</TASK>
---[ end trace 0000000000000000 ]---
watchdog: BUG: soft lockup - CPU#7 stuck for 26s! [iomap:905]
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: Guo Xuenan <guoxuenan@huawei.com>
Reviewed-by: Christoph Hellwig <hch@lst.de>
[djwong: fix some typos]
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Pull xfs fixes from Darrick Wong:
"This fixes multiple problems in the reserve pool sizing functions: an
incorrect free space calculation, a pointless infinite loop, and even
more braindamage that could result in the pool being overfilled. The
pile of patches from Dave fix myriad races and UAF bugs in the log
recovery code that much to our mutual surprise nobody's tripped over.
Dave also fixed a performance optimization that had turned into a
regression.
Dave Chinner is taking over as XFS maintainer starting Sunday and
lasting until 5.19-rc1 is tagged so that I can focus on starting a
massive design review for the (feature complete after five years)
online repair feature. From then on, he and I will be moving XFS to a
co-maintainership model by trading duties every other release.
NOTE: I hope very strongly that the other pieces of the (X)FS
ecosystem (fstests and xfsprogs) will make similar changes to spread
their maintenance load.
Summary:
- Fix an incorrect free space calculation in xfs_reserve_blocks that
could lead to a request for free blocks that will never succeed.
- Fix a hang in xfs_reserve_blocks caused by an infinite loop and the
incorrect free space calculation.
- Fix yet a third problem in xfs_reserve_blocks where multiple racing
threads can overfill the reserve pool.
- Fix an accounting error that lead to us reporting reserved space as
"available".
- Fix a race condition during abnormal fs shutdown that could cause
UAF problems when memory reclaim and log shutdown try to clean up
inodes.
- Fix a bug where log shutdown can race with unmount to tear down the
log, thereby causing UAF errors.
- Disentangle log and filesystem shutdown to reduce confusion.
- Fix some confusion in xfs_trans_commit such that a race between
transaction commit and filesystem shutdown can cause unlogged dirty
inode metadata to be committed, thereby corrupting the filesystem.
- Remove a performance optimization in the log as it was discovered
that certain storage hardware handle async log flushes so poorly as
to cause serious performance regressions. Recent restructuring of
other parts of the logging code mean that no performance benefit is
seen on hardware that handle it well"
* tag 'xfs-5.18-merge-4' of git://git.kernel.org/pub/scm/fs/xfs/xfs-linux:
xfs: drop async cache flushes from CIL commits.
xfs: shutdown during log recovery needs to mark the log shutdown
xfs: xfs_trans_commit() path must check for log shutdown
xfs: xfs_do_force_shutdown needs to block racing shutdowns
xfs: log shutdown triggers should only shut down the log
xfs: run callbacks before waking waiters in xlog_state_shutdown_callbacks
xfs: shutdown in intent recovery has non-intent items in the AIL
xfs: aborting inodes on shutdown may need buffer lock
xfs: don't report reserved bnobt space as available
xfs: fix overfilling of reserve pool
xfs: always succeed at setting the reserve pool size
xfs: remove infinite loop when reserving free block pool
xfs: don't include bnobt blocks when reserving free block pool
xfs: document the XFS_ALLOC_AGFL_RESERVE constant
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Jan Kara reported a performance regression in dbench that he
bisected down to commit bad77c375e8d ("xfs: CIL checkpoint
flushes caches unconditionally").
Whilst developing the journal flush/fua optimisations this cache was
part of, it appeared to made a significant difference to
performance. However, now that this patchset has settled and all the
correctness issues fixed, there does not appear to be any
significant performance benefit to asynchronous cache flushes.
In fact, the opposite is true on some storage types and workloads,
where additional cache flushes that can occur from fsync heavy
workloads have measurable and significant impact on overall
throughput.
Local dbench testing shows little difference on dbench runs with
sync vs async cache flushes on either fast or slow SSD storage, and
no difference in streaming concurrent async transaction workloads
like fs-mark.
Fast NVME storage.
From `dbench -t 30`, CIL scale:
clients async sync
BW Latency BW Latency
1 935.18 0.855 915.64 0.903
8 2404.51 6.873 2341.77 6.511
16 3003.42 6.460 2931.57 6.529
32 3697.23 7.939 3596.28 7.894
128 7237.43 15.495 7217.74 11.588
512 5079.24 90.587 5167.08 95.822
fsmark, 32 threads, create w/ 64 byte xattr w/32k logbsize
create chown unlink
async 1m41s 1m16s 2m03s
sync 1m40s 1m19s 1m54s
Slower SATA SSD storage:
From `dbench -t 30`, CIL scale:
clients async sync
BW Latency BW Latency
1 78.59 15.792 83.78 10.729
8 367.88 92.067 404.63 59.943
16 564.51 72.524 602.71 76.089
32 831.66 105.984 870.26 110.482
128 1659.76 102.969 1624.73 91.356
512 2135.91 223.054 2603.07 161.160
fsmark, 16 threads, create w/32k logbsize
create unlink
async 5m06s 4m15s
sync 5m00s 4m22s
And on Jan's test machine:
5.18-rc8-vanilla 5.18-rc8-patched
Amean 1 71.22 ( 0.00%) 64.94 * 8.81%*
Amean 2 93.03 ( 0.00%) 84.80 * 8.85%*
Amean 4 150.54 ( 0.00%) 137.51 * 8.66%*
Amean 8 252.53 ( 0.00%) 242.24 * 4.08%*
Amean 16 454.13 ( 0.00%) 439.08 * 3.31%*
Amean 32 835.24 ( 0.00%) 829.74 * 0.66%*
Amean 64 1740.59 ( 0.00%) 1686.73 * 3.09%*
Performance and cache flush behaviour is restored to pre-regression
levels.
As such, we can now consider the async cache flush mechanism an
unnecessary exercise in premature optimisation and hence we can
now remove it and the infrastructure it requires completely.
Fixes: bad77c375e8d ("xfs: CIL checkpoint flushes caches unconditionally")
Reported-and-tested-by: Jan Kara <jack@suse.cz>
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When a checkpoint writeback is run by log recovery, corruption
propagated from the log can result in writeback verifiers failing
and calling xfs_force_shutdown() from
xfs_buf_delwri_submit_buffers().
This results in the mount being marked as shutdown, but the log does
not get marked as shut down because:
/*
* If this happens during log recovery then we aren't using the runtime
* log mechanisms yet so there's nothing to shut down.
*/
if (!log || xlog_in_recovery(log))
return false;
If there are other buffers that then fail (say due to detecting the
mount shutdown), they will now hang in xfs_do_force_shutdown()
waiting for the log to shut down like this:
__schedule+0x30d/0x9e0
schedule+0x55/0xd0
xfs_do_force_shutdown+0x1cd/0x200
? init_wait_var_entry+0x50/0x50
xfs_buf_ioend+0x47e/0x530
__xfs_buf_submit+0xb0/0x240
xfs_buf_delwri_submit_buffers+0xfe/0x270
xfs_buf_delwri_submit+0x3a/0xc0
xlog_do_recovery_pass+0x474/0x7b0
? do_raw_spin_unlock+0x30/0xb0
xlog_do_log_recovery+0x91/0x140
xlog_do_recover+0x38/0x1e0
xlog_recover+0xdd/0x170
xfs_log_mount+0x17e/0x2e0
xfs_mountfs+0x457/0x930
xfs_fs_fill_super+0x476/0x830
xlog_force_shutdown() always needs to mark the log as shut down,
regardless of whether recovery is in progress or not, so that
multiple calls to xfs_force_shutdown() during recovery don't end
up waiting for the log to be shut down like this.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
If a shut races with xfs_trans_commit() and we have shut down the
filesystem but not the log, we will still cancel the transaction.
This can result in aborting dirty log items instead of committing and
pinning them whilst the log is still running. Hence we can end up
with dirty, unlogged metadata that isn't in the AIL in memory that
can be flushed to disk via writeback clustering.
This was discovered from a g/388 trace where an inode log item was
having IO completed on it and it wasn't in the AIL, hence tripping
asserts xfs_ail_check(). Inode cluster writeback started long after
the filesystem shutdown started, and long after the transaction
containing the dirty inode was aborted and the log item marked
XFS_LI_ABORTED. The inode was seen as dirty and unpinned, so it
was flushed. IO completion tried to remove the inode from the AIL,
at which point stuff went bad:
XFS (pmem1): Log I/O Error (0x6) detected at xfs_fs_goingdown+0xa3/0xf0 (fs/xfs/xfs_fsops.c:500). Shutting down filesystem.
XFS: Assertion failed: in_ail, file: fs/xfs/xfs_trans_ail.c, line: 67
XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
Workqueue: xfs-buf/pmem1 xfs_buf_ioend_work
RIP: 0010:assfail+0x27/0x2d
Call Trace:
<TASK>
xfs_ail_check+0xa8/0x180
xfs_ail_delete_one+0x3b/0xf0
xfs_buf_inode_iodone+0x329/0x3f0
xfs_buf_ioend+0x1f8/0x530
xfs_buf_ioend_work+0x15/0x20
process_one_work+0x1ac/0x390
worker_thread+0x56/0x3c0
kthread+0xf6/0x120
ret_from_fork+0x1f/0x30
</TASK>
xfs_trans_commit() needs to check log state for shutdown, not mount
state. It cannot abort dirty log items while the log is still
running as dirty items must remained pinned in memory until they are
either committed to the journal or the log has shut down and they
can be safely tossed away. Hence if the log has not shut down, the
xfs_trans_commit() path must allow completed transactions to commit
to the CIL and pin the dirty items even if a mount shutdown has
started.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
When we call xfs_forced_shutdown(), the caller often expects the
filesystem to be completely shut down when it returns. However,
if we have racing xfs_forced_shutdown() calls, the first caller sets
the mount shutdown flag then goes to shutdown the log. The second
caller sees the mount shutdown flag and returns immediately - it
does not wait for the log to be shut down.
Unfortunately, xfs_forced_shutdown() is used in some places that
expect it to completely shut down the filesystem before it returns
(e.g. xfs_trans_log_inode()). As such, returning before the log has
been shut down leaves us in a place where the transaction failed to
complete correctly but we still call xfs_trans_commit(). This
situation arises because xfs_trans_log_inode() does not return an
error and instead calls xfs_force_shutdown() to ensure that the
transaction being committed is aborted.
Unfortunately, we have a race condition where xfs_trans_commit()
needs to check xlog_is_shutdown() because it can't abort log items
before the log is shut down, but it needs to use xfs_is_shutdown()
because xfs_forced_shutdown() does not block waiting for the log to
shut down.
To fix this conundrum, first we make all calls to
xfs_forced_shutdown() block until the log is also shut down. This
means we can then safely use xfs_forced_shutdown() as a mechanism
that ensures the currently running transaction will be aborted by
xfs_trans_commit() regardless of the shutdown check it uses.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
We've got a mess on our hands.
1. xfs_trans_commit() cannot cancel transactions because the mount is
shut down - that causes dirty, aborted, unlogged log items to sit
unpinned in memory and potentially get written to disk before the
log is shut down. Hence xfs_trans_commit() can only abort
transactions when xlog_is_shutdown() is true.
2. xfs_force_shutdown() is used in places to cause the current
modification to be aborted via xfs_trans_commit() because it may be
impractical or impossible to cancel the transaction directly, and
hence xfs_trans_commit() must cancel transactions when
xfs_is_shutdown() is true in this situation. But we can't do that
because of #1.
3. Log IO errors cause log shutdowns by calling xfs_force_shutdown()
to shut down the mount and then the log from log IO completion.
4. xfs_force_shutdown() can result in a log force being issued,
which has to wait for log IO completion before it will mark the log
as shut down. If #3 races with some other shutdown trigger that runs
a log force, we rely on xfs_force_shutdown() silently ignoring #3
and avoiding shutting down the log until the failed log force
completes.
5. To ensure #2 always works, we have to ensure that
xfs_force_shutdown() does not return until the the log is shut down.
But in the case of #4, this will result in a deadlock because the
log Io completion will block waiting for a log force to complete
which is blocked waiting for log IO to complete....
So the very first thing we have to do here to untangle this mess is
dissociate log shutdown triggers from mount shutdowns. We already
have xlog_forced_shutdown, which will atomically transistion to the
log a shutdown state. Due to internal asserts it cannot be called
multiple times, but was done simply because the only place that
could call it was xfs_do_force_shutdown() (i.e. the mount shutdown!)
and that could only call it once and once only. So the first thing
we do is remove the asserts.
We then convert all the internal log shutdown triggers to call
xlog_force_shutdown() directly instead of xfs_force_shutdown(). This
allows the log shutdown triggers to shut down the log without
needing to care about mount based shutdown constraints. This means
we shut down the log independently of the mount and the mount may
not notice this until it's next attempt to read or modify metadata.
At that point (e.g. xfs_trans_commit()) it will see that the log is
shutdown, error out and shutdown the mount.
To ensure that all the unmount behaviours and asserts track
correctly as a result of a log shutdown, propagate the shutdown up
to the mount if it is not already set. This keeps the mount and log
state in sync, and saves a huge amount of hassle where code fails
because of a log shutdown but only checks for mount shutdowns and
hence ends up doing the wrong thing. Cleaning up that mess is
an exercise for another day.
This enables us to address the other problems noted above in
followup patches.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Brian reported a null pointer dereference failure during unmount in
xfs/006. He tracked the problem down to the AIL being torn down
before a log shutdown had completed and removed all the items from
the AIL. The failure occurred in this path while unmount was
proceeding in another task:
xfs_trans_ail_delete+0x102/0x130 [xfs]
xfs_buf_item_done+0x22/0x30 [xfs]
xfs_buf_ioend+0x73/0x4d0 [xfs]
xfs_trans_committed_bulk+0x17e/0x2f0 [xfs]
xlog_cil_committed+0x2a9/0x300 [xfs]
xlog_cil_process_committed+0x69/0x80 [xfs]
xlog_state_shutdown_callbacks+0xce/0xf0 [xfs]
xlog_force_shutdown+0xdf/0x150 [xfs]
xfs_do_force_shutdown+0x5f/0x150 [xfs]
xlog_ioend_work+0x71/0x80 [xfs]
process_one_work+0x1c5/0x390
worker_thread+0x30/0x350
kthread+0xd7/0x100
ret_from_fork+0x1f/0x30
This is processing an EIO error to a log write, and it's
triggering a force shutdown. This causes the log to be shut down,
and then it is running attached iclog callbacks from the shutdown
context. That means the fs and log has already been marked as
xfs_is_shutdown/xlog_is_shutdown and so high level code will abort
(e.g. xfs_trans_commit(), xfs_log_force(), etc) with an error
because of shutdown.
The umount would have been blocked waiting for a log force
completion inside xfs_log_cover() -> xfs_sync_sb(). The first thing
for this situation to occur is for xfs_sync_sb() to exit without
waiting for the iclog buffer to be comitted to disk. The
above trace is the completion routine for the iclog buffer, and
it is shutting down the filesystem.
xlog_state_shutdown_callbacks() does this:
{
struct xlog_in_core *iclog;
LIST_HEAD(cb_list);
spin_lock(&log->l_icloglock);
iclog = log->l_iclog;
do {
if (atomic_read(&iclog->ic_refcnt)) {
/* Reference holder will re-run iclog callbacks. */
continue;
}
list_splice_init(&iclog->ic_callbacks, &cb_list);
>>>>>> wake_up_all(&iclog->ic_write_wait);
>>>>>> wake_up_all(&iclog->ic_force_wait);
} while ((iclog = iclog->ic_next) != log->l_iclog);
wake_up_all(&log->l_flush_wait);
spin_unlock(&log->l_icloglock);
>>>>>> xlog_cil_process_committed(&cb_list);
}
This wakes any thread waiting on IO completion of the iclog (in this
case the umount log force) before shutdown processes all the pending
callbacks. That means the xfs_sync_sb() waiting on a sync
transaction in xfs_log_force() on iclog->ic_force_wait will get
woken before the callbacks attached to that iclog are run. This
results in xfs_sync_sb() returning an error, and so unmount unblocks
and continues to run whilst the log shutdown is still in progress.
Normally this is just fine because the force waiter has nothing to
do with AIL operations. But in the case of this unmount path, the
log force waiter goes on to tear down the AIL because the log is now
shut down and so nothing ever blocks it again from the wait point in
xfs_log_cover().
Hence it's a race to see who gets to the AIL first - the unmount
code or xlog_cil_process_committed() killing the superblock buffer.
To fix this, we just have to change the order of processing in
xlog_state_shutdown_callbacks() to run the callbacks before it wakes
any task waiting on completion of the iclog.
Reported-by: Brian Foster <bfoster@redhat.com>
Fixes: aad7272a9208 ("xfs: separate out log shutdown callback processing")
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
generic/388 triggered a failure in RUI recovery due to a corrupted
btree record and the system then locked up hard due to a subsequent
assert failure while holding a spinlock cancelling intents:
XFS (pmem1): Corruption of in-memory data (0x8) detected at xfs_do_force_shutdown+0x1a/0x20 (fs/xfs/xfs_trans.c:964). Shutting down filesystem.
XFS (pmem1): Please unmount the filesystem and rectify the problem(s)
XFS: Assertion failed: !xlog_item_is_intent(lip), file: fs/xfs/xfs_log_recover.c, line: 2632
Call Trace:
<TASK>
xlog_recover_cancel_intents.isra.0+0xd1/0x120
xlog_recover_finish+0xb9/0x110
xfs_log_mount_finish+0x15a/0x1e0
xfs_mountfs+0x540/0x910
xfs_fs_fill_super+0x476/0x830
get_tree_bdev+0x171/0x270
? xfs_init_fs_context+0x1e0/0x1e0
xfs_fs_get_tree+0x15/0x20
vfs_get_tree+0x24/0xc0
path_mount+0x304/0xba0
? putname+0x55/0x60
__x64_sys_mount+0x108/0x140
do_syscall_64+0x35/0x80
entry_SYSCALL_64_after_hwframe+0x44/0xae
Essentially, there's dirty metadata in the AIL from intent recovery
transactions, so when we go to cancel the remaining intents we assume
that all objects after the first non-intent log item in the AIL are
not intents.
This is not true. Intent recovery can log new intents to continue
the operations the original intent could not complete in a single
transaction. The new intents are committed before they are deferred,
which means if the CIL commits in the background they will get
inserted into the AIL at the head.
Hence if we shut down the filesystem while processing intent
recovery, the AIL may have new intents active at the current head.
Hence this check:
/*
* We're done when we see something other than an intent.
* There should be no intents left in the AIL now.
*/
if (!xlog_item_is_intent(lip)) {
#ifdef DEBUG
for (; lip; lip = xfs_trans_ail_cursor_next(ailp, &cur))
ASSERT(!xlog_item_is_intent(lip));
#endif
break;
}
in both xlog_recover_process_intents() and
log_recover_cancel_intents() is simply not valid. It was valid back
when we only had EFI/EFD intents and didn't chain intents, but it
hasn't been valid ever since intent recovery could create and commit
new intents.
Given that crashing the mount task like this pretty much prevents
diagnosing what went wrong that lead to the initial failure that
triggered intent cancellation, just remove the checks altogether.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Most buffer io list operations are run with the bp->b_lock held, but
xfs_iflush_abort() can be called without the buffer lock being held
resulting in inodes being removed from the buffer list while other
list operations are occurring. This causes problems with corrupted
bp->b_io_list inode lists during filesystem shutdown, leading to
traversals that never end, double removals from the AIL, etc.
Fix this by passing the buffer to xfs_iflush_abort() if we have
it locked. If the inode is attached to the buffer, we're going to
have to remove it from the buffer list and we'd have to get the
buffer off the inode log item to do that anyway.
If we don't have a buffer passed in (e.g. from xfs_reclaim_inode())
then we can determine if the inode has a log item and if it is
attached to a buffer before we do anything else. If it does have an
attached buffer, we can lock it safely (because the inode has a
reference to it) and then perform the inode abort.
Signed-off-by: Dave Chinner <dchinner@redhat.com>
Reviewed-by: Darrick J. Wong <djwong@kernel.org>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
On a modern filesystem, we don't allow userspace to allocate blocks for
data storage from the per-AG space reservations, the user-controlled
reservation pool that prevents ENOSPC in the middle of internal
operations, or the internal per-AG set-aside that prevents unwanted
filesystem shutdowns due to ENOSPC during a bmap btree split.
Since we now consider freespace btree blocks as unavailable for
allocation for data storage, we shouldn't report those blocks via statfs
either. This makes the numbers that we return via the statfs f_bavail
and f_bfree fields a more conservative estimate of actual free space.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Due to cycling of m_sb_lock, it's possible for multiple callers of
xfs_reserve_blocks to race at changing the pool size, subtracting blocks
from fdblocks, and actually putting it in the pool. The result of all
this is that we can overfill the reserve pool to hilarious levels.
xfs_mod_fdblocks, when called with a positive value, already knows how
to take freed blocks and either fill the reserve until it's full, or put
them in fdblocks. Use that instead of setting m_resblks_avail directly.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Nowadays, xfs_mod_fdblocks will always choose to fill the reserve pool
with freed blocks before adding to fdblocks. Therefore, we can change
the behavior of xfs_reserve_blocks slightly -- setting the target size
of the pool should always succeed, since a deficiency will eventually
be made up as blocks get freed.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Infinite loops in kernel code are scary. Calls to xfs_reserve_blocks
should be rare (people should just use the defaults!) so we really don't
need to try so hard. Simplify the logic here by removing the infinite
loop.
Cc: Brian Foster <bfoster@redhat.com>
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
xfs_reserve_blocks controls the size of the user-visible free space
reserve pool. Given the difference between the current and requested
pool sizes, it will try to reserve free space from fdblocks. However,
the amount requested from fdblocks is also constrained by the amount of
space that we think xfs_mod_fdblocks will give us. If we forget to
subtract m_allocbt_blks before calling xfs_mod_fdblocks, it will will
return ENOSPC and we'll hang the kernel at mount due to the infinite
loop.
In commit fd43cf600cf6, we decided that xfs_mod_fdblocks should not hand
out the "free space" used by the free space btrees, because some portion
of the free space btrees hold in reserve space for future btree
expansion. Unfortunately, xfs_reserve_blocks' estimation of the number
of blocks that it could request from xfs_mod_fdblocks was not updated to
include m_allocbt_blks, so if space is extremely low, the caller hangs.
Fix this by creating a function to estimate the number of blocks that
can be reserved from fdblocks, which needs to exclude the set-aside and
m_allocbt_blks.
Found by running xfs/306 (which formats a single-AG 20MB filesystem)
with an fstests configuration that specifies a 1k blocksize and a
specially crafted log size that will consume 7/8 of the space (17920
blocks, specifically) in that AG.
Cc: Brian Foster <bfoster@redhat.com>
Fixes: fd43cf600cf6 ("xfs: set aside allocation btree blocks from block reservation")
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Brian Foster <bfoster@redhat.com>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Currently, we use this undocumented macro to encode the minimum number
of blocks needed to replenish a completely empty AGFL when an AG is
nearly full. This has lead to confusion on the part of the maintainers,
so let's document what the value actually means, and move it to
xfs_alloc.c since it's not used outside of that module.
Signed-off-by: Darrick J. Wong <djwong@kernel.org>
Reviewed-by: Dave Chinner <dchinner@redhat.com>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Pull io_uring fixes from Jens Axboe:
"A little bit all over the map, some regression fixes for this merge
window, and some general fixes that are stable bound. In detail:
- Fix an SQPOLL memory ordering issue (Almog)
- Accept fixes (Dylan)
- Poll fixes (me)
- Fixes for provided buffers and recycling (me)
- Tweak to IORING_OP_MSG_RING command added in this merge window (me)
- Memory leak fix (Pavel)
- Misc fixes and tweaks (Pavel, me)"
* tag 'for-5.18/io_uring-2022-04-01' of git://git.kernel.dk/linux-block:
io_uring: defer msg-ring file validity check until command issue
io_uring: fail links if msg-ring doesn't succeeed
io_uring: fix memory leak of uid in files registration
io_uring: fix put_kbuf without proper locking
io_uring: fix invalid flags for io_put_kbuf()
io_uring: improve req fields comments
io_uring: enable EPOLLEXCLUSIVE for accept poll
io_uring: improve task work cache utilization
io_uring: fix async accept on O_NONBLOCK sockets
io_uring: remove IORING_CQE_F_MSG
io_uring: add flag for disabling provided buffer recycling
io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly
io_uring: don't recycle provided buffer if punted to async worker
io_uring: fix assuming triggered poll waitqueue is the single poll
io_uring: bump poll refs to full 31-bits
io_uring: remove poll entry from list when canceling all
io_uring: fix memory ordering when SQPOLL thread goes to sleep
io_uring: ensure that fsnotify is always called
io_uring: recycle provided before arming poll
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
In preparation for not using the file at prep time, defer checking if this
file refers to a valid io_uring instance until issue time.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
We must always call req_set_fail() if the request is failed, otherwise
we won't sever links for dependent chains correctly.
Fixes: 4f57f06ce218 ("io_uring: add support for IORING_OP_MSG_RING command")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
When there are no files for __io_sqe_files_scm() to process in the
range, it'll free everything and return. However, it forgets to put uid.
Fixes: 08a451739a9b5 ("io_uring: allow sparse fixed file sets")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/accee442376f33ce8aaebb099d04967533efde92.1648226048.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
io_put_kbuf_comp() should only be called while holding
->completion_lock, however there is no such assumption in io_clean_op()
and thus it can corrupt ->io_buffer_comp. Take the lock there, and
workaround the only user of io_clean_op() calling it with locks. Not
the prettiest solution, but it's easier to refactor it for-next.
Fixes: cc3cec8367cba ("io_uring: speedup provided buffer handling")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/743e2130b73ec6d48c4c5dd15db896c433431e6d.1648212967.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
io_req_complete_failed() doesn't require callers to hold ->uring_lock,
use IO_URING_F_UNLOCKED version of io_put_kbuf(). The only affected
place is the fail path of io_apoll_task_func(). Also add a lockdep
annotation to catch such bugs in the future.
Fixes: 3b2b78a8eb7cc ("io_uring: extend provided buf return to fails")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ccf602dbf8df3b6a8552a262d8ee0a13a086fbc7.1648212967.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Move a misplaced comment about req->creds and add a line with
assumptions about req->link.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1e51d1e6b1f3708c2d4127b4e371f9daa4c5f859.1648209006.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
When polling sockets for accept, use EPOLLEXCLUSIVE. This is helpful
when multiple accept SQEs are submitted.
For O_NONBLOCK sockets multiple queued SQEs would previously have all
completed at once, but most with -EAGAIN as the result. Now only one
wakes up and completes.
For sockets without O_NONBLOCK there is no user facing change, but
internally the extra requests would previously be queued onto a worker
thread as they would wake up with no connection waiting, and be
punted. Now they do not wake up unnecessarily.
Co-developed-by: Jens Axboe <axboe@kernel.dk>
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220325093755.4123343-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
While profiling task_work intensive workloads, I noticed that most of
the time in tctx_task_work() is spending stalled on loading 'req'. This
is one of the unfortunate side effects of using linked lists,
particularly when they end up being passe around.
Prefetch the next request, if there is one. There's a sufficient amount
of work in between that this makes it available for the next loop.
While fiddling with the cache layout, move the link outside of the
hot completion cacheline. It's rarely used in hot workloads, so better
to bring in kbuf which is used for networked loads with provided buffers.
This reduces tctx_task_work() overhead from ~3% to 1-1.5% in my testing.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Do not set REQ_F_NOWAIT if the socket is non blocking. When enabled this
causes the accept to immediately post a CQE with EAGAIN, which means you
cannot perform an accept SQE on a NONBLOCK socket asynchronously.
By removing the flag if there is no pending accept then poll is armed as
usual and when a connection comes in the CQE is posted.
Signed-off-by: Dylan Yudaken <dylany@fb.com>
Link: https://lore.kernel.org/r/20220324143435.2875844-1-dylany@fb.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
This was introduced with the message ring opcode, but isn't strictly
required for the request itself. The sender can encode what is needed
in user_data, which is passed to the receiver. It's unclear if having
a separate flag that essentially says "This CQE did not originate from
an SQE on this ring" provides any real utility to applications. While
we can always re-introduce a flag to provide this information, we cannot
take it away at a later point in time.
Remove the flag while we still can, before it's in a released kernel.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
If we need to continue doing this IO, then we don't want a potentially
selected buffer recycled. Add a flag for that.
Set this for recv/recvmsg if they do partial IO.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
We currently don't attempt to get the full asked for length even if
MSG_WAITALL is set, if we get a partial receive. If we do see a partial
receive, then just note how many bytes we did and return -EAGAIN to
get it retried.
The iov is advanced appropriately for the vector based case, and we
manually bump the buffer and remainder for the non-vector case.
Cc: stable@vger.kernel.org
Reported-by: Constantine Gavrilov <constantine.gavrilov@gmail.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
We only really need to recycle the buffer when going async for a file
type that has an indefinite reponse time (eg non-file/bdev). And for
files that to arm poll, the async worker will arm poll anyway and the
buffer will get recycled there.
In that latter case, we're not holding ctx->uring_lock. Ensure we take
the issue_flags into account and acquire it if we need to.
Fixes: b1c62645758e ("io_uring: recycle provided buffers if request goes async")
Reported-by: Stefan Roesch <shr@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
syzbot reports a recent regression:
BUG: KASAN: use-after-free in __wake_up_common+0x637/0x650 kernel/sched/wait.c:101
Read of size 8 at addr ffff888011e8a130 by task syz-executor413/3618
CPU: 0 PID: 3618 Comm: syz-executor413 Tainted: G W 5.17.0-syzkaller-01402-g8565d64430f8 #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0xcd/0x134 lib/dump_stack.c:106
print_address_description.constprop.0.cold+0x8d/0x303 mm/kasan/report.c:255
__kasan_report mm/kasan/report.c:442 [inline]
kasan_report.cold+0x83/0xdf mm/kasan/report.c:459
__wake_up_common+0x637/0x650 kernel/sched/wait.c:101
__wake_up_common_lock+0xd0/0x130 kernel/sched/wait.c:138
tty_release+0x657/0x1200 drivers/tty/tty_io.c:1781
__fput+0x286/0x9f0 fs/file_table.c:317
task_work_run+0xdd/0x1a0 kernel/task_work.c:164
exit_task_work include/linux/task_work.h:32 [inline]
do_exit+0xaff/0x29d0 kernel/exit.c:806
do_group_exit+0xd2/0x2f0 kernel/exit.c:936
__do_sys_exit_group kernel/exit.c:947 [inline]
__se_sys_exit_group kernel/exit.c:945 [inline]
__x64_sys_exit_group+0x3a/0x50 kernel/exit.c:945
do_syscall_x64 arch/x86/entry/common.c:50 [inline]
do_syscall_64+0x35/0xb0 arch/x86/entry/common.c:80
entry_SYSCALL_64_after_hwframe+0x44/0xae
RIP: 0033:0x7f439a1fac69
which is due to leaving the request on the waitqueue mistakenly. The
reproducer is using a tty device, which means we end up arming the same
poll queue twice (it uses the same poll waitqueue for both), but in
io_poll_wake() we always just clear REQ_F_SINGLE_POLL regardless of which
entry triggered. This leaves one waitqueue potentially armed after we're
done, which then blows up in tty when the waitqueue is attempted removed.
We have no room to store this information, so simply encode it in the
wait_queue_entry->private where we store the io_kiocb request pointer.
Fixes: 91eac1c69c20 ("io_uring: cache poll/double-poll state with a request flag")
Reported-by: syzbot+09ad4050dd3a120bfccd@syzkaller.appspotmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
The previous commit:
1bc84c40088 ("io_uring: remove poll entry from list when canceling all")
removed a potential overflow condition for the poll references. They
are currently limited to 20-bits, even if we have 31-bits available. The
upper bit is used to mark for cancelation.
Bump the poll ref space to 31-bits, making that kind of situation much
harder to trigger in general. We'll separately add overflow checking
and handling.
Fixes: aa43477b0402 ("io_uring: poll rework")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
When the ring is exiting, as part of the shutdown, poll requests are
removed. But io_poll_remove_all() does not remove entries when finding
them, and since completions are done out-of-band, we can find and remove
the same entry multiple times.
We do guard the poll execution by poll ownership, but that does not
exclude us from reissuing a new one once the previous removal ownership
goes away.
This can race with poll execution as well, where we then end up seeing
req->apoll be NULL because a previous task_work requeue finished the
request.
Remove the poll entry when we find it and get ownership of it. This
prevents multiple invocations from finding it.
Fixes: aa43477b0402 ("io_uring: poll rework")
Reported-by: Dylan Yudaken <dylany@fb.com>
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Without a full memory barrier between the store to the flags and the
load of the SQ tail the two operations can be reordered and this can
lead to a situation where the SQPOLL thread goes to sleep while the
application writes to the SQ tail and doesn't see the wakeup flag.
This memory barrier pairs with a full memory barrier in the application
between its store to the SQ tail and its load of the flags.
Signed-off-by: Almog Khaikin <almogkh@gmail.com>
Link: https://lore.kernel.org/r/20220321090059.46313-1-almogkh@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
Ensure that we call fsnotify_modify() if we write a file, and that we
do fsnotify_access() if we read it. This enables anyone using inotify
on the file to get notified.
Ditto for fallocate, ensure that fsnotify_modify() is called.
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
We currently have a race where we recycle the selected buffer if poll
returns IO_APOLL_OK. But that's too late, as the poll could already be
triggering or have triggered. If that race happens, then we're putting a
buffer that's already being used.
Fix this by recycling before we arm poll. This does mean that we'll
sometimes almost instantly re-select the buffer, but it's rare enough in
testing that it should not pose a performance issue.
Fixes: b1c62645758e ("io_uring: recycle provided buffers if request goes async")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\ \ \ \ \ \
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Pull ksmbd updates from Steve French:
- three cleanup fixes
- shorten module load warning
- two documentation fixes
* tag '5.18-rc-ksmbd-server-fixes' of git://git.samba.org/ksmbd:
ksmbd: replace usage of found with dedicated list iterator variable
ksmbd: Remove a redundant zeroing of memory
MAINTAINERS: ksmbd: switch Sergey to reviewer
ksmbd: shorten experimental warning on loading the module
ksmbd: use netif_is_bridge_port
Documentation: ksmbd: update Feature Status table
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
To move the list iterator variable into the list_for_each_entry_*()
macro in the future it should be avoided to use the list iterator
variable after the loop body.
To *never* use the list iterator variable after the loop it was
concluded to use a separate iterator variable instead of a
found boolean [1].
This removes the need to use a found variable and simply checking if
the variable was set, can determine if the break/goto was hit.
Link: https://lore.kernel.org/all/CAHk-=wgRr_D8CB-D9Kg-c=EHreAsk5SqXPwr9Y7k9sA6cWXJ6w@mail.gmail.com/
Signed-off-by: Jakob Koschel <jakobkoschel@gmail.com>
Reviewed-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
fill_transform_hdr() has only one caller that already clears tr_buf (it is
kzalloc'ed).
So there is no need to clear it another time here.
Remove the superfluous memset() and add a comment to remind that the caller
must clear the buffer.
Signed-off-by: Christophe JAILLET <christophe.jaillet@wanadoo.fr>
Acked-by: Hyunchul Lee <hyc.lee@gmail.com>
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
ksmbd is continuing to improve. Shorten the warning message
logged the first time it is loaded to:
"The ksmbd server is experimental"
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Steve French <stfrench@microsoft.com>
|
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | |
| | | | | | | |
Use netif_is_bridge_port defined in <linux/netdevice.h> instead of
open-coding it.
Acked-by: Namjae Jeon <linkinjeon@kernel.org>
Signed-off-by: Tobias Klauser <tklauser@distanz.ch>
Signed-off-by: Steve French <stfrench@microsoft.com>
|