| Commit message (Collapse) | Author | Age | Files | Lines |
... | |
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
* for-6.2/io_uring: (41 commits)
io_uring: keep unlock_post inlined in hot path
io_uring: don't use complete_post in kbuf
io_uring: spelling fix
io_uring: remove io_req_complete_post_tw
io_uring: allow multishot polled reqs to defer completion
io_uring: remove overflow param from io_post_aux_cqe
io_uring: add lockdep assertion in io_fill_cqe_aux
io_uring: make io_fill_cqe_aux static
io_uring: add io_aux_cqe which allows deferred completion
io_uring: allow defer completion for aux posted cqes
io_uring: defer all io_req_complete_failed
io_uring: always lock in io_apoll_task_func
io_uring: remove iopoll spinlock
io_uring: iopoll protect complete_post
io_uring: inline __io_req_complete_put()
io_uring: remove io_req_tw_post_queue
io_uring: use io_req_task_complete() in timeout
io_uring: hold locks for io_req_complete_failed
io_uring: add completion locking for iopoll
io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
...
|
|\ \ \ \
| | |/ /
| |/| |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pull io_uring updates from Jens Axboe:
- Always ensure proper ordering in case of CQ ring overflow, which then
means we can remove some work-arounds for that (Dylan)
- Support completion batching for multishot, greatly increasing the
efficiency for those (Dylan)
- Flag epoll/eventfd wakeups done from io_uring, so that we can easily
tell if we're recursing into io_uring again.
Previously, this would have resulted in repeated multishot
notifications if we had a dependency there. That could happen if an
eventfd was registered as the ring eventfd, and we multishot polled
for events on it. Or if an io_uring fd was added to epoll, and
io_uring had a multishot request for the epoll fd.
Test cases here:
https://git.kernel.dk/cgit/liburing/commit/?id=919755a7d0096fda08fb6d65ac54ad8d0fe027cd
Previously these got terminated when the CQ ring eventually
overflowed, now it's handled gracefully (me).
- Tightening of the IOPOLL based completions (Pavel)
- Optimizations of the networking zero-copy paths (Pavel)
- Various tweaks and fixes (Dylan, Pavel)
* tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux: (41 commits)
io_uring: keep unlock_post inlined in hot path
io_uring: don't use complete_post in kbuf
io_uring: spelling fix
io_uring: remove io_req_complete_post_tw
io_uring: allow multishot polled reqs to defer completion
io_uring: remove overflow param from io_post_aux_cqe
io_uring: add lockdep assertion in io_fill_cqe_aux
io_uring: make io_fill_cqe_aux static
io_uring: add io_aux_cqe which allows deferred completion
io_uring: allow defer completion for aux posted cqes
io_uring: defer all io_req_complete_failed
io_uring: always lock in io_apoll_task_func
io_uring: remove iopoll spinlock
io_uring: iopoll protect complete_post
io_uring: inline __io_req_complete_put()
io_uring: remove io_req_tw_post_queue
io_uring: use io_req_task_complete() in timeout
io_uring: hold locks for io_req_complete_failed
io_uring: add completion locking for iopoll
io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()
...
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This partially reverts
6c16fe3c16bdc ("io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()")
The redundancy of __io_cq_unlock_post() was always to keep it inlined
into __io_submit_flush_completions(). Inline it back and rename with
hope of clarifying the intention behind it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/372a16c485fca44c069be2e92fc5e7332a1d7fd7.1669310258.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Now we're handling IOPOLL completions more generically, get rid uses of
_post() and send requests through the normal path. It may have some
extra mertis performance wise, but we don't care much as there is a
better interface for selected buffers.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/4deded706587f55b006dc33adf0c13cfc3b2319f.1669310258.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
s/pushs/pushes/
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221125103412.1425305-3-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
It's only used in one place. Inline it.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221125103412.1425305-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Until now there was no reason for multishot polled requests to defer
completions as there was no functional difference. However now this will
actually defer the completions, for a performance win.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-10-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
The only call sites which would not allow overflow are also call sites
which would use the io_aux_cqe as they care about ordering.
So remove this parameter from io_post_aux_cqe.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-9-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add an assertion for the completion lock to io_fill_cqe_aux
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-8-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This is only used in io_uring.c
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-7-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Use the just introduced deferred post cqe completion state when possible
in io_aux_cqe. If not possible fallback to io_post_aux_cqe.
This introduces a complication because of allow_overflow. For deferred
completions we cannot know without locking the completion_lock if it will
overflow (and even if we locked it, another post could sneak in and cause
this cqe to be in overflow).
However since overflow protection is mostly a best effort defence in depth
to prevent infinite loops of CQEs for poll, just checking the overflow bit
is going to be good enough and will result in at most 16 (array size of
deferred cqes) overflows.
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-6-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Multishot ops cannot use the compl_reqs list as the request must stay in
the poll list, but that means they need to run each completion without
benefiting from batching.
Here introduce batching infrastructure for only small (ie 16 byte)
CQEs. This restriction is ok because there are no use cases posting 32
byte CQEs.
In the ring keep a batch of up to 16 posted results, and flush in the same
way as compl_reqs.
16 was chosen through experimentation on a microbenchmark ([1]), as well
as trying not to increase the size of the ring too much. This increases
the size to 1472 bytes from 1216.
[1]: https://github.com/DylanZA/liburing/commit/9ac66b36bcf4477bfafeff1c5f107896b7ae31cf
Run with $ make -j && ./benchmark/reg.b -s 1 -t 2000 -r 10
Gives results:
baseline 8309 k/s
8 18807 k/s
16 19338 k/s
32 20134 k/s
Suggested-by: Pavel Begunkov <asml.silence@gmail.com>
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-5-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
All failures happen under lock now, and can be deferred. To be consistent
when the failure has happened after some multishot cqe has been
deferred (and keep ordering), always defer failures.
To make this obvious at the caller (and to help prevent a future bug)
rename io_req_complete_failed to io_req_defer_failed.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-4-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This is required for the failure case (io_req_complete_failed) and is
missing.
The alternative would be to only lock in the failure path, however all of
the non-error paths in io_poll_check_events that do not do not return
IOU_POLL_NO_ACTION end up locking anyway. The only extraneous lock would
be for the multishot poll overflowing the CQE ring, however multishot poll
would probably benefit from being locked as it will allow completions to
be batched.
So it seems reasonable to lock always.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221124093559.3780686-3-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This reverts commit 2ccc92f4effcfa1c51c4fcf1e34d769099d3cad4
io_req_complete_post() should now behave well even in case of IOPOLL, we
can remove completion_lock locking.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7e171c8b530656b14a671c59100ca260e46e7f2a.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
io_req_complete_post() may be used by iopoll enabled rings, grab locks
in this case. That requires to pass issue_flags to propagate the locking
state.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/cc6d854065c57c838ca8e8806f707a226b70fd2d.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Inline __io_req_complete_put() into io_req_complete_post(), there are no
other users.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1923a4dfe80fa877f859a22ed3df2d5fc8ecf02b.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Remove io_req_tw_post() and io_req_tw_post_queue(), we can use
io_req_task_complete() instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/b9b73c08022c7f1457023ac841f35c0100e70345.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Use a more generic io_req_task_complete() in timeout completion
task_work instead of io_req_complete_post().
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/bda1710b58c07bf06107421c2a65c529ea9cdcac.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
A preparation patch, make sure we always hold uring_lock around
io_req_complete_failed(). The only place deviating from the rule
is io_cancel_defer_files(), queue a tw instead.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/70760344eadaecf2939287084b9d4ba5c05a6984.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There are pieces of code that may allow iopoll to race filling cqes,
temporarily add spinlocking around posting events.
Cc: stable@vger.kernel.org
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/84d86b5c117feda075471c5c9e65208e0dccf5d0.1669203009.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
__io_cq_unlock_post() is identical to io_cq_unlock_post(), and
io_cqring_ev_posted() has a single caller so migth as well just inline
it there.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This reverts commit 7fdbc5f014c3f71bc44673a2d6c5bb2d12d45f25.
This patch dealt with a subset of the real problem, which is a potential
circular dependency on the wakup path for io_uring itself. Outside of
io_uring, eventfd can also trigger this (see details in 03e02acda8e2)
and so can epoll (see details in caf1aeaffc3b). Now that we have a
generic solution to this problem, get rid of the io_uring specific
work-around.
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Pass in EPOLL_URING_WAKE when signaling eventfd or doing poll related
wakups, so that we can check for a circular event dependency between
eventfd and epoll. If this flag is set when our wakeup handlers are
called, then we know we have a dependency that needs to terminate
multishot requests.
eventfd and epoll are the only such possible dependencies.
Cc: stable@vger.kernel.org # 6.0
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There is only one user of __io_req_complete_post(), inline it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/ef4c9059950a3da5cf68df00f977f1fd13bd9306.1668597569.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
When the target process is dying and so task_work_add() is not allowed
we push all task_work item to the fallback workqueue. Move the part
responsible for moving tw items out of __io_req_task_work_add() into
a separate function. Makes it a bit cleaner and gives the compiler a bit
of extra info.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/e503dab9d7af95470ca6b214c6de17715ae4e748.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
__io_req_task_work_add() is huge but marked inline, that makes compilers
to generate lots of garbage. Inline the wrapper caller
io_req_task_work_add() instead.
before and after:
text data bss dec hex filename
47347 16248 8 63603 f873 io_uring/io_uring.o
text data bss dec hex filename
45303 16248 8 61559 f077 io_uring/io_uring.o
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/26dc8c28ca0160e3269ef3e55c5a8b917c4d4450.1668162751.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Previous commit ebc11b6c6b87 ("io_uring: clean io-wq callbacks") rename
io_free_work() into io_wq_free_work() for consistency. This patch also
updates relevant comment to avoid misunderstanding.
Fixes: ebc11b6c6b87 ("io_uring: clean io-wq callbacks")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20221110122103.20120-1-linma@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Previous commit 13a99017ff19 ("io_uring: remove events caching
atavisms") entirely removes the events caching optimization introduced
by commit 81459350d581 ("io_uring: cache req->apoll->events in
req->cflags"). Hence the related comment should also be removed to avoid
misunderstanding.
Fixes: 13a99017ff19 ("io_uring: remove events caching atavisms")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/20221110060313.16303-1-linma@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
With commit aa1df3a360a0 ("io_uring: fix CQE reordering"), there are
stronger guarantees for overflow ordering. Specifically ensuring that
userspace will not receive out of order receive CQEs. Therefore this is
not needed any more for recv/recvmsg.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107125236.260132-4-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
This is no longer needed after commit aa1df3a360a0 ("io_uring: fix CQE
reordering"), since all reordering is now taken care of.
This reverts commit cbd25748545c ("io_uring: fix multishot accept
ordering").
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107125236.260132-2-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Running task work when not needed can unnecessarily delay
operations. Specifically IORING_SETUP_DEFER_TASKRUN tries to avoid running
task work until the user requests it. Therefore do not run it in
io_uring_register any more.
The one catch is that io_rsrc_ref_quiesce expects it to have run in order
to process all outstanding references, and so reorder it's loop to do this.
Signed-off-by: Dylan Yudaken <dylany@meta.com>
Link: https://lore.kernel.org/r/20221107123349.4106213-1-dylany@meta.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Fixes two errors:
"ERROR: do not use assignment in if condition
130: FILE: io_uring/net.c:130:
+ if (!(issue_flags & IO_URING_F_UNLOCKED) &&
ERROR: do not use assignment in if condition
599: FILE: io_uring/poll.c:599:
+ } else if (!(issue_flags & IO_URING_F_UNLOCKED) &&"
reported by checkpatch.pl in net.c and poll.c .
Signed-off-by: Xinghui Li <korantli@tencent.com>
Reported-by: kernel test robot <lkp@intel.com>
Link: https://lore.kernel.org/r/20221102082503.32236-1-korantwork@gmail.com
[axboe: style tweaks]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We can also move mm accounting to the extended callbacks. It removes a
few cycles from the hot path including skipping one function call and
setting io_req_task_complete as a callback directly. For user backed I/O
it shouldn't make any difference taking into considering atomic mm
accounting and page pinning.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/1062f270273ad11c1b7b45ec59a6a317533d5e64.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Add custom tw and notif callbacks on top of usual bits also handling zc
reporting. That moves it from the hot path.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/40de4a6409042478e1f35adc4912e23226cb1b5c.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
io_notif_flush() is pretty simple, we can inline it.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/332359e7bd124138dfe51340bbec829c9b265c18.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Just a simple renaming patch, io_uring_tx_zerocopy_callback() is too
bulky and doesn't follow usual naming style.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/24d78325403ca6dcb1ec4bced1e33cacc9b832a5.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
We're going to have multiple notification tw functions. In preparation
for future changes default the tw callback in advance so later we can
replace it with other versions.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/7acdbea5e20eadd844513320cd454af14ba50f64.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
io_send_zc_prep() sets up notification's rsrc_node when needed, don't
unconditionally install it on notif alloc.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/dbe4875ac33e180b9799d8537a5e27935e82aac4.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There are multiple users of io_req_task_complete() including zc
notifications, but only read requests use selected buffers. As we
already have an rw specific tw function, move io_put_kbuf() in there.
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/94374c7649aaefc3a17808dc4701f25ccd457e25.1667557923.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
It might be useful for applications to detect if a zero copy transfer with
SEND[MSG]_ZC was actually possible or not. The application can fallback to
plain SEND[MSG] in order to avoid the overhead of two cqes per request. Or
it can generate a log message that could indicate to an administrator that
no zero copy was possible and could explain degraded performance.
Cc: stable@vger.kernel.org # 6.1
Link: https://lore.kernel.org/io-uring/fb6a7599-8a9b-15e5-9b64-6cd9d01c6ff4@gmail.com/T/#m2b0d9df94ce43b0e69e6c089bdff0ce6babbdfaa
Signed-off-by: Stefan Metzmacher <metze@samba.org>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/8945b01756d902f5d5b0667f20b957ad3f742e5e.1666895626.git.metze@samba.org
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
|\ \ \ \
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping
Pull idmapping updates from Christian Brauner:
"Last cycle we've already made the interaction with idmapped mounts
more robust and type safe by introducing the vfs{g,u}id_t type. This
cycle we concluded the conversion and removed the legacy helpers.
Currently we still pass around the plain namespace that was attached
to a mount. This is in general pretty convenient but it makes it easy
to conflate namespaces that are relevant on the filesystem - with
namespaces that are relevent on the mount level. Especially for
filesystem developers without detailed knowledge in this area this can
be a potential source for bugs.
Instead of passing the plain namespace we introduce a dedicated type
struct mnt_idmap and replace the pointer with a pointer to a struct
mnt_idmap. There are no semantic or size changes for the mount struct
caused by this.
We then start converting all places aware of idmapped mounts to rely
on struct mnt_idmap. Once the conversion is done all helpers down to
the really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take
a struct mnt_idmap argument instead of two namespace arguments. This
way it becomes impossible to conflate the two removing and thus
eliminating the possibility of any bugs. Fwiw, I fixed some issues in
that area a while ago in ntfs3 and ksmbd in the past. Afterwards only
low-level code can ultimately use the associated namespace for any
permission checks. Even most of the vfs can be completely obivious
about this ultimately and filesystems will never interact with it in
any form in the future.
A struct mnt_idmap currently encompasses a simple refcount and pointer
to the relevant namespace the mount is idmapped to. If a mount isn't
idmapped then it will point to a static nop_mnt_idmap and if it
doesn't that it is idmapped. As usual there are no allocations or
anything happening for non-idmapped mounts. Everthing is carefully
written to be a nop for non-idmapped mounts as has always been the
case.
If an idmapped mount is created a struct mnt_idmap is allocated and a
reference taken on the relevant namespace. Each mount that gets
idmapped or inherits the idmap simply bumps the reference count on
struct mnt_idmap. Just a reminder that we only allow a mount to change
it's idmapping a single time and only if it hasn't already been
attached to the filesystems and has no active writers.
The actual changes are fairly straightforward but this will have huge
benefits for maintenance and security in the long run even if it
causes some churn.
Note that this also makes it possible to extend struct mount_idmap in
the future. For example, it would be possible to place the namespace
pointer in an anonymous union together with an idmapping struct. This
would allow us to expose an api to userspace that would let it specify
idmappings directly instead of having to go through the detour of
setting up namespaces at all"
* tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping:
acl: conver higher-level helpers to rely on mnt_idmap
fs: introduce dedicated idmap type for mounts
|
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
Convert an initial portion to rely on struct mnt_idmap by converting the
high level xattr helpers.
Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org>
Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
|
|\ \ \ \ \
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | |
| | | | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs
Pull iov_iter updates from Al Viro:
"iov_iter work; most of that is about getting rid of direction
misannotations and (hopefully) preventing more of the same for the
future"
* tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
use less confusing names for iov_iter direction initializers
iov_iter: saner checks for attempt to copy to/from iterator
[xen] fix "direction" argument of iov_iter_kvec()
[vhost] fix 'direction' argument of iov_iter_{init,bvec}()
[target] fix iov_iter_bvec() "direction" argument
[s390] memcpy_real(): WRITE is "data source", not destination...
[s390] zcore: WRITE is "data source", not destination...
[infiniband] READ is "data destination", not source...
[fsi] WRITE is "data source", not destination...
[s390] copy_oldmem_kernel() - WRITE is "data source", not destination
csum_and_copy_to_iter(): handle ITER_DISCARD
get rid of unlikely() on page_copy_sane() calls
|
| | |/ / /
| |/| | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | |
| | | | | |
READ/WRITE proved to be actively confusing - the meanings are
"data destination, as used with read(2)" and "data source, as
used with write(2)", but people keep interpreting those as
"we read data from it" and "we write data to it", i.e. exactly
the wrong way.
Call them ITER_DEST and ITER_SOURCE - at least that is harder
to misinterpret...
Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
|
| |_|/ /
|/| | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
Syzkaller reports a NULL deref bug as follows:
BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3
Read of size 4 at addr 0000000000000138 by task file1/1955
CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014
Call Trace:
<TASK>
dump_stack_lvl+0xcd/0x134
? io_tctx_exit_cb+0x53/0xd3
kasan_report+0xbb/0x1f0
? io_tctx_exit_cb+0x53/0xd3
kasan_check_range+0x140/0x190
io_tctx_exit_cb+0x53/0xd3
task_work_run+0x164/0x250
? task_work_cancel+0x30/0x30
get_signal+0x1c3/0x2440
? lock_downgrade+0x6e0/0x6e0
? lock_downgrade+0x6e0/0x6e0
? exit_signals+0x8b0/0x8b0
? do_raw_read_unlock+0x3b/0x70
? do_raw_spin_unlock+0x50/0x230
arch_do_signal_or_restart+0x82/0x2470
? kmem_cache_free+0x260/0x4b0
? putname+0xfe/0x140
? get_sigframe_size+0x10/0x10
? do_execveat_common.isra.0+0x226/0x710
? lockdep_hardirqs_on+0x79/0x100
? putname+0xfe/0x140
? do_execveat_common.isra.0+0x238/0x710
exit_to_user_mode_prepare+0x15f/0x250
syscall_exit_to_user_mode+0x19/0x50
do_syscall_64+0x42/0xb0
entry_SYSCALL_64_after_hwframe+0x63/0xcd
RIP: 0023:0x0
Code: Unable to access opcode bytes at 0xffffffffffffffd6.
RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000
RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
</TASK>
Kernel panic - not syncing: panic_on_warn set ...
This happens because the adding of task_work from io_ring_exit_work()
isn't synchronized with canceling all work items from eg exec. The
execution of the two are ordered in that they are both run by the task
itself, but if io_tctx_exit_cb() is queued while we're canceling all
work items off exec AND gets executed when the task exits to userspace
rather than in the main loop in io_uring_cancel_generic(), then we can
find current->io_uring == NULL and hit the above crash.
It's safe to add this NULL check here, because the execution of the two
paths are done by the task itself.
Cc: stable@vger.kernel.org
Fixes: d56d938b4bef ("io_uring: do ctx initiated file note removal")
Reported-by: syzkaller <syzkaller@googlegroups.com>
Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com>
Link: https://lore.kernel.org/r/20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com
[axboe: add code comment and also put an explanation in the commit msg]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
With how task_work is added and signaled, we can have TIF_NOTIFY_SIGNAL
set and no task_work pending as it got run in a previous loop. Treat
TIF_NOTIFY_SIGNAL like get_signal(), always clear it if set regardless
of whether or not task_work is pending to run.
Cc: stable@vger.kernel.org
Fixes: 46a525e199e4 ("io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL")
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There is an interesting race condition of poll_refs which could result
in a NULL pointer dereference. The crash trace is like:
KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f]
CPU: 0 PID: 30781 Comm: syz-executor.2 Not tainted 6.0.0-g493ffd6605b2 #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS
1.13.0-1ubuntu1.1 04/01/2014
RIP: 0010:io_poll_remove_entry io_uring/poll.c:154 [inline]
RIP: 0010:io_poll_remove_entries+0x171/0x5b4 io_uring/poll.c:190
Code: ...
RSP: 0018:ffff88810dfefba0 EFLAGS: 00010202
RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000040000
RDX: ffffc900030c4000 RSI: 000000000003ffff RDI: 0000000000040000
RBP: 0000000000000008 R08: ffffffff9764d3dd R09: fffffbfff3836781
R10: fffffbfff3836781 R11: 0000000000000000 R12: 1ffff11003422d60
R13: ffff88801a116b04 R14: ffff88801a116ac0 R15: dffffc0000000000
FS: 00007f9c07497700(0000) GS:ffff88811a600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007ffb5c00ea98 CR3: 0000000105680005 CR4: 0000000000770ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
io_apoll_task_func+0x3f/0xa0 io_uring/poll.c:299
handle_tw_list io_uring/io_uring.c:1037 [inline]
tctx_task_work+0x37e/0x4f0 io_uring/io_uring.c:1090
task_work_run+0x13a/0x1b0 kernel/task_work.c:177
get_signal+0x2402/0x25a0 kernel/signal.c:2635
arch_do_signal_or_restart+0x3b/0x660 arch/x86/kernel/signal.c:869
exit_to_user_mode_loop kernel/entry/common.c:166 [inline]
exit_to_user_mode_prepare+0xc2/0x160 kernel/entry/common.c:201
__syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline]
syscall_exit_to_user_mode+0x58/0x160 kernel/entry/common.c:294
entry_SYSCALL_64_after_hwframe+0x63/0xcd
The root cause for this is a tiny overlooking in
io_poll_check_events() when cocurrently run with poll cancel routine
io_poll_cancel_req().
The interleaving to trigger use-after-free:
CPU0 | CPU1
|
io_apoll_task_func() | io_poll_cancel_req()
io_poll_check_events() |
// do while first loop |
v = atomic_read(...) |
// v = poll_refs = 1 |
... | io_poll_mark_cancelled()
| atomic_or()
| // poll_refs =
IO_POLL_CANCEL_FLAG | 1
|
atomic_sub_return(...) |
// poll_refs = IO_POLL_CANCEL_FLAG |
// loop continue |
|
| io_poll_execute()
| io_poll_get_ownership()
| // poll_refs =
IO_POLL_CANCEL_FLAG | 1
| // gets the ownership
v = atomic_read(...) |
// poll_refs not change |
|
if (v & IO_POLL_CANCEL_FLAG) |
return -ECANCELED; |
// io_poll_check_events return |
// will go into |
// io_req_complete_failed() free req |
|
| io_apoll_task_func()
| // also go into
io_req_complete_failed()
And the interleaving to trigger the kernel WARNING:
CPU0 | CPU1
|
io_apoll_task_func() | io_poll_cancel_req()
io_poll_check_events() |
// do while first loop |
v = atomic_read(...) |
// v = poll_refs = 1 |
... | io_poll_mark_cancelled()
| atomic_or()
| // poll_refs =
IO_POLL_CANCEL_FLAG | 1
|
atomic_sub_return(...) |
// poll_refs = IO_POLL_CANCEL_FLAG |
// loop continue |
|
v = atomic_read(...) |
// v = IO_POLL_CANCEL_FLAG |
| io_poll_execute()
| io_poll_get_ownership()
| // poll_refs =
IO_POLL_CANCEL_FLAG | 1
| // gets the ownership
|
WARN_ON_ONCE(!(v & IO_POLL_REF_MASK))) |
// v & IO_POLL_REF_MASK = 0 WARN |
|
| io_apoll_task_func()
| // also go into
io_req_complete_failed()
By looking up the source code and communicating with Pavel, the
implementation of this atomic poll refs should continue the loop of
io_poll_check_events() just to avoid somewhere else to grab the
ownership. Therefore, this patch simply adds another AND operation to
make sure the loop will stop if it finds the poll_refs is exactly equal
to IO_POLL_CANCEL_FLAG. Since io_poll_cancel_req() grabs ownership and
will finally make its way to io_req_complete_failed(), the req will
be reclaimed as expected.
Fixes: aa43477b0402 ("io_uring: poll rework")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Pavel Begunkov <asml.silence@gmail.com>
[axboe: tweak description and code style]
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
There is an interesting reference bug when -ENOMEM occurs in calling of
io_install_fixed_file(). KASan report like below:
[ 14.057131] ==================================================================
[ 14.059161] BUG: KASAN: use-after-free in unix_get_socket+0x10/0x90
[ 14.060975] Read of size 8 at addr ffff88800b09cf20 by task kworker/u8:2/45
[ 14.062684]
[ 14.062768] CPU: 2 PID: 45 Comm: kworker/u8:2 Not tainted 6.1.0-rc4 #1
[ 14.063099] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014
[ 14.063666] Workqueue: events_unbound io_ring_exit_work
[ 14.063936] Call Trace:
[ 14.064065] <TASK>
[ 14.064175] dump_stack_lvl+0x34/0x48
[ 14.064360] print_report+0x172/0x475
[ 14.064547] ? _raw_spin_lock_irq+0x83/0xe0
[ 14.064758] ? __virt_addr_valid+0xef/0x170
[ 14.064975] ? unix_get_socket+0x10/0x90
[ 14.065167] kasan_report+0xad/0x130
[ 14.065353] ? unix_get_socket+0x10/0x90
[ 14.065553] unix_get_socket+0x10/0x90
[ 14.065744] __io_sqe_files_unregister+0x87/0x1e0
[ 14.065989] ? io_rsrc_refs_drop+0x1c/0xd0
[ 14.066199] io_ring_exit_work+0x388/0x6a5
[ 14.066410] ? io_uring_try_cancel_requests+0x5bf/0x5bf
[ 14.066674] ? try_to_wake_up+0xdb/0x910
[ 14.066873] ? virt_to_head_page+0xbe/0xbe
[ 14.067080] ? __schedule+0x574/0xd20
[ 14.067273] ? read_word_at_a_time+0xe/0x20
[ 14.067492] ? strscpy+0xb5/0x190
[ 14.067665] process_one_work+0x423/0x710
[ 14.067879] worker_thread+0x2a2/0x6f0
[ 14.068073] ? process_one_work+0x710/0x710
[ 14.068284] kthread+0x163/0x1a0
[ 14.068454] ? kthread_complete_and_exit+0x20/0x20
[ 14.068697] ret_from_fork+0x22/0x30
[ 14.068886] </TASK>
[ 14.069000]
[ 14.069088] Allocated by task 289:
[ 14.069269] kasan_save_stack+0x1e/0x40
[ 14.069463] kasan_set_track+0x21/0x30
[ 14.069652] __kasan_slab_alloc+0x58/0x70
[ 14.069899] kmem_cache_alloc+0xc5/0x200
[ 14.070100] __alloc_file+0x20/0x160
[ 14.070283] alloc_empty_file+0x3b/0xc0
[ 14.070479] path_openat+0xc3/0x1770
[ 14.070689] do_filp_open+0x150/0x270
[ 14.070888] do_sys_openat2+0x113/0x270
[ 14.071081] __x64_sys_openat+0xc8/0x140
[ 14.071283] do_syscall_64+0x3b/0x90
[ 14.071466] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 14.071791]
[ 14.071874] Freed by task 0:
[ 14.072027] kasan_save_stack+0x1e/0x40
[ 14.072224] kasan_set_track+0x21/0x30
[ 14.072415] kasan_save_free_info+0x2a/0x50
[ 14.072627] __kasan_slab_free+0x106/0x190
[ 14.072858] kmem_cache_free+0x98/0x340
[ 14.073075] rcu_core+0x427/0xe50
[ 14.073249] __do_softirq+0x110/0x3cd
[ 14.073440]
[ 14.073523] Last potentially related work creation:
[ 14.073801] kasan_save_stack+0x1e/0x40
[ 14.074017] __kasan_record_aux_stack+0x97/0xb0
[ 14.074264] call_rcu+0x41/0x550
[ 14.074436] task_work_run+0xf4/0x170
[ 14.074619] exit_to_user_mode_prepare+0x113/0x120
[ 14.074858] syscall_exit_to_user_mode+0x1d/0x40
[ 14.075092] do_syscall_64+0x48/0x90
[ 14.075272] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 14.075529]
[ 14.075612] Second to last potentially related work creation:
[ 14.075900] kasan_save_stack+0x1e/0x40
[ 14.076098] __kasan_record_aux_stack+0x97/0xb0
[ 14.076325] task_work_add+0x72/0x1b0
[ 14.076512] fput+0x65/0xc0
[ 14.076657] filp_close+0x8e/0xa0
[ 14.076825] __x64_sys_close+0x15/0x50
[ 14.077019] do_syscall_64+0x3b/0x90
[ 14.077199] entry_SYSCALL_64_after_hwframe+0x63/0xcd
[ 14.077448]
[ 14.077530] The buggy address belongs to the object at ffff88800b09cf00
[ 14.077530] which belongs to the cache filp of size 232
[ 14.078105] The buggy address is located 32 bytes inside of
[ 14.078105] 232-byte region [ffff88800b09cf00, ffff88800b09cfe8)
[ 14.078685]
[ 14.078771] The buggy address belongs to the physical page:
[ 14.079046] page:000000001bd520e7 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88800b09de00 pfn:0xb09c
[ 14.079575] head:000000001bd520e7 order:1 compound_mapcount:0 compound_pincount:0
[ 14.079946] flags: 0x100000000010200(slab|head|node=0|zone=1)
[ 14.080244] raw: 0100000000010200 0000000000000000 dead000000000001 ffff88800493cc80
[ 14.080629] raw: ffff88800b09de00 0000000080190018 00000001ffffffff 0000000000000000
[ 14.081016] page dumped because: kasan: bad access detected
[ 14.081293]
[ 14.081376] Memory state around the buggy address:
[ 14.081618] ffff88800b09ce00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00
[ 14.081974] ffff88800b09ce80: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc
[ 14.082336] >ffff88800b09cf00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
[ 14.082690] ^
[ 14.082909] ffff88800b09cf80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc
[ 14.083266] ffff88800b09d000: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb
[ 14.083622] ==================================================================
The actual tracing of this bug is shown below:
commit 8c71fe750215 ("io_uring: ensure fput() called correspondingly
when direct install fails") adds an additional fput() in
io_fixed_fd_install() when io_file_bitmap_get() returns error values. In
that case, the routine will never make it to io_install_fixed_file() due
to an early return.
static int io_fixed_fd_install(...)
{
if (alloc_slot) {
...
ret = io_file_bitmap_get(ctx);
if (unlikely(ret < 0)) {
io_ring_submit_unlock(ctx, issue_flags);
fput(file);
return ret;
}
...
}
...
ret = io_install_fixed_file(req, file, issue_flags, file_slot);
...
}
In the above scenario, the reference is okay as io_fixed_fd_install()
ensures the fput() is called when something bad happens, either via
bitmap or via inner io_install_fixed_file().
However, the commit 61c1b44a21d7 ("io_uring: fix deadlock on iowq file
slot alloc") breaks the balance because it places fput() into the common
path for both io_file_bitmap_get() and io_install_fixed_file(). Since
io_install_fixed_file() handles the fput() itself, the reference
underflow come across then.
There are some extra commits make the current code into
io_fixed_fd_install() -> __io_fixed_fd_install() ->
io_install_fixed_file()
However, the fact that there is an extra fput() is called if
io_install_fixed_file() calls fput(). Traversing through the code, I
find that the existing two callers to __io_fixed_fd_install():
io_fixed_fd_install() and io_msg_send_fd() have fput() when handling
error return, this patch simply removes the fput() in
io_install_fixed_file() to fix the bug.
Fixes: 61c1b44a21d7 ("io_uring: fix deadlock on iowq file slot alloc")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Link: https://lore.kernel.org/r/be4ba4b.5d44.184a0a406a4.Coremail.linma@zju.edu.cn
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
poll_refs carry two functions, the first is ownership over the request.
The second is notifying the io_poll_check_events() that there was an
event but wake up couldn't grab the ownership, so io_poll_check_events()
should retry.
We want to make poll_refs more robust against overflows. Instead of
always incrementing it, which covers two purposes with one atomic, check
if poll_refs is elevated enough and if so set a retry flag without
attempts to grab ownership. The gap between the bias check and following
atomics may seem racy, but we don't need it to be strict. Moreover there
might only be maximum 4 parallel updates: by the first and the second
poll entries, __io_arm_poll_handler() and cancellation. From those four,
only poll wake ups may be executed multiple times, but they're protected
by a spin.
Cc: stable@vger.kernel.org
Reported-by: Lin Ma <linma@zju.edu.cn>
Fixes: aa43477b04025 ("io_uring: poll rework")
Signed-off-by: Pavel Begunkov <asml.silence@gmail.com>
Link: https://lore.kernel.org/r/c762bc31f8683b3270f3587691348a7119ef9c9d.1668963050.git.asml.silence@gmail.com
Signed-off-by: Jens Axboe <axboe@kernel.dk>
|