summaryrefslogtreecommitdiffstats
path: root/io_uring (follow)
Commit message (Collapse)AuthorAgeFilesLines
...
| * | Merge branch 'for-6.2/io_uring' into for-6.2/io_uring-nextJens Axboe2022-11-2912-183/+291
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * for-6.2/io_uring: (41 commits) io_uring: keep unlock_post inlined in hot path io_uring: don't use complete_post in kbuf io_uring: spelling fix io_uring: remove io_req_complete_post_tw io_uring: allow multishot polled reqs to defer completion io_uring: remove overflow param from io_post_aux_cqe io_uring: add lockdep assertion in io_fill_cqe_aux io_uring: make io_fill_cqe_aux static io_uring: add io_aux_cqe which allows deferred completion io_uring: allow defer completion for aux posted cqes io_uring: defer all io_req_complete_failed io_uring: always lock in io_apoll_task_func io_uring: remove iopoll spinlock io_uring: iopoll protect complete_post io_uring: inline __io_req_complete_put() io_uring: remove io_req_tw_post_queue io_uring: use io_req_task_complete() in timeout io_uring: hold locks for io_req_complete_failed io_uring: add completion locking for iopoll io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post() ...
* | \ \ Merge tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linuxLinus Torvalds2022-12-1312-183/+291
|\ \ \ \ | | |/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring updates from Jens Axboe: - Always ensure proper ordering in case of CQ ring overflow, which then means we can remove some work-arounds for that (Dylan) - Support completion batching for multishot, greatly increasing the efficiency for those (Dylan) - Flag epoll/eventfd wakeups done from io_uring, so that we can easily tell if we're recursing into io_uring again. Previously, this would have resulted in repeated multishot notifications if we had a dependency there. That could happen if an eventfd was registered as the ring eventfd, and we multishot polled for events on it. Or if an io_uring fd was added to epoll, and io_uring had a multishot request for the epoll fd. Test cases here: https://git.kernel.dk/cgit/liburing/commit/?id=919755a7d0096fda08fb6d65ac54ad8d0fe027cd Previously these got terminated when the CQ ring eventually overflowed, now it's handled gracefully (me). - Tightening of the IOPOLL based completions (Pavel) - Optimizations of the networking zero-copy paths (Pavel) - Various tweaks and fixes (Dylan, Pavel) * tag 'for-6.2/io_uring-2022-12-08' of git://git.kernel.dk/linux: (41 commits) io_uring: keep unlock_post inlined in hot path io_uring: don't use complete_post in kbuf io_uring: spelling fix io_uring: remove io_req_complete_post_tw io_uring: allow multishot polled reqs to defer completion io_uring: remove overflow param from io_post_aux_cqe io_uring: add lockdep assertion in io_fill_cqe_aux io_uring: make io_fill_cqe_aux static io_uring: add io_aux_cqe which allows deferred completion io_uring: allow defer completion for aux posted cqes io_uring: defer all io_req_complete_failed io_uring: always lock in io_apoll_task_func io_uring: remove iopoll spinlock io_uring: iopoll protect complete_post io_uring: inline __io_req_complete_put() io_uring: remove io_req_tw_post_queue io_uring: use io_req_task_complete() in timeout io_uring: hold locks for io_req_complete_failed io_uring: add completion locking for iopoll io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post() ...
| * | | io_uring: keep unlock_post inlined in hot pathPavel Begunkov2022-11-251-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This partially reverts 6c16fe3c16bdc ("io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()") The redundancy of __io_cq_unlock_post() was always to keep it inlined into __io_submit_flush_completions(). Inline it back and rename with hope of clarifying the intention behind it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/372a16c485fca44c069be2e92fc5e7332a1d7fd7.1669310258.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: don't use complete_post in kbufPavel Begunkov2022-11-251-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now we're handling IOPOLL completions more generically, get rid uses of _post() and send requests through the normal path. It may have some extra mertis performance wise, but we don't care much as there is a better interface for selected buffers. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/4deded706587f55b006dc33adf0c13cfc3b2319f.1669310258.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: spelling fixDylan Yudaken2022-11-251-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | s/pushs/pushes/ Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221125103412.1425305-3-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: remove io_req_complete_post_twDylan Yudaken2022-11-252-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's only used in one place. Inline it. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221125103412.1425305-2-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: allow multishot polled reqs to defer completionDylan Yudaken2022-11-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Until now there was no reason for multishot polled requests to defer completions as there was no functional difference. However now this will actually defer the completions, for a performance win. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-10-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: remove overflow param from io_post_aux_cqeDylan Yudaken2022-11-254-10/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The only call sites which would not allow overflow are also call sites which would use the io_aux_cqe as they care about ordering. So remove this parameter from io_post_aux_cqe. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-9-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: add lockdep assertion in io_fill_cqe_auxDylan Yudaken2022-11-251-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an assertion for the completion lock to io_fill_cqe_aux Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-8-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: make io_fill_cqe_aux staticDylan Yudaken2022-11-252-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is only used in io_uring.c Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-7-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: add io_aux_cqe which allows deferred completionDylan Yudaken2022-11-254-5/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the just introduced deferred post cqe completion state when possible in io_aux_cqe. If not possible fallback to io_post_aux_cqe. This introduces a complication because of allow_overflow. For deferred completions we cannot know without locking the completion_lock if it will overflow (and even if we locked it, another post could sneak in and cause this cqe to be in overflow). However since overflow protection is mostly a best effort defence in depth to prevent infinite loops of CQEs for poll, just checking the overflow bit is going to be good enough and will result in at most 16 (array size of deferred cqes) overflows. Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-6-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: allow defer completion for aux posted cqesDylan Yudaken2022-11-251-3/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multishot ops cannot use the compl_reqs list as the request must stay in the poll list, but that means they need to run each completion without benefiting from batching. Here introduce batching infrastructure for only small (ie 16 byte) CQEs. This restriction is ok because there are no use cases posting 32 byte CQEs. In the ring keep a batch of up to 16 posted results, and flush in the same way as compl_reqs. 16 was chosen through experimentation on a microbenchmark ([1]), as well as trying not to increase the size of the ring too much. This increases the size to 1472 bytes from 1216. [1]: https://github.com/DylanZA/liburing/commit/9ac66b36bcf4477bfafeff1c5f107896b7ae31cf Run with $ make -j && ./benchmark/reg.b -s 1 -t 2000 -r 10 Gives results: baseline 8309 k/s 8 18807 k/s 16 19338 k/s 32 20134 k/s Suggested-by: Pavel Begunkov <asml.silence@gmail.com> Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-5-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: defer all io_req_complete_failedDylan Yudaken2022-11-253-11/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All failures happen under lock now, and can be deferred. To be consistent when the failure has happened after some multishot cqe has been deferred (and keep ordering), always defer failures. To make this obvious at the caller (and to help prevent a future bug) rename io_req_complete_failed to io_req_defer_failed. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-4-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: always lock in io_apoll_task_funcDylan Yudaken2022-11-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is required for the failure case (io_req_complete_failed) and is missing. The alternative would be to only lock in the failure path, however all of the non-error paths in io_poll_check_events that do not do not return IOU_POLL_NO_ACTION end up locking anyway. The only extraneous lock would be for the multishot poll overflowing the CQE ring, however multishot poll would probably benefit from being locked as it will allow completions to be batched. So it seems reasonable to lock always. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221124093559.3780686-3-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: remove iopoll spinlockPavel Begunkov2022-11-231-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 2ccc92f4effcfa1c51c4fcf1e34d769099d3cad4 io_req_complete_post() should now behave well even in case of IOPOLL, we can remove completion_lock locking. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7e171c8b530656b14a671c59100ca260e46e7f2a.1669203009.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: iopoll protect complete_postPavel Begunkov2022-11-235-12/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_req_complete_post() may be used by iopoll enabled rings, grab locks in this case. That requires to pass issue_flags to propagate the locking state. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/cc6d854065c57c838ca8e8806f707a226b70fd2d.1669203009.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: inline __io_req_complete_put()Pavel Begunkov2022-11-231-13/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inline __io_req_complete_put() into io_req_complete_post(), there are no other users. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1923a4dfe80fa877f859a22ed3df2d5fc8ecf02b.1669203009.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: remove io_req_tw_post_queuePavel Begunkov2022-11-233-16/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove io_req_tw_post() and io_req_tw_post_queue(), we can use io_req_task_complete() instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/b9b73c08022c7f1457023ac841f35c0100e70345.1669203009.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: use io_req_task_complete() in timeoutPavel Begunkov2022-11-231-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a more generic io_req_task_complete() in timeout completion task_work instead of io_req_complete_post(). Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/bda1710b58c07bf06107421c2a65c529ea9cdcac.1669203009.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: hold locks for io_req_complete_failedPavel Begunkov2022-11-231-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A preparation patch, make sure we always hold uring_lock around io_req_complete_failed(). The only place deviating from the rule is io_cancel_defer_files(), queue a tw instead. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/70760344eadaecf2939287084b9d4ba5c05a6984.1669203009.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: add completion locking for iopollPavel Begunkov2022-11-231-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are pieces of code that may allow iopoll to race filling cqes, temporarily add spinlocking around posting events. Cc: stable@vger.kernel.org Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/84d86b5c117feda075471c5c9e65208e0dccf5d0.1669203009.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: kill io_cqring_ev_posted() and __io_cq_unlock_post()Jens Axboe2022-11-221-13/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __io_cq_unlock_post() is identical to io_cq_unlock_post(), and io_cqring_ev_posted() has a single caller so migth as well just inline it there. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | Revert "io_uring: disallow self-propelled ring polling"Jens Axboe2022-11-221-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 7fdbc5f014c3f71bc44673a2d6c5bb2d12d45f25. This patch dealt with a subset of the real problem, which is a potential circular dependency on the wakup path for io_uring itself. Outside of io_uring, eventfd can also trigger this (see details in 03e02acda8e2) and so can epoll (see details in caf1aeaffc3b). Now that we have a generic solution to this problem, get rid of the io_uring specific work-around. Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: pass in EPOLL_URING_WAKE for eventfd signaling and wakeupsJens Axboe2022-11-223-6/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass in EPOLL_URING_WAKE when signaling eventfd or doing poll related wakups, so that we can check for a circular event dependency between eventfd and epoll. If this flag is set when our wakeup handlers are called, then we know we have a dependency that needs to terminate multishot requests. eventfd and epoll are the only such possible dependencies. Cc: stable@vger.kernel.org # 6.0 Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: inline __io_req_complete_post()Pavel Begunkov2022-11-212-9/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is only one user of __io_req_complete_post(), inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/ef4c9059950a3da5cf68df00f977f1fd13bd9306.1668597569.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: split tw fallback into a functionPavel Begunkov2022-11-211-10/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When the target process is dying and so task_work_add() is not allowed we push all task_work item to the fallback workqueue. Move the part responsible for moving tw items out of __io_req_task_work_add() into a separate function. Makes it a bit cleaner and gives the compiler a bit of extra info. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/e503dab9d7af95470ca6b214c6de17715ae4e748.1668162751.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: inline io_req_task_work_add()Pavel Begunkov2022-11-212-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __io_req_task_work_add() is huge but marked inline, that makes compilers to generate lots of garbage. Inline the wrapper caller io_req_task_work_add() instead. before and after: text data bss dec hex filename 47347 16248 8 63603 f873 io_uring/io_uring.o text data bss dec hex filename 45303 16248 8 61559 f077 io_uring/io_uring.o Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/26dc8c28ca0160e3269ef3e55c5a8b917c4d4450.1668162751.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: update outdated comment of callbacksLin Ma2022-11-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous commit ebc11b6c6b87 ("io_uring: clean io-wq callbacks") rename io_free_work() into io_wq_free_work() for consistency. This patch also updates relevant comment to avoid misunderstanding. Fixes: ebc11b6c6b87 ("io_uring: clean io-wq callbacks") Signed-off-by: Lin Ma <linma@zju.edu.cn> Link: https://lore.kernel.org/r/20221110122103.20120-1-linma@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/poll: remove outdated comments of cachingLin Ma2022-11-211-6/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previous commit 13a99017ff19 ("io_uring: remove events caching atavisms") entirely removes the events caching optimization introduced by commit 81459350d581 ("io_uring: cache req->apoll->events in req->cflags"). Hence the related comment should also be removed to avoid misunderstanding. Fixes: 13a99017ff19 ("io_uring: remove events caching atavisms") Signed-off-by: Lin Ma <linma@zju.edu.cn> Link: https://lore.kernel.org/r/20221110060313.16303-1-linma@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: allow multishot recv CQEs to overflowDylan Yudaken2022-11-211-6/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit aa1df3a360a0 ("io_uring: fix CQE reordering"), there are stronger guarantees for overflow ordering. Specifically ensuring that userspace will not receive out of order receive CQEs. Therefore this is not needed any more for recv/recvmsg. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221107125236.260132-4-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: revert "io_uring fix multishot accept ordering"Dylan Yudaken2022-11-211-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is no longer needed after commit aa1df3a360a0 ("io_uring: fix CQE reordering"), since all reordering is now taken care of. This reverts commit cbd25748545c ("io_uring: fix multishot accept ordering"). Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221107125236.260132-2-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: do not always force run task_work in io_uring_registerDylan Yudaken2022-11-212-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Running task work when not needed can unnecessarily delay operations. Specifically IORING_SETUP_DEFER_TASKRUN tries to avoid running task work until the user requests it. Therefore do not run it in io_uring_register any more. The one catch is that io_rsrc_ref_quiesce expects it to have run in order to process all outstanding references, and so reorder it's loop to do this. Signed-off-by: Dylan Yudaken <dylany@meta.com> Link: https://lore.kernel.org/r/20221107123349.4106213-1-dylany@meta.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: fix two assignments in if conditionsXinghui Li2022-11-212-9/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes two errors: "ERROR: do not use assignment in if condition 130: FILE: io_uring/net.c:130: + if (!(issue_flags & IO_URING_F_UNLOCKED) && ERROR: do not use assignment in if condition 599: FILE: io_uring/poll.c:599: + } else if (!(issue_flags & IO_URING_F_UNLOCKED) &&" reported by checkpatch.pl in net.c and poll.c . Signed-off-by: Xinghui Li <korantli@tencent.com> Reported-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/r/20221102082503.32236-1-korantwork@gmail.com [axboe: style tweaks] Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: move mm accounting to a slower pathPavel Begunkov2022-11-212-18/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can also move mm accounting to the extended callbacks. It removes a few cycles from the hot path including skipping one function call and setting io_req_task_complete as a callback directly. For user backed I/O it shouldn't make any difference taking into considering atomic mm accounting and page pinning. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/1062f270273ad11c1b7b45ec59a6a317533d5e64.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: move zc reporting from the hot pathPavel Begunkov2022-11-213-12/+42
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add custom tw and notif callbacks on top of usual bits also handling zc reporting. That moves it from the hot path. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/40de4a6409042478e1f35adc4912e23226cb1b5c.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: inline io_notif_flush()Pavel Begunkov2022-11-212-11/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_notif_flush() is pretty simple, we can inline it. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/332359e7bd124138dfe51340bbec829c9b265c18.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: rename io_uring_tx_zerocopy_callbackPavel Begunkov2022-11-211-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Just a simple renaming patch, io_uring_tx_zerocopy_callback() is too bulky and doesn't follow usual naming style. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/24d78325403ca6dcb1ec4bced1e33cacc9b832a5.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: preset notif tw handlerPavel Begunkov2022-11-211-6/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're going to have multiple notification tw functions. In preparation for future changes default the tw callback in advance so later we can replace it with other versions. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/7acdbea5e20eadd844513320cd454af14ba50f64.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: remove extra notif rsrc setupPavel Begunkov2022-11-211-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_send_zc_prep() sets up notification's rsrc_node when needed, don't unconditionally install it on notif alloc. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/dbe4875ac33e180b9799d8537a5e27935e82aac4.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring: move kbuf put out of generic tw completePavel Begunkov2022-11-212-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are multiple users of io_req_task_complete() including zc notifications, but only read requests use selected buffers. As we already have an rw specific tw function, move io_put_kbuf() in there. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/94374c7649aaefc3a17808dc4701f25ccd457e25.1667557923.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * | | io_uring/net: introduce IORING_SEND_ZC_REPORT_USAGE flagStefan Metzmacher2022-11-213-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It might be useful for applications to detect if a zero copy transfer with SEND[MSG]_ZC was actually possible or not. The application can fallback to plain SEND[MSG] in order to avoid the overhead of two cqes per request. Or it can generate a log message that could indicate to an administrator that no zero copy was possible and could explain degraded performance. Cc: stable@vger.kernel.org # 6.1 Link: https://lore.kernel.org/io-uring/fb6a7599-8a9b-15e5-9b64-6cd9d01c6ff4@gmail.com/T/#m2b0d9df94ce43b0e69e6c089bdff0ce6babbdfaa Signed-off-by: Stefan Metzmacher <metze@samba.org> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/8945b01756d902f5d5b0667f20b957ad3f742e5e.1666895626.git.metze@samba.org Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | Merge tag 'fs.idmapped.mnt_idmap.v6.2' of ↵Linus Torvalds2022-12-131-5/+3
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping Pull idmapping updates from Christian Brauner: "Last cycle we've already made the interaction with idmapped mounts more robust and type safe by introducing the vfs{g,u}id_t type. This cycle we concluded the conversion and removed the legacy helpers. Currently we still pass around the plain namespace that was attached to a mount. This is in general pretty convenient but it makes it easy to conflate namespaces that are relevant on the filesystem - with namespaces that are relevent on the mount level. Especially for filesystem developers without detailed knowledge in this area this can be a potential source for bugs. Instead of passing the plain namespace we introduce a dedicated type struct mnt_idmap and replace the pointer with a pointer to a struct mnt_idmap. There are no semantic or size changes for the mount struct caused by this. We then start converting all places aware of idmapped mounts to rely on struct mnt_idmap. Once the conversion is done all helpers down to the really low-level make_vfs{g,u}id() and from_vfs{g,u}id() will take a struct mnt_idmap argument instead of two namespace arguments. This way it becomes impossible to conflate the two removing and thus eliminating the possibility of any bugs. Fwiw, I fixed some issues in that area a while ago in ntfs3 and ksmbd in the past. Afterwards only low-level code can ultimately use the associated namespace for any permission checks. Even most of the vfs can be completely obivious about this ultimately and filesystems will never interact with it in any form in the future. A struct mnt_idmap currently encompasses a simple refcount and pointer to the relevant namespace the mount is idmapped to. If a mount isn't idmapped then it will point to a static nop_mnt_idmap and if it doesn't that it is idmapped. As usual there are no allocations or anything happening for non-idmapped mounts. Everthing is carefully written to be a nop for non-idmapped mounts as has always been the case. If an idmapped mount is created a struct mnt_idmap is allocated and a reference taken on the relevant namespace. Each mount that gets idmapped or inherits the idmap simply bumps the reference count on struct mnt_idmap. Just a reminder that we only allow a mount to change it's idmapping a single time and only if it hasn't already been attached to the filesystems and has no active writers. The actual changes are fairly straightforward but this will have huge benefits for maintenance and security in the long run even if it causes some churn. Note that this also makes it possible to extend struct mount_idmap in the future. For example, it would be possible to place the namespace pointer in an anonymous union together with an idmapping struct. This would allow us to expose an api to userspace that would let it specify idmappings directly instead of having to go through the detour of setting up namespaces at all" * tag 'fs.idmapped.mnt_idmap.v6.2' of git://git.kernel.org/pub/scm/linux/kernel/git/vfs/idmapping: acl: conver higher-level helpers to rely on mnt_idmap fs: introduce dedicated idmap type for mounts
| * | | | acl: conver higher-level helpers to rely on mnt_idmapChristian Brauner2022-10-311-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert an initial portion to rely on struct mnt_idmap by converting the high level xattr helpers. Reviewed-by: Seth Forshee (DigitalOcean) <sforshee@kernel.org> Signed-off-by: Christian Brauner (Microsoft) <brauner@kernel.org>
* | | | | Merge tag 'pull-iov_iter' of ↵Linus Torvalds2022-12-132-12/+12
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull iov_iter updates from Al Viro: "iov_iter work; most of that is about getting rid of direction misannotations and (hopefully) preventing more of the same for the future" * tag 'pull-iov_iter' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: use less confusing names for iov_iter direction initializers iov_iter: saner checks for attempt to copy to/from iterator [xen] fix "direction" argument of iov_iter_kvec() [vhost] fix 'direction' argument of iov_iter_{init,bvec}() [target] fix iov_iter_bvec() "direction" argument [s390] memcpy_real(): WRITE is "data source", not destination... [s390] zcore: WRITE is "data source", not destination... [infiniband] READ is "data destination", not source... [fsi] WRITE is "data source", not destination... [s390] copy_oldmem_kernel() - WRITE is "data source", not destination csum_and_copy_to_iter(): handle ITER_DISCARD get rid of unlikely() on page_copy_sane() calls
| * | | | | use less confusing names for iov_iter direction initializersAl Viro2022-11-252-12/+12
| | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | io_uring: Fix a null-ptr-deref in io_tctx_exit_cb()Harshit Mogalapalli2022-12-071-1/+3
| |_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller reports a NULL deref bug as follows: BUG: KASAN: null-ptr-deref in io_tctx_exit_cb+0x53/0xd3 Read of size 4 at addr 0000000000000138 by task file1/1955 CPU: 1 PID: 1955 Comm: file1 Not tainted 6.1.0-rc7-00103-gef4d3ea40565 #75 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.11.0-2.el7 04/01/2014 Call Trace: <TASK> dump_stack_lvl+0xcd/0x134 ? io_tctx_exit_cb+0x53/0xd3 kasan_report+0xbb/0x1f0 ? io_tctx_exit_cb+0x53/0xd3 kasan_check_range+0x140/0x190 io_tctx_exit_cb+0x53/0xd3 task_work_run+0x164/0x250 ? task_work_cancel+0x30/0x30 get_signal+0x1c3/0x2440 ? lock_downgrade+0x6e0/0x6e0 ? lock_downgrade+0x6e0/0x6e0 ? exit_signals+0x8b0/0x8b0 ? do_raw_read_unlock+0x3b/0x70 ? do_raw_spin_unlock+0x50/0x230 arch_do_signal_or_restart+0x82/0x2470 ? kmem_cache_free+0x260/0x4b0 ? putname+0xfe/0x140 ? get_sigframe_size+0x10/0x10 ? do_execveat_common.isra.0+0x226/0x710 ? lockdep_hardirqs_on+0x79/0x100 ? putname+0xfe/0x140 ? do_execveat_common.isra.0+0x238/0x710 exit_to_user_mode_prepare+0x15f/0x250 syscall_exit_to_user_mode+0x19/0x50 do_syscall_64+0x42/0xb0 entry_SYSCALL_64_after_hwframe+0x63/0xcd RIP: 0023:0x0 Code: Unable to access opcode bytes at 0xffffffffffffffd6. RSP: 002b:00000000fffb7790 EFLAGS: 00000200 ORIG_RAX: 000000000000000b RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: 0000000000000000 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000 R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000 </TASK> Kernel panic - not syncing: panic_on_warn set ... This happens because the adding of task_work from io_ring_exit_work() isn't synchronized with canceling all work items from eg exec. The execution of the two are ordered in that they are both run by the task itself, but if io_tctx_exit_cb() is queued while we're canceling all work items off exec AND gets executed when the task exits to userspace rather than in the main loop in io_uring_cancel_generic(), then we can find current->io_uring == NULL and hit the above crash. It's safe to add this NULL check here, because the execution of the two paths are done by the task itself. Cc: stable@vger.kernel.org Fixes: d56d938b4bef ("io_uring: do ctx initiated file note removal") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Harshit Mogalapalli <harshit.m.mogalapalli@oracle.com> Link: https://lore.kernel.org/r/20221206093833.3812138-1-harshit.m.mogalapalli@oracle.com [axboe: add code comment and also put an explanation in the commit msg] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | io_uring: clear TIF_NOTIFY_SIGNAL if set and task_work not availableJens Axboe2022-11-251-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With how task_work is added and signaled, we can have TIF_NOTIFY_SIGNAL set and no task_work pending as it got run in a previous loop. Treat TIF_NOTIFY_SIGNAL like get_signal(), always clear it if set regardless of whether or not task_work is pending to run. Cc: stable@vger.kernel.org Fixes: 46a525e199e4 ("io_uring: don't gate task_work run on TIF_NOTIFY_SIGNAL") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | io_uring/poll: fix poll_refs race with cancelationLin Ma2022-11-251-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an interesting race condition of poll_refs which could result in a NULL pointer dereference. The crash trace is like: KASAN: null-ptr-deref in range [0x0000000000000008-0x000000000000000f] CPU: 0 PID: 30781 Comm: syz-executor.2 Not tainted 6.0.0-g493ffd6605b2 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-1ubuntu1.1 04/01/2014 RIP: 0010:io_poll_remove_entry io_uring/poll.c:154 [inline] RIP: 0010:io_poll_remove_entries+0x171/0x5b4 io_uring/poll.c:190 Code: ... RSP: 0018:ffff88810dfefba0 EFLAGS: 00010202 RAX: 0000000000000001 RBX: 0000000000000000 RCX: 0000000000040000 RDX: ffffc900030c4000 RSI: 000000000003ffff RDI: 0000000000040000 RBP: 0000000000000008 R08: ffffffff9764d3dd R09: fffffbfff3836781 R10: fffffbfff3836781 R11: 0000000000000000 R12: 1ffff11003422d60 R13: ffff88801a116b04 R14: ffff88801a116ac0 R15: dffffc0000000000 FS: 00007f9c07497700(0000) GS:ffff88811a600000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffb5c00ea98 CR3: 0000000105680005 CR4: 0000000000770ef0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: <TASK> io_apoll_task_func+0x3f/0xa0 io_uring/poll.c:299 handle_tw_list io_uring/io_uring.c:1037 [inline] tctx_task_work+0x37e/0x4f0 io_uring/io_uring.c:1090 task_work_run+0x13a/0x1b0 kernel/task_work.c:177 get_signal+0x2402/0x25a0 kernel/signal.c:2635 arch_do_signal_or_restart+0x3b/0x660 arch/x86/kernel/signal.c:869 exit_to_user_mode_loop kernel/entry/common.c:166 [inline] exit_to_user_mode_prepare+0xc2/0x160 kernel/entry/common.c:201 __syscall_exit_to_user_mode_work kernel/entry/common.c:283 [inline] syscall_exit_to_user_mode+0x58/0x160 kernel/entry/common.c:294 entry_SYSCALL_64_after_hwframe+0x63/0xcd The root cause for this is a tiny overlooking in io_poll_check_events() when cocurrently run with poll cancel routine io_poll_cancel_req(). The interleaving to trigger use-after-free: CPU0 | CPU1 | io_apoll_task_func() | io_poll_cancel_req() io_poll_check_events() | // do while first loop | v = atomic_read(...) | // v = poll_refs = 1 | ... | io_poll_mark_cancelled() | atomic_or() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | atomic_sub_return(...) | // poll_refs = IO_POLL_CANCEL_FLAG | // loop continue | | | io_poll_execute() | io_poll_get_ownership() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | // gets the ownership v = atomic_read(...) | // poll_refs not change | | if (v & IO_POLL_CANCEL_FLAG) | return -ECANCELED; | // io_poll_check_events return | // will go into | // io_req_complete_failed() free req | | | io_apoll_task_func() | // also go into io_req_complete_failed() And the interleaving to trigger the kernel WARNING: CPU0 | CPU1 | io_apoll_task_func() | io_poll_cancel_req() io_poll_check_events() | // do while first loop | v = atomic_read(...) | // v = poll_refs = 1 | ... | io_poll_mark_cancelled() | atomic_or() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | atomic_sub_return(...) | // poll_refs = IO_POLL_CANCEL_FLAG | // loop continue | | v = atomic_read(...) | // v = IO_POLL_CANCEL_FLAG | | io_poll_execute() | io_poll_get_ownership() | // poll_refs = IO_POLL_CANCEL_FLAG | 1 | // gets the ownership | WARN_ON_ONCE(!(v & IO_POLL_REF_MASK))) | // v & IO_POLL_REF_MASK = 0 WARN | | | io_apoll_task_func() | // also go into io_req_complete_failed() By looking up the source code and communicating with Pavel, the implementation of this atomic poll refs should continue the loop of io_poll_check_events() just to avoid somewhere else to grab the ownership. Therefore, this patch simply adds another AND operation to make sure the loop will stop if it finds the poll_refs is exactly equal to IO_POLL_CANCEL_FLAG. Since io_poll_cancel_req() grabs ownership and will finally make its way to io_req_complete_failed(), the req will be reclaimed as expected. Fixes: aa43477b0402 ("io_uring: poll rework") Signed-off-by: Lin Ma <linma@zju.edu.cn> Reviewed-by: Pavel Begunkov <asml.silence@gmail.com> [axboe: tweak description and code style] Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | io_uring/filetable: fix file reference underflowLin Ma2022-11-251-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is an interesting reference bug when -ENOMEM occurs in calling of io_install_fixed_file(). KASan report like below: [ 14.057131] ================================================================== [ 14.059161] BUG: KASAN: use-after-free in unix_get_socket+0x10/0x90 [ 14.060975] Read of size 8 at addr ffff88800b09cf20 by task kworker/u8:2/45 [ 14.062684] [ 14.062768] CPU: 2 PID: 45 Comm: kworker/u8:2 Not tainted 6.1.0-rc4 #1 [ 14.063099] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.16.0-0-gd239552ce722-prebuilt.qemu.org 04/01/2014 [ 14.063666] Workqueue: events_unbound io_ring_exit_work [ 14.063936] Call Trace: [ 14.064065] <TASK> [ 14.064175] dump_stack_lvl+0x34/0x48 [ 14.064360] print_report+0x172/0x475 [ 14.064547] ? _raw_spin_lock_irq+0x83/0xe0 [ 14.064758] ? __virt_addr_valid+0xef/0x170 [ 14.064975] ? unix_get_socket+0x10/0x90 [ 14.065167] kasan_report+0xad/0x130 [ 14.065353] ? unix_get_socket+0x10/0x90 [ 14.065553] unix_get_socket+0x10/0x90 [ 14.065744] __io_sqe_files_unregister+0x87/0x1e0 [ 14.065989] ? io_rsrc_refs_drop+0x1c/0xd0 [ 14.066199] io_ring_exit_work+0x388/0x6a5 [ 14.066410] ? io_uring_try_cancel_requests+0x5bf/0x5bf [ 14.066674] ? try_to_wake_up+0xdb/0x910 [ 14.066873] ? virt_to_head_page+0xbe/0xbe [ 14.067080] ? __schedule+0x574/0xd20 [ 14.067273] ? read_word_at_a_time+0xe/0x20 [ 14.067492] ? strscpy+0xb5/0x190 [ 14.067665] process_one_work+0x423/0x710 [ 14.067879] worker_thread+0x2a2/0x6f0 [ 14.068073] ? process_one_work+0x710/0x710 [ 14.068284] kthread+0x163/0x1a0 [ 14.068454] ? kthread_complete_and_exit+0x20/0x20 [ 14.068697] ret_from_fork+0x22/0x30 [ 14.068886] </TASK> [ 14.069000] [ 14.069088] Allocated by task 289: [ 14.069269] kasan_save_stack+0x1e/0x40 [ 14.069463] kasan_set_track+0x21/0x30 [ 14.069652] __kasan_slab_alloc+0x58/0x70 [ 14.069899] kmem_cache_alloc+0xc5/0x200 [ 14.070100] __alloc_file+0x20/0x160 [ 14.070283] alloc_empty_file+0x3b/0xc0 [ 14.070479] path_openat+0xc3/0x1770 [ 14.070689] do_filp_open+0x150/0x270 [ 14.070888] do_sys_openat2+0x113/0x270 [ 14.071081] __x64_sys_openat+0xc8/0x140 [ 14.071283] do_syscall_64+0x3b/0x90 [ 14.071466] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 14.071791] [ 14.071874] Freed by task 0: [ 14.072027] kasan_save_stack+0x1e/0x40 [ 14.072224] kasan_set_track+0x21/0x30 [ 14.072415] kasan_save_free_info+0x2a/0x50 [ 14.072627] __kasan_slab_free+0x106/0x190 [ 14.072858] kmem_cache_free+0x98/0x340 [ 14.073075] rcu_core+0x427/0xe50 [ 14.073249] __do_softirq+0x110/0x3cd [ 14.073440] [ 14.073523] Last potentially related work creation: [ 14.073801] kasan_save_stack+0x1e/0x40 [ 14.074017] __kasan_record_aux_stack+0x97/0xb0 [ 14.074264] call_rcu+0x41/0x550 [ 14.074436] task_work_run+0xf4/0x170 [ 14.074619] exit_to_user_mode_prepare+0x113/0x120 [ 14.074858] syscall_exit_to_user_mode+0x1d/0x40 [ 14.075092] do_syscall_64+0x48/0x90 [ 14.075272] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 14.075529] [ 14.075612] Second to last potentially related work creation: [ 14.075900] kasan_save_stack+0x1e/0x40 [ 14.076098] __kasan_record_aux_stack+0x97/0xb0 [ 14.076325] task_work_add+0x72/0x1b0 [ 14.076512] fput+0x65/0xc0 [ 14.076657] filp_close+0x8e/0xa0 [ 14.076825] __x64_sys_close+0x15/0x50 [ 14.077019] do_syscall_64+0x3b/0x90 [ 14.077199] entry_SYSCALL_64_after_hwframe+0x63/0xcd [ 14.077448] [ 14.077530] The buggy address belongs to the object at ffff88800b09cf00 [ 14.077530] which belongs to the cache filp of size 232 [ 14.078105] The buggy address is located 32 bytes inside of [ 14.078105] 232-byte region [ffff88800b09cf00, ffff88800b09cfe8) [ 14.078685] [ 14.078771] The buggy address belongs to the physical page: [ 14.079046] page:000000001bd520e7 refcount:1 mapcount:0 mapping:0000000000000000 index:0xffff88800b09de00 pfn:0xb09c [ 14.079575] head:000000001bd520e7 order:1 compound_mapcount:0 compound_pincount:0 [ 14.079946] flags: 0x100000000010200(slab|head|node=0|zone=1) [ 14.080244] raw: 0100000000010200 0000000000000000 dead000000000001 ffff88800493cc80 [ 14.080629] raw: ffff88800b09de00 0000000080190018 00000001ffffffff 0000000000000000 [ 14.081016] page dumped because: kasan: bad access detected [ 14.081293] [ 14.081376] Memory state around the buggy address: [ 14.081618] ffff88800b09ce00: 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 [ 14.081974] ffff88800b09ce80: 00 00 00 00 00 fc fc fc fc fc fc fc fc fc fc fc [ 14.082336] >ffff88800b09cf00: fa fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb [ 14.082690] ^ [ 14.082909] ffff88800b09cf80: fb fb fb fb fb fb fb fb fb fb fb fb fb fc fc fc [ 14.083266] ffff88800b09d000: fc fc fc fc fc fc fc fc fa fb fb fb fb fb fb fb [ 14.083622] ================================================================== The actual tracing of this bug is shown below: commit 8c71fe750215 ("io_uring: ensure fput() called correspondingly when direct install fails") adds an additional fput() in io_fixed_fd_install() when io_file_bitmap_get() returns error values. In that case, the routine will never make it to io_install_fixed_file() due to an early return. static int io_fixed_fd_install(...) { if (alloc_slot) { ... ret = io_file_bitmap_get(ctx); if (unlikely(ret < 0)) { io_ring_submit_unlock(ctx, issue_flags); fput(file); return ret; } ... } ... ret = io_install_fixed_file(req, file, issue_flags, file_slot); ... } In the above scenario, the reference is okay as io_fixed_fd_install() ensures the fput() is called when something bad happens, either via bitmap or via inner io_install_fixed_file(). However, the commit 61c1b44a21d7 ("io_uring: fix deadlock on iowq file slot alloc") breaks the balance because it places fput() into the common path for both io_file_bitmap_get() and io_install_fixed_file(). Since io_install_fixed_file() handles the fput() itself, the reference underflow come across then. There are some extra commits make the current code into io_fixed_fd_install() -> __io_fixed_fd_install() -> io_install_fixed_file() However, the fact that there is an extra fput() is called if io_install_fixed_file() calls fput(). Traversing through the code, I find that the existing two callers to __io_fixed_fd_install(): io_fixed_fd_install() and io_msg_send_fd() have fput() when handling error return, this patch simply removes the fput() in io_install_fixed_file() to fix the bug. Fixes: 61c1b44a21d7 ("io_uring: fix deadlock on iowq file slot alloc") Signed-off-by: Lin Ma <linma@zju.edu.cn> Link: https://lore.kernel.org/r/be4ba4b.5d44.184a0a406a4.Coremail.linma@zju.edu.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | | | io_uring: make poll refs more robustPavel Begunkov2022-11-251-1/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | poll_refs carry two functions, the first is ownership over the request. The second is notifying the io_poll_check_events() that there was an event but wake up couldn't grab the ownership, so io_poll_check_events() should retry. We want to make poll_refs more robust against overflows. Instead of always incrementing it, which covers two purposes with one atomic, check if poll_refs is elevated enough and if so set a retry flag without attempts to grab ownership. The gap between the bias check and following atomics may seem racy, but we don't need it to be strict. Moreover there might only be maximum 4 parallel updates: by the first and the second poll entries, __io_arm_poll_handler() and cancellation. From those four, only poll wake ups may be executed multiple times, but they're protected by a spin. Cc: stable@vger.kernel.org Reported-by: Lin Ma <linma@zju.edu.cn> Fixes: aa43477b04025 ("io_uring: poll rework") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/c762bc31f8683b3270f3587691348a7119ef9c9d.1668963050.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>