summaryrefslogtreecommitdiffstats
path: root/io_uring (follow)
Commit message (Collapse)AuthorAgeFilesLines
* io_uring/net: improve the usercopy for sendmsg/recvmsgJens Axboe2024-02-271-7/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We're spending a considerable amount of the sendmsg/recvmsg time just copying in the message header. And for provided buffers, the known single entry iovec. Be a bit smarter about it and enable/disable user access around our copying. In a test case that does both sendmsg and recvmsg, the runtime before this change (averaged over multiple runs, very stable times however): Kernel Time Diff ==================================== -git 4720 usec -git+commit 4311 usec -8.7% and looking at a profile diff, we see the following: 0.25% +9.33% [kernel.kallsyms] [k] _copy_from_user 4.47% -3.32% [kernel.kallsyms] [k] __io_msg_copy_hdr.constprop.0 where we drop more than 9% of _copy_from_user() time, and consequently add time to __io_msg_copy_hdr() where the copies are now attributed to, but with a net win of 6%. In comparison, the same test case with send/recv runs in 3745 usec, which is (expectedly) still quite a bit faster. But at least sendmsg/recvmsg is now only ~13% slower, where it was ~21% slower before. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: move receive multishot out of the generic msghdr pathJens Axboe2024-02-271-70/+91
| | | | | | | | Move the actual user_msghdr / compat_msghdr into the send and receive sides, respectively, so we can move the uaddr receive handling into its own handler, and ditto the multishot with buffer selection logic. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: unify how recvmsg and sendmsg copy in the msghdrJens Axboe2024-02-271-129/+142
| | | | | | | | | For recvmsg, we roll our own since we support buffer selections. This isn't the case for sendmsg right now, but in preparation for doing so, make the recvmsg copy helpers generic so we can call them from the sendmsg side as well. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/napi: enable even with a timeout of 0Jens Axboe2024-02-151-4/+5
| | | | | | | | | | | | | | 1 usec is not as short as it used to be, and it makes sense to allow 0 for a busy poll timeout - this means just do one loop to check if we have anything available. Add a separate ->napi_enabled to check if napi has been enabled or not. While at it, move the writing of the ctx napi values after we've copied the old values back to userspace. This ensures that if the call fails, we'll be in the same state as we were before, rather than some indeterminate state. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: kill stale comment for io_cqring_overflow_kill()Jens Axboe2024-02-151-1/+0
| | | | | | | | | This function now deals only with discarding overflow entries on ring free and exit, and it no longer returns whether we successfully flushed all entries as there's no CQE posting involved anymore. Kill the outdated comment. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/sqpoll: use the correct check for pending task_workJens Axboe2024-02-141-1/+8
| | | | | | | | | | | A previous commit moved to using just the private task_work list for SQPOLL, but it neglected to update the check for whether we have pending task_work. Normally this is fine as we'll attempt to run it unconditionally, but if we race with going to sleep AND task_work being added, then we certainly need the right check here. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: wake SQPOLL task when task_work is added to an empty queueJens Axboe2024-02-141-1/+6
| | | | | | | | | | If there's no current work on the list, we still need to potentially wake the SQPOLL task if it is sleeping. This is ordered with the wait queue addition in sqpoll, which adds to the wait queue before checking for pending work items. Fixes: af5d68f8892f ("io_uring/sqpoll: manage task_work privately") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/napi: ensure napi polling is aborted when work is availableJens Axboe2024-02-143-12/+12
| | | | | | | | | | | | | | While testing io_uring NAPI with DEFER_TASKRUN, I ran into slowdowns and stalls in packet delivery. Turns out that while io_napi_busy_loop_should_end() aborts appropriately on regular task_work, it does not abort if we have local task_work pending. Move io_has_work() into the private io_uring.h header, and gate whether we should continue polling on that as well. This makes NAPI polling on send/receive work as designed with IORING_SETUP_DEFER_TASKRUN as well. Fixes: 8d0c12a80cde ("io-uring: add napi busy poll support") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: Don't include af_unix.h.Kuniyuki Iwashima2024-02-134-3/+2
| | | | | | | | | | | Changes to AF_UNIX trigger rebuild of io_uring, but io_uring does not use AF_UNIX anymore. Let's not include af_unix.h and instead include necessary headers. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://lore.kernel.org/r/20240212234236.63714-1-kuniyu@amazon.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: add register/unregister napi functionStefan Roesch2024-02-093-0/+76
| | | | | | | | | | | | This adds an api to register and unregister the napi for io-uring. If the arg value is specified when unregistering, the current napi setting for the busy poll timeout is copied into the user structure. If this is not required, NULL can be passed as the arg value. Signed-off-by: Stefan Roesch <shr@devkernel.io> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-7-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io-uring: add sqpoll support for napi busy pollStefan Roesch2024-02-093-1/+33
| | | | | | | | | This adds the sqpoll support to the io-uring napi. Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Link: https://lore.kernel.org/r/20230608163839.2891748-6-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io-uring: add napi busy poll supportStefan Roesch2024-02-096-0/+359
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds the napi busy polling support in io_uring.c. It adds a new napi_list to the io_ring_ctx structure. This list contains the list of napi_id's that are currently enabled for busy polling. The list is synchronized by the new napi_lock spin lock. The current default napi busy polling time is stored in napi_busy_poll_to. If napi busy polling is not enabled, the value is 0. In addition there is also a hash table. The hash table store the napi id and the pointer to the above list nodes. The hash table is used to speed up the lookup to the list elements. The hash table is synchronized with rcu. The NAPI_TIMEOUT is stored as a timeout to make sure that the time a napi entry is stored in the napi list is limited. The busy poll timeout is also stored as part of the io_wait_queue. This is necessary as for sq polling the poll interval needs to be adjusted and the napi callback allows only to pass in one value. This has been tested with two simple programs from the liburing library repository: the napi client and the napi server program. The client sends a request, which has a timestamp in its payload and the server replies with the same payload. The client calculates the roundtrip time and stores it to calculate the results. The client is running on host1 and the server is running on host 2 (in the same rack). The measured times below are roundtrip times. They are average times over 5 runs each. Each run measures 1 million roundtrips. no rx coal rx coal: frames=88,usecs=33 Default 57us 56us client_poll=100us 47us 46us server_poll=100us 51us 46us client_poll=100us+ 40us 40us server_poll=100us client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on server client_poll=100us+ 41us 39us server_poll=100us+ prefer napi busy poll on client + server Signed-off-by: Stefan Roesch <shr@devkernel.io> Suggested-by: Olivier Langlois <olivier@trillion01.com> Acked-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20230608163839.2891748-5-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io-uring: move io_wait_queue definition to header fileStefan Roesch2024-02-092-21/+22
| | | | | | | | | This moves the definition of the io_wait_queue structure to the header file so it can be also used from other files. Signed-off-by: Stefan Roesch <shr@devkernel.io> Link: https://lore.kernel.org/r/20230608163839.2891748-4-shr@devkernel.io Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: add support for ftruncateTony Solomonik2024-02-094-1/+63
| | | | | | | | | | Adds support for doing truncate through io_uring, eliminating the need for applications to roll their own thread pool or offload mechanism to be able to do non-blocking truncates. Signed-off-by: Tony Solomonik <tony.solomonik@gmail.com> Link: https://lore.kernel.org/r/20240202121724.17461-3-tony.solomonik@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: Simplify the allocation of slab cachesKunwu Chan2024-02-081-3/+2
| | | | | | | | | | | commit 0a31bd5f2bbb ("KMEM_CACHE(): simplify slab cache creation") introduces a new macro. Use the new KMEM_CACHE() macro instead of direct kmem_cache_create to simplify the creation of SLAB caches. Signed-off-by: Kunwu Chan <chentao@kylinos.cn> Link: https://lore.kernel.org/r/20240130100247.81460-1-chentao@kylinos.cn Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/sqpoll: manage task_work privatelyJens Axboe2024-02-083-17/+82
| | | | | | | | | | | | | | | | Decouple from task_work running, and cap the number of entries we process at the time. If we exceed that number, push remaining entries to a retry list that we'll process first next time. We cap the number of entries to process at 8, which is fairly random. We just want to get enough per-ctx batching here, while not processing endlessly. Since we manually run PF_IO_WORKER related task_work anyway as the task never exits to userspace, with this we no longer need to add an actual task_work item to the per-process list. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: pass in counter to handle_tw_list() rather than return itJens Axboe2024-02-081-5/+3
| | | | | | | No functional changes in this patch, just in preparation for returning something other than count from this helper. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: cleanup handle_tw_list() calling conventionJens Axboe2024-02-081-16/+13
| | | | | | | | | Now that we don't loop around task_work anymore, there's no point in maintaining the ring and locked state outside of handle_tw_list(). Get rid of passing in those pointers (and pointers to pointers) and just do the management internally in handle_tw_list(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/poll: improve readability of poll reference decrementingJens Axboe2024-02-081-2/+2
| | | | | | | | This overly long line is hard to read. Break it up by AND'ing the ref mask first, then perform the atomic_sub_return() with the value itself. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: remove unconditional looping in local task_work handlingJens Axboe2024-02-081-15/+29
| | | | | | | | | | If we have a ton of notifications coming in, we can be looping in here for a long time. This can be problematic for various reasons, mostly because we can starve userspace. If the application is waiting on N events, then only re-run if we need more events. Fixes: c0e0d6ba25f1 ("io_uring: add IORING_SETUP_DEFER_TASKRUN") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: remove next io_kiocb fetch in task_work runningJens Axboe2024-02-081-3/+0
| | | | | | | | We just reversed the task_work list and that will have touched requests as well, just get rid of this optimization as it should not make a difference anymore. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: handle traditional task_work in FIFO orderJens Axboe2024-02-081-1/+1
| | | | | | | | | | | | For local task_work, which is used if IORING_SETUP_DEFER_TASKRUN is set, we reverse the order of the lockless list before processing the work. This is done to process items in the order in which they were queued, as the llist always adds to the head. Do the same for traditional task_work, so we have the same behavior for both types. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: remove 'loops' argument from trace_io_uring_task_work_run()Jens Axboe2024-02-081-1/+1
| | | | | | | We no longer loop in task_work handling, hence delete the argument from the tracepoint as it's always 1 and hence not very informative. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: remove looping around handling traditional task_workJens Axboe2024-02-081-38/+7
| | | | | | | | | | | | | | | | | A previous commit added looping around handling traditional task_work as an optimization, and while that may seem like a good idea, it's also possible to run into application starvation doing so. If the task_work generation is bursty, we can get very deep task_work queues, and we can end up looping in here for a very long time. One immediately observable problem with that is handling network traffic using provided buffers, where flooding incoming traffic and looping task_work handling will very quickly lead to buffer starvation as we keep running task_work rather than returning to the application so it can handle the associated CQEs and also provide buffers back. Fixes: 3a0c037b0e16 ("io_uring: batch task_work") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/kbuf: cleanup passing back cflagsJens Axboe2024-02-082-24/+31
| | | | | | | | | | | | | | | We have various functions calculating the CQE cflags we need to pass back, but it's all the same everywhere. Make a number of the putting functions void, and just have the two main helps for this, io_put_kbuf() and io_put_kbuf_comp() calculate the actual mask and pass it back. While at it, cleanup how we put REQ_F_BUFFER_RING buffers. Before this change, we would call into __io_put_kbuf() only to go right back in to the header defined functions. As clearing this type of buffer is just re-assigning the buf_index and incrementing the head, this is very wasteful. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/rw: remove dead file == NULL checkJens Axboe2024-02-081-1/+1
| | | | | | | | | Any read/write opcode has needs_file == true, which means that we would've failed the request long before reaching the issue stage if we didn't successfully assign a file. This check has been dead forever, and is really a leftover from generic code. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: cleanup io_req_complete_post()Jens Axboe2024-02-081-4/+4
| | | | | | | Move the ctx declaration and assignment up to be generally available in the function, as we use req->ctx at the top anyway. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: mark the need to lock/unlock the ring as unlikelyJens Axboe2024-02-081-2/+2
| | | | | | | | | Any of the fast paths will already have this locked, this helper only exists to deal with io-wq invoking request issue where we do not have the ctx->uring_lock held already. This means that any common or fast path will already have this locked, mark it as such. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: add io_file_can_poll() helperJens Axboe2024-02-085-6/+18
| | | | | | | | | | This adds a flag to avoid dipping dereferencing file and then f_op to figure out if the file has a poll handler defined or not. We generally call this at least twice for networked workloads, and if using ring provided buffers, we do it on every buffer selection. Particularly the latter is troublesome, as it's otherwise a very fast operation. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/cancel: don't default to setting req->work.cancel_seqJens Axboe2024-02-084-8/+12
| | | | | | | | | | Just leave it unset by default, avoiding dipping into the last cacheline (which is otherwise untouched) for the fast path of using poll to drive networked traffic. Add a flag that tells us if the sequence is valid or not, and then we can defer actually assigning the flag and sequence until someone runs cancelations. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: expand main struct io_kiocb flags to 64-bitsJens Axboe2024-02-082-5/+6
| | | | | | | | | | | | | | | | We're out of space here, and none of the flags are easily reclaimable. Bump it to 64-bits and re-arrange the struct a bit to avoid gaps. Add a specific bitwise type for the request flags, io_request_flags_t. This will help catch violations of casting this value to a smaller type on 32-bit archs, like unsigned int. This creates a hole in the io_kiocb, so move nr_tw up and rsrc_node down to retain needing only cacheline 0 and 1 for non-polled opcodes. No functional changes intended in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: use file_mnt_idmap helperAlexander Mikhalitsyn2024-02-071-1/+1
| | | | | | | | | | | | | | | | Let's use file_mnt_idmap() as we do that across the tree. No functional impact. Cc: Christian Brauner <brauner@kernel.org> Cc: Jens Axboe <axboe@kernel.dk> Cc: Pavel Begunkov <asml.silence@gmail.com> Cc: io-uring@vger.kernel.org Cc: linux-fsdevel@vger.kernel.org Cc: linux-kernel@vger.kernel.org Signed-off-by: Alexander Mikhalitsyn <aleksandr.mikhalitsyn@canonical.com> Link: https://lore.kernel.org/r/20240129180024.219766-2-aleksandr.mikhalitsyn@canonical.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: fix sr->len for IORING_OP_RECV with MSG_WAITALL and buffersJens Axboe2024-02-011-0/+1
| | | | | | | | | | | | | | | | If we use IORING_OP_RECV with provided buffers and pass in '0' as the length of the request, the length is retrieved from the selected buffer. If MSG_WAITALL is also set and we get a short receive, then we may hit the retry path which decrements sr->len and increments the buffer for a retry. However, the length is still zero at this point, which means that sr->len now becomes huge and import_ubuf() will cap it to MAX_RW_COUNT and subsequently return -EFAULT for the range as a whole. Fix this by always assigning sr->len once the buffer has been selected. Cc: stable@vger.kernel.org Fixes: 7ba89d2af17a ("io_uring: ensure recv and recvmsg handle MSG_WAITALL correctly") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: limit inline multishot retriesJens Axboe2024-01-291-3/+20
| | | | | | | | | | | | | | | | | | | | | | If we have multiple clients and some/all are flooding the receives to such an extent that we can retry a LOT handling multishot receives, then we can be starving some clients and hence serving traffic in an imbalanced fashion. Limit multishot retry attempts to some arbitrary value, whose only purpose serves to ensure that we don't keep serving a single connection for way too long. We default to 32 retries, which should be more than enough to provide fairness, yet not so small that we'll spend too much time requeuing rather than handling traffic. Cc: stable@vger.kernel.org Depends-on: 704ea888d646 ("io_uring/poll: add requeue return code from poll multishot handling") Depends-on: 1e5d765a82f ("io_uring/net: un-indent mshot retry path in io_recv_finish()") Depends-on: e84b01a880f6 ("io_uring/poll: move poll execution helpers higher up") Fixes: b3fdea6ecb55 ("io_uring: multishot recv") Fixes: 9bb66906f23e ("io_uring: support multishot in recvmsg") Link: https://github.com/axboe/liburing/issues/1043 Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/poll: add requeue return code from poll multishot handlingJens Axboe2024-01-292-2/+15
| | | | | | | | | | | | | | | Since our poll handling is edge triggered, multishot handlers retry internally until they know that no more data is available. In preparation for limiting these retries, add an internal return code, IOU_REQUEUE, which can be used to inform the poll backend about the handler wanting to retry, but that this should happen through a normal task_work requeue rather than keep hammering on the issue side for this one request. No functional changes in this patch, nobody is using this return code just yet. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/net: un-indent mshot retry path in io_recv_finish()Jens Axboe2024-01-291-16/+20
| | | | | | | | | | In preparation for putting some retry logic in there, have the done path just skip straight to the end rather than have too much nesting in here. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/poll: move poll execution helpers higher upJens Axboe2024-01-291-20/+20
| | | | | | | | | In preparation for calling __io_poll_execute() higher up, move the functions to avoid forward declarations. No functional changes in this patch. Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring/rw: ensure poll based multishot read retries appropriatelyJens Axboe2024-01-292-1/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | io_read_mshot() always relies on poll triggering retries, and this works fine as long as we do a retry per size of the buffer being read. The buffer size is given by the size of the buffer(s) in the given buffer group ID. But if we're reading less than what is available, then we don't always get to read everything that is available. For example, if the buffers available are 32 bytes and we have 64 bytes to read, then we'll correctly read the first 32 bytes and then wait for another poll trigger before we attempt the next read. This next poll trigger may never happen, in which case we just sit forever and never make progress, or it may trigger at some point in the future, and now we're just delivering the available data much later than we should have. io_read_mshot() could do retries itself, but that is wasteful as we'll be going through all of __io_read() again, and most likely in vain. Rather than do that, bump our poll reference count and have io_poll_check_events() do one more loop and check with vfs_poll() if we have more data to read. If we do, io_read_mshot() will get invoked again directly and we'll read the next chunk. io_poll_multishot_retry() must only get called from inside io_poll_issue(), which is our multishot retry handler, as we know we already "own" the request at this point. Cc: stable@vger.kernel.org Link: https://github.com/axboe/liburing/issues/1041 Fixes: fc68fcda0491 ("io_uring/rw: add support for IORING_OP_READ_MULTISHOT") Signed-off-by: Jens Axboe <axboe@kernel.dk>
* io_uring: enable audit and restrict cred override for IORING_OP_FIXED_FD_INSTALLPaul Moore2024-01-232-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to correct some aspects of the IORING_OP_FIXED_FD_INSTALL command to take into account the security implications of making an io_uring-private file descriptor generally accessible to a userspace task. The first change in this patch is to enable auditing of the FD_INSTALL operation as installing a file descriptor into a task's file descriptor table is a security relevant operation and something that admins/users may want to audit. The second change is to disable the io_uring credential override functionality, also known as io_uring "personalities", in the FD_INSTALL command. The credential override in FD_INSTALL is particularly problematic as it affects the credentials used in the security_file_receive() LSM hook. If a task were to request a credential override via REQ_F_CREDS on a FD_INSTALL operation, the LSM would incorrectly check to see if the overridden credentials of the io_uring were able to "receive" the file as opposed to the task's credentials. After discussions upstream, it's difficult to imagine a use case where we would want to allow a credential override on a FD_INSTALL operation so we are simply going to block REQ_F_CREDS on IORING_OP_FIXED_FD_INSTALL operations. Fixes: dc18b89ab113 ("io_uring/openclose: add support for IORING_OP_FIXED_FD_INSTALL") Signed-off-by: Paul Moore <paul@paul-moore.com> Link: https://lore.kernel.org/r/20240123215501.289566-2-paul@paul-moore.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'for-6.8/io_uring-2024-01-18' of git://git.kernel.dk/linuxLinus Torvalds2024-01-194-47/+86
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull io_uring fixes from Jens Axboe: "Nothing major in here, just a few fixes and cleanups that arrived after the initial merge window pull request got finalized, as well as a fix for a patch that got merged earlier" * tag 'for-6.8/io_uring-2024-01-18' of git://git.kernel.dk/linux: io_uring: combine cq_wait_nr checks io_uring: clean *local_work_add var naming io_uring: clean up local tw add-wait sync io_uring: adjust defer tw counting io_uring/register: guard compat syscall with CONFIG_COMPAT io_uring/rsrc: improve code generation for fixed file assignment io_uring/rw: cleanup io_rw_done()
| * io_uring: combine cq_wait_nr checksPavel Begunkov2024-01-171-7/+27
| | | | | | | | | | | | | | | | | | | | | | Instead of explicitly checking ->cq_wait_nr for whether there are waiting, which is currently represented by 0, we can store there a large value and the nr_tw will automatically filter out those cases. Add a named constant for that and for the wake up bias value. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/38def30282654d980673976cd42fde9bab19b297.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: clean *local_work_add var namingPavel Begunkov2024-01-171-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | if (!first) { ... } While it reads as do something if it's not the first entry, it does exactly the opposite because "first" here is a pointer to the first entry. Remove the confusion by renaming it into "head". Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3b8be483b52f58a524185bb88694b8a268e7e85d.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: clean up local tw add-wait syncPavel Begunkov2024-01-171-2/+8
| | | | | | | | | | | | | | | | | | | | Kill a smp_mb__after_atomic() right before wake_up, it's useless, and add a comment explaining implicit barriers from cmpxchg and synchronsation around ->cq_wait_nr with the waiter. Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/3007f3c2d53c72b61de56919ef56b53158b8276f.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring: adjust defer tw countingPavel Begunkov2024-01-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The UINT_MAX work item counting bias in io_req_local_work_add() in case of !IOU_F_TWQ_LAZY_WAKE works in a sense that we will not miss a wake up, however it's still eerie. In particular, if we add a lazy work item after a non-lazy one, we'll increment it and get nr_tw==0, and subsequent adds may try to unnecessarily wake up the task, which is though not so likely to happen in real workloads. Half the bias, it's still large enough to be larger than any valid ->cq_wait_nr, which is limited by IORING_MAX_CQ_ENTRIES, but further have a good enough of space before it overflows. Fixes: 8751d15426a31 ("io_uring: reduce scheduling due to tw") Signed-off-by: Pavel Begunkov <asml.silence@gmail.com> Link: https://lore.kernel.org/r/108b971e958deaf7048342930c341ba90f75d806.1705438669.git.asml.silence@gmail.com Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/register: guard compat syscall with CONFIG_COMPATJens Axboe2024-01-171-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add compat.h include to avoid a potential build issue: io_uring/register.c:281:6: error: call to undeclared function 'in_compat_syscall'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] if (in_compat_syscall()) { ^ 1 warning generated. io_uring/register.c:282:9: error: call to undeclared function 'compat_get_bitmap'; ISO C99 and later do not support implicit function declarations [-Werror,-Wimplicit-function-declaration] ret = compat_get_bitmap(cpumask_bits(new_mask), ^ Fixes: c43203154d8a ("io_uring/register: move io_uring_register(2) related code to register.c") Reported-by: Manu Bretelle <chantra@meta.com> Reviewed-by: Jeff Moyer <jmoyer@redhat.com> Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/rsrc: improve code generation for fixed file assignmentJens Axboe2024-01-112-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | For the normal read/write path, we have already locked the ring submission side when assigning the file. This causes branch mispredictions when we then check and try and lock again in io_req_set_rsrc_node(). As this is a very hot path, this matters. Add a basic helper that already assumes we already have it locked, and use that in io_file_get_fixed(). Signed-off-by: Jens Axboe <axboe@kernel.dk>
| * io_uring/rw: cleanup io_rw_done()Jens Axboe2024-01-101-21/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This originally came from the aio side, and it's laid out rather oddly. The common case here is that we either get -EIOCBQUEUED from submitting an async request, or that we complete the request correctly with the given number of bytes. Handling the odd internal restart error codes is not a common operation. Lay it out a bit more optimally that better explains the normal flow, and switch to avoiding the indirect call completely as this is our kiocb and we know the completion handler can only be one of two possible variants. While at it, move it to where it belongs in the file, with fellow end IO helpers. Outside of being easier to read, this also reduces the text size of the function by 24 bytes for me on arm64. Reviewed-by: Keith Busch <kbusch@kernel.org> Signed-off-by: Jens Axboe <axboe@kernel.dk>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-01-171-1/+2
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
| * \ Merge tag 'loongarch-kvm-6.8' of ↵Paolo Bonzini2024-01-0212-98/+256
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support.
| * \ \ Merge branch 'kvm-guestmemfd' into HEADPaolo Bonzini2023-11-141-1/+2
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).