summaryrefslogtreecommitdiffstats
path: root/virt (follow)
Commit message (Collapse)AuthorAgeFilesLines
* sched/fair: Fix external p->on_rq usersPeter Zijlstra2024-10-141-1/+1
| | | | | | | | | | | | | | | | | | | | | Sean noted that ever since commit 152e11f6df29 ("sched/fair: Implement delayed dequeue") KVM's preemption notifiers have started mis-classifying preemption vs blocking. Notably p->on_rq is no longer sufficient to determine if a task is runnable or blocked -- the aforementioned commit introduces tasks that remain on the runqueue even through they will not run again, and should be considered blocked for many cases. Add the task_is_runnable() helper to classify things and audit all external users of the p->on_rq state. Also add a few comments. Fixes: 152e11f6df29 ("sched/fair: Implement delayed dequeue") Reported-by: Sean Christopherson <seanjc@google.com> Tested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lkml.kernel.org/r/20241010091843.GK33184@noisy.programming.kicks-ass.net
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-09-282-160/+152
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull x86 kvm updates from Paolo Bonzini: "x86: - KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted. This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter. Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior. - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon) - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work) - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit - Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest) - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area - Remove unnecessary accounting of temporary nested VMCB related allocations - Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible) - Minor SGX fix and a cleanup - Misc cleanups Generic: - Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed - Enable virtualization when KVM is loaded, not right before the first VM is created Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs Selftests: - Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest - Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries - Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath - Misc cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...
| * Merge tag 'kvm-x86-generic-6.12' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-09-172-24/+18
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVK generic changes for 6.12: - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase. - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs.
| | * KVM: Harden guest memory APIs against out-of-bounds accessesSean Christopherson2024-09-101-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When reading or writing a guest page, WARN and bail if offset+len would result in a read to a different page so that KVM bugs are more likely to be detected, and so that any such bugs are less likely to escalate to an out-of-bounds access. E.g. if userspace isn't using guard pages and the target page is at the end of a memslot. Note, KVM already hardens itself in similar APIs, e.g. in the "cached" variants, it's just the vanilla APIs that are playing with fire. Link: https://lore.kernel.org/r/20240829191413.900740-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * KVM: Write the per-page "segment" when clearing (part of) a guest pageSean Christopherson2024-09-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pass "seg" instead of "len" when writing guest memory in kvm_clear_guest(), as "seg" holds the number of bytes to write for the current page, while "len" holds the total bytes remaining. Luckily, all users of kvm_clear_guest() are guaranteed to not cross a page boundary, and so the bug is unhittable in the current code base. Fixes: 2f5414423ef5 ("KVM: remove kvm_clear_guest_page") Reported-by: zyr_ms@outlook.com Closes: https://bugzilla.kernel.org/show_bug.cgi?id=219104 Link: https://lore.kernel.org/r/20240829191413.900740-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * KVM: Clean up coalesced MMIO ring full checkSean Christopherson2024-08-301-21/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fold coalesced_mmio_has_room() into its sole caller, coalesced_mmio_write(), as it's really just a single line of code, has a goofy return value, and is unnecessarily brittle. E.g. if coalesced_mmio_has_room() were to check ring->last directly, or the caller failed to use READ_ONCE(), KVM would be susceptible to TOCTOU attacks from userspace. Opportunistically add a comment explaining why on earth KVM leaves one entry free, which may not be obvious to readers that aren't familiar with ring buffers. No functional change intended. Reviewed-by: Ilias Stamatis <ilstam@amazon.com> Cc: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240828181446.652474-3-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| | * KVM: Fix coalesced_mmio_has_room() to avoid premature userspace exitIlias Stamatis2024-08-231-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following calculation used in coalesced_mmio_has_room() to check whether the ring buffer is full is wrong and results in premature exits if the start of the valid entries is in the first half of the ring buffer. avail = (ring->first - last - 1) % KVM_COALESCED_MMIO_MAX; if (avail == 0) /* full */ Because negative values are handled using two's complement, and KVM computes the result as an unsigned value, the above will get a false positive if "first < last" and the ring is half-full. The above might have worked as expected in python for example: >>> (-86) % 170 84 However it doesn't work the same way in C. printf("avail: %d\n", (-86) % 170); printf("avail: %u\n", (-86) % 170); printf("avail: %u\n", (-86u) % 170u); Using gcc-11 these print: avail: -86 avail: 4294967210 avail: 0 For illustration purposes, given a 4-bit integer and a ring size of 0xA (unsigned), 0xA == 0x1010 == -6, and thus (-6u % 0xA) == 0. Fix the calculation and allow all but one entries in the buffer to be used as originally intended. Note, KVM's behavior is self-healing to some extent, as KVM will allow the entire buffer to be used if ring->first is beyond the halfway point. In other words, in the unlikely scenario that a use case benefits from being able to coalesce more than 86 entries at once, KVM will still provide such behavior, sometimes. Note #2, the % operator in C is not the modulo operator but the remainder operator. Modulo and remainder operators differ with respect to negative values. But, the relevant values in KVM are all unsigned, so it's a moot point in this case anyway. Note #3, this is almost a pure revert of the buggy commit, plus a READ_ONCE() to provide additional safety. Thue buggy commit justified the change with "it paves the way for making this function lockless", but it's not at all clear what was intended, nor is there any evidence that the buggy code was somehow safer. (a) the fields in question were already accessed locklessly, from the perspective that they could be modified by userspace at any time, and (b) the lock guarding the ring itself was changed, but never dropped, i.e. whatever lockless scheme (SRCU?) was planned never landed. Fixes: 105f8d40a737 ("KVM: Calculate available entries in coalesced mmio ring") Signed-off-by: Ilias Stamatis <ilstam@amazon.com> Reviewed-by: Paul Durrant <paul@xen.org> Link: https://lore.kernel.org/r/20240718193543.624039-2-ilstam@amazon.com [sean: rework changelog to clarify behavior, call out weirdness of buggy commit] Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | KVM: Add arch hooks for enabling/disabling virtualizationSean Christopherson2024-09-041-0/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add arch hooks that are invoked when KVM enables/disable virtualization. x86 will use the hooks to register an "emergency disable" callback, which is essentially an x86-specific shutdown notifier that is used when the kernel is doing an emergency reboot/shutdown/kexec. Add comments for the declarations to help arch code understand exactly when the callbacks are invoked. Alternatively, the APIs themselves could communicate most of the same info, but kvm_arch_pre_enable_virtualization() and kvm_arch_post_disable_virtualization() are a bit cumbersome, and make it a bit less obvious that they are intended to be implemented as a pair. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-9-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Add a module param to allow enabling virtualization when KVM is loadedSean Christopherson2024-09-041-0/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add an on-by-default module param, enable_virt_at_load, to let userspace force virtualization to be enabled in hardware when KVM is initialized, i.e. just before /dev/kvm is exposed to userspace. Enabling virtualization during KVM initialization allows userspace to avoid the additional latency when creating/destroying the first/last VM (or more specifically, on the 0=>1 and 1=>0 edges of creation/destruction). Now that KVM uses the cpuhp framework to do per-CPU enabling, the latency could be non-trivial as the cpuhup bringup/teardown is serialized across CPUs, e.g. the latency could be problematic for use case that need to spin up VMs quickly. Prior to commit 10474ae8945c ("KVM: Activate Virtualization On Demand"), KVM _unconditionally_ enabled virtualization during load, i.e. there's no fundamental reason KVM needs to dynamically toggle virtualization. These days, the only known argument for not enabling virtualization is to allow KVM to be autoloaded without blocking other out-of-tree hypervisors, and such use cases can simply change the module param, e.g. via command line. Note, the aforementioned commit also mentioned that enabling SVM (AMD's virtualization extensions) can result in "using invalid TLB entries". It's not clear whether the changelog was referring to a KVM bug, a CPU bug, or something else entirely. Regardless, leaving virtualization off by default is not a robust "fix", as any protection provided is lost the instant userspace creates the first VM. Reviewed-by: Chao Gao <chao.gao@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Rename arch hooks related to per-CPU virtualization enablingSean Christopherson2024-09-041-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename the per-CPU hooks used to enable virtualization in hardware to align with the KVM-wide helpers in kvm_main.c, and to better capture that the callbacks are invoked on every online CPU. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Rename symbols related to enabling virtualization hardwareSean Christopherson2024-09-041-21/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename the various functions (and a variable) that enable virtualization to prepare for upcoming changes, and to clean up artifacts of KVM's previous behavior, which required manually juggling locks around kvm_usage_count. Drop the "nolock" qualifier from per-CPU functions now that there are no "nolock" implementations of the "all" variants, i.e. now that calling a non-nolock function from a nolock function isn't confusing (unlike this sentence). Drop "all" from the outer helpers as they no longer manually iterate over all CPUs, and because it might not be obvious what "all" refers to. In lieu of the above qualifiers, append "_cpu" to the end of the functions that are per-CPU helpers for the outer APIs. Opportunistically prepend "kvm" to all functions to help make it clear that they are KVM helpers, but mostly because there's no reason not to. Lastly, use "virtualization" instead of "hardware", because while the functions do enable virtualization in hardware, there are a _lot_ of things that KVM enables in hardware. Defer renaming the arch hooks to future patches, purely to reduce the amount of churn in a single commit. Reviewed-by: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-4-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Register cpuhp and syscore callbacks when enabling hardwareSean Christopherson2024-09-041-113/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Register KVM's cpuhp and syscore callback when enabling virtualization in hardware instead of registering the callbacks during initialization, and let the CPU up/down framework invoke the inner enable/disable functions. Registering the callbacks during initialization makes things more complex than they need to be, as KVM needs to be very careful about handling races between enabling CPUs being onlined/offlined and hardware being enabled/disabled. Intel TDX support will require KVM to enable virtualization during KVM initialization, i.e. will add another wrinkle to things, at which point sorting out the potential races with kvm_usage_count would become even more complex. Note, using the cpuhp framework has a subtle behavioral change: enabling will be done serially across all CPUs, whereas KVM currently sends an IPI to all CPUs in parallel. While serializing virtualization enabling could create undesirable latency, the issue is limited to the 0=>1 transition of VM creation. And even that can be mitigated, e.g. by letting userspace force virtualization to be enabled when KVM is initialized. Cc: Chao Gao <chao.gao@intel.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Use dedicated mutex to protect kvm_usage_count to avoid deadlockSean Christopherson2024-09-041-15/+16
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use a dedicated mutex to guard kvm_usage_count to fix a potential deadlock on x86 due to a chain of locks and SRCU synchronizations. Translating the below lockdep splat, CPU1 #6 will wait on CPU0 #1, CPU0 #8 will wait on CPU2 #3, and CPU2 #7 will wait on CPU1 #4 (if there's a writer, due to the fairness of r/w semaphores). CPU0 CPU1 CPU2 1 lock(&kvm->slots_lock); 2 lock(&vcpu->mutex); 3 lock(&kvm->srcu); 4 lock(cpu_hotplug_lock); 5 lock(kvm_lock); 6 lock(&kvm->slots_lock); 7 lock(cpu_hotplug_lock); 8 sync(&kvm->srcu); Note, there are likely more potential deadlocks in KVM x86, e.g. the same pattern of taking cpu_hotplug_lock outside of kvm_lock likely exists with __kvmclock_cpufreq_notifier(): cpuhp_cpufreq_online() | -> cpufreq_online() | -> cpufreq_gov_performance_limits() | -> __cpufreq_driver_target() | -> __target_index() | -> cpufreq_freq_transition_begin() | -> cpufreq_notify_transition() | -> ... __kvmclock_cpufreq_notifier() But, actually triggering such deadlocks is beyond rare due to the combination of dependencies and timings involved. E.g. the cpufreq notifier is only used on older CPUs without a constant TSC, mucking with the NX hugepage mitigation while VMs are running is very uncommon, and doing so while also onlining/offlining a CPU (necessary to generate contention on cpu_hotplug_lock) would be even more unusual. The most robust solution to the general cpu_hotplug_lock issue is likely to switch vm_list to be an RCU-protected list, e.g. so that x86's cpufreq notifier doesn't to take kvm_lock. For now, settle for fixing the most blatant deadlock, as switching to an RCU-protected list is a much more involved change, but add a comment in locking.rst to call out that care needs to be taken when walking holding kvm_lock and walking vm_list. ====================================================== WARNING: possible circular locking dependency detected 6.10.0-smp--c257535a0c9d-pip #330 Tainted: G S O ------------------------------------------------------ tee/35048 is trying to acquire lock: ff6a80eced71e0a8 (&kvm->slots_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x179/0x1e0 [kvm] but task is already holding lock: ffffffffc07abb08 (kvm_lock){+.+.}-{3:3}, at: set_nx_huge_pages+0x14a/0x1e0 [kvm] which lock already depends on the new lock. the existing dependency chain (in reverse order) is: -> #3 (kvm_lock){+.+.}-{3:3}: __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 kvm_dev_ioctl+0x4fb/0xe50 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #2 (cpu_hotplug_lock){++++}-{0:0}: cpus_read_lock+0x2e/0xb0 static_key_slow_inc+0x16/0x30 kvm_lapic_set_base+0x6a/0x1c0 [kvm] kvm_set_apic_base+0x8f/0xe0 [kvm] kvm_set_msr_common+0x9ae/0xf80 [kvm] vmx_set_msr+0xa54/0xbe0 [kvm_intel] __kvm_set_msr+0xb6/0x1a0 [kvm] kvm_arch_vcpu_ioctl+0xeca/0x10c0 [kvm] kvm_vcpu_ioctl+0x485/0x5b0 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #1 (&kvm->srcu){.+.+}-{0:0}: __synchronize_srcu+0x44/0x1a0 synchronize_srcu_expedited+0x21/0x30 kvm_swap_active_memslots+0x110/0x1c0 [kvm] kvm_set_memslot+0x360/0x620 [kvm] __kvm_set_memory_region+0x27b/0x300 [kvm] kvm_vm_ioctl_set_memory_region+0x43/0x60 [kvm] kvm_vm_ioctl+0x295/0x650 [kvm] __se_sys_ioctl+0x7b/0xd0 __x64_sys_ioctl+0x21/0x30 x64_sys_call+0x15d0/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e -> #0 (&kvm->slots_lock){+.+.}-{3:3}: __lock_acquire+0x15ef/0x2e30 lock_acquire+0xe0/0x260 __mutex_lock+0x6a/0xb40 mutex_lock_nested+0x1f/0x30 set_nx_huge_pages+0x179/0x1e0 [kvm] param_attr_store+0x93/0x100 module_attr_store+0x22/0x40 sysfs_kf_write+0x81/0xb0 kernfs_fop_write_iter+0x133/0x1d0 vfs_write+0x28d/0x380 ksys_write+0x70/0xe0 __x64_sys_write+0x1f/0x30 x64_sys_call+0x281b/0x2e60 do_syscall_64+0x83/0x160 entry_SYSCALL_64_after_hwframe+0x76/0x7e Cc: Chao Gao <chao.gao@intel.com> Fixes: 0bf50497f03b ("KVM: Drop kvm_count_lock and instead protect kvm_usage_count with kvm_lock") Cc: stable@vger.kernel.org Reviewed-by: Kai Huang <kai.huang@intel.com> Acked-by: Kai Huang <kai.huang@intel.com> Tested-by: Farrah Chen <farrah.chen@intel.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240830043600.127750-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | [tree-wide] finally take no_llseek outAl Viro2024-09-271-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | no_llseek had been defined to NULL two years ago, in commit 868941b14441 ("fs: remove no_llseek") To quote that commit, At -rc1 we'll need do a mechanical removal of no_llseek - git grep -l -w no_llseek | grep -v porting.rst | while read i; do sed -i '/\<no_llseek\>/d' $i done would do it. Unfortunately, that hadn't been done. Linus, could you do that now, so that we could finally put that thing to rest? All instances are of the form .llseek = no_llseek, so it's obviously safe. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | Merge tag 'pull-stable-struct_fd' of ↵Linus Torvalds2024-09-232-7/+7
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs Pull 'struct fd' updates from Al Viro: "Just the 'struct fd' layout change, with conversion to accessor helpers" * tag 'pull-stable-struct_fd' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs: add struct fd constructors, get rid of __to_fd() struct fd: representation change introduce fd_file(), convert all accessors to it.
| * | introduce fd_file(), convert all accessors to it.Al Viro2024-08-132-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For any changes of struct fd representation we need to turn existing accesses to fields into calls of wrappers. Accesses to struct fd::flags are very few (3 in linux/file.h, 1 in net/socket.c, 3 in fs/overlayfs/file.c and 3 more in explicit initializers). Those can be dealt with in the commit converting to new layout; accesses to struct fd::file are too many for that. This commit converts (almost) all of f.file to fd_file(f). It's not entirely mechanical ('file' is used as a member name more than just in struct fd) and it does not even attempt to distinguish the uses in pointer context from those in boolean context; the latter will be eventually turned into a separate helper (fd_empty()). NOTE: mass conversion to fd_empty(), tempting as it might be, is a bad idea; better do that piecewise in commit that convert from fdget...() to CLASS(...). [conflicts in fs/fhandle.c, kernel/bpf/syscall.c, mm/memcontrol.c caught by git; fs/stat.c one got caught by git grep] [fs/xattr.c conflict] Reviewed-by: Christian Brauner <brauner@kernel.org> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | KVM: use follow_pfnmap APIPeter Xu2024-09-171-12/+7
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use the new pfnmap API to allow huge MMIO mappings for VMs. The rest work is done perfectly on the other side (host_pfn_mapping_level()). Link: https://lkml.kernel.org/r/20240826204353.2228736-11-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Sean Christopherson <seanjc@google.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alex Williamson <alex.williamson@redhat.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: David Hildenbrand <david@redhat.com> Cc: Gavin Shan <gshan@redhat.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Jason Gunthorpe <jgg@nvidia.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Niklas Schnelle <schnelle@linux.ibm.com> Cc: Ryan Roberts <ryan.roberts@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Zi Yan <ziy@nvidia.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | KVM: x86: Disallow read-only memslots for SEV-ES and SEV-SNP (and TDX)Sean Christopherson2024-08-141-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Disallow read-only memslots for SEV-{ES,SNP} VM types, as KVM can't directly emulate instructions for ES/SNP, and instead the guest must explicitly request emulation. Unless the guest explicitly requests emulation without accessing memory, ES/SNP relies on KVM creating an MMIO SPTE, with the subsequent #NPF being reflected into the guest as a #VC. But for read-only memslots, KVM deliberately doesn't create MMIO SPTEs, because except for ES/SNP, doing so requires setting reserved bits in the SPTE, i.e. the SPTE can't be readable while also generating a #VC on writes. Because KVM never creates MMIO SPTEs and jumps directly to emulation, the guest never gets a #VC. And since KVM simply resumes the guest if ES/SNP guests trigger emulation, KVM effectively puts the vCPU into an infinite #NPF loop if the vCPU attempts to write read-only memory. Disallow read-only memory for all VMs with protected state, i.e. for upcoming TDX VMs as well as ES/SNP VMs. For TDX, it's actually possible to support read-only memory, as TDX uses EPT Violation #VE to reflect the fault into the guest, e.g. KVM could configure read-only SPTEs with RX protections and SUPPRESS_VE=0. But there is no strong use case for supporting read-only memslots on TDX, e.g. the main historical usage is to emulate option ROMs, but TDX disallows executing from shared memory. And if someone comes along with a legitimate, strong use case, the restriction can always be lifted for TDX. Don't bother trying to retroactively apply the restriction to SEV-ES VMs that are created as type KVM_X86_DEFAULT_VM. Read-only memslots can't possibly work for SEV-ES, i.e. disallowing such memslots is really just means reporting an error to userspace instead of silently hanging vCPUs. Trying to deal with the ordering between KVM_SEV_INIT and memslot creation isn't worth the marginal benefit it would provide userspace. Fixes: 26c44aa9e076 ("KVM: SEV: define VM types for SEV and SEV-ES") Fixes: 1dfe571c12cf ("KVM: SEV: Add initial SEV-SNP support") Cc: Peter Gonda <pgonda@google.com> Cc: Michael Roth <michael.roth@amd.com> Cc: Vishal Annapurve <vannapurve@google.com> Cc: Ackerly Tng <ackerleytng@google.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-ID: <20240809190319.1710470-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: eventfd: Use synchronize_srcu_expedited() on shutdownLi RongQing2024-08-131-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When hot-unplug a device which has many queues, and guest CPU will has huge jitter, and unplugging is very slow. It turns out synchronize_srcu() in irqfd_shutdown() caused the guest jitter and unplugging latency, so replace synchronize_srcu() with synchronize_srcu_expedited(), to accelerate the unplugging, and reduce the guest OS jitter, this accelerates the VM reboot too. Signed-off-by: Li RongQing <lirongqing@baidu.com> Message-ID: <20240711121130.38917-1-lirongqing@baidu.com> [Call it just once in irqfd_resampler_shutdown. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: abstract how prepared folios are recordedPaolo Bonzini2024-07-261-13/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Right now, large folios are not supported in guest_memfd, and therefore the order used by kvm_gmem_populate() is always 0. In this scenario, using the up-to-date bit to track prepared-ness is nice and easy because we have one bit available per page. In the future, however, we might have large pages that are partially populated; for example, in the case of SEV-SNP, if a large page has both shared and private areas inside, it is necessary to populate it at a granularity that is smaller than that of the guest_memfd's backing store. In that case we will have to track preparedness at a 4K level, probably as a bitmap. In preparation for that, do not use explicitly folio_test_uptodate() and folio_mark_uptodate(). Return the state of the page directly from __kvm_gmem_get_pfn(), so that it is expected to apply to 2^N pages with N=*max_order. The function to mark a range as prepared for now takes just a folio, but is expected to take also an index and order (or something like that) when large pages are introduced. Thanks to Michael Roth for pointing out the issue with large pages. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: let kvm_gmem_populate() operate only on private gfnsPaolo Bonzini2024-07-261-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This check is currently performed by sev_gmem_post_populate(), but it applies to all callers of kvm_gmem_populate(): the point of the function is that the memory is being encrypted and some work has to be done on all the gfns in order to encrypt them. Therefore, check the KVM_MEMORY_ATTRIBUTE_PRIVATE attribute prior to invoking the callback, and stop the operation if a shared page is encountered. Because CONFIG_KVM_PRIVATE_MEM in principle does not require attributes, this makes kvm_gmem_populate() depend on CONFIG_KVM_GENERIC_PRIVATE_MEM (which does require them). Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: extend kvm_range_has_memory_attributes() to check subset of attributesPaolo Bonzini2024-07-261-6/+7
| | | | | | | | | | | | | | | | | | | | | | While currently there is no other attribute than KVM_MEMORY_ATTRIBUTE_PRIVATE, KVM code such as kvm_mem_is_private() is written to expect their existence. Allow using kvm_range_has_memory_attributes() as a multi-page version of kvm_mem_is_private(), without it breaking later when more attributes are introduced. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: cleanup and add shortcuts to kvm_range_has_memory_attributes()Paolo Bonzini2024-07-261-22/+20
| | | | | | | | | | | | | | | | | | | | Use a guard to simplify early returns, and add two more easy shortcuts. If the requested attributes are invalid, the attributes xarray will never show them as set. And if testing a single page, kvm_get_memory_attributes() is more efficient. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: move check for already-populated page to common codePaolo Bonzini2024-07-261-0/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Do not allow populating the same page twice with startup data. In the case of SEV-SNP, for example, the firmware does not allow it anyway, since the launch-update operation is only possible on pages that are still shared in the RMP. Even if it worked, kvm_gmem_populate()'s callback is meant to have side effects such as updating launch measurements, and updating the same page twice is unlikely to have the desired results. Races between calls to the ioctl are not possible because kvm_gmem_populate() holds slots_lock and the VM should not be running. But again, even if this worked on other confidential computing technology, it doesn't matter to guest_memfd.c whether this is something fishy such as missing synchronization in userspace, or rather something intentional. One of the racers wins, and the page is initialized by either kvm_gmem_prepare_folio() or kvm_gmem_populate(). Anyway, out of paranoia, adjust sev_gmem_post_populate() anyway to use the same errno that kvm_gmem_populate() is using. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: remove kvm_arch_gmem_prepare_needed()Paolo Bonzini2024-07-261-10/+3
| | | | | | | | | | | | | | | | | | | | It is enough to return 0 if a guest need not do any preparation. This is in fact how sev_gmem_prepare() works for non-SNP guests, and it extends naturally to Intel hosts: the x86 callback for gmem_prepare is optional and returns 0 if not defined. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: make kvm_gmem_prepare_folio() operate on a single struct kvmPaolo Bonzini2024-07-261-30/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is now possible because preparation is done by kvm_gmem_get_pfn() instead of fallocate(). In practice this is not a limitation, because even though guest_memfd can be bound to multiple struct kvm, for hardware implementations of confidential computing only one guest (identified by an ASID on SEV-SNP, or an HKID on TDX) will be able to access it. In the case of intra-host migration (not implemented yet for SEV-SNP, but we can use SEV-ES as an idea of how it will work), the new struct kvm inherits the same ASID and preparation need not be repeated. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: delay kvm_gmem_prepare_folio() until the memory is passed ↵Paolo Bonzini2024-07-261-44/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | to the guest Initializing the contents of the folio on fallocate() is unnecessarily restrictive. It means that the page is registered with the firmware and then it cannot be touched anymore. In particular, this loses the possibility of using fallocate() to pre-allocate the page for SEV-SNP guests, because kvm_arch_gmem_prepare() then fails. It's only when the guest actually accesses the page (and therefore kvm_gmem_get_pfn() is called) that the page must be cleared from any stale host data and registered with the firmware. The up-to-date flag is clear if this has to be done (i.e. it is the first access and kvm_gmem_populate() has not been called). All in all, there are enough differences between kvm_gmem_get_pfn() and kvm_gmem_populate(), that it's better to separate the two flows completely. Extract the bulk of kvm_gmem_get_folio(), which take a folio and end up setting its up-to-date flag, to a new function kvm_gmem_prepare_folio(); these are now done only by the non-__-prefixed kvm_gmem_get_pfn(). As a bonus, __kvm_gmem_get_pfn() loses its ugly "bool prepare" argument. One difference is that fallocate(PUNCH_HOLE) can now race with a page fault. Potentially this causes a page to be prepared and into the filemap even after fallocate(PUNCH_HOLE). This is harmless, as it can be fixed by another hole punching operation, and can be avoided by clearing the private-page attribute prior to invoking fallocate(PUNCH_HOLE). This way, the page fault will cause an exit to user space. The previous semantics, where fallocate() could be used to prepare the pages in advance of running the guest, can be accessed with KVM_PRE_FAULT_MEMORY. For now, accessing a page in one VM will attempt to call kvm_arch_gmem_prepare() in all of those that have bound the guest_memfd. Cleaning this up is left to a separate patch. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: return locked folio from __kvm_gmem_get_pfnPaolo Bonzini2024-07-261-1/+4
| | | | | | | | | | | | | | | | Allow testing the up-to-date flag in the caller without taking the lock again. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: rename CONFIG_HAVE_KVM_GMEM_* to CONFIG_HAVE_KVM_ARCH_GMEM_*Paolo Bonzini2024-07-262-5/+5
| | | | | | | | | | | | | | | | | | | | Add "ARCH" to the symbols; shortly, the "prepare" phase will include both the arch-independent step to clear out contents left in the page by the host, and the arch-dependent step enabled by CONFIG_HAVE_KVM_GMEM_PREPARE. For consistency do the same for CONFIG_HAVE_KVM_GMEM_INVALIDATE as well. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: do not go through struct pagePaolo Bonzini2024-07-261-10/+17
| | | | | | | | | | | | | | | | We have a perfectly usable folio, use it to retrieve the pfn and order. All that's needed is a version of folio_file_page that returns a pfn. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: delay folio_mark_uptodate() until after successful preparationPaolo Bonzini2024-07-261-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The up-to-date flag as is now is not too useful; it tells guest_memfd not to overwrite the contents of a folio, but it doesn't say that the page is ready to be mapped into the guest. For encrypted guests, mapping a private page requires that the "preparation" phase has succeeded, and at the same time the same page cannot be prepared twice. So, ensure that folio_mark_uptodate() is only called on a prepared page. If kvm_gmem_prepare_folio() or the post_populate callback fail, the folio will not be marked up-to-date; it's not a problem to call clear_highpage() again on such a page prior to the next preparation attempt. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | KVM: guest_memfd: return folio from __kvm_gmem_get_pfn()Paolo Bonzini2024-07-261-17/+20
|/ | | | | | | | | | Right now this is simply more consistent and avoids use of pfn_to_page() and put_page(). It will be put to more use in upcoming patches, to ensure that the up-to-date flag is set at the very end of both the kvm_gmem_get_pfn() and kvm_gmem_populate() flows. Reviewed-by: Michael Roth <michael.roth@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge tag 'kvm-x86-generic-6.11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-07-163-25/+58
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM generic changes for 6.11 - Enable halt poll shrinking by default, as Intel found it to be a clear win. - Setup empty IRQ routing when creating a VM to avoid having to synchronize SRCU when creating a split IRQCHIP on x86. - Rework the sched_in/out() paths to replace kvm_arch_sched_in() with a flag that arch code can use for hooking both sched_in() and sched_out(). - Take the vCPU @id as an "unsigned long" instead of "u32" to avoid truncating a bogus value from userspace, e.g. to help userspace detect bugs. - Mark a vCPU as preempted if and only if it's scheduled out while in the KVM_RUN loop, e.g. to avoid marking it preempted and thus writing guest memory when retrieving guest state during live migration blackout. - A few minor cleanups
| * KVM: Add missing MODULE_DESCRIPTION()Jeff Johnson2024-06-281-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a module description for kvm.ko to fix a 'make W=1' warning: WARNING: modpost: missing MODULE_DESCRIPTION() in arch/x86/kvm/kvm.o Opportunistically update kvm_main.c's comically stale file comment to match the module description. Signed-off-by: Jeff Johnson <quic_jjohnson@quicinc.com> Link: https://lore.kernel.org/r/20240622-md-kvm-v2-1-29a60f7c48b1@quicinc.com [sean: split x86 changes to a separate commit, remove stale VT-x comment] Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Mark a vCPU as preempted/ready iff it's scheduled out while runningDavid Matlack2024-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark a vCPU as preempted/ready if-and-only-if it's scheduled out while running. i.e. Do not mark a vCPU preempted/ready if it's scheduled out during a non-KVM_RUN ioctl() or when userspace is doing KVM_RUN with immediate_exit. Commit 54aa83c90198 ("KVM: x86: do not set st->preempted when going back to user space") stopped marking a vCPU as preempted when returning to userspace, but if userspace then invokes a KVM vCPU ioctl() that gets preempted, the vCPU will be marked preempted/ready. This is arguably incorrect behavior since the vCPU was not actually preempted while the guest was running, it was preempted while doing something on behalf of userspace. Marking a vCPU preempted iff its running also avoids KVM dirtying guest memory after userspace has paused vCPUs, e.g. for live migration, which allows userspace to collect the final dirty bitmap before or in parallel with saving vCPU state, without having to worry about saving vCPU state triggering writes to guest memory. Suggested-by: Sean Christopherson <seanjc@google.com> Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-4-dmatlack@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Ensure new code that references immediate_exit gets extra scrutinyDavid Matlack2024-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that any new KVM code that references immediate_exit gets extra scrutiny by renaming it to immediate_exit__unsafe in kernel code. All fields in struct kvm_run are subject to TOCTOU races since they are mapped into userspace, which may be malicious or buggy. To protect KVM, introduces a new macro that appends __unsafe to select field names in struct kvm_run, hinting to developers and reviewers that accessing such fields must be done carefully. Apply the new macro to immediate_exit, since userspace can make immediate_exit inconsistent with vcpu->wants_to_run, i.e. accessing immediate_exit directly could lead to unexpected bugs in the future. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-3-dmatlack@google.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Introduce vcpu->wants_to_runDavid Matlack2024-06-181-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce vcpu->wants_to_run to indicate when a vCPU is in its core run loop, i.e. when the vCPU is running the KVM_RUN ioctl and immediate_exit was not set. Replace all references to vcpu->run->immediate_exit with !vcpu->wants_to_run to avoid TOCTOU races with userspace. For example, a malicious userspace could invoked KVM_RUN with immediate_exit=true and then after KVM reads it to set wants_to_run=false, flip it to false. This would result in the vCPU running in KVM_RUN with wants_to_run=false. This wouldn't cause any real bugs today but is a dangerous landmine. Signed-off-by: David Matlack <dmatlack@google.com> Link: https://lore.kernel.org/r/20240503181734.1467938-2-dmatlack@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Reject overly excessive IDs in KVM_CREATE_VCPUMathias Krause2024-06-181-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If, on a 64 bit system, a vCPU ID is provided that has the upper 32 bits set to a non-zero value, it may get accepted if the truncated to 32 bits integer value is below KVM_MAX_VCPU_IDS and 'max_vcpus'. This feels very wrong and triggered the reporting logic of PaX's SIZE_OVERFLOW plugin. Instead of silently truncating and accepting such values, pass the full value to kvm_vm_ioctl_create_vcpu() and make the existing limit checks return an error. Even if this is a userland ABI breaking change, no sane userland could have ever relied on that behaviour. Reported-by: PaX's SIZE_OVERFLOW plugin running on grsecurity's syzkaller Fixes: 6aa8b732ca01 ("[PATCH] kvm: userspace interface") Cc: Emese Revfy <re.emese@gmail.com> Cc: PaX Team <pageexec@freemail.hu> Signed-off-by: Mathias Krause <minipli@grsecurity.net> Link: https://lore.kernel.org/r/20240614202859.3597745-2-minipli@grsecurity.net [sean: tweak comment about INT_MAX assertion] Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Fix a goof where kvm_create_vm() returns 0 instead of -ENOMEMDan Carpenter2024-06-141-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The error path for OOM when allocating buses used to return -ENOMEM using the local variable 'r', where 'r' was initialized at the top of the function. But a new "r = kvm_init_irq_routing(kvm);" was introduced in the middle of the function, so now the error code is not set and it eventually leads to a NULL dereference due to kvm_dev_ioctl_create_vm() thinking kvm_create_vm() succeeded. Set the error code back to -ENOMEM. Opportunistically tweak the logic to pre-set "r = -ENOMEM" immediately before the flows that can fail due to memory allocation failure to make it less likely that the bug recurs in the future. Fixes: fbe4a7e881d4 ("KVM: Setup empty IRQ routing when creating a VM") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Link: https://lore.kernel.org/r/02051e0a-09d8-49a2-917f-7c2f278a1ba1@moroto.mountain [sean: tweak all of the "r = -ENOMEM" sites, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Delete the now unused kvm_arch_sched_in()Sean Christopherson2024-06-111-1/+0
| | | | | | | | | | | | | | | | | | Delete kvm_arch_sched_in() now that all implementations are nops. Reviewed-by: Bibo Mao <maobibo@loongson.cn> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-5-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Add a flag to track if a loaded vCPU is scheduled outSean Christopherson2024-06-111-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a kvm_vcpu.scheduled_out flag to track if a vCPU is in the process of being scheduled out (vCPU put path), or if the vCPU is being reloaded after being scheduled out (vCPU load path). In the short term, this will allow dropping kvm_arch_sched_in(), as arch code can query scheduled_out during kvm_arch_vcpu_load(). Longer term, scheduled_out opens up other potential optimizations, without creating subtle/brittle dependencies. E.g. it allows KVM to keep guest state (that is managed via kvm_arch_vcpu_{load,put}()) loaded across kvm_sched_{out,in}(), if KVM knows the state isn't accessed by the host kernel. Forcing arch code to coordinate between kvm_arch_sched_{in,out}() and kvm_arch_vcpu_{load,put}() is awkward, not reusable, and relies on the exact ordering of calls into arch code. Adding scheduled_out also obviates the need for a kvm_arch_sched_out() hook, e.g. if arch code needs to do something novel when putting vCPU state. And even if KVM never uses scheduled_out for anything beyond dropping kvm_arch_sched_in(), just being able to remove all of the arch stubs makes it worth adding the flag. Link: https://lore.kernel.org/all/20240430224431.490139-1-seanjc@google.com Cc: Oliver Upton <oliver.upton@linux.dev> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Kai Huang <kai.huang@intel.com> Link: https://lore.kernel.org/r/20240522014013.1672962-2-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Setup empty IRQ routing when creating a VMYi Wang2024-06-112-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Setup empty IRQ routing during VM creation so that x86 and s390 don't need to set empty/dummy IRQ routing during KVM_CREATE_IRQCHIP (in future patches). Initializing IRQ routing before there are any potential readers allows KVM to avoid the synchronize_srcu() in kvm_set_irq_routing(), which can introduces 20+ milliseconds of latency in the VM creation path. Ensuring that all VMs have non-NULL IRQ routing also hardens KVM against misbehaving userspace VMMs, e.g. RISC-V dynamically instantiates its interrupt controller, but doesn't override kvm_arch_intc_initialized() or kvm_arch_irqfd_allowed(), and so can likely reach kvm_irq_map_gsi() without fully initialized IRQ routing. Signed-off-by: Yi Wang <foxywang@tencent.com> Acked-by: Christian Borntraeger <borntraeger@linux.ibm.com> Link: https://lore.kernel.org/r/20240506101751.3145407-2-foxywang@tencent.com [sean: init refcount after IRQ routing, fix stub, massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>
| * Revert "KVM: async_pf: avoid recursive flushing of work items"Sean Christopherson2024-06-031-12/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that KVM does NOT gift async #PF workers a "struct kvm" reference, don't bother skipping "done" workers when flushing/canceling queued workers, as the deadlock that was being fudged around can no longer occur. When workers, i.e. async_pf_execute(), were gifted a referenced, it was possible for a worker to put the last reference and trigger VM destruction, i.e. trigger flushing of a workqueue from a worker in said workqueue. Note, there is no actual lock, the deadlock was that a worker will be stuck waiting for itself (the workqueue code simulates a lock/unlock via lock_map_{acquire,release}()). Skipping "done" workers isn't problematic per se, but using work->vcpu as a "done" flag is confusing, e.g. it's not clear that async_pf.lock is acquired to protect the work->vcpu, NOT the processing of async_pf.queue (which is protected by vcpu->mutex). This reverts commit 22583f0d9c85e60c9860bc8a0ebff59fe08be6d7. Suggested-by: Xu Yilun <yilun.xu@linux.intel.com> Link: https://lore.kernel.org/r/20240423191649.2885257-1-seanjc@google.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Enable halt polling shrink parameter by defaultParshuram Sangle2024-06-031-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Default halt_poll_ns_shrink value of 0 always resets polling interval to 0 on an un-successful poll where vcpu wakeup is not received. This is mostly to avoid pointless polling for more number of shorter intervals. But disabled shrink assumes vcpu wakeup is less likely to be received in subsequent shorter polling intervals. Another side effect of 0 shrink value is that, even on a successful poll if total block time was greater than current polling interval, the polling interval starts over from 0 instead of shrinking by a factor. Enabling shrink with value of 2 allows the polling interval to gradually decrement in case of un-successful poll events as well. This gives a fair chance for successful polling events in subsequent polling intervals rather than resetting it to 0 and starting over from grow_start. Below kvm stat log snippet shows interleaved growth and shrinking of polling interval: 87162647182125: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (grow 0) 87162647637763: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) 87162649627943: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162650892407: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162651540378: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162652276768: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162652515037: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 40000 (grow 20000) 87162653383787: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (shrink 40000) 87162653627670: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (shrink 20000) 87162653796321: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 20000 (grow 10000) 87162656171645: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 10000 (shrink 20000) 87162661607487: kvm_halt_poll_ns: vcpu 0: halt_poll_ns 0 (shrink 10000) Having both grow and shrink enabled creates a balance in polling interval growth and shrink behavior. Tests show improved successful polling attempt ratio which contribute to VM performance. Power penalty is quite negligible as shrunk polling intervals create bursts of very short durations. Performance assessment results show 3-6% improvements in CPU+GPU, Memory and Storage Android VM workloads whereas 5-9% improvement in average FPS of gaming VM workloads. Power penalty is below 1% where host OS is either idle or running a native workload having 2 VMs enabled. CPU/GPU intensive gaming workloads as well do not show any increased power overhead with shrink enabled. Co-developed-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Rajendran Jaishankar <jaishankar.rajendran@intel.com> Signed-off-by: Parshuram Sangle <parshuram.sangle@intel.com> Link: https://lore.kernel.org/r/20231102154628.2120-2-parshuram.sangle@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * KVM: Unexport kvm_debugfs_dirBorislav Petkov2024-06-031-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | After faf01aef0570 ("KVM: PPC: Merge powerpc's debugfs entry content into generic entry") kvm_debugfs_dir is not used anywhere else outside of kvm_main.c Unexport it and make it static. Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Link: https://lore.kernel.org/r/20240515150804.9354-1-bp@kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com>
* | Merge tag 'kvm-x86-fixes-6.10-11' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-07-161-0/+3
|\ \ | | | | | | | | | | | | | | | | | | KVM Xen: Fix a bug where KVM fails to check the validity of an incoming userspace virtual address and tries to activate a gfn_to_pfn_cache with a kernel address.
| * | KVM: Validate hva in kvm_gpc_activate_hva() to fix __kvm_gpc_refresh() WARNPei Li2024-06-281-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Check that the virtual address is "ok" when activating a gfn_to_pfn_cache with a host VA to ensure that KVM never attempts to use a bad address. This fixes a bug where KVM fails to check the incoming address when handling KVM_XEN_VCPU_ATTR_TYPE_VCPU_INFO_HVA in kvm_xen_vcpu_set_attr(). Reported-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=fd555292a1da3180fc82 Tested-by: syzbot+fd555292a1da3180fc82@syzkaller.appspotmail.com Signed-off-by: Pei Li <peili.dev@gmail.com> Reviewed-by: Paul Durrant <paul@xen.org> Reviewed-by: David Woodhouse <dwmw@amazon.co.uk> Link: https://lore.kernel.org/r/20240627-bug5-v2-1-2c63f7ee6739@gmail.com [sean: rewrite changelog with --verbose] Signed-off-by: Sean Christopherson <seanjc@google.com>
* | | Merge tag 'loongarch-kvm-6.11' of ↵Paolo Bonzini2024-07-121-3/+5
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.11 1. Add ParaVirt steal time support. 2. Add some VM migration enhancement. 3. Add perf kvm-stat support for loongarch.
| * | Merge tag 'kvm-x86-fixes-6.10-rcN' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-06-211-3/+5
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM fixes for 6.10 - Fix a "shift too big" goof in the KVM_SEV_INIT2 selftest. - Compute the max mappable gfn for KVM selftests on x86 using GuestMaxPhyAddr from KVM's supported CPUID (if it's available). - Fix a race in kvm_vcpu_on_spin() by ensuring loads and stores are atomic. - Fix technically benign bug in __kvm_handle_hva_range() where KVM consumes the return from a void-returning function as if it were a boolean.
| | * | KVM: Stop processing *all* memslots when "null" mmu_notifier handler is foundBabu Moger2024-06-181-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bail from outer address space loop, not just the inner memslot loop, when a "null" handler is encountered by __kvm_handle_hva_range(), which is the intended behavior. On x86, which has multiple address spaces thanks to SMM emulation, breaking from just the memslot loop results in undefined behavior due to assigning the non-existent return value from kvm_null_fn() to a bool. In practice, the bug is benign as kvm_mmu_notifier_invalidate_range_end() is the only caller that passes handler=kvm_null_fn, and it doesn't set flush_on_ret, i.e. assigning garbage to r.ret is ultimately ignored. And for most configuration the compiler elides the entire sequence, i.e. there is no undefined behavior at runtime. ------------[ cut here ]------------ UBSAN: invalid-load in arch/x86/kvm/../../../virt/kvm/kvm_main.c:655:10 load of value 160 is not a valid value for type '_Bool' CPU: 370 PID: 8246 Comm: CPU 0/KVM Not tainted 6.8.2-amdsos-build58-ubuntu-22.04+ #1 Hardware name: AMD Corporation Sh54p/Sh54p, BIOS WPC4429N 04/25/2024 Call Trace: <TASK> dump_stack_lvl+0x48/0x60 ubsan_epilogue+0x5/0x30 __ubsan_handle_load_invalid_value+0x79/0x80 kvm_mmu_notifier_invalidate_range_end.cold+0x18/0x4f [kvm] __mmu_notifier_invalidate_range_end+0x63/0xe0 __split_huge_pmd+0x367/0xfc0 do_huge_pmd_wp_page+0x1cc/0x380 __handle_mm_fault+0x8ee/0xe50 handle_mm_fault+0xe4/0x4a0 __get_user_pages+0x190/0x840 get_user_pages_unlocked+0xe0/0x590 hva_to_pfn+0x114/0x550 [kvm] kvm_faultin_pfn+0xed/0x5b0 [kvm] kvm_tdp_page_fault+0x123/0x170 [kvm] kvm_mmu_page_fault+0x244/0xaa0 [kvm] vcpu_enter_guest+0x592/0x1070 [kvm] kvm_arch_vcpu_ioctl_run+0x145/0x8a0 [kvm] kvm_vcpu_ioctl+0x288/0x6d0 [kvm] __x64_sys_ioctl+0x8f/0xd0 do_syscall_64+0x77/0x120 entry_SYSCALL_64_after_hwframe+0x6e/0x76 </TASK> ---[ end trace ]--- Fixes: 071064f14d87 ("KVM: Don't take mmu_lock for range invalidation unless necessary") Signed-off-by: Babu Moger <babu.moger@amd.com> Link: https://lore.kernel.org/r/b8723d39903b64c241c50f5513f804390c7b5eec.1718203311.git.babu.moger@amd.com [sean: massage changelog] Signed-off-by: Sean Christopherson <seanjc@google.com>