summaryrefslogtreecommitdiffstats
path: root/arch/x86/kvm (follow)
Commit message (Collapse)AuthorAgeFilesLines
* treewide: surround Kconfig file paths with double quotesMasahiro Yamada2018-12-211-1/+1
| | | | | | | | | | | | | | | | | | | The Kconfig lexer supports special characters such as '.' and '/' in the parameter context. In my understanding, the reason is just to support bare file paths in the source statement. I do not see a good reason to complicate Kconfig for the room of ambiguity. The majority of code already surrounds file paths with double quotes, and it makes sense since file paths are constant string literals. Make it treewide consistent now. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Wolfram Sang <wsa@the-dreams.de> Acked-by: Geert Uytterhoeven <geert@linux-m68k.org> Acked-by: Ingo Molnar <mingo@kernel.org>
* x86: Clean up 'sizeof x' => 'sizeof(x)'Jordan Borgner2018-10-293-33/+33
| | | | | | | | | | | | | | | | | | "sizeof(x)" is the canonical coding style used in arch/x86 most of the time. Fix the few places that didn't follow the convention. (Also do some whitespace cleanups in a few places while at it.) [ mingo: Rewrote the changelog. ] Signed-off-by: Jordan Borgner <mail@jordan-borgner.de> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/20181028125828.7rgammkgzep2wpam@JordanDesktop Signed-off-by: Ingo Molnar <mingo@kernel.org>
* Merge tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2018-10-2614-1050/+2358
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Radim Krčmář: "ARM: - Improved guest IPA space support (32 to 52 bits) - RAS event delivery for 32bit - PMU fixes - Guest entry hardening - Various cleanups - Port of dirty_log_test selftest PPC: - Nested HV KVM support for radix guests on POWER9. The performance is much better than with PR KVM. Migration and arbitrary level of nesting is supported. - Disable nested HV-KVM on early POWER9 chips that need a particular hardware bug workaround - One VM per core mode to prevent potential data leaks - PCI pass-through optimization - merge ppc-kvm topic branch and kvm-ppc-fixes to get a better base s390: - Initial version of AP crypto virtualization via vfio-mdev - Improvement for vfio-ap - Set the host program identifier - Optimize page table locking x86: - Enable nested virtualization by default - Implement Hyper-V IPI hypercalls - Improve #PF and #DB handling - Allow guests to use Enlightened VMCS - Add migration selftests for VMCS and Enlightened VMCS - Allow coalesced PIO accesses - Add an option to perform nested VMCS host state consistency check through hardware - Automatic tuning of lapic_timer_advance_ns - Many fixes, minor improvements, and cleanups" * tag 'kvm-4.20-1' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (204 commits) KVM/nVMX: Do not validate that posted_intr_desc_addr is page aligned Revert "kvm: x86: optimize dr6 restore" KVM: PPC: Optimize clearing TCEs for sparse tables x86/kvm/nVMX: tweak shadow fields selftests/kvm: add missing executables to .gitignore KVM: arm64: Safety check PSTATE when entering guest and handle IL KVM: PPC: Book3S HV: Don't use streamlined entry path on early POWER9 chips arm/arm64: KVM: Enable 32 bits kvm vcpu events support arm/arm64: KVM: Rename function kvm_arch_dev_ioctl_check_extension() KVM: arm64: Fix caching of host MDCR_EL2 value KVM: VMX: enable nested virtualization by default KVM/x86: Use 32bit xor to clear registers in svm.c kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOAD kvm: vmx: Defer setting of DR6 until #DB delivery kvm: x86: Defer setting of CR2 until #PF delivery kvm: x86: Add payload operands to kvm_multiple_exception kvm: x86: Add exception payload fields to kvm_vcpu_events kvm: x86: Add has_payload and payload to kvm_queued_exception KVM: Documentation: Fix omission in struct kvm_vcpu_events KVM: selftests: add Enlightened VMCS test ...
| * KVM/nVMX: Do not validate that posted_intr_desc_addr is page alignedKarimAllah Ahmed2018-10-241-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The spec only requires the posted interrupt descriptor address to be 64-bytes aligned (i.e. bits[0:5] == 0). Using page_address_valid also forces the address to be page aligned. Only validate that the address does not cross the maximum physical address without enforcing a page alignment. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Radim Krčmář <rkrcmar@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Ingo Molnar <mingo@redhat.com> Cc: Borislav Petkov <bp@alien8.de> Cc: H. Peter Anvin <hpa@zytor.com> Cc: x86@kernel.org Cc: kvm@vger.kernel.org Cc: linux-kernel@vger.kernel.org Fixes: 6de84e581c0 ("nVMX x86: check posted-interrupt descriptor addresss on vmentry of L2") Signed-off-by: KarimAllah Ahmed <karahmed@amazon.de> Reviewed-by: Jim Mattson <jmattson@google.com> Reviewed-by: Krish Sadhuhan <krish.sadhukhan@oracle.com> Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * Revert "kvm: x86: optimize dr6 restore"Radim Krčmář2018-10-231-9/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 0e0a53c551317654e2d7885fdfd23299fee99b6b. As Christian Ehrhardt noted: The most common case is that vcpu->arch.dr6 and the host's %dr6 value are not related at all because ->switch_db_regs is zero. To do this all correctly, we must handle the case where the guest leaves an arbitrary unused value in vcpu->arch.dr6 before disabling breakpoints again. However, this means that vcpu->arch.dr6 is not suitable to detect the need for a %dr6 clear. Signed-off-by: Radim Krčmář <rkrcmar@redhat.com>
| * x86/kvm/nVMX: tweak shadow fieldsVitaly Kuznetsov2018-10-192-9/+6
| | | | | | | | | | | | | | | | | | | | | | | | It seems we have some leftovers from times when 'unrestricted guest' wasn't exposed to L1. Stop shadowing GUEST_CS_{BASE,LIMIT,AR_SELECTOR} and GUEST_ES_BASE, shadow GUEST_SS_AR_BYTES as it was found that some hypervisors (e.g. Hyper-V without Enlightened VMCS) access it pretty often. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: VMX: enable nested virtualization by defaultPaolo Bonzini2018-10-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With live migration support and finally a good solution for exception event injection, nested VMX should be ready for having a stable userspace ABI. The results of syzkaller fuzzing are not perfect but not horrible either (and might be partially due to running on GCE, so that effectively we're testing three-level nesting on a fork of upstream KVM!). Enabling it by default seems like a nice way to conclude the 4.20 pull request. :) Unfortunately, enabling nested SVM in 2009 (commit 4b6e4dca701) was a bit premature. However, until live migration support is in place we can reasonably expect that it does not offer much in terms of ABI guarantees. Therefore we are still in time to break things and conform as much as possible to the interface used for VMX. Suggested-by: Jim Mattson <jmattson@google.com> Suggested-by: Liran Alon <liran.alon@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Celebrated-by: Liran Alon <liran.alon@oracle.com> Celebrated-by: Wanpeng Li <kernellwp@gmail.com> Celebrated-by: Wincy Van <fanwenyi0529@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM/x86: Use 32bit xor to clear registers in svm.cUros Bizjak2018-10-172-16/+18
| | | | | | | | | | | | | | | | | | x86_64 zero-extends 32bit xor operation to a full 64bit register. Also add a comment and remove unnecessary instruction suffix in vmx.c Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Introduce KVM_CAP_EXCEPTION_PAYLOADJim Mattson2018-10-171-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a per-VM capability which can be enabled by userspace so that the faulting linear address will be included with the information about a pending #PF in L2, and the "new DR6 bits" will be included with the information about a pending #DB in L2. With this capability enabled, the L1 hypervisor can now intercept #PF before CR2 is modified. Under VMX, the L1 hypervisor can now intercept #DB before DR6 and DR7 are modified. When userspace has enabled KVM_CAP_EXCEPTION_PAYLOAD, it should generally provide an appropriate payload when injecting a #PF or #DB exception via KVM_SET_VCPU_EVENTS. However, to support restoring old checkpoints, this payload is not required. Note that bit 16 of the "new DR6 bits" is set to indicate that a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled. This is the reverse of DR6.RTM, which is cleared in this scenario. This capability also enables exception.pending in struct kvm_vcpu_events, which allows userspace to distinguish between pending and injected exceptions. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: vmx: Defer setting of DR6 until #DB deliveryJim Mattson2018-10-172-31/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When exception payloads are enabled by userspace (which is not yet possible) and a #DB is raised in L2, defer the setting of DR6 until later. Under VMX, this allows the L1 hypervisor to intercept the fault before DR6 is modified. Under SVM, DR6 is modified before L1 can intercept the fault (as has always been the case with DR7). Note that the payload associated with a #DB exception includes only the "new DR6 bits." When the payload is delievered, DR6.B0-B3 will be cleared and DR6.RTM will be set prior to merging in the new DR6 bits. Also note that bit 16 in the "new DR6 bits" is set to indicate that a debug exception (#DB) or a breakpoint exception (#BP) occurred inside an RTM region while advanced debugging of RTM transactional regions was enabled. Though the reverse of DR6.RTM, this makes the #DB payload field compatible with both the pending debug exceptions field under VMX and the exit qualification for #DB exceptions under VMX. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Defer setting of CR2 until #PF deliveryJim Mattson2018-10-174-21/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When exception payloads are enabled by userspace (which is not yet possible) and a #PF is raised in L2, defer the setting of CR2 until the #PF is delivered. This allows the L1 hypervisor to intercept the fault before CR2 is modified. For backwards compatibility, when exception payloads are not enabled by userspace, kvm_multiple_exception modifies CR2 when the #PF exception is raised. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Add payload operands to kvm_multiple_exceptionJim Mattson2018-10-171-7/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_multiple_exception now takes two additional operands: has_payload and payload, so that updates to CR2 (and DR6 under VMX) can be delayed until the exception is delivered. This is necessary to properly emulate VMX or SVM hardware behavior for nested virtualization. The new behavior is triggered by vcpu->kvm->arch.exception_payload_enabled, which will (later) be set by a new per-VM capability, KVM_CAP_EXCEPTION_PAYLOAD. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Add exception payload fields to kvm_vcpu_eventsJim Mattson2018-10-171-16/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The per-VM capability KVM_CAP_EXCEPTION_PAYLOAD (to be introduced in a later commit) adds the following fields to struct kvm_vcpu_events: exception_has_payload, exception_payload, and exception.pending. With this capability set, all of the details of vcpu->arch.exception, including the payload for a pending exception, are reported to userspace in response to KVM_GET_VCPU_EVENTS. With this capability clear, the original ABI is preserved, and the exception.injected field is set for either pending or injected exceptions. When userspace calls KVM_SET_VCPU_EVENTS with KVM_CAP_EXCEPTION_PAYLOAD clear, exception.injected is no longer translated to exception.pending. KVM_SET_VCPU_EVENTS can now only establish a pending exception when KVM_CAP_EXCEPTION_PAYLOAD is set. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm: x86: Add has_payload and payload to kvm_queued_exceptionJim Mattson2018-10-171-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The payload associated with a #PF exception is the linear address of the fault to be loaded into CR2 when the fault is delivered. The payload associated with a #DB exception is a mask of the DR6 bits to be set (or in the case of DR6.RTM, cleared) when the fault is delivered. Add fields has_payload and payload to kvm_queued_exception to track payloads for pending exceptions. The new fields are introduced here, but for now, they are just cleared. Reported-by: Jim Mattson <jmattson@google.com> Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/nVMX: nested state migration for Enlightened VMCSVitaly Kuznetsov2018-10-172-20/+64
| | | | | | | | | | | | | | | | | | Add support for get/set of nested state when Enlightened VMCS is in use. A new KVM_STATE_NESTED_EVMCS flag to indicate eVMCS on the vCPU was enabled is added. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/nVMX: allow bare VMXON state migrationVitaly Kuznetsov2018-10-171-7/+8
| | | | | | | | | | | | | | | | | | | | It is perfectly valid for a guest to do VMXON and not do VMPTRLD. This state needs to be preserved on migration. Cc: stable@vger.kernel.org Fixes: 8fcc4b5923af5de58b80b53a069453b135693304 Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/lapic: preserve gfn_to_hva_cache len on cache reinitVitaly Kuznetsov2018-10-171-2/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | vcpu->arch.pv_eoi is accessible through both HV_X64_MSR_VP_ASSIST_PAGE and MSR_KVM_PV_EOI_EN so on migration userspace may try to restore them in any order. Values match, however, kvm_lapic_enable_pv_eoi() uses different length: for Hyper-V case it's the whole struct hv_vp_assist_page, for KVM native case it is 8. In case we restore KVM-native MSR last cache will be reinitialized with len=8 so trying to access VP assist page beyond 8 bytes with kvm_read_guest_cached() will fail. Check if we re-initializing cache for the same address and preserve length in case it was greater. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/hyperv: don't clear VP assist pages on initVitaly Kuznetsov2018-10-171-1/+7
| | | | | | | | | | | | | | | | | | | | | | VP assist pages may hold valuable data which needs to be preserved across migration. Clean PV EOI portion of the data on init, the guest is responsible for making sure there's no garbage in the rest. This will be used for nVMX migration, eVMCS address needs to be preserved. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: optimize prepare_vmcs02{,_full} for Enlightened VMCS caseVitaly Kuznetsov2018-10-171-53/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When Enlightened VMCS is in use by L1 hypervisor we can avoid vmwriting VMCS fields which did not change. Our first goal is to achieve minimal impact on traditional VMCS case so we're not wrapping each vmwrite() with an if-changed checker. We also can't utilize static keys as Enlightened VMCS usage is per-guest. This patch implements the simpliest solution: checking fields in groups. We skip single vmwrite() statements as doing the check will cost us something even in non-evmcs case and the win is tiny. Unfortunately, this makes prepare_vmcs02_full{,_full}() code Enlightened VMCS-dependent (and a bit ugly). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: implement enlightened VMPTRLD and VMCLEARVitaly Kuznetsov2018-10-171-7/+108
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per Hyper-V TLFS 5.0b: "The L1 hypervisor may choose to use enlightened VMCSs by writing 1 to the corresponding field in the VP assist page (see section 7.8.7). Another field in the VP assist page controls the currently active enlightened VMCS. Each enlightened VMCS is exactly one page (4 KB) in size and must be initially zeroed. No VMPTRLD instruction must be executed to make an enlightened VMCS active or current. After the L1 hypervisor performs a VM entry with an enlightened VMCS, the VMCS is considered active on the processor. An enlightened VMCS can only be active on a single processor at the same time. The L1 hypervisor can execute a VMCLEAR instruction to transition an enlightened VMCS from the active to the non-active state. Any VMREAD or VMWRITE instructions while an enlightened VMCS is active is unsupported and can result in unexpected behavior." Keep Enlightened VMCS structure for the current L2 guest permanently mapped from struct nested_vmx instead of mapping it every time. Suggested-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: add enlightened VMCS stateVitaly Kuznetsov2018-10-171-17/+423
| | | | | | | | | | | | | | | | | | | | | | | | Adds hv_evmcs pointer and implement copy_enlightened_to_vmcs12() and copy_enlightened_to_vmcs12(). prepare_vmcs02()/prepare_vmcs02_full() separation is not valid for Enlightened VMCS, do full sync for now. Suggested-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: add KVM_CAP_HYPERV_ENLIGHTENED_VMCS capabilityVitaly Kuznetsov2018-10-173-0/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Enlightened VMCS is opt-in. The current version does not contain all fields supported by nested VMX so we must not advertise the corresponding VMX features if enlightened VMCS is enabled. Userspace is given the enlightened VMCS version supported by KVM as part of enabling KVM_CAP_HYPERV_ENLIGHTENED_VMCS. The version is to be advertised to the nested hypervisor, currently done via a cpuid leaf for Hyper-V. Suggested-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: VMX: refactor evmcs_sanitize_exec_ctrls()Vitaly Kuznetsov2018-10-171-61/+47
| | | | | | | | | | | | | | | | Split off EVMCS1_UNSUPPORTED_* macros so we can re-use them when enabling Enlightened VMCS for Hyper-V on KVM. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: hyperv: define VP assist page helpersLadi Prosek2018-10-175-6/+29
| | | | | | | | | | | | | | | | | | | | The state related to the VP assist page is still managed by the LAPIC code in the pv_eoi field. Signed-off-by: Ladi Prosek <lprosek@redhat.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: reintroduce pte_list_remove, but including mmu_spte_clear_track_bitsWei Yang2018-10-171-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | rmap_remove() removes the sptep after locating the correct rmap_head but, in several cases, the caller has already known the correct rmap_head. This patch introduces a new pte_list_remove(); because it is known that the spte is present (or it would not have an rmap_head), it is safe to remove the tracking bits without any previous check. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: rename pte_list_remove to __pte_list_removeWei Yang2018-10-171-8/+8
| | | | | | | | | | | | | | This is a patch preparing for further change. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * kvm/x86 : fix some typoPeng Hao2018-10-171-2/+2
| | | | | | | | | | Signed-off-by: Peng Hao <peng.hao2@zte.com.cn> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM/VMX: Change hv flush logic when ept tables are mismatched.Lan Tianyu2018-10-171-7/+8
| | | | | | | | | | | | | | | | | | If ept table pointers are mismatched, flushing tlb for each vcpus via hv flush interface still helps to reduce vmexits which are triggered by IPI and INEPT emulation. Signed-off-by: Lan Tianyu <Tianyu.Lan@microsoft.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM/x86: Use 32bit xor to clear registerUros Bizjak2018-10-171-2/+2
| | | | | | | | | | | | | | | | x86_64 zero-extends 32bit xor to a full 64bit register. Use %k asm operand modifier to force 32bit register and save 268 bytes in kvm.o Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM/x86: Use assembly instruction mnemonics instead of .byte streamsUros Bizjak2018-10-171-26/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recently the minimum required version of binutils was changed to 2.20, which supports all VMX instruction mnemonics. The patch removes all .byte #defines and uses real instruction mnemonics instead. The compiler is now able to pass memory operand to the instruction, so there is no need for memory clobber anymore. Also, the compiler adds CC register clobber automatically to all extended asm clauses, so the patch also removes explicit CC clobber. The immediate benefit of the patch is removal of many unnecesary register moves, resulting in 1434 saved bytes in vmx.o: text data bss dec hex filename 151257 18246 8500 178003 2b753 vmx.o 152691 18246 8500 179437 2bced vmx-old.o Some examples of improvement include removal of unneeded moves of %rsp to %rax in front of invept and invvpid instructions: a57e: b9 01 00 00 00 mov $0x1,%ecx a583: 48 89 04 24 mov %rax,(%rsp) a587: 48 89 e0 mov %rsp,%rax a58a: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp) a591: 00 00 a593: 66 0f 38 80 08 invept (%rax),%rcx to: a45c: 48 89 04 24 mov %rax,(%rsp) a460: b8 01 00 00 00 mov $0x1,%eax a465: 48 c7 44 24 08 00 00 movq $0x0,0x8(%rsp) a46c: 00 00 a46e: 66 0f 38 80 04 24 invept (%rsp),%rax and the ability to use more optimal registers and memory operands in the instruction: 8faa: 48 8b 44 24 28 mov 0x28(%rsp),%rax 8faf: 4c 89 c2 mov %r8,%rdx 8fb2: 0f 79 d0 vmwrite %rax,%rdx to: 8e7c: 44 0f 79 44 24 28 vmwrite 0x28(%rsp),%r8 Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM/x86: Fix invvpid and invept register operand size in 64-bit modeUros Bizjak2018-10-171-2/+2
| | | | | | | | | | | | | | | | | | Register operand size of invvpid and invept instruction in 64-bit mode has always 64 bits. Adjust inline function argument type to reflect correct size. Signed-off-by: Uros Bizjak <ubizjak@gmail.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/mmu: check if MMU reconfiguration is needed in init_kvm_nested_mmu()Vitaly Kuznetsov2018-10-171-0/+6
| | | | | | | | | | | | | | | | | | | | We don't use root page role for nested_mmu, however, optimizing out re-initialization in case nothing changed is still valuable as this is done for every nested vmentry. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is neededVitaly Kuznetsov2018-10-171-34/+61
| | | | | | | | | | | | | | | | | | | | | | MMU reconfiguration in init_kvm_tdp_mmu()/kvm_init_shadow_mmu() can be avoided if the source data used to configure it didn't change; enhance MMU extended role with the required fields and consolidate common code in kvm_calc_mmu_role_common(). Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/nVMX: introduce source data cache for kvm_init_shadow_ept_mmu()Vitaly Kuznetsov2018-10-171-15/+38
| | | | | | | | | | | | | | | | | | | | | | | | MMU re-initialization is expensive, in particular, update_permission_bitmask() and update_pkru_bitmask() are. Cache the data used to setup shadow EPT MMU and avoid full re-init when it is unchanged. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/mmu: make space for source data caching in struct kvm_mmuVitaly Kuznetsov2018-10-172-7/+21
| | | | | | | | | | | | | | | | | | | | | | In preparation to MMU reconfiguration avoidance we need a space to cache source data. As this partially intersects with kvm_mmu_page_role, create 64bit sized union kvm_mmu_role holding both base and extended data. No functional change. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/mmu: get rid of redundant kvm_mmu_setup()Paolo Bonzini2018-10-172-13/+1
| | | | | | | | | | | | | | | | | | Just inline the contents into the sole caller, kvm_init_mmu is now public. Suggested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
| * x86/kvm/mmu: introduce guest_mmuVitaly Kuznetsov2018-10-172-18/+37
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When EPT is used for nested guest we need to re-init MMU as shadow EPT MMU (nested_ept_init_mmu_context() does that). When we return back from L2 to L1 kvm_mmu_reset_context() in nested_vmx_load_cr3() resets MMU back to normal TDP mode. Add a special 'guest_mmu' so we can use separate root caches; the improved hit rate is not very important for single vCPU performance, but it avoids contention on the mmu_lock for many vCPUs. On the nested CPUID benchmark, with 16 vCPUs, an L2->L1->L2 vmexit goes from 42k to 26k cycles. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * x86/kvm/mmu.c: add kvm_mmu parameter to kvm_mmu_free_roots()Vitaly Kuznetsov2018-10-172-5/+6
| | | | | | | | | | | | | | | | | | Add an option to specify which MMU root we want to free. This will be used when nested and non-nested MMUs for L1 are split. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
| * x86/kvm/mmu.c: set get_pdptr hook in kvm_init_shadow_ept_mmu()Vitaly Kuznetsov2018-10-172-0/+2
| | | | | | | | | | | | | | | | | | | | | | kvm_init_shadow_ept_mmu() doesn't set get_pdptr() hook and is this not a problem just because MMU context is already initialized and this hook points to kvm_pdptr_read(). As we're intended to use a dedicated MMU for shadow EPT MMU set this hook explicitly. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
| * x86/kvm/mmu: make vcpu->mmu a pointer to the current MMUVitaly Kuznetsov2018-10-177-123/+126
| | | | | | | | | | | | | | | | | | | | As a preparation to full MMU split between L1 and L2 make vcpu->arch.mmu a pointer to the currently used mmu. For now, this is always vcpu->arch.root_mmu. No functional change. Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com>
| * kvm: x86: optimize dr6 restorePaolo Bonzini2018-10-171-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The quote from the comment almost says it all: we are currently zeroing the guest dr6 in kvm_arch_vcpu_put, because do_debug expects it. However, the host %dr6 is either: - zero because the guest hasn't run after kvm_arch_vcpu_load - written from vcpu->arch.dr6 by vcpu_enter_guest - written by the guest and copied to vcpu->arch.dr6 by ->sync_dirty_debug_regs(). Therefore, we can skip the write if vcpu->arch.dr6 is already zero. We may do extra useless writes if vcpu->arch.dr6 is nonzero but the guest hasn't run; however that is less important for performance. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: hyperv: optimize sparse VP set processingVitaly Kuznetsov2018-10-171-98/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rewrite kvm_hv_flush_tlb()/send_ipi_vcpus_mask() making them cleaner and somewhat more optimal. hv_vcpu_in_sparse_set() is converted to sparse_set_to_vcpu_mask() which copies sparse banks u64-at-a-time and then, depending on the num_mismatched_vp_indexes value, returns immediately or does vp index to vcpu index conversion by walking all vCPUs. To support the change and make kvm_hv_send_ipi() look similar to kvm_hv_flush_tlb() send_ipi_vcpus_mask() is introduced. Suggested-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: x86: hyperv: fix 'tlb_lush' typoVitaly Kuznetsov2018-10-171-3/+3
| | | | | | | | | | | | | | | | | | Regardless of whether your TLB is lush or not it still needs flushing. Reported-by: Roman Kagan <rkagan@virtuozzo.com> Reviewed-by: Roman Kagan <rkagan@virtuozzo.com> Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: WARN if nested run hits VMFail with early consistency checks enabledSean Christopherson2018-10-171-8/+10
| | | | | | | | | | | | | | | | When early consistency checks are enabled, all VMFail conditions should be caught by nested_vmx_check_vmentry_hw(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: add option to perform early consistency checks via H/WSean Christopherson2018-10-171-5/+137
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM defers many VMX consistency checks to the CPU, ostensibly for performance reasons[1], including checks that result in VMFail (as opposed to VMExit). This behavior may be undesirable for some users since this means KVM detects certain classes of VMFail only after it has processed guest state, e.g. emulated MSR load-on-entry. Because there is a strict ordering between checks that cause VMFail and those that cause VMExit, i.e. all VMFail checks are performed before any checks that cause VMExit, we can detect (almost) all VMFail conditions via a dry run of sorts. The almost qualifier exists because some state in vmcs02 comes from L0, e.g. VPID, which means that hardware will never detect an invalid VPID in vmcs12 because it never sees said value. Software must (continue to) explicitly check such fields. After preparing vmcs02 with all state needed to pass the VMFail consistency checks, optionally do a "test" VMEnter with an invalid GUEST_RFLAGS. If the VMEnter results in a VMExit (due to bad guest state), then we can safely say that the nested VMEnter should not VMFail, i.e. any VMFail encountered in nested_vmx_vmexit() must be due to an L0 bug. GUEST_RFLAGS is used to induce VMExit as it is unconditionally loaded on all implementations of VMX, has an invalid value that is writable on a 32-bit system and its consistency check is performed relatively early in all implementations (the exact order of consistency checks is micro-architectural). Unfortunately, since the "passing" case causes a VMExit, KVM must be extra diligent to ensure that host state is restored, e.g. DR7 and RFLAGS are reset on VMExit. Failure to restore RFLAGS.IF is particularly fatal. And of course the extra VMEnter and VMExit impacts performance. The raw overhead of the early consistency checks is ~6% on modern hardware (though this could easily vary based on configuration), while the added latency observed from the L1 VMM is ~10%. The early consistency checks do not occur in a vacuum, e.g. spending more time in L0 can lead to more interrupts being serviced while emulating VMEnter, thereby increasing the latency observed by L1. Add a module param, early_consistency_checks, to provide control over whether or not VMX performs the early consistency checks. In addition to standard on/off behavior, the param accepts a value of -1, which is essentialy an "auto" setting whereby KVM does the early checks only when it thinks it's running on bare metal. When running nested, doing early checks is of dubious value since the resulting behavior is heavily dependent on L0. In the future, the "auto" setting could also be used to default to skipping the early hardware checks for certain configurations/platforms if KVM reaches a state where it has 100% coverage of VMFail conditions. [1] To my knowledge no one has implemented and tested full software emulation of the VMFail consistency checks. Until that happens, one can only speculate about the actual performance overhead of doing all VMFail consistency checks in software. Obviously any code is slower than no code, but in the grand scheme of nested virtualization it's entirely possible the overhead is negligible. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: vmx: write HOST_IA32_EFER in vmx_set_constant_host_state()Sean Christopherson2018-10-171-1/+5
| | | | | | | | | | | | | | | | | | EFER is constant in the host and writing it once during setup means we can skip writing the host value in add_atomic_switch_msr_special(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: call kvm_skip_emulated_instruction in nested_vmx_{fail,succeed}Sean Christopherson2018-10-171-154/+91
| | | | | | | | | | | | | | | | | | | | | | | | ... as every invocation of nested_vmx_{fail,succeed} is immediately followed by a call to kvm_skip_emulated_instruction(). This saves a bit of code and eliminates some silly paths, e.g. nested_vmx_run() ended up with a goto label purely used to call and return kvm_skip_emulated_instruction(). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: do not call nested_vmx_succeed() for consistency check VMExitSean Christopherson2018-10-171-1/+0
| | | | | | | | | | | | | | | | | | EFLAGS is set to a fixed value on VMExit, calling nested_vmx_succeed() is unnecessary and wrong. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: do not skip VMEnter instruction that succeedsSean Christopherson2018-10-171-16/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A successful VMEnter is essentially a fancy indirect branch that pulls the target RIP from the VMCS. Skipping the instruction is unnecessary (RIP will get overwritten by the VMExit handler) and is problematic because it can incorrectly suppress a #DB due to EFLAGS.TF when a VMFail is detected by hardware (happens after we skip the instruction). Now that vmx_nested_run() is not prematurely skipping the instr, use the full kvm_skip_emulated_instruction() in the VMFail path of nested_vmx_vmexit(). We also need to explicitly update the GUEST_INTERRUPTIBILITY_INFO when loading vmcs12 host state. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * KVM: nVMX: do early preparation of vmcs02 before check_vmentry_postreqs()Sean Christopherson2018-10-171-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | In anticipation of using vmcs02 to do early consistency checks, move the early preparation of vmcs02 prior to checking the postreqs. The downside of this approach is that we'll unnecessary load vmcs02 in the case that check_vmentry_postreqs() fails, but that is essentially our slow path anyways (not actually slow, but it's the path we don't really care about optimizing). Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>