summaryrefslogtreecommitdiffstats
path: root/arch/arm64/kvm (follow)
Commit message (Collapse)AuthorAgeFilesLines
* KVM: arm64: Ensure vgic_ready() is ordered against MMIO registrationOliver Upton2024-10-171-2/+11
| | | | | | | | | | | | | | | | | | kvm_vgic_map_resources() prematurely marks the distributor as 'ready', potentially allowing vCPUs to enter the guest before the distributor's MMIO registration has been made visible. Plug the race by marking the distributor as ready only after MMIO registration is completed. Rely on the implied ordering of synchronize_srcu() to ensure the MMIO registration is visible before vgic_dist::ready. This also means that writers to vgic_dist::ready are now serialized by the slots_lock, which was effectively the case already as all writers held the slots_lock in addition to the config_lock. Fixes: 59112e9c390b ("KVM: arm64: vgic: Fix a circular locking issue") Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241017001947.2707312-3-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: vgic: Don't check for vgic_ready() when setting NR_IRQSOliver Upton2024-10-171-1/+6
| | | | | | | | | | | | | | | | KVM commits to a particular sizing of SPIs when the vgic is initialized, which is before the point a vgic becomes ready. On top of that, KVM supplies a default amount of SPIs should userspace not explicitly configure this. As such, the check for vgic_ready() in the handling of KVM_DEV_ARM_VGIC_GRP_NR_IRQS is completely wrong, and testing if nr_spis is nonzero is sufficient for preventing userspace from playing games with us. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241017001947.2707312-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: Fix shift-out-of-bounds bugIlkka Koskinen2024-10-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a shift-out-of-bounds bug reported by UBSAN when running VM with MTE enabled host kernel. UBSAN: shift-out-of-bounds in arch/arm64/kvm/sys_regs.c:1988:14 shift exponent 33 is too large for 32-bit type 'int' CPU: 26 UID: 0 PID: 7629 Comm: qemu-kvm Not tainted 6.12.0-rc2 #34 Hardware name: IEI NF5280R7/Mitchell MB, BIOS 00.00. 2024-10-12 09:28:54 10/14/2024 Call trace: dump_backtrace+0xa0/0x128 show_stack+0x20/0x38 dump_stack_lvl+0x74/0x90 dump_stack+0x18/0x28 __ubsan_handle_shift_out_of_bounds+0xf8/0x1e0 reset_clidr+0x10c/0x1c8 kvm_reset_sys_regs+0x50/0x1c8 kvm_reset_vcpu+0xec/0x2b0 __kvm_vcpu_set_target+0x84/0x158 kvm_vcpu_set_target+0x138/0x168 kvm_arch_vcpu_ioctl_vcpu_init+0x40/0x2b0 kvm_arch_vcpu_ioctl+0x28c/0x4b8 kvm_vcpu_ioctl+0x4bc/0x7a8 __arm64_sys_ioctl+0xb4/0x100 invoke_syscall+0x70/0x100 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x158 el0t_64_sync_handler+0x120/0x130 el0t_64_sync+0x194/0x198 Fixes: 7af0c2534f4c ("KVM: arm64: Normalize cache configuration") Cc: stable@vger.kernel.org Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: Ilkka Koskinen <ilkka@os.amperecomputing.com> Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Link: https://lore.kernel.org/r/20241017025701.67936-1-ilkka@os.amperecomputing.com Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: Shave a few bytes from the EL2 idmap codeMarc Zyngier2024-10-171-23/+29
| | | | | | | | | | | | | | | Our idmap is becoming too big, to the point where it doesn't fit in a 4kB page anymore. There are some low-hanging fruits though, such as the el2_init_state horror that is expanded 3 times in the kernel. Let's at least limit ourselves to two copies, which makes the kernel link again. At some point, we'll have to have a better way of doing this. Reported-by: Nathan Chancellor <nathan@kernel.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Link: https://lore.kernel.org/r/20241009204903.GA3353168@thelio-3990X
* KVM: arm64: Don't eagerly teardown the vgic on init errorMarc Zyngier2024-10-112-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As there is very little ordering in the KVM API, userspace can instanciate a half-baked GIC (missing its memory map, for example) at almost any time. This means that, with the right timing, a thread running vcpu-0 can enter the kernel without a GIC configured and get a GIC created behind its back by another thread. Amusingly, it will pick up that GIC and start messing with the data structures without the GIC having been fully initialised. Similarly, a thread running vcpu-1 can enter the kernel, and try to init the GIC that was previously created. Since this GIC isn't properly configured (no memory map), it fails to correctly initialise. And that's the point where we decide to teardown the GIC, freeing all its resources. Behind vcpu-0's back. Things stop pretty abruptly, with a variety of symptoms. Clearly, this isn't good, we should be a bit more careful about this. It is obvious that this guest is not viable, as it is missing some important part of its configuration. So instead of trying to tear bits of it down, let's just mark it as *dead*. It means that any further interaction from userspace will result in -EIO. The memory will be released on the "normal" path, when userspace gives up. Cc: stable@vger.kernel.org Reported-by: Alexander Potapenko <glider@google.com> Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241009183603.3221824-1-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: Expose S1PIE to guestsMark Brown2024-10-081-1/+3
| | | | | | | | | | | | | | | Prior to commit 70ed7238297f ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1") we just exposed the santised view of ID_AA64MMFR3_EL1 to guests, meaning that they saw both TCRX and S1PIE if present on the host machine. That commit added VMM control over the contents of the register and exposed S1POE but removed S1PIE, meaning that the extension is no longer visible to guests. Reenable support for S1PIE with VMM control. Fixes: 70ed7238297f ("KVM: arm64: Sanitise ID_AA64MMFR3_EL1") Signed-off-by: Mark Brown <broonie@kernel.org> Reviewed-by: Joey Gouly <joey.gouly@arm.com> Link: https://lore.kernel.org/r/20241005-kvm-arm64-fix-s1pie-v1-1-5901f02de749@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: nv: Clarify safety of allowing TLBI unmaps to rescheduleOliver Upton2024-10-081-0/+27
| | | | | | | | | | | There's been a decent amount of attention around unmaps of nested MMUs, and TLBI handling is no exception to this. Add a comment clarifying why it is safe to reschedule during a TLBI unmap, even without a reference on the MMU in progress. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-5-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: nv: Punt stage-2 recycling to a vCPU requestOliver Upton2024-10-082-2/+28
| | | | | | | | | | | | | | | | | | | Currently, when a nested MMU is repurposed for some other MMU context, KVM unmaps everything during vcpu_load() while holding the MMU lock for write. This is quite a performance bottleneck for large nested VMs, as all vCPU scheduling will spin until the unmap completes. Start punting the MMU cleanup to a vCPU request, where it is then possible to periodically release the MMU lock and CPU in the presence of contention. Ensure that no vCPU winds up using a stale MMU by tracking the pending unmap on the S2 MMU itself and requesting an unmap on every vCPU that finds it. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-4-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: nv: Do not block when unmapping stage-2 if disallowedOliver Upton2024-10-083-13/+14
| | | | | | | | | | | | | | | | Right now the nested code allows unmap operations on a shadow stage-2 to block unconditionally. This is wrong in a couple places, such as a non-blocking MMU notifier or on the back of a sched_in() notifier as part of shadow MMU recycling. Carry through whether or not blocking is allowed to kvm_pgtable_stage2_unmap(). This 'fixes' an issue where stage-2 MMU reclaim would precipitate a stack overflow from a pile of kvm_sched_in() callbacks, all trying to recycle a stage-2 MMU. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-3-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: nv: Keep reference on stage-2 MMU when scheduled outOliver Upton2024-10-081-3/+18
| | | | | | | | | | | | | | | | | | | | | | If a vCPU is scheduling out and not in WFI emulation, it is highly likely it will get scheduled again soon and reuse the MMU it had before. Dropping the MMU at vcpu_put() can have some unfortunate consequences, as the MMU could get reclaimed and used in a different context, forcing another 'cold start' on an otherwise active MMU. Avoid that altogether by keeping a reference on the MMU if the vCPU is scheduling out, ensuring that another vCPU cannot reclaim it while the current vCPU is away. Since there are more MMUs than vCPUs, this does not affect the guarantee that an unused MMU is available at any time. Furthermore, this makes the vcpu->arch.hw_mmu ~stable in preemptible code, at least for where it matters in the stage-2 abort path. Yes, the MMU can change across WFI emulation, but there isn't even a use case where this would matter. Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007233028.2236133-2-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
* KVM: arm64: Unregister redistributor for failed vCPU creationOliver Upton2024-10-081-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alex reports that syzkaller has managed to trigger a use-after-free when tearing down a VM: BUG: KASAN: slab-use-after-free in kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769 Read of size 8 at addr ffffff801c6890d0 by task syz.3.2219/10758 CPU: 3 UID: 0 PID: 10758 Comm: syz.3.2219 Not tainted 6.11.0-rc6-dirty #64 Hardware name: linux,dummy-virt (DT) Call trace: dump_backtrace+0x17c/0x1a8 arch/arm64/kernel/stacktrace.c:317 show_stack+0x2c/0x3c arch/arm64/kernel/stacktrace.c:324 __dump_stack lib/dump_stack.c:93 [inline] dump_stack_lvl+0x94/0xc0 lib/dump_stack.c:119 print_report+0x144/0x7a4 mm/kasan/report.c:377 kasan_report+0xcc/0x128 mm/kasan/report.c:601 __asan_report_load8_noabort+0x20/0x2c mm/kasan/report_generic.c:381 kvm_put_kvm+0x300/0xe68 virt/kvm/kvm_main.c:5769 kvm_vm_release+0x4c/0x60 virt/kvm/kvm_main.c:1409 __fput+0x198/0x71c fs/file_table.c:422 ____fput+0x20/0x30 fs/file_table.c:450 task_work_run+0x1cc/0x23c kernel/task_work.c:228 do_notify_resume+0x144/0x1a0 include/linux/resume_user_mode.h:50 el0_svc+0x64/0x68 arch/arm64/kernel/entry-common.c:169 el0t_64_sync_handler+0x90/0xfc arch/arm64/kernel/entry-common.c:730 el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:598 Upon closer inspection, it appears that we do not properly tear down the MMIO registration for a vCPU that fails creation late in the game, e.g. a vCPU w/ the same ID already exists in the VM. It is important to consider the context of commit that introduced this bug by moving the unregistration out of __kvm_vgic_vcpu_destroy(). That change correctly sought to avoid an srcu v. config_lock inversion by breaking up the vCPU teardown into two parts, one guarded by the config_lock. Fix the use-after-free while avoiding lock inversion by adding a special-cased unregistration to __kvm_vgic_vcpu_destroy(). This is safe because failed vCPUs are torn down outside of the config_lock. Cc: stable@vger.kernel.org Fixes: f616506754d3 ("KVM: arm64: vgic: Don't hold config_lock while unregistering redistributors") Reported-by: Alexander Potapenko <glider@google.com> Signed-off-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20241007223909.2157336-1-oliver.upton@linux.dev Signed-off-by: Marc Zyngier <maz@kernel.org>
* Merge branch kvm-arm64/idregs-6.12 into kvmarm/fixesMarc Zyngier2024-10-082-8/+42
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * kvm-arm64/idregs-6.12: : . : Make some fields of ID_AA64DFR0_EL1 and ID_AA64PFR1_EL1 : writable from userspace, so that a VMM can influence the : set of guest-visible features. : : - for ID_AA64DFR0_EL1: DoubleLock, WRPs, PMUVer and DebugVer : are writable (courtesy of Shameer Kolothum) : : - for ID_AA64PFR1_EL1: BT, SSBS, CVS2_frac are writable : (courtesy of Shaoqin Huang) : . KVM: selftests: aarch64: Add writable test for ID_AA64PFR1_EL1 KVM: arm64: Allow userspace to change ID_AA64PFR1_EL1 KVM: arm64: Use kvm_has_feat() to check if FEAT_SSBS is advertised to the guest KVM: arm64: Disable fields that KVM doesn't know how to handle in ID_AA64PFR1_EL1 KVM: arm64: Make the exposed feature bits in AA64DFR0_EL1 writable from userspace Signed-off-by: Marc Zyngier <maz@kernel.org>
| * KVM: arm64: Allow userspace to change ID_AA64PFR1_EL1Shaoqin Huang2024-08-251-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow userspace to change the guest-visible value of the register with different way of handling: - Since the RAS and MPAM is not writable in the ID_AA64PFR0_EL1 register, RAS_frac and MPAM_frac are also not writable in the ID_AA64PFR1_EL1 register. - The MTE is controlled by a separate UAPI (KVM_CAP_ARM_MTE) with an internal flag (KVM_ARCH_FLAG_MTE_ENABLED). So it's not writable. - For those fields which KVM doesn't know how to handle, they are not exposed to the guest (being disabled in the register read accessor), those fields value will always be 0. Those fields don't have a known behavior now, so don't advertise them to the userspace. Thus still not writable. Those fields include SME, RNDR_trap, NMI, GCS, THE, DF2, PFAR, MTE_frac, MTEX. - The BT, SSBS, CSV2_frac don't introduce any new registers which KVM doesn't know how to handle, they can be written without ill effect. So let them writable. Besides, we don't do the crosscheck in KVM about the CSV2_frac even if it depends on the value of CSV2, it should be made sure by the VMM instead of KVM. Signed-off-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20240723072004.1470688-4-shahuang@redhat.com Signed-off-by: Marc Zyngier <maz@kernel.org>
| * KVM: arm64: Use kvm_has_feat() to check if FEAT_SSBS is advertised to the guestShaoqin Huang2024-08-251-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently KVM use cpus_have_final_cap() to check if FEAT_SSBS is advertised to the guest. But if FEAT_SSBS is writable and isn't advertised to the guest, this is wrong. Update it to use kvm_has_feat() to check if FEAT_SSBS is advertised to the guest, thus the KVM can do the right thing if FEAT_SSBS isn't advertised to the guest. Signed-off-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20240723072004.1470688-3-shahuang@redhat.com Signed-off-by: Marc Zyngier <maz@kernel.org>
| * KVM: arm64: Disable fields that KVM doesn't know how to handle in ↵Shaoqin Huang2024-08-251-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ID_AA64PFR1_EL1 For some of the fields in the ID_AA64PFR1_EL1 register, KVM doesn't know how to handle them right now. So explicitly disable them in the register accessor, then those fields value will be masked to 0 even if on the hardware the field value is 1. This is safe because from a UAPI point of view that read_sanitised_ftr_reg() doesn't yet return a nonzero value for any of those fields. This will benifit the migration if the host and VM have different values when restoring a VM. Those fields include RNDR_trap, NMI, MTE_frac, GCS, THE, MTEX, DF2, PFAR. Signed-off-by: Shaoqin Huang <shahuang@redhat.com> Link: https://lore.kernel.org/r/20240723072004.1470688-2-shahuang@redhat.com Signed-off-by: Marc Zyngier <maz@kernel.org>
| * KVM: arm64: Make the exposed feature bits in AA64DFR0_EL1 writable from ↵Shameer Kolothum2024-08-221-1/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | userspace KVM exposes the OS double lock feature bit to Guests but returns RAZ/WI on Guest OSDLR_EL1 access. This breaks Guest migration between systems where this feature differ. Add support to make this feature writable from userspace by setting the mask bit. While at it, set the mask bits for the exposed WRPs(Number of Watchpoints) as well. Also update the selftest to cover these fields. However we still can't make BRPs and CTX_CMPs fields writable, because as per ARM ARM DDI 0487K.a, section D2.8.3 Breakpoint types and linking of breakpoints, highest numbered breakpoints(BRPs) must be context aware breakpoints(CTX_CMPs). KVM does not trap + emulate the breakpoint registers, and as such cannot support a layout that misaligns with the underlying hardware. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Shameer Kolothum <shameerali.kolothum.thodi@huawei.com> Link: https://lore.kernel.org/r/20240816132819.34316-1-shameerali.kolothum.thodi@huawei.com Signed-off-by: Marc Zyngier <maz@kernel.org>
* | KVM: arm64: Constrain the host to the maximum shared SVE VL with pKVMMark Brown2024-10-012-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When pKVM saves and restores the host floating point state on a SVE system, it programs the vector length in ZCR_EL2.LEN to be whatever the maximum VL for the PE is. But it uses a buffer allocated with kvm_host_sve_max_vl, the maximum VL shared by all PEs in the system. This means that if we run on a system where the maximum VLs are not consistent, we will overflow the buffer on PEs which support larger VLs. Since the host will not currently attempt to make use of non-shared VLs, fix this by explicitly setting the EL2 VL to be the maximum shared VL when we save and restore. This will enforce the limit on host VL usage. Should we wish to support asymmetric VLs, this code will need to be updated along with the required changes for the host: https://lore.kernel.org/r/20240730-kvm-arm64-fix-pkvm-sve-vl-v6-0-cae8a2e0bd66@kernel.org Fixes: b5b9955617bc ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM") Signed-off-by: Mark Brown <broonie@kernel.org> Tested-by: Fuad Tabba <tabba@google.com> Reviewed-by: Fuad Tabba <tabba@google.com> Link: https://lore.kernel.org/r/20240912-kvm-arm64-limit-guest-vl-v2-1-dd2c29cb2ac9@kernel.org [maz: added punctuation to the commit message] Signed-off-by: Marc Zyngier <maz@kernel.org>
* | KVM: arm64: Fix __pkvm_init_vcpu cptr_el2 error pathVincent Donnefort2024-10-011-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | On an error, hyp_vcpu will be accessed while this memory has already been relinquished to the host and unmapped from the hypervisor. Protect the CPTR assignment with an early return. Fixes: b5b9955617bc ("KVM: arm64: Eagerly restore host fpsimd/sve state in pKVM") Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Signed-off-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20240919110500.2345927-1-vdonnefort@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
* | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-09-281-3/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull x86 kvm updates from Paolo Bonzini: "x86: - KVM currently invalidates the entirety of the page tables, not just those for the memslot being touched, when a memslot is moved or deleted. This does not traditionally have particularly noticeable overhead, but Intel's TDX will require the guest to re-accept private pages if they are dropped from the secure EPT, which is a non starter. Actually, the only reason why this is not already being done is a bug which was never fully investigated and caused VM instability with assigned GeForce GPUs, so allow userspace to opt into the new behavior. - Advertise AVX10.1 to userspace (effectively prep work for the "real" AVX10 functionality that is on the horizon) - Rework common MSR handling code to suppress errors on userspace accesses to unsupported-but-advertised MSRs This will allow removing (almost?) all of KVM's exemptions for userspace access to MSRs that shouldn't exist based on the vCPU model (the actual cleanup is non-trivial future work) - Rework KVM's handling of x2APIC ICR, again, because AMD (x2AVIC) splits the 64-bit value into the legacy ICR and ICR2 storage, whereas Intel (APICv) stores the entire 64-bit value at the ICR offset - Fix a bug where KVM would fail to exit to userspace if one was triggered by a fastpath exit handler - Add fastpath handling of HLT VM-Exit to expedite re-entering the guest when there's already a pending wake event at the time of the exit - Fix a WARN caused by RSM entering a nested guest from SMM with invalid guest state, by forcing the vCPU out of guest mode prior to signalling SHUTDOWN (the SHUTDOWN hits the VM altogether, not the nested guest) - Overhaul the "unprotect and retry" logic to more precisely identify cases where retrying is actually helpful, and to harden all retry paths against putting the guest into an infinite retry loop - Add support for yielding, e.g. to honor NEED_RESCHED, when zapping rmaps in the shadow MMU - Refactor pieces of the shadow MMU related to aging SPTEs in prepartion for adding multi generation LRU support in KVM - Don't stuff the RSB after VM-Exit when RETPOLINE=y and AutoIBRS is enabled, i.e. when the CPU has already flushed the RSB - Trace the per-CPU host save area as a VMCB pointer to improve readability and cleanup the retrieval of the SEV-ES host save area - Remove unnecessary accounting of temporary nested VMCB related allocations - Set FINAL/PAGE in the page fault error code for EPT violations if and only if the GVA is valid. If the GVA is NOT valid, there is no guest-side page table walk and so stuffing paging related metadata is nonsensical - Fix a bug where KVM would incorrectly synthesize a nested VM-Exit instead of emulating posted interrupt delivery to L2 - Add a lockdep assertion to detect unsafe accesses of vmcs12 structures - Harden eVMCS loading against an impossible NULL pointer deref (really truly should be impossible) - Minor SGX fix and a cleanup - Misc cleanups Generic: - Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed - Enable virtualization when KVM is loaded, not right before the first VM is created Together with the previous change, this simplifies a lot the logic of the callbacks, because their very existence implies virtualization is enabled - Fix a bug that results in KVM prematurely exiting to userspace for coalesced MMIO/PIO in many cases, clean up the related code, and add a testcase - Fix a bug in kvm_clear_guest() where it would trigger a buffer overflow _if_ the gpa+len crosses a page boundary, which thankfully is guaranteed to not happen in the current code base. Add WARNs in more helpers that read/write guest memory to detect similar bugs Selftests: - Fix a goof that caused some Hyper-V tests to be skipped when run on bare metal, i.e. NOT in a VM - Add a regression test for KVM's handling of SHUTDOWN for an SEV-ES guest - Explicitly include one-off assets in .gitignore. Past Sean was completely wrong about not being able to detect missing .gitignore entries - Verify userspace single-stepping works when KVM happens to handle a VM-Exit in its fastpath - Misc cleanups" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (127 commits) Documentation: KVM: fix warning in "make htmldocs" s390: Enable KVM_S390_UCONTROL config in debug_defconfig selftests: kvm: s390: Add VM run test case KVM: SVM: let alternatives handle the cases when RSB filling is required KVM: VMX: Set PFERR_GUEST_{FINAL,PAGE}_MASK if and only if the GVA is valid KVM: x86/mmu: Use KVM_PAGES_PER_HPAGE() instead of an open coded equivalent KVM: x86/mmu: Add KVM_RMAP_MANY to replace open coded '1' and '1ul' literals KVM: x86/mmu: Fold mmu_spte_age() into kvm_rmap_age_gfn_range() KVM: x86/mmu: Morph kvm_handle_gfn_range() into an aging specific helper KVM: x86/mmu: Honor NEED_RESCHED when zapping rmaps and blocking is allowed KVM: x86/mmu: Add a helper to walk and zap rmaps for a memslot KVM: x86/mmu: Plumb a @can_yield parameter into __walk_slot_rmaps() KVM: x86/mmu: Move walk_slot_rmaps() up near for_each_slot_rmap_range() KVM: x86/mmu: WARN on MMIO cache hit when emulating write-protected gfn KVM: x86/mmu: Detect if unprotect will do anything based on invalid_list KVM: x86/mmu: Subsume kvm_mmu_unprotect_page() into the and_retry() version KVM: x86: Rename reexecute_instruction()=>kvm_unprotect_and_retry_on_failure() KVM: x86: Update retry protection fields when forcing retry on emulation failure KVM: x86: Apply retry protection to "unprotect on failure" path KVM: x86: Check EMULTYPE_WRITE_PF_TO_SP before unprotecting gfn ...
| * \ Merge branch 'kvm-redo-enable-virt' into HEADPaolo Bonzini2024-09-171-3/+3
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Register KVM's cpuhp and syscore callbacks when enabling virtualization in hardware, as the sole purpose of said callbacks is to disable and re-enable virtualization as needed. The primary motivation for this series is to simplify dealing with enabling virtualization for Intel's TDX, which needs to enable virtualization when kvm-intel.ko is loaded, i.e. long before the first VM is created. That said, this is a nice cleanup on its own. By registering the callbacks on-demand, the callbacks themselves don't need to check kvm_usage_count, because their very existence implies a non-zero count. Patch 1 (re)adds a dedicated lock for kvm_usage_count. This avoids a lock ordering issue between cpus_read_lock() and kvm_lock. The lock ordering issue still exist in very rare cases, and will be fixed for good by switching vm_list to an (S)RCU-protected list. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * | KVM: Rename arch hooks related to per-CPU virtualization enablingSean Christopherson2024-09-041-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Rename the per-CPU hooks used to enable virtualization in hardware to align with the KVM-wide helpers in kvm_main.c, and to better capture that the callbacks are invoked on every online CPU. No functional change intended. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Kai Huang <kai.huang@intel.com> Message-ID: <20240830043600.127750-5-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* | | | Merge tag 'mm-stable-2024-09-20-02-31' of ↵Linus Torvalds2024-09-211-1/+0
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull MM updates from Andrew Morton: "Along with the usual shower of singleton patches, notable patch series in this pull request are: - "Align kvrealloc() with krealloc()" from Danilo Krummrich. Adds consistency to the APIs and behaviour of these two core allocation functions. This also simplifies/enables Rustification. - "Some cleanups for shmem" from Baolin Wang. No functional changes - mode code reuse, better function naming, logic simplifications. - "mm: some small page fault cleanups" from Josef Bacik. No functional changes - code cleanups only. - "Various memory tiering fixes" from Zi Yan. A small fix and a little cleanup. - "mm/swap: remove boilerplate" from Yu Zhao. Code cleanups and simplifications and .text shrinkage. - "Kernel stack usage histogram" from Pasha Tatashin and Shakeel Butt. This is a feature, it adds new feilds to /proc/vmstat such as $ grep kstack /proc/vmstat kstack_1k 3 kstack_2k 188 kstack_4k 11391 kstack_8k 243 kstack_16k 0 which tells us that 11391 processes used 4k of stack while none at all used 16k. Useful for some system tuning things, but partivularly useful for "the dynamic kernel stack project". - "kmemleak: support for percpu memory leak detect" from Pavel Tikhomirov. Teaches kmemleak to detect leaksage of percpu memory. - "mm: memcg: page counters optimizations" from Roman Gushchin. "3 independent small optimizations of page counters". - "mm: split PTE/PMD PT table Kconfig cleanups+clarifications" from David Hildenbrand. Improves PTE/PMD splitlock detection, makes powerpc/8xx work correctly by design rather than by accident. - "mm: remove arch_make_page_accessible()" from David Hildenbrand. Some folio conversions which make arch_make_page_accessible() unneeded. - "mm, memcg: cg2 memory{.swap,}.peak write handlers" fro David Finkel. Cleans up and fixes our handling of the resetting of the cgroup/process peak-memory-use detector. - "Make core VMA operations internal and testable" from Lorenzo Stoakes. Rationalizaion and encapsulation of the VMA manipulation APIs. With a view to better enable testing of the VMA functions, even from a userspace-only harness. - "mm: zswap: fixes for global shrinker" from Takero Funaki. Fix issues in the zswap global shrinker, resulting in improved performance. - "mm: print the promo watermark in zoneinfo" from Kaiyang Zhao. Fill in some missing info in /proc/zoneinfo. - "mm: replace follow_page() by folio_walk" from David Hildenbrand. Code cleanups and rationalizations (conversion to folio_walk()) resulting in the removal of follow_page(). - "improving dynamic zswap shrinker protection scheme" from Nhat Pham. Some tuning to improve zswap's dynamic shrinker. Significant reductions in swapin and improvements in performance are shown. - "mm: Fix several issues with unaccepted memory" from Kirill Shutemov. Improvements to the new unaccepted memory feature, - "mm/mprotect: Fix dax puds" from Peter Xu. Implements mprotect on DAX PUDs. This was missing, although nobody seems to have notied yet. - "Introduce a store type enum for the Maple tree" from Sidhartha Kumar. Cleanups and modest performance improvements for the maple tree library code. - "memcg: further decouple v1 code from v2" from Shakeel Butt. Move more cgroup v1 remnants away from the v2 memcg code. - "memcg: initiate deprecation of v1 features" from Shakeel Butt. Adds various warnings telling users that memcg v1 features are deprecated. - "mm: swap: mTHP swap allocator base on swap cluster order" from Chris Li. Greatly improves the success rate of the mTHP swap allocation. - "mm: introduce numa_memblks" from Mike Rapoport. Moves various disparate per-arch implementations of numa_memblk code into generic code. - "mm: batch free swaps for zap_pte_range()" from Barry Song. Greatly improves the performance of munmap() of swap-filled ptes. - "support large folio swap-out and swap-in for shmem" from Baolin Wang. With this series we no longer split shmem large folios into simgle-page folios when swapping out shmem. - "mm/hugetlb: alloc/free gigantic folios" from Yu Zhao. Nice performance improvements and code reductions for gigantic folios. - "support shmem mTHP collapse" from Baolin Wang. Adds support for khugepaged's collapsing of shmem mTHP folios. - "mm: Optimize mseal checks" from Pedro Falcato. Fixes an mprotect() performance regression due to the addition of mseal(). - "Increase the number of bits available in page_type" from Matthew Wilcox. Increases the number of bits available in page_type! - "Simplify the page flags a little" from Matthew Wilcox. Many legacy page flags are now folio flags, so the page-based flags and their accessors/mutators can be removed. - "mm: store zero pages to be swapped out in a bitmap" from Usama Arif. An optimization which permits us to avoid writing/reading zero-filled zswap pages to backing store. - "Avoid MAP_FIXED gap exposure" from Liam Howlett. Fixes a race window which occurs when a MAP_FIXED operqtion is occurring during an unrelated vma tree walk. - "mm: remove vma_merge()" from Lorenzo Stoakes. Major rotorooting of the vma_merge() functionality, making ot cleaner, more testable and better tested. - "misc fixups for DAMON {self,kunit} tests" from SeongJae Park. Minor fixups of DAMON selftests and kunit tests. - "mm: memory_hotplug: improve do_migrate_range()" from Kefeng Wang. Code cleanups and folio conversions. - "Shmem mTHP controls and stats improvements" from Ryan Roberts. Cleanups for shmem controls and stats. - "mm: count the number of anonymous THPs per size" from Barry Song. Expose additional anon THP stats to userspace for improved tuning. - "mm: finish isolate/putback_lru_page()" from Kefeng Wang: more folio conversions and removal of now-unused page-based APIs. - "replace per-quota region priorities histogram buffer with per-context one" from SeongJae Park. DAMON histogram rationalization. - "Docs/damon: update GitHub repo URLs and maintainer-profile" from SeongJae Park. DAMON documentation updates. - "mm/vdpa: correct misuse of non-direct-reclaim __GFP_NOFAIL and improve related doc and warn" from Jason Wang: fixes usage of page allocator __GFP_NOFAIL and GFP_ATOMIC flags. - "mm: split underused THPs" from Yu Zhao. Improve THP=always policy. This was overprovisioning THPs in sparsely accessed memory areas. - "zram: introduce custom comp backends API" frm Sergey Senozhatsky. Add support for zram run-time compression algorithm tuning. - "mm: Care about shadow stack guard gap when getting an unmapped area" from Mark Brown. Fix up the various arch_get_unmapped_area() implementations to better respect guard areas. - "Improve mem_cgroup_iter()" from Kinsey Ho. Improve the reliability of mem_cgroup_iter() and various code cleanups. - "mm: Support huge pfnmaps" from Peter Xu. Extends the usage of huge pfnmap support. - "resource: Fix region_intersects() vs add_memory_driver_managed()" from Huang Ying. Fix a bug in region_intersects() for systems with CXL memory. - "mm: hwpoison: two more poison recovery" from Kefeng Wang. Teaches a couple more code paths to correctly recover from the encountering of poisoned memry. - "mm: enable large folios swap-in support" from Barry Song. Support the swapin of mTHP memory into appropriately-sized folios, rather than into single-page folios" * tag 'mm-stable-2024-09-20-02-31' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: (416 commits) zram: free secondary algorithms names uprobes: turn xol_area->pages[2] into xol_area->page uprobes: introduce the global struct vm_special_mapping xol_mapping Revert "uprobes: use vm_special_mapping close() functionality" mm: support large folios swap-in for sync io devices mm: add nr argument in mem_cgroup_swapin_uncharge_swap() helper to support large folios mm: fix swap_read_folio_zeromap() for large folios with partial zeromap mm/debug_vm_pgtable: Use pxdp_get() for accessing page table entries set_memory: add __must_check to generic stubs mm/vma: return the exact errno in vms_gather_munmap_vmas() memcg: cleanup with !CONFIG_MEMCG_V1 mm/show_mem.c: report alloc tags in human readable units mm: support poison recovery from copy_present_page() mm: support poison recovery from do_cow_fault() resource, kunit: add test case for region_intersects() resource: make alloc_free_mem_region() works for iomem_resource mm: z3fold: deprecate CONFIG_Z3FOLD vfio/pci: implement huge_fault support mm/arm64: support large pfn mappings mm/x86: support large pfn mappings ...
| * | | | mm: kvmalloc: align kvrealloc() with krealloc()Danilo Krummrich2024-09-021-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Besides the obvious (and desired) difference between krealloc() and kvrealloc(), there is some inconsistency in their function signatures and behavior: - krealloc() frees the memory when the requested size is zero, whereas kvrealloc() simply returns a pointer to the existing allocation. - krealloc() behaves like kmalloc() if a NULL pointer is passed, whereas kvrealloc() does not accept a NULL pointer at all and, if passed, would fault instead. - krealloc() is self-contained, whereas kvrealloc() relies on the caller to provide the size of the previous allocation. Inconsistent behavior throughout allocation APIs is error prone, hence make kvrealloc() behave like krealloc(), which seems superior in all mentioned aspects. Besides that, implementing kvrealloc() by making use of krealloc() and vrealloc() provides oppertunities to grow (and shrink) allocations more efficiently. For instance, vrealloc() can be optimized to allocate and map additional pages to grow the allocation or unmap and free unused pages to shrink the allocation. [dakr@kernel.org: document concurrency restrictions] Link: https://lkml.kernel.org/r/20240725125442.4957-1-dakr@kernel.org [dakr@kernel.org: disable KASAN when switching to vmalloc] Link: https://lkml.kernel.org/r/20240730185049.6244-2-dakr@kernel.org [dakr@kernel.org: properly document __GFP_ZERO behavior] Link: https://lkml.kernel.org/r/20240730185049.6244-5-dakr@kernel.org Link: https://lkml.kernel.org/r/20240722163111.4766-3-dakr@kernel.org Signed-off-by: Danilo Krummrich <dakr@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Acked-by: Vlastimil Babka <vbabka@suse.cz> Cc: Chandan Babu R <chandan.babu@oracle.com> Cc: Christian König <christian.koenig@amd.com> Cc: Christoph Hellwig <hch@infradead.org> Cc: Christoph Lameter <cl@linux.com> Cc: David Rientjes <rientjes@google.com> Cc: Hyeonggon Yoo <42.hyeyoo@gmail.com> Cc: Joonsoo Kim <iamjoonsoo.kim@lge.com> Cc: Kees Cook <kees@kernel.org> Cc: Marc Zyngier <maz@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Miguel Ojeda <ojeda@kernel.org> Cc: Oliver Upton <oliver.upton@linux.dev> Cc: Pekka Enberg <penberg@kernel.org> Cc: Roman Gushchin <roman.gushchin@linux.dev> Cc: Uladzislau Rezki <urezki@gmail.com> Cc: Wedson Almeida Filho <wedsonaf@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* | | | | Merge tag 'for-linus-non-x86' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-09-1623-265/+1925
|\ \ \ \ \ | | |/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "These are the non-x86 changes (mostly ARM, as is usually the case). The generic and x86 changes will come later" ARM: - New Stage-2 page table dumper, reusing the main ptdump infrastructure - FP8 support - Nested virtualization now supports the address translation (FEAT_ATS1A) family of instructions - Add selftest checks for a bunch of timer emulation corner cases - Fix multiple cases where KVM/arm64 doesn't correctly handle the guest trying to use a GICv3 that wasn't advertised - Remove REG_HIDDEN_USER from the sysreg infrastructure, making things little simpler - Prevent MTE tags being restored by userspace if we are actively logging writes, as that's a recipe for disaster - Correct the refcount on a page that is not considered for MTE tag copying (such as a device) - When walking a page table to split block mappings, synchronize only at the end the walk rather than on every store - Fix boundary check when transfering memory using FFA - Fix pKVM TLB invalidation, only affecting currently out of tree code but worth addressing for peace of mind LoongArch: - Revert qspinlock to test-and-set simple lock on VM. - Add Loongson Binary Translation extension support. - Add PMU support for guest. - Enable paravirt feature control from VMM. - Implement function kvm_para_has_feature(). RISC-V: - Fix sbiret init before forwarding to userspace - Don't zero-out PMU snapshot area before freeing data - Allow legacy PMU access from guest - Fix to allow hpmcounter31 from the guest" * tag 'for-linus-non-x86' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (64 commits) LoongArch: KVM: Implement function kvm_para_has_feature() LoongArch: KVM: Enable paravirt feature control from VMM LoongArch: KVM: Add PMU support for guest KVM: arm64: Get rid of REG_HIDDEN_USER visibility qualifier KVM: arm64: Simplify visibility handling of AArch32 SPSR_* KVM: arm64: Simplify handling of CNTKCTL_EL12 LoongArch: KVM: Add vm migration support for LBT registers LoongArch: KVM: Add Binary Translation extension support LoongArch: KVM: Add VM feature detection function LoongArch: Revert qspinlock to test-and-set simple lock on VM KVM: arm64: Register ptdump with debugfs on guest creation arm64: ptdump: Don't override the level when operating on the stage-2 tables arm64: ptdump: Use the ptdump description from a local context arm64: ptdump: Expose the attribute parsing functionality KVM: arm64: Add memory length checks and remove inline in do_ffa_mem_xfer KVM: arm64: Move pagetable definitions to common header KVM: arm64: nv: Add support for FEAT_ATS1A KVM: arm64: nv: Plumb handling of AT S1* traps from EL2 KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3 KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configuration ...
| * | | | Merge branch kvm-arm64/visibility-cleanups into kvmarm-master/nextMarc Zyngier2024-09-122-44/+22
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * kvm-arm64/visibility-cleanups: : . : Remove REG_HIDDEN_USER from the sysreg infrastructure, making things : a little more simple. From the cover letter: : : "Since 4d4f52052ba8 ("KVM: arm64: nv: Drop EL12 register traps that are : redirected to VNCR") and the admission that KVM would never be supporting : the original FEAT_NV, REG_HIDDEN_USER only had a few users, all of which : could either be replaced by a more ad-hoc mechanism, or removed altogether." : . KVM: arm64: Get rid of REG_HIDDEN_USER visibility qualifier KVM: arm64: Simplify visibility handling of AArch32 SPSR_* KVM: arm64: Simplify handling of CNTKCTL_EL12 Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: Get rid of REG_HIDDEN_USER visibility qualifierMarc Zyngier2024-09-112-26/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that REG_HIDDEN_USER has no direct user anymore, remove it entirely and update all users of sysreg_hidden_user() to call sysreg_hidden() instead. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240904082419.1982402-4-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: Simplify visibility handling of AArch32 SPSR_*Marc Zyngier2024-09-111-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since SPSR_* are not associated with any register in the sysreg array, nor do they have .get_user()/.set_user() helpers, they are invisible to userspace with that encoding. Therefore hidden_user_visibility() serves no purpose here, and can be safely removed. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240904082419.1982402-3-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: Simplify handling of CNTKCTL_EL12Marc Zyngier2024-09-111-10/+13
| | | |_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We go trough a great deal of effort to map CNTKCTL_EL12 to CNTKCTL_EL1 while hidding this mapping from userspace via a special visibility helper. However, it would be far simpler to just provide an accessor doing the mapping job, removing the need for a visibility helper. With that done, we can also remove the EL12_REG() macro which serves no purpose. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240904082419.1982402-2-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
| * | | | Merge branch kvm-arm64/s2-ptdump into kvmarm-master/nextMarc Zyngier2024-09-125-42/+287
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * kvm-arm64/s2-ptdump: : . : Stage-2 page table dumper, reusing the main ptdump infrastructure, : courtesy of Sebastian Ene. From the cover letter: : : "This series extends the ptdump support to allow dumping the guest : stage-2 pagetables. When CONFIG_PTDUMP_STAGE2_DEBUGFS is enabled, ptdump : registers the new following files under debugfs: : - /sys/debug/kvm/<guest_id>/stage2_page_tables : - /sys/debug/kvm/<guest_id>/stage2_levels : - /sys/debug/kvm/<guest_id>/ipa_range : : This allows userspace tools (eg. cat) to dump the stage-2 pagetables by : reading the 'stage2_page_tables' file. : [...]" : . KVM: arm64: Register ptdump with debugfs on guest creation arm64: ptdump: Don't override the level when operating on the stage-2 tables arm64: ptdump: Use the ptdump description from a local context arm64: ptdump: Expose the attribute parsing functionality KVM: arm64: Move pagetable definitions to common header Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: Register ptdump with debugfs on guest creationSebastian Ene2024-09-104-0/+287
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While arch/*/mem/ptdump handles the kernel pagetable dumping code, introduce KVM/ptdump to show the guest stage-2 pagetables. The separation is necessary because most of the definitions from the stage-2 pagetable reside in the KVM path and we will be invoking functionality specific to KVM. Introduce the PTDUMP_STAGE2_DEBUGFS config. When a guest is created, register a new file entry under the guest debugfs dir which allows userspace to show the contents of the guest stage-2 pagetables when accessed. [maz: moved function prototypes from kvm_host.h to kvm_mmu.h] Signed-off-by: Sebastian Ene <sebastianene@google.com> Reviewed-by: Vincent Donnefort <vdonnefort@google.com> Link: https://lore.kernel.org/r/20240909124721.1672199-6-sebastianene@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: Move pagetable definitions to common headerSebastian Ene2024-09-101-42/+0
| | |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for using the stage-2 definitions in ptdump, move some of these macros in the common header. Signed-off-by: Sebastian Ene <sebastianene@google.com> Link: https://lore.kernel.org/r/20240909124721.1672199-2-sebastianene@google.com Signed-off-by: Marc Zyngier <maz@kernel.org>
| * | | | Merge branch kvm-arm64/nv-at-pan into kvmarm-master/nextMarc Zyngier2024-09-126-27/+1180
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * kvm-arm64/nv-at-pan: : . : Add NV support for the AT family of instructions, which mostly results : in adding a page table walker that deals with most of the complexity : of the architecture. : : From the cover letter: : : "Another task that a hypervisor supporting NV on arm64 has to deal with : is to emulate the AT instruction, because we multiplex all the S1 : translations on a single set of registers, and the guest S2 is never : truly resident on the CPU. : : So given that we lie about page tables, we also have to lie about : translation instructions, hence the emulation. Things are made : complicated by the fact that guest S1 page tables can be swapped out, : and that our shadow S2 is likely to be incomplete. So while using AT : to emulate AT is tempting (and useful), it is not going to always : work, and we thus need a fallback in the shape of a SW S1 walker." : . KVM: arm64: nv: Add support for FEAT_ATS1A KVM: arm64: nv: Plumb handling of AT S1* traps from EL2 KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3 KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configuration KVM: arm64: nv: Add SW walker for AT S1 emulation KVM: arm64: nv: Make ps_to_output_size() generally available KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W} KVM: arm64: nv: Add basic emulation of AT S1E2{R,W} KVM: arm64: nv: Add basic emulation of AT S1E1{R,W}P KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W} KVM: arm64: nv: Honor absence of FEAT_PAN2 KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptor KVM: arm64: nv: Enforce S2 alignment when contiguous bit is set arm64: Add ESR_ELx_FSC_ADDRSZ_L() helper arm64: Add system register encoding for PSTATE.PAN arm64: Add PAR_EL1 field description arm64: Add missing APTable and TCR_ELx.HPD masks KVM: arm64: Make kvm_at() take an OP_AT_* Signed-off-by: Marc Zyngier <maz@kernel.org> # Conflicts: # arch/arm64/kvm/nested.c
| | * | | | KVM: arm64: nv: Add support for FEAT_ATS1AMarc Zyngier2024-08-303-0/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Handling FEAT_ATS1A (which provides the AT S1E{1,2}A instructions) is pretty easy, as it is just the usual AT without the permission check. This basically amounts to plumbing the instructions in the various dispatch tables, and handling FEAT_ATS1A being disabled in the ID registers. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Plumb handling of AT S1* traps from EL2Marc Zyngier2024-08-301-0/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hooray, we're done. Plug the AT traps into the system instruction table, and let it rip. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Make AT+PAN instructions aware of FEAT_PAN3Marc Zyngier2024-08-301-1/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | FEAT_PAN3 added a check for executable permissions to FEAT_PAN2. Add the required SCTLR_ELx.EPAN and descriptor checks to handle this correctly. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Sanitise SCTLR_EL1.EPAN according to VM configurationMarc Zyngier2024-08-301-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that SCTLR_EL1.EPAN is RES0 when FEAT_PAN3 isn't supported. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Add SW walker for AT S1 emulationMarc Zyngier2024-08-301-2/+608
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to plug the brokenness of our current AT implementation, we need a SW walker that is going to... err.. walk the S1 tables and tell us what it finds. Of course, it builds on top of our S2 walker, and share similar concepts. The beauty of it is that since it uses kvm_read_guest(), it is able to bring back pages that have been otherwise evicted. This is then plugged in the two AT S1 emulation functions as a "slow path" fallback. I'm not sure it is that slow, but hey. Reviewed-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Make ps_to_output_size() generally availableMarc Zyngier2024-08-301-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make this helper visible to at.c, we are going to need it. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Add emulation of AT S12E{0,1}{R,W}Marc Zyngier2024-08-301-0/+253
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On the face of it, AT S12E{0,1}{R,W} is pretty simple. It is the combination of AT S1E{0,1}{R,W}, followed by an extra S2 walk. However, there is a great deal of complexity coming from combining the S1 and S2 attributes to report something consistent in PAR_EL1. This is an absolute mine field, and I have a splitting headache. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Add basic emulation of AT S1E2{R,W}Marc Zyngier2024-08-301-0/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to our AT S1E{0,1} emulation, we implement the AT S1E2 handling. This emulation of course suffers from the same problems, but is somehow simpler due to the lack of PAN2 and the fact that we are guaranteed to execute it from the correct context. Co-developed-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Add basic emulation of AT S1E1{R,W}PMarc Zyngier2024-08-301-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Building on top of our primitive AT S1E{0,1}{R,W} emulation, add minimal support for the FEAT_PAN2 instructions, momentary context-switching PSTATE.PAN so that it takes effect in the context of the guest. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Add basic emulation of AT S1E{0,1}{R,W}Marc Zyngier2024-08-302-1/+141
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Emulating AT instructions is one the tasks devolved to the host hypervisor when NV is on. Here, we take the basic approach of emulating AT S1E{0,1}{R,W} using the AT instructions themselves. While this mostly work, it doesn't *always* work: - S1 page tables can be swapped out - shadow S2 can be incomplete and not contain mappings for the S1 page tables We are not trying to handle these case here, and defer it to a later patch. Suitable comments indicate where we are in dire need of better handling. Co-developed-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Jintack Lim <jintack.lim@linaro.org> Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Honor absence of FEAT_PAN2Marc Zyngier2024-08-301-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If our guest has been configured without PAN2, make sure that AT S1E1{R,W}P will generate an UNDEF. Reviewed-by: Anshuman Khandual <anshuman.khandual@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Turn upper_attr for S2 walk into the full descriptorMarc Zyngier2024-08-301-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The upper_attr attribute has been badly named, as it most of the time carries the full "last walked descriptor". Rename it to "desc" and make ti contain the full 64bit descriptor. This will be used by the S1 PTW. Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | KVM: arm64: nv: Enforce S2 alignment when contiguous bit is setMarc Zyngier2024-08-301-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Despite KVM not using the contiguous bit for anything related to TLBs, the spec does require that the alignment defined by the contiguous bit for the page size and the level is enforced. Add the required checks to offset the point where PA and VA merge. Fixes: 61e30b9eef7f ("KVM: arm64: nv: Implement nested Stage-2 page table walk logic") Reported-by: Alexandru Elisei <alexandru.elisei@arm.com> Signed-off-by: Marc Zyngier <maz@kernel.org>
| * | | | | Merge branch kvm-arm64/vgic-sre-traps into kvmarm-master/nextMarc Zyngier2024-09-129-135/+343
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * kvm-arm64/vgic-sre-traps: : . : Fix the multiple of cases where KVM/arm64 doesn't correctly : handle the guest trying to use a GICv3 that isn't advertised. : : From the cover letter: : : "It recently appeared that, when running on a GICv3-equipped platform : (which is what non-ancient arm64 HW has), *not* configuring a GICv3 : for the guest could result in less than desirable outcomes. : : We have multiple issues to fix: : : - for registers that *always* trap (the SGI registers) or that *may* : trap (the SRE register), we need to check whether a GICv3 has been : instantiated before acting upon the trap. : : - for registers that only conditionally trap, we must actively trap : them even in the absence of a GICv3 being instantiated, and handle : those traps accordingly. : : - finally, ID registers must reflect the absence of a GICv3, so that : we are consistent. : : This series goes through all these requirements. The main complexity : here is to apply a GICv3 configuration on the host in the absence of a : GICv3 in the guest. This is pretty hackish, but I don't have a much : better solution so far. : : As part of making wider use of of the trap bits, we fully define the : trap routing as per the architecture, something that we eventually : need for NV anyway." : . KVM: arm64: selftests: Cope with lack of GICv3 in set_id_regs KVM: arm64: Add selftest checking how the absence of GICv3 is handled KVM: arm64: Unify UNDEF injection helpers KVM: arm64: Make most GICv3 accesses UNDEF if they trap KVM: arm64: Honor guest requested traps in GICv3 emulation KVM: arm64: Add trap routing information for ICH_HCR_EL2 KVM: arm64: Add ICH_HCR_EL2 to the vcpu state KVM: arm64: Zero ID_AA64PFR0_EL1.GIC when no GICv3 is presented to the guest KVM: arm64: Add helper for last ditch idreg adjustments KVM: arm64: Force GICv3 trap activation when no irqchip is configured on VHE KVM: arm64: Force SRE traps when SRE access is not enabled KVM: arm64: Move GICv3 trap configuration to kvm_calculate_traps() Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | KVM: arm64: Unify UNDEF injection helpersMarc Zyngier2024-08-271-81/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently have two helpers (undef_access() and trap_undef()) that do exactly the same thing: inject an UNDEF and return 'false' (as an indication that PC should not be incremented). We definitely could do with one less. Given that undef_access() is used 80ish times, while trap_undef() is only used 30 times, the latter loses the battle and is immediately sacrificed. We also have a large number of instances where undef_access() is open-coded. Let's also convert those. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240827152517.3909653-11-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | KVM: arm64: Make most GICv3 accesses UNDEF if they trapMarc Zyngier2024-08-273-18/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We don't expect to trap any GICv3 register for host handling, apart from ICC_SRE_EL1 and the SGI registers. If they trap, that's because the guest is playing with us despite being told it doesn't have a GICv3. If it does, UNDEF is what it will get. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240827152517.3909653-10-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | KVM: arm64: Honor guest requested traps in GICv3 emulationMarc Zyngier2024-08-271-0/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On platforms that require emulation of the CPU interface, we still need to honor the traps requested by the guest (ICH_HCR_EL2 as well as the FGTs for ICC_IGRPEN{0,1}_EL1. Check for these bits early and lail out if any trap applies. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240827152517.3909653-9-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>
| | * | | | | KVM: arm64: Add trap routing information for ICH_HCR_EL2Marc Zyngier2024-08-271-5/+66
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The usual song and dance. Anything that is a trap, any register it traps. Note that we don't handle the registers added by FEAT_NMI for now. Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Link: https://lore.kernel.org/r/20240827152517.3909653-8-maz@kernel.org Signed-off-by: Marc Zyngier <maz@kernel.org>