diff options
author | Linus Torvalds <torvalds@linux-foundation.org> | 2015-11-06 01:26:26 +0100 |
---|---|---|
committer | Linus Torvalds <torvalds@linux-foundation.org> | 2015-11-06 01:26:26 +0100 |
commit | 933425fb0010bd02bd459b41e63082756818ffce (patch) | |
tree | 1cbc6c2035b9dcff8cb265c9ac562cbee7c6bb82 /Documentation/virtual/kvm | |
parent | Merge tag 'scsi-misc' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi (diff) | |
parent | KVM: VMX: Fix commit which broke PML (diff) | |
download | linux-933425fb0010bd02bd459b41e63082756818ffce.tar.xz linux-933425fb0010bd02bd459b41e63082756818ffce.zip |
Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm
Pull KVM updates from Paolo Bonzini:
"First batch of KVM changes for 4.4.
s390:
A bunch of fixes and optimizations for interrupt and time handling.
PPC:
Mostly bug fixes.
ARM:
No big features, but many small fixes and prerequisites including:
- a number of fixes for the arch-timer
- introducing proper level-triggered semantics for the arch-timers
- a series of patches to synchronously halt a guest (prerequisite
for IRQ forwarding)
- some tracepoint improvements
- a tweak for the EL2 panic handlers
- some more VGIC cleanups getting rid of redundant state
x86:
Quite a few changes:
- support for VT-d posted interrupts (i.e. PCI devices can inject
interrupts directly into vCPUs). This introduces a new
component (in virt/lib/) that connects VFIO and KVM together.
The same infrastructure will be used for ARM interrupt
forwarding as well.
- more Hyper-V features, though the main one Hyper-V synthetic
interrupt controller will have to wait for 4.5. These will let
KVM expose Hyper-V devices.
- nested virtualization now supports VPID (same as PCID but for
vCPUs) which makes it quite a bit faster
- for future hardware that supports NVDIMM, there is support for
clflushopt, clwb, pcommit
- support for "split irqchip", i.e. LAPIC in kernel +
IOAPIC/PIC/PIT in userspace, which reduces the attack surface of
the hypervisor
- obligatory smattering of SMM fixes
- on the guest side, stable scheduler clock support was rewritten
to not require help from the hypervisor"
* tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (123 commits)
KVM: VMX: Fix commit which broke PML
KVM: x86: obey KVM_X86_QUIRK_CD_NW_CLEARED in kvm_set_cr0()
KVM: x86: allow RSM from 64-bit mode
KVM: VMX: fix SMEP and SMAP without EPT
KVM: x86: move kvm_set_irq_inatomic to legacy device assignment
KVM: device assignment: remove pointless #ifdefs
KVM: x86: merge kvm_arch_set_irq with kvm_set_msi_inatomic
KVM: x86: zero apic_arb_prio on reset
drivers/hv: share Hyper-V SynIC constants with userspace
KVM: x86: handle SMBASE as physical address in RSM
KVM: x86: add read_phys to x86_emulate_ops
KVM: x86: removing unused variable
KVM: don't pointlessly leave KVM_COMPAT=y in non-KVM configs
KVM: arm/arm64: Merge vgic_set_lr() and vgic_sync_lr_elrsr()
KVM: arm/arm64: Clean up vgic_retire_lr() and surroundings
KVM: arm/arm64: Optimize away redundant LR tracking
KVM: s390: use simple switch statement as multiplexer
KVM: s390: drop useless newline in debugging data
KVM: s390: SCA must not cross page boundaries
KVM: arm: Do not indent the arguments of DECLARE_BITMAP
...
Diffstat (limited to 'Documentation/virtual/kvm')
-rw-r--r-- | Documentation/virtual/kvm/api.txt | 52 | ||||
-rw-r--r-- | Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt | 187 | ||||
-rw-r--r-- | Documentation/virtual/kvm/devices/arm-vgic.txt | 18 | ||||
-rw-r--r-- | Documentation/virtual/kvm/locking.txt | 12 |
4 files changed, 256 insertions, 13 deletions
diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 29ece601008e..092ee9fbaf2b 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -401,10 +401,9 @@ Capability: basic Architectures: x86, ppc, mips Type: vcpu ioctl Parameters: struct kvm_interrupt (in) -Returns: 0 on success, -1 on error +Returns: 0 on success, negative on failure. -Queues a hardware interrupt vector to be injected. This is only -useful if in-kernel local APIC or equivalent is not used. +Queues a hardware interrupt vector to be injected. /* for KVM_INTERRUPT */ struct kvm_interrupt { @@ -414,7 +413,14 @@ struct kvm_interrupt { X86: -Note 'irq' is an interrupt vector, not an interrupt pin or line. +Returns: 0 on success, + -EEXIST if an interrupt is already enqueued + -EINVAL the the irq number is invalid + -ENXIO if the PIC is in the kernel + -EFAULT if the pointer is invalid + +Note 'irq' is an interrupt vector, not an interrupt pin or line. This +ioctl is useful if the in-kernel PIC is not used. PPC: @@ -1598,7 +1604,7 @@ provided event instead of triggering an exit. struct kvm_ioeventfd { __u64 datamatch; __u64 addr; /* legal pio/mmio address */ - __u32 len; /* 1, 2, 4, or 8 bytes */ + __u32 len; /* 0, 1, 2, 4, or 8 bytes */ __s32 fd; __u32 flags; __u8 pad[36]; @@ -1621,6 +1627,10 @@ to the registered address is equal to datamatch in struct kvm_ioeventfd. For virtio-ccw devices, addr contains the subchannel id and datamatch the virtqueue index. +With KVM_CAP_IOEVENTFD_ANY_LENGTH, a zero length ioeventfd is allowed, and +the kernel will ignore the length of guest write and may get a faster vmexit. +The speedup may only apply to specific architectures, but the ioeventfd will +work anyway. 4.60 KVM_DIRTY_TLB @@ -3309,6 +3319,18 @@ Valid values for 'type' are: to ignore the request, or to gather VM memory core dump and/or reset/shutdown of the VM. + /* KVM_EXIT_IOAPIC_EOI */ + struct { + __u8 vector; + } eoi; + +Indicates that the VCPU's in-kernel local APIC received an EOI for a +level-triggered IOAPIC interrupt. This exit only triggers when the +IOAPIC is implemented in userspace (i.e. KVM_CAP_SPLIT_IRQCHIP is enabled); +the userspace IOAPIC should process the EOI and retrigger the interrupt if +it is still asserted. Vector is the LAPIC interrupt vector for which the +EOI was received. + /* Fix the size of the union. */ char padding[256]; }; @@ -3627,6 +3649,26 @@ struct { KVM handlers should exit to userspace with rc = -EREMOTE. +7.5 KVM_CAP_SPLIT_IRQCHIP + +Architectures: x86 +Parameters: args[0] - number of routes reserved for userspace IOAPICs +Returns: 0 on success, -1 on error + +Create a local apic for each processor in the kernel. This can be used +instead of KVM_CREATE_IRQCHIP if the userspace VMM wishes to emulate the +IOAPIC and PIC (and also the PIT, even though this has to be enabled +separately). + +This capability also enables in kernel routing of interrupt requests; +when KVM_CAP_SPLIT_IRQCHIP only routes of KVM_IRQ_ROUTING_MSI type are +used in the IRQ routing table. The first args[0] MSI routes are reserved +for the IOAPIC pins. Whenever the LAPIC receives an EOI for these routes, +a KVM_EXIT_IOAPIC_EOI vmexit will be reported to userspace. + +Fails if VCPU has already been created, or if the irqchip is already in the +kernel (i.e. KVM_CREATE_IRQCHIP has already been called). + 8. Other capabilities. ---------------------- diff --git a/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt new file mode 100644 index 000000000000..38bca2835278 --- /dev/null +++ b/Documentation/virtual/kvm/arm/vgic-mapped-irqs.txt @@ -0,0 +1,187 @@ +KVM/ARM VGIC Forwarded Physical Interrupts +========================================== + +The KVM/ARM code implements software support for the ARM Generic +Interrupt Controller's (GIC's) hardware support for virtualization by +allowing software to inject virtual interrupts to a VM, which the guest +OS sees as regular interrupts. The code is famously known as the VGIC. + +Some of these virtual interrupts, however, correspond to physical +interrupts from real physical devices. One example could be the +architected timer, which itself supports virtualization, and therefore +lets a guest OS program the hardware device directly to raise an +interrupt at some point in time. When such an interrupt is raised, the +host OS initially handles the interrupt and must somehow signal this +event as a virtual interrupt to the guest. Another example could be a +passthrough device, where the physical interrupts are initially handled +by the host, but the device driver for the device lives in the guest OS +and KVM must therefore somehow inject a virtual interrupt on behalf of +the physical one to the guest OS. + +These virtual interrupts corresponding to a physical interrupt on the +host are called forwarded physical interrupts, but are also sometimes +referred to as 'virtualized physical interrupts' and 'mapped interrupts'. + +Forwarded physical interrupts are handled slightly differently compared +to virtual interrupts generated purely by a software emulated device. + + +The HW bit +---------- +Virtual interrupts are signalled to the guest by programming the List +Registers (LRs) on the GIC before running a VCPU. The LR is programmed +with the virtual IRQ number and the state of the interrupt (Pending, +Active, or Pending+Active). When the guest ACKs and EOIs a virtual +interrupt, the LR state moves from Pending to Active, and finally to +inactive. + +The LRs include an extra bit, called the HW bit. When this bit is set, +KVM must also program an additional field in the LR, the physical IRQ +number, to link the virtual with the physical IRQ. + +When the HW bit is set, KVM must EITHER set the Pending OR the Active +bit, never both at the same time. + +Setting the HW bit causes the hardware to deactivate the physical +interrupt on the physical distributor when the guest deactivates the +corresponding virtual interrupt. + + +Forwarded Physical Interrupts Life Cycle +---------------------------------------- + +The state of forwarded physical interrupts is managed in the following way: + + - The physical interrupt is acked by the host, and becomes active on + the physical distributor (*). + - KVM sets the LR.Pending bit, because this is the only way the GICV + interface is going to present it to the guest. + - LR.Pending will stay set as long as the guest has not acked the interrupt. + - LR.Pending transitions to LR.Active on the guest read of the IAR, as + expected. + - On guest EOI, the *physical distributor* active bit gets cleared, + but the LR.Active is left untouched (set). + - KVM clears the LR on VM exits when the physical distributor + active state has been cleared. + +(*): The host handling is slightly more complicated. For some forwarded +interrupts (shared), KVM directly sets the active state on the physical +distributor before entering the guest, because the interrupt is never actually +handled on the host (see details on the timer as an example below). For other +forwarded interrupts (non-shared) the host does not deactivate the interrupt +when the host ISR completes, but leaves the interrupt active until the guest +deactivates it. Leaving the interrupt active is allowed, because Linux +configures the physical GIC with EOIMode=1, which causes EOI operations to +perform a priority drop allowing the GIC to receive other interrupts of the +default priority. + + +Forwarded Edge and Level Triggered PPIs and SPIs +------------------------------------------------ +Forwarded physical interrupts injected should always be active on the +physical distributor when injected to a guest. + +Level-triggered interrupts will keep the interrupt line to the GIC +asserted, typically until the guest programs the device to deassert the +line. This means that the interrupt will remain pending on the physical +distributor until the guest has reprogrammed the device. Since we +always run the VM with interrupts enabled on the CPU, a pending +interrupt will exit the guest as soon as we switch into the guest, +preventing the guest from ever making progress as the process repeats +over and over. Therefore, the active state on the physical distributor +must be set when entering the guest, preventing the GIC from forwarding +the pending interrupt to the CPU. As soon as the guest deactivates the +interrupt, the physical line is sampled by the hardware again and the host +takes a new interrupt if and only if the physical line is still asserted. + +Edge-triggered interrupts do not exhibit the same problem with +preventing guest execution that level-triggered interrupts do. One +option is to not use HW bit at all, and inject edge-triggered interrupts +from a physical device as pure virtual interrupts. But that would +potentially slow down handling of the interrupt in the guest, because a +physical interrupt occurring in the middle of the guest ISR would +preempt the guest for the host to handle the interrupt. Additionally, +if you configure the system to handle interrupts on a separate physical +core from that running your VCPU, you still have to interrupt the VCPU +to queue the pending state onto the LR, even though the guest won't use +this information until the guest ISR completes. Therefore, the HW +bit should always be set for forwarded edge-triggered interrupts. With +the HW bit set, the virtual interrupt is injected and additional +physical interrupts occurring before the guest deactivates the interrupt +simply mark the state on the physical distributor as Pending+Active. As +soon as the guest deactivates the interrupt, the host takes another +interrupt if and only if there was a physical interrupt between injecting +the forwarded interrupt to the guest and the guest deactivating the +interrupt. + +Consequently, whenever we schedule a VCPU with one or more LRs with the +HW bit set, the interrupt must also be active on the physical +distributor. + + +Forwarded LPIs +-------------- +LPIs, introduced in GICv3, are always edge-triggered and do not have an +active state. They become pending when a device signal them, and as +soon as they are acked by the CPU, they are inactive again. + +It therefore doesn't make sense, and is not supported, to set the HW bit +for physical LPIs that are forwarded to a VM as virtual interrupts, +typically virtual SPIs. + +For LPIs, there is no other choice than to preempt the VCPU thread if +necessary, and queue the pending state onto the LR. + + +Putting It Together: The Architected Timer +------------------------------------------ +The architected timer is a device that signals interrupts with level +triggered semantics. The timer hardware is directly accessed by VCPUs +which program the timer to fire at some point in time. Each VCPU on a +system programs the timer to fire at different times, and therefore the +hardware is multiplexed between multiple VCPUs. This is implemented by +context-switching the timer state along with each VCPU thread. + +However, this means that a scenario like the following is entirely +possible, and in fact, typical: + +1. KVM runs the VCPU +2. The guest programs the time to fire in T+100 +3. The guest is idle and calls WFI (wait-for-interrupts) +4. The hardware traps to the host +5. KVM stores the timer state to memory and disables the hardware timer +6. KVM schedules a soft timer to fire in T+(100 - time since step 2) +7. KVM puts the VCPU thread to sleep (on a waitqueue) +8. The soft timer fires, waking up the VCPU thread +9. KVM reprograms the timer hardware with the VCPU's values +10. KVM marks the timer interrupt as active on the physical distributor +11. KVM injects a forwarded physical interrupt to the guest +12. KVM runs the VCPU + +Notice that KVM injects a forwarded physical interrupt in step 11 without +the corresponding interrupt having actually fired on the host. That is +exactly why we mark the timer interrupt as active in step 10, because +the active state on the physical distributor is part of the state +belonging to the timer hardware, which is context-switched along with +the VCPU thread. + +If the guest does not idle because it is busy, the flow looks like this +instead: + +1. KVM runs the VCPU +2. The guest programs the time to fire in T+100 +4. At T+100 the timer fires and a physical IRQ causes the VM to exit + (note that this initially only traps to EL2 and does not run the host ISR + until KVM has returned to the host). +5. With interrupts still disabled on the CPU coming back from the guest, KVM + stores the virtual timer state to memory and disables the virtual hw timer. +6. KVM looks at the timer state (in memory) and injects a forwarded physical + interrupt because it concludes the timer has expired. +7. KVM marks the timer interrupt as active on the physical distributor +7. KVM enables the timer, enables interrupts, and runs the VCPU + +Notice that again the forwarded physical interrupt is injected to the +guest without having actually been handled on the host. In this case it +is because the physical interrupt is never actually seen by the host because the +timer is disabled upon guest return, and the virtual forwarded interrupt is +injected on the KVM guest entry path. diff --git a/Documentation/virtual/kvm/devices/arm-vgic.txt b/Documentation/virtual/kvm/devices/arm-vgic.txt index 3fb905429e8a..59541d49e15c 100644 --- a/Documentation/virtual/kvm/devices/arm-vgic.txt +++ b/Documentation/virtual/kvm/devices/arm-vgic.txt @@ -44,28 +44,29 @@ Groups: Attributes: The attr field of kvm_device_attr encodes two values: bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 | - values: | reserved | cpu id | offset | + values: | reserved | vcpu_index | offset | All distributor regs are (rw, 32-bit) The offset is relative to the "Distributor base address" as defined in the GICv2 specs. Getting or setting such a register has the same effect as - reading or writing the register on the actual hardware from the cpu - specified with cpu id field. Note that most distributor fields are not - banked, but return the same value regardless of the cpu id used to access - the register. + reading or writing the register on the actual hardware from the cpu whose + index is specified with the vcpu_index field. Note that most distributor + fields are not banked, but return the same value regardless of the + vcpu_index used to access the register. Limitations: - Priorities are not implemented, and registers are RAZ/WI - Currently only implemented for KVM_DEV_TYPE_ARM_VGIC_V2. Errors: - -ENODEV: Getting or setting this register is not yet supported + -ENXIO: Getting or setting this register is not yet supported -EBUSY: One or more VCPUs are running + -EINVAL: Invalid vcpu_index supplied KVM_DEV_ARM_VGIC_GRP_CPU_REGS Attributes: The attr field of kvm_device_attr encodes two values: bits: | 63 .... 40 | 39 .. 32 | 31 .... 0 | - values: | reserved | cpu id | offset | + values: | reserved | vcpu_index | offset | All CPU interface regs are (rw, 32-bit) @@ -91,8 +92,9 @@ Groups: - Priorities are not implemented, and registers are RAZ/WI - Currently only implemented for KVM_DEV_TYPE_ARM_VGIC_V2. Errors: - -ENODEV: Getting or setting this register is not yet supported + -ENXIO: Getting or setting this register is not yet supported -EBUSY: One or more VCPUs are running + -EINVAL: Invalid vcpu_index supplied KVM_DEV_ARM_VGIC_GRP_NR_IRQS Attributes: diff --git a/Documentation/virtual/kvm/locking.txt b/Documentation/virtual/kvm/locking.txt index d68af4dc3006..19f94a6b9bb0 100644 --- a/Documentation/virtual/kvm/locking.txt +++ b/Documentation/virtual/kvm/locking.txt @@ -166,3 +166,15 @@ Comment: The srcu read lock must be held while accessing memslots (e.g. MMIO/PIO address->device structure mapping (kvm->buses). The srcu index can be stored in kvm_vcpu->srcu_idx per vcpu if it is needed by multiple functions. + +Name: blocked_vcpu_on_cpu_lock +Type: spinlock_t +Arch: x86 +Protects: blocked_vcpu_on_cpu +Comment: This is a per-CPU lock and it is used for VT-d posted-interrupts. + When VT-d posted-interrupts is supported and the VM has assigned + devices, we put the blocked vCPU on the list blocked_vcpu_on_cpu + protected by blocked_vcpu_on_cpu_lock, when VT-d hardware issues + wakeup notification event since external interrupts from the + assigned devices happens, we will find the vCPU on the list to + wakeup. |