summaryrefslogtreecommitdiffstats
path: root/arch/powerpc/kvm/book3s.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* KVM: halt_polling: provide a way to qualify wakeups during pollChristian Borntraeger2016-05-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some wakeups should not be considered a sucessful poll. For example on s390 I/O interrupts are usually floating, which means that _ALL_ CPUs would be considered runnable - letting all vCPUs poll all the time for transactional like workload, even if one vCPU would be enough. This can result in huge CPU usage for large guests. This patch lets architectures provide a way to qualify wakeups if they should be considered a good/bad wakeups in regard to polls. For s390 the implementation will fence of halt polling for anything but known good, single vCPU events. The s390 implementation for floating interrupts does a wakeup for one vCPU, but the interrupt will be delivered by whatever CPU checks first for a pending interrupt. We prefer the woken up CPU by marking the poll of this CPU as "good" poll. This code will also mark several other wakeup reasons like IPI or expired timers as "good". This will of course also mark some events as not sucessful. As KVM on z runs always as a 2nd level hypervisor, we prefer to not poll, unless we are really sure, though. This patch successfully limits the CPU usage for cases like uperf 1byte transactional ping pong workload or wakeup heavy workload like OLTP while still providing a proper speedup. This also introduced a new vcpu stat "halt_poll_no_tuning" that marks wakeups that are considered not good for polling. Signed-off-by: Christian Borntraeger <borntraeger@de.ibm.com> Acked-by: Radim Krčmář <rkrcmar@redhat.com> (for an earlier version) Cc: David Matlack <dmatlack@google.com> Cc: Wanpeng Li <kernellwp@gmail.com> [Rename config symbol. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: PPC: Use RCU for arch.spapr_tce_tablesAlexey Kardashevskiy2016-02-161-1/+1
| | | | | | | | | | | | | | At the moment only spapr_tce_tables updates are protected against races but not lookups. This fixes missing protection by using RCU for the list. As lookups also happen in real mode, this uses list_for_each_entry_lockless() (which is expected not to access any vmalloc'd memory). This converts release_spapr_tce_table() to a RCU scheduled handler. Signed-off-by: Alexey Kardashevskiy <aik@ozlabs.ru> Reviewed-by: David Gibson <david@gibson.dropbear.id.au> Signed-off-by: Paul Mackerras <paulus@samba.org>
* kvm: rename pfn_t to kvm_pfn_tDan Williams2016-01-161-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To date, we have implemented two I/O usage models for persistent memory, PMEM (a persistent "ram disk") and DAX (mmap persistent memory into userspace). This series adds a third, DAX-GUP, that allows DAX mappings to be the target of direct-i/o. It allows userspace to coordinate DMA/RDMA from/to persistent memory. The implementation leverages the ZONE_DEVICE mm-zone that went into 4.3-rc1 (also discussed at kernel summit) to flag pages that are owned and dynamically mapped by a device driver. The pmem driver, after mapping a persistent memory range into the system memmap via devm_memremap_pages(), arranges for DAX to distinguish pfn-only versus page-backed pmem-pfns via flags in the new pfn_t type. The DAX code, upon seeing a PFN_DEV+PFN_MAP flagged pfn, flags the resulting pte(s) inserted into the process page tables with a new _PAGE_DEVMAP flag. Later, when get_user_pages() is walking ptes it keys off _PAGE_DEVMAP to pin the device hosting the page range active. Finally, get_page() and put_page() are modified to take references against the device driver established page mapping. Finally, this need for "struct page" for persistent memory requires memory capacity to store the memmap array. Given the memmap array for a large pool of persistent may exhaust available DRAM introduce a mechanism to allocate the memmap from persistent memory. The new "struct vmem_altmap *" parameter to devm_memremap_pages() enables arch_add_memory() to use reserved pmem capacity rather than the page allocator. This patch (of 18): The core has developed a need for a "pfn_t" type [1]. Move the existing pfn_t in KVM to kvm_pfn_t [2]. [1]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002199.html [2]: https://lists.01.org/pipermail/linux-nvdimm/2015-September/002218.html Signed-off-by: Dan Williams <dan.j.williams@intel.com> Acked-by: Christoffer Dall <christoffer.dall@linaro.org> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* KVM: PPC: Book3S: Take the kvm->srcu lock in kvmppc_h_logical_ci_load/store()Thomas Huth2015-09-211-0/+6
| | | | | | | | | | | | | | Access to the kvm->buses (like with the kvm_io_bus_read() and -write() functions) has to be protected via the kvm->srcu lock. The kvmppc_h_logical_ci_load() and -store() functions are missing this lock so far, so let's add it there, too. This fixes the problem that the kernel reports "suspicious RCU usage" when lock debugging is enabled. Cc: stable@vger.kernel.org # v4.1+ Fixes: 99342cf8044420eebdf9297ca03a14cb6a7085a1 Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Paul Mackerras <paulus@samba.org>
* KVM: add halt_attempted_poll to VCPU statsPaolo Bonzini2015-09-161-0/+1
| | | | | | | | | | | | | | | | | This new statistic can help diagnosing VCPUs that, for any reason, trigger bad behavior of halt_poll_ns autotuning. For example, say halt_poll_ns = 480000, and wakeups are spaced exactly like 479us, 481us, 479us, 481us. Then KVM always fails polling and wastes 10+20+40+80+160+320+480 = 1110 microseconds out of every 479+481+479+481+479+481+479 = 3359 microseconds. The VCPU then is consuming about 30% more CPU than it would use without polling. This would show as an abnormally high number of attempted polling compared to the successful polls. Acked-by: Christian Borntraeger <borntraeger@de.ibm.com< Reviewed-by: David Matlack <dmatlack@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2015-09-111-1/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull more kvm updates from Paolo Bonzini: "ARM: - Full debug support for arm64 - Active state switching for timer interrupts - Lazy FP/SIMD save/restore for arm64 - Generic ARMv8 target PPC: - Book3S: A few bug fixes - Book3S: Allow micro-threading on POWER8 x86: - Compiler warnings Generic: - Adaptive polling for guest halt" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (49 commits) kvm: irqchip: fix memory leak kvm: move new trace event outside #ifdef CONFIG_KVM_ASYNC_PF KVM: trace kvm_halt_poll_ns grow/shrink KVM: dynamic halt-polling KVM: make halt_poll_ns per-vCPU Silence compiler warning in arch/x86/kvm/emulate.c kvm: compile process_smi_save_seg_64() only for x86_64 KVM: x86: avoid uninitialized variable warning KVM: PPC: Book3S: Fix typo in top comment about locking KVM: PPC: Book3S: Fix size of the PSPB register KVM: PPC: Book3S HV: Exit on H_DOORBELL if HOST_IPI is set KVM: PPC: Book3S HV: Fix race in starting secondary threads KVM: PPC: Book3S: correct width in XER handling KVM: PPC: Book3S HV: Fix preempted vcore stolen time calculation KVM: PPC: Book3S HV: Fix preempted vcore list locking KVM: PPC: Book3S HV: Implement H_CLEAR_REF and H_CLEAR_MOD KVM: PPC: Book3S HV: Fix bug in dirty page tracking KVM: PPC: Book3S HV: Fix race in reading change bit when removing HPTE KVM: PPC: Book3S HV: Implement dynamic micro-threading on POWER8 KVM: PPC: Book3S HV: Make use of unused threads when running guests ...
| * KVM: PPC: Fix warnings from sparseThomas Huth2015-08-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | When compiling the KVM code for POWER with "make C=1", sparse complains about functions missing proper prototypes and a 64-bit constant missing the ULL prefix. Let's fix this by making the functions static or by including the proper header with the prototypes, and by appending a ULL prefix to the constant PPC_MPPE_ADDRESS_MASK. Signed-off-by: Thomas Huth <thuth@redhat.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* | treewide: Fix typo compatability -> compatibilityLaurent Pinchart2015-08-071-1/+1
|/ | | | | | | | | | Even though 'compatability' has a dedicated entry in the Wiktionary, it's listed as 'Mispelling of compatibility'. Fix it. Signed-off-by: Laurent Pinchart <laurent.pinchart@ideasonboard.com> Acked-by: David S. Miller <davem@davemloft.net> Acked-by: Daniel Vetter <daniel.vetter@ffwll.ch> for the atomic_helper.c Signed-off-by: Jiri Kosina <jkosina@suse.com>
* KVM: add "new" argument to kvm_arch_commit_memory_regionPaolo Bonzini2015-05-281-2/+3
| | | | | | | | | | | | | | This lets the function access the new memory slot without going through kvm_memslots and id_to_memslot. It will simplify the code when more than one address space will be supported. Unfortunately, the "const"ness of the new argument must be casted away in two places. Fixing KVM to accept const struct kvm_memory_slot pointers would require modifications in pretty much all architectures, and is left for later. Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: const-ify uses of struct kvm_userspace_memory_regionPaolo Bonzini2015-05-261-2/+2
| | | | | | | | | | | | | | Architecture-specific helpers are not supposed to muck with struct kvm_userspace_memory_region contents. Add const to enforce this. In order to eliminate the only write in __kvm_set_memory_region, the cleaning of deleted slots is pulled up from update_memslots to __kvm_set_memory_region. Reviewed-by: Takuya Yoshikawa <yoshikawa_takuya_b1@lab.ntt.co.jp> Reviewed-by: Radim Krcmar <rkrcmar@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* kvmppc: Implement H_LOGICAL_CI_{LOAD,STORE} in KVMDavid Gibson2015-04-211-0/+76
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On POWER, storage caching is usually configured via the MMU - attributes such as cache-inhibited are stored in the TLB and the hashed page table. This makes correctly performing cache inhibited IO accesses awkward when the MMU is turned off (real mode). Some CPU models provide special registers to control the cache attributes of real mode load and stores but this is not at all consistent. This is a problem in particular for SLOF, the firmware used on KVM guests, which runs entirely in real mode, but which needs to do IO to load the kernel. To simplify this qemu implements two special hypercalls, H_LOGICAL_CI_LOAD and H_LOGICAL_CI_STORE which simulate a cache-inhibited load or store to a logical address (aka guest physical address). SLOF uses these for IO. However, because these are implemented within qemu, not the host kernel, these bypass any IO devices emulated within KVM itself. The simplest way to see this problem is to attempt to boot a KVM guest from a virtio-blk device with iothread / dataplane enabled. The iothread code relies on an in kernel implementation of the virtio queue notification, which is not triggered by the IO hcalls, and so the guest will stall in SLOF unable to load the guest OS. This patch addresses this by providing in-kernel implementations of the 2 hypercalls, which correctly scan the KVM IO bus. Any access to an address not handled by the KVM IO bus will cause a VM exit, hitting the qemu implementation as before. Note that a userspace change is also required, in order to enable these new hcall implementations with KVM_CAP_PPC_ENABLE_HCALL. Signed-off-by: David Gibson <david@gibson.dropbear.id.au> [agraf: fix compilation] Signed-off-by: Alexander Graf <agraf@suse.de>
* kvm: add halt_poll_ns module parameterPaolo Bonzini2015-02-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces a new module parameter for the KVM module; when it is present, KVM attempts a bit of polling on every HLT before scheduling itself out via kvm_vcpu_block. This parameter helps a lot for latency-bound workloads---in particular I tested it with O_DSYNC writes with a battery-backed disk in the host. In this case, writes are fast (because the data doesn't have to go all the way to the platters) but they cannot be merged by either the host or the guest. KVM's performance here is usually around 30% of bare metal, or 50% if you use cache=directsync or cache=writethrough (these parameters avoid that the guest sends pointless flush requests, and at the same time they are not slow because of the battery-backed cache). The bad performance happens because on every halt the host CPU decides to halt itself too. When the interrupt comes, the vCPU thread is then migrated to a new physical CPU, and in general the latency is horrible because the vCPU thread has to be scheduled back in. With this patch performance reaches 60-65% of bare metal and, more important, 99% of what you get if you use idle=poll in the guest. This means that the tunable gets rid of this particular bottleneck, and more work can be done to improve performance in the kernel or QEMU. Of course there is some price to pay; every time an otherwise idle vCPUs is interrupted by an interrupt, it will poll unnecessarily and thus impose a little load on the host. The above results were obtained with a mostly random value of the parameter (500000), and the load was around 1.5-2.5% CPU usage on one of the host's core for each idle guest vCPU. The patch also adds a new stat, /sys/kernel/debug/kvm/halt_successful_poll, that can be used to tune the parameter. It counts how many HLT instructions received an interrupt during the polling period; each successful poll avoids that Linux schedules the VCPU thread out and back in, and may also avoid a likely trip to C1 and back for the physical CPU. While the VM is idle, a Linux 4 VCPU VM halts around 10 times per second. Of these halts, almost all are failed polls. During the benchmark, instead, basically all halts end within the polling period, except a more or less constant stream of 50 per second coming from vCPUs that are not running the benchmark. The wasted time is thus very low. Things may be slightly different for Windows VMs, which have a ~10 ms timer tick. The effect is also visible on Marcelo's recently-introduced latency test for the TSC deadline timer. Though of course a non-RT kernel has awful latency bounds, the latency of the timer is around 8000-10000 clock cycles compared to 20000-120000 without setting halt_poll_ns. For the TSC deadline timer, thus, the effect is both a smaller average latency and a smaller variance. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* arch: powerpc: kvm: book3s.c: Remove some unused functionsRickard Strandqvist2014-12-171-8/+0
| | | | | | | | | | Removes some functions that are not used anywhere: kvmppc_core_load_guest_debugstate() kvmppc_core_load_host_debugstate() This was partially found by using a static code analysis program called cppcheck. Signed-off-by: Rickard Strandqvist <rickard_strandqvist@spectrumdigital.se> Signed-off-by: Alexander Graf <agraf@suse.de>
* Merge tag 'signed-kvm-ppc-next' of git://github.com/agraf/linux-2.6 into ↵Paolo Bonzini2014-09-241-111/+47
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm-next Patch queue for ppc - 2014-09-24 New awesome things in this release: - E500: e6500 core support - E500: guest and remote debug support - Book3S: remote sw breakpoint support - Book3S: HV: Minor bugfixes Alexander Graf (1): KVM: PPC: Pass enum to kvmppc_get_last_inst Bharat Bhushan (8): KVM: PPC: BOOKE: allow debug interrupt at "debug level" KVM: PPC: BOOKE : Emulate rfdi instruction KVM: PPC: BOOKE: Allow guest to change MSR_DE KVM: PPC: BOOKE: Clear guest dbsr in userspace exit KVM_EXIT_DEBUG KVM: PPC: BOOKE: Guest and hardware visible debug registers are same KVM: PPC: BOOKE: Add one reg interface for DBSR KVM: PPC: BOOKE: Add one_reg documentation of SPRG9 and DBSR KVM: PPC: BOOKE: Emulate debug registers and exception Madhavan Srinivasan (2): powerpc/kvm: support to handle sw breakpoint powerpc/kvm: common sw breakpoint instr across ppc Michael Neuling (1): KVM: PPC: Book3S HV: Add register name when loading toc Mihai Caraman (10): powerpc/booke: Restrict SPE exception handlers to e200/e500 cores powerpc/booke: Revert SPE/AltiVec common defines for interrupt numbers KVM: PPC: Book3E: Increase FPU laziness KVM: PPC: Book3e: Add AltiVec support KVM: PPC: Make ONE_REG powerpc generic KVM: PPC: Move ONE_REG AltiVec support to powerpc KVM: PPC: Remove the tasklet used by the hrtimer KVM: PPC: Remove shared defines for SPE and AltiVec interrupts KVM: PPC: e500mc: Add support for single threaded vcpus on e6500 core KVM: PPC: Book3E: Enable e6500 core Paul Mackerras (2): KVM: PPC: Book3S HV: Increase timeout for grabbing secondary threads KVM: PPC: Book3S HV: Only accept host PVR value for guest PVR
| * powerpc/kvm: support to handle sw breakpointMadhavan Srinivasan2014-09-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | This patch adds kernel side support for software breakpoint. Design is that, by using an illegal instruction, we trap to hypervisor via Emulation Assistance interrupt, where we check for the illegal instruction and accordingly we return to Host or Guest. Patch also adds support for software breakpoint in PR KVM. Signed-off-by: Madhavan Srinivasan <maddy@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
| * KVM: PPC: Remove the tasklet used by the hrtimerMihai Caraman2014-09-221-3/+1
| | | | | | | | | | | | | | | | | | | | Powerpc timer implementation is a copycat version of s390. Now that they removed the tasklet with commit ea74c0ea1b24a6978a6ebc80ba4dbc7b7848b32d follow this optimization. Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Bogdan Purcareata <bogdan.purcareata@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
| * KVM: PPC: Move ONE_REG AltiVec support to powerpcMihai Caraman2014-09-221-42/+0
| | | | | | | | | | | | | | Move ONE_REG AltiVec support to powerpc generic layer. Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
| * KVM: PPC: Make ONE_REG powerpc genericMihai Caraman2014-09-221-71/+50
| | | | | | | | | | | | | | | | | | Make ONE_REG generic for server and embedded architectures by moving kvm_vcpu_ioctl_get_one_reg() and kvm_vcpu_ioctl_set_one_reg() functions to powerpc layer. Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* | kvm: Fix page ageing bugsAndres Lagar-Cavilla2014-09-241-2/+2
|/ | | | | | | | | | | | | | | | | | | | | 1. We were calling clear_flush_young_notify in unmap_one, but we are within an mmu notifier invalidate range scope. The spte exists no more (due to range_start) and the accessed bit info has already been propagated (due to kvm_pfn_set_accessed). Simply call clear_flush_young. 2. We clear_flush_young on a primary MMU PMD, but this may be mapped as a collection of PTEs by the secondary MMU (e.g. during log-dirty). This required expanding the interface of the clear_flush_young mmu notifier, so a lot of code has been trivially touched. 3. In the absence of shadow_accessed_mask (e.g. EPT A bit), we emulate the access bit by blowing the spte. This requires proper synchronizing with MMU notifier consumers, like every other removal of spte's does. Signed-off-by: Andres Lagar-Cavilla <andreslc@google.com> Acked-by: Rik van Riel <riel@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: PPC: Expose helper functions for data/inst faultsAlexander Graf2014-07-281-0/+17
| | | | | | | | | | We're going to implement guest code interpretation in KVM for some rare corner cases. This code needs to be able to inject data and instruction faults into the guest when it encounters them. Expose generic APIs to do this in a reasonably subarch agnostic fashion. Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Move kvmppc_ld/st to common codeAlexander Graf2014-07-281-81/+0
| | | | | | | | We have enough common infrastructure now to resolve GVA->GPA mappings at runtime. With this we can move our book3s specific helpers to load / store in guest virtual address space to common code as well. Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Implement kvmppc_xlate for all targetsAlexander Graf2014-07-281-4/+8
| | | | | | | | | | | We have a nice API to find the translated GPAs of a GVA including protection flags. So far we only use it on Book3S, but there's no reason the same shouldn't be used on BookE as well. Implement a kvmppc_xlate() version for BookE and clean it up to make it more readable in general. Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Book3S: Make kvmppc_ld return a more accurate error indicationPaul Mackerras2014-07-281-13/+12
| | | | | | | | | | | | | | | | | | | | At present, kvmppc_ld calls kvmppc_xlate, and if kvmppc_xlate returns any error indication, it returns -ENOENT, which is taken to mean an HPTE not found error. However, the error could have been a segment found (no SLB entry) or a permission error. Similarly, kvmppc_pte_to_hva currently does permission checking, but any error from it is taken by kvmppc_ld to mean that the access is an emulated MMIO access. Also, kvmppc_ld does no execute permission checking. This fixes these problems by (a) returning any error from kvmppc_xlate directly, (b) moving the permission check from kvmppc_pte_to_hva into kvmppc_ld, and (c) adding an execute permission check to kvmppc_ld. This is similar to what was done for kvmppc_st() by commit 82ff911317c3 ("KVM: PPC: Deflect page write faults properly in kvmppc_st"). Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Allow kvmppc_get_last_inst() to failMihai Caraman2014-07-281-0/+17
| | | | | | | | | | | | | | | | | On book3e, guest last instruction is read on the exit path using load external pid (lwepx) dedicated instruction. This load operation may fail due to TLB eviction and execute-but-not-read entries. This patch lay down the path for an alternative solution to read the guest last instruction, by allowing kvmppc_get_lat_inst() function to fail. Architecture specific implmentations of kvmppc_load_last_inst() may read last guest instruction and instruct the emulation layer to re-execute the guest in case of failure. Make kvmppc_get_last_inst() definition common between architectures. Signed-off-by: Mihai Caraman <mihai.caraman@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Book3S: Make magic page properly 4k mappableAlexander Graf2014-07-281-6/+6
| | | | | | | | | | | | | | | | | | | | The magic page is defined as a 4k page of per-vCPU data that is shared between the guest and the host to accelerate accesses to privileged registers. However, when the host is using 64k page size granularity we weren't quite as strict about that rule anymore. Instead, we partially treated all of the upper 64k as magic page and mapped only the uppermost 4k with the actual magic contents. This works well enough for Linux which doesn't use any memory in kernel space in the upper 64k, but Mac OS X got upset. So this patch makes magic page actually stay in a 4k range even on 64k page size hosts. This patch fixes magic page usage with Mac OS X (using MOL) on 64k PAGE_SIZE hosts for me. Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Book3S: Add hack for split real modeAlexander Graf2014-07-281-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Today we handle split real mode by mapping both instruction and data faults into a special virtual address space that only exists during the split mode phase. This is good enough to catch 32bit Linux guests that use split real mode for copy_from/to_user. In this case we're always prefixed with 0xc0000000 for our instruction pointer and can map the user space process freely below there. However, that approach fails when we're running KVM inside of KVM. Here the 1st level last_inst reader may well be in the same virtual page as a 2nd level interrupt handler. It also fails when running Mac OS X guests. Here we have a 4G/4G split, so a kernel copy_from/to_user implementation can easily overlap with user space addresses. The architecturally correct way to fix this would be to implement an instruction interpreter in KVM that kicks in whenever we go into split real mode. This interpreter however would not receive a great amount of testing and be a lot of bloat for a reasonably isolated corner case. So I went back to the drawing board and tried to come up with a way to make split real mode work with a single flat address space. And then I realized that we could get away with the same trick that makes it work for Linux: Whenever we see an instruction address during split real mode that may collide, we just move it higher up the virtual address space to a place that hopefully does not collide (keep your fingers crossed!). That approach does work surprisingly well. I am able to successfully run Mac OS X guests with KVM and QEMU (no split real mode hacks like MOL) when I apply a tiny timing probe hack to QEMU. I'd say this is a win over even more broken split real mode :). Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Deflect page write faults properly in kvmppc_stAlexander Graf2014-07-281-2/+4
| | | | | | | | | | | When we have a page that we're not allowed to write to, xlate() will already tell us -EPERM on lookup of that page. With the code as is we change it into a "page missing" error which a guest may get confused about. Instead, just tell the caller about the -EPERM directly. This fixes Mac OS X guests when run with DCBZ32 emulation. Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Book3S: Allow only implemented hcalls to be enabled or disabledPaul Mackerras2014-07-281-0/+5
| | | | | | | | | | | | | This adds code to check that when the KVM_CAP_PPC_ENABLE_HCALL capability is used to enable or disable in-kernel handling of an hcall, that the hcall is actually implemented by the kernel. If not an EINVAL error is returned. This also checks the default-enabled list of hcalls and prints a warning if any hcall there is not actually implemented. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: BOOK3S: PR: Emulate instruction counterAneesh Kumar K.V2014-07-281-0/+6
| | | | | | | Writing to IC is not allowed in the privileged mode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: BOOK3S: PR: Emulate virtual timebase registerAneesh Kumar K.V2014-07-281-0/+6
| | | | | | | | | | virtual time base register is a per VM, per cpu register that needs to be saved and restored on vm exit and entry. Writing to VTB is not allowed in the privileged mode. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: fix compile error] Signed-off-by: Alexander Graf <agraf@suse.de>
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm into nextLinus Torvalds2014-06-041-36/+70
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM updates from Paolo Bonzini: "At over 200 commits, covering almost all supported architectures, this was a pretty active cycle for KVM. Changes include: - a lot of s390 changes: optimizations, support for migration, GDB support and more - ARM changes are pretty small: support for the PSCI 0.2 hypercall interface on both the guest and the host (the latter acked by Catalin) - initial POWER8 and little-endian host support - support for running u-boot on embedded POWER targets - pretty large changes to MIPS too, completing the userspace interface and improving the handling of virtualized timer hardware - for x86, a larger set of changes is scheduled for 3.17. Still, we have a few emulator bugfixes and support for running nested fully-virtualized Xen guests (para-virtualized Xen guests have always worked). And some optimizations too. The only missing architecture here is ia64. It's not a coincidence that support for KVM on ia64 is scheduled for removal in 3.17" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (203 commits) KVM: add missing cleanup_srcu_struct KVM: PPC: Book3S PR: Rework SLB switching code KVM: PPC: Book3S PR: Use SLB entry 0 KVM: PPC: Book3S HV: Fix machine check delivery to guest KVM: PPC: Book3S HV: Work around POWER8 performance monitor bugs KVM: PPC: Book3S HV: Make sure we don't miss dirty pages KVM: PPC: Book3S HV: Fix dirty map for hugepages KVM: PPC: Book3S HV: Put huge-page HPTEs in rmap chain for base address KVM: PPC: Book3S HV: Fix check for running inside guest in global_invalidates() KVM: PPC: Book3S: Move KVM_REG_PPC_WORT to an unused register number KVM: PPC: Book3S: Add ONE_REG register names that were missed KVM: PPC: Add CAP to indicate hcall fixes KVM: PPC: MPIC: Reset IRQ source private members KVM: PPC: Graciously fail broken LE hypercalls PPC: ePAPR: Fix hypercall on LE guest KVM: PPC: BOOK3S: Remove open coded make_dsisr in alignment handler KVM: PPC: BOOK3S: Always use the saved DAR value PPC: KVM: Make NX bit available with magic page KVM: PPC: Disable NX for old magic page using guests KVM: PPC: BOOK3S: HV: Add mixed page-size support for guest ...
| * KVM: PPC: Book3S PR: Expose EBB registersAlexander Graf2014-05-301-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | POWER8 introduces a new facility called the "Event Based Branch" facility. It contains of a few registers that indicate where a guest should branch to when a defined event occurs and it's in PR mode. We don't want to really enable EBB as it will create a big mess with !PR guest mode while hardware is in PR and we don't really emulate the PMU anyway. So instead, let's just leave it at emulation of all its registers. Signed-off-by: Alexander Graf <agraf@suse.de>
| * KVM: PPC: Book3S PR: Expose TAR facility to guestAlexander Graf2014-05-301-0/+6
| | | | | | | | | | | | | | | | | | POWER8 implements a new register called TAR. This register has to be enabled in FSCR and then from KVM's point of view is mere storage. This patch enables the guest to use TAR. Signed-off-by: Alexander Graf <agraf@suse.de>
| * KVM: PPC: Book3S PR: Handle Facility interrupt and FSCRAlexander Graf2014-05-301-0/+10
| | | | | | | | | | | | | | | | | | | | POWER8 introduced a new interrupt type called "Facility unavailable interrupt" which contains its status message in a new register called FSCR. Handle these exits and try to emulate instructions for unhandled facilities. Follow-on patches enable KVM to expose specific facilities into the guest. Signed-off-by: Alexander Graf <agraf@suse.de>
| * KVM: PPC: Make shared struct aka magic page guest endianAlexander Graf2014-05-301-36/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The shared (magic) page is a data structure that contains often used supervisor privileged SPRs accessible via memory to the user to reduce the number of exits we have to take to read/write them. When we actually share this structure with the guest we have to maintain it in guest endianness, because some of the patch tricks only work with native endian load/store operations. Since we only share the structure with either host or guest in little endian on book3s_64 pr mode, we don't have to worry about booke or book3s hv. For booke, the shared struct stays big endian. For book3s_64 hv we maintain the struct in host native endian, since it never gets shared with the guest. For book3s_64 pr we introduce a variable that tells us which endianness the shared struct is in and route every access to it through helper inline functions that evaluate this variable. Signed-off-by: Alexander Graf <agraf@suse.de>
* | KVM: PPC: Book3S: ifdef on CONFIG_KVM_BOOK3S_32_HANDLER for 32bitAlexander Graf2014-04-281-3/+3
|/ | | | | | | | | | | | The book3s_32 target can get built as module which means we don't see the config define for it in code. Instead, check on the bool define CONFIG_KVM_BOOK3S_32_HANDLER whenever we want to know whether we're building for a book3s_32 host. This fixes running book3s_32 kvm as a module for me. Signed-off-by: Alexander Graf <agraf@suse.de> Reviewed-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com>
* KVM: PPC: Store FP/VSX/VMX state in thread_fp/vr_state structuresPaul Mackerras2014-01-091-8/+30
| | | | | | | | | | | This uses struct thread_fp_state and struct thread_vr_state to store the floating-point, VMX/Altivec and VSX state, rather than flat arrays. This makes transferring the state to/from the thread_struct simpler and allows us to unify the get/set_one_reg implementations for the VSX registers. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Add devname:kvm aliases for modulesAlexander Graf2014-01-091-0/+8
| | | | | | | | | | | Systems that support automatic loading of kernel modules through device aliases should try and automatically load kvm when /dev/kvm gets opened. Add code to support that magic for all PPC kvm targets, even the ones that don't support modules yet. Signed-off-by: Alexander Graf <agraf@suse.de>
* kvm: powerpc: book3s: drop is_hv_enabledAneesh Kumar K.V2013-10-171-3/+3
| | | | | | | drop is_hv_enabled, because that should not be a callback property Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* kvm: powerpc: book3s: Allow the HV and PR selection per virtual machineAneesh Kumar K.V2013-10-171-29/+60
| | | | | | | | | | | | | This moves the kvmppc_ops callbacks to be a per VM entity. This enables us to select HV and PR mode when creating a VM. We also allow both kvm-hv and kvm-pr kernel module to be loaded. To achieve this we move /dev/kvm ownership to kvm.ko module. Depending on which KVM mode we select during VM creation we take a reference count on respective module Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: fix coding style] Signed-off-by: Alexander Graf <agraf@suse.de>
* kvm: Add struct kvm arg to memslot APIsAneesh Kumar K.V2013-10-171-2/+2
| | | | | | | We will use that in the later patch to find the kvm ops handler Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* kvm: powerpc: book3s: Support building HV and PR KVM as moduleAneesh Kumar K.V2013-10-171-1/+11
| | | | | | Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: squash in compile fix] Signed-off-by: Alexander Graf <agraf@suse.de>
* kvm: powerpc: book3s: Add is_hv_enabled to kvmppc_opsAneesh Kumar K.V2013-10-171-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | This help us to identify whether we are running with hypervisor mode KVM enabled. The change is needed so that we can have both HV and PR kvm enabled in the same kernel. If both HV and PR KVM are included, interrupts come in to the HV version of the kvmppc_interrupt code, which then jumps to the PR handler, renamed to kvmppc_interrupt_pr, if the guest is a PR guest. Allowing both PR and HV in the same kernel required some changes to kvm_dev_ioctl_check_extension(), since the values returned now can't be selected with #ifdefs as much as previously. We look at is_hv_enabled to return the right value when checking for capabilities.For capabilities that are only provided by HV KVM, we return the HV value only if is_hv_enabled is true. For capabilities provided by PR KVM but not HV, we return the PR value only if is_hv_enabled is false. NOTE: in later patch we replace is_hv_enabled with a static inline function comparing kvm_ppc_ops Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* kvm: powerpc: Add kvmppc_ops callbackAneesh Kumar K.V2013-10-171-4/+141
| | | | | | | | | | This patch add a new callback kvmppc_ops. This will help us in enabling both HV and PR KVM together in the same kernel. The actual change to enable them together is done in the later patch in the series. Signed-off-by: Aneesh Kumar K.V <aneesh.kumar@linux.vnet.ibm.com> [agraf: squash in booke changes] Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Book3S PR: Better handling of host-side read-only pagesPaul Mackerras2013-10-171-6/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently we request write access to all pages that get mapped into the guest, even if the guest is only loading from the page. This reduces the effectiveness of KSM because it means that we unshare every page we access. Also, we always set the changed (C) bit in the guest HPTE if it allows writing, even for a guest load. This fixes both these problems. We pass an 'iswrite' flag to the mmu.xlate() functions and to kvmppc_mmu_map_page() to indicate whether the access is a load or a store. The mmu.xlate() functions now only set C for stores. kvmppc_gfn_to_pfn() now calls gfn_to_pfn_prot() instead of gfn_to_pfn() so that it can indicate whether we need write access to the page, and get back a 'writable' flag to indicate whether the page is writable or not. If that 'writable' flag is clear, we then make the host HPTE read-only even if the guest HPTE allowed writing. This means that we can get a protection fault when the guest writes to a page that it has mapped read-write but which is read-only on the host side (perhaps due to KSM having merged the page). Thus we now call kvmppc_handle_pagefault() for protection faults as well as HPTE not found faults. In kvmppc_handle_pagefault(), if the access was allowed by the guest HPTE and we thus need to install a new host HPTE, we then need to remove the old host HPTE if there is one. This is done with a new function, kvmppc_mmu_unmap_page(), which uses kvmppc_mmu_pte_vflush() to find and remove the old host HPTE. Since the memslot-related functions require the KVM SRCU read lock to be held, this adds srcu_read_lock/unlock pairs around the calls to kvmppc_handle_pagefault(). Finally, this changes kvmppc_mmu_book3s_32_xlate_pte() to not ignore guest HPTEs that don't permit access, and to return -EPERM for accesses that are not permitted by the page protections. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Book3S: Add GET/SET_ONE_REG interface for VRSAVEPaul Mackerras2013-10-171-0/+10
| | | | | | | | | | | The VRSAVE register value for a vcpu is accessible through the GET/SET_SREGS interface for Book E processors, but not for Book 3S processors. In order to make this accessible for Book 3S processors, this adds a new register identifier for GET/SET_ONE_REG, and adds the code to implement it. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Book3S: Facilities to save/restore XICS presentation ctrler statePaul Mackerras2013-04-261-0/+19
| | | | | | | | | | | | | This adds the ability for userspace to save and restore the state of the XICS interrupt presentation controllers (ICPs) via the KVM_GET/SET_ONE_REG interface. Since there is one ICP per vcpu, we simply define a new 64-bit register in the ONE_REG space for the ICP state. The state includes the CPU priority setting, the pending IPI priority, and the priority and source number of any pending external interrupt. Signed-off-by: Paul Mackerras <paulus@samba.org> Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: Book3S: Add kernel emulation for the XICS interrupt controllerBenjamin Herrenschmidt2013-04-261-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds in-kernel emulation of the XICS (eXternal Interrupt Controller Specification) interrupt controller specified by PAPR, for both HV and PR KVM guests. The XICS emulation supports up to 1048560 interrupt sources. Interrupt source numbers below 16 are reserved; 0 is used to mean no interrupt and 2 is used for IPIs. Internally these are represented in blocks of 1024, called ICS (interrupt controller source) entities, but that is not visible to userspace. Each vcpu gets one ICP (interrupt controller presentation) entity, used to store the per-vcpu state such as vcpu priority, pending interrupt state, IPI request, etc. This does not include any API or any way to connect vcpus to their ICP state; that will be added in later patches. This is based on an initial implementation by Michael Ellerman <michael@ellerman.id.au> reworked by Benjamin Herrenschmidt and Paul Mackerras. Signed-off-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Paul Mackerras <paulus@samba.org> [agraf: fix typo, add dependency on !KVM_MPIC] Signed-off-by: Alexander Graf <agraf@suse.de>
* KVM: PPC: debug stub interface parameter definedBharat Bhushan2013-04-261-0/+6
| | | | | | | | | | | | | This patch defines the interface parameter for KVM_SET_GUEST_DEBUG ioctl support. Follow up patches will use this for setting up hardware breakpoints, watchpoints and software breakpoints. Also kvm_arch_vcpu_ioctl_set_guest_debug() is brought one level below. This is because I am not sure what is required for book3s. So this ioctl behaviour will not change for book3s. Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>
* Added ONE_REG interface for debug instructionBharat Bhushan2013-04-171-0/+6
| | | | | | | | This patch adds the one_reg interface to get the special instruction to be used for setting software breakpoint from userspace. Signed-off-by: Bharat Bhushan <bharat.bhushan@freescale.com> Signed-off-by: Alexander Graf <agraf@suse.de>