linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	KVM: MMU: fix shrinking page from the empty mmu	Xiao Guangrong	2012-07-03	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Fix: [ 3190.059226] BUG: unable to handle kernel NULL pointer dereference at (null) [ 3190.062224] IP: [<ffffffffa02aac66>] mmu_page_zap_pte+0x10/0xa7 [kvm] [ 3190.063760] PGD 104f50067 PUD 112bea067 PMD 0 [ 3190.065309] Oops: 0000 [#1] SMP DEBUG_PAGEALLOC [ 3190.066860] CPU 1 [ ...... ] [ 3190.109629] Call Trace: [ 3190.111342] [<ffffffffa02aada6>] kvm_mmu_prepare_zap_page+0xa9/0x1fc [kvm] [ 3190.113091] [<ffffffffa02ab2f5>] mmu_shrink+0x11f/0x1f3 [kvm] [ 3190.114844] [<ffffffffa02ab25d>] ? mmu_shrink+0x87/0x1f3 [kvm] [ 3190.116598] [<ffffffff81150c9d>] ? prune_super+0x142/0x154 [ 3190.118333] [<ffffffff8110a4f4>] ? shrink_slab+0x39/0x31e [ 3190.120043] [<ffffffff8110a687>] shrink_slab+0x1cc/0x31e [ 3190.121718] [<ffffffff8110ca1d>] do_try_to_free_pages This is caused by shrinking page from the empty mmu, although we have checked n_used_mmu_pages, it is useless since the check is out of mmu-lock Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
*	KVM: MMU: fix huge page adapted on non-PAE host	Xiao Guangrong	2012-05-28	1	-2/+1
\| \| \| \| \| \| \| \| \|	The huge page size is 4M on non-PAE host, but 2M page size is used in transparent_hugepage_adjust(), so the page we get after adjust the mapping level is not the head page, the BUG_ON() will be triggered Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
*	Merge branch 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm	Linus Torvalds	2012-05-25	13	-408/+649
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull KVM changes from Avi Kivity: "Changes include additional instruction emulation, page-crossing MMIO, faster dirty logging, preventing the watchdog from killing a stopped guest, module autoload, a new MSI ABI, and some minor optimizations and fixes. Outside x86 we have a small s390 and a very large ppc update. Regarding the new (for kvm) rebaseless workflow, some of the patches that were merged before we switch trees had to be rebased, while others are true pulls. In either case the signoffs should be correct now." Fix up trivial conflicts in Documentation/feature-removal-schedule.txt arch/powerpc/kvm/book3s_segment.S and arch/x86/include/asm/kvm_para.h. I suspect the kvm_para.h resolution ends up doing the "do I have cpuid" check effectively twice (it was done differently in two different commits), but better safe than sorry ;) * 'next' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (125 commits) KVM: make asm-generic/kvm_para.h have an ifdef __KERNEL__ block KVM: s390: onereg for timer related registers KVM: s390: epoch difference and TOD programmable field KVM: s390: KVM_GET/SET_ONEREG for s390 KVM: s390: add capability indicating COW support KVM: Fix mmu_reload() clash with nested vmx event injection KVM: MMU: Don't use RCU for lockless shadow walking KVM: VMX: Optimize %ds, %es reload KVM: VMX: Fix %ds/%es clobber KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte() KVM: VMX: unlike vmcs on fail path KVM: PPC: Emulator: clean up SPR reads and writes KVM: PPC: Emulator: clean up instruction parsing kvm/powerpc: Add new ioctl to retreive server MMU infos kvm/book3s: Make kernel emulated H_PUT_TCE available for "PR" KVM KVM: PPC: bookehv: Fix r8/r13 storing in level exception handler KVM: PPC: Book3S: Enable IRQs during exit handling KVM: PPC: Fix PR KVM on POWER7 bare metal KVM: PPC: Fix stbux emulation KVM: PPC: bookehv: Use lwz/stw instead of PPC_LL/PPC_STL for 32-bit fields ...
\| *	KVM: Fix mmu_reload() clash with nested vmx event injection	Avi Kivity	2012-05-16	1	-4/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently the inject_pending_event() call during guest entry happens after kvm_mmu_reload(). This is for historical reasons - we used to inject_pending_event() in atomic context, while kvm_mmu_reload() needs task context. A problem is that nested vmx can cause the mmu context to be reset, if event injection is intercepted and causes a #VMEXIT instead (the #VMEXIT resets CR0/CR3/CR4). If this happens, we end up with invalid root_hpa, and since kvm_mmu_reload() has already run, no one will fix it and we end up entering the guest this way. Fix by reordering event injection to be before kvm_mmu_reload(). Use ->cancel_injection() to undo if kvm_mmu_reload() fails. https://bugzilla.kernel.org/show_bug.cgi?id=42980 Reported-by: Luke-Jr <luke-jr+linuxbugs@utopios.org> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| *	KVM: MMU: Don't use RCU for lockless shadow walking	Avi Kivity	2012-05-16	1	-44/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Using RCU for lockless shadow walking can increase the amount of memory in use by the system, since RCU grace periods are unpredictable. We also have an unconditional write to a shared variable (reader_counter), which isn't good for scaling. Replace that with a scheme similar to x86's get_user_pages_fast(): disable interrupts during lockless shadow walk to force the freer (kvm_mmu_commit_zap_page()) to wait for the TLB flush IPI to find the processor with interrupts enabled. We also add a new vcpu->mode, READING_SHADOW_PAGE_TABLES, to prevent kvm_flush_remote_tlbs() from avoiding the IPI. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| *	KVM: VMX: Optimize %ds, %es reload	Avi Kivity	2012-05-16	1	-5/+31
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	On x86_64, we can defer %ds and %es reload to the heavyweight context switch, since nothing in the lightweight paths uses the host %ds or %es (they are ignored by the processor). Furthermore we can avoid the load if the segments are null, by letting the hardware load the null segments for us. This is the expected case. On i386, we could avoid the reload entirely, since the entry.S paths take care of reload, except for the SYSEXIT path which leaves %ds and %es set to __USER_DS. So we set them to the same values as well. Saves about 70 cycles out of 1600 (around 4%; noisy measurements). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| *	KVM: VMX: Fix %ds/%es clobber	Avi Kivity	2012-05-16	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The vmx exit code unconditionally restores %ds and %es to __USER_DS. This can override the user's values, since %ds and %es are not saved and restored in x86_64 syscalls. In practice, this isn't dangerous since nobody uses segment registers in long mode, least of all programs that use KVM. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| *	KVM: x86 emulator: convert bsf/bsr instructions to emulate_2op_SrcV_nobyte()	Joerg Roedel	2012-05-14	1	-24/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The instruction emulation for bsrw is broken in KVM because the code always uses bsr with 32 or 64 bit operand size for emulation. Fix that by using emulate_2op_SrcV_nobyte() macro to use guest operand size for emulation. Signed-off-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: VMX: unlike vmcs on fail path	Xiao Guangrong	2012-05-14	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	fix: [ 1529.577273] Call Trace: [ 1529.577289] [<ffffffffa060d58f>] kvm_arch_hardware_disable+0x13/0x30 [kvm] [ 1529.577302] [<ffffffffa05fa2d4>] hardware_disable_nolock+0x35/0x39 [kvm] [ 1529.577311] [<ffffffffa05fa29f>] ? cpumask_clear_cpu.constprop.31+0x13/0x13 [kvm] [ 1529.577315] [<ffffffff81096ba8>] on_each_cpu+0x44/0x84 [ 1529.577326] [<ffffffffa05f98b5>] hardware_disable_all_nolock+0x34/0x36 [kvm] [ 1529.577335] [<ffffffffa05f98e2>] hardware_disable_all+0x2b/0x39 [kvm] [ 1529.577349] [<ffffffffa05fafe5>] kvm_put_kvm+0xed/0x10f [kvm] [ 1529.577358] [<ffffffffa05fb3d7>] kvm_vm_release+0x22/0x28 [kvm] Signed-off-by: Xiao Guangrong <xiaoguangrong@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: x86 emulator: Avoid pushing back ModRM byte fetched for group decoding	Takuya Yoshikawa	2012-05-06	1	-5/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Although ModRM byte is fetched for group decoding, it is soon pushed back to make decode_modrm() fetch it later again. Now that ModRM flag can be found in the top level opcode tables, fetch ModRM byte before group decoding to make the code simpler. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: x86 emulator: Move ModRM flags for groups to top level opcode tables	Takuya Yoshikawa	2012-05-06	1	-55/+56
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	Needed for the following patch which simplifies ModRM fetching code. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: fix cpuid eax for KVM leaf	Michael S. Tsirkin	2012-05-06	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	cpuid eax should return the max leaf so that guests can find out the valid range. This matches Xen et al. Update documentation to match. Tested with -cpu host. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: x86: Run PIT work in own kthread	Jan Kiszka	2012-04-28	2	-14/+24
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can't run PIT IRQ injection work in the interrupt context of the host timer. This would allow the user to influence the handler complexity by asking for a broadcast to a large number of VCPUs. Therefore, this work was pushed into workqueue context in 9d244caf2e. However, this prevents prioritizing the PIT injection over other task as workqueues share kernel threads. This replaces the workqueue with a kthread worker and gives that thread a name in the format "kvm-pit/<owner-process-pid>". That allows to identify and adjust the kthread priority according to the VM process parameters. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| *	KVM: x86 emulator: fix asm constraint in flush_pending_x87_faults	Avi Kivity	2012-04-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	'bool' wants 8-bit registers. Reported-by: Takuya Yoshikawa <takuya.yoshikawa@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: Introduce bitmask for apic attention reasons	Gleb Natapov	2012-04-24	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The patch introduces a bitmap that will hold reasons apic should be checked during vmexit. This is in a preparation for vp eoi patch that will add one more check on vmexit. With the bitmap we can do if(apic_attention) to check everything simultaneously which will add zero overhead on the fast path. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: Introduce direct MSI message injection for in-kernel irqchips	Jan Kiszka	2012-04-24	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, MSI messages can only be injected to in-kernel irqchips by defining a corresponding IRQ route for each message. This is not only unhandy if the MSI messages are generated "on the fly" by user space, IRQ routes are a limited resource that user space has to manage carefully. By providing a direct injection path, we can both avoid using up limited resources and simplify the necessary steps for user land. Signed-off-by: Jan Kiszka <jan.kiszka@siemens.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: Fix page-crossing MMIO	Avi Kivity	2012-04-20	1	-33/+81
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	MMIO that are split across a page boundary are currently broken - the code does not expect to be aborted by the exit to userspace for the first MMIO fragment. This patch fixes the problem by generalizing the current code for handling 16-byte MMIOs to handle a number of "fragments", and changes the MMIO code to create those fragments. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| *	Merge branch 'linus' into queue	Marcelo Tosatti	2012-04-19	2	-10/+13
\| \|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Merge reason: development work has dependency on kvm patches merged upstream. Conflicts: Documentation/feature-removal-schedule.txt Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: MMU: use page table level macro	Davidlohr Bueso	2012-04-19	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Its much cleaner to use PT_PAGE_TABLE_LEVEL than its numeric value. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: dont clear TMR on EOI	Michael S. Tsirkin	2012-04-17	1	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Intel spec says that TMR needs to be set/cleared when IRR is set, but kvm also clears it on EOI. I did some tests on a real (AMD based) system, and I see same TMR values both before and after EOI, so I think it's a minor bug in kvm. This patch fixes TMR to be set/cleared on IRR set only as per spec. And now that we don't clear TMR, we can save an atomic read of TMR on EOI that's not propagated to ioapic, by checking whether ioapic needs a specific vector first and calculating the mode afterwards. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: x86 emulator: implement MMX MOVQ (opcodes 0f 6f, 0f 7f)	Avi Kivity	2012-04-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Needed by some framebuffer drivers. See https://bugzilla.kernel.org/show_bug.cgi?id=42779 Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: x86 emulator: MMX support	Avi Kivity	2012-04-17	1	-4/+99
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	General support for the MMX instruction set. Special care is taken to trap pending x87 exceptions so that they are properly reflected to the guest. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: x86 emulator: implement movntps	Avi Kivity	2012-04-17	1	-1/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Used to write to framebuffers (by at least Icaros). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: x86: emulate movdqa	Stefan Hajnoczi	2012-04-17	1	-8/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An Ubuntu 9.10 Karmic Koala guest is unable to boot or install due to missing movdqa emulation: kvm_exit: reason EXCEPTION_NMI rip 0x7fef3e025a7b info 7fef3e799000 80000b0e kvm_page_fault: address 7fef3e799000 error_code f kvm_emulate_insn: 0:7fef3e025a7b: 66 0f 7f 07 (prot64) movdqa %xmm0,(%rdi) [avi: mark it explicitly aligned] Signed-off-by: Stefan Hajnoczi <stefanha@linux.vnet.ibm.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: x86 emulator: add support for vector alignment	Avi Kivity	2012-04-17	1	-1/+29
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	x86 defines three classes of vector instructions: explicitly aligned (#GP(0) if unaligned, explicitly unaligned, and default (which depends on the encoding: AVX is unaligned, SSE is aligned). Add support for marking an instruction as explicitly aligned or unaligned, and mark MOVDQU as unaligned. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: SVM: Auto-load on CPUs with SVM	Josh Triplett	2012-04-17	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Enable x86 feature-based autoloading for the kvm-amd module on CPUs with X86_FEATURE_SVM. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
\| * \|	KVM: MMU: Improve iteration through sptes from rmap	Takuya Yoshikawa	2012-04-08	2	-82/+124
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Iteration using rmap_next(), the actual body is pte_list_next(), is inefficient: every time we call it we start from checking whether rmap holds a single spte or points to a descriptor which links more sptes. In the case of shadow paging, this quadratic total iteration cost is a problem. Even for two dimensional paging, with EPT/NPT on, in which we almost always have a single mapping, the extra checks at the end of the iteration should be eliminated. This patch fixes this by introducing rmap_iterator which keeps the iteration context for the next search. Furthermore the implementation of rmap_next() is splitted into two functions, rmap_get_first() and rmap_get_next(), to avoid repeatedly checking whether the rmap being iterated on has only one spte. Although there seemed to be only a slight change for EPT/NPT, the actual improvement was significant: we observed that GET_DIRTY_LOG for 1GB dirty memory became 15% faster than before. This is probably because the new code is easy to make branch predictions. Note: we just remove pte_list_next() because we can think of parent_ptes as a reverse mapping. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: Make pte_list_desc fit cache lines well	Takuya Yoshikawa	2012-04-08	1	-2/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have PTE_LIST_EXT + 1 pointers in this structure and these 40/20 bytes do not fit cache lines well. Furthermore, some allocators may use 64/32-byte objects for the pte_list_desc cache. This patch solves this problem by changing PTE_LIST_EXT from 4 to 3. For shadow paging, the new size is still large enough to hold both the kernel and process mappings for usual anonymous pages. For file mappings, there may be a slight change in the cache usage. Note: with EPT/NPT we almost always have a single spte in each reverse mapping and we will not see any change by this. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: x86: add paging gcc optimization	Davidlohr Bueso	2012-04-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since most guests will have paging enabled for memory management, add likely() optimization around CR0.PG checks. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: VMX: Auto-load on CPUs with VMX	Josh Triplett	2012-04-08	1	-0/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Enable x86 feature-based autoloading for the kvm-intel module on CPUs with X86_FEATURE_VMX. Signed-off-by: Josh Triplett <josh@joshtriplett.org> Acked-By: Kay Sievers <kay@vrfy.org> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: Switch to srcu-less get_dirty_log()	Takuya Yoshikawa	2012-04-08	1	-73/+43
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We have seen some problems of the current implementation of get_dirty_log() which uses synchronize_srcu_expedited() for updating dirty bitmaps; e.g. it is noticeable that this sometimes gives us ms order of latency when we use VGA displays. Furthermore the recent discussion on the following thread "srcu: Implement call_srcu()" http://lkml.org/lkml/2012/1/31/211 also motivated us to implement get_dirty_log() without SRCU. This patch achieves this goal without sacrificing the performance of both VGA and live migration: in practice the new code is much faster than the old one unless we have too many dirty pages. Implementation: The key part of the implementation is the use of xchg() operation for clearing dirty bits atomically. Since this allows us to update only BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap until every dirty bit is cleared again for the next call. Although some people may worry about the problem of using the atomic memory instruction many times to the concurrently accessible bitmap, it is usually accessed with mmu_lock held and we rarely see concurrent accesses: so what we need to care about is the pure xchg() overheads. Another point to note is that we do not use for_each_set_bit() to check which ones in each BITS_PER_LONG pages are actually dirty. Instead we simply use __ffs() in a loop. This is much faster than repeatedly call find_next_bit(). Performance: The dirty-log-perf unit test showed nice improvements, some times faster than before, except for some extreme cases; for such cases the speed of getting dirty page information is much faster than we process it in the userspace. For real workloads, both VGA and live migration, we have observed pure improvements: when the guest was reading a file during live migration, we originally saw a few ms of latency, but with the new method the latency was less than 200us. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: Avoid checking huge page mappings in get_dirty_log()	Takuya Yoshikawa	2012-04-08	2	-15/+33
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Dropped such mappings when we enabled dirty logging and we will never create new ones until we stop the logging. For this we introduce a new function which can be used to write protect a range of PT level pages: although we do not need to care about a range of pages at this point, the following patch will need this feature to optimize the write protection of many pages. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: MMU: Split the main body of rmap_write_protect() off from others	Takuya Yoshikawa	2012-04-08	1	-26/+27
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We will use this in the following patch to implement another function which needs to write protect pages using the rmap information. Note that there is a small change in debug printing for large pages: we do not differentiate them from others to avoid duplicating code. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: x86: Add ioctl for KVM_KVMCLOCK_CTRL	Eric B Munson	2012-04-08	1	-0/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that we have a flag that will tell the guest it was suspended, create an interface for that communication using a KVM ioctl. Signed-off-by: Eric B Munson <emunson@mgebm.net> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: Factor out kvm_vcpu_kick to arch-generic code	Christoffer Dall	2012-04-08	1	-14/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The kvm_vcpu_kick function performs roughly the same funcitonality on most all architectures, so we shouldn't have separate copies. PowerPC keeps a pointer to interchanging waitqueues on the vcpu_arch structure and to accomodate this special need a __KVM_HAVE_ARCH_VCPU_GET_WQ define and accompanying function kvm_arch_vcpu_wq have been defined. For all other architectures this is a generic inline that just returns &vcpu->wq; Acked-by: Scott Wood <scottwood@freescale.com> Signed-off-by: Christoffer Dall <c.dall@virtualopensystems.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: SVM: count all irq windows exit	Jason Wang	2012-04-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Also count the exits of fast-path. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| * \|	KVM: x86: expose Intel cpu new features (HLE, RTM) to guest	Liu, Jinsong	2012-04-08	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Intel recently release 2 new features, HLE and RTM. Refer to http://software.intel.com/file/41417. This patch expose them to guest. Signed-off-by: Liu, Jinsong <jinsong.liu@intel.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* \| \|	KVM: ensure async PF event wakes up vcpu from halt	Gleb Natapov	2012-05-06	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If vcpu executes hlt instruction while async PF is waiting to be delivered vcpu can block and deliver async PF only after another even wakes it up. This happens because kvm_check_async_pf_completion() will remove completion event from vcpu->async_pf.done before entering kvm_vcpu_block() and this will make kvm_arch_vcpu_runnable() return false. The solution is to make vcpu runnable when processing completion. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* \| \|	kill mm argument of vm_munmap()	Al Viro	2012-04-21	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	it's always current->mm Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* \| \|	VM: add "vm_mmap()" helper function	Linus Torvalds	2012-04-21	1	-3/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This continues the theme started with vm_brk() and vm_munmap(): vm_mmap() does the same thing as do_mmap(), but additionally does the required VM locking. This uninlines (and rewrites it to be clearer) do_mmap(), which sadly duplicates it in mm/mmap.c and mm/nommu.c. But that way we don't have to export our internal do_mmap_pgoff() function. Some day we hopefully don't have to export do_mmap() either, if all modular users can become the simpler vm_mmap() instead. We're actually very close to that already, with the notable exception of the (broken) use in i810, and a couple of stragglers in binfmt_elf. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \| \|	VM: add "vm_munmap()" helper function	Linus Torvalds	2012-04-21	1	-3/+1
\| \|/ \|/\| \| \| \| \| \| \| \| \| \| \|	Like the vm_brk() function, this is the same as "do_munmap()", except it does the VM locking for the caller. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* \|	KVM: VMX: Fix kvm_set_shared_msr() called in preemptible context	Avi Kivity	2012-04-19	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	kvm_set_shared_msr() may not be called in preemptible context, but vmx_set_msr() does so: BUG: using smp_processor_id() in preemptible [00000000] code: qemu-kvm/22713 caller is kvm_set_shared_msr+0x32/0xa0 [kvm] Pid: 22713, comm: qemu-kvm Not tainted 3.4.0-rc3+ #39 Call Trace: [<ffffffff8131fa82>] debug_smp_processor_id+0xe2/0x100 [<ffffffffa0328ae2>] kvm_set_shared_msr+0x32/0xa0 [kvm] [<ffffffffa03a103b>] vmx_set_msr+0x28b/0x2d0 [kvm_intel] ... Making kvm_set_shared_msr() work in preemptible is cleaner, but it's used in the fast path. Making two variants is overkill, so this patch just disables preemption around the call. Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* \|	KVM: PMU emulation: GLOBAL_CTRL MSR should be enabled on reset	Gleb Natapov	2012-04-10	1	-9/+9
\|/ \| \| \| \| \| \|	On reset all MPU counters should be enabled in GLOBAL_CTRL MSR. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
*	KVM: VMX: vmx_set_cr0 expects kvm->srcu locked	Marcelo Tosatti	2012-04-05	1	-0/+2
\| \| \| \| \| \| \| \|	vmx_set_cr0 is called from vcpu run context, therefore it expects kvm->srcu to be held (for setting up the real-mode TSS). Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
*	KVM: PMU: Fix integer constant is too large warning in kvm_pmu_set_msr()	Sasikantha babu	2012-04-05	1	-1/+1
\| \| \| \| \|	Signed-off-by: Sasikantha babu <sasikanth.v19@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
*	Merge branch 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm	Linus Torvalds	2012-03-28	11	-206/+595
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Pull kvm updates from Avi Kivity: "Changes include timekeeping improvements, support for assigning host PCI devices that share interrupt lines, s390 user-controlled guests, a large ppc update, and random fixes." This is with the sign-off's fixed, hopefully next merge window we won't have rebased commits. * 'kvm-updates/3.4' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (130 commits) KVM: Convert intx_mask_lock to spin lock KVM: x86: fix kvm_write_tsc() TSC matching thinko x86: kvmclock: abstract save/restore sched_clock_state KVM: nVMX: Fix erroneous exception bitmap check KVM: Ignore the writes to MSR_K7_HWCR(3) KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask KVM: PMU: add proper support for fixed counter 2 KVM: PMU: Fix raw event check KVM: PMU: warn when pin control is set in eventsel msr KVM: VMX: Fix delayed load of shared MSRs KVM: use correct tlbs dirty type in cmpxchg KVM: Allow host IRQ sharing for assigned PCI 2.3 devices KVM: Ensure all vcpus are consistent with in-kernel irqchip settings KVM: x86 emulator: Allow PM/VM86 switch during task switch KVM: SVM: Fix CPL updates KVM: x86 emulator: VM86 segments must have DPL 3 KVM: x86 emulator: Fix task switch privilege checks arch/powerpc/kvm/book3s_hv.c: included linux/sched.h twice KVM: x86 emulator: correctly mask pmc index bits in RDPMC instruction emulation KVM: mmu_notifier: Flush TLBs before releasing mmu_lock ...
\| *	KVM: x86: fix kvm_write_tsc() TSC matching thinko	Marcelo Tosatti	2012-03-20	1	-9/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	kvm_write_tsc() converts from guest TSC to microseconds, not nanoseconds as intended. The result is that the window for matching is 1000 seconds, not 1 second. Microsecond precision is enough for checking whether the TSC write delta is within the heuristic values, so use it instead of nanoseconds. Noted by Avi Kivity. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: nVMX: Fix erroneous exception bitmap check	Nadav Har'El	2012-03-08	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The code which checks whether to inject a pagefault to L1 or L2 (in nested VMX) was wrong, incorrect in how it checked the PF_VECTOR bit. Thanks to Dan Carpenter for spotting this. Signed-off-by: Nadav Har'El <nyh@il.ibm.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: Ignore the writes to MSR_K7_HWCR(3)	Nicolae Mogoreanu	2012-03-08	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When CPUID Fn8000_0001_EAX reports 0x00100f22 Windows 7 x64 guest tries to set bit 3 in MSRC001_0015 in nt!KiDisableCacheErrataSource and fails. This patch will ignore this step and allow things to move on without having to fake CPUID value. Signed-off-by: Nicolae Mogoreanu <mogoreanu@gmail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
\| *	KVM: MMU: make use of ->root_level in reset_rsvds_bits_mask	Davidlohr Bueso	2012-03-08	1	-16/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The reset_rsvds_bits_mask() function can use the guest walker's root level number instead of using a separate 'level' variable. Signed-off-by: Davidlohr Bueso <dave@gnu.org> Signed-off-by: Avi Kivity <avi@redhat.com>