summaryrefslogtreecommitdiffstats
path: root/arch/x86 (follow)
Commit message (Collapse)AuthorAgeFilesLines
* KVM: emulator: optimize "rep ins" handlingGleb Natapov2012-09-062-6/+31
| | | | | | | | | | Optimize "rep ins" by allowing emulator to write back more than one datum at a time. Introduce new operand type OP_MEM_STR which tells writeback() that dst contains pointer to an array that should be written back as opposite to just one data element. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: emulator: string_addr_inc() cleanupGleb Natapov2012-09-061-7/+4
| | | | | | | | Remove unneeded segment argument. Address structure already has correct segment which was put there during decode. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: emulator: make x86 emulation modes enum instead of definesGleb Natapov2012-09-062-13/+13
| | | | | Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: Provide userspace IO exit completion callbackGleb Natapov2012-09-062-37/+57
| | | | | | | | | | Current code assumes that IO exit was due to instruction emulation and handles execution back to emulator directly. This patch adds new userspace IO exit completion callback that can be set by any other code that caused IO exit to userspace. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: move postcommit flush to x86, as mmio sptes are x86 specificMarcelo Tosatti2012-09-061-0/+8
| | | | | | | | | Other arches do not need this. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> v2: fix incorrect deletion of mmio sptes on gpa move (noticed by Takuya) Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: split kvm_arch_flush_shadowMarcelo Tosatti2012-09-061-1/+7
| | | | | | | | Introducing kvm_arch_flush_shadow_memslot, to invalidate the translations of a single memory slot. Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: SVM: constify lookup tablesMathias Krause2012-09-051-4/+4
| | | | | | | | | | We never modify direct_access_msrs[], msrpm_ranges[], svm_exit_handlers[] or x86_intercept_map[] at runtime. Mark them r/o. Signed-off-by: Mathias Krause <minipli@googlemail.com> Cc: Joerg Roedel <joerg.roedel@amd.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: VMX: constify lookup tablesMathias Krause2012-09-051-7/+7
| | | | | | | | We use vmcs_field_to_offset_table[], kvm_vmx_segment_fields[] and kvm_vmx_exit_handlers[] as lookup tables only -- make them r/o. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: x86: more constificationMathias Krause2012-09-052-2/+2
| | | | | Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: x86: constify read_write_emulator_opsMathias Krause2012-09-051-4/+4
| | | | | | | We never change those, make them r/o. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: x86 emulator: constify emulate_opsMathias Krause2012-09-053-13/+13
| | | | | | | We never change emulate_ops[] at runtime so it should be r/o. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: x86 emulator: mark opcode tables constMathias Krause2012-09-051-20/+20
| | | | | | | The opcode tables never change at runtime, therefor mark them const. Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: x86 emulator: use aligned variants of SSE register opsMathias Krause2012-09-051-32/+32
| | | | | | | | | As the the compiler ensures that the memory operand is always aligned to a 16 byte memory location, use the aligned variant of MOVDQ for read_sse_reg() and write_sse_reg(). Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: x86: minor size optimizationMathias Krause2012-09-051-6/+6
| | | | | | | | | | | | | | Some fields can be constified and/or made static to reduce code and data size. Numbers for a 32 bit build: text data bss dec hex filename before: 3351 80 0 3431 d67 cpuid.o after: 3391 0 0 3391 d3f cpuid.o Signed-off-by: Mathias Krause <minipli@googlemail.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: cleanup pic resetGleb Natapov2012-09-041-41/+11
| | | | | | | | kvm_pic_reset() is not used anywhere. Move reset logic from pic_ioport_write() there. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
* KVM: x86: remove unused variable from kvm_task_switch()Marcelo Tosatti2012-08-301-1/+0
| | | | Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Ignore segment G and D bits when considering whether we can virtualizeAvi Kivity2012-08-281-1/+1
| | | | | | | | | We will enter the guest with G and D cleared; as real hardware ignores D in real mode, and G is taken care of by the limit test, we allow more code to run in vm86 mode. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Save all segment data in real modeAvi Kivity2012-08-281-0/+1
| | | | | Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Preserve segment limit and access rights in real modeAvi Kivity2012-08-281-0/+3
| | | | | | | | | While this is undocumented, real processors do not reload the segment limit and access rights when loading a segment register in real mode. Real programs rely on it so we need to comply with this behaviour. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Return real real-mode segment data even if ↵Avi Kivity2012-08-281-2/+1
| | | | | | | | | | emulate_invalid_guest_state=1 emulate_invalid_guest_state=1 doesn't mean we don't munge the segments in the vmcs; we do. So we need to return the real ones (maintained by vmx_set_segment). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: x86 emulator: Fix #GP error code during linearizationAvi Kivity2012-08-281-2/+2
| | | | | | | We want the segment selector, nor segment number. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: x86 emulator: Check segment limits in real mode tooAvi Kivity2012-08-281-3/+4
| | | | | | | Segment limits are verified in real mode, not just protected mode. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: x86 emulator: Leave segment limit and attributs alone in real modeAvi Kivity2012-08-281-6/+2
| | | | | | | | | | When loading a segment in real mode, only the base and selector must be modified. The limit needs to be left alone, otherwise big real mode users will hit a #GP due to limit checking (currently this is suppressed because we don't check limits in real mode). Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Allow vm86 virtualization of big real modeAvi Kivity2012-08-281-1/+1
| | | | | | | | | | | | | | Usually, big real mode uses large (4GB) segments. Currently we don't virtualize this; if any segment has a limit other than 0xffff, we emulate. But if we set the vmx-visible limit to 0xffff, we can use vm86 to virtualize real mode; if an access overruns the segment limit, the guest will #GP, which we will trap and forward to the emulator. This results in significantly faster execution, and less risk of hitting an unemulated instruction. If the limit is less than 0xffff, we retain the existing behaviour. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Allow real mode emulation using vm86 with dpl=0Avi Kivity2012-08-281-1/+1
| | | | | | | | | Real mode is always entered from protected mode with dpl=0. Since the dpl doesn't affect execution, and we already override it to 3 in the vmcs (as vmx requires), we can allow execution in that state. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Retain limit and attributes when entering protected modeAvi Kivity2012-08-281-6/+6
| | | | | | | | Real processors don't change segment limits and attributes while in real mode. Mimic that behaviour. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Use kvm_segment to save protected-mode segments when entering realmodeAvi Kivity2012-08-281-61/+24
| | | | | | | | | | | Instead of using struct kvm_save_segment, use struct kvm_segment, which is what the other APIs use. This leads to some simplification. We replace save_rmode_seg() with a call to vmx_save_segment(). Since this depends on rmode.vm86_active, we move the call to before setting the flag. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Fix incorrect lookup of segment S flag in fix_pmode_dataseg()Avi Kivity2012-08-281-2/+2
| | | | | | | fix_pmode_dataseg() looks up S in ->base instead of ->ar_bytes. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: VMX: Separate saving pre-realmode state from setting segmentsAvi Kivity2012-08-281-7/+13
| | | | | | | | | | | | | | | Commit b246dd5df139 ("KVM: VMX: Fix KVM_SET_SREGS with big real mode segments") moved fix_rmode_seg() to vmx_set_segment(), so that it is applied not just on transitions to real mode, but also on KVM_SET_SREGS (migration). However fix_rmode_seg() not only munges the vmcs segments, it also sets up the save area for us to restore when returning to protected mode or to return in vmx_get_segment(). Move saving the segment into a new function, save_rmode_seg(), and call it just during the transition. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* KVM: x86 emulator: access GPRs on demandAvi Kivity2012-08-273-144/+220
| | | | | | | | | | Instead of populating the entire register file, read in registers as they are accessed, and write back only the modified ones. This saves a VMREAD and VMWRITE on Intel (for rsp, since it is not usually used during emulation), and a two 128-byte copies for the registers. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
* Merge remote-tracking branch 'upstream/master' into queueMarcelo Tosatti2012-08-2623-293/+385
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Merging critical fixes from upstream required for development. * upstream/master: (809 commits) libata: Add a space to " 2GB ATA Flash Disk" DMA blacklist entry Revert "powerpc: Update g5_defconfig" powerpc/perf: Use pmc_overflow() to detect rolled back events powerpc: Fix VMX in interrupt check in POWER7 copy loops powerpc: POWER7 copy_to_user/copy_from_user patch applied twice powerpc: Fix personality handling in ppc64_personality() powerpc/dma-iommu: Fix IOMMU window check powerpc: Remove unnecessary ifdefs powerpc/kgdb: Restore current_thread_info properly powerpc/kgdb: Bail out of KGDB when we've been triggered powerpc/kgdb: Do not set kgdb_single_step on ppc powerpc/mpic_msgr: Add missing includes powerpc: Fix null pointer deref in perf hardware breakpoints powerpc: Fixup whitespace in xmon powerpc: Fix xmon dl command for new printk implementation xfs: check for possible overflow in xfs_ioc_trim xfs: unlock the AGI buffer when looping in xfs_dialloc xfs: fix uninitialised variable in xfs_rtbuf_get() powerpc/fsl: fix "Failed to mount /dev: No such device" errors powerpc/fsl: update defconfigs ... Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| * Merge tag 'stable/for-linus-3.6-rc3-tag' of ↵Linus Torvalds2012-08-265-113/+113
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen Pull three xen bug-fixes from Konrad Rzeszutek Wilk: - Revert the kexec fix which caused on non-kexec shutdowns a race. - Reuse existing P2M leafs - instead of requiring to allocate a large area of bootup virtual address estate. - Fix a one-off error when adding PFNs for balloon pages. * tag 'stable/for-linus-3.6-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/konrad/xen: xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M. xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID. Revert "xen PVonHVM: move shared_info to MMIO before kexec"
| | * xen/setup: Fix one-off error when adding for-balloon PFNs to the P2M.Konrad Rzeszutek Wilk2012-08-231-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we are finished with return PFNs to the hypervisor, then populate it back, and also mark the E820 MMIO and E820 gaps as IDENTITY_FRAMEs, we then call P2M to set areas that can be used for ballooning. We were off by one, and ended up over-writting a P2M entry that most likely was an IDENTITY_FRAME. For example: 1-1 mapping on 40000->40200 1-1 mapping on bc558->bc5ac 1-1 mapping on bc5b4->bc8c5 1-1 mapping on bc8c6->bcb7c 1-1 mapping on bcd00->100000 Released 614 pages of unused memory Set 277889 page(s) to 1-1 mapping Populating 40200-40466 pfn range: 614 pages added => here we set from 40466 up to bc559 P2M tree to be INVALID_P2M_ENTRY. We should have done it up to bc558. The end result is that if anybody is trying to construct a PTE for PFN bc558 they end up with ~PAGE_PRESENT. CC: stable@vger.kernel.org Reported-by-and-Tested-by: Andre Przywara <andre.przywara@amd.com> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * xen/p2m: Reuse existing P2M leafs if they are filled with 1:1 PFNs or INVALID.Konrad Rzeszutek Wilk2012-08-171-3/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If P2M leaf is completly packed with INVALID_P2M_ENTRY or with 1:1 PFNs (so IDENTITY_FRAME type PFNs), we can swap the P2M leaf with either a p2m_missing or p2m_identity respectively. The old page (which was created via extend_brk or was grafted on from the mfn_list) can be re-used for setting new PFNs. This also means we can remove git commit: 5bc6f9888db5739abfa0cae279b4b442e4db8049 xen/p2m: Reserve 8MB of _brk space for P2M leafs when populating back which tried to fix this. and make the amount that is required to be reserved much smaller. CC: stable@vger.kernel.org # for 3.5 only. Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| | * Revert "xen PVonHVM: move shared_info to MMIO before kexec"Konrad Rzeszutek Wilk2012-08-163-109/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 00e37bdb0113a98408de42db85be002f21dbffd3. During shutdown of PVHVM guests with more than 2VCPUs on certain machines we can hit the race where the replaced shared_info is not replaced fast enough and the PV time clock retries reading the same area over and over without any any success and is stuck in an infinite loop. Acked-by: Olaf Hering <olaf@aepfle.de> Signed-off-by: Konrad Rzeszutek Wilk <konrad.wilk@oracle.com>
| * | Merge git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2012-08-263-14/+31
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Marcelo Tosatti. * git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: x86 emulator: use stack size attribute to mask rsp in stack ops KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intended ppc: e500_tlb memset clears nothing KVM: PPC: Add cache flush on page map KVM: PPC: Book3S HV: Fix incorrect branch in H_CEDE code KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct value
| | * | KVM: x86 emulator: use stack size attribute to mask rsp in stack opsAvi Kivity2012-08-221-9/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sub-register used to access the stack (sp, esp, or rsp) is not determined by the address size attribute like other memory references, but by the stack segment's B bit (if not in x86_64 mode). Fix by using the existing stack_mask() to figure out the correct mask. This long-existing bug was exposed by a combination of a27685c33acccce (emulate invalid guest state by default), which causes many more instructions to be emulated, and a seabios change (possibly a bug) which causes the high 16 bits of esp to become polluted across calls to real mode software interrupts. Signed-off-by: Avi Kivity <avi@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com>
| | * | KVM: MMU: Fix mmu_shrink() so that it can free mmu pages as intendedTakuya Yoshikawa2012-08-221-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although the possible race described in commit 85b7059169e128c57a3a8a3e588fb89cb2031da1 KVM: MMU: fix shrinking page from the empty mmu was correct, the real cause of that issue was a more trivial bug of mmu_shrink() introduced by commit 1952639665e92481c34c34c3e2a71bf3e66ba362 KVM: MMU: do not iterate over all VMs in mmu_shrink() Here is the bug: if (kvm->arch.n_used_mmu_pages > 0) { if (!nr_to_scan--) break; continue; } We skip VMs whose n_used_mmu_pages is not zero and try to shrink others: in other words we try to shrink empty ones by mistake. This patch reverses the logic so that mmu_shrink() can free pages from the first VM whose n_used_mmu_pages is not zero. Note that we also add comments explaining the role of nr_to_scan which is not practically important now, hoping this will be improved in the future. Signed-off-by: Takuya Yoshikawa <yoshikawa.takuya@oss.ntt.co.jp> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
| | * | KVM: x86: update KVM_SAVE_MSRS_BEGIN to correct valueGleb Natapov2012-08-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When MSR_KVM_PV_EOI_EN was added to msrs_to_save array KVM_SAVE_MSRS_BEGIN was not updated accordingly. Signed-off-by: Gleb Natapov <gleb@redhat.com> Signed-off-by: Marcelo Tosatti <mtosatti@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com>
| * | | x86, microcode, AMD: Fix broken ucode patch size checkAndreas Herrmann2012-08-231-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This issue was recently observed on an AMD C-50 CPU where a patch of maximum size was applied. Commit be62adb49294 ("x86, microcode, AMD: Simplify ucode verification") added current_size in get_matching_microcode(). This is calculated as size of the ucode patch + 8 (ie. size of the header). Later this is compared against the maximum possible ucode patch size for a CPU family. And of course this fails if the patch has already maximum size. Cc: <stable@vger.kernel.org> [3.3+] Signed-off-by: Andreas Herrmann <andreas.herrmann3@amd.com> Signed-off-by: Borislav Petkov <borislav.petkov@amd.com> Link: http://lkml.kernel.org/r/1344361461-10076-1-git-send-email-bp@amd64.org Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| * | | x86/alternatives: Fix p6 nops on non-modular kernelsAvi Kivity2012-08-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Probably a leftover from the early days of self-patching, p6nops are marked __initconst_or_module, which causes them to be discarded in a non-modular kernel. If something later triggers patching, it will overwrite kernel code with garbage. Reported-by: Tomas Racek <tracek@redhat.com> Signed-off-by: Avi Kivity <avi@redhat.com> Cc: Michael Tokarev <mjt@tls.msk.ru> Cc: Borislav Petkov <borislav.petkov@amd.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: qemu-devel@nongnu.org Cc: Anthony Liguori <anthony@codemonkey.ws> Cc: H. Peter Anvin <hpa@linux.intel.com> Cc: Alan Cox <alan@lxorguk.ukuu.org.uk> Cc: Alan Cox <alan@linux.intel.com> Link: http://lkml.kernel.org/r/5034AE84.90708@redhat.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/fixup_irq: Use cpu_online_mask instead of cpu_all_maskLiu, Chuansheng2012-08-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When one CPU is going down and this CPU is the last one in irq affinity, current code is setting cpu_all_mask as the new affinity for that irq. But for some systems (such as in Medfield Android mobile) the firmware sends the interrupt to each CPU in the irq affinity mask, averaged, and cpu_all_mask includes all potential CPUs, i.e. offline ones as well. So replace cpu_all_mask with cpu_online_mask. Signed-off-by: liu chuansheng <chuansheng.liu@intel.com> Acked-by: Yanmin Zhang <yanmin_zhang@linux.intel.com> Acked-by: Thomas Gleixner <tglx@linutronix.de> Link: http://lkml.kernel.org/r/27240C0AC20F114CBF8149A2696CBE4A137286@SHSMSX101.ccr.corp.intel.com Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | x86/spinlocks: Fix comment in spinlock.hRichard Weinberger2012-08-221-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This comment is no longer true. We support up to 2^16 CPUs because __ticket_t is an u16 if NR_CPUS is larger than 256. Signed-off-by: Richard Weinberger <richard@nod.at> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Ingo Molnar <mingo@kernel.org>
| * | | mm: hugetlbfs: correctly populate shared pmdMichal Hocko2012-08-221-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Each page mapped in a process's address space must be correctly accounted for in _mapcount. Normally the rules for this are straightforward but hugetlbfs page table sharing is different. The page table pages at the PMD level are reference counted while the mapcount remains the same. If this accounting is wrong, it causes bugs like this one reported by Larry Woodman: kernel BUG at mm/filemap.c:135! invalid opcode: 0000 [#1] SMP CPU 22 Modules linked in: bridge stp llc sunrpc binfmt_misc dcdbas microcode pcspkr acpi_pad acpi] Pid: 18001, comm: mpitest Tainted: G W 3.3.0+ #4 Dell Inc. PowerEdge R620/07NDJ2 RIP: 0010:[<ffffffff8112cfed>] [<ffffffff8112cfed>] __delete_from_page_cache+0x15d/0x170 Process mpitest (pid: 18001, threadinfo ffff880428972000, task ffff880428b5cc20) Call Trace: delete_from_page_cache+0x40/0x80 truncate_hugepages+0x115/0x1f0 hugetlbfs_evict_inode+0x18/0x30 evict+0x9f/0x1b0 iput_final+0xe3/0x1e0 iput+0x3e/0x50 d_kill+0xf8/0x110 dput+0xe2/0x1b0 __fput+0x162/0x240 During fork(), copy_hugetlb_page_range() detects if huge_pte_alloc() shared page tables with the check dst_pte == src_pte. The logic is if the PMD page is the same, they must be shared. This assumes that the sharing is between the parent and child. However, if the sharing is with a different process entirely then this check fails as in this diagram: parent | ------------>pmd src_pte----------> data page ^ other--------->pmd--------------------| ^ child-----------| dst_pte For this situation to occur, it must be possible for Parent and Other to have faulted and failed to share page tables with each other. This is possible due to the following style of race. PROC A PROC B copy_hugetlb_page_range copy_hugetlb_page_range src_pte == huge_pte_offset src_pte == huge_pte_offset !src_pte so no sharing !src_pte so no sharing (time passes) hugetlb_fault hugetlb_fault huge_pte_alloc huge_pte_alloc huge_pmd_share huge_pmd_share LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) LOCK(i_mmap_mutex) find nothing, no sharing UNLOCK(i_mmap_mutex) pmd_alloc pmd_alloc LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) LOCK(instantiation_mutex) fault UNLOCK(instantiation_mutex) These two processes are not poing to the same data page but are not sharing page tables because the opportunity was missed. When either process later forks, the src_pte == dst pte is potentially insufficient. As the check falls through, the wrong PTE information is copied in (harmless but wrong) and the mapcount is bumped for a page mapped by a shared page table leading to the BUG_ON. This patch addresses the issue by moving pmd_alloc into huge_pmd_share which guarantees that the shared pud is populated in the same critical section as pmd. This also means that huge_pte_offset test in huge_pmd_share is serialized correctly now which in turn means that the success of the sharing will be higher as the racing tasks see the pud and pmd populated together. Race identified and changelog written mostly by Mel Gorman. {akpm@linux-foundation.org: attempt to make the huge_pmd_share() comment comprehensible, clean up coding style] Reported-by: Larry Woodman <lwoodman@redhat.com> Tested-by: Larry Woodman <lwoodman@redhat.com> Reviewed-by: Mel Gorman <mgorman@suse.de> Signed-off-by: Michal Hocko <mhocko@suse.cz> Reviewed-by: Rik van Riel <riel@redhat.com> Cc: David Gibson <david@gibson.dropbear.id.au> Cc: Ken Chen <kenchen@google.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Hillf Danton <dhillf@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
| * | | Merge branch 'x86-urgent-for-linus' of ↵Linus Torvalds2012-08-208-14/+52
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Ingo Molnar. A x32 socket ABI fix with a -stable backport tag among other fixes. * 'x86-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x32: Use compat shims for {g,s}etsockopt Revert "x86-64/efi: Use EFI to deal with platform wall clock" x86, apic: fix broken legacy interrupts in the logical apic mode x86, build: Globally set -fno-pic x86, avx: don't use avx instructions with "noxsave" boot param
| | * | | x32: Use compat shims for {g,s}etsockoptMike Frysinger2012-08-181-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some of the arguments to {g,s}etsockopt are passed in userland pointers. If we try to use the 64bit entry point, we end up sometimes failing. For example, dhcpcd doesn't run in x32: # dhcpcd eth0 dhcpcd[1979]: version 5.5.6 starting dhcpcd[1979]: eth0: broadcasting for a lease dhcpcd[1979]: eth0: open_socket: Invalid argument dhcpcd[1979]: eth0: send_raw_packet: Bad file descriptor The code in particular is getting back EINVAL when doing: struct sock_fprog pf; setsockopt(s, SOL_SOCKET, SO_ATTACH_FILTER, &pf, sizeof(pf)); Diving into the kernel code, we can see: include/linux/filter.h: struct sock_fprog { unsigned short len; struct sock_filter __user *filter; }; net/core/sock.c: case SO_ATTACH_FILTER: ret = -EINVAL; if (optlen == sizeof(struct sock_fprog)) { struct sock_fprog fprog; ret = -EFAULT; if (copy_from_user(&fprog, optval, sizeof(fprog))) break; ret = sk_attach_filter(&fprog, sk); } break; arch/x86/syscalls/syscall_64.tbl: 54 common setsockopt sys_setsockopt 55 common getsockopt sys_getsockopt So for x64, sizeof(sock_fprog) is 16 bytes. For x86/x32, it's 8 bytes. This comes down to the pointer being 32bit for x32, which means we need to do structure size translation. But since x32 comes in directly to sys_setsockopt, it doesn't get translated like x86. After changing the syscall table and rebuilding glibc with the new kernel headers, dhcp runs fine in an x32 userland. Oddly, it seems like Linus noted the same thing during the initial port, but I guess that was missed/lost along the way: https://lkml.org/lkml/2011/8/26/452 [ hpa: tagging for -stable since this is an ABI fix. ] Bugzilla: https://bugs.gentoo.org/423649 Reported-by: Mads <mads@ab3.no> Signed-off-by: Mike Frysinger <vapier@gentoo.org> Link: http://lkml.kernel.org/r/1345320697-15713-1-git-send-email-vapier@gentoo.org Cc: H. J. Lu <hjl.tools@gmail.com> Cc: <stable@vger.kernel.org> v3.4..v3.5 Signed-off-by: H. Peter Anvin <hpa@zytor.com>
| | * | | Revert "x86-64/efi: Use EFI to deal with platform wall clock"H. Peter Anvin2012-08-142-10/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit bacef661acdb634170a8faddbc1cf28e8f8b9eee. This commit has been found to cause serious regressions on a number of ASUS machines at the least. We probably need to provide a 1:1 map in addition to the EFI virtual memory map in order for this to work. Signed-off-by: H. Peter Anvin <hpa@zytor.com> Reported-and-bisected-by: Jérôme Carretero <cJ-ko@zougloub.eu> Cc: Jan Beulich <jbeulich@suse.com> Cc: Matt Fleming <matt.fleming@intel.com> Cc: Matthew Garrett <mjg@redhat.com> Cc: Linus Torvalds <torvalds@linux-foundation.org> Cc: Andrew Morton <akpm@linux-foundation.org> Cc: Peter Zijlstra <a.p.zijlstra@chello.nl> Link: http://lkml.kernel.org/r/20120805172903.5f8bb24c@zougloub.eu
| | * | | x86, apic: fix broken legacy interrupts in the logical apic modeSuresh Siddha2012-08-141-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Recent commit 332afa656e76458ee9cf0f0d123016a0658539e4 cleaned up a workaround that updates irq_cfg domain for legacy irq's that are handled by the IO-APIC. This was assuming that the recent changes in assign_irq_vector() were sufficient to remove the workaround. But this broke couple of AMD platforms. One of them seems to be sending interrupts to the offline cpu's, resulting in spurious "No irq handler for vector xx (irq -1)" messages when those cpu's come online. And the other platform seems to always send the interrupt to the last logical CPU (cpu-7). Recent changes had an unintended side effect of using only logical cpu-0 in the IO-APIC RTE (during boot for the legacy interrupts) and this broke the legacy interrupts not getting routed to the cpu-7 on the AMD platform, resulting in a boot hang. For now, reintroduce the removed workaround, (essentially not allowing the vector to change for legacy irq's when io-apic starts to handle the irq. Which also addressed the uninteded sife effect of just specifying cpu-0 in the IO-APIC RTE for those irq's during boot). Reported-and-tested-by: Robert Richter <robert.richter@amd.com> Reported-and-tested-by: Borislav Petkov <bp@amd64.org> Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1344453412.29170.5.camel@sbsiddha-desk.sc.intel.com Signed-off-by: H. Peter Anvin <hpa@zytor.com>
| | * | | x86, build: Globally set -fno-picAndrew Boie2012-08-113-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GCC built with nonstandard options can enable -fpic by default. We never want this for 32-bit kernels and it will break the build. [ hpa: Notably the Android toolchain apparently does this. ] Change-Id: Iaab7d66e598b1c65ac4a4f0229eca2cd3d0d2898 Signed-off-by: Andrew Boie <andrew.p.boie@intel.com> Link: http://lkml.kernel.org/r/1344624546-29691-1-git-send-email-andrew.p.boie@intel.com Signed-off-by: H. Peter Anvin <hpa@linux.intel.com>
| | * | | x86, avx: don't use avx instructions with "noxsave" boot paramSuresh Siddha2012-08-081-0/+2
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Clear AVX, AVX2 features along with clearing XSAVE feature bits, as part of the parsing "noxsave" parameter. Fixes the kernel boot panic with "noxsave" boot parameter. We could have checked cpu_has_osxsave along with cpu_has_avx etc, but Peter mentioned clearing the feature bits will be better for uses like static_cpu_has() etc. Signed-off-by: Suresh Siddha <suresh.b.siddha@intel.com> Link: http://lkml.kernel.org/r/1343755754.2041.2.camel@sbsiddha-desk.sc.intel.com Cc: <stable@vger.kernel.org> # v3.5 Signed-off-by: H. Peter Anvin <hpa@zytor.com>