summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* moduleparam: Save information about built-in modules in separate fileAlexey Gladkov2019-05-078-7/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Problem: When a kernel module is compiled as a separate module, some important information about the kernel module is available via .modinfo section of the module. In contrast, when the kernel module is compiled into the kernel, that information is not available. Information about built-in modules is necessary in the following cases: 1. When it is necessary to find out what additional parameters can be passed to the kernel at boot time. 2. When you need to know which module names and their aliases are in the kernel. This is very useful for creating an initrd image. Proposal: The proposed patch does not remove .modinfo section with module information from the vmlinux at the build time and saves it into a separate file after kernel linking. So, the kernel does not increase in size and no additional information remains in it. Information is stored in the same format as in the separate modules (null-terminated string array). Because the .modinfo section is already exported with a separate modules, we are not creating a new API. It can be easily read in the userspace: $ tr '\0' '\n' < modules.builtin.modinfo ext4.softdep=pre: crc32c ext4.license=GPL ext4.description=Fourth Extended Filesystem ext4.author=Remy Card, Stephen Tweedie, Andrew Morton, Andreas Dilger, Theodore Ts'o and others ext4.alias=fs-ext4 ext4.alias=ext3 ext4.alias=fs-ext3 ext4.alias=ext2 ext4.alias=fs-ext2 md_mod.alias=block-major-9-* md_mod.alias=md md_mod.description=MD RAID framework md_mod.license=GPL md_mod.parmtype=create_on_open:bool md_mod.parmtype=start_dirty_degraded:int ... Co-Developed-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Signed-off-by: Gleb Fotengauer-Malinovskiy <glebfm@altlinux.org> Signed-off-by: Alexey Gladkov <gladkov.alexey@gmail.com> Acked-by: Jessica Yu <jeyu@kernel.org> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* Remove MODULE_ALIAS() calls that take undefined macroMasahiro Yamada2019-05-073-3/+0
| | | | | | | | | | | | | | | | | | | | | | | These files do not define (USBHS_)DRIVER_NAME. Yet, they can be successfully compiled because they are never built as a module by anyone, i.e, the MODULE_ALIAS() calls are always no-op. A problem showed up when a patch "moduleparam: Save information about built-in modules in separate file" was applied. With this new feature, MODULE_*() will be populated even if the callers are built-in. To avoid the build errors, the lines referencing to the undefined macro must be removed. The complete fix is to remove all MODULE_* and #include <linux/module.h> like many "make ... explicitly non-modular" commits did. For now, I am touching only the offending lines. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Daniel Lezcano <daniel.lezcano@linaro.org> Acked-by: Tony Lindgren <tony@atomide.com>
* .gitignore: add leading and trailing slashes to generated directoriesMasahiro Yamada2019-05-071-4/+4
| | | | | | | Clarify these directory paths are relative to the top of the source tree. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* scripts/tags.sh: fix direct execution of scripts/tags.shMasahiro Yamada2019-05-031-1/+1
| | | | | | | | | | | | | | | | | I thought this script was run via "make tags" etc. but some people run it directly. Prior to commit a9a49c2ad9b9 ("kbuild: use $(srctree) instead of KBUILD_SRC to check out-of-tree build"), in such a usecase, "tree" was set empty since KBUILD_SRC is undefined. Now, "tree" is set to "${srctree}/", which is evaluated to "/". Fix it by taking into account the case where "srctree" is unset. Link: https://lkml.org/lkml/2019/4/19/501 Fixes: a9a49c2ad9b9 ("kbuild: use $(srctree) instead of KBUILD_SRC to check out-of-tree build") Reported-by: Sergey Senozhatsky <sergey.senozhatsky.work@gmail.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* scripts: override locale from environment when running recordmcount.plDaniel Dadap2019-05-031-1/+1
| | | | | | | | | | | | | | | | recordmcount.pl uses a set of regular expressions to parse the output of objdump(1). However, if objdump(1) output is localized, it may not match the regular expressions, thereby preventing recordmcount.pl from parsing object files correctly. In order to allow recordmcount.pl to function correctly regardless of the current locale settings, set LANG=C when running objdump(1). LC_ALL is already unset in the top-level Makefile, so it is not necessary to also override that environment variable. Signed-off-by: Daniel Dadap <ddadap@nvidia.com> Reviewed-by: Robert Morell <rmorell@nvidia.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* samples: kobject: allow CONFIG_SAMPLE_KOBJECT to become yMasahiro Yamada2019-05-031-2/+1
| | | | | | This sample works well as builtin. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* samples: seccomp: turn CONFIG_SAMPLE_SECCOMP into a bool optionMasahiro Yamada2019-05-032-3/+3
| | | | | | | | | | The prompt of CONFIG_SAMPLE_SECCOMP claims this is "loadable module only", which is invalid. samples/seccomp/ only contains host programs, so having it tristate is pointless. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: move Documentation to vmlinux-alldirsMasahiro Yamada2019-05-031-4/+5
| | | | | | A minor code cleanup. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: move samples/ to KBUILD_VMLINUX_OBJSMasahiro Yamada2019-05-032-11/+9
| | | | | | | | | | Handle samples/ like the other top-level directories to simplify the Makefile. Include include/config/auto.conf earlier to evaluate drivers-$(CONFIG_SAMPLES). Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* modpost: make KBUILD_MODPOST_WARN also configurable for external modulesWiebe, Wladislav (Nokia - DE/Ulm)2019-04-111-1/+1
| | | | | | | | | | | | | | | | | | | | | Commit ea837f1c0503 ("kbuild: make modpost processing configurable") was intended to give KBUILD_MODPOST_WARN flexibility to be configurable. Right now KBUILD_MODPOST_WARN gets just ignored when KBUILD_EXTMOD is set which happens per default when building modules out of the tree. This change gives the opportunity to define module build behaving also in case of out of tree builds and default will become exit on error. Errors which can be detected by the build should be trapped out of the box there, unless somebody wants to notice broken stuff later at runtime. As this patch changes the default behaving from warning to error, users can consider to fix it for external module builds by: - providing module symbol table via KBUILD_EXTRA_SYMBOLS for modules which are dependent - OR getting old behaving back by passing KBUILD_MODPOST_WARN to the build Signed-off-by: Wladislav Wiebe <wladislav.wiebe@nokia.com> Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: check arch/$(SRCARCH)/include/generated before out-of-tree buildMasahiro Yamada2019-04-091-2/+4
| | | | | | | | | | | | | | | | | | | | | | After cross-compiling the kernel, "make mrproper" should be executed with the proper ARCH= option. Otherwise, stale objects will remain under arch/$(SRCARCH)/. One bad scenario is like this: $ make ARCH=arm defconfig all # cross-compile the kernel for arm $ make mrproper # mrproper for host-arch (i.e. x86) $ make ARCH=arm O=build_dir defconfig all If you miss ARCH= for mrproper and cross-compile the kernel with O= and ARCH= options, Kbuild will happily start to build, but may fail due to stale objects in the srctree. If $(srctree)/arch/$(SRCARCH)/include/generated/ exists, let's stop the out-of-tree build. To detect this, mrproper should clean only arch/$(SRCARCH)/include/generated/. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: remove unneeded dependency for include/config/kernel.releaseMasahiro Yamada2019-04-091-1/+1
| | | | | | | This is unneeded since commit 43fee2b23895 ("kbuild: do not redirect the first prerequisite for filechk"). Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* memory: squash drivers/memory/Makefile.asm-offsetsMasahiro Yamada2019-04-082-6/+5
| | | | | | | | | drivers/memory/Makefile.asm-offsets is small enough, and included from a single place. Squash it into drivers/memory/Makefile. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: use $(srctree) instead of KBUILD_SRC to check out-of-tree buildMasahiro Yamada2019-04-027-9/+9
| | | | | | | | | | | | | | | | | | | KBUILD_SRC was conventionally used for some different purposes: [1] To remember the source tree path [2] As a flag to check if sub-make is already done [3] As a flag to check if Kbuild runs out of tree For [1], we do not need to remember it because the top Makefile can compute it by $(realpath $(dir $(lastword $(MAKEFILE_LIST)))) [2] has been replaced with self-commenting 'sub_make_done'. For [3], we can distinguish in-tree/out-of-tree by comparing $(srctree) and '.' This commit converts [3] to prepare for the KBUILD_SRC removal. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: mkmakefile: generate a simple wrapper of top MakefileMasahiro Yamada2019-04-021-18/+2
| | | | | | | | Now that Kbuild is able to start from any directory, the generated Makefile can simply wrap the top Makefile. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kieran Bingham <kbingham@kernel.org>
* kbuild: mkmakefile: do not check the generated Makefile markerMasahiro Yamada2019-04-021-6/+0
| | | | | | | | | | | | | | This hunk was added to avoid accidental overwrite of the top Makefile in case O= points to the top of the source tree. As commit 4f1127e20437 ("kbuild: fix infinite make recursion"), it caused some troubles in the past because Kbuild assumes O= as out-of-tree build, while it actually works in the source tree. Now this works more properly; if O= points to the source directory, it is handled as in-tree build. So, this sanity check is unneeded. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: allow Kbuild to start from any directoryMasahiro Yamada2019-04-021-37/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Kbuild always runs in the top of the output directory. If Make starts in the source directory with O=, it relocates the working directory to the location specified by O=. Also, users can start build from the output directory by using the Makefile generated by scripts/mkmakefile. With a little more effort, Kbuild will be able to start from any directory path. This commit allows to specify the source directory by using the -f option. For example, you can do: $ cd path/to/output/dir $ make -f path/to/source/dir/Makefile Or, for the equivalent behavior, you can do: $ make O=path/to/output/dir -f path/to/source/dir/Makefile KBUILD_SRC is now deprecated. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Reviewed-by: Kieran Bingham <kbingham@kernel.org>
* kbuild: pass $(MAKECMDGOALS) to sub-make as isMasahiro Yamada2019-04-011-1/+1
| | | | | | | | | | | | | | | | Manipulating $(MAKECMDGOALS) for sub-make seems odd to me. [1] 'make O=foo sub-make' is turned into 'make O=foo', which builds the default targets. It would make sense to terminate the build with: *** No rule to make target 'sub-make'. Stop. [2] 'make O=foo defconfig _all' is turned into 'make O=foo defconfig', which changes the behavior. Let's pass $(MAKECMDGOALS) as is. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: fix warning "overriding recipe for target 'Makefile'"Masahiro Yamada2019-04-011-1/+1
| | | | | | | | | | | | | If you do "make Makefile" with GNU Make 3.x, the following warning is displayed: $ make Makefile Makefile:165: warning: overriding recipe for target 'Makefile' Makefile:51: warning: ignoring old recipe for target 'Makefile' make[1]: Nothing to be done for 'Makefile'. make: Nothing to be done for 'Makefile'. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* kbuild: move RETPOLINE flags below config-targetsMasahiro Yamada2019-04-011-9/+9
| | | | | | | | | | When you run a "make *config" target, the retpoline compiler flags are evaluated for nothing because the code is located above the 'ifeq ($(config-targets),1)'. Move it a bit below to avoid unneeded computation in the Kconfig stage. Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com>
* Linux 5.1-rc3v5.1-rc3Linus Torvalds2019-03-311-1/+1
|
* Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2019-03-3160-201/+409
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull KVM fixes from Paolo Bonzini: "A collection of x86 and ARM bugfixes, and some improvements to documentation. On top of this, a cleanup of kvm_para.h headers, which were exported by some architectures even though they not support KVM at all. This is responsible for all the Kbuild changes in the diffstat" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (28 commits) Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGION KVM: doc: Document the life cycle of a VM and its resources KVM: selftests: complete IO before migrating guest state KVM: selftests: disable stack protector for all KVM tests KVM: selftests: explicitly disable PIE for tests KVM: selftests: assert on exit reason in CR4/cpuid sync test KVM: x86: update %rip after emulating IO x86/kvm/hyper-v: avoid spurious pending stimer on vCPU init kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrs KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hosts kvm: don't redefine flags as something else kvm: mmu: Used range based flushing in slot_handle_level_range KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supported KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region() kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP fields KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation) KVM: Reject device ioctls from processes other than the VM's creator KVM: doc: Fix incorrect word ordering regarding supported use of APIs KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size' KVM: nVMX: Do not inherit quadrant and invalid for the root shadow EPT ...
| * Merge tag 'kvmarm-fixes-for-5.1' of ↵Paolo Bonzini2019-03-289-75/+133
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvmarm/kvmarm into kvm-master KVM/ARM fixes for 5.1 - Fix THP handling in the presence of pre-existing PTEs - Honor request for PTE mappings even when THPs are available - GICv4 performance improvement - Take the srcu lock when writing to guest-controlled ITS data structures - Reset the virtual PMU in preemptible context - Various cleanups
| | * KVM: arm/arm64: Comments cleanup in mmu.cZenghui Yu2019-03-281-14/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some comments in virt/kvm/arm/mmu.c are outdated. Update them to reflect the current state of the code. Signed-off-by: Zenghui Yu <yuzenghui@huawei.com> Reviewed-by: Suzuki K Poulose <suzuki.poulose@arm.com> [maz: commit message tidy-up] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic-its: Make attribute accessors staticYueHaibing2019-03-201-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix sparse warnings: arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-its.c:1732:5: warning: symbol 'vgic_its_has_attr_regs' was not declared. Should it be static? arch/arm64/kvm/../../../virt/kvm/arm/vgic/vgic-its.c:1753:5: warning: symbol 'vgic_its_attr_regs_access' was not declared. Should it be static? Signed-off-by: YueHaibing <yuehaibing@huawei.com> [maz: fixed subject] Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: Fix handling of stage2 huge mappingsSuzuki K Poulose2019-03-202-16/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We rely on the mmu_notifier call backs to handle the split/merge of huge pages and thus we are guaranteed that, while creating a block mapping, either the entire block is unmapped at stage2 or it is missing permission. However, we miss a case where the block mapping is split for dirty logging case and then could later be made block mapping, if we cancel the dirty logging. This not only creates inconsistent TLB entries for the pages in the the block, but also leakes the table pages for PMD level. Handle this corner case for the huge mappings at stage2 by unmapping the non-huge mapping for the block. This could potentially release the upper level table. So we need to restart the table walk once we unmap the range. Fixes : ad361f093c1e31d ("KVM: ARM: Support hugetlbfs backed huge pages") Reported-by: Zheng Xiang <zhengxiang9@huawei.com> Cc: Zheng Xiang <zhengxiang9@huawei.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Cc: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: Enforce PTE mappings at stage2 when neededSuzuki K Poulose2019-03-191-22/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings") made the checks to skip huge mappings, stricter. However it introduced a bug where we still use huge mappings, ignoring the flag to use PTE mappings, by not reseting the vma_pagesize to PAGE_SIZE. Also, the checks do not cover the PUD huge pages, that was under review during the same period. This patch fixes both the issues. Fixes : 6794ad5443a2118 ("KVM: arm/arm64: Fix unintended stage 2 PMD mappings") Reported-by: Zenghui Yu <yuzenghui@huawei.com> Cc: Zenghui Yu <yuzenghui@huawei.com> Cc: Christoffer Dall <christoffer.dall@arm.com> Signed-off-by: Suzuki K Poulose <suzuki.poulose@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic-its: Take the srcu lock when parsing the memslotsMarc Zyngier2019-03-191-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Calling kvm_is_visible_gfn() implies that we're parsing the memslots, and doing this without the srcu lock is frown upon: [12704.164532] ============================= [12704.164544] WARNING: suspicious RCU usage [12704.164560] 5.1.0-rc1-00008-g600025238f51-dirty #16 Tainted: G W [12704.164573] ----------------------------- [12704.164589] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage! [12704.164602] other info that might help us debug this: [12704.164616] rcu_scheduler_active = 2, debug_locks = 1 [12704.164631] 6 locks held by qemu-system-aar/13968: [12704.164644] #0: 000000007ebdae4f (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0 [12704.164691] #1: 000000007d751022 (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0 [12704.164726] #2: 00000000219d2706 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164761] #3: 00000000a760aecd (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164794] #4: 000000000ef8e31d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164827] #5: 000000007a872093 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [12704.164861] stack backtrace: [12704.164878] CPU: 2 PID: 13968 Comm: qemu-system-aar Tainted: G W 5.1.0-rc1-00008-g600025238f51-dirty #16 [12704.164887] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019 [12704.164896] Call trace: [12704.164910] dump_backtrace+0x0/0x138 [12704.164920] show_stack+0x24/0x30 [12704.164934] dump_stack+0xbc/0x104 [12704.164946] lockdep_rcu_suspicious+0xcc/0x110 [12704.164958] gfn_to_memslot+0x174/0x190 [12704.164969] kvm_is_visible_gfn+0x28/0x70 [12704.164980] vgic_its_check_id.isra.0+0xec/0x1e8 [12704.164991] vgic_its_save_tables_v0+0x1ac/0x330 [12704.165001] vgic_its_set_attr+0x298/0x3a0 [12704.165012] kvm_device_ioctl_attr+0x9c/0xd8 [12704.165022] kvm_device_ioctl+0x8c/0xf8 [12704.165035] do_vfs_ioctl+0xc8/0x960 [12704.165045] ksys_ioctl+0x8c/0xa0 [12704.165055] __arm64_sys_ioctl+0x28/0x38 [12704.165067] el0_svc_common+0xd8/0x138 [12704.165078] el0_svc_handler+0x38/0x78 [12704.165089] el0_svc+0x8/0xc Make sure the lock is taken when doing this. Fixes: bf308242ab98 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock") Reviewed-by: Eric Auger <eric.auger@redhat.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm/arm64: vgic-its: Take the srcu lock when writing to guest memoryMarc Zyngier2019-03-194-6/+28
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When halting a guest, QEMU flushes the virtual ITS caches, which amounts to writing to the various tables that the guest has allocated. When doing this, we fail to take the srcu lock, and the kernel shouts loudly if running a lockdep kernel: [ 69.680416] ============================= [ 69.680819] WARNING: suspicious RCU usage [ 69.681526] 5.1.0-rc1-00008-g600025238f51-dirty #18 Not tainted [ 69.682096] ----------------------------- [ 69.682501] ./include/linux/kvm_host.h:605 suspicious rcu_dereference_check() usage! [ 69.683225] [ 69.683225] other info that might help us debug this: [ 69.683225] [ 69.683975] [ 69.683975] rcu_scheduler_active = 2, debug_locks = 1 [ 69.684598] 6 locks held by qemu-system-aar/4097: [ 69.685059] #0: 0000000034196013 (&kvm->lock){+.+.}, at: vgic_its_set_attr+0x244/0x3a0 [ 69.686087] #1: 00000000f2ed935e (&its->its_lock){+.+.}, at: vgic_its_set_attr+0x250/0x3a0 [ 69.686919] #2: 000000005e71ea54 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.687698] #3: 00000000c17e548d (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.688475] #4: 00000000ba386017 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.689978] #5: 00000000c2c3c335 (&vcpu->mutex){+.+.}, at: lock_all_vcpus+0x64/0xd0 [ 69.690729] [ 69.690729] stack backtrace: [ 69.691151] CPU: 2 PID: 4097 Comm: qemu-system-aar Not tainted 5.1.0-rc1-00008-g600025238f51-dirty #18 [ 69.691984] Hardware name: rockchip evb_rk3399/evb_rk3399, BIOS 2019.04-rc3-00124-g2feec69fb1 03/15/2019 [ 69.692831] Call trace: [ 69.694072] lockdep_rcu_suspicious+0xcc/0x110 [ 69.694490] gfn_to_memslot+0x174/0x190 [ 69.694853] kvm_write_guest+0x50/0xb0 [ 69.695209] vgic_its_save_tables_v0+0x248/0x330 [ 69.695639] vgic_its_set_attr+0x298/0x3a0 [ 69.696024] kvm_device_ioctl_attr+0x9c/0xd8 [ 69.696424] kvm_device_ioctl+0x8c/0xf8 [ 69.696788] do_vfs_ioctl+0xc8/0x960 [ 69.697128] ksys_ioctl+0x8c/0xa0 [ 69.697445] __arm64_sys_ioctl+0x28/0x38 [ 69.697817] el0_svc_common+0xd8/0x138 [ 69.698173] el0_svc_handler+0x38/0x78 [ 69.698528] el0_svc+0x8/0xc The fix is to obviously take the srcu lock, just like we do on the read side of things since bf308242ab98. One wonders why this wasn't fixed at the same time, but hey... Fixes: bf308242ab98 ("KVM: arm/arm64: VGIC/ITS: protect kvm_read_guest() calls with SRCU lock") Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * arm64: KVM: Always set ICH_HCR_EL2.EN if GICv4 is enabledMarc Zyngier2019-03-192-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The normal interrupt flow is not to enable the vgic when no virtual interrupt is to be injected (i.e. the LRs are empty). But when a guest is likely to use GICv4 for LPIs, we absolutely need to switch it on at all times. Otherwise, VLPIs only get delivered when there is something in the LRs, which doesn't happen very often. Reported-by: Nianyao Tang <tangnianyao@huawei.com> Tested-by: Shameerali Kolothum Thodi <shameerali.kolothum.thodi@huawei.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| | * KVM: arm64: Reset the PMU in preemptible contextMarc Zyngier2019-03-191-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've become very cautious to now always reset the vcpu when nothing is loaded on the physical CPU. To do so, we now disable preemption and do a kvm_arch_vcpu_put() to make sure we have all the state in memory (and that it won't be loaded behind out back). This now causes issues with resetting the PMU, which calls into perf. Perf itself uses mutexes, which clashes with the lack of preemption. It is worth realizing that the PMU is fully emulated, and that no PMU state is ever loaded on the physical CPU. This means we can perfectly reset the PMU outside of the non-preemptible section. Fixes: e761a927bc9a ("KVM: arm/arm64: Reset the VCPU without preemption and vcpu state loaded") Reported-by: Julien Grall <julien.grall@arm.com> Tested-by: Julien Grall <julien.grall@arm.com> Signed-off-by: Marc Zyngier <marc.zyngier@arm.com>
| * | Documentation: kvm: clarify KVM_SET_USER_MEMORY_REGIONPaolo Bonzini2019-03-281-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | The documentation does not mention how to delete a slot, add the information. Reported-by: Nathaniel McCallum <npmccallum@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: doc: Document the life cycle of a VM and its resourcesSean Christopherson2019-03-281-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The series to add memcg accounting to KVM allocations[1] states: There are many KVM kernel memory allocations which are tied to the life of the VM process and should be charged to the VM process's cgroup. While it is correct to account KVM kernel allocations to the cgroup of the process that created the VM, it's technically incorrect to state that the KVM kernel memory allocations are tied to the life of the VM process. This is because the VM itself, i.e. struct kvm, is not tied to the life of the process which created it, rather it is tied to the life of its associated file descriptor. In other words, kvm_destroy_vm() is not invoked until fput() decrements its associated file's refcount to zero. A simple example is to fork() in Qemu and have the child sleep indefinitely; kvm_destroy_vm() isn't called until Qemu closes its file descriptor *and* the rogue child is killed. The allocations are guaranteed to be *accounted* to the process which created the VM, but only because KVM's per-{VM,vCPU} ioctls reject the ioctl() with -EIO if kvm->mm != current->mm. I.e. the child can keep the VM "alive" but can't do anything useful with its reference. Note that because 'struct kvm' also holds a reference to the mm_struct of its owner, the above behavior also applies to userspace allocations. Given that mucking with a VM's file descriptor can lead to subtle and undesirable behavior, e.g. memcg charges persisting after a VM is shut down, explicitly document a VM's lifecycle and its impact on the VM's resources. Alternatively, KVM could aggressively free resources when the creating process exits, e.g. via mmu_notifier->release(). However, mmu_notifier isn't guaranteed to be available, and freeing resources when the creator exits is likely to be error prone and fragile as KVM would need to ensure that it only freed resources that are truly out of reach. In practice, the existing behavior shouldn't be problematic as a properly configured system will prevent a child process from being moved out of the appropriate cgroup hierarchy, i.e. prevent hiding the process from the OOM killer, and will prevent an unprivileged user from being able to to hold a reference to struct kvm via another method, e.g. debugfs. [1]https://patchwork.kernel.org/patch/10806707/ Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: selftests: complete IO before migrating guest stateSean Christopherson2019-03-283-2/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Documentation/virtual/kvm/api.txt states: NOTE: For KVM_EXIT_IO, KVM_EXIT_MMIO, KVM_EXIT_OSI, KVM_EXIT_PAPR and KVM_EXIT_EPR the corresponding operations are complete (and guest state is consistent) only after userspace has re-entered the kernel with KVM_RUN. The kernel side will first finish incomplete operations and then check for pending signals. Userspace can re-enter the guest with an unmasked signal pending to complete pending operations. Because guest state may be inconsistent, starting state migration after an IO exit without first completing IO may result in test failures, e.g. a proposed change to KVM's handling of %rip in its fast PIO handling[1] will cause the new VM, i.e. the post-migration VM, to have its %rip set to the IN instruction that triggered KVM_EXIT_IO, leading to a test assertion due to a stage mismatch. For simplicitly, require KVM_CAP_IMMEDIATE_EXIT to complete IO and skip the test if it's not available. The addition of KVM_CAP_IMMEDIATE_EXIT predates the state selftest by more than a year. [1] https://patchwork.kernel.org/patch/10848545/ Fixes: fa3899add1056 ("kvm: selftests: add basic test for state save and restore") Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: selftests: disable stack protector for all KVM testsSean Christopherson2019-03-281-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 4.8.3, gcc has enabled -fstack-protector by default. This is problematic for the KVM selftests as they do not configure fs or gs segments (the stack canary is pulled from fs:0x28). With the default behavior, gcc will insert a stack canary on any function that creates buffers of 8 bytes or more. As a result, ucall() will hit a triple fault shutdown due to reading a bad fs segment when inserting its stack canary, i.e. every test fails with an unexpected SHUTDOWN. Fixes: 14c47b7530e2d ("kvm: selftests: introduce ucall") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: selftests: explicitly disable PIE for testsSean Christopherson2019-03-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM selftests embed the guest "image" as a function in the test itself and extract the guest code at runtime by manually parsing the elf headers. The parsing is very simple and doesn't supporting fancy things like position independent executables. Recent versions of gcc enable pie by default, which results in triple fault shutdowns in the guest due to the virtual address in the headers not matching up with the virtual address retrieved from the function pointer. Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: selftests: assert on exit reason in CR4/cpuid sync testSean Christopherson2019-03-281-16/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ...so that the test doesn't end up in an infinite loop if it fails for whatever reason, e.g. SHUTDOWN due to gcc inserting stack canary code into ucall() and attempting to derefence a null segment. Fixes: ca359066889f7 ("kvm: selftests: add cr4_cpuid_sync_test") Cc: Wei Huang <wei@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: update %rip after emulating IOSean Christopherson2019-03-282-10/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most (all?) x86 platforms provide a port IO based reset mechanism, e.g. OUT 92h or CF9h. Userspace may emulate said mechanism, i.e. reset a vCPU in response to KVM_EXIT_IO, without explicitly announcing to KVM that it is doing a reset, e.g. Qemu jams vCPU state and resumes running. To avoid corruping %rip after such a reset, commit 0967b7bf1c22 ("KVM: Skip pio instruction when it is emulated, not executed") changed the behavior of PIO handlers, i.e. today's "fast" PIO handling to skip the instruction prior to exiting to userspace. Full emulation doesn't need such tricks becase re-emulating the instruction will naturally handle %rip being changed to point at the reset vector. Updating %rip prior to executing to userspace has several drawbacks: - Userspace sees the wrong %rip on the exit, e.g. if PIO emulation fails it will likely yell about the wrong address. - Single step exits to userspace for are effectively dropped as KVM_EXIT_DEBUG is overwritten with KVM_EXIT_IO. - Behavior of PIO emulation is different depending on whether it goes down the fast path or the slow path. Rather than skip the PIO instruction before exiting to userspace, snapshot the linear %rip and cancel PIO completion if the current value does not match the snapshot. For a 64-bit vCPU, i.e. the most common scenario, the snapshot and comparison has negligible overhead as VMCS.GUEST_RIP will be cached regardless, i.e. there is no extra VMREAD in this case. All other alternatives to snapshotting the linear %rip that don't rely on an explicit reset announcenment suffer from one corner case or another. For example, canceling PIO completion on any write to %rip fails if userspace does a save/restore of %rip, and attempting to avoid that issue by canceling PIO only if %rip changed then fails if PIO collides with the reset %rip. Attempting to zero in on the exact reset vector won't work for APs, which means adding more hooks such as the vCPU's MP_STATE, and so on and so forth. Checking for a linear %rip match technically suffers from corner cases, e.g. userspace could theoretically rewrite the underlying code page and expect a different instruction to execute, or the guest hardcodes a PIO reset at 0xfffffff0, but those are far, far outside of what can be considered normal operation. Fixes: 432baf60eee3 ("KVM: VMX: use kvm_fast_pio_in for handling IN I/O") Cc: <stable@vger.kernel.org> Reported-by: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | x86/kvm/hyper-v: avoid spurious pending stimer on vCPU initVitaly Kuznetsov2019-03-281-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When userspace initializes guest vCPUs it may want to zero all supported MSRs including Hyper-V related ones including HV_X64_MSR_STIMERn_CONFIG/ HV_X64_MSR_STIMERn_COUNT. With commit f3b138c5d89a ("kvm/x86: Update SynIC timers on guest entry only") we began doing stimer_mark_pending() unconditionally on every config change. The issue I'm observing manifests itself as following: - Qemu writes 0 to STIMERn_{CONFIG,COUNT} MSRs and marks all stimers as pending in stimer_pending_bitmap, arms KVM_REQ_HV_STIMER; - kvm_hv_has_stimer_pending() starts returning true; - kvm_vcpu_has_events() starts returning true; - kvm_arch_vcpu_runnable() starts returning true; - when kvm_arch_vcpu_ioctl_run() gets into (vcpu->arch.mp_state == KVM_MP_STATE_UNINITIALIZED) case: - kvm_vcpu_block() gets in 'kvm_vcpu_check_block(vcpu) < 0' and returns immediately, avoiding normal wait path; - -EAGAIN is returned from kvm_arch_vcpu_ioctl_run() immediately forcing userspace to retry. So instead of normal wait path we get a busy loop on all secondary vCPUs before they get INIT signal. This seems to be undesirable, especially given that this happens even when Hyper-V extensions are not used. Generally, it seems to be pointless to mark an stimer as pending in stimer_pending_bitmap and arm KVM_REQ_HV_STIMER as the only thing kvm_hv_process_stimers() will do is clear the corresponding bit. We may just not mark disabled timers as pending instead. Fixes: f3b138c5d89a ("kvm/x86: Update SynIC timers on guest entry only") Signed-off-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | kvm/x86: Move MSR_IA32_ARCH_CAPABILITIES to array emulated_msrsXiaoyao Li2019-03-281-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | Since MSR_IA32_ARCH_CAPABILITIES is emualted unconditionally even if host doesn't suppot it. We should move it to array emulated_msrs from arry msrs_to_save, to report to userspace that guest support this msr. Signed-off-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: Emulate MSR_IA32_ARCH_CAPABILITIES on AMD hostsSean Christopherson2019-03-284-14/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The CPUID flag ARCH_CAPABILITIES is unconditioinally exposed to host userspace for all x86 hosts, i.e. KVM advertises ARCH_CAPABILITIES regardless of hardware support under the pretense that KVM fully emulates MSR_IA32_ARCH_CAPABILITIES. Unfortunately, only VMX hosts handle accesses to MSR_IA32_ARCH_CAPABILITIES (despite KVM_GET_MSRS also reporting MSR_IA32_ARCH_CAPABILITIES for all hosts). Move the MSR_IA32_ARCH_CAPABILITIES handling to common x86 code so that it's emulated on AMD hosts. Fixes: 1eaafe91a0df4 ("kvm: x86: IA32_ARCH_CAPABILITIES is always supported") Cc: stable@vger.kernel.org Reported-by: Xiaoyao Li <xiaoyao.li@linux.intel.com> Cc: Jim Mattson <jmattson@google.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | kvm: don't redefine flags as something elseSebastian Andrzej Siewior2019-03-281-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function irqfd_wakeup() has flags defined as __poll_t and then it has additional flags which is used for irqflags. Redefine the inner flags variable as iflags so it does not shadow the outer flags. Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: kvm@vger.kernel.org Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | kvm: mmu: Used range based flushing in slot_handle_level_rangeBen Gardon2019-03-281-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace kvm_flush_remote_tlbs with kvm_flush_remote_tlbs_with_address in slot_handle_level_range. When range based flushes are not enabled kvm_flush_remote_tlbs_with_address falls back to kvm_flush_remote_tlbs. This changes the behavior of many functions that indirectly use slot_handle_level_range, iff the range based flushes are enabled. The only potential problem I see with this is that kvm->tlbs_dirty will be cleared less often, however the only caller of slot_handle_level_range that checks tlbs_dirty is kvm_mmu_notifier_invalidate_range_start which checks it and does a kvm_flush_remote_tlbs after calling kvm_unmap_hva_range anyway. Tested: Ran all kvm-unit-tests on a Intel Haswell machine with and without this patch. The patch introduced no new failures. Signed-off-by: Ben Gardon <bgardon@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: export <linux/kvm_para.h> and <asm/kvm_para.h> iif KVM is supportedMasahiro Yamada2019-03-2833-20/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I do not see any consistency about headers_install of <linux/kvm_para.h> and <asm/kvm_para.h>. According to my analysis of Linux 5.1-rc1, there are 3 groups: [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported alpha, arm, hexagon, mips, powerpc, s390, sparc, x86 [2] <asm/kvm_para.h> is exported, but <linux/kvm_para.h> is not arc, arm64, c6x, h8300, ia64, m68k, microblaze, nios2, openrisc, parisc, sh, unicore32, xtensa [3] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported csky, nds32, riscv This does not match to the actual KVM support. At least, [2] is half-baked. Nor do arch maintainers look like they care about this. For example, commit 0add53713b1c ("microblaze: Add missing kvm_para.h to Kbuild") exported <asm/kvm_para.h> to user-space in order to fix an in-kernel build error. We have two ways to make this consistent: [A] export both <linux/kvm_para.h> and <asm/kvm_para.h> for all architectures, irrespective of the KVM support [B] Match the header export of <linux/kvm_para.h> and <asm/kvm_para.h> to the KVM support My first attempt was [A] because the code looks cleaner, but Paolo suggested [B]. So, this commit goes with [B]. For most architectures, <asm/kvm_para.h> was moved to the kernel-space. I changed include/uapi/linux/Kbuild so that it checks generated asm/kvm_para.h as well as check-in ones. After this commit, there will be two groups: [1] Both <linux/kvm_para.h> and <asm/kvm_para.h> are exported arm, arm64, mips, powerpc, s390, x86 [2] Neither <linux/kvm_para.h> nor <asm/kvm_para.h> is exported alpha, arc, c6x, csky, h8300, hexagon, ia64, m68k, microblaze, nds32, nios2, openrisc, parisc, riscv, sh, sparc, unicore32, xtensa Signed-off-by: Masahiro Yamada <yamada.masahiro@socionext.com> Acked-by: Cornelia Huck <cohuck@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: remove check on nr_mmu_pages in kvm_arch_commit_memory_region()Wei Yang2019-03-283-8/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * nr_mmu_pages would be non-zero only if kvm->arch.n_requested_mmu_pages is non-zero. * nr_mmu_pages is always non-zero, since kvm_mmu_calculate_mmu_pages() never return zero. Based on these two reasons, we can merge the two *if* clause and use the return value from kvm_mmu_calculate_mmu_pages() directly. This simplify the code and also eliminate the possibility for reader to believe nr_mmu_pages would be zero. Signed-off-by: Wei Yang <richard.weiyang@gmail.com> Reviewed-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | kvm: nVMX: Add a vmentry check for HOST_SYSENTER_ESP and HOST_SYSENTER_EIP ↵Krish Sadhukhan2019-03-281-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fields According to section "Checks on VMX Controls" in Intel SDM vol 3C, the following check is performed on vmentry of L2 guests: On processors that support Intel 64 architecture, the IA32_SYSENTER_ESP field and the IA32_SYSENTER_EIP field must each contain a canonical address. Signed-off-by: Krish Sadhukhan <krish.sadhukhan@oracle.com> Reviewed-by: Mihai Carabas <mihai.carabas@oracle.com> Reviewed-by: Jim Mattson <jmattson@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: SVM: Workaround errata#1096 (insn_len maybe zero on SMAP violation)Singh, Brijesh2019-03-284-3/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Errata#1096: On a nested data page fault when CR.SMAP=1 and the guest data read generates a SMAP violation, GuestInstrBytes field of the VMCB on a VMEXIT will incorrectly return 0h instead the correct guest instruction bytes . Recommend Workaround: To determine what instruction the guest was executing the hypervisor will have to decode the instruction at the instruction pointer. The recommended workaround can not be implemented for the SEV guest because guest memory is encrypted with the guest specific key, and instruction decoder will not be able to decode the instruction bytes. If we hit this errata in the SEV guest then log the message and request a guest shutdown. Reported-by: Venkatesh Srinivas <venkateshs@google.com> Cc: Jim Mattson <jmattson@google.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Joerg Roedel <joro@8bytes.org> Cc: "Radim Krčmář" <rkrcmar@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Brijesh Singh <brijesh.singh@amd.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: Reject device ioctls from processes other than the VM's creatorSean Christopherson2019-03-282-5/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM's API requires thats ioctls must be issued from the same process that created the VM. In other words, userspace can play games with a VM's file descriptors, e.g. fork(), SCM_RIGHTS, etc..., but only the creator can do anything useful. Explicitly reject device ioctls that are issued by a process other than the VM's creator, and update KVM's API documentation to extend its requirements to device ioctls. Fixes: 852b6d57dc7f ("kvm: add device control API") Cc: <stable@vger.kernel.org> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: doc: Fix incorrect word ordering regarding supported use of APIsSean Christopherson2019-03-281-12/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per Paolo[1], instantiating multiple VMs in a single process is legal; but this conflicts with KVM's API documentation, which states: The only supported use is one virtual machine per process, and one vcpu per thread. However, an earlier section in the documentation states: Only run VM ioctls from the same process (address space) that was used to create the VM. and: Only run vcpu ioctls from the same thread that was used to create the vcpu. This suggests that the conflicting documentation is simply an incorrect ordering of of words, i.e. what's really meant is that a virtual machine can't be shared across multiple processes and a vCPU can't be shared across multiple threads. Tweak the blurb on issuing ioctls to use a more assertive tone, and rewrite the "supported use" sentence to reference said blurb instead of poorly restating it in different terms. Opportunistically add missing punctuation. [1] https://lkml.kernel.org/r/f23265d4-528e-3bd4-011f-4d7b8f3281db@redhat.com Fixes: 9c1b96e34717 ("KVM: Document basic API") Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> [Improve notes on asynchronous ioctl] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | KVM: x86: fix handling of role.cr4_pae and rename it to 'gpte_size'Sean Christopherson2019-03-284-22/+35
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The cr4_pae flag is a bit of a misnomer, its purpose is really to track whether the guest PTE that is being shadowed is a 4-byte entry or an 8-byte entry. Prior to supporting nested EPT, the size of the gpte was reflected purely by CR4.PAE. KVM fudged things a bit for direct sptes, but it was mostly harmless since the size of the gpte never mattered. Now that a spte may be tracking an indirect EPT entry, relying on CR4.PAE is wrong and ill-named. For direct shadow pages, force the gpte_size to '1' as they are always 8-byte entries; EPT entries can only be 8-bytes and KVM always uses 8-byte entries for NPT and its identity map (when running with EPT but not unrestricted guest). Likewise, nested EPT entries are always 8-bytes. Nested EPT presents a unique scenario as the size of the entries are not dictated by CR4.PAE, but neither is the shadow page a direct map. To handle this scenario, set cr0_wp=1 and smap_andnot_wp=1, an otherwise impossible combination, to denote a nested EPT shadow page. Use the information to avoid incorrectly zapping an unsync'd indirect page in __kvm_sync_page(). Providing a consistent and accurate gpte_size fixes a bug reported by Vitaly where fast_cr3_switch() always fails when switching from L2 to L1 as kvm_mmu_get_page() would force role.cr4_pae=0 for direct pages, whereas kvm_calc_mmu_role_common() would set it according to CR4.PAE. Fixes: 7dcd575520082 ("x86/kvm/mmu: check if tdp/shadow MMU reconfiguration is needed") Reported-by: Vitaly Kuznetsov <vkuznets@redhat.com> Tested-by: Vitaly Kuznetsov <vkuznets@redhat.com> Signed-off-by: Sean Christopherson <sean.j.christopherson@intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>