summaryrefslogtreecommitdiffstats
path: root/arch (follow)
Commit message (Collapse)AuthorAgeFilesLines
* KVM: SVM: add module param to control the #SMI interceptionMaxim Levitsky2021-07-153-1/+14
| | | | | | | | | | | | | | | | | | | In theory there are no side effects of not intercepting #SMI, because then #SMI becomes transparent to the OS and the KVM. Plus an observation on recent Zen2 CPUs reveals that these CPUs ignore #SMI interception and never deliver #SMI VMexits. This is also useful to test nested KVM to see that L1 handles #SMIs correctly in case when L1 doesn't intercept #SMI. Finally the default remains the same, the SMI are intercepted by default thus this patch doesn't have any effect unless non default module param value is used. Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210707125100.677203-4-mlevitsk@redhat.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: SVM: remove INIT intercept handlerMaxim Levitsky2021-07-151-1/+0
| | | | | | | | | | | | | | | | | | Kernel never sends real INIT even to CPUs, other than on boot. Thus INIT interception is an error which should be caught by a check for an unknown VMexit reason. On top of that, the current INIT VM exit handler skips the current instruction which is wrong. That was added in commit 5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code"). Fixes: 5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210707125100.677203-3-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: SVM: #SMI interception must not skip the instructionMaxim Levitsky2021-07-151-1/+6
| | | | | | | | | | | | | | | | | Commit 5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code"), unfortunately made a mistake of treating nop_on_interception and nop_interception in the same way. Former does truly nothing while the latter skips the instruction. SMI VM exit handler should do nothing. (SMI itself is handled by the host when we do STGI) Fixes: 5ff3a351f687 ("KVM: x86: Move trivial instruction-based exit handlers to common code") Signed-off-by: Maxim Levitsky <mlevitsk@redhat.com> Message-Id: <20210707125100.677203-2-mlevitsk@redhat.com> Cc: stable@vger.kernel.org Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: VMX: Remove vmx_msr_index from vmx.hYu Zhang2021-07-151-2/+0
| | | | | | | | | | | vmx_msr_index was used to record the list of MSRs which can be lazily restored when kvm returns to userspace. It is now reimplemented as kvm_uret_msrs_list, a common x86 list which is only used inside x86.c. So just remove the obsolete declaration in vmx.h. Signed-off-by: Yu Zhang <yu.c.zhang@linux.intel.com> Message-Id: <20210707235702.31595-1-yu.c.zhang@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: X86: Disable hardware breakpoints unconditionally before kvm_x86->run()Lai Jiangshan2021-07-151-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | When the host is using debug registers but the guest is not using them nor is the guest in guest-debug state, the kvm code does not reset the host debug registers before kvm_x86->run(). Rather, it relies on the hardware vmentry instruction to automatically reset the dr7 registers which ensures that the host breakpoints do not affect the guest. This however violates the non-instrumentable nature around VM entry and exit; for example, when a host breakpoint is set on vcpu->arch.cr2, Another issue is consistency. When the guest debug registers are active, the host breakpoints are reset before kvm_x86->run(). But when the guest debug registers are inactive, the host breakpoints are delayed to be disabled. The host tracing tools may see different results depending on what the guest is doing. To fix the problems, we clear %db7 unconditionally before kvm_x86->run() if the host has set any breakpoints, no matter if the guest is using them or not. Signed-off-by: Lai Jiangshan <laijs@linux.alibaba.com> Message-Id: <20210628172632.81029-1-jiangshanlai@gmail.com> Cc: stable@vger.kernel.org [Only clear %db7 instead of reloading all debug registers. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/pmu: Clear anythread deprecated bit when 0xa leaf is unsupported on ↵Like Xu2021-07-141-1/+2
| | | | | | | | | | | | | | | | | the SVM The AMD platform does not support the functions Ah CPUID leaf. The returned results for this entry should all remain zero just like the native does: AMD host: 0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00000000 (uncanny) AMD guest: 0x0000000a 0x00: eax=0x00000000 ebx=0x00000000 ecx=0x00000000 edx=0x00008000 Fixes: cadbaa039b99 ("perf/x86/intel: Make anythread filter support conditional") Signed-off-by: Like Xu <likexu@tencent.com> Message-Id: <20210628074354.33848-1-likexu@tencent.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: SVM: Revert clearing of C-bit on GPA in #NPF handlerSean Christopherson2021-07-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't clear the C-bit in the #NPF handler, as it is a legal GPA bit for non-SEV guests, and for SEV guests the C-bit is dropped before the GPA hits the NPT in hardware. Clearing the bit for non-SEV guests causes KVM to mishandle #NPFs with that collide with the host's C-bit. Although the APM doesn't explicitly state that the C-bit is not reserved for non-SEV, Tom Lendacky confirmed that the following snippet about the effective reduction due to the C-bit does indeed apply only to SEV guests. Note that because guest physical addresses are always translated through the nested page tables, the size of the guest physical address space is not impacted by any physical address space reduction indicated in CPUID 8000_001F[EBX]. If the C-bit is a physical address bit however, the guest physical address space is effectively reduced by 1 bit. And for SEV guests, the APM clearly states that the bit is dropped before walking the nested page tables. If the C-bit is an address bit, this bit is masked from the guest physical address when it is translated through the nested page tables. Consequently, the hypervisor does not need to be aware of which pages the guest has chosen to mark private. Note, the bogus C-bit clearing was removed from legacy #PF handler in commit 6d1b867d0456 ("KVM: SVM: Don't strip the C-bit from CR2 on #PF interception"). Fixes: 0ede79e13224 ("KVM: SVM: Clear C-bit from the page fault address") Cc: Peter Gonda <pgonda@google.com> Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210625020354.431829-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86/mmu: Do not apply HPA (memory encryption) mask to GPAsSean Christopherson2021-07-144-8/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ignore "dynamic" host adjustments to the physical address mask when generating the masks for guest PTEs, i.e. the guest PA masks. The host physical address space and guest physical address space are two different beasts, e.g. even though SEV's C-bit is the same bit location for both host and guest, disabling SME in the host (which clears shadow_me_mask) does not affect the guest PTE->GPA "translation". For non-SEV guests, not dropping bits is the correct behavior. Assuming KVM and userspace correctly enumerate/configure guest MAXPHYADDR, bits that are lost as collateral damage from memory encryption are treated as reserved bits, i.e. KVM will never get to the point where it attempts to generate a gfn using the affected bits. And if userspace wants to create a bogus vCPU, then userspace gets to deal with the fallout of hardware doing odd things with bad GPAs. For SEV guests, not dropping the C-bit is technically wrong, but it's a moot point because KVM can't read SEV guest's page tables in any case since they're always encrypted. Not to mention that the current KVM code is also broken since sme_me_mask does not have to be non-zero for SEV to be supported by KVM. The proper fix would be to teach all of KVM to correctly handle guest private memory, but that's a task for the future. Fixes: d0ec49d4de90 ("kvm/x86/svm: Support Secure Memory Encryption within KVM") Cc: stable@vger.kernel.org Cc: Brijesh Singh <brijesh.singh@amd.com> Cc: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-5-seanjc@google.com> [Use a new header instead of adding header guards to paging_tmpl.h. - Paolo] Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: Use kernel's x86_phys_bits to handle reduced MAXPHYADDRSean Christopherson2021-07-141-7/+20
| | | | | | | | | | | | | | | | | | | | Use boot_cpu_data.x86_phys_bits instead of the raw CPUID information to enumerate the MAXPHYADDR for KVM guests when TDP is disabled (the guest version is only relevant to NPT/TDP). When using shadow paging, any reductions to the host's MAXPHYADDR apply to KVM and its guests as well, i.e. using the raw CPUID info will cause KVM to misreport the number of PA bits available to the guest. Unconditionally zero out the "Physical Address bit reduction" entry. For !TDP, the adjustment is already done, and for TDP enumerating the host's reduction is wrong as the reduction does not apply to GPAs. Fixes: 9af9b94068fb ("x86/cpu/AMD: Handle SME reduction in physical address size") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-3-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* KVM: x86: Use guest MAXPHYADDR from CPUID.0x8000_0008 iff TDP is enabledSean Christopherson2021-07-141-1/+7
| | | | | | | | | | | | | | | Ignore the guest MAXPHYADDR reported by CPUID.0x8000_0008 if TDP, i.e. NPT, is disabled, and instead use the host's MAXPHYADDR. Per AMD'S APM: Maximum guest physical address size in bits. This number applies only to guests using nested paging. When this field is zero, refer to the PhysAddrSize field for the maximum guest physical address size. Fixes: 24c82e576b78 ("KVM: Sanitize cpuid") Cc: stable@vger.kernel.org Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210623230552.4027702-2-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Revert "KVM: x86: WARN and reject loading KVM if NX is supported but not ↵Sean Christopherson2021-07-141-3/+0
| | | | | | | | | | | | | | | | | | | | enabled" Let KVM load if EFER.NX=0 even if NX is supported, the analysis and testing (or lack thereof) for the non-PAE host case was garbage. If the kernel won't be using PAE paging, .Ldefault_entry in head_32.S skips over the entire EFER sequence. Hopefully that can be changed in the future to allow KVM to require EFER.NX, but the motivation behind KVM's requirement isn't yet merged. Reverting and revisiting the mess at a later date is by far the safest approach. This reverts commit 8bbed95d2cb6e5de8a342d761a89b0a04faed7be. Fixes: 8bbed95d2cb6 ("KVM: x86: WARN and reject loading KVM if NX is supported but not enabled") Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20210625001853.318148-1-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
* Merge tag 'kvm-s390-master-5.14-1' of ↵Paolo Bonzini2021-07-1478-379/+516
|\ | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kvms390/linux into HEAD KVM: selftests: Fixes - provide memory model for IBM z196 and zEC12 - do not require 64GB of memory
| * Merge tag 'x86_urgent_for_v5.13_rc6' of ↵Linus Torvalds2021-06-205-24/+56
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Borislav Petkov: "A first set of urgent fixes to the FPU/XSTATE handling mess^W code. (There's a lot more in the pipe): - Prevent corruption of the XSTATE buffer in signal handling by validating what is being copied from userspace first. - Invalidate other task's preserved FPU registers on XRSTOR failure (#PF) because latter can still modify some of them. - Restore the proper PKRU value in case userspace modified it - Reset FPU state when signal restoring fails Other: - Map EFI boot services data memory as encrypted in a SEV guest so that the guest can access it and actually boot properly - Two SGX correctness fixes: proper resources freeing and a NUMA fix" * tag 'x86_urgent_for_v5.13_rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/mm: Avoid truncating memblocks for SGX memory x86/sgx: Add missing xa_destroy() when virtual EPC is destroyed x86/fpu: Reset state for all signal restore failures x86/pkru: Write hardware init value to PKRU when xstate is init x86/process: Check PF_KTHREAD and not current->mm for kernel threads x86/fpu: Invalidate FPU state after a failed XRSTOR from a user buffer x86/fpu: Prevent state corruption in __fpu__restore_sig() x86/ioremap: Map EFI-reserved memory as encrypted for SEV
| | * x86/mm: Avoid truncating memblocks for SGX memoryFan Du2021-06-181-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tl;dr: Several SGX users reported seeing the following message on NUMA systems: sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0. This turned out to be the memblock code mistakenly throwing away SGX memory. === Full Changelog === The 'max_pfn' variable represents the highest known RAM address. It can be used, for instance, to quickly determine for which physical addresses there is mem_map[] space allocated. The numa_meminfo code makes an effort to throw out ("trim") all memory blocks which are above 'max_pfn'. SGX memory is not considered RAM (it is marked as "Reserved" in the e820) and is not taken into account by max_pfn. Despite this, SGX memory areas have NUMA affinity and are enumerated in the ACPI SRAT table. The existing SGX code uses the numa_meminfo mechanism to look up the NUMA affinity for its memory areas. In cases where SGX memory was above max_pfn (usually just the one EPC section in the last highest NUMA node), the numa_memblock is truncated at 'max_pfn', which is below the SGX memory. When the SGX code tries to look up the affinity of this memory, it fails and produces an error message: sgx: [Firmware Bug]: Unable to map EPC section to online node. Fallback to the NUMA node 0. and assigns the memory to NUMA node 0. Instead of silently truncating the memory block at 'max_pfn' and dropping the SGX memory, add the truncated portion to 'numa_reserved_meminfo'. This allows the SGX code to later determine the NUMA affinity of its 'Reserved' area. Before, numa_meminfo looked like this (from 'crash'): blk = { start = 0x0, end = 0x2080000000, nid = 0x0 } { start = 0x2080000000, end = 0x4000000000, nid = 0x1 } numa_reserved_meminfo is empty. With this, numa_meminfo looks like this: blk = { start = 0x0, end = 0x2080000000, nid = 0x0 } { start = 0x2080000000, end = 0x4000000000, nid = 0x1 } and numa_reserved_meminfo has an entry for node 1's SGX memory: blk = { start = 0x4000000000, end = 0x4080000000, nid = 0x1 } [ daveh: completely rewrote/reworked changelog ] Fixes: 5d30f92e7631 ("x86/NUMA: Provide a range-to-target_node lookup facility") Reported-by: Reinette Chatre <reinette.chatre@intel.com> Signed-off-by: Fan Du <fan.du@intel.com> Signed-off-by: Dave Hansen <dave.hansen@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Jarkko Sakkinen <jarkko@kernel.org> Reviewed-by: Dan Williams <dan.j.williams@intel.com> Reviewed-by: Dave Hansen <dave.hansen@intel.com> Cc: <stable@vger.kernel.org> Link: https://lkml.kernel.org/r/20210617194657.0A99CB22@viggo.jf.intel.com
| | * x86/sgx: Add missing xa_destroy() when virtual EPC is destroyedKai Huang2021-06-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xa_destroy() needs to be called to destroy a virtual EPC's page array before calling kfree() to free the virtual EPC. Currently it is not called so add the missing xa_destroy(). Fixes: 540745ddbc70 ("x86/sgx: Introduce virtual EPC for use by KVM guests") Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@intel.com> Tested-by: Yang Zhong <yang.zhong@intel.com> Link: https://lkml.kernel.org/r/20210615101639.291929-1-kai.huang@intel.com
| | * x86/fpu: Reset state for all signal restore failuresThomas Gleixner2021-06-101-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If access_ok() or fpregs_soft_set() fails in __fpu__restore_sig() then the function just returns but does not clear the FPU state as it does for all other fatal failures. Clear the FPU state for these failures as well. Fixes: 72a671ced66d ("x86, fpu: Unify signal handling code paths for x86 and x86_64 kernels") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/87mtryyhhz.ffs@nanos.tec.linutronix.de
| | * x86/pkru: Write hardware init value to PKRU when xstate is initThomas Gleixner2021-06-091-2/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When user space brings PKRU into init state, then the kernel handling is broken: T1 user space xsave(state) state.header.xfeatures &= ~XFEATURE_MASK_PKRU; xrstor(state) T1 -> kernel schedule() XSAVE(S) -> T1->xsave.header.xfeatures[PKRU] == 0 T1->flags |= TIF_NEED_FPU_LOAD; wrpkru(); schedule() ... pk = get_xsave_addr(&T1->fpu->state.xsave, XFEATURE_PKRU); if (pk) wrpkru(pk->pkru); else wrpkru(DEFAULT_PKRU); Because the xfeatures bit is 0 and therefore the value in the xsave storage is not valid, get_xsave_addr() returns NULL and switch_to() writes the default PKRU. -> FAIL #1! So that wrecks any copy_to/from_user() on the way back to user space which hits memory which is protected by the default PKRU value. Assumed that this does not fail (pure luck) then T1 goes back to user space and because TIF_NEED_FPU_LOAD is set it ends up in switch_fpu_return() __fpregs_load_activate() if (!fpregs_state_valid()) { load_XSTATE_from_task(); } But if nothing touched the FPU between T1 scheduling out and back in, then the fpregs_state is still valid which means switch_fpu_return() does nothing and just clears TIF_NEED_FPU_LOAD. Back to user space with DEFAULT_PKRU loaded. -> FAIL #2! The fix is simple: if get_xsave_addr() returns NULL then set the PKRU value to 0 instead of the restrictive default PKRU value in init_pkru_value. [ bp: Massage in minor nitpicks from folks. ] Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Tested-by: Babu Moger <babu.moger@amd.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210608144346.045616965@linutronix.de
| | * x86/process: Check PF_KTHREAD and not current->mm for kernel threadsThomas Gleixner2021-06-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | switch_fpu_finish() checks current->mm as indicator for kernel threads. That's wrong because kernel threads can temporarily use a mm of a user process via kthread_use_mm(). Check the task flags for PF_KTHREAD instead. Fixes: 0cecca9d03c9 ("x86/fpu: Eager switch PKRU state") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210608144345.912645927@linutronix.de
| | * x86/fpu: Invalidate FPU state after a failed XRSTOR from a user bufferAndy Lutomirski2021-06-091-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Both Intel and AMD consider it to be architecturally valid for XRSTOR to fail with #PF but nonetheless change the register state. The actual conditions under which this might occur are unclear [1], but it seems plausible that this might be triggered if one sibling thread unmaps a page and invalidates the shared TLB while another sibling thread is executing XRSTOR on the page in question. __fpu__restore_sig() can execute XRSTOR while the hardware registers are preserved on behalf of a different victim task (using the fpu_fpregs_owner_ctx mechanism), and, in theory, XRSTOR could fail but modify the registers. If this happens, then there is a window in which __fpu__restore_sig() could schedule out and the victim task could schedule back in without reloading its own FPU registers. This would result in part of the FPU state that __fpu__restore_sig() was attempting to load leaking into the victim task's user-visible state. Invalidate preserved FPU registers on XRSTOR failure to prevent this situation from corrupting any state. [1] Frequent readers of the errata lists might imagine "complex microarchitectural conditions". Fixes: 1d731e731c4c ("x86/fpu: Add a fastpath to __fpu__restore_sig()") Signed-off-by: Andy Lutomirski <luto@kernel.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210608144345.758116583@linutronix.de
| | * x86/fpu: Prevent state corruption in __fpu__restore_sig()Thomas Gleixner2021-06-091-8/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The non-compacted slowpath uses __copy_from_user() and copies the entire user buffer into the kernel buffer, verbatim. This means that the kernel buffer may now contain entirely invalid state on which XRSTOR will #GP. validate_user_xstate_header() can detect some of that corruption, but that leaves the onus on callers to clear the buffer. Prior to XSAVES support, it was possible just to reinitialize the buffer, completely, but with supervisor states that is not longer possible as the buffer clearing code split got it backwards. Fixing that is possible but not corrupting the state in the first place is more robust. Avoid corruption of the kernel XSAVE buffer by using copy_user_to_xstate() which validates the XSAVE header contents before copying the actual states to the kernel. copy_user_to_xstate() was previously only called for compacted-format kernel buffers, but it works for both compacted and non-compacted forms. Using it for the non-compacted form is slower because of multiple __copy_from_user() operations, but that cost is less important than robust code in an already slow path. [ Changelog polished by Dave Hansen ] Fixes: b860eb8dce59 ("x86/fpu/xstate: Define new functions for clearing fpregs and xstates") Reported-by: syzbot+2067e764dbcd10721e2e@syzkaller.appspotmail.com Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov <bp@suse.de> Reviewed-by: Borislav Petkov <bp@suse.de> Acked-by: Dave Hansen <dave.hansen@linux.intel.com> Acked-by: Rik van Riel <riel@surriel.com> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/20210608144345.611833074@linutronix.de
| | * x86/ioremap: Map EFI-reserved memory as encrypted for SEVTom Lendacky2021-06-081-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some drivers require memory that is marked as EFI boot services data. In order for this memory to not be re-used by the kernel after ExitBootServices(), efi_mem_reserve() is used to preserve it by inserting a new EFI memory descriptor and marking it with the EFI_MEMORY_RUNTIME attribute. Under SEV, memory marked with the EFI_MEMORY_RUNTIME attribute needs to be mapped encrypted by Linux, otherwise the kernel might crash at boot like below: EFI Variables Facility v0.08 2004-May-17 general protection fault, probably for non-canonical address 0x3597688770a868b2: 0000 [#1] SMP NOPTI CPU: 13 PID: 1 Comm: swapper/0 Not tainted 5.12.4-2-default #1 openSUSE Tumbleweed Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/2015 RIP: 0010:efi_mokvar_entry_next [...] Call Trace: efi_mokvar_sysfs_init ? efi_mokvar_table_init do_one_initcall ? __kmalloc kernel_init_freeable ? rest_init kernel_init ret_from_fork Expand the __ioremap_check_other() function to additionally check for this other type of boot data reserved at runtime and indicate that it should be mapped encrypted for an SEV guest. [ bp: Massage commit message. ] Fixes: 58c909022a5a ("efi: Support for MOK variable config table") Reported-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Joerg Roedel <jroedel@suse.de> Signed-off-by: Borislav Petkov <bp@suse.de> Tested-by: Joerg Roedel <jroedel@suse.de> Cc: <stable@vger.kernel.org> # 5.10+ Link: https://lkml.kernel.org/r/20210608095439.12668-2-joro@8bytes.org
| * | Merge tag 'powerpc-5.13-6' of ↵Linus Torvalds2021-06-204-7/+7
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "Fix initrd corruption caused by our recent change to use relative jump labels. Fix a crash using perf record on systems without a hardware PMU backend. Rework our 64-bit signal handling slighty to make it more closely match the old behaviour, after the recent change to use unsafe user accessors. Thanks to Anastasia Kovaleva, Athira Rajeev, Christophe Leroy, Daniel Axtens, Greg Kurz, and Roman Bolshakov" * tag 'powerpc-5.13-6' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/perf: Fix crash in perf_instruction_pointer() when ppmu is not set powerpc: Fix initrd corruption with relative jump labels powerpc/signal64: Copy siginfo before changing regs->nip powerpc/mem: Add back missing header to fix 'no previous prototype' error
| | * | powerpc/perf: Fix crash in perf_instruction_pointer() when ppmu is not setAthira Rajeev2021-06-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On systems without any specific PMU driver support registered, running perf record causes Oops. The relevant portion from call trace: BUG: Kernel NULL pointer dereference on read at 0x00000040 Faulting instruction address: 0xc0021f0c Oops: Kernel access of bad area, sig: 11 [#1] BE PAGE_SIZE=4K PREEMPT CMPCPRO SAF3000 DIE NOTIFICATION CPU: 0 PID: 442 Comm: null_syscall Not tainted 5.13.0-rc6-s3k-dev-01645-g7649ee3d2957 #5164 NIP: c0021f0c LR: c00e8ad8 CTR: c00d8a5c NIP perf_instruction_pointer+0x10/0x60 LR perf_prepare_sample+0x344/0x674 Call Trace: perf_prepare_sample+0x7c/0x674 (unreliable) perf_event_output_forward+0x3c/0x94 __perf_event_overflow+0x74/0x14c perf_swevent_hrtimer+0xf8/0x170 __hrtimer_run_queues.constprop.0+0x160/0x318 hrtimer_interrupt+0x148/0x3b0 timer_interrupt+0xc4/0x22c Decrementer_virt+0xb8/0xbc During perf record session, perf_instruction_pointer() is called to capture the sample IP. This function in core-book3s accesses ppmu->flags. If a platform specific PMU driver is not registered, ppmu is set to NULL and accessing its members results in a crash. Fix this crash by checking if ppmu is set. Fixes: 2ca13a4cc56c ("powerpc/perf: Use regs->nip when SIAR is zero") Cc: stable@vger.kernel.org # v5.11+ Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Athira Rajeev <atrajeev@linux.vnet.ibm.com> Tested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/1623952506-1431-1-git-send-email-atrajeev@linux.vnet.ibm.com
| | * | powerpc: Fix initrd corruption with relative jump labelsMichael Ellerman2021-06-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b0b3b2c78ec0 ("powerpc: Switch to relative jump labels") switched us to using relative jump labels. That involves changing the code, target and key members in struct jump_entry to be relative to the address of the jump_entry, rather than absolute addresses. We have two static inlines that create a struct jump_entry, arch_static_branch() and arch_static_branch_jump(), as well as an asm macro ARCH_STATIC_BRANCH, which is used by the pseries-only hypervisor tracing code. Unfortunately we missed updating the key to be a relative reference in ARCH_STATIC_BRANCH. That causes a pseries kernel to have a handful of jump_entry structs with bad key values. Instead of being a relative reference they instead hold the full address of the key. However the code doesn't expect that, it still adds the key value to the address of the jump_entry (see jump_entry_key()) expecting to get a pointer to a key somewhere in kernel data. The table of jump_entry structs sits in rodata, which comes after the kernel text. In a typical build this will be somewhere around 15MB. The address of the key will be somewhere in data, typically around 20MB. Adding the two values together gets us a pointer somewhere around 45MB. We then call static_key_set_entries() with that bad pointer and modify some members of the struct static_key we think we are pointing at. A pseries kernel is typically ~30MB in size, so writing to ~45MB won't corrupt the kernel itself. However if we're booting with an initrd, depending on the size and exact location of the initrd, we can corrupt the initrd. Depending on how exactly we corrupt the initrd it can either cause the system to not boot, or just corrupt one of the files in the initrd. The fix is simply to make the key value relative to the jump_entry struct in the ARCH_STATIC_BRANCH macro. Fixes: b0b3b2c78ec0 ("powerpc: Switch to relative jump labels") Reported-by: Anastasia Kovaleva <a.kovaleva@yadro.com> Reported-by: Roman Bolshakov <r.bolshakov@yadro.com> Reported-by: Greg Kurz <groug@kaod.org> Reported-by: Daniel Axtens <dja@axtens.net> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Tested-by: Daniel Axtens <dja@axtens.net> Tested-by: Greg Kurz <groug@kaod.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210614131440.312360-1-mpe@ellerman.id.au
| | * | powerpc/signal64: Copy siginfo before changing regs->nipMichael Ellerman2021-06-141-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 96d7a4e06fab ("powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches") the 64-bit signal code was rearranged to use user_write_access_begin/end(). As part of that change the call to copy_siginfo_to_user() was moved later in the function, so that it could be done after the user_write_access_end(). In particular it was moved after we modify regs->nip to point to the signal trampoline. That means if copy_siginfo_to_user() fails we exit handle_rt_signal64() with an error but with regs->nip modified, whereas previously we would not modify regs->nip until the copy succeeded. Returning an error from signal delivery but with regs->nip updated leaves the process in a sort of half-delivered state. We do immediately force a SEGV in signal_setup_done(), called from do_signal(), so the process should never run in the half-delivered state. However that SEGV is not delivered until we've gone around to do_notify_resume() again, so it's possible some tracing could observe the half-delivered state. There are other cases where we fail signal delivery with regs partly updated, eg. the write to newsp and SA_SIGINFO, but the latter at least is very unlikely to fail as it reads back from the frame we just wrote to. Looking at other arches they seem to be more careful about leaving regs unchanged until the copy operations have succeeded, and in general that seems like good hygenie. So although the current behaviour is not cleary buggy, it's also not clearly correct. So move the call to copy_siginfo_to_user() up prior to the modification of regs->nip, which is closer to the old behaviour, and easier to reason about. Fixes: 96d7a4e06fab ("powerpc/signal64: Rewrite handle_rt_signal64() to minimise uaccess switches") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/20210608134605.2783677-1-mpe@ellerman.id.au
| | * | powerpc/mem: Add back missing header to fix 'no previous prototype' errorChristophe Leroy2021-06-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b26e8f27253a ("powerpc/mem: Move cache flushing functions into mm/cacheflush.c") removed asm/sparsemem.h which is required when CONFIG_MEMORY_HOTPLUG is selected to get the declaration of create_section_mapping(). Add it back. Fixes: b26e8f27253a ("powerpc/mem: Move cache flushing functions into mm/cacheflush.c") Reported-by: kernel test robot <lkp@intel.com> Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://lore.kernel.org/r/3e5b63bb3daab54a1eb9c20221c2e9528c4db9b3.1622883330.git.christophe.leroy@csgroup.eu
| * | | Merge tag 'riscv-for-linus-5.13-rc7' of ↵Linus Torvalds2021-06-195-10/+10
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - A build fix to always build modules with the 'medany' code model, as the module loader doesn't support 'medlow'. - A Kconfig warning fix for the SiFive errata. - A pair of fixes that for regressions to the recent memory layout changes. - A fix for the FU740 device tree. * tag 'riscv-for-linus-5.13-rc7' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: dts: fu740: fix cache-controller interrupts riscv: Ensure BPF_JIT_REGION_START aligned with PMD size riscv: kasan: Fix MODULES_VADDR evaluation due to local variables' name riscv: sifive: fix Kconfig errata warning riscv32: Use medany C model for modules
| | * | | riscv: dts: fu740: fix cache-controller interruptsDavid Abdurachmanov2021-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The order of interrupt numbers is incorrect. The order for FU740 is: DirError, DataError, DataFail, DirFail From SiFive FU740-C000 Manual: 19 - L2 Cache DirError 20 - L2 Cache DirFail 21 - L2 Cache DataError 22 - L2 Cache DataFail Signed-off-by: David Abdurachmanov <david.abdurachmanov@sifive.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| | * | | riscv: Ensure BPF_JIT_REGION_START aligned with PMD sizeJisheng Zhang2021-06-192-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andreas reported commit fc8504765ec5 ("riscv: bpf: Avoid breaking W^X") breaks booting with one kind of defconfig, I reproduced a kernel panic with the defconfig: [ 0.138553] Unable to handle kernel paging request at virtual address ffffffff81201220 [ 0.139159] Oops [#1] [ 0.139303] Modules linked in: [ 0.139601] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 5.13.0-rc5-default+ #1 [ 0.139934] Hardware name: riscv-virtio,qemu (DT) [ 0.140193] epc : __memset+0xc4/0xfc [ 0.140416] ra : skb_flow_dissector_init+0x1e/0x82 [ 0.140609] epc : ffffffff8029806c ra : ffffffff8033be78 sp : ffffffe001647da0 [ 0.140878] gp : ffffffff81134b08 tp : ffffffe001654380 t0 : ffffffff81201158 [ 0.141156] t1 : 0000000000000002 t2 : 0000000000000154 s0 : ffffffe001647dd0 [ 0.141424] s1 : ffffffff80a43250 a0 : ffffffff81201220 a1 : 0000000000000000 [ 0.141654] a2 : 000000000000003c a3 : ffffffff81201258 a4 : 0000000000000064 [ 0.141893] a5 : ffffffff8029806c a6 : 0000000000000040 a7 : ffffffffffffffff [ 0.142126] s2 : ffffffff81201220 s3 : 0000000000000009 s4 : ffffffff81135088 [ 0.142353] s5 : ffffffff81135038 s6 : ffffffff8080ce80 s7 : ffffffff80800438 [ 0.142584] s8 : ffffffff80bc6578 s9 : 0000000000000008 s10: ffffffff806000ac [ 0.142810] s11: 0000000000000000 t3 : fffffffffffffffc t4 : 0000000000000000 [ 0.143042] t5 : 0000000000000155 t6 : 00000000000003ff [ 0.143220] status: 0000000000000120 badaddr: ffffffff81201220 cause: 000000000000000f [ 0.143560] [<ffffffff8029806c>] __memset+0xc4/0xfc [ 0.143859] [<ffffffff8061e984>] init_default_flow_dissectors+0x22/0x60 [ 0.144092] [<ffffffff800010fc>] do_one_initcall+0x3e/0x168 [ 0.144278] [<ffffffff80600df0>] kernel_init_freeable+0x1c8/0x224 [ 0.144479] [<ffffffff804868a8>] kernel_init+0x12/0x110 [ 0.144658] [<ffffffff800022de>] ret_from_exception+0x0/0xc [ 0.145124] ---[ end trace f1e9643daa46d591 ]--- After some investigation, I think I found the root cause: commit 2bfc6cd81bd ("move kernel mapping outside of linear mapping") moves BPF JIT region after the kernel: | #define BPF_JIT_REGION_START PFN_ALIGN((unsigned long)&_end) The &_end is unlikely aligned with PMD size, so the front bpf jit region sits with part of kernel .data section in one PMD size mapping. But kernel is mapped in PMD SIZE, when bpf_jit_binary_lock_ro() is called to make the first bpf jit prog ROX, we will make part of kernel .data section RO too, so when we write to, for example memset the .data section, MMU will trigger a store page fault. To fix the issue, we need to ensure the BPF JIT region is PMD size aligned. This patch acchieve this goal by restoring the BPF JIT region to original position, I.E the 128MB before kernel .text section. The modification to kasan_init.c is inspired by Alexandre. Fixes: fc8504765ec5 ("riscv: bpf: Avoid breaking W^X") Reported-by: Andreas Schwab <schwab@linux-m68k.org> Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| | * | | riscv: kasan: Fix MODULES_VADDR evaluation due to local variables' nameJisheng Zhang2021-06-191-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 2bfc6cd81bd1 ("riscv: Move kernel mapping outside of linear mapping") makes use of MODULES_VADDR to populate kernel, BPF, modules mapping. Currently, MODULES_VADDR is defined as below for RV64: | #define MODULES_VADDR (PFN_ALIGN((unsigned long)&_end) - SZ_2G) But kasan_init() has two local variables which are also named as _start, _end, so MODULES_VADDR is evaluated with the local variable _end rather than the global "_end" as we expected. Fix this issue by renaming the two local variables. Fixes: 2bfc6cd81bd1 ("riscv: Move kernel mapping outside of linear mapping") Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| | * | | riscv: sifive: fix Kconfig errata warningRandy Dunlap2021-06-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SOC_SIFIVE Kconfig entry unconditionally selects ERRATA_SIFIVE. However, ERRATA_SIFIVE depends on RISCV_ERRATA_ALTERNATIVE, which is not set, so SOC_SIFIVE should either depend on or select RISCV_ERRATA_ALTERNATIVE. Use 'select' here to quieten the Kconfig warning. WARNING: unmet direct dependencies detected for ERRATA_SIFIVE Depends on [n]: RISCV_ERRATA_ALTERNATIVE [=n] Selected by [y]: - SOC_SIFIVE [=y] Fixes: 1a0e5dbd3723 ("riscv: sifive: Add SiFive alternative ports") Signed-off-by: Randy Dunlap <rdunlap@infradead.org> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: linux-riscv@lists.infradead.org Cc: Vincent Chen <vincent.chen@sifive.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| | * | | riscv32: Use medany C model for modulesKhem Raj2021-06-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_CMODEL_MEDLOW is used it ends up generating riscv_hi20_rela relocations in modules which are not resolved during runtime and following errors would be seen [ 4.802714] virtio_input: target 00000000c1539090 can not be addressed by the 32-bit offset from PC = 39148b7b [ 4.854800] virtio_input: target 00000000c1539090 can not be addressed by the 32-bit offset from PC = 9774456d Signed-off-by: Khem Raj <raj.khem@gmail.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| * | | | Merge tag 's390-5.13-4' of ↵Linus Torvalds2021-06-191-2/+2
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux Pull s390 fixes from Vasily Gorbik: - Fix zcrypt ioctl hang due to AP queue msg counter dropping below 0 when pending requests are purged. - Two fixes for the machine check handler in the entry code. * tag 's390-5.13-4' of git://git.kernel.org/pub/scm/linux/kernel/git/s390/linux: s390/ap: Fix hanging ioctl caused by wrong msg counter s390/mcck: fix invalid KVM guest condition check s390/mcck: fix calculation of SIE critical section size
| | * | | | s390/mcck: fix invalid KVM guest condition checkAlexander Gordeev2021-06-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Wrong condition check is used to decide if a machine check hit while in KVM guest. As result of this check the instruction following the SIE critical section might be considered as still in KVM guest and _CIF_MCCK_GUEST CPU flag mistakenly set as result. Fixes: c929500d7a5a ("s390/nmi: s390: New low level handling for machine check happening in guest") Cc: <stable@vger.kernel.org> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
| | * | | | s390/mcck: fix calculation of SIE critical section sizeAlexander Gordeev2021-06-071-1/+1
| | | |_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The size of SIE critical section is calculated wrongly as result of a missed subtraction in commit 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S") Fixes: 0b0ed657fe00 ("s390: remove critical section cleanup from entry.S") Cc: <stable@vger.kernel.org> Signed-off-by: Alexander Gordeev <agordeev@linux.ibm.com> Reviewed-by: Christian Borntraeger <borntraeger@de.ibm.com> Signed-off-by: Heiko Carstens <hca@linux.ibm.com> Signed-off-by: Vasily Gorbik <gor@linux.ibm.com>
| * | | | Merge tag 'pci-v5.13-fixes-2' of ↵Linus Torvalds2021-06-181-0/+44
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci Pull PCI fixes from Bjorn Helgaas: - Clear 64-bit flag for host bridge windows below 4GB to fix a resource allocation regression added in -rc1 (Punit Agrawal) - Fix tegra194 MCFG quirk build regressions added in -rc1 (Jon Hunter) - Avoid secondary bus resets on TI KeyStone C667X devices (Antti Järvinen) - Avoid secondary bus resets on some NVIDIA GPUs (Shanker Donthineni) - Work around FLR erratum on Huawei Intelligent NIC VF (Chiqijun) - Avoid broken ATS on AMD Navi14 GPU (Evan Quan) - Trust Broadcom BCM57414 NIC to isolate functions even though it doesn't advertise ACS support (Sriharsha Basavapatna) - Work around AMD RS690 BIOSes that don't configure DMA above 4GB (Mikel Rychliski) - Fix panic during PIO transfer on Aardvark controller (Pali Rohár) * tag 'pci-v5.13-fixes-2' of git://git.kernel.org/pub/scm/linux/kernel/git/helgaas/pci: PCI: aardvark: Fix kernel panic during PIO transfer PCI: Add AMD RS690 quirk to enable 64-bit DMA PCI: Add ACS quirk for Broadcom BCM57414 NIC PCI: Mark AMD Navi14 GPU ATS as broken PCI: Work around Huawei Intelligent NIC VF FLR erratum PCI: Mark some NVIDIA GPUs to avoid bus reset PCI: Mark TI C667X to avoid bus reset PCI: tegra194: Fix MCFG quirk build regressions PCI: of: Clear 64-bit flag for non-prefetchable memory below 4GB
| | * | | | PCI: Add AMD RS690 quirk to enable 64-bit DMAMikel Rychliski2021-06-181-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although the AMD RS690 chipset has 64-bit DMA support, BIOS implementations sometimes fail to configure the memory limit registers correctly. The Acer F690GVM mainboard uses this chipset and a Marvell 88E8056 NIC. The sky2 driver programs the NIC to use 64-bit DMA, which will not work: sky2 0000:02:00.0: error interrupt status=0x8 sky2 0000:02:00.0 eth0: tx timeout sky2 0000:02:00.0 eth0: transmit ring 0 .. 22 report=0 done=0 Other drivers required by this mainboard either don't support 64-bit DMA, or have it disabled using driver specific quirks. For example, the ahci driver has quirks to enable or disable 64-bit DMA depending on the BIOS version (see ahci_sb600_enable_64bit() in ahci.c). This ahci quirk matches against the SB600 SATA controller, but the real issue is almost certainly with the RS690 PCI host that it was commonly attached to. To avoid this issue in all drivers with 64-bit DMA support, fix the configuration of the PCI host. If the kernel is aware of physical memory above 4GB, but the BIOS never configured the PCI host with this information, update the registers with our values. [bhelgaas: drop PCI_DEVICE_ID_ATI_RS690 definition] Link: https://lore.kernel.org/r/20210611214823.4898-1-mikel@mikelr.com Signed-off-by: Mikel Rychliski <mikel@mikelr.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | Merge tag 'arc-5.13-rc7-fixes' of ↵Linus Torvalds2021-06-183-1/+45
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc Pull ARC fixes from Vineet Gupta: - ARCv2 userspace ABI not populating a few registers - Unbork CONFIG_HARDENED_USERCOPY for ARC * tag 'arc-5.13-rc7-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/vgupta/arc: ARC: fix CONFIG_HARDENED_USERCOPY ARCv2: save ABI registers across signal handling
| | * | | | | ARC: fix CONFIG_HARDENED_USERCOPYVineet Gupta2021-06-111-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently enabling this triggers a warning | usercopy: Kernel memory overwrite attempt detected to kernel text (offset 155633, size 11)! | usercopy: BUG: failure at mm/usercopy.c:99/usercopy_abort()! | |gcc generated __builtin_trap |Path: /bin/busybox |CPU: 0 PID: 84 Comm: init Not tainted 5.4.22 | |[ECR ]: 0x00090005 => gcc generated __builtin_trap |[EFA ]: 0x9024fcaa |[BLINK ]: usercopy_abort+0x8a/0x8c |[ERET ]: memfd_fcntl+0x0/0x470 |[STAT32]: 0x80080802 : IE K |... |... |Stack Trace: | memfd_fcntl+0x0/0x470 | usercopy_abort+0x8a/0x8c | __check_object_size+0x10e/0x138 | copy_strings+0x1f4/0x38c | __do_execve_file+0x352/0x848 | EV_Trap+0xcc/0xd0 The issue is triggered by an allocation in "init reclaimed" region. ARC _stext emcompasses the init region (for historical reasons we wanted the init.text to be under .text as well). This however trips up __check_object_size()->check_kernel_text_object() which treats this as object bleeding into kernel text. Fix that by rezoning _stext to start from regular kernel .text and leave out .init altogether. Fixes: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/15 Reported-by: Evgeniy Didin <didin@synopsys.com> Reviewed-by: Kees Cook <keescook@chromium.org> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
| | * | | | | ARCv2: save ABI registers across signal handlingVineet Gupta2021-06-112-0/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ARCv2 has some configuration dependent registers (r30, r58, r59) which could be targetted by the compiler. To keep the ABI stable, these were unconditionally part of the glibc ABI (sysdeps/unix/sysv/linux/arc/sys/ucontext.h:mcontext_t) however we missed populating them (by saving/restoring them across signal handling). This patch fixes the issue by - adding arcv2 ABI regs to kernel struct sigcontext - populating them during signal handling Change to struct sigcontext might seem like a glibc ABI change (although it primarily uses ucontext_t:mcontext_t) but the fact is - it has only been extended (existing fields are not touched) - the old sigcontext was ABI incomplete to begin with anyways Fixes: https://github.com/foss-for-synopsys-dwc-arc-processors/linux/issues/53 Cc: <stable@vger.kernel.org> Tested-by: kernel test robot <lkp@intel.com> Reported-by: Vladimir Isaev <isaev@synopsys.com> Signed-off-by: Vineet Gupta <vgupta@synopsys.com>
| * | | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2021-06-177-10/+53
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm fixes from Paolo Bonzini: "Miscellaneous bugfixes. The main interesting one is a NULL pointer dereference reported by syzkaller ("KVM: x86: Immediately reset the MMU context when the SMM flag is cleared")" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: KVM: selftests: Fix kvm_check_cap() assertion KVM: x86/mmu: Calculate and check "full" mmu_role for nested MMU KVM: X86: Fix x86_emulator slab cache leak KVM: SVM: Call SEV Guest Decommission if ASID binding fails KVM: x86: Immediately reset the MMU context when the SMM flag is cleared KVM: x86: Fix fall-through warnings for Clang KVM: SVM: fix doc warnings KVM: selftests: Fix compiling errors when initializing the static structure kvm: LAPIC: Restore guard to prevent illegal APIC register access
| * \ \ \ \ \ \ Merge tag 'riscv-for-linus-5.13-rc6' of ↵Linus Torvalds2021-06-126-16/+36
| |\ \ \ \ \ \ \ | | | |_|_|_|/ / | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux Pull RISC-V fixes from Palmer Dabbelt: - A pair of XIP fixes: one to fix alternatives, and one to turn off the rest of the features that require code modification - A fix to a type that was causing some alternatives to break - A build fix for BUILTIN_DTB * tag 'riscv-for-linus-5.13-rc6' of git://git.kernel.org/pub/scm/linux/kernel/git/riscv/linux: riscv: Fix BUILTIN_DTB for sifive and microchip soc riscv: alternative: fix typo in macro name riscv: code patching only works on !XIP_KERNEL riscv: xip: support runtime trap patching
| | * | | | | | riscv: Fix BUILTIN_DTB for sifive and microchip socAlexandre Ghiti2021-06-122-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix BUILTIN_DTB config which resulted in a dtb that was actually not built into the Linux image: in the same manner as Canaan soc does, create an object file from the dtb file that will get linked into the Linux image. Signed-off-by: Alexandre Ghiti <alex@ghiti.fr> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| | * | | | | | riscv: alternative: fix typo in macro nameVitaly Wool2021-06-111-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | alternative-macros.h defines ALT_NEW_CONTENT in its assembly part and ALT_NEW_CONSTENT in the C part. Most likely it is the latter that is wrong. Fixes: 6f4eea90465ad (riscv: Introduce alternative mechanism to apply errata solution) Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| | * | | | | | riscv: code patching only works on !XIP_KERNELJisheng Zhang2021-06-111-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some features which need code patching such as KPROBES, DYNAMIC_FTRACE KGDB can only work on !XIP_KERNEL. Add dependencies for these features that rely on code patching. Signed-off-by: Jisheng Zhang <jszhang@kernel.org> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| | * | | | | | riscv: xip: support runtime trap patchingVitaly Wool2021-06-112-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RISCV_ERRATA_ALTERNATIVE patches text at runtime which is currently not possible when the kernel is executed from the flash in XIP mode. Since runtime patching concerns only traps at the moment, let's just have all the traps reside in RAM anyway if RISCV_ERRATA_ALTERNATIVE is set. Thus, these functions will be patch-able even when the .text section is in flash. Signed-off-by: Vitaly Wool <vitaly.wool@konsulko.com> Signed-off-by: Palmer Dabbelt <palmerdabbelt@google.com>
| * | | | | | | Merge tag 'perf-urgent-2021-06-12' of ↵Linus Torvalds2021-06-122-5/+8
| |\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull perf fixes from Ingo Molnar: "Misc fixes: - Fix the NMI watchdog on ancient Intel CPUs - Remove a misguided, NMI-unsafe KASAN callback from the NMI-safe irq_work path used by perf. - Fix uncore events on Ice Lake servers. - Someone booted maxcpus=1 on an SNB-EP, and the uncore driver emitted warnings and was probably buggy. Fix it. - KCSAN found a genuine data race in the core perf code. Somewhat ironically the bug was introduced through a recent race fix. :-/ In our defense, the new race window was much more narrow. Fix it" * tag 'perf-urgent-2021-06-12' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUs irq_work: Make irq_work_queue() NMI-safe again perf/x86/intel/uncore: Fix M2M event umask for Ice Lake server perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1 perf: Fix data race between pin_count increment/decrement
| | * | | | | | | x86/nmi_watchdog: Fix old-style NMI watchdog regression on old Intel CPUsCodyYao-oc2021-06-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following commit: 3a4ac121c2ca ("x86/perf: Add hardware performance events support for Zhaoxin CPU.") Got the old-style NMI watchdog logic wrong and broke it for basically every Intel CPU where it was active. Which is only truly old CPUs, so few people noticed. On CPUs with perf events support we turn off the old-style NMI watchdog, so it was pretty pointless to add the logic for X86_VENDOR_ZHAOXIN to begin with ... :-/ Anyway, the fix is to restore the old logic and add a 'break'. [ mingo: Wrote a new changelog. ] Fixes: 3a4ac121c2ca ("x86/perf: Add hardware performance events support for Zhaoxin CPU.") Signed-off-by: CodyYao-oc <CodyYao-oc@zhaoxin.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Link: https://lore.kernel.org/r/20210607025335.9643-1-CodyYao-oc@zhaoxin.com
| | * | | | | | | perf/x86/intel/uncore: Fix M2M event umask for Ice Lake serverKan Liang2021-06-011-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Perf tool errors out with the latest event list for the Ice Lake server. event syntax error: 'unc_m2m_imc_reads.to_pmm' \___ value too big for format, maximum is 255 The same as the Snow Ridge server, the M2M uncore unit in the Ice Lake server has the unit mask extension field as well. Fixes: 2b3b76b5ec67 ("perf/x86/intel/uncore: Add Ice Lake server uncore support") Reported-by: Jin Yao <yao.jin@linux.intel.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: stable@vger.kernel.org Link: https://lkml.kernel.org/r/1622552943-119174-1-git-send-email-kan.liang@linux.intel.com
| | * | | | | | | perf/x86/intel/uncore: Fix a kernel WARNING triggered by maxcpus=1Kan Liang2021-05-311-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A kernel WARNING may be triggered when setting maxcpus=1. The uncore counters are Die-scope. When probing a PCI device, only the BUS information can be retrieved. The uncore driver has to maintain a mapping table used to calculate the logical Die ID from a given BUS#. Before the patch ba9506be4e40, the mapping table stores the mapping information from the BUS# -> a Physical Socket ID. To calculate the logical die ID, perf does, - In snbep_pci2phy_map_init(), retrieve the BUS# -> a Physical Socket ID from the UBOX PCI configure space. - Calculate the mapping information (a BUS# -> a Physical Socket ID) for the other PCI BUS. - In the uncore_pci_probe(), get the physical Socket ID from a given BUS and the mapping table. - Calculate the logical Die ID Since only the logical Die ID is required, with the patch ba9506be4e40, the mapping table stores the mapping information from the BUS# -> a logical Die ID. Now perf does, - In snbep_pci2phy_map_init(), retrieve the BUS# -> a Physical Socket ID from the UBOX PCI configure space. - Calculate the logical Die ID - Calculate the mapping information (a BUS# -> a logical Die ID) for the other PCI BUS. - In the uncore_pci_probe(), get the logical die ID from a given BUS and the mapping table. When calculating the logical Die ID, -1 may be returned, especially when maxcpus=1. Here, -1 means the logical Die ID is not found. But when calculating the mapping information for the other PCI BUS, -1 indicates that it's the other PCI BUS that requires the calculation of the mapping. The driver will mistakenly do the calculation. Uses the -ENODEV to indicate the case which the logical Die ID is not found. The driver will not mess up the mapping table anymore. Fixes: ba9506be4e40 ("perf/x86/intel/uncore: Store the logical die id instead of the physical die id.") Reported-by: John Donnelly <john.p.donnelly@oracle.com> Signed-off-by: Kan Liang <kan.liang@linux.intel.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Acked-by: John Donnelly <john.p.donnelly@oracle.com> Tested-by: John Donnelly <john.p.donnelly@oracle.com> Link: https://lkml.kernel.org/r/1622037527-156028-1-git-send-email-kan.liang@linux.intel.com