summaryrefslogtreecommitdiffstats
path: root/arch/powerpc (follow)
Commit message (Collapse)AuthorAgeFilesLines
* mm/powerpc: replace pXd_is_leaf() with pXd_leaf()Peter Xu2024-03-067-52/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | They're the same macros underneath. Drop pXd_is_leaf(), instead always use pXd_leaf(). At the meantime, instead of renames, drop the pXd_is_leaf() fallback definitions directly in arch/powerpc/include/asm/pgtable.h. because similar fallback macros for pXd_leaf() are already defined in include/linux/pgtable.h. Link: https://lkml.kernel.org/r/20240305043750.93762-3-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Suggested-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Muchun Song <muchun.song@linux.dev> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Cc: Yang Shi <shy828301@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/powerpc: define pXd_large() with pXd_leaf()Peter Xu2024-03-062-15/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "mm/treewide: Replace pXd_large() with pXd_leaf()", v3. These two APIs are mostly always the same. It's confusing to have both of them. Merge them into one. Here I used pXd_leaf() only because pXd_leaf() is a global API which is always defined, while pXd_large() is not. We have yet one more API that is similar which is pXd_huge(), but that's even trickier, so let's do it step by step. Some special cares are taken for ppc and x86, they're done as separate cleanups first. This patch (of 10): The two definitions are the same. The only difference is that pXd_large() is only defined with THP selected, and only on book3s 64bits. Instead of implementing it twice, make pXd_large() a macro to pXd_leaf(). Define it unconditionally just like pXd_leaf(). This helps to prepare merging the two APIs. Link: https://lkml.kernel.org/r/20240305043750.93762-1-peterx@redhat.com Link: https://lkml.kernel.org/r/20240305043750.93762-2-peterx@redhat.com Signed-off-by: Peter Xu <peterx@redhat.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: "Aneesh Kumar K.V" <aneesh.kumar@kernel.org> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Kirill A. Shutemov <kirill@shutemov.name> Cc: Muchun Song <muchun.song@linux.dev> Cc: Yang Shi <shy828301@gmail.com> Cc: Alexander Potapenko <glider@google.com> Cc: Andrey Konovalov <andreyknvl@gmail.com> Cc: Andrey Ryabinin <ryabinin.a.a@gmail.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Dmitry Vyukov <dvyukov@google.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vincenzo Frascino <vincenzo.frascino@arm.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* hugetlb: parallelize 1G hugetlb initializationGang Li2024-03-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Optimizing the initialization speed of 1G huge pages through parallelization. 1G hugetlbs are allocated from bootmem, a process that is already very fast and does not currently require optimization. Therefore, we focus on parallelizing only the initialization phase in `gather_bootmem_prealloc`. Here are some test results: test case no patch(ms) patched(ms) saved ------------------- -------------- ------------- -------- 256c2T(4 node) 1G 4745 2024 57.34% 128c1T(2 node) 1G 3358 1712 49.02% 12T 1G 77000 18300 76.23% [akpm@linux-foundation.org: s/initialied/initialized/, per Alexey] Link: https://lkml.kernel.org/r/20240222140422.393911-9-gang.li@linux.dev Signed-off-by: Gang Li <ligang.bdlg@bytedance.com> Tested-by: David Rientjes <rientjes@google.com> Reviewed-by: Muchun Song <muchun.song@linux.dev> Cc: Alexey Dobriyan <adobriyan@gmail.com> Cc: Daniel Jordan <daniel.m.jordan@oracle.com> Cc: David Hildenbrand <david@redhat.com> Cc: Jane Chu <jane.chu@oracle.com> Cc: Mike Kravetz <mike.kravetz@oracle.com> Cc: Paul E. McKenney <paulmck@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Cc: Steffen Klassert <steffen.klassert@secunet.com> Cc: Tim Chen <tim.c.chen@linux.intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* arch, crash: move arch_crash_save_vmcoreinfo() out to file vmcore_info.cBaoquan He2024-02-243-28/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nathan reported below building error: ===== $ curl -LSso .config https://git.alpinelinux.org/aports/plain/community/linux-edge/config-edge.armv7 $ make -skj"$(nproc)" ARCH=arm CROSS_COMPILE=arm-linux-gnueabi- olddefconfig all .. arm-linux-gnueabi-ld: arch/arm/kernel/machine_kexec.o: in function `arch_crash_save_vmcoreinfo': machine_kexec.c:(.text+0x488): undefined reference to `vmcoreinfo_append_str' ==== On architecutres, like arm, s390, ppc, sh, function arch_crash_save_vmcoreinfo() is located in machine_kexec.c and it can only be compiled in when CONFIG_KEXEC_CORE=y. That's not right because arch_crash_save_vmcoreinfo() is used to export arch specific vmcoreinfo. CONFIG_VMCORE_INFO is supposed to control its compiling in. However, CONFIG_VMVCORE_INFO could be independent of CONFIG_KEXEC_CORE, e.g CONFIG_PROC_KCORE=y will select CONFIG_VMVCORE_INFO. Or CONFIG_KEXEC/CONFIG_KEXEC_FILE is set while CONFIG_CRASH_DUMP is not set, it will report linking error. So, on arm, s390, ppc and sh, move arch_crash_save_vmcoreinfo out to a new file vmcore_info.c. Let CONFIG_VMCORE_INFO decide if compiling in arch_crash_save_vmcoreinfo(). [akpm@linux-foundation.org: remove stray newlines at eof] Link: https://lkml.kernel.org/r/20240129135033.157195-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Reported-by: Nathan Chancellor <nathan@kernel.org> Closes: https://lore.kernel.org/all/20240126045551.GA126645@dev-arch.thelio-3990X/T/#u Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Hari Bathini <hbathini@linux.ibm.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* ppc, crash: enforce KEXEC and KEXEC_FILE to select CRASH_DUMPBaoquan He2024-02-241-0/+6
| | | | | | | | | | | | | | | | | | | | | | In PowerPC, the crash dumping and kexec reboot share code in arch_kexec_locate_mem_hole(), in which struct crash_mem is used. Here enfoce enforce KEXEC and KEXEC_FILE to select CRASH_DUMP for now. [bhe@redhat.com: fix allnoconfig on ppc] Link: https://lkml.kernel.org/r/ZbJwMyCpz4HDySoo@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20240124051254.67105-9-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* crash: remove dependency of FA_DUMP on CRASH_DUMPBaoquan He2024-02-241-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In kdump kernel, /proc/vmcore is an elf file mapping the crashed kernel's old memory content. Its elf header is constructed in 1st kernel and passed to kdump kernel via elfcorehdr_addr. Config CRASH_DUMP enables the code of 1st kernel's old memory accessing in different architectures. Currently, config FA_DUMP has dependency on CRASH_DUMP because fadump needs access global variable 'elfcorehdr_addr' to judge if it's in kdump kernel within function is_kdump_kernel(). In the current kernel/crash_dump.c, variable 'elfcorehdr_addr' is defined, and function setup_elfcorehdr() used to parse kernel parameter to fetch the passed value of elfcorehdr_addr. Only for accessing elfcorehdr_addr, FA_DUMP really doesn't have to depends on CRASH_DUMP. To remove the dependency of FA_DUMP on CRASH_DUMP to avoid confusion, rename kernel/crash_dump.c to kernel/elfcorehdr.c, and build it when CONFIG_VMCORE_INFO is ebabled. With this, FA_DUMP doesn't need to depend on CRASH_DUMP. [bhe@redhat.com: power/fadump: make FA_DUMP select CRASH_DUMP] Link: https://lkml.kernel.org/r/Zb8D1ASrgX0qVm9z@MiWiFi-R3L-srv Link: https://lkml.kernel.org/r/20240124051254.67105-4-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* crash: split vmcoreinfo exporting code out from crash_core.cBaoquan He2024-02-243-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now move the relevant codes into separate files: kernel/crash_reserve.c, include/linux/crash_reserve.h. And add config item CRASH_RESERVE to control its enabling. And also update the old ifdeffery of CONFIG_CRASH_CORE, including of <linux/crash_core.h> and config item dependency on CRASH_CORE accordingly. And also do renaming as follows: - arch/xxx/kernel/{crash_core.c => vmcore_info.c} because they are only related to vmcoreinfo exporting on x86, arm64, riscv. And also Remove config item CRASH_CORE, and rely on CONFIG_KEXEC_CORE to decide if build in crash_core.c. [yang.lee@linux.alibaba.com: remove duplicated include in vmcore_info.c] Link: https://lkml.kernel.org/r/20240126005744.16561-1-yang.lee@linux.alibaba.com Link: https://lkml.kernel.org/r/20240124051254.67105-3-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Signed-off-by: Yang Li <yang.lee@linux.alibaba.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* kexec: split crashkernel reservation code out from crash_core.cBaoquan He2024-02-242-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Patch series "Split crash out from kexec and clean up related config items", v3. Motivation: ============= Previously, LKP reported a building error. When investigating, it can't be resolved reasonablly with the present messy kdump config items. https://lore.kernel.org/oe-kbuild-all/202312182200.Ka7MzifQ-lkp@intel.com/ The kdump (crash dumping) related config items could causes confusions: Firstly, CRASH_CORE enables codes including - crashkernel reservation; - elfcorehdr updating; - vmcoreinfo exporting; - crash hotplug handling; Now fadump of powerpc, kcore dynamic debugging and kdump all selects CRASH_CORE, while fadump - fadump needs crashkernel parsing, vmcoreinfo exporting, and accessing global variable 'elfcorehdr_addr'; - kcore only needs vmcoreinfo exporting; - kdump needs all of the current kernel/crash_core.c. So only enabling PROC_CORE or FA_DUMP will enable CRASH_CORE, this mislead people that we enable crash dumping, actual it's not. Secondly, It's not reasonable to allow KEXEC_CORE select CRASH_CORE. Because KEXEC_CORE enables codes which allocate control pages, copy kexec/kdump segments, and prepare for switching. These codes are shared by both kexec reboot and kdump. We could want kexec reboot, but disable kdump. In that case, CRASH_CORE should not be selected. -------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC=y CONFIG_KEXEC_FILE=y --------------------- Thirdly, It's not reasonable to allow CRASH_DUMP select KEXEC_CORE. That could make KEXEC_CORE, CRASH_DUMP are enabled independently from KEXEC or KEXEC_FILE. However, w/o KEXEC or KEXEC_FILE, the KEXEC_CORE code built in doesn't make any sense because no kernel loading or switching will happen to utilize the KEXEC_CORE code. --------------------- CONFIG_CRASH_CORE=y CONFIG_KEXEC_CORE=y CONFIG_CRASH_DUMP=y --------------------- In this case, what is worse, on arch sh and arm, KEXEC relies on MMU, while CRASH_DUMP can still be enabled when !MMU, then compiling error is seen as the lkp test robot reported in above link. ------arch/sh/Kconfig------ config ARCH_SUPPORTS_KEXEC def_bool MMU config ARCH_SUPPORTS_CRASH_DUMP def_bool BROKEN_ON_SMP --------------------------- Changes: =========== 1, split out crash_reserve.c from crash_core.c; 2, split out vmcore_infoc. from crash_core.c; 3, move crash related codes in kexec_core.c into crash_core.c; 4, remove dependency of FA_DUMP on CRASH_DUMP; 5, clean up kdump related config items; 6, wrap up crash codes in crash related ifdefs on all 8 arch-es which support crash dumping, except of ppc; Achievement: =========== With above changes, I can rearrange the config item logic as below (the right item depends on or is selected by the left item): PROC_KCORE -----------> VMCORE_INFO |----------> VMCORE_INFO FA_DUMP----| |----------> CRASH_RESERVE ---->VMCORE_INFO / |---->CRASH_RESERVE KEXEC --| /| |--> KEXEC_CORE--> CRASH_DUMP-->/-|---->PROC_VMCORE KEXEC_FILE --| \ | \---->CRASH_HOTPLUG KEXEC --| |--> KEXEC_CORE (for kexec reboot only) KEXEC_FILE --| Test ======== On all 8 architectures, including x86_64, arm64, s390x, sh, arm, mips, riscv, loongarch, I did below three cases of config item setting and building all passed. Take configs on x86_64 as exampmle here: (1) Both CONFIG_KEXEC and KEXEC_FILE is unset, then all kexec/kdump items are unset automatically: # Kexec and crash features # CONFIG_KEXEC is not set # CONFIG_KEXEC_FILE is not set # end of Kexec and crash features (2) set CONFIG_KEXEC_FILE and 'make olddefconfig': --------------- # Kexec and crash features CONFIG_CRASH_RESERVE=y CONFIG_VMCORE_INFO=y CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y CONFIG_CRASH_DUMP=y CONFIG_CRASH_HOTPLUG=y CONFIG_CRASH_MAX_MEMORY_RANGES=8192 # end of Kexec and crash features --------------- (3) unset CONFIG_CRASH_DUMP in case 2 and execute 'make olddefconfig': ------------------------ # Kexec and crash features CONFIG_KEXEC_CORE=y CONFIG_KEXEC_FILE=y # end of Kexec and crash features ------------------------ Note: For ppc, it needs investigation to make clear how to split out crash code in arch folder. Hope Hari and Pingfan can help have a look, see if it's doable. Now, I make it either have both kexec and crash enabled, or disable both of them altogether. This patch (of 14): Both kdump and fa_dump of ppc rely on crashkernel reservation. Move the relevant codes into separate files: crash_reserve.c, include/linux/crash_reserve.h. And also add config item CRASH_RESERVE to control its enabling of the codes. And update config items which has relationship with crashkernel reservation. And also change ifdeffery from CONFIG_CRASH_CORE to CONFIG_CRASH_RESERVE when those scopes are only crashkernel reservation related. And also rename arch/XXX/include/asm/{crash_core.h => crash_reserve.h} on arm64, x86 and risc-v because those architectures' crash_core.h is only related to crashkernel reservation. [akpm@linux-foundation.org: s/CRASH_RESEERVE/CRASH_RESERVE/, per Klara Modin] Link: https://lkml.kernel.org/r/20240124051254.67105-1-bhe@redhat.com Link: https://lkml.kernel.org/r/20240124051254.67105-2-bhe@redhat.com Signed-off-by: Baoquan He <bhe@redhat.com> Acked-by: Hari Bathini <hbathini@linux.ibm.com> Cc: Al Viro <viro@zeniv.linux.org.uk> Cc: Eric W. Biederman <ebiederm@xmission.com> Cc: Pingfan Liu <piliu@redhat.com> Cc: Klara Modin <klarasmodin@gmail.com> Cc: Michael Kelley <mhklinux@outlook.com> Cc: Nathan Chancellor <nathan@kernel.org> Cc: Stephen Rothwell <sfr@canb.auug.org.au> Cc: Yang Li <yang.lee@linux.alibaba.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/mmu_gather: add tlb_remove_tlb_entries()David Hildenbrand2024-02-231-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's add a helper that lets us batch-process multiple consecutive PTEs. Note that the loop will get optimized out on all architectures except on powerpc. We have to add an early define of __tlb_remove_tlb_entry() on ppc to make the compiler happy (and avoid making tlb_remove_tlb_entries() a macro). [arnd@kernel.org: change __tlb_remove_tlb_entry() to an inline function] Link: https://lkml.kernel.org/r/20240221154549.2026073-1-arnd@kernel.org Link: https://lkml.kernel.org/r/20240214204435.167852-8-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Signed-off-by: Arnd Bergmann <arnd@arndb.de> Reviewed-by: Ryan Roberts <ryan.roberts@arm.com> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Aneesh Kumar K.V <aneesh.kumar@linux.ibm.com> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Michal Hocko <mhocko@suse.com> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Peter Zijlstra (Intel) <peterz@infradead.org> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Cc: Yin Fengwei <fengwei.yin@intel.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm/hugetlb: move page order check inside hugetlb_cma_reserve()Anshuman Khandual2024-02-221-3/+1
| | | | | | | | | | | | | | | | All platforms could benefit from page order check against MAX_PAGE_ORDER before allocating a CMA area for gigantic hugetlb pages. Let's move this check from individual platforms to generic hugetlb. Link: https://lkml.kernel.org/r/20240209054221.1403364-1-anshuman.khandual@arm.com Signed-off-by: Anshuman Khandual <anshuman.khandual@arm.com> Reviewed-by: Jane Chu <jane.chu@oracle.com> Reviewed-by: David Hildenbrand <david@redhat.com> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Will Deacon <will@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* powerpc/mm: use pte_next_pfn() in set_ptes()David Hildenbrand2024-02-221-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let's use our handy new helper. Note that the implementation is slightly different, but shouldn't really make a difference in practice. Link: https://lkml.kernel.org/r/20240129124649.189745-11-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* powerpc/pgtable: define PFN_PTE_SHIFTDavid Hildenbrand2024-02-221-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We want to make use of pte_next_pfn() outside of set_ptes(). Let's simply define PFN_PTE_SHIFT, required by pte_next_pfn(). Link: https://lkml.kernel.org/r/20240129124649.189745-5-david@redhat.com Signed-off-by: David Hildenbrand <david@redhat.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Tested-by: Ryan Roberts <ryan.roberts@arm.com> Reviewed-by: Mike Rapoport (IBM) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Aneesh Kumar K.V <aneesh.kumar@kernel.org> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: David S. Miller <davem@davemloft.net> Cc: Dinh Nguyen <dinguyen@kernel.org> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Naveen N. Rao <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Russell King (Oracle) <linux@armlinux.org.uk> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* mm: ptdump: have ptdump_check_wx() return boolChristophe Leroy2024-02-221-4/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Have ptdump_check_wx() return true when the check is successful or false otherwise. [akpm@linux-foundation.org: fix a couple of build issues (x86_64 allmodconfig)] Link: https://lkml.kernel.org/r/7943149fe955458cb7b57cd483bf41a3aad94684.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* powerpc,s390: ptdump: define ptdump_check_wx() regardless of CONFIG_DEBUG_WXChristophe Leroy2024-02-221-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Following patch will use ptdump_check_wx() regardless of CONFIG_DEBUG_WX, so define it at all times on powerpc and s390 just like other architectures. Though keep the WARN_ON_ONCE() only when CONFIG_DEBUG_WX is set. Link: https://lkml.kernel.org/r/07bfb04c7fec58e84413e91d2533581be357a696.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* arm64, powerpc, riscv, s390, x86: ptdump: refactor CONFIG_DEBUG_WXChristophe Leroy2024-02-224-13/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All architectures using the core ptdump functionality also implement CONFIG_DEBUG_WX, and they all do it more or less the same way, with a function called debug_checkwx() that is called by mark_rodata_ro(), which is a substitute to ptdump_check_wx() when CONFIG_DEBUG_WX is set and a no-op otherwise. Refactor by centrally defining debug_checkwx() in linux/ptdump.h and call debug_checkwx() immediately after calling mark_rodata_ro() instead of calling it at the end of every mark_rodata_ro(). On x86_32, mark_rodata_ro() first checks __supported_pte_mask has _PAGE_NX before calling debug_checkwx(). Now the check is inside the callee ptdump_walk_pgd_level_checkwx(). On powerpc_64, mark_rodata_ro() bails out early before calling ptdump_check_wx() when the MMU doesn't have KERNEL_RO feature. The check is now also done in ptdump_check_wx() as it is called outside mark_rodata_ro(). Link: https://lkml.kernel.org/r/a59b102d7964261d31ead0316a9f18628e4e7a8e.1706610398.git.christophe.leroy@csgroup.eu Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reviewed-by: Alexandre Ghiti <alexghiti@rivosinc.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Alexander Gordeev <agordeev@linux.ibm.com> Cc: Andy Lutomirski <luto@kernel.org> Cc: "Aneesh Kumar K.V (IBM)" <aneesh.kumar@kernel.org> Cc: Borislav Petkov (AMD) <bp@alien8.de> Cc: Catalin Marinas <catalin.marinas@arm.com> Cc: Christian Borntraeger <borntraeger@linux.ibm.com> Cc: Dave Hansen <dave.hansen@linux.intel.com> Cc: Gerald Schaefer <gerald.schaefer@linux.ibm.com> Cc: Greg KH <greg@kroah.com> Cc: Heiko Carstens <hca@linux.ibm.com> Cc: "H. Peter Anvin" <hpa@zytor.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Mark Rutland <mark.rutland@arm.com> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: "Naveen N. Rao" <naveen.n.rao@linux.ibm.com> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Phong Tran <tranmanphong@gmail.com> Cc: Russell King <linux@armlinux.org.uk> Cc: Steven Price <steven.price@arm.com> Cc: Sven Schnelle <svens@linux.ibm.com> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vasily Gorbik <gor@linux.ibm.com> Cc: Will Deacon <will@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
* Merge tag 'powerpc-6.8-3' of ↵Linus Torvalds2024-02-1823-31/+75
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Michael Ellerman: "This is a bit of a big batch for rc4, but just due to holiday hangover and because I didn't send any fixes last week due to a late revert request. I think next week should be back to normal. - Fix ftrace bug on boot caused by exit text sections with '-fpatchable-function-entry' - Fix accuracy of stolen time on pseries since the switch to VIRT_CPU_ACCOUNTING_GEN - Fix a crash in the IOMMU code when doing DLPAR remove - Set pt_regs->link on scv entry to fix BPF stack unwinding - Add missing PPC_FEATURE_BOOKE on 64-bit e5500/e6500, which broke gdb - Fix boot on some 6xx platforms with STRICT_KERNEL_RWX enabled - Fix build failures with KASAN enabled and 32KB stack size - Some other minor fixes Thanks to Arnd Bergmann, Benjamin Gray, Christophe Leroy, David Engraf, Gaurav Batra, Jason Gunthorpe, Jiangfeng Xiao, Matthias Schiffer, Nathan Lynch, Naveen N Rao, Nicholas Piggin, Nysal Jan K.A, R Nageswara Sastry, Shivaprasad G Bhat, Shrikanth Hegde, Spoorthy, Srikar Dronamraju, and Venkat Rao Bagalkote" * tag 'powerpc-6.8-3' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/iommu: Fix the missing iommu_group_put() during platform domain attach powerpc/pseries: fix accuracy of stolen time powerpc/ftrace: Ignore ftrace locations in exit text sections powerpc/cputable: Add missing PPC_FEATURE_BOOKE on PPC64 Book-E powerpc/kasan: Limit KASAN thread size increase to 32KB Revert "powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add" powerpc: 85xx: mark local functions static powerpc: udbg_memcons: mark functions static powerpc/kasan: Fix addr error caused by page alignment powerpc/6xx: set High BAT Enable flag on G2_LE cores selftests/powerpc/papr_vpd: Check devfd before get_system_loc_code() powerpc/64: Set task pt_regs->link to the LR value on scv entry powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add powerpc/pseries/papr-sysparm: use u8 arrays for payloads
| * powerpc/iommu: Fix the missing iommu_group_put() during platform domain attachShivaprasad G Bhat2024-02-141-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The function spapr_tce_platform_iommu_attach_dev() is missing to call iommu_group_put() when the domain is already set. This refcount leak shows up with BUG_ON() during DLPAR remove operation as: KernelBug: Kernel bug in state 'None': kernel BUG at arch/powerpc/platforms/pseries/iommu.c:100! Oops: Exception in kernel mode, sig: 5 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=8192 NUMA pSeries <snip> Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_016) hv:phyp pSeries NIP: c0000000000ff4d4 LR: c0000000000ff4cc CTR: 0000000000000000 REGS: c0000013aed5f840 TRAP: 0700 Tainted: G I (6.8.0-rc3-autotest-g99bd3cb0d12e) MSR: 8000000000029033 <SF,EE,ME,IR,DR,RI,LE> CR: 44002402 XER: 20040000 CFAR: c000000000a0d170 IRQMASK: 0 ... NIP iommu_reconfig_notifier+0x94/0x200 LR iommu_reconfig_notifier+0x8c/0x200 Call Trace: iommu_reconfig_notifier+0x8c/0x200 (unreliable) notifier_call_chain+0xb8/0x19c blocking_notifier_call_chain+0x64/0x98 of_reconfig_notify+0x44/0xdc of_detach_node+0x78/0xb0 ofdt_write.part.0+0x86c/0xbb8 proc_reg_write+0xf4/0x150 vfs_write+0xf8/0x488 ksys_write+0x84/0x140 system_call_exception+0x138/0x330 system_call_vectored_common+0x15c/0x2ec The patch adds the missing iommu_group_put() call. Fixes: a8ca9fc9134c ("powerpc/iommu: Do not do platform domain attach atctions after probe") Reported-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Closes: https://lore.kernel.org/all/274e0d2b-b5cc-475e-94e6-8427e88e271d@linux.vnet.ibm.com/ Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Tested-by: Venkat Rao Bagalkote <venkat88@linux.vnet.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/170784021983.6249.10039296655906636112.stgit@linux.ibm.com
| * powerpc/pseries: fix accuracy of stolen timeShrikanth Hegde2024-02-141-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | powerVM hypervisor updates the VPA fields with stolen time data. It currently reports enqueue_dispatch_tb and ready_enqueue_tb for this purpose. In linux these two fields are used to report the stolen time. The VPA fields are updated at the TB frequency. On powerPC its mostly set at 512Mhz. Hence this needs a conversion to ns when reporting it back as rest of the kernel timings are in ns. This conversion is already handled in tb_to_ns function. So use that function to report accurate stolen time. Observed this issue and used an Capped Shared Processor LPAR(SPLPAR) to simplify the experiments. In all these cases, 100% VP Load is run using stress-ng workload. Values of stolen time is in percentages as reported by mpstat. With the patch values are close to expected. 6.8.rc1 +Patch 12EC/12VP 0.0 0.0 12EC/24VP 25.7 50.2 12EC/36VP 37.3 69.2 12EC/48VP 38.5 78.3 Fixes: 0e8a63132800 ("powerpc/pseries: Implement CONFIG_PARAVIRT_TIME_ACCOUNTING") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Reviewed-by: Nicholas Piggin <npiggin@gmail.com> Reviewed-by: Srikar Dronamraju <srikar@linux.vnet.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240213052635.231597-1-sshegde@linux.ibm.com
| * powerpc/ftrace: Ignore ftrace locations in exit text sectionsNaveen N Rao2024-02-145-8/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Michael reported that we are seeing an ftrace bug on bootup when KASAN is enabled and we are using -fpatchable-function-entry: ftrace: allocating 47780 entries in 18 pages ftrace-powerpc: 0xc0000000020b3d5c: No module provided for non-kernel address ------------[ ftrace bug ]------------ ftrace faulted on modifying [<c0000000020b3d5c>] 0xc0000000020b3d5c Initializing ftrace call sites ftrace record flags: 0 (0) expected tramp: c00000000008cef4 ------------[ cut here ]------------ WARNING: CPU: 0 PID: 0 at kernel/trace/ftrace.c:2180 ftrace_bug+0x3c0/0x424 Modules linked in: CPU: 0 PID: 0 Comm: swapper Not tainted 6.5.0-rc3-00120-g0f71dcfb4aef #860 Hardware name: IBM pSeries (emulated by qemu) POWER9 (raw) 0x4e1202 0xf000005 of:SLOF,HEAD hv:linux,kvm pSeries NIP: c0000000003aa81c LR: c0000000003aa818 CTR: 0000000000000000 REGS: c0000000033cfab0 TRAP: 0700 Not tainted (6.5.0-rc3-00120-g0f71dcfb4aef) MSR: 8000000002021033 <SF,VEC,ME,IR,DR,RI,LE> CR: 28028240 XER: 00000000 CFAR: c0000000002781a8 IRQMASK: 3 ... NIP [c0000000003aa81c] ftrace_bug+0x3c0/0x424 LR [c0000000003aa818] ftrace_bug+0x3bc/0x424 Call Trace: ftrace_bug+0x3bc/0x424 (unreliable) ftrace_process_locs+0x5f4/0x8a0 ftrace_init+0xc0/0x1d0 start_kernel+0x1d8/0x484 With CONFIG_FTRACE_MCOUNT_USE_PATCHABLE_FUNCTION_ENTRY=y and CONFIG_KASAN=y, compiler emits nops in functions that it generates for registering and unregistering global variables (unlike with -pg and -mprofile-kernel where calls to _mcount() are not generated in those functions). Those functions then end up in INIT_TEXT and EXIT_TEXT respectively. We don't expect to see any profiled functions in EXIT_TEXT, so ftrace_init_nop() assumes that all addresses that aren't in the core kernel text belongs to a module. Since these functions do not match that criteria, we see the above bug. Address this by having ftrace ignore all locations in the text exit sections of vmlinux. Fixes: 0f71dcfb4aef ("powerpc/ftrace: Add support for -fpatchable-function-entry") Cc: stable@vger.kernel.org # v6.6+ Reported-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Naveen N Rao <naveen@kernel.org> Reviewed-by: Benjamin Gray <bgray@linux.ibm.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240213175410.1091313-1-naveen@kernel.org
| * powerpc/cputable: Add missing PPC_FEATURE_BOOKE on PPC64 Book-EDavid Engraf2024-02-141-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e320a76db4b0 ("powerpc/cputable: Split cpu_specs[] out of cputable.h") moved the cpu_specs to separate header files. Previously PPC_FEATURE_BOOKE was enabled by CONFIG_PPC_BOOK3E_64. The definition in cpu_specs_e500mc.h for PPC64 no longer enables PPC_FEATURE_BOOKE. This breaks user space reading the ELF hwcaps and expect PPC_FEATURE_BOOKE. Debugging an application with gdb is no longer working on e5500/e6500 because the 64-bit detection relies on PPC_FEATURE_BOOKE for Book-E. Fixes: e320a76db4b0 ("powerpc/cputable: Split cpu_specs[] out of cputable.h") Cc: stable@vger.kernel.org # v6.1+ Signed-off-by: David Engraf <david.engraf@sysgo.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240207092758.1058893-1-david.engraf@sysgo.com
| * powerpc/kasan: Limit KASAN thread size increase to 32KBMichael Ellerman2024-02-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KASAN is seen to increase stack usage, to the point that it was reported to lead to stack overflow on some 32-bit machines (see link). To avoid overflows the stack size was doubled for KASAN builds in commit 3e8635fb2e07 ("powerpc/kasan: Force thread size increase with KASAN"). However with a 32KB stack size to begin with, the doubling leads to a 64KB stack, which causes build errors: arch/powerpc/kernel/switch.S:249: Error: operand out of range (0x000000000000fe50 is not between 0xffffffffffff8000 and 0x0000000000007fff) Although the asm could be reworked, in practice a 32KB stack seems sufficient even for KASAN builds - the additional usage seems to be in the 2-3KB range for a 64-bit KASAN build. So only increase the stack for KASAN if the stack size is < 32KB. Fixes: 18f14afe2816 ("powerpc/64s: Increase default stack size to 32KB") Reported-by: Spoorthy <spoorthy@linux.ibm.com> Reported-by: Benjamin Gray <bgray@linux.ibm.com> Reviewed-by: Benjamin Gray <bgray@linux.ibm.com> Link: https://lore.kernel.org/linuxppc-dev/bug-207129-206035@https.bugzilla.kernel.org%2F/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240212064244.3924505-1-mpe@ellerman.id.au
| * Revert "powerpc/pseries/iommu: Fix iommu initialisation during DLPAR add"Michael Ellerman2024-02-143-23/+5
| | | | | | | | | | | | | | | | | | | | This reverts commit ed8b94f6e0acd652ce69bd69d678a0c769172df8. Gaurav reported that there are still problems with the patch and it should be reverted pending a fuller fix. Link: https://lore.kernel.org/all/4f6fc1ac-7a76-4447-9d0e-f55c0be373f8@linux.ibm.com/ Signed-off-by: Michael Ellerman <mpe@ellerman.id.au>
| * powerpc: 85xx: mark local functions staticArnd Bergmann2024-02-067-10/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | These functions are either used in only one file and can just be made static or need an #include statement to avoid a warning: arch/powerpc/platforms/85xx/mpc8536_ds.c:30:13: error: no previous prototype for 'mpc8536_ds_pic_init' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/p1010rdb.c:27:13: error: no previous prototype for 'p1010_rdb_pic_init' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/p1022_ds.c:373:6: error: no previous prototype for 'p1022ds_set_pixel_clock' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/p1022_ds.c:422:1: error: no previous prototype for 'p1022ds_valid_monitor_port' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/p1022_ds.c:435:13: error: no previous prototype for 'p1022_ds_pic_init' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/p1022_rdk.c:43:6: error: no previous prototype for 'p1022rdk_set_pixel_clock' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/p1022_rdk.c:92:1: error: no previous prototype for 'p1022rdk_valid_monitor_port' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/p1022_rdk.c:99:13: error: no previous prototype for 'p1022_rdk_pic_init' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/socrates_fpga_pic.c:273:13: error: no previous prototype for 'socrates_fpga_pic_init' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/xes_mpc85xx.c:40:13: error: no previous prototype for 'xes_mpc85xx_pic_init' [-Werror=missing-prototypes] arch/powerpc/platforms/85xx/mvme2500.c:24:13: error: no previous prototype for 'mvme2500_pic_init' [-Werror=missing-prototypes] Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240123125148.2004648-2-arnd@kernel.org
| * powerpc: udbg_memcons: mark functions staticArnd Bergmann2024-02-061-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ppc64_book3e_allmodconfig has one more driver that triggeres a few missing-prototypes warnings: arch/powerpc/sysdev/udbg_memcons.c:44:6: error: no previous prototype for 'memcons_putc' [-Werror=missing-prototypes] arch/powerpc/sysdev/udbg_memcons.c:57:5: error: no previous prototype for 'memcons_getc_poll' [-Werror=missing-prototypes] arch/powerpc/sysdev/udbg_memcons.c:80:5: error: no previous prototype for 'memcons_getc' [-Werror=missing-prototypes] Mark all these function static as there are no other users. Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240123125148.2004648-1-arnd@kernel.org
| * powerpc/kasan: Fix addr error caused by page alignmentJiangfeng Xiao2024-02-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In kasan_init_region, when k_start is not page aligned, at the begin of for loop, k_cur = k_start & PAGE_MASK is less than k_start, and then `va = block + k_cur - k_start` is less than block, the addr va is invalid, because the memory address space from va to block is not alloced by memblock_alloc, which will not be reserved by memblock_reserve later, it will be used by other places. As a result, memory overwriting occurs. for example: int __init __weak kasan_init_region(void *start, size_t size) { [...] /* if say block(dcd97000) k_start(feef7400) k_end(feeff3fe) */ block = memblock_alloc(k_end - k_start, PAGE_SIZE); [...] for (k_cur = k_start & PAGE_MASK; k_cur < k_end; k_cur += PAGE_SIZE) { /* at the begin of for loop * block(dcd97000) va(dcd96c00) k_cur(feef7000) k_start(feef7400) * va(dcd96c00) is less than block(dcd97000), va is invalid */ void *va = block + k_cur - k_start; [...] } [...] } Therefore, page alignment is performed on k_start before memblock_alloc() to ensure the validity of the VA address. Fixes: 663c0c9496a6 ("powerpc/kasan: Fix shadow area set up for modules.") Signed-off-by: Jiangfeng Xiao <xiaojiangfeng@huawei.com> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/1705974359-43790-1-git-send-email-xiaojiangfeng@huawei.com
| * powerpc/6xx: set High BAT Enable flag on G2_LE coresMatthias Schiffer2024-02-052-1/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MMU_FTR_USE_HIGH_BATS is set for G2_LE cores and derivatives like e300cX, but the high BATs need to be enabled in HID2 to work. Add register definitions and add the needed setup to __setup_cpu_603. This fixes boot on CPUs like the MPC5200B with STRICT_KERNEL_RWX enabled on systems where the flag has not been set by the bootloader already. Fixes: e4d6654ebe6e ("powerpc/mm/32s: rework mmu_mapin_ram()") Signed-off-by: Matthias Schiffer <matthias.schiffer@ew.tq-group.com> Reviewed-by: Christophe Leroy <christophe.leroy@csgroup.eu> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240124103838.43675-1-matthias.schiffer@ew.tq-group.com
| * powerpc/64: Set task pt_regs->link to the LR value on scv entryNaveen N Rao2024-02-051-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nysal reported that userspace backtraces are missing in offcputime bcc tool. As an example: $ sudo ./bcc/tools/offcputime.py -uU Tracing off-CPU time (us) of user threads by user stack... Hit Ctrl-C to end. ^C write - python (9107) 8 write - sudo (9105) 9 mmap - python (9107) 16 clock_nanosleep - multipathd (697) 3001604 The offcputime bcc tool attaches a bpf program to a kprobe on finish_task_switch(), which is usually hit on a syscall from userspace. With the switch to system call vectored, we started setting pt_regs->link to zero. This is because system call vectored behaves like a function call with LR pointing to the system call return address, and with no modification to SRR0/SRR1. The LR value does indicate our next instruction, so it is being saved as pt_regs->nip, and pt_regs->link is being set to zero. This is not a problem by itself, but BPF uses perf callchain infrastructure for capturing stack traces, and that stores LR as the second entry in the stack trace. perf has code to cope with the second entry being zero, and skips over it. However, generic userspace unwinders assume that a zero entry indicates end of the stack trace, resulting in a truncated userspace stack trace. Rather than fixing all userspace unwinders to ignore/skip past the second entry, store the real LR value in pt_regs->link so that there continues to be a valid, though duplicate entry in the stack trace. With this change: $ sudo ./bcc/tools/offcputime.py -uU Tracing off-CPU time (us) of user threads by user stack... Hit Ctrl-C to end. ^C write write [unknown] [unknown] [unknown] [unknown] [unknown] PyObject_VectorcallMethod [unknown] [unknown] PyObject_CallOneArg PyFile_WriteObject PyFile_WriteString [unknown] [unknown] PyObject_Vectorcall _PyEval_EvalFrameDefault PyEval_EvalCode [unknown] [unknown] [unknown] _PyRun_SimpleFileObject _PyRun_AnyFileObject Py_RunMain [unknown] Py_BytesMain [unknown] __libc_start_main - python (1293) 7 write write [unknown] sudo_ev_loop_v1 sudo_ev_dispatch_v1 [unknown] [unknown] [unknown] [unknown] __libc_start_main - sudo (1291) 7 syscall syscall bpf_open_perf_buffer_opts [unknown] [unknown] [unknown] [unknown] _PyObject_MakeTpCall PyObject_Vectorcall _PyEval_EvalFrameDefault PyEval_EvalCode [unknown] [unknown] [unknown] _PyRun_SimpleFileObject _PyRun_AnyFileObject Py_RunMain [unknown] Py_BytesMain [unknown] __libc_start_main - python (1293) 11 clock_nanosleep clock_nanosleep nanosleep sleep [unknown] [unknown] __clone - multipathd (698) 3001661 Fixes: 7fa95f9adaee ("powerpc/64s: system call support for scv/rfscv instructions") Cc: stable@vger.kernel.org Reported-by: "Nysal Jan K.A" <nysal@linux.ibm.com> Signed-off-by: Naveen N Rao <naveen@kernel.org> Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240202154316.395276-1-naveen@kernel.org
| * powerpc/pseries/iommu: Fix iommu initialisation during DLPAR addGaurav Batra2024-02-053-5/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a PCI device is dynamically added, the kernel oopses with a NULL pointer dereference: BUG: Kernel NULL pointer dereference on read at 0x00000030 Faulting instruction address: 0xc0000000006bbe5c Oops: Kernel access of bad area, sig: 11 [#1] LE PAGE_SIZE=64K MMU=Radix SMP NR_CPUS=2048 NUMA pSeries Modules linked in: rpadlpar_io rpaphp rpcsec_gss_krb5 auth_rpcgss nfsv4 dns_resolver nfs lockd grace fscache netfs xsk_diag bonding nft_compat nf_tables nfnetlink rfkill binfmt_misc dm_multipath rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_umad ib_iser libiscsi scsi_transport_iscsi ib_ipoib rdma_cm iw_cm ib_cm mlx5_ib ib_uverbs ib_core pseries_rng drm drm_panel_orientation_quirks xfs libcrc32c mlx5_core mlxfw sd_mod t10_pi sg tls ibmvscsi ibmveth scsi_transport_srp vmx_crypto pseries_wdt psample dm_mirror dm_region_hash dm_log dm_mod fuse CPU: 17 PID: 2685 Comm: drmgr Not tainted 6.7.0-203405+ #66 Hardware name: IBM,9080-HEX POWER10 (raw) 0x800200 0xf000006 of:IBM,FW1060.00 (NH1060_008) hv:phyp pSeries NIP: c0000000006bbe5c LR: c000000000a13e68 CTR: c0000000000579f8 REGS: c00000009924f240 TRAP: 0300 Not tainted (6.7.0-203405+) MSR: 8000000000009033 <SF,EE,ME,IR,DR,RI,LE> CR: 24002220 XER: 20040006 CFAR: c000000000a13e64 DAR: 0000000000000030 DSISR: 40000000 IRQMASK: 0 ... NIP sysfs_add_link_to_group+0x34/0x94 LR iommu_device_link+0x5c/0x118 Call Trace: iommu_init_device+0x26c/0x318 (unreliable) iommu_device_link+0x5c/0x118 iommu_init_device+0xa8/0x318 iommu_probe_device+0xc0/0x134 iommu_bus_notifier+0x44/0x104 notifier_call_chain+0xb8/0x19c blocking_notifier_call_chain+0x64/0x98 bus_notify+0x50/0x7c device_add+0x640/0x918 pci_device_add+0x23c/0x298 of_create_pci_dev+0x400/0x884 of_scan_pci_dev+0x124/0x1b0 __of_scan_bus+0x78/0x18c pcibios_scan_phb+0x2a4/0x3b0 init_phb_dynamic+0xb8/0x110 dlpar_add_slot+0x170/0x3b8 [rpadlpar_io] add_slot_store.part.0+0xb4/0x130 [rpadlpar_io] kobj_attr_store+0x2c/0x48 sysfs_kf_write+0x64/0x78 kernfs_fop_write_iter+0x1b0/0x290 vfs_write+0x350/0x4a0 ksys_write+0x84/0x140 system_call_exception+0x124/0x330 system_call_vectored_common+0x15c/0x2ec Commit a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains") broke DLPAR add of PCI devices. The above added iommu_device structure to pci_controller. During system boot, PCI devices are discovered and this newly added iommu_device structure is initialized by a call to iommu_device_register(). During DLPAR add of a PCI device, a new pci_controller structure is allocated but there are no calls made to iommu_device_register() interface. Fix is to register the iommu device during DLPAR add as well. Fixes: a940904443e4 ("powerpc/iommu: Add iommu_ops to report capabilities and allow blocking domains") Signed-off-by: Gaurav Batra <gbatra@linux.ibm.com> [mpe: Trim oops and tweak some change log wording] Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240122222407.39603-1-gbatra@linux.ibm.com
| * powerpc/pseries/papr-sysparm: use u8 arrays for payloadsNathan Lynch2024-02-052-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some PAPR system parameter values are formatted by firmware as nul-terminated strings (e.g. LPAR name, shared processor attributes). But the values returned for other parameters, such as processor module info and TLB block invalidate characteristics, are binary data with parameter-specific layouts. So char[] isn't the appropriate type for the general case. Use u8/__u8. Signed-off-by: Nathan Lynch <nathanl@linux.ibm.com> Fixes: 905b9e48786e ("powerpc/pseries/papr-sysparm: Expose character device to user space") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240202-papr-sysparm-ioblock-data-use-u8-v1-1-f5c6c89f65ec@linux.ibm.com
* | work around gcc bugs with 'asm goto' with outputsLinus Torvalds2024-02-103-9/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We've had issues with gcc and 'asm goto' before, and we created a 'asm_volatile_goto()' macro for that in the past: see commits 3f0116c3238a ("compiler/gcc4: Add quirk for 'asm goto' miscompilation bug") and a9f180345f53 ("compiler/gcc4: Make quirk for asm_volatile_goto() unconditional"). Then, much later, we ended up removing the workaround in commit 43c249ea0b1e ("compiler-gcc.h: remove ancient workaround for gcc PR 58670") because we no longer supported building the kernel with the affected gcc versions, but we left the macro uses around. Now, Sean Christopherson reports a new version of a very similar problem, which is fixed by re-applying that ancient workaround. But the problem in question is limited to only the 'asm goto with outputs' cases, so instead of re-introducing the old workaround as-is, let's rename and limit the workaround to just that much less common case. It looks like there are at least two separate issues that all hit in this area: (a) some versions of gcc don't mark the asm goto as 'volatile' when it has outputs: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=98619 https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110420 which is easy to work around by just adding the 'volatile' by hand. (b) Internal compiler errors: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=110422 which are worked around by adding the extra empty 'asm' as a barrier, as in the original workaround. but the problem Sean sees may be a third thing since it involves bad code generation (not an ICE) even with the manually added 'volatile'. but the same old workaround works for this case, even if this feels a bit like voodoo programming and may only be hiding the issue. Reported-and-tested-by: Sean Christopherson <seanjc@google.com> Link: https://lore.kernel.org/all/20240208220604.140859-1-seanjc@google.com/ Cc: Nick Desaulniers <ndesaulniers@google.com> Cc: Uros Bizjak <ubizjak@gmail.com> Cc: Jakub Jelinek <jakub@redhat.com> Cc: Andrew Pinski <quic_apinski@quicinc.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | powerpc: iommu: Bring back table group release_ownership() callShivaprasad G Bhat2024-02-021-9/+28
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops") refactored the code removing the set_platform_dma_ops(). It missed out the table group release_ownership() call which would have got called otherwise during the guest shutdown via vfio_group_detach_container(). On PPC64, this particular call actually sets up the 32-bit TCE table, and enables the 64-bit DMA bypass etc. Now after guest shutdown, the subsequent host driver (e.g megaraid-sas) probe post unbind from vfio-pci fails like, megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0x7fffffffffffffff, table unavailable megaraid_sas 0031:01:00.0: Warning: IOMMU dma not supported: mask 0xffffffff, table unavailable megaraid_sas 0031:01:00.0: Failed to set DMA mask megaraid_sas 0031:01:00.0: Failed from megasas_init_fw 6539 The patch brings back the call to table_group release_ownership() call when switching back to PLATFORM domain from BLOCKED, while also separates the domain_ops for both. Fixes: 2ad56efa80db ("powerpc/iommu: Setup a default domain and remove set_platform_dma_ops") Signed-off-by: Shivaprasad G Bhat <sbhat@linux.ibm.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/170628173462.3742.18330000394415935845.stgit@ltcd48-lp2.aus.stglab.ibm.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
* Merge tag 'powerpc-6.8-2' of ↵Linus Torvalds2024-01-211-0/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux Pull powerpc fixes from Aneesh Kumar: - Increase default stack size to 32KB for Book3S Thanks to Michael Ellerman. * tag 'powerpc-6.8-2' of git://git.kernel.org/pub/scm/linux/kernel/git/powerpc/linux: powerpc/64s: Increase default stack size to 32KB
| * powerpc/64s: Increase default stack size to 32KBMichael Ellerman2024-01-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are reports of kernels crashing due to stack overflow while running OpenShift (Kubernetes). The primary contributor to the stack usage seems to be openvswitch, which is used by OVN-Kubernetes (based on OVN (Open Virtual Network)), but NFS also contributes in some stack traces. There may be some opportunities to reduce stack usage in the openvswitch code, but doing so potentially require tradeoffs vs performance, and also requires testing across architectures. Looking at stack usage across the kernel (using -fstack-usage), shows that ppc64le stack frames are on average 50-100% larger than the equivalent function built for x86-64. Which is not surprising given the minimum stack frame size is 32 bytes on ppc64le vs 16 bytes on x86-64. So increase the default stack size to 32KB for the modern 64-bit Book3S platforms, ie. pseries (virtualised) and powernv (bare metal). That leaves the older systems like G5s, and the AmigaOne (pasemi) with a 16KB stack which should be sufficient on those machines. Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Signed-off-by: Aneesh Kumar K.V (IBM) <aneesh.kumar@kernel.org> Link: https://msgid.link/20231215124449.317597-1-mpe@ellerman.id.au
* | Merge tag 'tty-6.8-rc1' of ↵Linus Torvalds2024-01-185-21/+27
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial updates from Greg KH: "Here is the big set of tty and serial driver changes for 6.8-rc1. As usual, Jiri has a bunch of refactoring and cleanups for the tty core and drivers in here, along with the usual set of rs485 updates (someday this might work properly...) Along with those, in here are changes for: - sc16is7xx serial driver updates - platform driver removal api updates - amba-pl011 driver updates - tty driver binding updates - other small tty/serial driver updates and changes All of these have been in linux-next for a while with no reported issues" * tag 'tty-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: (197 commits) serial: sc16is7xx: refactor EFR lock serial: sc16is7xx: reorder code to remove prototype declarations serial: sc16is7xx: refactor FIFO access functions to increase commonality serial: sc16is7xx: drop unneeded MODULE_ALIAS serial: sc16is7xx: replace hardcoded divisor value with BIT() macro serial: sc16is7xx: add explicit return for some switch default cases serial: sc16is7xx: add macro for max number of UART ports serial: sc16is7xx: add driver name to struct uart_driver serial: sc16is7xx: use i2c_get_match_data() serial: sc16is7xx: use spi_get_device_match_data() serial: sc16is7xx: use DECLARE_BITMAP for sc16is7xx_lines bitfield serial: sc16is7xx: improve do/while loop in sc16is7xx_irq() serial: sc16is7xx: remove obsolete loop in sc16is7xx_port_irq() serial: sc16is7xx: set safe default SPI clock frequency serial: sc16is7xx: add check for unsupported SPI modes during probe serial: sc16is7xx: fix invalid sc16is7xx_lines bitfield in case of probe error serial: 8250_exar: Set missing rs485_supported flag serial: omap: do not override settings for RS485 support serial: core, imx: do not set RS485 enabled if it is not supported serial: core: make sure RS485 cannot be enabled when it is not supported ...
| * \ Merge 6.7-rc5 into tty-nextGreg Kroah-Hartman2023-12-114-5/+20
| |\ \ | | | | | | | | | | | | | | | | | | | | We need the serial fixes in here as well to build off of. Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | tty: hvc: convert to u8 and size_tJiri Slaby (SUSE)2023-12-085-21/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Switch character types to u8 and sizes to size_t. To conform to characters/sizes in the rest of the tty layer. Signed-off-by: "Jiri Slaby (SUSE)" <jirislaby@kernel.org> Cc: Michael Ellerman <mpe@ellerman.id.au> Cc: Nicholas Piggin <npiggin@gmail.com> Cc: Christophe Leroy <christophe.leroy@csgroup.eu> Cc: Amit Shah <amit@kernel.org> Cc: Arnd Bergmann <arnd@arndb.de> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: linuxppc-dev@lists.ozlabs.org Cc: virtualization@lists.linux.dev Cc: linux-riscv@lists.infradead.org Link: https://lore.kernel.org/r/20231206073712.17776-13-jirislaby@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | Merge tag 'pci-v6.8-changes' of ↵Linus Torvalds2024-01-181-2/+2
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Reserve ECAM so we don't assign it to PCI BARs; this works around bugs where BIOS included ECAM in a PNP0A03 host bridge window, didn't reserve it via a PNP0C02 motherboard device, and didn't allocate space for SR-IOV VF BARs (Bjorn Helgaas) - Add MMCONFIG/ECAM debug logging (Bjorn Helgaas) - Rename 'MMCONFIG' to 'ECAM' to match spec usage (Bjorn Helgaas) - Log device type (Root Port, Switch Port, etc) during enumeration (Bjorn Helgaas) - Log bridges before downstream devices so the dmesg order is more logical (Bjorn Helgaas) - Log resource names (BAR 0, VF BAR 0, bridge window, etc) consistently instead of a mix of names and "reg 0x10" (Puranjay Mohan, Bjorn Helgaas) - Fix 64GT/s effective data rate calculation to use 1b/1b encoding rather than the 8b/10b or 128b/130b used by lower rates (Ilpo Järvinen) - Use PCI_HEADER_TYPE_* instead of literals in x86, powerpc, SCSI lpfc (Ilpo Järvinen) - Clean up open-coded PCIBIOS return code mangling (Ilpo Järvinen) Resource management: - Restructure pci_dev_for_each_resource() to avoid computing the address of an out-of-bounds array element (the bounds check was performed later so the element was never actually *read*, but it's nicer to avoid even computing an out-of-bounds address) (Andy Shevchenko) Driver binding: - Convert pci-host-common.c platform .remove() callback to .remove_new() returning 'void' since it's not useful to return error codes here (Uwe Kleine-König) - Convert exynos, keystone, kirin from .remove() to .remove_new(), which returns void instead of int (Uwe Kleine-König) - Drop unused struct pci_driver.node member (Mathias Krause) Virtualization: - Add ACS quirk for more Zhaoxin Root Ports (LeoLiuoc) Error handling: - Log AER errors as "Correctable" (not "Corrected") or "Uncorrectable" to match spec terminology (Bjorn Helgaas) - Decode Requester ID when no error info found instead of printing the raw hex value (Bjorn Helgaas) Endpoint framework: - Use a unique test pattern for each BAR in the pci_endpoint_test to make it easier to debug address translation issues (Niklas Cassel) Broadcom STB PCIe controller driver: - Add DT property "brcm,clkreq-mode" and driver support for different CLKREQ# modes to make ASPM L1.x states possible (Jim Quinlan) Freescale Layerscape PCIe controller driver: - Add suspend/resume support for Layerscape LS1043a and LS1021a, including software-managed PME_Turn_Off and transitions between L0, L2/L3_Ready Link states (Frank Li) MediaTek PCIe controller driver: - Clear MSI interrupt status before handler to avoid missing MSIs that occur after the handler (qizhong cheng) MediaTek PCIe Gen3 controller driver: - Update mediatek-gen3 translation window setup to handle MMIO space that is not a power of two in size (Jianjun Wang) Qualcomm PCIe controller driver: - Increase qcom iommu-map maxItems to accommodate SDX55 (five entries) and SDM845 (sixteen entries) (Krzysztof Kozlowski) - Describe qcom,pcie-sc8180x clocks and resets accurately (Krzysztof Kozlowski) - Describe qcom,pcie-sm8150 clocks and resets accurately (Krzysztof Kozlowski) - Correct the qcom "reset-name" property, previously incorrectly called "reset-names" (Krzysztof Kozlowski) - Document qcom,pcie-sm8650, based on qcom,pcie-sm8550 (Neil Armstrong) Renesas R-Car PCIe controller driver: - Replace of_device.h with explicit of.h include to untangle header usage (Rob Herring) - Add DT and driver support for optional miniPCIe 1.5v and 3.3v regulators on KingFisher (Wolfram Sang) SiFive FU740 PCIe controller driver: - Convert fu740 CONFIG_PCIE_FU740 dependency from SOC_SIFIVE to ARCH_SIFIVE (Conor Dooley) Synopsys DesignWare PCIe controller driver: - Align iATU mapping for endpoint MSI-X (Niklas Cassel) - Drop "host_" prefix from struct dw_pcie_host_ops members (Yoshihiro Shimoda) - Drop "ep_" prefix from struct dw_pcie_ep_ops members (Yoshihiro Shimoda) - Rename struct dw_pcie_ep_ops.func_conf_select() to .get_dbi_offset() to be more descriptive (Yoshihiro Shimoda) - Add Endpoint DBI accessors to encapsulate offset lookups (Yoshihiro Shimoda) TI J721E PCIe driver: - Add j721e DT and driver support for 'num-lanes' for devices that support x1, x2, or x4 Links (Matt Ranostay) - Add j721e DT compatible strings and driver support for j784s4 (Matt Ranostay) - Make TI J721E Kconfig depend on ARCH_K3 since the hardware is specific to those TI SoC parts (Peter Robinson) TI Keystone PCIe controller driver: - Hold power management references to all PHYs while enabling them to avoid a race when one provides clocks to others (Siddharth Vadapalli) Xilinx XDMA PCIe controller driver: - Remove redundant dev_err(), since platform_get_irq() and platform_get_irq_byname() already log errors (Yang Li) - Fix uninitialized symbols in xilinx_pl_dma_pcie_setup_irq() (Krzysztof Wilczyński) - Fix xilinx_pl_dma_pcie_init_irq_domain() error return when irq_domain_add_linear() fails (Harshit Mogalapalli) MicroSemi Switchtec management driver: - Do dma_mrpc cleanup during switchtec_pci_remove() to match its devm ioremapping in switchtec_pci_probe(). Previously the cleanup was done in stdev_release(), which used stale pointers if stdev->cdev happened to be open when the PCI device was removed (Daniel Stodden) Miscellaneous: - Convert interrupt terminology from "legacy" to "INTx" to be more specific and match spec terminology (Damien Le Moal) - In dw-xdata-pcie, pci_endpoint_test, and vmd, replace usage of deprecated ida_simple_*() API with ida_alloc() and ida_free() (Christophe JAILLET)" * tag 'pci-v6.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (97 commits) PCI: Fix kernel-doc issues PCI: brcmstb: Configure HW CLKREQ# mode appropriate for downstream device dt-bindings: PCI: brcmstb: Add property "brcm,clkreq-mode" PCI: mediatek-gen3: Fix translation window size calculation PCI: mediatek: Clear interrupt status before dispatching handler PCI: keystone: Fix race condition when initializing PHYs PCI: xilinx-xdma: Fix error code in xilinx_pl_dma_pcie_init_irq_domain() PCI: xilinx-xdma: Fix uninitialized symbols in xilinx_pl_dma_pcie_setup_irq() PCI: rcar-gen4: Fix -Wvoid-pointer-to-enum-cast error PCI: iproc: Fix -Wvoid-pointer-to-enum-cast warning PCI: dwc: Add dw_pcie_ep_{read,write}_dbi[2] helpers PCI: dwc: Rename .func_conf_select to .get_dbi_offset in struct dw_pcie_ep_ops PCI: dwc: Rename .ep_init to .init in struct dw_pcie_ep_ops PCI: dwc: Drop host prefix from struct dw_pcie_host_ops members misc: pci_endpoint_test: Use a unique test pattern for each BAR PCI: j721e: Make TI J721E depend on ARCH_K3 PCI: j721e: Add TI J784S4 PCIe configuration PCI/AER: Use explicit register sizes for struct members PCI/AER: Decode Requester ID when no error info found PCI/AER: Use 'Correctable' and 'Uncorrectable' spec terms for errors ...
| * | | | powerpc/fsl-pci: Use PCI_HEADER_TYPE_MASK instead of literalIlpo Järvinen2023-12-011-2/+2
| |/ / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace 0x7f literals with PCI_HEADER_TYPE_MASK. Link: https://lore.kernel.org/r/20231124090919.23687-2-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
* | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-01-174-20/+8
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
| * \ \ \ Merge tag 'kvm-x86-generic-6.8' of https://github.com/kvm-x86/linux into HEADPaolo Bonzini2024-01-081-1/+0
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Common KVM changes for 6.8: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures.
| | * | | | KVM: move KVM_CAP_DEVICE_CTRL to the generic checkWei Wang2023-11-301-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KVM_CAP_DEVICE_CTRL allows userspace to check if the kvm_device framework (e.g. KVM_CREATE_DEVICE) is supported by KVM. Move KVM_CAP_DEVICE_CTRL to the generic check for the two reasons: 1) it already supports arch agnostic usages (i.e. KVM_DEV_TYPE_VFIO). For example, userspace VFIO implementation may needs to create KVM_DEV_TYPE_VFIO on x86, riscv, or arm etc. It is simpler to have it checked at the generic code than at each arch's code. 2) KVM_CREATE_DEVICE has been added to the generic code. Link: https://lore.kernel.org/all/20221215115207.14784-1-wei.w.wang@intel.com Signed-off-by: Wei Wang <wei.w.wang@intel.com> Reviewed-by: Sean Christopherson <seanjc@google.com> Acked-by: Anup Patel <anup@brainfault.org> (riscv) Reviewed-by: Oliver Upton <oliver.upton@linux.dev> Acked-by: Michael Ellerman <mpe@ellerman.id.au> (powerpc) Link: https://lore.kernel.org/r/20230315101606.10636-1-wei.w.wang@intel.com Signed-off-by: Sean Christopherson <seanjc@google.com>
| * | | | | KVM: introduce CONFIG_KVM_COMMONPaolo Bonzini2024-01-081-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | CONFIG_HAVE_KVM is currently used by some architectures to either enabled the KVM config proper, or to enable host-side code that is not part of the KVM module. However, CONFIG_KVM's "select" statement in virt/kvm/Kconfig corresponds to a third meaning, namely to enable common Kconfigs required by all architectures that support KVM. These three meanings can be replaced respectively by an architecture-specific Kconfig, by IS_ENABLED(CONFIG_KVM), or by a new Kconfig symbol that is in turn selected by the architecture-specific "config KVM". Start by introducing such a new Kconfig symbol, CONFIG_KVM_COMMON. Unlike CONFIG_HAVE_KVM, it is selected by CONFIG_KVM, not by architecture code, and it brings in all dependencies of common KVM code. In particular, INTERVAL_TREE was missing in loongarch and riscv, so that is another thing that is fixed. Fixes: 8132d887a702 ("KVM: remove CONFIG_HAVE_KVM_EVENTFD", 2023-12-08) Reported-by: Randy Dunlap <rdunlap@infradead.org> Closes: https://lore.kernel.org/all/44907c6b-c5bd-4e4a-a921-e4d3825539d8@infradead.org/ Reviewed-by: Andrew Jones <ajones@ventanamicro.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | Merge tag 'loongarch-kvm-6.8' of ↵Paolo Bonzini2024-01-027-13/+66
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/chenhuacai/linux-loongson into HEAD LoongArch KVM changes for v6.8 1. Optimization for memslot hugepage checking. 2. Cleanup and fix some HW/SW timer issues. 3. Add LSX/LASX (128bit/256bit SIMD) support.
| * | | | | | KVM: remove CONFIG_HAVE_KVM_IRQFDPaolo Bonzini2023-12-082-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All platforms with a kernel irqchip have support for irqfd. Unify the two configuration items so that userspace can expect to use irqfd to inject interrupts into the irqchip. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | | KVM: remove CONFIG_HAVE_KVM_EVENTFDPaolo Bonzini2023-12-081-1/+0
| | |/ / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | virt/kvm/eventfd.c is compiled unconditionally, meaning that the ioeventfds member of struct kvm is accessed unconditionally. CONFIG_HAVE_KVM_EVENTFD therefore must be defined for KVM common code to compile successfully, remove it. Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| * | | | | Merge branch 'kvm-guestmemfd' into HEADPaolo Bonzini2023-11-144-13/+6
| |\ \ \ \ \ | | |_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce several new KVM uAPIs to ultimately create a guest-first memory subsystem within KVM, a.k.a. guest_memfd. Guest-first memory allows KVM to provide features, enhancements, and optimizations that are kludgly or outright impossible to implement in a generic memory subsystem. The core KVM ioctl() for guest_memfd is KVM_CREATE_GUEST_MEMFD, which similar to the generic memfd_create(), creates an anonymous file and returns a file descriptor that refers to it. Again like "regular" memfd files, guest_memfd files live in RAM, have volatile storage, and are automatically released when the last reference is dropped. The key differences between memfd files (and every other memory subystem) is that guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to convert a guest memory area between the shared and guest-private states. A second KVM ioctl(), KVM_SET_MEMORY_ATTRIBUTES, allows userspace to specify attributes for a given page of guest memory. In the long term, it will likely be extended to allow userspace to specify per-gfn RWX protections, including allowing memory to be writable in the guest without it also being writable in host userspace. The immediate and driving use case for guest_memfd are Confidential (CoCo) VMs, specifically AMD's SEV-SNP, Intel's TDX, and KVM's own pKVM. For such use cases, being able to map memory into KVM guests without requiring said memory to be mapped into the host is a hard requirement. While SEV+ and TDX prevent untrusted software from reading guest private data by encrypting guest memory, pKVM provides confidentiality and integrity *without* relying on memory encryption. In addition, with SEV-SNP and especially TDX, accessing guest private memory can be fatal to the host, i.e. KVM must be prevent host userspace from accessing guest memory irrespective of hardware behavior. Long term, guest_memfd may be useful for use cases beyond CoCo VMs, for example hardening userspace against unintentional accesses to guest memory. As mentioned earlier, KVM's ABI uses userspace VMA protections to define the allow guest protection (with an exception granted to mapping guest memory executable), and similarly KVM currently requires the guest mapping size to be a strict subset of the host userspace mapping size. Decoupling the mappings sizes would allow userspace to precisely map only what is needed and with the required permissions, without impacting guest performance. A guest-first memory subsystem also provides clearer line of sight to things like a dedicated memory pool (for slice-of-hardware VMs) and elimination of "struct page" (for offload setups where userspace _never_ needs to DMA from or into guest memory). guest_memfd is the result of 3+ years of development and exploration; taking on memory management responsibilities in KVM was not the first, second, or even third choice for supporting CoCo VMs. But after many failed attempts to avoid KVM-specific backing memory, and looking at where things ended up, it is quite clear that of all approaches tried, guest_memfd is the simplest, most robust, and most extensible, and the right thing to do for KVM and the kernel at-large. The "development cycle" for this version is going to be very short; ideally, next week I will merge it as is in kvm/next, taking this through the KVM tree for 6.8 immediately after the end of the merge window. The series is still based on 6.6 (plus KVM changes for 6.7) so it will require a small fixup for changes to get_file_rcu() introduced in 6.7 by commit 0ede61d8589c ("file: convert to SLAB_TYPESAFE_BY_RCU"). The fixup will be done as part of the merge commit, and most of the text above will become the commit message for the merge. Pending post-merge work includes: - hugepage support - looking into using the restrictedmem framework for guest memory - introducing a testing mechanism to poison memory, possibly using the same memory attributes introduced here - SNP and TDX support There are two non-KVM patches buried in the middle of this series: fs: Rename anon_inode_getfile_secure() and anon_inode_getfd_secure() mm: Add AS_UNMOVABLE to mark mapping as completely unmovable The first is small and mostly suggested-by Christian Brauner; the second a bit less so but it was written by an mm person (Vlastimil Babka).
| | * | | | KVM: Allow arch code to track number of memslot address spaces per VMSean Christopherson2023-11-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Let x86 track the number of address spaces on a per-VM basis so that KVM can disallow SMM memslots for confidential VMs. Confidentials VMs are fundamentally incompatible with emulating SMM, which as the name suggests requires being able to read and write guest memory and register state. Disallowing SMM will simplify support for guest private memory, as KVM will not need to worry about tracking memory attributes for multiple address spaces (SMM is the only "non-default" address space across all architectures). Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-23-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * | | | KVM: Convert KVM_ARCH_WANT_MMU_NOTIFIER to CONFIG_KVM_GENERIC_MMU_NOTIFIERSean Christopherson2023-11-133-9/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig and select it where appropriate to effectively maintain existing behavior. Using a proper Kconfig will simplify building more functionality on top of KVM's mmu_notifier infrastructure. Add a forward declaration of kvm_gfn_range to kvm_types.h so that including arch/powerpc/include/asm/kvm_ppc.h's with CONFIG_KVM=n doesn't generate warnings due to kvm_gfn_range being undeclared. PPC defines hooks for PR vs. HV without guarding them via #ifdeffery, e.g. bool (*unmap_gfn_range)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*test_age_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); bool (*set_spte_gfn)(struct kvm *kvm, struct kvm_gfn_range *range); Alternatively, PPC could forward declare kvm_gfn_range, but there's no good reason not to define it in common KVM. Acked-by: Anup Patel <anup@brainfault.org> Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Tested-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-8-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * | | | KVM: PPC: Return '1' unconditionally for KVM_CAP_SYNC_MMUSean Christopherson2023-11-131-4/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Advertise that KVM's MMU is synchronized with the primary MMU for all flavors of PPC KVM support, i.e. advertise that the MMU is synchronized when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y but the VM is not using hypervisor mode (a.k.a. PR VMs). PR VMs, via kvm_unmap_gfn_range_pr(), do the right thing for mmu_notifier invalidation events, and more tellingly, KVM returns '1' for KVM_CAP_SYNC_MMU when CONFIG_KVM_BOOK3S_HV_POSSIBLE=n and CONFIG_KVM_BOOK3S_PR_POSSIBLE=y, i.e. KVM already advertises a synchronized MMU for PR VMs, just not when CONFIG_KVM_BOOK3S_HV_POSSIBLE=y. Suggested-by: Paolo Bonzini <pbonzini@redhat.com> Signed-off-by: Sean Christopherson <seanjc@google.com> Message-Id: <20231027182217.3615211-7-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>
| | * | | | KVM: PPC: Drop dead code related to KVM_ARCH_WANT_MMU_NOTIFIERSean Christopherson2023-11-131-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Assert that both KVM_ARCH_WANT_MMU_NOTIFIER and CONFIG_MMU_NOTIFIER are defined when KVM is enabled, and return '1' unconditionally for the CONFIG_KVM_BOOK3S_HV_POSSIBLE=n path. All flavors of PPC support for KVM select MMU_NOTIFIER, and KVM_ARCH_WANT_MMU_NOTIFIER is unconditionally defined by arch/powerpc/include/asm/kvm_host.h. Effectively dropping use of KVM_ARCH_WANT_MMU_NOTIFIER will simplify a future cleanup to turn KVM_ARCH_WANT_MMU_NOTIFIER into a Kconfig, i.e. will allow combining all of the #if defined(CONFIG_MMU_NOTIFIER) && defined(KVM_ARCH_WANT_MMU_NOTIFIER) checks into a single #ifdef CONFIG_KVM_GENERIC_MMU_NOTIFIER without having to worry about PPC's "bare" usage of KVM_ARCH_WANT_MMU_NOTIFIER. Signed-off-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Paolo Bonzini <pbonzini@redhat.com> Reviewed-by: Fuad Tabba <tabba@google.com> Message-Id: <20231027182217.3615211-6-seanjc@google.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>