summaryrefslogtreecommitdiffstats
path: root/arch/x86 (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge tag 'rtc-6.8' of ↵Linus Torvalds2024-01-192-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux Pull RTC updates from Alexandre Belloni: "There are three new drivers this cycle. Also the cmos driver is getting fixes for longstanding wakeup issues on AMD. New drivers: - Analog Devices MAX31335 - Nuvoton ma35d1 - Texas Instrument TPS6594 PMIC RTC Drivers: - cmos: use ACPI alarm instead of HPET on recent AMD platforms - nuvoton: add NCT3015Y-R and NCT3018Y-R support - rv8803: proper suspend/resume and wakeup-source support" * tag 'rtc-6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (26 commits) rtc: nuvoton: Compatible with NCT3015Y-R and NCT3018Y-R rtc: da9063: Use dev_err_probe() rtc: da9063: Use device_get_match_data() rtc: da9063: Make IRQ as optional rtc: max31335: Fix comparison in max31335_volatile_reg() rtc: max31335: use regmap_update_bits_check rtc: max31335: remove unecessary locking rtc: max31335: add driver support dt-bindings: rtc: max31335: add max31335 bindings rtc: rv8803: add wakeup-source support rtc: ac100: remove misuses of kernel-doc rtc: class: Remove usage of the deprecated ida_simple_xx() API rtc: MAINTAINERS: drop Alessandro Zummo rtc: ma35d1: remove hardcoded UIE support dt-bindings: rtc: qcom-pm8xxx: fix inconsistent example rtc: rv8803: Add power management support rtc: ds3232: avoid unused-const-variable warning rtc: lpc24xx: add missing dependency rtc: tps6594: Add driver for TPS6594 RTC rtc: Add driver for Nuvoton ma35d1 rtc controller ...
| * rtc: Extend timeout for waiting for UIP to clear to 1sMario Limonciello2023-12-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Specs don't say anything about UIP being cleared within 10ms. They only say that UIP won't occur for another 244uS. If a long NMI occurs while UIP is still updating it might not be possible to get valid data in 10ms. This has been observed in the wild that around s2idle some calls can take up to 480ms before UIP is clear. Adjust callers from outside an interrupt context to wait for up to a 1s instead of 10ms. Cc: <stable@vger.kernel.org> # 6.1.y Fixes: ec5895c0f2d8 ("rtc: mc146818-lib: extract mc146818_avoid_UIP") Reported-by: Carsten Hatger <xmb8dsv4@gmail.com> Closes: https://bugzilla.kernel.org/show_bug.cgi?id=217626 Tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Reviewed-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Acked-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Link: https://lore.kernel.org/r/20231128053653.101798-5-mario.limonciello@amd.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
| * rtc: Add support for configuring the UIP timeout for RTC readsMario Limonciello2023-12-172-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The UIP timeout is hardcoded to 10ms for all RTC reads, but in some contexts this might not be enough time. Add a timeout parameter to mc146818_get_time() and mc146818_get_time_callback(). If UIP timeout is configured by caller to be >=100 ms and a call takes this long, log a warning. Make all callers use 10ms to ensure no functional changes. Cc: <stable@vger.kernel.org> # 6.1.y Fixes: ec5895c0f2d8 ("rtc: mc146818-lib: extract mc146818_avoid_UIP") Signed-off-by: Mario Limonciello <mario.limonciello@amd.com> Tested-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Reviewed-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Acked-by: Mateusz Jończyk <mat.jonczyk@o2.pl> Link: https://lore.kernel.org/r/20231128053653.101798-4-mario.limonciello@amd.com Signed-off-by: Alexandre Belloni <alexandre.belloni@bootlin.com>
* | Merge tag 'iommu-updates-v6.8' of ↵Linus Torvalds2024-01-192-2/+3
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu Pull iommu updates from Joerg Roedel: "Core changes: - Fix race conditions in device probe path - Retire IOMMU bus_ops - Support for passing custom allocators to page table drivers - Clean up Kconfig around IOMMU_SVA - Support for sharing SVA domains with all devices bound to a mm - Firmware data parsing cleanup - Tracing improvements for iommu-dma code - Some smaller fixes and cleanups ARM-SMMU drivers: - Device-tree binding updates: - Add additional compatible strings for Qualcomm SoCs - Document Adreno clocks for Qualcomm's SM8350 SoC - SMMUv2: - Implement support for the ->domain_alloc_paging() callback - Ensure Secure context is restored following suspend of Qualcomm SMMU implementation - SMMUv3: - Disable stalling mode for the "quiet" context descriptor - Minor refactoring and driver cleanups Intel VT-d driver: - Cleanup and refactoring AMD IOMMU driver: - Improve IO TLB invalidation logic - Small cleanups and improvements Rockchip IOMMU driver: - DT binding update to add Rockchip RK3588 Apple DART driver: - Apple M1 USB4/Thunderbolt DART support - Cleanups Virtio IOMMU driver: - Add support for iotlb_sync_map - Enable deferred IO TLB flushes" * tag 'iommu-updates-v6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/joro/iommu: (66 commits) iommu: Don't reserve 0-length IOVA region iommu/vt-d: Move inline helpers to header files iommu/vt-d: Remove unused vcmd interfaces iommu/vt-d: Remove unused parameter of intel_pasid_setup_pass_through() iommu/vt-d: Refactor device_to_iommu() to retrieve iommu directly iommu/sva: Fix memory leak in iommu_sva_bind_device() dt-bindings: iommu: rockchip: Add Rockchip RK3588 iommu/dma: Trace bounce buffer usage when mapping buffers iommu/arm-smmu: Convert to domain_alloc_paging() iommu/arm-smmu: Pass arm_smmu_domain to internal functions iommu/arm-smmu: Implement IOMMU_DOMAIN_BLOCKED iommu/arm-smmu: Convert to a global static identity domain iommu/arm-smmu: Reorganize arm_smmu_domain_add_master() iommu/arm-smmu-v3: Remove ARM_SMMU_DOMAIN_NESTED iommu/arm-smmu-v3: Master cannot be NULL in arm_smmu_write_strtab_ent() iommu/arm-smmu-v3: Add a type for the STE iommu/arm-smmu-v3: disable stall for quiet_cd iommu/qcom: restore IOMMU state if needed iommu/arm-smmu-qcom: Add QCM2290 MDSS compatible iommu/arm-smmu-qcom: Add missing GMU entry to match table ...
| | \
| | \
| *-. \ Merge branches 'apple/dart', 'arm/rockchip', 'arm/smmu', 'virtio', ↵Joerg Roedel2024-01-032-2/+3
| |\ \ \ | | | | | | | | | | | | | | | 'x86/vt-d', 'x86/amd' and 'core' into next
| | | * | iommu: Add mm_get_enqcmd_pasid() helper functionTina Zhang2023-12-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mm_get_enqcmd_pasid() should be used by architecture code and closely related to learn the PASID value that the x86 ENQCMD operation should use for the mm. For the moment SMMUv3 uses this without any connection to ENQCMD, it will be cleaned up similar to how the prior patch made VT-d use the PASID argument of set_dev_pasid(). The motivation is to replace mm->pasid with an iommu private data structure that is introduced in a later patch. Reviewed-by: Lu Baolu <baolu.lu@linux.intel.com> Reviewed-by: Jason Gunthorpe <jgg@nvidia.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Tina Zhang <tina.zhang@intel.com> Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231027000525.1278806-4-tina.zhang@intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
| | | * | iommu: Change kconfig around IOMMU_SVAJason Gunthorpe2023-12-122-1/+2
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Linus suggested that the kconfig here is confusing: https://lore.kernel.org/all/CAHk-=wgUiAtiszwseM1p2fCJ+sC4XWQ+YN4TanFhUgvUqjr9Xw@mail.gmail.com/ Let's break it into three kconfigs controlling distinct things: - CONFIG_IOMMU_MM_DATA controls if the mm_struct has the additional fields for the IOMMU. Currently only PASID, but later patches store a struct iommu_mm_data * - CONFIG_ARCH_HAS_CPU_PASID controls if the arch needs the scheduling bit for keeping track of the ENQCMD instruction. x86 will select this if IOMMU_SVA is enabled - IOMMU_SVA controls if the IOMMU core compiles in the SVA support code for iommu driver use and the IOMMU exported API This way ARM will not enable CONFIG_ARCH_HAS_CPU_PASID Signed-off-by: Jason Gunthorpe <jgg@nvidia.com> Link: https://lore.kernel.org/r/20231027000525.1278806-2-tina.zhang@intel.com Signed-off-by: Joerg Roedel <jroedel@suse.de>
* | | | Merge tag 'x86_tdx_for_6.8' of ↵Linus Torvalds2024-01-1812-5/+1691
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 TDX updates from Dave Hansen: "This contains the initial support for host-side TDX support so that KVM can run TDX-protected guests. This does not include the actual KVM-side support which will come from the KVM folks. The TDX host interactions with kexec also needs to be ironed out before this is ready for prime time, so this code is currently Kconfig'd off when kexec is on. The majority of the code here is the kernel telling the TDX module which memory to protect and handing some additional memory over to it to use to store TDX module metadata. That sounds pretty simple, but the TDX architecture is rather flexible and it takes quite a bit of back-and-forth to say, "just protect all memory, please." There is also some code tacked on near the end of the series to handle a hardware erratum. The erratum can make software bugs such as a kernel write to TDX-protected memory cause a machine check and masquerade as a real hardware failure. The erratum handling watches out for these and tries to provide nicer user errors" * tag 'x86_tdx_for_6.8' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (21 commits) x86/virt/tdx: Make TDX host depend on X86_MCE x86/virt/tdx: Disable TDX host support when kexec is enabled Documentation/x86: Add documentation for TDX host support x86/mce: Differentiate real hardware #MCs from TDX erratum ones x86/cpu: Detect TDX partial write machine check erratum x86/virt/tdx: Handle TDX interaction with sleep and hibernation x86/virt/tdx: Initialize all TDMRs x86/virt/tdx: Configure global KeyID on all packages x86/virt/tdx: Configure TDX module with the TDMRs and global KeyID x86/virt/tdx: Designate reserved areas for all TDMRs x86/virt/tdx: Allocate and set up PAMTs for TDMRs x86/virt/tdx: Fill out TDMRs to cover all TDX memory regions x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regions x86/virt/tdx: Get module global metadata for module initialization x86/virt/tdx: Use all system memory when initializing TDX module as TDX memory x86/virt/tdx: Add skeleton to enable TDX on demand x86/virt/tdx: Add SEAMCALL error printing for module initialization x86/virt/tdx: Handle SEAMCALL no entropy error in common code x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APIC x86/virt/tdx: Define TDX supported page sizes as macros ...
| * | | | x86/virt/tdx: Make TDX host depend on X86_MCEKai Huang2023-12-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A build failure was reported that when INTEL_TDX_HOST is enabled but X86_MCE is not, the tdx_dump_mce_info() function fails to link: ld: vmlinux.o: in function `tdx_dump_mce_info': ...: undefined reference to `mce_is_memory_error' ...: undefined reference to `mce_usable_address' The reason is in such configuration, despite there's no caller of tdx_dump_mce_info() it is still built and there's no implementation for the two "mce_*()" functions. Make INTEL_TDX_HOST depend on X86_MCE to fix. It makes sense to enable MCE support for the TDX host anyway. Because the only way that TDX has to report integrity errors is an MCE, and it is not good to silently ignore such MCE. The TDX spec also suggests the host VMM is expected to implement the MCE handler. Note it also makes sense to make INTEL_TDX_HOST select X86_MCE but this generates "recursive dependency detected!" error in the Kconfig. Closes: https://lore.kernel.org/all/20231212214612.GHZXjUpBFa1IwVMTI7@fat_crate.local/T/ Fixes: 70060463cb2b ("x86/mce: Differentiate real hardware #MCs from TDX erratum ones") Reported-by: Arnd Bergmann <arnd@kernel.org> Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231212214612.GHZXjUpBFa1IwVMTI7@fat_crate.local/T/#m1a109c29324b2bbd0b3b1d45c218012cd3a13be6 Link: https://lore.kernel.org/all/20231213222825.286809-1-kai.huang%40intel.com
| * | | | x86/virt/tdx: Disable TDX host support when kexec is enabledDave Hansen2023-12-121-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TDX host support currently lacks the ability to handle kexec. Disable TDX when kexec is enabled. Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-20-dave.hansen%40intel.com
| * | | | x86/mce: Differentiate real hardware #MCs from TDX erratum onesKai Huang2023-12-124-0/+133
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The first few generations of TDX hardware have an erratum. Triggering it in Linux requires some kind of kernel bug involving relatively exotic memory writes to TDX private memory and will manifest via spurious-looking machine checks when reading the affected memory. Make an effort to detect these TDX-induced machine checks and spit out a new blurb to dmesg so folks do not think their hardware is failing. == Background == Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. == Problem == A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. To add insult to injury, the Linux machine code will present these as a literal "Hardware error" when they were, in fact, a software-triggered issue. == Solution == In the end, this issue is hard to trigger. Rather than do something rash (and incomplete) like unmap TDX private memory from the direct map, improve the machine check handler. Currently, the #MC handler doesn't distinguish whether the memory is TDX private memory or not but just dump, for instance, below message: [...] mce: [Hardware Error]: CPU 147: Machine Check Exception: f Bank 1: bd80000000100134 [...] mce: [Hardware Error]: RIP 10:<ffffffffadb69870> {__tlb_remove_page_size+0x10/0xa0} ... [...] mce: [Hardware Error]: Run the above through 'mcelog --ascii' [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] Kernel panic - not syncing: Fatal local machine check Which says "Hardware Error" and "Data load in unrecoverable area of kernel". Ideally, it's better for the log to say "software bug around TDX private memory" instead of "Hardware Error". But in reality the real hardware memory error can happen, and sadly such software-triggered #MC cannot be distinguished from the real hardware error. Also, the error message is used by userspace tool 'mcelog' to parse, so changing the output may break userspace. So keep the "Hardware Error". The "Data load in unrecoverable area of kernel" is also helpful, so keep it too. Instead of modifying above error log, improve the error log by printing additional TDX related message to make the log like: ... [...] mce: [Hardware Error]: Machine check: Data load in unrecoverable area of kernel [...] mce: [Hardware Error]: Machine Check: TDX private memory error. Possible kernel bug. Adding this additional message requires determination of whether the memory page is TDX private memory. There is no existing infrastructure to do that. Add an interface to query the TDX module to fill this gap. == Impact == This issue requires some kind of kernel bug to trigger. TDX private memory should never be mapped UC/WC. A partial write originating from these mappings would require *two* bugs, first mapping the wrong page, then writing the wrong memory. It would also be detectable using traditional memory corruption techniques like DEBUG_PAGEALLOC. MOVNTI (and friends) could cause this issue with something like a simple buffer overrun or use-after-free on the direct map. It should also be detectable with normal debug techniques. The one place where this might get nasty would be if the CPU read data then wrote back the same data. That would trigger this problem but would not, for instance, set off mechanisms like slab redzoning because it doesn't actually corrupt data. With an IOMMU at least, the DMA exposure is similar to the UC/WC issue. TDX private memory would first need to be incorrectly mapped into the I/O space and then a later DMA to that mapping would actually cause the poisoning event. [ dhansen: changelog tweaks ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Tony Luck <tony.luck@intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-18-dave.hansen%40intel.com
| * | | | x86/cpu: Detect TDX partial write machine check erratumKai Huang2023-12-122-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TDX memory has integrity and confidentiality protections. Violations of this integrity protection are supposed to only affect TDX operations and are never supposed to affect the host kernel itself. In other words, the host kernel should never, itself, see machine checks induced by the TDX integrity hardware. Alas, the first few generations of TDX hardware have an erratum. A partial write to a TDX private memory cacheline will silently "poison" the line. Subsequent reads will consume the poison and generate a machine check. According to the TDX hardware spec, neither of these things should have happened. Virtually all kernel memory accesses operations happen in full cachelines. In practice, writing a "byte" of memory usually reads a 64 byte cacheline of memory, modifies it, then writes the whole line back. Those operations do not trigger this problem. This problem is triggered by "partial" writes where a write transaction of less than cacheline lands at the memory controller. The CPU does these via non-temporal write instructions (like MOVNTI), or through UC/WC memory mappings. The issue can also be triggered away from the CPU by devices doing partial writes via DMA. With this erratum, there are additional things need to be done. To prepare for those changes, add a CPU bug bit to indicate this erratum. Note this bug reflects the hardware thus it is detected regardless of whether the kernel is built with TDX support or not. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-17-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Handle TDX interaction with sleep and hibernationKai Huang2023-12-081-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TDX is incompatible with hibernation and some ACPI sleep states. Users must disable hibernation to use TDX. Users must also disable TDX if they want to use ACPI S3 sleep. This feels a bit wonky and asymmetric, but it avoids adding any new command-line parameters for now. It can be improved if users hate it too much. Long version: TDX cannot survive from S3 and deeper states. The hardware resets and disables TDX completely when platform goes to S3 and deeper. Both TDX guests and the TDX module get destroyed permanently. The kernel uses S3 to support suspend-to-ram, and S4 or deeper states to support hibernation. The kernel also maintains TDX states to track whether it has been initialized and its metadata resource, etc. After resuming from S3 or hibernation, these TDX states won't be correct anymore. Theoretically, the kernel can do more complicated things like resetting TDX internal states and TDX module metadata before going to S3 or deeper, and re-initialize TDX module after resuming, etc, but there is no way to save/restore TDX guests for now. Until TDX supports full save and restore of TDX guests, there is no big value to handle TDX module in suspend and hibernation alone. To make things simple, just choose to make TDX mutually exclusive with S3 and hibernation. Note the TDX module is initialized at runtime. To avoid having to deal with the fuss of determining TDX state at runtime, just choose TDX vs S3 and hibernation at kernel early boot. It's a bad user experience if the choice of TDX and S3/hibernation is done at runtime anyway, i.e., the user can experience being able to do S3/hibernation but later becoming unable to due to TDX being enabled. Disable TDX in kernel early boot when hibernation support is available. Currently there's no mechanism exposed by the hibernation code to allow other kernel code to disable hibernation once for all. Users that want TDX must disable hibernation, like using hibername=no on the command line. Disable ACPI S3 when TDX is enabled by the BIOS. For now the user needs to disable TDX in the BIOS to use ACPI S3. A new kernel command line can be added in the future if there's a need to let user disable TDX host via kernel command line. Alternatively, the kernel could disable TDX when ACPI S3 is supported and request the user to disable S3 to use TDX. But there's no existing kernel command line to do that, and BIOS doesn't always have an option to disable S3. [ dhansen: subject / changelog tweaks ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-16-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Initialize all TDMRsKai Huang2023-12-082-9/+53
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the global KeyID has been configured on all packages, initialize all TDMRs to make all TDX-usable memory regions that are passed to the TDX module become usable. This is the last step of initializing the TDX module. Initializing TDMRs can be time consuming on large memory systems as it involves initializing all metadata entries for all pages that can be used by TDX guests. Initializing different TDMRs can be parallelized. For now to keep it simple, just initialize all TDMRs one by one. It can be enhanced in the future. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-15-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Configure global KeyID on all packagesKai Huang2023-12-082-2/+132
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the list of TDMRs and the global KeyID are configured to the TDX module, the kernel needs to configure the key of the global KeyID on all packages using TDH.SYS.KEY.CONFIG. This SEAMCALL cannot run parallel on different cpus. Loop all online cpus and use smp_call_on_cpu() to call this SEAMCALL on the first cpu of each package. To keep things simple, this implementation takes no affirmative steps to online cpus to make sure there's at least one cpu for each package. The callers (aka. KVM) can ensure success by ensuring sufficient CPUs are online for this to succeed. Intel hardware doesn't guarantee cache coherency across different KeyIDs. The PAMTs are transitioning from being used by the kernel mapping (KeyId 0) to the TDX module's "global KeyID" mapping. This means that the kernel must flush any dirty KeyID-0 PAMT cachelines before the TDX module uses the global KeyID to access the PAMTs. Otherwise, if those dirty cachelines were written back, they would corrupt the TDX module's metadata. Aside: This corruption would be detected by the memory integrity hardware on the next read of the memory with the global KeyID. The result would likely be fatal to the system but would not impact TDX security. Following the TDX module specification, flush cache before configuring the global KeyID on all packages. Given the PAMT size can be large (~1/256th of system RAM), just use WBINVD on all CPUs to flush. If TDH.SYS.KEY.CONFIG fails, the TDX module may already have "converted" some memory for TDX module use. Convert the memory back so that it can be safely used by the kernel again. Note that this is slower than it should be because of the "partial write machine check" erratum which affects TDX-capable hardware. Also refactor and introduce a new helper: tdmr_do_pamt_func(). This takes a TDMR and runs a function on its PAMT. It looks a _bit_ odd to pass a function pointer around like this, but its use is pretty narrow and it does eliminate what would otherwise be some copying and pasting. [ dhansen: * munge changelog as usual * remove weird (*pamd_func)() syntax ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-14-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Configure TDX module with the TDMRs and global KeyIDKai Huang2023-12-082-1/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TDX module uses a private KeyID as the "global KeyID" for mapping things like the PAMT and other TDX metadata. This KeyID has already been reserved when detecting TDX during the kernel early boot. Now that the "TD Memory Regions" (TDMRs) are fully built, pass them to the TDX module together with the global KeyID. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-13-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Designate reserved areas for all TDMRsKai Huang2023-12-081-8/+209
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the last step of constructing TDMRs, populate reserved areas for all TDMRs. Cover all memory holes and PAMTs with a TMDR reserved area. [ dhansen: trim down chagnelog ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-12-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Allocate and set up PAMTs for TDMRsKai Huang2023-12-083-6/+213
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TDX module uses additional metadata to record things like which guest "owns" a given page of memory. This metadata, referred as Physical Address Metadata Table (PAMT), essentially serves as the 'struct page' for the TDX module. PAMTs are not reserved by hardware up front. They must be allocated by the kernel and then given to the TDX module during module initialization. TDX supports 3 page sizes: 4K, 2M, and 1G. Each "TD Memory Region" (TDMR) has 3 PAMTs to track the 3 supported page sizes. Each PAMT must be a physically contiguous area from a Convertible Memory Region (CMR). However, the PAMTs which track pages in one TDMR do not need to reside within that TDMR but can be anywhere in CMRs. If one PAMT overlaps with any TDMR, the overlapping part must be reported as a reserved area in that particular TDMR. Use alloc_contig_pages() since PAMT must be a physically contiguous area and it may be potentially large (~1/256th of the size of the given TDMR). The downside is alloc_contig_pages() may fail at runtime. One (bad) mitigation is to launch a TDX guest early during system boot to get those PAMTs allocated at early time, but the only way to fix is to add a boot option to allocate or reserve PAMTs during kernel boot. It is imperfect but will be improved on later. TDX only supports a limited number of reserved areas per TDMR to cover both PAMTs and memory holes within the given TDMR. If many PAMTs are allocated within a single TDMR, the reserved areas may not be sufficient to cover all of them. Adopt the following policies when allocating PAMTs for a given TDMR: - Allocate three PAMTs of the TDMR in one contiguous chunk to minimize the total number of reserved areas consumed for PAMTs. - Try to first allocate PAMT from the local node of the TDMR for better NUMA locality. Also dump out how many pages are allocated for PAMTs when the TDX module is initialized successfully. This helps answer the eternal "where did all my memory go?" questions. [ dhansen: merge in error handling cleanup ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-11-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Fill out TDMRs to cover all TDX memory regionsKai Huang2023-12-082-1/+105
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Start to transit out the "multi-steps" to construct a list of "TD Memory Regions" (TDMRs) to cover all TDX-usable memory regions. The kernel configures TDX-usable memory regions by passing a list of TDMRs "TD Memory Regions" (TDMRs) to the TDX module. Each TDMR contains the information of the base/size of a memory region, the base/size of the associated Physical Address Metadata Table (PAMT) and a list of reserved areas in the region. Do the first step to fill out a number of TDMRs to cover all TDX memory regions. To keep it simple, always try to use one TDMR for each memory region. As the first step only set up the base/size for each TDMR. Each TDMR must be 1G aligned and the size must be in 1G granularity. This implies that one TDMR could cover multiple memory regions. If a memory region spans the 1GB boundary and the former part is already covered by the previous TDMR, just use a new TDMR for the remaining part. TDX only supports a limited number of TDMRs. Disable TDX if all TDMRs are consumed but there is more memory region to cover. There are fancier things that could be done like trying to merge adjacent TDMRs. This would allow more pathological memory layouts to be supported. But, current systems are not even close to exhausting the existing TDMR resources in practice. For now, keep it simple. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Yuan Yao <yuan.yao@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-10-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Add placeholder to construct TDMRs to cover all TDX memory regionsKai Huang2023-12-082-3/+123
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the kernel selects all TDX-usable memory regions, the kernel needs to pass those regions to the TDX module via data structure "TD Memory Region" (TDMR). Add a placeholder to construct a list of TDMRs (in multiple steps) to cover all TDX-usable memory regions. === Long Version === TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. The TDX architecture needs additional metadata to record things like which TD guest "owns" a given page of memory. This metadata essentially serves as the 'struct page' for the TDX module. The space for this metadata is not reserved by the hardware up front and must be allocated by the kernel and given to the TDX module. Since this metadata consumes space, the VMM can choose whether or not to allocate it for a given area of convertible memory. If it chooses not to, the memory cannot receive TDX protections and can not be used by TDX guests as private memory. For every memory region that the VMM wants to use as TDX memory, it sets up a "TD Memory Region" (TDMR). Each TDMR represents a physically contiguous convertible range and must also have its own physically contiguous metadata table, referred to as a Physical Address Metadata Table (PAMT), to track status for each page in the TDMR range. Unlike a CMR, each TDMR requires 1G granularity and alignment. To support physical RAM areas that don't meet those strict requirements, each TDMR permits a number of internal "reserved areas" which can be placed over memory holes. If PAMT metadata is placed within a TDMR it must be covered by one of these reserved areas. Let's summarize the concepts: CMR - Firmware-enumerated physical ranges that support TDX. CMRs are 4K aligned. TDMR - Physical address range which is chosen by the kernel to support TDX. 1G granularity and alignment required. Each TDMR has reserved areas where TDX memory holes and overlapping PAMTs can be represented. PAMT - Physically contiguous TDX metadata. One table for each page size per TDMR. Roughly 1/256th of TDMR in size. 256G TDMR = ~1G PAMT. As one step of initializing the TDX module, the kernel configures TDX-usable memory regions by passing a list of TDMRs to the TDX module. Constructing the list of TDMRs consists below steps: 1) Fill out TDMRs to cover all memory regions that the TDX module will use for TD memory. 2) Allocate and set up PAMT for each TDMR. 3) Designate reserved areas for each TDMR. Add a placeholder to construct TDMRs to do the above steps. To keep things simple, just allocate enough space to hold maximum number of TDMRs up front. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-9-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Get module global metadata for module initializationKai Huang2023-12-083-1/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TDX module global metadata provides system-wide information about the module. TL;DR: Use the TDH.SYS.RD SEAMCALL to tell if the module is good or not. Long Version: 1) Only initialize TDX module with version 1.5 and later TDX module 1.0 has some compatibility issues with the later versions of module, as documented in the "Intel TDX module ABI incompatibilities between TDX1.0 and TDX1.5" spec. Don't bother with module versions that do not have a stable ABI. 2) Get the essential global metadata for module initialization TDX reports a list of "Convertible Memory Region" (CMR) to tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. The kernel does this by constructing a list of "TD Memory Regions" (TDMRs) to cover all these memory regions and passing them to the TDX module. Each TDMR is a TDX architectural data structure containing the memory region that the TDMR covers, plus the information to track (within this TDMR): a) the "Physical Address Metadata Table" (PAMT) to track each TDX memory page's status (such as which TDX guest "owns" a given page, and b) the "reserved areas" to tell memory holes that cannot be used as TDX memory. The kernel needs to get below metadata from the TDX module to build the list of TDMRs: a) the maximum number of supported TDMRs b) the maximum number of supported reserved areas per TDMR and, c) the PAMT entry size for each TDX-supported page size. == Implementation == The TDX module has two modes of fetching the metadata: a one field at a time, or all in one blob. Use the field at a time for now. It is slower, but there just are not enough fields now to justify the complexity of extra unpacking. The err_free_tdxmem=>out_put_tdxmem goto looks wonky by itself. But it is the first of a bunch of error handling that will get stuck at its site. [ dhansen: clean up changelog and add a struct to map between the TDX module fields and 'struct tdx_tdmr_sysinfo' ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-8-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Use all system memory when initializing TDX module as TDX memoryKai Huang2023-12-084-2/+174
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Start to transit out the "multi-steps" to initialize the TDX module. TDX provides increased levels of memory confidentiality and integrity. This requires special hardware support for features like memory encryption and storage of memory integrity checksums. Not all memory satisfies these requirements. As a result, TDX introduced the concept of a "Convertible Memory Region" (CMR). During boot, the firmware builds a list of all of the memory ranges which can provide the TDX security guarantees. The list of these ranges is available to the kernel by querying the TDX module. CMRs tell the kernel which memory is TDX compatible. The kernel needs to build a list of memory regions (out of CMRs) as "TDX-usable" memory and pass them to the TDX module. Once this is done, those "TDX-usable" memory regions are fixed during module's lifetime. To keep things simple, assume that all TDX-protected memory will come from the page allocator. Make sure all pages in the page allocator *are* TDX-usable memory. As TDX-usable memory is a fixed configuration, take a snapshot of the memory configuration from memblocks at the time of module initialization (memblocks are modified on memory hotplug). This snapshot is used to enable TDX support for *this* memory configuration only. Use a memory hotplug notifier to ensure that no other RAM can be added outside of this configuration. This approach requires all memblock memory regions at the time of module initialization to be TDX convertible memory to work, otherwise module initialization will fail in a later SEAMCALL when passing those regions to the module. This approach works when all boot-time "system RAM" is TDX convertible memory and no non-TDX-convertible memory is hot-added to the core-mm before module initialization. For instance, on the first generation of TDX machines, both CXL memory and NVDIMM are not TDX convertible memory. Using kmem driver to hot-add any CXL memory or NVDIMM to the core-mm before module initialization will result in failure to initialize the module. The SEAMCALL error code will be available in the dmesg to help user to understand the failure. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-7-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Add skeleton to enable TDX on demandKai Huang2023-12-083-0/+201
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are essentially two steps to get the TDX module ready: 1) Get each CPU ready to run TDX 2) Set up the shared TDX module data structures Introduce and export (to KVM) the infrastructure to do both of these pieces at runtime. == Per-CPU TDX Initialization == Track the initialization status of each CPU with a per-cpu variable. This avoids failures in the case of KVM module reloads and handles cases where CPUs come online later. Generally, the per-cpu SEAMCALLs happen first. But there's actually one global call that has to happen before _any_ others (TDH_SYS_INIT). It's analogous to the boot CPU having to do a bit of extra work just because it happens to be the first one. Track if _any_ CPU has done this call and then only actually do it during the first per-cpu init. == Shared TDX Initialization == Create the global state function (tdx_enable()) as a simple placeholder. The TODO list will be pared down as functionality is added. Use a state machine protected by mutex to make sure the work in tdx_enable() will only be done once. This avoids failures if the KVM module is reloaded. A CPU must be made ready to run TDX before it can participate in initializing the shared parts of the module. Any caller of tdx_enable() need to ensure that it can never run on a CPU which is not ready to run TDX. It needs to be wary of CPU hotplug, preemption and the VMX enabling state of any CPU on which it might run. == Why runtime instead of boot time? == The TDX module can be initialized only once in its lifetime. Instead of always initializing it at boot time, this implementation chooses an "on demand" approach to initialize TDX until there is a real need (e.g when requested by KVM). This approach has below pros: 1) It avoids consuming the memory that must be allocated by kernel and given to the TDX module as metadata (~1/256th of the TDX-usable memory), and also saves the CPU cycles of initializing the TDX module (and the metadata) when TDX is not used at all. 2) The TDX module design allows it to be updated while the system is running. The update procedure shares quite a few steps with this "on demand" initialization mechanism. The hope is that much of "on demand" mechanism can be shared with a future "update" mechanism. A boot-time TDX module implementation would not be able to share much code with the update mechanism. 3) Making SEAMCALL requires VMX to be enabled. Currently, only the KVM code mucks with VMX enabling. If the TDX module were to be initialized separately from KVM (like at boot), the boot code would need to be taught how to muck with VMX enabling and KVM would need to be taught how to cope with that. Making KVM itself responsible for TDX initialization lets the rest of the kernel stay blissfully unaware of VMX. [ dhansen: completely reorder/rewrite changelog ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Nikolay Borisov <nik.borisov@suse.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-6-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Add SEAMCALL error printing for module initializationKai Huang2023-12-082-0/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SEAMCALLs involved during the TDX module initialization are not expected to fail. In fact, they are not expected to return any non-zero code (except the "running out of entropy error", which can be handled internally already). Add yet another set of SEAMCALL wrappers, which treats all non-zero return code as error, to support printing SEAMCALL error upon failure for module initialization. Note the TDX module initialization doesn't use the _saved_ret() variant thus no wrapper is added for it. SEAMCALL assembly can also return kernel-defined error codes for three special cases: 1) TDX isn't enabled by the BIOS; 2) TDX module isn't loaded; 3) CPU isn't in VMX operation. Whether they can legally happen depends on the caller, so leave to the caller to print error message when desired. Also convert the SEAMCALL error codes to the kernel error codes in the new wrappers so that each SEAMCALL caller doesn't have to repeat the conversion. [ dhansen: Align the register dump with show_regs(). Zero-pad the contents, split on two lines and use consistent spacing. ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-5-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Handle SEAMCALL no entropy error in common codeKai Huang2023-12-081-0/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some SEAMCALLs use the RDRAND hardware and can fail for the same reasons as RDRAND. Use the kernel RDRAND retry logic for them. There are three __seamcall*() variants. Do the SEAMCALL retry in common code and add a wrapper for each of them. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirll.shutemov@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-4-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Make INTEL_TDX_HOST depend on X86_X2APICKai Huang2023-12-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TDX capable platforms are locked to X2APIC mode and cannot fall back to the legacy xAPIC mode when TDX is enabled by the BIOS. TDX host support requires x2APIC. Make INTEL_TDX_HOST depend on X86_X2APIC. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/lkml/ba80b303-31bf-d44a-b05d-5c0f83038798@intel.com/ Link: https://lore.kernel.org/all/20231208170740.53979-3-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Define TDX supported page sizes as macrosKai Huang2023-12-082-3/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TDX supports 4K, 2M and 1G page sizes. The corresponding values are defined by the TDX module spec and used as TDX module ABI. Currently, they are used in try_accept_one() when the TDX guest tries to accept a page. However currently try_accept_one() uses hard-coded magic values. Define TDX supported page sizes as macros and get rid of the hard-coded values in try_accept_one(). TDX host support will need to use them too. Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Link: https://lore.kernel.org/all/20231208170740.53979-2-dave.hansen%40intel.com
| * | | | x86/virt/tdx: Detect TDX during kernel bootKai Huang2023-12-086-1/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Intel Trust Domain Extensions (TDX) protects guest VMs from malicious host and certain physical attacks. A CPU-attested software module called 'the TDX module' runs inside a new isolated memory range as a trusted hypervisor to manage and run protected VMs. Pre-TDX Intel hardware has support for a memory encryption architecture called MKTME. The memory encryption hardware underpinning MKTME is also used for Intel TDX. TDX ends up "stealing" some of the physical address space from the MKTME architecture for crypto-protection to VMs. The BIOS is responsible for partitioning the "KeyID" space between legacy MKTME and TDX. The KeyIDs reserved for TDX are called 'TDX private KeyIDs' or 'TDX KeyIDs' for short. During machine boot, TDX microcode verifies that the BIOS programmed TDX private KeyIDs consistently and correctly programmed across all CPU packages. The MSRs are locked in this state after verification. This is why MSR_IA32_MKTME_KEYID_PARTITIONING gets used for TDX enumeration: it indicates not just that the hardware supports TDX, but that all the boot-time security checks passed. The TDX module is expected to be loaded by the BIOS when it enables TDX, but the kernel needs to properly initialize it before it can be used to create and run any TDX guests. The TDX module will be initialized by the KVM subsystem when KVM wants to use TDX. Detect platform TDX support by detecting TDX private KeyIDs. The TDX module itself requires one TDX KeyID as the 'TDX global KeyID' to protect its metadata. Each TDX guest also needs a TDX KeyID for its own protection. Just use the first TDX KeyID as the global KeyID and leave the rest for TDX guests. If no TDX KeyID is left for TDX guests, disable TDX as initializing the TDX module alone is useless. [ dhansen: add X86_FEATURE, replace helper function ] Signed-off-by: Kai Huang <kai.huang@intel.com> Signed-off-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reviewed-by: Isaku Yamahata <isaku.yamahata@intel.com> Reviewed-by: David Hildenbrand <david@redhat.com> Reviewed-by: Dave Hansen <dave.hansen@linux.intel.com> Reviewed-by: Kuppuswamy Sathyanarayanan <sathyanarayanan.kuppuswamy@linux.intel.com> Link: https://lore.kernel.org/all/20231208170740.53979-1-dave.hansen%40intel.com
* | | | | Merge tag 'driver-core-6.8-rc1' of ↵Linus Torvalds2024-01-183-35/+4
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core updates from Greg KH: "Here are the set of driver core and kernfs changes for 6.8-rc1. Nothing major in here this release cycle, just lots of small cleanups and some tweaks on kernfs that in the very end, got reverted and will come back in a safer way next release cycle. Included in here are: - more driver core 'const' cleanups and fixes - fw_devlink=rpm is now the default behavior - kernfs tiny changes to remove some string functions - cpu handling in the driver core is updated to work better on many systems that add topologies and cpus after booting - other minor changes and cleanups All of the cpu handling patches have been acked by the respective maintainers and are coming in here in one series. Everything has been in linux-next for a while with no reported issues" * tag 'driver-core-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: (51 commits) Revert "kernfs: convert kernfs_idr_lock to an irq safe raw spinlock" kernfs: convert kernfs_idr_lock to an irq safe raw spinlock class: fix use-after-free in class_register() PM: clk: make pm_clk_add_notifier() take a const pointer EDAC: constantify the struct bus_type usage kernfs: fix reference to renamed function driver core: device.h: fix Excess kernel-doc description warning driver core: class: fix Excess kernel-doc description warning driver core: mark remaining local bus_type variables as const driver core: container: make container_subsys const driver core: bus: constantify subsys_register() calls driver core: bus: make bus_sort_breadthfirst() take a const pointer kernfs: d_obtain_alias(NULL) will do the right thing... driver core: Better advertise dev_err_probe() kernfs: Convert kernfs_path_from_node_locked() from strlcpy() to strscpy() kernfs: Convert kernfs_name_locked() from strlcpy() to strscpy() kernfs: Convert kernfs_walk_ns() from strlcpy() to strscpy() initramfs: Expose retained initrd as sysfs file fs/kernfs/dir: obey S_ISGID kernel/cgroup: use kernfs_create_dir_ns() ...
| * | | | | x86/topology: convert to use arch_cpu_is_hotpluggable()Russell King (Oracle)2023-12-061-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Convert x86 to use the arch_cpu_is_hotpluggable() helper rather than arch_register_cpu(). Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R3w-00Cszy-6k@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | x86/topology: use weak version of arch_unregister_cpu()Russell King (Oracle)2023-12-061-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the x86 version of arch_unregister_cpu() is the same as the weak version, drop the x86 specific version. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R3r-00Cszs-2R@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | x86/topology: Switch over to GENERIC_CPU_DEVICESJames Morse2023-12-063-27/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that GENERIC_CPU_DEVICES calls arch_register_cpu(), which can be overridden by the arch code, switch over to this to allow common code to choose when the register_cpu() call is made. x86's struct cpus come from struct x86_cpu, which has no other members or users. Remove this and use the version defined by common code. This is an intermediate step to the logic being moved to drivers/acpi, where GENERIC_CPU_DEVICES will do the work when booting with acpi=off. This patch also has the effect of moving the registration of CPUs from subsys to driver core initialisation, prior to any initcalls running. ---- Changes since RFC: * Fixed the second copy of arch_register_cpu() used for non-hotplug Changes since RFC v2: * Remove duplicate of the weak generic arch_register_cpu(), spotted by Jonathan Cameron. Add note about initialisation order change. Changes since RFC v3: * Adapt to removal of EXPORT_SYMBOL()s Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R3l-00Cszm-UA@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | ACPI: Move ACPI_HOTPLUG_CPU to be disabled on arm64 and riscvJames Morse2023-12-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Neither arm64 nor riscv support physical hotadd of CPUs that were not present at boot. For arm64 much of the platform description is in static tables which do not have update methods. arm64 does support HOTPLUG_CPU, which is backed by a firmware interface to turn CPUs on and off. acpi_processor_hotadd_init() and acpi_processor_remove() are for adding and removing CPUs that were not present at boot. arm64 systems that do this are not supported as there is currently insufficient information in the platform description. (e.g. did the GICR get removed too?) arm64 currently relies on the MADT enabled flag check in map_gicc_mpidr() to prevent CPUs that were not described as present at boot from being added to the system. Similarly, riscv relies on the same check in map_rintc_hartid(). Both architectures also rely on the weak 'always fails' definitions of acpi_map_cpu() and arch_register_cpu(). Subsequent changes will redefine ACPI_HOTPLUG_CPU as making possible CPUs present. Neither arm64 nor riscv support this. Disable ACPI_HOTPLUG_CPU for arm64 and riscv by removing 'default y' and selecting it on the other three ACPI architectures. This allows the weak definitions of some symbols to be removed. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Shaoqin Huang <shahuang@redhat.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R31-00Csyt-Jq@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | x86/topology: remove arch_*register_cpu() exportsRussell King (Oracle)2023-12-061-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arch_register_cpu() and arch_unregister_cpu() are not used by anything that can be a module - they are used by drivers/base/cpu.c and drivers/acpi/acpi_processor.c, neither of which can be a module. Remove the exports. Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R2r-00Csyh-7B@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | | | | x86: intel_epb: Don't rely on link orderJames Morse2023-12-061-1/+1
| | |_|/ / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | intel_epb_init() is called as a subsys_initcall() to register cpuhp callbacks. The callbacks make use of get_cpu_device() which will return NULL unless register_cpu() has been called. register_cpu() is called from topology_init(), which is also a subsys_initcall(). This is fragile. Moving the register_cpu() to a different subsys_initcall() leads to a NULL dereference during boot. Make intel_epb_init() a late_initcall(), user-space can't provide a policy before this point anyway. Signed-off-by: James Morse <james.morse@arm.com> Reviewed-by: Gavin Shan <gshan@redhat.com> Signed-off-by: "Russell King (Oracle)" <rmk+kernel@armlinux.org.uk> Acked-by: "Rafael J. Wysocki" <rafael.j.wysocki@intel.com> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/E1r5R2m-00Csyb-2S@rmk-PC.armlinux.org.uk Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | | | | Merge tag 'pci-v6.8-changes' of ↵Linus Torvalds2024-01-187-117/+145
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci Pull pci updates from Bjorn Helgaas: "Enumeration: - Reserve ECAM so we don't assign it to PCI BARs; this works around bugs where BIOS included ECAM in a PNP0A03 host bridge window, didn't reserve it via a PNP0C02 motherboard device, and didn't allocate space for SR-IOV VF BARs (Bjorn Helgaas) - Add MMCONFIG/ECAM debug logging (Bjorn Helgaas) - Rename 'MMCONFIG' to 'ECAM' to match spec usage (Bjorn Helgaas) - Log device type (Root Port, Switch Port, etc) during enumeration (Bjorn Helgaas) - Log bridges before downstream devices so the dmesg order is more logical (Bjorn Helgaas) - Log resource names (BAR 0, VF BAR 0, bridge window, etc) consistently instead of a mix of names and "reg 0x10" (Puranjay Mohan, Bjorn Helgaas) - Fix 64GT/s effective data rate calculation to use 1b/1b encoding rather than the 8b/10b or 128b/130b used by lower rates (Ilpo Järvinen) - Use PCI_HEADER_TYPE_* instead of literals in x86, powerpc, SCSI lpfc (Ilpo Järvinen) - Clean up open-coded PCIBIOS return code mangling (Ilpo Järvinen) Resource management: - Restructure pci_dev_for_each_resource() to avoid computing the address of an out-of-bounds array element (the bounds check was performed later so the element was never actually *read*, but it's nicer to avoid even computing an out-of-bounds address) (Andy Shevchenko) Driver binding: - Convert pci-host-common.c platform .remove() callback to .remove_new() returning 'void' since it's not useful to return error codes here (Uwe Kleine-König) - Convert exynos, keystone, kirin from .remove() to .remove_new(), which returns void instead of int (Uwe Kleine-König) - Drop unused struct pci_driver.node member (Mathias Krause) Virtualization: - Add ACS quirk for more Zhaoxin Root Ports (LeoLiuoc) Error handling: - Log AER errors as "Correctable" (not "Corrected") or "Uncorrectable" to match spec terminology (Bjorn Helgaas) - Decode Requester ID when no error info found instead of printing the raw hex value (Bjorn Helgaas) Endpoint framework: - Use a unique test pattern for each BAR in the pci_endpoint_test to make it easier to debug address translation issues (Niklas Cassel) Broadcom STB PCIe controller driver: - Add DT property "brcm,clkreq-mode" and driver support for different CLKREQ# modes to make ASPM L1.x states possible (Jim Quinlan) Freescale Layerscape PCIe controller driver: - Add suspend/resume support for Layerscape LS1043a and LS1021a, including software-managed PME_Turn_Off and transitions between L0, L2/L3_Ready Link states (Frank Li) MediaTek PCIe controller driver: - Clear MSI interrupt status before handler to avoid missing MSIs that occur after the handler (qizhong cheng) MediaTek PCIe Gen3 controller driver: - Update mediatek-gen3 translation window setup to handle MMIO space that is not a power of two in size (Jianjun Wang) Qualcomm PCIe controller driver: - Increase qcom iommu-map maxItems to accommodate SDX55 (five entries) and SDM845 (sixteen entries) (Krzysztof Kozlowski) - Describe qcom,pcie-sc8180x clocks and resets accurately (Krzysztof Kozlowski) - Describe qcom,pcie-sm8150 clocks and resets accurately (Krzysztof Kozlowski) - Correct the qcom "reset-name" property, previously incorrectly called "reset-names" (Krzysztof Kozlowski) - Document qcom,pcie-sm8650, based on qcom,pcie-sm8550 (Neil Armstrong) Renesas R-Car PCIe controller driver: - Replace of_device.h with explicit of.h include to untangle header usage (Rob Herring) - Add DT and driver support for optional miniPCIe 1.5v and 3.3v regulators on KingFisher (Wolfram Sang) SiFive FU740 PCIe controller driver: - Convert fu740 CONFIG_PCIE_FU740 dependency from SOC_SIFIVE to ARCH_SIFIVE (Conor Dooley) Synopsys DesignWare PCIe controller driver: - Align iATU mapping for endpoint MSI-X (Niklas Cassel) - Drop "host_" prefix from struct dw_pcie_host_ops members (Yoshihiro Shimoda) - Drop "ep_" prefix from struct dw_pcie_ep_ops members (Yoshihiro Shimoda) - Rename struct dw_pcie_ep_ops.func_conf_select() to .get_dbi_offset() to be more descriptive (Yoshihiro Shimoda) - Add Endpoint DBI accessors to encapsulate offset lookups (Yoshihiro Shimoda) TI J721E PCIe driver: - Add j721e DT and driver support for 'num-lanes' for devices that support x1, x2, or x4 Links (Matt Ranostay) - Add j721e DT compatible strings and driver support for j784s4 (Matt Ranostay) - Make TI J721E Kconfig depend on ARCH_K3 since the hardware is specific to those TI SoC parts (Peter Robinson) TI Keystone PCIe controller driver: - Hold power management references to all PHYs while enabling them to avoid a race when one provides clocks to others (Siddharth Vadapalli) Xilinx XDMA PCIe controller driver: - Remove redundant dev_err(), since platform_get_irq() and platform_get_irq_byname() already log errors (Yang Li) - Fix uninitialized symbols in xilinx_pl_dma_pcie_setup_irq() (Krzysztof Wilczyński) - Fix xilinx_pl_dma_pcie_init_irq_domain() error return when irq_domain_add_linear() fails (Harshit Mogalapalli) MicroSemi Switchtec management driver: - Do dma_mrpc cleanup during switchtec_pci_remove() to match its devm ioremapping in switchtec_pci_probe(). Previously the cleanup was done in stdev_release(), which used stale pointers if stdev->cdev happened to be open when the PCI device was removed (Daniel Stodden) Miscellaneous: - Convert interrupt terminology from "legacy" to "INTx" to be more specific and match spec terminology (Damien Le Moal) - In dw-xdata-pcie, pci_endpoint_test, and vmd, replace usage of deprecated ida_simple_*() API with ida_alloc() and ida_free() (Christophe JAILLET)" * tag 'pci-v6.8-changes' of git://git.kernel.org/pub/scm/linux/kernel/git/pci/pci: (97 commits) PCI: Fix kernel-doc issues PCI: brcmstb: Configure HW CLKREQ# mode appropriate for downstream device dt-bindings: PCI: brcmstb: Add property "brcm,clkreq-mode" PCI: mediatek-gen3: Fix translation window size calculation PCI: mediatek: Clear interrupt status before dispatching handler PCI: keystone: Fix race condition when initializing PHYs PCI: xilinx-xdma: Fix error code in xilinx_pl_dma_pcie_init_irq_domain() PCI: xilinx-xdma: Fix uninitialized symbols in xilinx_pl_dma_pcie_setup_irq() PCI: rcar-gen4: Fix -Wvoid-pointer-to-enum-cast error PCI: iproc: Fix -Wvoid-pointer-to-enum-cast warning PCI: dwc: Add dw_pcie_ep_{read,write}_dbi[2] helpers PCI: dwc: Rename .func_conf_select to .get_dbi_offset in struct dw_pcie_ep_ops PCI: dwc: Rename .ep_init to .init in struct dw_pcie_ep_ops PCI: dwc: Drop host prefix from struct dw_pcie_host_ops members misc: pci_endpoint_test: Use a unique test pattern for each BAR PCI: j721e: Make TI J721E depend on ARCH_K3 PCI: j721e: Add TI J784S4 PCIe configuration PCI/AER: Use explicit register sizes for struct members PCI/AER: Decode Requester ID when no error info found PCI/AER: Use 'Correctable' and 'Uncorrectable' spec terms for errors ...
| * \ \ \ \ Merge branch 'pci/enumeration'Bjorn Helgaas2024-01-153-11/+24
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | - Convert pci-host-common.c platform .remove() callback to .remove_new() returning 'void' since it's not useful to return error codes here (Uwe Kleine-König) - Log a message about updating AMD USB controller class code (so dwc3, not xhci, claims it) only when we actually change it (Guilherme G. Piccoli) - Use PCI_HEADER_TYPE_* instead of literals in x86, powerpc, SCSI lpfc (Ilpo Järvinen) - Clean up open-coded PCIBIOS return code mangling (Ilpo Järvinen) - Fix 64GT/s effective data rate calculation to use 1b/1b encoding rather than the 8b/10b or 128b/130b used by lower rates (Ilpo Järvinen) * pci/enumeration: PCI: Fix 64GT/s effective data rate calculation x86/pci: Clean up open-coded PCIBIOS return code mangling scsi: lpfc: Use PCI_HEADER_TYPE_MFD instead of literal powerpc/fsl-pci: Use PCI_HEADER_TYPE_MASK instead of literal x86/pci: Use PCI_HEADER_TYPE_* instead of literals PCI: Only override AMD USB controller if required PCI: host-generic: Convert to platform remove callback returning void
| | * | | | | x86/pci: Clean up open-coded PCIBIOS return code manglingIlpo Järvinen2023-12-011-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Per PCI Firmware spec r3.3, sec 2.5.2, 2.6.2, and 2.7, the return code for these PCI BIOS interfaces is in 8 bits of the EAX register. Previously it was extracted by open-coded masks and shifting. Name the return code bits with a #define and add pcibios_get_return_code() to extract the return code to improve code readability. In addition, replace zero test with PCIBIOS_SUCCESSFUL. No functional changes intended. Link: https://lore.kernel.org/r/20231124085924.13830-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| | * | | | | x86/pci: Use PCI_HEADER_TYPE_* instead of literalsIlpo Järvinen2023-12-012-4/+3
| | | |_|_|/ | | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace 0x7f and 0x80 literals with PCI_HEADER_TYPE_* defines. Link: https://lore.kernel.org/r/20231124090919.23687-1-ilpo.jarvinen@linux.intel.com Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@linux.intel.com> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Reorder pci_mmcfg_arch_map() definition before callsBjorn Helgaas2023-12-051-19/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The typical style is to define functions before calling them. Move pci_mmcfg_arch_map() and pci_mmcfg_arch_unmap() earlier so they're defined before they're called. No functional change intended. Link: https://lore.kernel.org/r/20231121183643.249006-10-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Return pci_mmconfig_add() failure earlyBjorn Helgaas2023-12-051-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If pci_mmconfig_alloc() fails, return the failure early so it's obvious that the failure is the exception, and the success is the normal case. No functional change intended. Link: https://lore.kernel.org/r/20231121183643.249006-9-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Comment pci_mmconfig_insert() obscure MCFG dependencyBjorn Helgaas2023-12-051-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In pci_mmconfig_insert(), there's no reference to "addr" between locking pci_mmcfg_lock and testing "addr", so it *looks* like we should move the test before the lock. But 07f9b61c3915 ("x86/PCI: MMCONFIG: Check earlier for MMCONFIG region at address zero") did that, which broke things by returning -EINVAL when "addr" is zero instead of -EEXIST. So 07f9b61c3915 was reverted by 67d470e0e171 ("Revert "x86/PCI: MMCONFIG: Check earlier for MMCONFIG region at address zero""). Add a comment about this issue to prevent it from happening again. Link: https://lore.kernel.org/r/20231121183643.249006-8-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Rename pci_mmcfg_check_reserved() to pci_mmcfg_reserved()Bjorn Helgaas2023-12-051-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "pci_mmcfg_check_reserved()" doesn't give a hint about what the boolean return value means. Rename it to pci_mmcfg_reserved() so testing "if (pci_mmcfg_reserved())" makes sense. Update callers to treat the return value as boolean instead of comparing with 0. Link: https://lore.kernel.org/r/20231121183643.249006-7-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Rename acpi_mcfg_check_entry() to acpi_mcfg_valid_entry()Bjorn Helgaas2023-12-051-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | "acpi_mcfg_check_entry()" doesn't give a hint about what the return value means. Rename it to "acpi_mcfg_valid_entry()", convert the return value to bool, and update the return values and callers to match so testing "if (acpi_mcfg_valid_entry())" makes sense. Link: https://lore.kernel.org/r/20231121183643.249006-6-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Rename 'MMCONFIG' to 'ECAM', use pr_fmtBjorn Helgaas2023-12-053-67/+52
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "MMCONFIG" term is not used in PCI/PCIe specs. Replace it with "ECAM", the term used in PCIe r6.0, sec 7.2.2. Define pr_fmt() instead of repeating PREFIX in every log message. Link: https://lore.kernel.org/r/20231121183643.249006-5-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Add MCFG debug loggingBjorn Helgaas2023-12-052-5/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | MCFG handling is a frequent source of problems. Add more logging to aid in debugging. Enable the logging with CONFIG_DYNAMIC_DEBUG=y and the kernel boot parameter 'dyndbg="file arch/x86/pci +p"'. Link: https://lore.kernel.org/r/20231121183643.249006-4-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Reword ECAM EfiMemoryMappedIO logging to avoid 'reserved'Bjorn Helgaas2023-12-051-8/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fd3a8cff4d4a ("x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space") added the concept of using the EFI memory map to help decide whether ECAM space mentioned in the MCFG table is valid. Unfortunately it described that EfiMemoryMappedIO space as "reserved", but it is actually not *reserved* by the EFI memory map. EfiMemoryMappedIO only means the firmware requested that the OS map this space for use by firmware runtime services. Change the dmesg logging to describe it as simply "EfiMemoryMappedIO", not as "reserved as EfiMemoryMappedIO". A previous commit actually *does* reserve the space if ACPI PNP0C01/02 devices haven't done so: - PCI: ECAM at [mem 0xe0000000-0xefffffff] reserved as EfiMemoryMappedIO + PCI: ECAM at [mem 0xe0000000-0xefffffff] is EfiMemoryMappedIO; assuming valid PCI: ECAM [mem 0xe0000000-0xefffffff] reserved to work around lack of ACPI motherboard _CRS Link: https://lore.kernel.org/r/20231121183643.249006-3-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com>
| * | | | | x86/pci: Reserve ECAM if BIOS didn't include it in PNP0C02 _CRSBjorn Helgaas2023-12-051-1/+12
| |/ / / / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Tomasz, Sebastian, and some Proxmox users reported problems initializing ixgbe NICs. I think the problem is that ECAM space described in the ACPI MCFG table is not reserved via a PNP0C02 _CRS method as required by the PCI Firmware spec (r3.3, sec 4.1.2), but it *is* included in the PNP0A03 host bridge _CRS as part of the MMIO aperture. If we allocate space for a PCI BAR, we're likely to allocate it from that ECAM space, which obviously cannot work. This could happen for any device, but in the ixgbe case it happens because it's an SR-IOV device and the BIOS didn't allocate space for the VF BARs, so Linux reallocated the bridge window leading to ixgbe and put it on top of the ECAM space. From Tomasz' system: PCI: MMCONFIG for domain 0000 [bus 00-ff] at [mem 0x80000000-0x8fffffff] (base 0x80000000) PCI: MMCONFIG at [mem 0x80000000-0x8fffffff] not reserved in ACPI motherboard resources pci_bus 0000:00: root bus resource [mem 0x80000000-0xfbffffff window] pci 0000:00:01.1: PCI bridge to [bus 02-03] pci 0000:00:01.1: bridge window [mem 0xfb900000-0xfbbfffff] pci 0000:02:00.0: [8086:10fb] type 00 class 0x020000 # ixgbe pci 0000:02:00.0: reg 0x10: [mem 0xfba80000-0xfbafffff 64bit] pci 0000:02:00.0: VF(n) BAR0 space: [mem 0x00000000-0x000fffff 64bit] (contains BAR0 for 64 VFs) pci 0000:02:00.0: BAR 7: no space for [mem size 0x00100000 64bit] # VF BAR 0 pci_bus 0000:00: No. 2 try to assign unassigned res pci 0000:00:01.1: resource 14 [mem 0xfb900000-0xfbbfffff] released pci 0000:00:01.1: BAR 14: assigned [mem 0x80000000-0x806fffff] pci 0000:02:00.0: BAR 0: assigned [mem 0x80000000-0x8007ffff 64bit] pci 0000:02:00.0: BAR 7: assigned [mem 0x80204000-0x80303fff 64bit] # VF BAR 0 Fixes: 07eab0901ede ("efi/x86: Remove EfiMemoryMappedIO from E820 map") Fixes: fd3a8cff4d4a ("x86/pci: Treat EfiMemoryMappedIO as reservation of ECAM space") Reported-by: Tomasz Pala <gotar@polanet.pl> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218050 Reported-by: Sebastian Manciulea <manciuleas@protonmail.com> Link: https://bugzilla.kernel.org/show_bug.cgi?id=218107 Link: https://forum.proxmox.com/threads/proxmox-8-kernel-6-2-16-4-pve-ixgbe-driver-fails-to-load-due-to-pci-device-probing-failure.131203/ Link: https://lore.kernel.org/r/20231121183643.249006-2-helgaas@kernel.org Tested-by: Tomasz Pala <gotar@polanet.pl> Signed-off-by: Bjorn Helgaas <bhelgaas@google.com> Cc: stable@vger.kernel.org # v6.2+
* | | | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvmLinus Torvalds2024-01-1752-1049/+1825
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull kvm updates from Paolo Bonzini: "Generic: - Use memdup_array_user() to harden against overflow. - Unconditionally advertise KVM_CAP_DEVICE_CTRL for all architectures. - Clean up Kconfigs that all KVM architectures were selecting - New functionality around "guest_memfd", a new userspace API that creates an anonymous file and returns a file descriptor that refers to it. guest_memfd files are bound to their owning virtual machine, cannot be mapped, read, or written by userspace, and cannot be resized. guest_memfd files do however support PUNCH_HOLE, which can be used to switch a memory area between guest_memfd and regular anonymous memory. - New ioctl KVM_SET_MEMORY_ATTRIBUTES allowing userspace to specify per-page attributes for a given page of guest memory; right now the only attribute is whether the guest expects to access memory via guest_memfd or not, which in Confidential SVMs backed by SEV-SNP, TDX or ARM64 pKVM is checked by firmware or hypervisor that guarantees confidentiality (AMD PSP, Intel TDX module, or EL2 in the case of pKVM). x86: - Support for "software-protected VMs" that can use the new guest_memfd and page attributes infrastructure. This is mostly useful for testing, since there is no pKVM-like infrastructure to provide a meaningfully reduced TCB. - Fix a relatively benign off-by-one error when splitting huge pages during CLEAR_DIRTY_LOG. - Fix a bug where KVM could incorrectly test-and-clear dirty bits in non-leaf TDP MMU SPTEs if a racing thread replaces a huge SPTE with a non-huge SPTE. - Use more generic lockdep assertions in paths that don't actually care about whether the caller is a reader or a writer. - let Xen guests opt out of having PV clock reported as "based on a stable TSC", because some of them don't expect the "TSC stable" bit (added to the pvclock ABI by KVM, but never set by Xen) to be set. - Revert a bogus, made-up nested SVM consistency check for TLB_CONTROL. - Advertise flush-by-ASID support for nSVM unconditionally, as KVM always flushes on nested transitions, i.e. always satisfies flush requests. This allows running bleeding edge versions of VMware Workstation on top of KVM. - Sanity check that the CPU supports flush-by-ASID when enabling SEV support. - On AMD machines with vNMI, always rely on hardware instead of intercepting IRET in some cases to detect unmasking of NMIs - Support for virtualizing Linear Address Masking (LAM) - Fix a variety of vPMU bugs where KVM fail to stop/reset counters and other state prior to refreshing the vPMU model. - Fix a double-overflow PMU bug by tracking emulated counter events using a dedicated field instead of snapshotting the "previous" counter. If the hardware PMC count triggers overflow that is recognized in the same VM-Exit that KVM manually bumps an event count, KVM would pend PMIs for both the hardware-triggered overflow and for KVM-triggered overflow. - Turn off KVM_WERROR by default for all configs so that it's not inadvertantly enabled by non-KVM developers, which can be problematic for subsystems that require no regressions for W=1 builds. - Advertise all of the host-supported CPUID bits that enumerate IA32_SPEC_CTRL "features". - Don't force a masterclock update when a vCPU synchronizes to the current TSC generation, as updating the masterclock can cause kvmclock's time to "jump" unexpectedly, e.g. when userspace hotplugs a pre-created vCPU. - Use RIP-relative address to read kvm_rebooting in the VM-Enter fault paths, partly as a super minor optimization, but mostly to make KVM play nice with position independent executable builds. - Guard KVM-on-HyperV's range-based TLB flush hooks with an #ifdef on CONFIG_HYPERV as a minor optimization, and to self-document the code. - Add CONFIG_KVM_HYPERV to allow disabling KVM support for HyperV "emulation" at build time. ARM64: - LPA2 support, adding 52bit IPA/PA capability for 4kB and 16kB base granule sizes. Branch shared with the arm64 tree. - Large Fine-Grained Trap rework, bringing some sanity to the feature, although there is more to come. This comes with a prefix branch shared with the arm64 tree. - Some additional Nested Virtualization groundwork, mostly introducing the NV2 VNCR support and retargetting the NV support to that version of the architecture. - A small set of vgic fixes and associated cleanups. Loongarch: - Optimization for memslot hugepage checking - Cleanup and fix some HW/SW timer issues - Add LSX/LASX (128bit/256bit SIMD) support RISC-V: - KVM_GET_REG_LIST improvement for vector registers - Generate ISA extension reg_list using macros in get-reg-list selftest - Support for reporting steal time along with selftest s390: - Bugfixes Selftests: - Fix an annoying goof where the NX hugepage test prints out garbage instead of the magic token needed to run the test. - Fix build errors when a header is delete/moved due to a missing flag in the Makefile. - Detect if KVM bugged/killed a selftest's VM and print out a helpful message instead of complaining that a random ioctl() failed. - Annotate the guest printf/assert helpers with __printf(), and fix the various bugs that were lurking due to lack of said annotation" * tag 'for-linus' of git://git.kernel.org/pub/scm/virt/kvm/kvm: (185 commits) x86/kvm: Do not try to disable kvmclock if it was not enabled KVM: x86: add missing "depends on KVM" KVM: fix direction of dependency on MMU notifiers KVM: introduce CONFIG_KVM_COMMON KVM: arm64: Add missing memory barriers when switching to pKVM's hyp pgd KVM: arm64: vgic-its: Avoid potential UAF in LPI translation cache RISC-V: KVM: selftests: Add get-reg-list test for STA registers RISC-V: KVM: selftests: Add steal_time test support RISC-V: KVM: selftests: Add guest_sbi_probe_extension RISC-V: KVM: selftests: Move sbi_ecall to processor.c RISC-V: KVM: Implement SBI STA extension RISC-V: KVM: Add support for SBI STA registers RISC-V: KVM: Add support for SBI extension registers RISC-V: KVM: Add SBI STA info to vcpu_arch RISC-V: KVM: Add steal-update vcpu request RISC-V: KVM: Add SBI STA extension skeleton RISC-V: paravirt: Implement steal-time support RISC-V: Add SBI STA extension definitions RISC-V: paravirt: Add skeleton for pv-time support RISC-V: KVM: Fix indentation in kvm_riscv_vcpu_set_reg_csr() ...
| * | | | | x86/kvm: Do not try to disable kvmclock if it was not enabledKirill A. Shutemov2024-01-081-4/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | kvm_guest_cpu_offline() tries to disable kvmclock regardless if it is present in the VM. It leads to write to a MSR that doesn't exist on some configurations, namely in TDX guest: unchecked MSR access error: WRMSR to 0x12 (tried to write 0x0000000000000000) at rIP: 0xffffffff8110687c (kvmclock_disable+0x1c/0x30) kvmclock enabling is gated by CLOCKSOURCE and CLOCKSOURCE2 KVM paravirt features. Do not disable kvmclock if it was not enabled. Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Fixes: c02027b5742b ("x86/kvm: Disable kvmclock on all CPUs on shutdown") Reviewed-by: Sean Christopherson <seanjc@google.com> Reviewed-by: Vitaly Kuznetsov <vkuznets@redhat.com> Cc: Paolo Bonzini <pbonzini@redhat.com> Cc: Wanpeng Li <wanpengli@tencent.com> Cc: stable@vger.kernel.org Message-Id: <20231205004510.27164-6-kirill.shutemov@linux.intel.com> Signed-off-by: Paolo Bonzini <pbonzini@redhat.com>