linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	dt-bindings: display: samsung,ams495qa01: add missing SPI properties ref	Krzysztof Kozlowski	2024-05-13	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \|	Samsung AMS495QA01 panel is a SPI device, so it should reference spi-peripheral-props.yaml schema to allow and validate the SPI device properties. Fixes: 92be07c65b22 ("dt-bindings: display: panel: Add Samsung AMS495QA01") Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240509-dt-bindings-dsi-panel-reg-v1-1-8b2443705be0@linaro.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: Use full path to other schemas	Krzysztof Kozlowski	2024-05-08	3	-4/+4
\| \| \| \| \| \| \| \| \| \| \|	When referencing other schema, it is preferred to use an absolute path (/schemas/....), which allows also an seamless move of particular schema out of Linux kernel to dtschema. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Miquel Raynal <miquel.raynal@bootlin.com> Link: https://lore.kernel.org/r/20240505084618.135705-1-krzysztof.kozlowski@linaro.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: PCI: qcom,pcie-sm8350: Drop redundant 'oneOf' sub-schema	Rob Herring (Arm)	2024-05-08	1	-22/+0
\| \| \| \| \| \| \| \| \| \| \| \| \|	The first entry in the 'oneOf' schema doesn't work because the top level schema requires exactly 8 interrupt entries. The 2nd entry is just redundant with the top level. Since 1 entry appears to have been a mistake, let's just drop the entire 'oneOf' rather than reworking the top-level to allow 1 entry. Reviewed-by: Manivannan Sadhasivam <manivannan.sadhasivam@linaro.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240417200431.3173953-1-robh@kernel.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	of: module: add buffer overflow check in of_modalias()	Sergey Shtylyov	2024-05-08	1	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \|	In of_modalias(), if the buffer happens to be too small even for the 1st snprintf() call, the len parameter will become negative and str parameter (if not NULL initially) will point beyond the buffer's end. Add the buffer overflow check after the 1st snprintf() call and fix such check after the strlen() call (accounting for the terminating NUL char). Fixes: bc575064d688 ("of/device: use of_property_for_each_string to parse compatible strings") Signed-off-by: Sergey Shtylyov <s.shtylyov@omp.ru> Link: https://lore.kernel.org/r/bbfc6be0-c687-62b6-d015-5141b93f313e@omp.ru Signed-off-by: Rob Herring <robh@kernel.org>
*	dt-bindings: PCI: microchip: increase number of items in ranges property	Valentina Fernandez	2024-05-03	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \|	Increase the number of items in the ranges property to allow up to 3 ranges. For example a prefetchable range, a non-prefetchable range and an IO range, depending on configuration. Signed-off-by: Valentina Fernandez <valentina.fernandezalanis@microchip.com> Signed-off-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240327-debunk-perky-f5514ca332be@spud Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: Drop unnecessary quotes on keys	Rob Herring (Arm)	2024-05-03	7	-19/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	The yamllint quoted-strings check wasn't checking keys for quotes, but support for checking keys was added in 1.34 release. Fix all the errors found when enabling the check. Clean-up the xilinx-versal-cpm formatting while we're here. Acked-by: Mark Brown <broonie@kernel.org> Reviewed-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Acked-by: Thierry Reding <treding@nvidia.com> Link: https://lore.kernel.org/r/20240426202239.2837516-1-robh@kernel.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: interrupt-controller: mediatek,mt6577-sysirq: Drop unnecessary ↵	Rob Herring (Arm)	2024-05-03	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \|	quotes Drop unnecessary quotes which aren't needed in yaml. This is checked by yamllint, but this case was excluded due to the comma and yamllint's mishandling of some cases with commas. That's now fixed in yamllint 1.34. Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20240426195404.2771046-1-robh@kernel.org Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	of: property: Use scope based cleanup on port_node	Shresth Prasad	2024-05-03	1	-3/+2
\| \| \| \| \| \| \| \| \| \| \|	Use __free cleanup handler which ensures that the resource is freed when it goes out of scope, thus removing the need to manually clean it up using of_node_put. Suggested-by: Julia Lawall <julia.lawall@inria.fr> Signed-off-by: Shresth Prasad <shresthprasad7@gmail.com> Link: https://lore.kernel.org/r/20240428115226.41345-2-shresthprasad7@gmail.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	of: reserved_mem: Remove the use of phandle from the reserved_mem APIs	Oreoluwa Babatunde	2024-05-03	2	-22/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The __find_rmem() function is the only place that references the phandle field of the reserved_mem struct. __find_rmem() is used to match a device_node object to its corresponding entry in the reserved_mem array using its phandle value. But, there is already a function called of_reserved_mem_lookup() which carries out the same action using the name of the node. Using the of_reserved_mem_lookup() function is more reliable because every node is guaranteed to have a name, but not all nodes will have a phandle. Nodes are only assigned a phandle if they are explicitly defined in the DT using "phandle = <phandle_number>", or if they are referenced by another node in the DT. Hence, If the phandle field is empty, then __find_rmem() will return a false negative. Hence, delete the __find_rmem() function and switch to using the of_reserved_mem_lookup() function to find the corresponding entry of a device_node in the reserved_mem array. Since the phandle field of the reserved_mem struct is now unused, delete that as well. Signed-off-by: Oreoluwa Babatunde <quic_obabatun@quicinc.com> Link: https://lore.kernel.org/r/20240502192403.3307277-1-quic_obabatun@quicinc.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	of: property: fw_devlink: Add support for "power-supplies" binding	Saravana Kannan	2024-04-23	1	-0/+2
\| \| \| \| \| \| \| \| \| \|	Add support for parsing power-supplies binding so that fw_devlink can enforce the dependency. Signed-off-by: Saravana Kannan <saravanak@google.com> Cc: Sebastian Reichel <sre@kernel.org> Link: https://lore.kernel.org/r/20240417200738.1370896-1-saravanak@google.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: watchdog: aspeed,ast2400-wdt: Convert to DT schema	Andrew Jeffery	2024-04-23	2	-73/+142
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Squash warnings such as: ``` arch/arm/boot/dts/aspeed/aspeed-bmc-facebook-galaxy100.dtb: /ahb/apb@1e600000/watchdog@1e785000: failed to match any schema with compatible: ['aspeed,ast2400-wdt'] ``` The schema binding additionally defines the clocks property over the prose binding to align with use of the node in the DTS files. Signed-off-by: Andrew Jeffery <andrew@codeconstruct.com.au> Reviewed-by: Rob Herring <robh@kernel.org> Link: https://lore.kernel.org/r/20240403020439.418788-1-andrew@codeconstruct.com.au Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: irq: sun7i-nmi: Add binding for the H616 NMI controller	Chris Morgan	2024-04-23	1	-4/+4
\| \| \| \| \| \| \| \| \|	Add binding for the H616 NMI controller. Signed-off-by: Chris Morgan <macromorgan@hotmail.com> Acked-by: Krzysztof Kozlowski <krzk@kernel.org> Link: https://lore.kernel.org/r/20240418181615.1370179-2-macroalpha82@gmail.com Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: interrupt-controller: renesas,irqc: Add r8a779g0 support	Geert Uytterhoeven	2024-04-23	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	Document support for the Interrupt Controller for External Devices (INT-EX) in the Renesas R-Car V4M (R8A779H0) SoC. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/264cffccfbb1f92657420f5f869236b06a97d958.1713280616.git.geert+renesas@glider.be Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: timer: renesas,tmu: Add R-Car V4M support	Geert Uytterhoeven	2024-04-16	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \|	Document support for the Timer Unit (TMU) in the Renesas R-Car V4M (R8A779H0) SoC. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/8a39386b1a33db6e83e852b3b365bc1adeb25242.1712068574.git.geert+renesas@glider.be Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	dt-bindings: timer: renesas,cmt: Add R-Car V4M support	Geert Uytterhoeven	2024-04-16	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \|	Document support for the Compare Match Timer Type0 (CMT0) and Type1 (CMT1) in the Renesas R-Car V4M (R8A779H0) SoC. Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Reviewed-by: Wolfram Sang <wsa+renesas@sang-engineering.com> Acked-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/3e8a7a93261d8ad264dec2fa2784fe1bbfc4a23c.1712068536.git.geert+renesas@glider.be Signed-off-by: Rob Herring (Arm) <robh@kernel.org>
*	of: Use scope based of_node_put() cleanups	Rob Herring	2024-04-15	2	-93/+48
\| \| \| \| \| \| \| \| \| \| \| \|	Use the relatively new scope based of_node_put() cleanup to simplify function exit handling. Doing so reduces the chances of forgetting an of_node_put() and simplifies error paths by avoiding the need for goto statements. Reviewed-by: Saravana Kannan <saravanak@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240409-dt-cleanup-free-v2-3-5b419a4af38d@kernel.org Signed-off-by: Rob Herring <robh@kernel.org>
*	of: Use scope based kfree() cleanups	Rob Herring	2024-04-15	3	-55/+25
\| \| \| \| \| \| \| \| \| \| \|	Use the relatively new scope based kfree() cleanup to simplify error handling. Doing so reduces the chances of memory leaks and simplifies error paths by avoiding the need for goto statements. Reviewed-by: Saravana Kannan <saravanak@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240409-dt-cleanup-free-v2-2-5b419a4af38d@kernel.org Signed-off-by: Rob Herring <robh@kernel.org>
*	of: Add a helper to free property struct	Rob Herring	2024-04-15	4	-31/+19
\| \| \| \| \| \| \| \| \| \| \|	Freeing a property struct is 3 kfree()'s which is duplicated in multiple spots. Add a helper, __of_prop_free(), and replace all the open coded cases in the DT code. Reviewed-by: Saravana Kannan <saravanak@google.com> Reviewed-by: Jonathan Cameron <Jonathan.Cameron@huawei.com> Link: https://lore.kernel.org/r/20240409-dt-cleanup-free-v2-1-5b419a4af38d@kernel.org Signed-off-by: Rob Herring <robh@kernel.org>
*	dt-bindings: serial: fsl-linflexuart: add compatible for S32G3	Wadim Mueller	2024-04-12	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \|	Add a compatible string for the uart binding of NXP S32G3 platforms. Here we use "s32v234-linflexuart" as fallback since the current linflexuart driver can still work on S32G3. Signed-off-by: Wadim Mueller <wafgo01@gmail.com> Acked-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Link: https://lore.kernel.org/r/20240324214329.29988-3-wafgo01@gmail.com Signed-off-by: Rob Herring <robh@kernel.org>
*	dt-bindings: usb: mtk-xhci: add compatible for MT7988	Rafał Miłecki	2024-04-11	1	-0/+1
\| \| \| \| \| \| \| \| \| \|	MT7988 SoC contains two on-SoC XHCI controllers. Add proper binding. Signed-off-by: Rafał Miłecki <rafal@milecki.pl> Acked-by: Conor Dooley <conor.dooley@microchip.com> Reviewed-by: AngeloGioacchino Del Regno <angelogioacchino.delregno@collabora.com> Link: https://lore.kernel.org/r/20240213130044.1976-1-zajec5@gmail.com Signed-off-by: Rob Herring <robh@kernel.org>
*	dt-bindings: display: atmel,lcdc: convert to dtschema	Dharma Balasubiramani	2024-03-25	3	-87/+173
\| \| \| \| \| \| \| \| \|	Convert the atmel,lcdc bindings to DT schema. Changes during conversion: add missing clocks and clock-names properties. Signed-off-by: Dharma Balasubiramani <dharma.b@microchip.com> Link: https://lore.kernel.org/r/20240318-lcdc-fb-v4-1-c533c7c2c706@microchip.com Signed-off-by: Rob Herring <robh@kernel.org>
*	dt-bindings: display: samsung,exynos5-dp: convert to DT Schema	Krzysztof Kozlowski	2024-03-25	2	-112/+163
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Convert Samsung Exynos5250/5420 SoC Display Port Controller bindings to DT schema with a change: add power-domains, already used in DTS. This Display Port controller is actually variant of Analogix Display Port bridge, however new DT Schema does not reference analogix,dp.yaml, because of incompatibilities in the driver. The analogix,dp.yaml expects two ports, input and output, but Linux Exynos DP DRM driver and DTS use only one port: output. Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@linaro.org> Reviewed-by: Conor Dooley <conor.dooley@microchip.com> Link: https://lore.kernel.org/r/20240313182855.14140-1-krzysztof.kozlowski@linaro.org Signed-off-by: Rob Herring <robh@kernel.org>
*	Linux 6.9-rc1v6.9-rc1	Linus Torvalds	2024-03-24	1	-2/+2
\|
*	Merge tag 'efi-fixes-for-v6.9-2' of ↵	Linus Torvalds	2024-03-24	4	-2/+14
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi Pull EFI fixes from Ard Biesheuvel: - Fix logic that is supposed to prevent placement of the kernel image below LOAD_PHYSICAL_ADDR - Use the firmware stack in the EFI stub when running in mixed mode - Clear BSS only once when using mixed mode - Check efi.get_variable() function pointer for NULL before trying to call it * tag 'efi-fixes-for-v6.9-2' of git://git.kernel.org/pub/scm/linux/kernel/git/efi/efi: efi: fix panic in kdump kernel x86/efistub: Don't clear BSS twice in mixed mode x86/efistub: Call mixed mode boot services on the firmware's stack efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or higher address
\| *	efi: fix panic in kdump kernel	Oleksandr Tymoshenko	2024-03-24	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Check if get_next_variable() is actually valid pointer before calling it. In kdump kernel this method is set to NULL that causes panic during the kexec-ed kernel boot. Tested with QEMU and OVMF firmware. Fixes: bad267f9e18f ("efi: verify that variable services are supported") Signed-off-by: Oleksandr Tymoshenko <ovt@google.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
\| *	x86/efistub: Don't clear BSS twice in mixed mode	Ard Biesheuvel	2024-03-24	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Clearing BSS should only be done once, at the very beginning. efi_pe_entry() is the entrypoint from the firmware, which may not clear BSS and so it is done explicitly. However, efi_pe_entry() is also used as an entrypoint by the mixed mode startup code, in which case BSS will already have been cleared, and doing it again at this point will corrupt global variables holding the firmware's GDT/IDT and segment selectors. So make the memset() conditional on whether the EFI stub is running in native mode. Fixes: b3810c5a2cc4a666 ("x86/efistub: Clear decompressor BSS in native EFI entrypoint") Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
\| *	x86/efistub: Call mixed mode boot services on the firmware's stack	Ard Biesheuvel	2024-03-24	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Normally, the EFI stub calls into the EFI boot services using the stack that was live when the stub was entered. According to the UEFI spec, this stack needs to be at least 128k in size - this might seem large but all asynchronous processing and event handling in EFI runs from the same stack and so quite a lot of space may be used in practice. In mixed mode, the situation is a bit different: the bootloader calls the 32-bit EFI stub entry point, which calls the decompressor's 32-bit entry point, where the boot stack is set up, using a fixed allocation of 16k. This stack is still in use when the EFI stub is started in 64-bit mode, and so all calls back into the EFI firmware will be using the decompressor's limited boot stack. Due to the placement of the boot stack right after the boot heap, any stack overruns have gone unnoticed. However, commit 5c4feadb0011983b ("x86/decompressor: Move global symbol references to C code") moved the definition of the boot heap into C code, and now the boot stack is placed right at the base of BSS, where any overruns will corrupt the end of the .data section. While it would be possible to work around this by increasing the size of the boot stack, doing so would affect all x86 systems, and mixed mode systems are a tiny (and shrinking) fraction of the x86 installed base. So instead, record the firmware stack pointer value when entering from the 32-bit firmware, and switch to this stack every time a EFI boot service call is made. Cc: <stable@kernel.org> # v6.1+ Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
\| *	efi/libstub: fix efi_random_alloc() to allocate memory at alloc_min or ↵	KONDO KAZUMA(近藤　和真)	2024-03-22	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	higher address Following warning is sometimes observed while booting my servers: [ 3.594838] DMA: preallocated 4096 KiB GFP_KERNEL pool for atomic allocations [ 3.602918] swapper/0: page allocation failure: order:10, mode:0xcc1(GFP_KERNEL\|GFP_DMA), nodemask=(null),cpuset=/,mems_allowed=0-1 ... [ 3.851862] DMA: preallocated 1024 KiB GFP_KERNEL\|GFP_DMA pool for atomic allocation If 'nokaslr' boot option is set, the warning always happens. On x86, ZONE_DMA is small zone at the first 16MB of physical address space. When this problem happens, most of that space seems to be used by decompressed kernel. Thereby, there is not enough space at DMA_ZONE to meet the request of DMA pool allocation. The commit 2f77465b05b1 ("x86/efistub: Avoid placing the kernel below LOAD_PHYSICAL_ADDR") tried to fix this problem by introducing lower bound of allocation. But the fix is not complete. efi_random_alloc() allocates pages by following steps. 1. Count total available slots ('total_slots') 2. Select a slot ('target_slot') to allocate randomly 3. Calculate a starting address ('target') to be included target_slot 4. Allocate pages, which starting address is 'target' In step 1, 'alloc_min' is used to offset the starting address of memory chunk. But in step 3 'alloc_min' is not considered at all. As the result, 'target' can be miscalculated and become lower than 'alloc_min'. When KASLR is disabled, 'target_slot' is always 0 and the problem happens everytime if the EFI memory map of the system meets the condition. Fix this problem by calculating 'target' considering 'alloc_min'. Cc: linux-efi@vger.kernel.org Cc: Tom Englund <tomenglund26@gmail.com> Cc: linux-kernel@vger.kernel.org Fixes: 2f77465b05b1 ("x86/efistub: Avoid placing the kernel below LOAD_PHYSICAL_ADDR") Signed-off-by: Kazuma Kondo <kazuma-kondo@nec.com> Signed-off-by: Ard Biesheuvel <ardb@kernel.org>
* \|	Merge tag 'x86-urgent-2024-03-24' of ↵	Linus Torvalds	2024-03-24	15	-89/+80
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull x86 fixes from Thomas Gleixner: - Ensure that the encryption mask at boot is properly propagated on 5-level page tables, otherwise the PGD entry is incorrectly set to non-encrypted, which causes system crashes during boot. - Undo the deferred 5-level page table setup as it cannot work with memory encryption enabled. - Prevent inconsistent XFD state on CPU hotplug, where the MSR is reset to the default value but the cached variable is not, so subsequent comparisons might yield the wrong result and as a consequence the result prevents updating the MSR. - Register the local APIC address only once in the MPPARSE enumeration to prevent triggering the related WARN_ONs() in the APIC and topology code. - Handle the case where no APIC is found gracefully by registering a fake APIC in the topology code. That makes all related topology functions work correctly and does not affect the actual APIC driver code at all. - Don't evaluate logical IDs during early boot as the local APIC IDs are not yet enumerated and the invoked function returns an error code. Nothing requires the logical IDs before the final CPUID enumeration takes place, which happens after the enumeration. - Cure the fallout of the per CPU rework on UP which misplaced the copying of boot_cpu_data to per CPU data so that the final update to boot_cpu_data got lost which caused inconsistent state and boot crashes. - Use copy_from_kernel_nofault() in the kprobes setup as there is no guarantee that the address can be safely accessed. - Reorder struct members in struct saved_context to work around another kmemleak false positive - Remove the buggy code which tries to update the E820 kexec table for setup_data as that is never passed to the kexec kernel. - Update the resource control documentation to use the proper units. - Fix a Kconfig warning observed with tinyconfig * tag 'x86-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: x86/boot/64: Move 5-level paging global variable assignments back x86/boot/64: Apply encryption mask to 5-level pagetable update x86/cpu: Add model number for another Intel Arrow Lake mobile processor x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD Documentation/x86: Document that resctrl bandwidth control units are MiB x86/mpparse: Register APIC address only once x86/topology: Handle the !APIC case gracefully x86/topology: Don't evaluate logical IDs during early boot x86/cpu: Ensure that CPU info updates are propagated on UP kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address x86/pm: Work around false positive kmemleak report in msr_build_context() x86/kexec: Do not update E820 kexec table for setup_data x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'
\| * \|	x86/boot/64: Move 5-level paging global variable assignments back	Tom Lendacky	2024-03-24	1	-9/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables") moved assignment of 5-level global variables to later in the boot in order to avoid having to use RIP relative addressing in order to set them. However, when running with 5-level paging and SME active (mem_encrypt=on), the variables are needed as part of the page table setup needed to encrypt the kernel (using pgd_none(), p4d_offset(), etc.). Since the variables haven't been set, the page table manipulation is done as if 4-level paging is active, causing the system to crash on boot. While only a subset of the assignments that were moved need to be set early, move all of the assignments back into check_la57_support() so that these assignments aren't spread between two locations. Instead of just reverting the fix, this uses the new RIP_REL_REF() macro when assigning the variables. Fixes: 63bed9660420 ("x86/startup_64: Defer assignment of 5-level paging global variables") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/2ca419f4d0de719926fd82353f6751f717590a86.1711122067.git.thomas.lendacky@amd.com
\| * \|	x86/boot/64: Apply encryption mask to 5-level pagetable update	Tom Lendacky	2024-03-24	1	-1/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When running with 5-level page tables, the kernel mapping PGD entry is updated to point to the P4D table. The assignment uses _PAGE_TABLE_NOENC, which, when SME is active (mem_encrypt=on), results in a page table entry without the encryption mask set, causing the system to crash on boot. Change the assignment to use _PAGE_TABLE instead of _PAGE_TABLE_NOENC so that the encryption mask is set for the PGD entry. Fixes: 533568e06b15 ("x86/boot/64: Use RIP_REL_REF() to access early_top_pgt[]") Signed-off-by: Tom Lendacky <thomas.lendacky@amd.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Ard Biesheuvel <ardb@kernel.org> Link: https://lore.kernel.org/r/8f20345cda7dbba2cf748b286e1bc00816fe649a.1711122067.git.thomas.lendacky@amd.com
\| * \|	x86/cpu: Add model number for another Intel Arrow Lake mobile processor	Tony Luck	2024-03-24	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This one is the regular laptop CPU. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240322161725.195614-1-tony.luck@intel.com
\| * \|	x86/fpu: Keep xfd_state in sync with MSR_IA32_XFD	Adamos Ttofari	2024-03-24	2	-6/+13
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Commit 672365477ae8 ("x86/fpu: Update XFD state where required") and commit 8bf26758ca96 ("x86/fpu: Add XFD state to fpstate") introduced a per CPU variable xfd_state to keep the MSR_IA32_XFD value cached, in order to avoid unnecessary writes to the MSR. On CPU hotplug MSR_IA32_XFD is reset to the init_fpstate.xfd, which wipes out any stale state. But the per CPU cached xfd value is not reset, which brings them out of sync. As a consequence a subsequent xfd_update_state() might fail to update the MSR which in turn can result in XRSTOR raising a #NM in kernel space, which crashes the kernel. To fix this, introduce xfd_set_state() to write xfd_state together with MSR_IA32_XFD, and use it in all places that set MSR_IA32_XFD. Fixes: 672365477ae8 ("x86/fpu: Update XFD state where required") Signed-off-by: Adamos Ttofari <attofari@amazon.de> Signed-off-by: Chang S. Bae <chang.seok.bae@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Reviewed-by: Thomas Gleixner <tglx@linutronix.de> Link: https://lore.kernel.org/r/20240322230439.456571-1-chang.seok.bae@intel.com Closes: https://lore.kernel.org/lkml/20230511152818.13839-1-attofari@amazon.de
\| * \|	Documentation/x86: Document that resctrl bandwidth control units are MiB	Tony Luck	2024-03-24	1	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The memory bandwidth software controller uses 2^20 units rather than 10^6. See mbm_bw_count() which computes bandwidth using the "SZ_1M" Linux define for 0x00100000. Update the documentation to use MiB when describing this feature. It's too late to fix the mount option "mba_MBps" as that is now an established user interface. Signed-off-by: Tony Luck <tony.luck@intel.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240322182016.196544-1-tony.luck@intel.com
\| * \|	x86/mpparse: Register APIC address only once	Thomas Gleixner	2024-03-23	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The APIC address is registered twice. First during the early detection and afterwards when actually scanning the table for APIC IDs. The APIC and topology core warn about the second attempt. Restrict it to the early detection call. Fixes: 81287ad65da5 ("x86/apic: Sanitize APIC address setup") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.297774848@linutronix.de
\| * \|	x86/topology: Handle the !APIC case gracefully	Thomas Gleixner	2024-03-23	1	-0/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	If there is no local APIC enumerated and registered then the topology bitmaps are empty. Therefore, topology_init_possible_cpus() will die with a division by zero exception. Prevent this by registering a fake APIC id to populate the topology bitmap. This also allows to use all topology query interfaces unconditionally. It does not affect the actual APIC code because either the local APIC address was not registered or no local APIC could be detected. Fixes: f1f758a80516 ("x86/topology: Add a mechanism to track topology via APIC IDs") Reported-by: Guenter Roeck <linux@roeck-us.net> Reported-by: Linus Torvalds <torvalds@linux-foundation.org> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.242709302@linutronix.de
\| * \|	x86/topology: Don't evaluate logical IDs during early boot	Thomas Gleixner	2024-03-23	1	-5/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The local APICs have not yet been enumerated so the logical ID evaluation from the topology bitmaps does not work and would return an error code. Skip the evaluation during the early boot CPUID evaluation and only apply it on the final run. Fixes: 380414be78bf ("x86/cpu/topology: Use topology logical mapping mechanism") Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.186943142@linutronix.de
\| * \|	x86/cpu: Ensure that CPU info updates are propagated on UP	Thomas Gleixner	2024-03-23	3	-37/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The boot sequence evaluates CPUID information twice: 1) During early boot 2) When finalizing the early setup right before mitigations are selected and alternatives are patched. In both cases the evaluation is stored in boot_cpu_data, but on UP the copying of boot_cpu_data to the per CPU info of the boot CPU happens between #1 and #2. So any update which happens in #2 is never propagated to the per CPU info instance. Consolidate the whole logic and copy boot_cpu_data right before applying alternatives as that's the point where boot_cpu_data is in it's final state and not supposed to change anymore. This also removes the voodoo mb() from smp_prepare_cpus_common() which had absolutely no purpose. Fixes: 71eb4893cfaf ("x86/percpu: Cure per CPU madness on UP") Reported-by: Guenter Roeck <linux@roeck-us.net> Signed-off-by: Thomas Gleixner <tglx@linutronix.de> Signed-off-by: Borislav Petkov (AMD) <bp@alien8.de> Tested-by: Guenter Roeck <linux@roeck-us.net> Link: https://lore.kernel.org/r/20240322185305.127642785@linutronix.de
\| * \|	kprobes/x86: Use copy_from_kernel_nofault() to read from unsafe address	Masami Hiramatsu (Google)	2024-03-22	1	-1/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Read from an unsafe address with copy_from_kernel_nofault() in arch_adjust_kprobe_addr() because this function is used before checking the address is in text or not. Syzcaller bot found a bug and reported the case if user specifies inaccessible data area, arch_adjust_kprobe_addr() will cause a kernel panic. [ mingo: Clarified the comment. ] Fixes: cc66bb914578 ("x86/ibt,kprobes: Cure sym+0 equals fentry woes") Reported-by: Qiang Zhang <zzqq0103.hey@gmail.com> Tested-by: Jinghao Jia <jinghao7@illinois.edu> Signed-off-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/171042945004.154897.2221804961882915806.stgit@devnote2
\| * \|	x86/pm: Work around false positive kmemleak report in msr_build_context()	Anton Altaparmakov	2024-03-22	1	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since: 7ee18d677989 ("x86/power: Make restore_processor_context() sane") kmemleak reports this issue: unreferenced object 0xf68241e0 (size 32): comm "swapper/0", pid 1, jiffies 4294668610 (age 68.432s) hex dump (first 32 bytes): 00 cc cc cc 29 10 01 c0 00 00 00 00 00 00 00 00 ....)........... 00 42 82 f6 cc cc cc cc cc cc cc cc cc cc cc cc .B.............. backtrace: [<461c1d50>] __kmem_cache_alloc_node+0x106/0x260 [<ea65e13b>] __kmalloc+0x54/0x160 [<c3858cd2>] msr_build_context.constprop.0+0x35/0x100 [<46635aff>] pm_check_save_msr+0x63/0x80 [<6b6bb938>] do_one_initcall+0x41/0x1f0 [<3f3add60>] kernel_init_freeable+0x199/0x1e8 [<3b538fde>] kernel_init+0x1a/0x110 [<938ae2b2>] ret_from_fork+0x1c/0x28 Which is a false positive. Reproducer: - Run rsync of whole kernel tree (multiple times if needed). - start a kmemleak scan - Note this is just an example: a lot of our internal tests hit these. The root cause is similar to the fix in: b0b592cf0836 x86/pm: Fix false positive kmemleak report in msr_build_context() ie. the alignment within the packed struct saved_context which has everything unaligned as there is only "u16 gs;" at start of struct where in the past there were four u16 there thus aligning everything afterwards. The issue is with the fact that Kmemleak only searches for pointers that are aligned (see how pointers are scanned in kmemleak.c) so when the struct members are not aligned it doesn't see them. Testing: We run a lot of tests with our CI, and after applying this fix we do not see any kmemleak issues any more whilst without it we see hundreds of the above report. From a single, simple test run consisting of 416 individual test cases on kernel 5.10 x86 with kmemleak enabled we got 20 failures due to this, which is quite a lot. With this fix applied we get zero kmemleak related failures. Fixes: 7ee18d677989 ("x86/power: Make restore_processor_context() sane") Signed-off-by: Anton Altaparmakov <anton@tuxera.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Acked-by: "Rafael J. Wysocki" <rafael@kernel.org> Cc: stable@vger.kernel.org Cc: Linus Torvalds <torvalds@linux-foundation.org> Link: https://lore.kernel.org/r/20240314142656.17699-1-anton@tuxera.com
\| * \|	x86/kexec: Do not update E820 kexec table for setup_data	Dave Young	2024-03-22	1	-16/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	crashkernel reservation failed on a Thinkpad t440s laptop recently. Actually the memblock reservation succeeded, but later insert_resource() failed. Test steps: kexec load -> /* make sure add crashkernel param eg. crashkernel=160M */ kexec reboot -> dmesg\|grep "crashkernel reserved"; crashkernel memory range like below reserved successfully: 0x00000000d0000000 - 0x00000000da000000 But no such "Crash kernel" region in /proc/iomem The background story: Currently the E820 code reserves setup_data regions for both the current kernel and the kexec kernel, and it inserts them into the resources list. Before the kexec kernel reboots nobody passes the old setup_data, and kexec only passes fresh SETUP_EFI/SETUP_IMA/SETUP_RNG_SEED if needed. Thus the old setup data memory is not used at all. Due to old kernel updates the kexec e820 table as well so kexec kernel sees them as E820_TYPE_RESERVED_KERN regions, and later the old setup_data regions are inserted into resources list in the kexec kernel by e820__reserve_resources(). Note, due to no setup_data is passed in for those old regions they are not early reserved (by function early_reserve_memory), and the crashkernel memblock reservation will just treat them as usable memory and it could reserve the crashkernel region which overlaps with the old setup_data regions. And just like the bug I noticed here, kdump insert_resource failed because e820__reserve_resources has added the overlapped chunks in /proc/iomem already. Finally, looking at the code, the old setup_data regions are not used at all as no setup_data is passed in by the kexec boot loader. Although something like SETUP_PCI etc could be needed, kexec should pass the info as new setup_data so that kexec kernel can take care of them. This should be taken care of in other separate patches if needed. Thus drop the useless buggy code here. Signed-off-by: Dave Young <dyoung@redhat.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Jiri Bohac <jbohac@suse.cz> Cc: Eric DeVolder <eric.devolder@oracle.com> Cc: Baoquan He <bhe@redhat.com> Cc: Ard Biesheuvel <ardb@kernel.org> Cc: Kees Cook <keescook@chromium.org> Cc: "Kirill A. Shutemov" <kirill.shutemov@linux.intel.com> Link: https://lore.kernel.org/r/Zf0T3HCG-790K-pZ@darkstar.users.ipa.redhat.com
\| * \|	x86/config: Fix warning for 'make ARCH=x86_64 tinyconfig'	Masahiro Yamada	2024-03-21	1	-0/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Kconfig emits a warning for the following command: $ make ARCH=x86_64 tinyconfig ... .config:1380:warning: override: UNWINDER_GUESS changes choice state When X86_64=y, the unwinder is exclusively selected from the following three options: - UNWINDER_ORC - UNWINDER_FRAME_POINTER - UNWINDER_GUESS However, arch/x86/configs/tiny.config only specifies the values of the last two. UNWINDER_ORC must be explicitly disabled. Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Signed-off-by: Ingo Molnar <mingo@kernel.org> Link: https://lore.kernel.org/r/20240320154313.612342-1-masahiroy@kernel.org
* \| \|	Merge tag 'sched-urgent-2024-03-24' of ↵	Linus Torvalds	2024-03-24	1	-0/+3
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip Pull scheduler doc clarification from Thomas Gleixner: "A single update for the documentation of the base_slice_ns tunable to clarify that any value which is less than the tick slice has no effect because the scheduler tick is not guaranteed to happen within the set time slice" * tag 'sched-urgent-2024-03-24' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation
\| * \| \|	sched/doc: Update documentation for base_slice_ns and CONFIG_HZ relation	Mukesh Kumar Chaurasiya	2024-03-21	1	-0/+3
\| \|/ / \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The tunable base_slice_ns is dependent on CONFIG_HZ (i.e. TICK_NSEC) for any significant performance improvement. The reason being the scheduler tick is not frequent enough to force preemption when base_slice expires in case of: base_slice_ns < TICK_NSEC The below data is of stress-ng: Number of CPU: 1 Stressor threads: 4 Time: 30sec On CONFIG_HZ=1000 \| base_slice \| avg-run (msec) \| context-switches \| \| ---------- \| -------------- \| ---------------- \| \| 3ms \| 2.914 \| 10342 \| \| 6ms \| 4.857 \| 6196 \| \| 9ms \| 6.754 \| 4482 \| \| 12ms \| 7.872 \| 3802 \| \| 22ms \| 11.294 \| 2710 \| \| 32ms \| 13.425 \| 2284 \| On CONFIG_HZ=100 \| base_slice \| avg-run (msec) \| context-switches \| \| ---------- \| -------------- \| ---------------- \| \| 3ms \| 9.144 \| 3337 \| \| 6ms \| 9.113 \| 3301 \| \| 9ms \| 8.991 \| 3315 \| \| 12ms \| 12.935 \| 2328 \| \| 22ms \| 16.031 \| 1915 \| \| 32ms \| 18.608 \| 1622 \| base_slice: the value of base_slice in ms avg-run (msec): average time of the stressor threads got on cpu before it got preempted context-switches: number of context switches for the stress-ng process Signed-off-by: Mukesh Kumar Chaurasiya <mchauras@linux.ibm.com> Signed-off-by: Ingo Molnar <mingo@kernel.org> Cc: Randy Dunlap <rdunlap@infradead.org> Link: https://lore.kernel.org/r/20240320173815.927637-2-mchauras@linux.ibm.com
* \| \|	Merge tag 'dma-mapping-6.9-2024-03-24' of ↵	Linus Torvalds	2024-03-24	2	-12/+42
\|\ \ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.infradead.org/users/hch/dma-mapping Pull dma-mapping fixes from Christoph Hellwig: "This has a set of swiotlb alignment fixes for sometimes very long standing bugs from Will. We've been discussion them for a while and they should be solid now" * tag 'dma-mapping-6.9-2024-03-24' of git://git.infradead.org/users/hch/dma-mapping: swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE iommu/dma: Force swiotlb_max_mapping_size on an untrusted device swiotlb: Fix alignment checks when both allocation and DMA masks are present swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc() swiotlb: Enforce page alignment in swiotlb_alloc() swiotlb: Fix double-allocation of slots due to broken alignment handling
\| * \| \|	swiotlb: Reinstate page-alignment for mappings >= PAGE_SIZE	Will Deacon	2024-03-13	1	-7/+11
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	For swiotlb allocations >= PAGE_SIZE, the slab search historically adjusted the stride to avoid checking unaligned slots. This had the side-effect of aligning large mapping requests to PAGE_SIZE, but that was broken by 0eee5ae10256 ("swiotlb: fix slot alignment checks"). Since this alignment could be relied upon drivers, reinstate PAGE_SIZE alignment for swiotlb mappings >= PAGE_SIZE. Reported-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Robin Murphy <robin.murphy@arm.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
\| * \| \|	iommu/dma: Force swiotlb_max_mapping_size on an untrusted device	Nicolin Chen	2024-03-13	1	-0/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The swiotlb does not support a mapping size > swiotlb_max_mapping_size(). On the other hand, with a 64KB PAGE_SIZE configuration, it's observed that an NVME device can map a size between 300KB~512KB, which certainly failed the swiotlb mappings, though the default pool of swiotlb has many slots: systemd[1]: Started Journal Service. => nvme 0000:00:01.0: swiotlb buffer is full (sz: 327680 bytes), total 32768 (slots), used 32 (slots) note: journal-offline[392] exited with irqs disabled note: journal-offline[392] exited with preempt_count 1 Call trace: [ 3.099918] swiotlb_tbl_map_single+0x214/0x240 [ 3.099921] iommu_dma_map_page+0x218/0x328 [ 3.099928] dma_map_page_attrs+0x2e8/0x3a0 [ 3.101985] nvme_prep_rq.part.0+0x408/0x878 [nvme] [ 3.102308] nvme_queue_rqs+0xc0/0x300 [nvme] [ 3.102313] blk_mq_flush_plug_list.part.0+0x57c/0x600 [ 3.102321] blk_add_rq_to_plug+0x180/0x2a0 [ 3.102323] blk_mq_submit_bio+0x4c8/0x6b8 [ 3.103463] __submit_bio+0x44/0x220 [ 3.103468] submit_bio_noacct_nocheck+0x2b8/0x360 [ 3.103470] submit_bio_noacct+0x180/0x6c8 [ 3.103471] submit_bio+0x34/0x130 [ 3.103473] ext4_bio_write_folio+0x5a4/0x8c8 [ 3.104766] mpage_submit_folio+0xa0/0x100 [ 3.104769] mpage_map_and_submit_buffers+0x1a4/0x400 [ 3.104771] ext4_do_writepages+0x6a0/0xd78 [ 3.105615] ext4_writepages+0x80/0x118 [ 3.105616] do_writepages+0x90/0x1e8 [ 3.105619] filemap_fdatawrite_wbc+0x94/0xe0 [ 3.105622] __filemap_fdatawrite_range+0x68/0xb8 [ 3.106656] file_write_and_wait_range+0x84/0x120 [ 3.106658] ext4_sync_file+0x7c/0x4c0 [ 3.106660] vfs_fsync_range+0x3c/0xa8 [ 3.106663] do_fsync+0x44/0xc0 Since untrusted devices might go down the swiotlb pathway with dma-iommu, these devices should not map a size larger than swiotlb_max_mapping_size. To fix this bug, add iommu_dma_max_mapping_size() for untrusted devices to take into account swiotlb_max_mapping_size() v.s. iova_rcache_range() from the iommu_dma_opt_mapping_size(). Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers") Link: https://lore.kernel.org/r/ee51a3a5c32cf885b18f6416171802669f4a718a.1707851466.git.nicolinc@nvidia.com Signed-off-by: Nicolin Chen <nicolinc@nvidia.com> [will: Drop redundant is_swiotlb_active(dev) check] Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Acked-by: Robin Murphy <robin.murphy@arm.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
\| * \| \|	swiotlb: Fix alignment checks when both allocation and DMA masks are present	Will Deacon	2024-03-13	1	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Nicolin reports that swiotlb buffer allocations fail for an NVME device behind an IOMMU using 64KiB pages. This is because we end up with a minimum allocation alignment of 64KiB (for the IOMMU to map the buffer safely) but a minimum DMA alignment mask corresponding to a 4KiB NVME page (i.e. preserving the 4KiB page offset from the original allocation). If the original address is not 4KiB-aligned, the allocation will fail because swiotlb_search_pool_area() erroneously compares these unmasked bits with the 64KiB-aligned candidate allocation. Tweak swiotlb_search_pool_area() so that the DMA alignment mask is reduced based on the required alignment of the allocation. Fixes: 82612d66d51d ("iommu: Allow the dma-iommu api to use bounce buffers") Link: https://lore.kernel.org/r/cover.1707851466.git.nicolinc@nvidia.com Reported-by: Nicolin Chen <nicolinc@nvidia.com> Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
\| * \| \|	swiotlb: Honour dma_alloc_coherent() alignment in swiotlb_alloc()	Will Deacon	2024-03-13	1	-1/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	core-api/dma-api-howto.rst states the following properties of dma_alloc_coherent(): \| The CPU virtual address and the DMA address are both guaranteed to \| be aligned to the smallest PAGE_SIZE order which is greater than or \| equal to the requested size. However, swiotlb_alloc() passes zero for the 'alloc_align_mask' parameter of swiotlb_find_slots() and so this property is not upheld. Instead, allocations larger than a page are aligned to PAGE_SIZE, Calculate the mask corresponding to the page order suitable for holding the allocation and pass that to swiotlb_find_slots(). Fixes: e81e99bacc9f ("swiotlb: Support aligned swiotlb buffers") Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>
\| * \| \|	swiotlb: Enforce page alignment in swiotlb_alloc()	Will Deacon	2024-03-13	1	-0/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When allocating pages from a restricted DMA pool in swiotlb_alloc(), the buffer address is blindly converted to a 'struct page *' that is returned to the caller. In the unlikely event of an allocation bug, page-unaligned addresses are not detected and slots can silently be double-allocated. Add a simple check of the buffer alignment in swiotlb_alloc() to make debugging a little easier if something has gone wonky. Signed-off-by: Will Deacon <will@kernel.org> Reviewed-by: Michael Kelley <mhklinux@outlook.com> Reviewed-by: Petr Tesarik <petr.tesarik1@huawei-partners.com> Tested-by: Nicolin Chen <nicolinc@nvidia.com> Tested-by: Michael Kelley <mhklinux@outlook.com> Signed-off-by: Christoph Hellwig <hch@lst.de>