summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* selftests: kvm: s390: Add debug print functionsChristoph Schlameuss2024-08-191-0/+69
| | | | | | | | | | | | | | | | | | | | | | | | | Add functions to simply print some basic state information in selftests. The output can be enabled by setting: #define TH_LOG_ENABLED 1 #define DEBUG 1 * print_psw: current SIE state description and VM run state * print_hex_bytes: print memory with some counting markers * print_hex: PRINT_HEX with 512 bytes * print_run: use print_psw and print_hex to print contents of VM run state and SIE state description * print_regs: print content of general and control registers All prints use pr_debug for the output and can be configured using DEBUG. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Acked-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20240807154512.316936-6-schlameuss@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20240807154512.316936-6-schlameuss@linux.ibm.com>
* selftests: kvm: s390: Add test fixture and simple VM setup testsChristoph Schlameuss2024-08-191-0/+131
| | | | | | | | | | | | | | | | Add a uc_kvm fixture to create and destroy a ucontrol VM. * uc_sie_assertions asserts basic settings in the SIE as setup by the kernel. * uc_attr_mem_limit asserts the memory limit is max value and cannot be set (not supported). * uc_no_dirty_log asserts dirty log is not supported. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Link: https://lore.kernel.org/r/20240807154512.316936-5-schlameuss@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20240807154512.316936-5-schlameuss@linux.ibm.com>
* selftests: kvm: s390: Add s390x ucontrol test suite with hpage testChristoph Schlameuss2024-08-194-0/+80
| | | | | | | | | | | | | | | | | | | | | Add test suite to validate the s390x architecture specific ucontrol KVM interface. Make use of the selftest test harness. * uc_cap_hpage testcase verifies that a ucontrol VM cannot be run with hugepages. To allow testing of the ucontrol interface the kernel needs a non-default config containing CONFIG_KVM_S390_UCONTROL. This config needs to be set to built-in (y) as this cannot be built as module. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20240807154512.316936-4-schlameuss@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20240807154512.316936-4-schlameuss@linux.ibm.com>
* selftests: kvm: s390: Add kvm_s390_sie_block definition for userspace testsChristoph Schlameuss2024-08-192-2/+242
| | | | | | | | | | | | | | | | | | | | | Subsequent tests do require direct manipulation of the SIE control block. This commit introduces the SIE control block definition for use within the selftests. There are already definitions of this within the kernel. This differs in two ways. * This is the first definition of this in userspace. * In the context of the selftests this does not require atomicity for the flags. With the userspace definition of the SIE block layout now being present we can reuse the values in other tests where applicable. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20240807154512.316936-3-schlameuss@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20240807154512.316936-3-schlameuss@linux.ibm.com>
* selftests: kvm: s390: Define page sizes in shared headerChristoph Schlameuss2024-08-195-14/+17
| | | | | | | | | | | | | | | | Multiple test cases need page size and shift definitions. By moving the definitions to a single architecture specific header we limit the repetition. Make use of PAGE_SIZE, PAGE_SHIFT and PAGE_MASK defines in existing code. Signed-off-by: Christoph Schlameuss <schlameuss@linux.ibm.com> Reviewed-by: Claudio Imbrenda <imbrenda@linux.ibm.com> Reviewed-by: Janosch Frank <frankja@linux.ibm.com> Link: https://lore.kernel.org/r/20240807154512.316936-2-schlameuss@linux.ibm.com Signed-off-by: Janosch Frank <frankja@linux.ibm.com> Message-ID: <20240807154512.316936-2-schlameuss@linux.ibm.com>
* KVM: s390: Fix SORTL and DFLTCC instruction format error in __insn32_queryHariharan Mari2024-08-191-9/+18
| | | | | | | | | | | | | | | | The __insn32_query() function incorrectly uses the RRF instruction format for both the SORTL (RRE format) and DFLTCC (RRF format) instructions. To fix this issue, add separate query functions for SORTL and DFLTCC that use the appropriate instruction formats. Additionally pass the query operand as a pointer to the entire array of 32 elements to slightly optimize performance and readability. Fixes: d668139718a9 ("KVM: s390: provide query function for instructions returning 32 byte") Suggested-by: Heiko Carstens <hca@linux.ibm.com> Reviewed-by: Juergen Christ <jchrist@linux.ibm.com> Signed-off-by: Hariharan Mari <hari55@linux.ibm.com> Signed-off-by: Janosch Frank <frankja@linux.ibm.com>
* Linux 6.11-rc4v6.11-rc4Linus Torvalds2024-08-181-1/+1
|
* Merge tag 'driver-core-6.11-rc4' of ↵Linus Torvalds2024-08-182-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core Pull driver core fixes from Greg KH: "Here are two driver fixes for regressions from 6.11-rc1 due to the driver core change making a structure in a driver core callback const. These were missed by all testing EXCEPT for what Bart happened to be running, so I appreciate the fixes provided here for some odd/not-often-used driver subsystems that nothing else happened to catch. Both of these fixes have been in linux-next all week with no reported issues" * tag 'driver-core-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/driver-core: mips: sgi-ip22: Fix the build ARM: riscpc: ecard: Fix the build
| * mips: sgi-ip22: Fix the buildBart Van Assche2024-08-131-1/+1
| | | | | | | | | | | | | | | | | | Fix a recently introduced build failure. Fixes: d69d80484598 ("driver core: have match() callback in struct bus_type take a const *") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240805232026.65087-3-bvanassche@acm.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * ARM: riscpc: ecard: Fix the buildBart Van Assche2024-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | Fix a recently introduced build failure. Cc: Russell King <rmk+kernel@armlinux.org.uk> Fixes: d69d80484598 ("driver core: have match() callback in struct bus_type take a const *") Signed-off-by: Bart Van Assche <bvanassche@acm.org> Link: https://lore.kernel.org/r/20240805232026.65087-2-bvanassche@acm.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'char-misc-6.11-rc4' of ↵Linus Torvalds2024-08-183-30/+37
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc Pull char / misc fixes from Greg KH: "Here are some small char/misc fixes for 6.11-rc4 to resolve reported problems. Included in here are: - fastrpc revert of a change that broke userspace - xillybus fixes for reported issues Half of these have been in linux-next this week with no reported problems, I don't know if the last bit of xillybus driver changes made it in, but they are 'obviously correct' so will be safe :)" * tag 'char-misc-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/char-misc: char: xillybus: Check USB endpoints when probing device char: xillybus: Refine workqueue handling Revert "misc: fastrpc: Restrict untrusted app to attach to privileged PD" char: xillybus: Don't destroy workqueue from work item running on it
| * | char: xillybus: Check USB endpoints when probing deviceEli Billauer2024-08-161-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure, as the driver probes the device, that all endpoints that the driver may attempt to access exist and are of the correct type. All XillyUSB devices must have a Bulk IN and Bulk OUT endpoint at address 1. This is verified in xillyusb_setup_base_eps(). On top of that, a XillyUSB device may have additional Bulk OUT endpoints. The information about these endpoints' addresses is deduced from a data structure (the IDT) that the driver fetches from the device while probing it. These endpoints are checked in setup_channels(). A XillyUSB device never has more than one IN endpoint, as all data towards the host is multiplexed in this single Bulk IN endpoint. This is why setup_channels() only checks OUT endpoints. Reported-by: syzbot+eac39cba052f2e750dbe@syzkaller.appspotmail.com Cc: stable <stable@kernel.org> Closes: https://lore.kernel.org/all/0000000000001d44a6061f7a54ee@google.com/T/ Fixes: a53d1202aef1 ("char: xillybus: Add driver for XillyUSB (Xillybus variant for USB)"). Signed-off-by: Eli Billauer <eli.billauer@gmail.com> Link: https://lore.kernel.org/r/20240816070200.50695-2-eli.billauer@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | char: xillybus: Refine workqueue handlingEli Billauer2024-08-161-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As the wakeup work item now runs on a separate workqueue, it needs to be flushed separately along with flushing the device's workqueue. Also, move the destroy_workqueue() call to the end of the exit method, so that deinitialization is done in the opposite order of initialization. Fixes: ccbde4b128ef ("char: xillybus: Don't destroy workqueue from work item running on it") Cc: stable <stable@kernel.org> Signed-off-by: Eli Billauer <eli.billauer@gmail.com> Link: https://lore.kernel.org/r/20240816070200.50695-1-eli.billauer@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | Revert "misc: fastrpc: Restrict untrusted app to attach to privileged PD"Griffin Kroah-Hartman2024-08-152-22/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit bab2f5e8fd5d2f759db26b78d9db57412888f187. Joel reported that this commit breaks userspace and stops sensors in SDM845 from working. Also breaks other qcom SoC devices running postmarketOS. Cc: stable <stable@kernel.org> Cc: Ekansh Gupta <quic_ekangupt@quicinc.com> Cc: Dmitry Baryshkov <dmitry.baryshkov@linaro.org> Reported-by: Joel Selvaraj <joelselvaraj.oss@gmail.com> Link: https://lore.kernel.org/r/9a9f5646-a554-4b65-8122-d212bb665c81@umsystem.edu Signed-off-by: Griffin Kroah-Hartman <griffin@kroah.com> Acked-by: Srinivas Kandagatla <srinivas.kandagatla@linaro.org> Fixes: bab2f5e8fd5d ("misc: fastrpc: Restrict untrusted app to attach to privileged PD") Link: https://lore.kernel.org/r/20240815094920.8242-1-griffin@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | char: xillybus: Don't destroy workqueue from work item running on itEli Billauer2024-08-131-5/+11
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Triggered by a kref decrement, destroy_workqueue() may be called from within a work item for destroying its own workqueue. This illegal situation is averted by adding a module-global workqueue for exclusive use of the offending work item. Other work items continue to be queued on per-device workqueues to ensure performance. Reported-by: syzbot+91dbdfecdd3287734d8e@syzkaller.appspotmail.com Cc: stable <stable@kernel.org> Closes: https://lore.kernel.org/lkml/0000000000000ab25a061e1dfe9f@google.com/ Signed-off-by: Eli Billauer <eli.billauer@gmail.com> Link: https://lore.kernel.org/r/20240801121126.60183-1-eli.billauer@gmail.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'tty-6.11-rc4' of ↵Linus Torvalds2024-08-184-39/+9
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty Pull tty / serial fixes from Greg KH: "Here are some small tty and serial driver fixes for 6.11-rc4 to resolve some reported problems. Included in here are: - conmakehash.c userspace build issues - fsl_lpuart driver fix - 8250_omap revert for reported regression - atmel_serial rts flag fix All of these have been in linux-next this week with no reported issues" * tag 'tty-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/tty: Revert "serial: 8250_omap: Set the console genpd always on if no console suspend" tty: atmel_serial: use the correct RTS flag. tty: vt: conmakehash: remove non-portable code printing comment header tty: serial: fsl_lpuart: mark last busy before uart_add_one_port
| * | Revert "serial: 8250_omap: Set the console genpd always on if no console ↵Griffin Kroah-Hartman2024-08-151-28/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | suspend" This reverts commit 68e6939ea9ec3d6579eadeab16060339cdeaf940. Kevin reported that this causes a crash during suspend on platforms that dont use PM domains. Link: https://lore.kernel.org/r/7ha5hgpchq.fsf@baylibre.com Cc: Thomas Richard <thomas.richard@bootlin.com> Fixes: 68e6939ea9ec ("serial: 8250_omap: Set the console genpd always on if no console suspend") Cc: stable <stable@kernel.org> Reported-by: Kevin Hilman <khilman@kernel.org> Signed-off-by: Griffin Kroah-Hartman <griffin@kroah.com> Link: https://lore.kernel.org/r/20240814111747.82371-1-griffin@kroah.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | tty: atmel_serial: use the correct RTS flag.Mathieu Othacehe2024-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In RS485 mode, the RTS pin is driven high by hardware when the transmitter is operating. This behaviour cannot be changed. This means that the driver should claim that it supports SER_RS485_RTS_ON_SEND and not SER_RS485_RTS_AFTER_SEND. Otherwise, when configuring the port with the SER_RS485_RTS_ON_SEND, one get the following warning: kern.warning kernel: atmel_usart_serial atmel_usart_serial.2.auto: ttyS1 (1): invalid RTS setting, using RTS_AFTER_SEND instead which is contradictory with what's really happening. Signed-off-by: Mathieu Othacehe <othacehe@gnu.org> Cc: stable <stable@kernel.org> Tested-by: Alexander Dahl <ada@thorsis.com> Fixes: af47c491e3c7 ("serial: atmel: Fill in rs485_supported") Link: https://lore.kernel.org/r/20240808060637.19886-1-othacehe@gnu.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | tty: vt: conmakehash: remove non-portable code printing comment headerMasahiro Yamada2024-08-131-10/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6e20753da6bc ("tty: vt: conmakehash: cope with abs_srctree no longer in env") included <linux/limits.h>, which invoked another (wrong) patch that tried to address a build error on macOS. According to the specification [1], the correct header to use PATH_MAX is <limits.h>. The minimal fix would be to replace <linux/limits.h> with <limits.h>. However, the following commits seem questionable to me: - 3bd85c6c97b2 ("tty: vt: conmakehash: Don't mention the full path of the input in output") - 6e20753da6bc ("tty: vt: conmakehash: cope with abs_srctree no longer in env") These commits made too many efforts to cope with a comment header in drivers/tty/vt/consolemap_deftbl.c: /* * Do not edit this file; it was automatically generated by * * conmakehash drivers/tty/vt/cp437.uni > [this file] * */ With this commit, the header part of the generate C file will be simplified as follows: /* * Automatically generated file; Do not edit. */ BTW, another series of excessive efforts for a comment header can be seen in the following: - 5ef6dc08cfde ("lib/build_OID_registry: don't mention the full path of the script in output") - 2fe29fe94563 ("lib/build_OID_registry: avoid non-destructive substitution for Perl < 5.13.2 compat") [1]: https://pubs.opengroup.org/onlinepubs/009695399/basedefs/limits.h.html Fixes: 6e20753da6bc ("tty: vt: conmakehash: cope with abs_srctree no longer in env") Cc: stable <stable@kernel.org> Reported-by: Daniel Gomez <da.gomez@samsung.com> Closes: https://lore.kernel.org/all/20240807-macos-build-support-v1-11-4cd1ded85694@samsung.com/ Signed-off-by: Masahiro Yamada <masahiroy@kernel.org> Link: https://lore.kernel.org/r/20240809160853.1269466-1-masahiroy@kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | tty: serial: fsl_lpuart: mark last busy before uart_add_one_portPeng Fan2024-08-131-0/+1
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With "earlycon initcall_debug=1 loglevel=8" in bootargs, kernel sometimes boot hang. It is because normal console still is not ready, but runtime suspend is called, so early console putchar will hang in waiting TRDE set in UARTSTAT. The lpuart driver has auto suspend delay set to 3000ms, but during uart_add_one_port, a child device serial ctrl will added and probed with its pm runtime enabled(see serial_ctrl.c). The runtime suspend call path is: device_add |-> bus_probe_device |->device_initial_probe |->__device_attach |-> pm_runtime_get_sync(dev->parent); |-> pm_request_idle(dev); |-> pm_runtime_put(dev->parent); So in the end, before normal console ready, the lpuart get runtime suspended. And earlycon putchar will hang. To address the issue, mark last busy just after pm_runtime_enable, three seconds is long enough to switch from bootconsole to normal console. Fixes: 43543e6f539b ("tty: serial: fsl_lpuart: Add runtime pm support") Cc: stable <stable@kernel.org> Signed-off-by: Peng Fan <peng.fan@nxp.com> Link: https://lore.kernel.org/r/20240808140325.580105-1-peng.fan@oss.nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
* | Merge tag 'usb-6.11-rc4' of ↵Linus Torvalds2024-08-188-10/+16
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb Pull USB / Thunderbolt driver fixes from Greg KH: "Here are some small USB and Thunderbolt driver fixes for 6.11-rc4 to resolve some reported issues. Included in here are: - thunderbolt driver fixes for reported problems - typec driver fixes - xhci fixes - new device id for ljca usb driver All of these have been in linux-next this week with no reported issues" * tag 'usb-6.11-rc4' of git://git.kernel.org/pub/scm/linux/kernel/git/gregkh/usb: xhci: Fix Panther point NULL pointer deref at full-speed re-enumeration usb: misc: ljca: Add Lunar Lake ljca GPIO HID to ljca_gpio_hids[] Revert "usb: typec: tcpm: clear pd_event queue in PORT_RESET" usb: typec: ucsi: Fix the return value of ucsi_run_command() usb: xhci: fix duplicate stall handling in handle_tx_event() usb: xhci: Check for xhci->interrupters being allocated in xhci_mem_clearup() thunderbolt: Mark XDomain as unplugged when router is removed thunderbolt: Fix memory leaks in {port|retimer}_sb_regs_write()
| * | xhci: Fix Panther point NULL pointer deref at full-speed re-enumerationMathias Nyman2024-08-151-3/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | re-enumerating full-speed devices after a failed address device command can trigger a NULL pointer dereference. Full-speed devices may need to reconfigure the endpoint 0 Max Packet Size value during enumeration. Usb core calls usb_ep0_reinit() in this case, which ends up calling xhci_configure_endpoint(). On Panther point xHC the xhci_configure_endpoint() function will additionally check and reserve bandwidth in software. Other hosts do this in hardware If xHC address device command fails then a new xhci_virt_device structure is allocated as part of re-enabling the slot, but the bandwidth table pointers are not set up properly here. This triggers the NULL pointer dereference the next time usb_ep0_reinit() is called and xhci_configure_endpoint() tries to check and reserve bandwidth [46710.713538] usb 3-1: new full-speed USB device number 5 using xhci_hcd [46710.713699] usb 3-1: Device not responding to setup address. [46710.917684] usb 3-1: Device not responding to setup address. [46711.125536] usb 3-1: device not accepting address 5, error -71 [46711.125594] BUG: kernel NULL pointer dereference, address: 0000000000000008 [46711.125600] #PF: supervisor read access in kernel mode [46711.125603] #PF: error_code(0x0000) - not-present page [46711.125606] PGD 0 P4D 0 [46711.125610] Oops: Oops: 0000 [#1] PREEMPT SMP PTI [46711.125615] CPU: 1 PID: 25760 Comm: kworker/1:2 Not tainted 6.10.3_2 #1 [46711.125620] Hardware name: Gigabyte Technology Co., Ltd. [46711.125623] Workqueue: usb_hub_wq hub_event [usbcore] [46711.125668] RIP: 0010:xhci_reserve_bandwidth (drivers/usb/host/xhci.c Fix this by making sure bandwidth table pointers are set up correctly after a failed address device command, and additionally by avoiding checking for bandwidth in cases like this where no actual endpoints are added or removed, i.e. only context for default control endpoint 0 is evaluated. Reported-by: Karel Balej <balejk@matfyz.cz> Closes: https://lore.kernel.org/linux-usb/D3CKQQAETH47.1MUO22RTCH2O3@matfyz.cz/ Cc: stable@vger.kernel.org Fixes: 651aaf36a7d7 ("usb: xhci: Handle USB transaction error on address command") Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20240815141117.2702314-2-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | usb: misc: ljca: Add Lunar Lake ljca GPIO HID to ljca_gpio_hids[]Hans de Goede2024-08-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add LJCA GPIO support for the Lunar Lake platform. New HID taken from out of tree ivsc-driver git repo. Link: https://github.com/intel/ivsc-driver/commit/47e7c4a446c8ea8c741ff5a32fa7b19f9e6fd47e Cc: stable <stable@kernel.org> Signed-off-by: Hans de Goede <hdegoede@redhat.com> Link: https://lore.kernel.org/r/20240812095038.555837-1-hdegoede@redhat.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | Revert "usb: typec: tcpm: clear pd_event queue in PORT_RESET"Xu Yang2024-08-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit bf20c69cf3cf9c6445c4925dd9a8a6ca1b78bfdf. During tcpm_init() stage, if the VBUS is still present after tcpm_reset_port(), then we assume that VBUS will off and goto safe0v after a specific discharge time. Following a TCPM_VBUS_EVENT event if VBUS reach to off state. TCPM_VBUS_EVENT event may be set during PORT_RESET handling stage. If pd_events reset to 0 after TCPM_VBUS_EVENT set, we will lost this VBUS event. Then the port state machine may stuck at one state. Before: [ 2.570172] pending state change PORT_RESET -> PORT_RESET_WAIT_OFF @ 100 ms [rev1 NONE_AMS] [ 2.570179] state change PORT_RESET -> PORT_RESET_WAIT_OFF [delayed 100 ms] [ 2.570182] pending state change PORT_RESET_WAIT_OFF -> SNK_UNATTACHED @ 920 ms [rev1 NONE_AMS] [ 3.490213] state change PORT_RESET_WAIT_OFF -> SNK_UNATTACHED [delayed 920 ms] [ 3.490220] Start toggling [ 3.546050] CC1: 0 -> 0, CC2: 0 -> 2 [state TOGGLING, polarity 0, connected] [ 3.546057] state change TOGGLING -> SRC_ATTACH_WAIT [rev1 NONE_AMS] After revert this patch, we can see VBUS off event and the port will goto expected state. [ 2.441992] pending state change PORT_RESET -> PORT_RESET_WAIT_OFF @ 100 ms [rev1 NONE_AMS] [ 2.441999] state change PORT_RESET -> PORT_RESET_WAIT_OFF [delayed 100 ms] [ 2.442002] pending state change PORT_RESET_WAIT_OFF -> SNK_UNATTACHED @ 920 ms [rev1 NONE_AMS] [ 2.442122] VBUS off [ 2.442125] state change PORT_RESET_WAIT_OFF -> SNK_UNATTACHED [rev1 NONE_AMS] [ 2.442127] VBUS VSAFE0V [ 2.442351] CC1: 0 -> 0, CC2: 0 -> 0 [state SNK_UNATTACHED, polarity 0, disconnected] [ 2.442357] Start toggling [ 2.491850] CC1: 0 -> 0, CC2: 0 -> 2 [state TOGGLING, polarity 0, connected] [ 2.491858] state change TOGGLING -> SRC_ATTACH_WAIT [rev1 NONE_AMS] [ 2.491863] pending state change SRC_ATTACH_WAIT -> SNK_TRY @ 200 ms [rev1 NONE_AMS] [ 2.691905] state change SRC_ATTACH_WAIT -> SNK_TRY [delayed 200 ms] Fixes: bf20c69cf3cf ("usb: typec: tcpm: clear pd_event queue in PORT_RESET") Cc: stable@vger.kernel.org Signed-off-by: Xu Yang <xu.yang_2@nxp.com> Acked-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://lore.kernel.org/r/20240809112901.535072-1-xu.yang_2@nxp.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | usb: typec: ucsi: Fix the return value of ucsi_run_command()Heikki Krogerus2024-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The command execution routines need to return the amount of data that was transferred when succesful. This fixes an issue where the alternate modes and the power delivery capabilities are not getting registered. Fixes: 5e9c1662a89b ("usb: typec: ucsi: rework command execution functions") Signed-off-by: Heikki Krogerus <heikki.krogerus@linux.intel.com> Link: https://lore.kernel.org/r/20240809150343.286942-1-heikki.krogerus@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | usb: xhci: fix duplicate stall handling in handle_tx_event()Niklas Neronin2024-08-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stall handling is managed in the 'process_*' functions, which are called right before the 'goto' stall handling code snippet. Thus, there should be a return after the 'process_*' functions. Otherwise, the stall code may run twice. Fixes: 1b349f214ac7 ("usb: xhci: add 'goto' for halted endpoint check in handle_tx_event()") Reported-by: Michal Pecio <michal.pecio@gmail.com> Signed-off-by: Niklas Neronin <niklas.neronin@linux.intel.com> Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20240809124408.505786-3-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | usb: xhci: Check for xhci->interrupters being allocated in xhci_mem_clearup()Marc Zyngier2024-08-131-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If xhci_mem_init() fails, it calls into xhci_mem_cleanup() to mop up the damage. If it fails early enough, before xhci->interrupters is allocated but after xhci->max_interrupters has been set, which happens in most (all?) cases, things get uglier, as xhci_mem_cleanup() unconditionally derefences xhci->interrupters. With prejudice. Gate the interrupt freeing loop with a check on xhci->interrupters being non-NULL. Found while debugging a DMA allocation issue that led the XHCI driver on this exact path. Fixes: c99b38c41234 ("xhci: add support to allocate several interrupters") Cc: Mathias Nyman <mathias.nyman@linux.intel.com> Cc: Wesley Cheng <quic_wcheng@quicinc.com> Cc: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: Marc Zyngier <maz@kernel.org> Cc: stable@vger.kernel.org # 6.8+ Signed-off-by: Mathias Nyman <mathias.nyman@linux.intel.com> Link: https://lore.kernel.org/r/20240809124408.505786-2-mathias.nyman@linux.intel.com Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org>
| * | Merge tag 'thunderbolt-for-v6.11-rc3' of ↵Greg Kroah-Hartman2024-08-132-4/+7
| |\ \ | | |/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt into usb-linus thunderbolt: Fixes for v6.11-rc3 This includes following USB4/Thunderbolt fixes for v6.11-rc3: - Fix memory leak in debugfs sideband register access - Fix hang when host router NVM is upgraded and there is another host connected. Both have been in linux-next with no reported issues. * tag 'thunderbolt-for-v6.11-rc3' of ssh://gitolite.kernel.org/pub/scm/linux/kernel/git/westeri/thunderbolt: thunderbolt: Mark XDomain as unplugged when router is removed thunderbolt: Fix memory leaks in {port|retimer}_sb_regs_write()
| | * thunderbolt: Mark XDomain as unplugged when router is removedMika Westerberg2024-08-061-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I noticed that when we do discrete host router NVM upgrade and it gets hot-removed from the PCIe side as a result of NVM firmware authentication, if there is another host connected with enabled paths we hang in tearing them down. This is due to fact that the Thunderbolt networking driver also tries to cleanup the paths and ends up blocking in tb_disconnect_xdomain_paths() waiting for the domain lock. However, at this point we already cleaned the paths in tb_stop() so there is really no need for tb_disconnect_xdomain_paths() to do that anymore. Furthermore it already checks if the XDomain is unplugged and bails out early so take advantage of that and mark the XDomain as unplugged when we remove the parent router. Cc: stable@vger.kernel.org Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
| | * thunderbolt: Fix memory leaks in {port|retimer}_sb_regs_write()Aapo Vienamo2024-08-021-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add missing free_page() call for the memory allocated by validate_and_copy_from_user(). Fixes: 6d241fa00159 ("thunderbolt: Add sideband register access to debugfs") Signed-off-by: Aapo Vienamo <aapo.vienamo@linux.intel.com> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Mika Westerberg <mika.westerberg@linux.intel.com>
* | | Merge tag 'for-6.11-rc3-tag' of ↵Linus Torvalds2024-08-185-8/+86
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux Pull more btrfs fixes from David Sterba: "A more fixes. We got reports that shrinker added in 6.10 still causes latency spikes and the fixes don't handle all corner cases. Due to summer holidays we're taking a shortcut to disable it for release builds and will fix it in the near future. - only enable extent map shrinker for DEBUG builds, temporary quick fix to avoid latency spikes for regular builds - update target inode's ctime on unlink, mandated by POSIX - properly take lock to read/update block group's zoned variables - add counted_by() annotations" * tag 'for-6.11-rc3-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/kdave/linux: btrfs: only enable extent map shrinker for DEBUG builds btrfs: zoned: properly take lock to read/update block group's zoned variables btrfs: tree-checker: add dev extent item checks btrfs: update target inode's ctime on unlink btrfs: send: annotate struct name_cache_entry with __counted_by()
| * | | btrfs: only enable extent map shrinker for DEBUG buildsQu Wenruo2024-08-161-1/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although there are several patches improving the extent map shrinker, there are still reports of too frequent shrinker behavior, taking too much CPU for the kswapd process. So let's only enable extent shrinker for now, until we got more comprehensive understanding and a better solution. Link: https://lore.kernel.org/linux-btrfs/3df4acd616a07ef4d2dc6bad668701504b412ffc.camel@intelfx.name/ Link: https://lore.kernel.org/linux-btrfs/c30fd6b3-ca7a-4759-8a53-d42878bf84f7@gmail.com/ Fixes: 956a17d9d050 ("btrfs: add a shrinker for extent maps") CC: stable@vger.kernel.org # 6.10+ Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: zoned: properly take lock to read/update block group's zoned variablesNaohiro Aota2024-08-151-6/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | __btrfs_add_free_space_zoned() references and modifies bg's alloc_offset, ro, and zone_unusable, but without taking the lock. It is mostly safe because they monotonically increase (at least for now) and this function is mostly called by a transaction commit, which is serialized by itself. Still, taking the lock is a safer and correct option and I'm going to add a change to reset zone_unusable while a block group is still alive. So, add locking around the operations. Fixes: 169e0da91a21 ("btrfs: zoned: track unusable bytes for zones") CC: stable@vger.kernel.org # 5.15+ Reviewed-by: Johannes Thumshirn <johannes.thumshirn@wdc.com> Signed-off-by: Naohiro Aota <naohiro.aota@wdc.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: tree-checker: add dev extent item checksQu Wenruo2024-08-151-0/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [REPORT] There is a corruption report that btrfs refused to mount a fs that has overlapping dev extents: BTRFS error (device sdc): dev extent devid 4 physical offset 14263979671552 overlap with previous dev extent end 14263980982272 BTRFS error (device sdc): failed to verify dev extents against chunks: -117 BTRFS error (device sdc): open_ctree failed [CAUSE] The direct cause is very obvious, there is a bad dev extent item with incorrect length. With btrfs check reporting two overlapping extents, the second one shows some clue on the cause: ERROR: dev extent devid 4 offset 14263979671552 len 6488064 overlap with previous dev extent end 14263980982272 ERROR: dev extent devid 13 offset 2257707008000 len 6488064 overlap with previous dev extent end 2257707270144 ERROR: errors found in extent allocation tree or chunk allocation The second one looks like a bitflip happened during new chunk allocation: hex(2257707008000) = 0x20da9d30000 hex(2257707270144) = 0x20da9d70000 diff = 0x00000040000 So it looks like a bitflip happened during new dev extent allocation, resulting the second overlap. Currently we only do the dev-extent verification at mount time, but if the corruption is caused by memory bitflip, we really want to catch it before writing the corruption to the storage. Furthermore the dev extent items has the following key definition: (<device id> DEV_EXTENT <physical offset>) Thus we can not just rely on the generic key order check to make sure there is no overlapping. [ENHANCEMENT] Introduce dedicated dev extent checks, including: - Fixed member checks * chunk_tree should always be BTRFS_CHUNK_TREE_OBJECTID (3) * chunk_objectid should always be BTRFS_FIRST_CHUNK_CHUNK_TREE_OBJECTID (256) - Alignment checks * chunk_offset should be aligned to sectorsize * length should be aligned to sectorsize * key.offset should be aligned to sectorsize - Overlap checks If the previous key is also a dev-extent item, with the same device id, make sure we do not overlap with the previous dev extent. Reported: Stefan N <stefannnau@gmail.com> Link: https://lore.kernel.org/linux-btrfs/CA+W5K0rSO3koYTo=nzxxTm1-Pdu1HYgVxEpgJ=aGc7d=E8mGEg@mail.gmail.com/ CC: stable@vger.kernel.org # 5.10+ Reviewed-by: Anand Jain <anand.jain@oracle.com> Signed-off-by: Qu Wenruo <wqu@suse.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: update target inode's ctime on unlinkJeff Layton2024-08-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unlink changes the link count on the target inode. POSIX mandates that the ctime must also change when this occurs. According to https://pubs.opengroup.org/onlinepubs/9699919799/functions/unlink.html: "Upon successful completion, unlink() shall mark for update the last data modification and last file status change timestamps of the parent directory. Also, if the file's link count is not 0, the last file status change timestamp of the file shall be marked for update." Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: David Sterba <dsterba@suse.com> [ add link to the opengroup docs ] Signed-off-by: David Sterba <dsterba@suse.com>
| * | | btrfs: send: annotate struct name_cache_entry with __counted_by()Thorsten Blum2024-08-151-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the __counted_by compiler attribute to the flexible array member name to improve access bounds-checking via CONFIG_UBSAN_BOUNDS and CONFIG_FORTIFY_SOURCE. Signed-off-by: Thorsten Blum <thorsten.blum@toblux.com> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
* | | | fuse: Initialize beyond-EOF page contents before setting uptodateJann Horn2024-08-181-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fuse_notify_store(), unlike fuse_do_readpage(), does not enable page zeroing (because it can be used to change partial page contents). So fuse_notify_store() must be more careful to fully initialize page contents (including parts of the page that are beyond end-of-file) before marking the page uptodate. The current code can leave beyond-EOF page contents uninitialized, which makes these uninitialized page contents visible to userspace via mmap(). This is an information leak, but only affects systems which do not enable init-on-alloc (via CONFIG_INIT_ON_ALLOC_DEFAULT_ON=y or the corresponding kernel command line parameter). Link: https://bugs.chromium.org/p/project-zero/issues/detail?id=2574 Cc: stable@kernel.org Fixes: a1d75f258230 ("fuse: add store request") Signed-off-by: Jann Horn <jannh@google.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* | | | Merge tag 'mm-hotfixes-stable-2024-08-17-19-34' of ↵Linus Torvalds2024-08-1822-182/+201
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm Pull misc fixes from Andrew Morton: "16 hotfixes. All except one are for MM. 10 of these are cc:stable and the others pertain to post-6.10 issues. As usual with these merges, singletons and doubletons all over the place, no identifiable-by-me theme. Please see the lovingly curated changelogs to get the skinny" * tag 'mm-hotfixes-stable-2024-08-17-19-34' of git://git.kernel.org/pub/scm/linux/kernel/git/akpm/mm: mm/migrate: fix deadlock in migrate_pages_batch() on large folios alloc_tag: mark pages reserved during CMA activation as not tagged alloc_tag: introduce clear_page_tag_ref() helper function crash: fix riscv64 crash memory reserve dead loop selftests: memfd_secret: don't build memfd_secret test on unsupported arches mm: fix endless reclaim on machines with unaccepted memory selftests/mm: compaction_test: fix off by one in check_compaction() mm/numa: no task_numa_fault() call if PMD is changed mm/numa: no task_numa_fault() call if PTE is changed mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0 mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpu mm: don't account memmap per-node mm: add system wide stats items category mm: don't account memmap on failure mm/hugetlb: fix hugetlb vs. core-mm PT locking mseal: fix is_madv_discard()
| * | | | mm/migrate: fix deadlock in migrate_pages_batch() on large foliosGao Xiang2024-08-161-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, migrate_pages_batch() can lock multiple locked folios with an arbitrary order. Although folio_trylock() is used to avoid deadlock as commit 2ef7dbb26990 ("migrate_pages: try migrate in batch asynchronously firstly") mentioned, it seems try_split_folio() is still missing. It was found by compaction stress test when I explicitly enable EROFS compressed files to use large folios, which case I cannot reproduce with the same workload if large folio support is off (current mainline). Typically, filesystem reads (with locked file-backed folios) could use another bdev/meta inode to load some other I/Os (e.g. inode extent metadata or caching compressed data), so the locking order will be: file-backed folios (A) bdev/meta folios (B) The following calltrace shows the deadlock: Thread 1 takes (B) lock and tries to take folio (A) lock Thread 2 takes (A) lock and tries to take folio (B) lock [Thread 1] INFO: task stress:1824 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1824 tgid:1824 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio mapping ffff00036d69cb18 index 996 (**) __folio_lock+0x24/0x38 migrate_pages_batch+0x77c/0xea0 // try_split_folio (mm/migrate.c:1486:2) // migrate_pages_batch (mm/migrate.c:1734:16) <--- LIST_HEAD(unmap_folios) has .. folio mapping 0xffff0000d184f1d8 index 1711; (*) folio mapping 0xffff0000d184f1d8 index 1712; .. migrate_pages+0xb28/0xe90 compact_zone+0xa08/0x10f0 compact_node+0x9c/0x180 sysctl_compaction_handler+0x8c/0x118 proc_sys_call_handler+0x1a8/0x280 proc_sys_write+0x1c/0x30 vfs_write+0x240/0x380 ksys_write+0x78/0x118 __arm64_sys_write+0x24/0x38 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 [Thread 2] INFO: task stress:1825 blocked for more than 30 seconds. Tainted: G OE 6.10.0-rc7+ #6 "echo 0 > /proc/sys/kernel/hung_task_timeout_secs" disables this message. task:stress state:D stack:0 pid:1825 tgid:1825 ppid:1822 flags:0x0000000c Call trace: __switch_to+0xec/0x138 __schedule+0x43c/0xcb0 schedule+0x54/0x198 io_schedule+0x44/0x70 folio_wait_bit_common+0x184/0x3f8 <-- folio = 0xfffffdffc6b503c0 (mapping == 0xffff0000d184f1d8 index == 1711) (*) __folio_lock+0x24/0x38 z_erofs_runqueue+0x384/0x9c0 [erofs] z_erofs_readahead+0x21c/0x350 [erofs] <-- folio mapping 0xffff00036d69cb18 range from [992, 1024] (**) read_pages+0x74/0x328 page_cache_ra_order+0x26c/0x348 ondemand_readahead+0x1c0/0x3a0 page_cache_sync_ra+0x9c/0xc0 filemap_get_pages+0xc4/0x708 filemap_read+0x104/0x3a8 generic_file_read_iter+0x4c/0x150 vfs_read+0x27c/0x330 ksys_pread64+0x84/0xd0 __arm64_sys_pread64+0x28/0x40 invoke_syscall+0x78/0x108 el0_svc_common.constprop.0+0x48/0xf0 do_el0_svc+0x24/0x38 el0_svc+0x3c/0x148 el0t_64_sync_handler+0x100/0x130 el0t_64_sync+0x190/0x198 Link: https://lkml.kernel.org/r/20240729021306.398286-1-hsiangkao@linux.alibaba.com Fixes: 5dfab109d519 ("migrate_pages: batch _unmap and _move") Signed-off-by: Gao Xiang <hsiangkao@linux.alibaba.com> Reviewed-by: "Huang, Ying" <ying.huang@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | alloc_tag: mark pages reserved during CMA activation as not taggedSuren Baghdasaryan2024-08-161-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During CMA activation, pages in CMA area are prepared and then freed without being allocated. This triggers warnings when memory allocation debug config (CONFIG_MEM_ALLOC_PROFILING_DEBUG) is enabled. Fix this by marking these pages not tagged before freeing them. Link: https://lkml.kernel.org/r/20240813150758.855881-2-surenb@google.com Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | alloc_tag: introduce clear_page_tag_ref() helper functionSuren Baghdasaryan2024-08-163-17/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In several cases we are freeing pages which were not allocated using common page allocators. For such cases, in order to keep allocation accounting correct, we should clear the page tag to indicate that the page being freed is expected to not have a valid allocation tag. Introduce clear_page_tag_ref() helper function to be used for this. Link: https://lkml.kernel.org/r/20240813150758.855881-1-surenb@google.com Fixes: d224eb0287fb ("codetag: debug: mark codetags for reserved pages as empty") Signed-off-by: Suren Baghdasaryan <surenb@google.com> Suggested-by: David Hildenbrand <david@redhat.com> Acked-by: David Hildenbrand <david@redhat.com> Reviewed-by: Pasha Tatashin <pasha.tatashin@soleen.com> Cc: Kees Cook <keescook@chromium.org> Cc: Kent Overstreet <kent.overstreet@linux.dev> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.10] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | crash: fix riscv64 crash memory reserve dead loopJinjie Ruan2024-08-161-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On RISCV64 Qemu machine with 512MB memory, cmdline "crashkernel=500M,high" will cause system stall as below: Zone ranges: DMA32 [mem 0x0000000080000000-0x000000009fffffff] Normal empty Movable zone start for each node Early memory node ranges node 0: [mem 0x0000000080000000-0x000000008005ffff] node 0: [mem 0x0000000080060000-0x000000009fffffff] Initmem setup node 0 [mem 0x0000000080000000-0x000000009fffffff] (stall here) commit 5d99cadf1568 ("crash: fix x86_32 crash memory reserve dead loop bug") fix this on 32-bit architecture. However, the problem is not completely solved. If `CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX` on 64-bit architecture, for example, when system memory is equal to CRASH_ADDR_LOW_MAX on RISCV64, the following infinite loop will also occur: -> reserve_crashkernel_generic() and high is true -> alloc at [CRASH_ADDR_LOW_MAX, CRASH_ADDR_HIGH_MAX] fail -> alloc at [0, CRASH_ADDR_LOW_MAX] fail and repeatedly (because CRASH_ADDR_LOW_MAX = CRASH_ADDR_HIGH_MAX). As Catalin suggested, do not remove the ",high" reservation fallback to ",low" logic which will change arm64's kdump behavior, but fix it by skipping the above situation similar to commit d2f32f23190b ("crash: fix x86_32 crash memory reserve dead loop"). After this patch, it print: cannot allocate crashkernel (size:0x1f400000) Link: https://lkml.kernel.org/r/20240812062017.2674441-1-ruanjinjie@huawei.com Signed-off-by: Jinjie Ruan <ruanjinjie@huawei.com> Suggested-by: Catalin Marinas <catalin.marinas@arm.com> Reviewed-by: Catalin Marinas <catalin.marinas@arm.com> Acked-by: Baoquan He <bhe@redhat.com> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: Dave Young <dyoung@redhat.com> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: Vivek Goyal <vgoyal@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | selftests: memfd_secret: don't build memfd_secret test on unsupported archesMuhammad Usama Anjum2024-08-162-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | [1] mentions that memfd_secret is only supported on arm64, riscv, x86 and x86_64 for now. It doesn't support other architectures. I found the build error on arm and decided to send the fix as it was creating noise on KernelCI: memfd_secret.c: In function 'memfd_secret': memfd_secret.c:42:24: error: '__NR_memfd_secret' undeclared (first use in this function); did you mean 'memfd_secret'? 42 | return syscall(__NR_memfd_secret, flags); | ^~~~~~~~~~~~~~~~~ | memfd_secret Hence I'm adding condition that memfd_secret should only be compiled on supported architectures. Also check in run_vmtests script if memfd_secret binary is present before executing it. Link: https://lkml.kernel.org/r/20240812061522.1933054-1-usama.anjum@collabora.com Link: https://lore.kernel.org/all/20210518072034.31572-7-rppt@kernel.org/ [1] Link: https://lkml.kernel.org/r/20240809075642.403247-1-usama.anjum@collabora.com Fixes: 76fe17ef588a ("secretmem: test: add basic selftest for memfd_secret(2)") Signed-off-by: Muhammad Usama Anjum <usama.anjum@collabora.com> Reviewed-by: Shuah Khan <skhan@linuxfoundation.org> Acked-by: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Albert Ou <aou@eecs.berkeley.edu> Cc: James Bottomley <James.Bottomley@HansenPartnership.com> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Palmer Dabbelt <palmer@dabbelt.com> Cc: Paul Walmsley <paul.walmsley@sifive.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: fix endless reclaim on machines with unaccepted memoryKirill A. Shutemov2024-08-161-22/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Unaccepted memory is considered unusable free memory, which is not counted as free on the zone watermark check. This causes get_page_from_freelist() to accept more memory to hit the high watermark, but it creates problems in the reclaim path. The reclaim path encounters a failed zone watermark check and attempts to reclaim memory. This is usually successful, but if there is little or no reclaimable memory, it can result in endless reclaim with little to no progress. This can occur early in the boot process, just after start of the init process when the only reclaimable memory is the page cache of the init executable and its libraries. Make unaccepted memory free from watermark check point of view. This way unaccepted memory will never be the trigger of memory reclaim. Accept more memory in the get_page_from_freelist() if needed. Link: https://lkml.kernel.org/r/20240809114854.3745464-2-kirill.shutemov@linux.intel.com Fixes: dcdfdd40fa82 ("mm: Add support for unaccepted memory") Signed-off-by: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Reported-by: Jianxiong Gao <jxgao@google.com> Acked-by: David Hildenbrand <david@redhat.com> Tested-by: Jianxiong Gao <jxgao@google.com> Cc: Borislav Petkov <bp@alien8.de> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Kirill A. Shutemov <kirill.shutemov@linux.intel.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: Mel Gorman <mgorman@suse.de> Cc: Mike Rapoport (Microsoft) <rppt@kernel.org> Cc: Tom Lendacky <thomas.lendacky@amd.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: <stable@vger.kernel.org> [6.5+] Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | selftests/mm: compaction_test: fix off by one in check_compaction()Dan Carpenter2024-08-161-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The "initial_nr_hugepages" variable is unsigned long so it takes up to 20 characters to print, plus 1 more character for the NUL terminator. Unfortunately, this buffer is not quite large enough for the terminator to fit. Also use snprintf() for a belt and suspenders approach. Link: https://lkml.kernel.org/r/87470c06-b45a-4e83-92ff-aac2e7b9c6ba@stanley.mountain Fixes: fb9293b6b015 ("selftests/mm: compaction_test: fix bogus test success and reduce probability of OOM-killer invocation") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Cc: Shuah Khan <shuah@kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/numa: no task_numa_fault() call if PMD is changedZi Yan2024-08-161-16/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling") restructured do_huge_pmd_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pmd_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/20240809145906.1513458-3-ziy@nvidia.com Fixes: c5b5a3dd2c1f ("mm: thp: refactor NUMA fault handling") Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/ Signed-off-by: Zi Yan <ziy@nvidia.com> Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/numa: no task_numa_fault() call if PTE is changedZi Yan2024-08-161-17/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When handling a numa page fault, task_numa_fault() should be called by a process that restores the page table of the faulted folio to avoid duplicated stats counting. Commit b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") restructured do_numa_page() and did not avoid task_numa_fault() call in the second page table check after a numa migration failure. Fix it by making all !pte_same() return immediately. This issue can cause task_numa_fault() being called more than necessary and lead to unexpected numa balancing results (It is hard to tell whether the issue will cause positive or negative performance impact due to duplicated numa fault counting). Link: https://lkml.kernel.org/r/20240809145906.1513458-2-ziy@nvidia.com Fixes: b99a342d4f11 ("NUMA balancing: reduce TLB flush via delaying mapping on hint page fault") Signed-off-by: Zi Yan <ziy@nvidia.com> Reported-by: "Huang, Ying" <ying.huang@intel.com> Closes: https://lore.kernel.org/linux-mm/87zfqfw0yw.fsf@yhuang6-desk2.ccr.corp.intel.com/ Acked-by: David Hildenbrand <david@redhat.com> Cc: Baolin Wang <baolin.wang@linux.alibaba.com> Cc: Kefeng Wang <wangkefeng.wang@huawei.com> Cc: Mel Gorman <mgorman@suse.de> Cc: Yang Shi <shy828301@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order ↵Hailong Liu2024-08-161-9/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | fallback to order 0 The __vmap_pages_range_noflush() assumes its argument pages** contains pages with the same page shift. However, since commit e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations"), if gfp_flags includes __GFP_NOFAIL with high order in vm_area_alloc_pages() and page allocation failed for high order, the pages** may contain two different page shifts (high order and order-0). This could lead __vmap_pages_range_noflush() to perform incorrect mappings, potentially resulting in memory corruption. Users might encounter this as follows (vmap_allow_huge = true, 2M is for PMD_SIZE): kvmalloc(2M, __GFP_NOFAIL|GFP_X) __vmalloc_node_range_noprof(vm_flags=VM_ALLOW_HUGE_VMAP) vm_area_alloc_pages(order=9) ---> order-9 allocation failed and fallback to order-0 vmap_pages_range() vmap_pages_range_noflush() __vmap_pages_range_noflush(page_shift = 21) ----> wrong mapping happens We can remove the fallback code because if a high-order allocation fails, __vmalloc_node_range_noprof() will retry with order-0. Therefore, it is unnecessary to fallback to order-0 here. Therefore, fix this by removing the fallback code. Link: https://lkml.kernel.org/r/20240808122019.3361-1-hailong.liu@oppo.com Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations") Signed-off-by: Hailong Liu <hailong.liu@oppo.com> Reported-by: Tangquan Zheng <zhengtangquan@oppo.com> Reviewed-by: Baoquan He <bhe@redhat.com> Reviewed-by: Uladzislau Rezki (Sony) <urezki@gmail.com> Acked-by: Barry Song <baohua@kernel.org> Acked-by: Michal Hocko <mhocko@suse.com> Cc: Matthew Wilcox <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm/memory-failure: use raw_spinlock_t in struct memory_failure_cpuWaiman Long2024-08-161-9/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The memory_failure_cpu structure is a per-cpu structure. Access to its content requires the use of get_cpu_var() to lock in the current CPU and disable preemption. The use of a regular spinlock_t for locking purpose is fine for a non-RT kernel. Since the integration of RT spinlock support into the v5.15 kernel, a spinlock_t in a RT kernel becomes a sleeping lock and taking a sleeping lock in a preemption disabled context is illegal resulting in the following kind of warning. [12135.732244] BUG: sleeping function called from invalid context at kernel/locking/spinlock_rt.c:48 [12135.732248] in_atomic(): 1, irqs_disabled(): 0, non_block: 0, pid: 270076, name: kworker/0:0 [12135.732252] preempt_count: 1, expected: 0 [12135.732255] RCU nest depth: 2, expected: 2 : [12135.732420] Hardware name: Dell Inc. PowerEdge R640/0HG0J8, BIOS 2.10.2 02/24/2021 [12135.732423] Workqueue: kacpi_notify acpi_os_execute_deferred [12135.732433] Call Trace: [12135.732436] <TASK> [12135.732450] dump_stack_lvl+0x57/0x81 [12135.732461] __might_resched.cold+0xf4/0x12f [12135.732479] rt_spin_lock+0x4c/0x100 [12135.732491] memory_failure_queue+0x40/0xe0 [12135.732503] ghes_do_memory_failure+0x53/0x390 [12135.732516] ghes_do_proc.constprop.0+0x229/0x3e0 [12135.732575] ghes_proc+0xf9/0x1a0 [12135.732591] ghes_notify_hed+0x6a/0x150 [12135.732602] notifier_call_chain+0x43/0xb0 [12135.732626] blocking_notifier_call_chain+0x43/0x60 [12135.732637] acpi_ev_notify_dispatch+0x47/0x70 [12135.732648] acpi_os_execute_deferred+0x13/0x20 [12135.732654] process_one_work+0x41f/0x500 [12135.732695] worker_thread+0x192/0x360 [12135.732715] kthread+0x111/0x140 [12135.732733] ret_from_fork+0x29/0x50 [12135.732779] </TASK> Fix it by using a raw_spinlock_t for locking instead. Also move the pr_err() out of the lock critical section and after put_cpu_ptr() to avoid indeterminate latency and the possibility of sleep with this call. [longman@redhat.com: don't hold percpu ref across pr_err(), per Miaohe] Link: https://lkml.kernel.org/r/20240807181130.1122660-1-longman@redhat.com Link: https://lkml.kernel.org/r/20240806164107.1044956-1-longman@redhat.com Fixes: 0f383b6dc96e ("locking/spinlock: Provide RT variant") Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Miaohe Lin <linmiaohe@huawei.com> Cc: "Huang, Ying" <ying.huang@intel.com> Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Len Brown <len.brown@intel.com> Cc: Naoya Horiguchi <nao.horiguchi@gmail.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
| * | | | mm: don't account memmap per-nodePasha Tatashin2024-08-169-60/+41
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix invalid access to pgdat during hot-remove operation: ndctl users reported a GPF when trying to destroy a namespace: $ ndctl destroy-namespace all -r all -f Segmentation fault dmesg: Oops: general protection fault, probably for non-canonical address 0xdffffc0000005650: 0000 [#1] PREEMPT SMP KASAN PTI KASAN: probably user-memory-access in range [0x000000000002b280-0x000000000002b287] CPU: 26 UID: 0 PID: 1868 Comm: ndctl Not tainted 6.11.0-rc1 #1 Hardware name: Dell Inc. PowerEdge R640/08HT8T, BIOS 2.20.1 09/13/2023 RIP: 0010:mod_node_page_state+0x2a/0x110 cxl-test users report a GPF when trying to unload the test module: $ modrpobe -r cxl-test dmesg BUG: unable to handle page fault for address: 0000000000004200 #PF: supervisor read access in kernel mode #PF: error_code(0x0000) - not-present page PGD 0 P4D 0 Oops: Oops: 0000 [#1] PREEMPT SMP PTI CPU: 0 UID: 0 PID: 1076 Comm: modprobe Tainted: G O N 6.11.0-rc1 #197 Tainted: [O]=OOT_MODULE, [N]=TEST Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 0.0.0 02/06/15 RIP: 0010:mod_node_page_state+0x6/0x90 Currently, when memory is hot-plugged or hot-removed the accounting is done based on the assumption that memmap is allocated from the same node as the hot-plugged/hot-removed memory, which is not always the case. In addition, there are challenges with keeping the node id of the memory that is being remove to the time when memmap accounting is actually performed: since this is done after remove_pfn_range_from_zone(), and also after remove_memory_block_devices(). Meaning that we cannot use pgdat nor walking though memblocks to get the nid. Given all of that, account the memmap overhead system wide instead. For this we are going to be using global atomic counters, but given that memmap size is rarely modified, and normally is only modified either during early boot when there is only one CPU, or under a hotplug global mutex lock, therefore there is no need for per-cpu optimizations. Also, while we are here rename nr_memmap to nr_memmap_pages, and nr_memmap_boot to nr_memmap_boot_pages to be self explanatory that the units are in page count. [pasha.tatashin@soleen.com: address a few nits from David Hildenbrand] Link: https://lkml.kernel.org/r/20240809191020.1142142-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240809191020.1142142-4-pasha.tatashin@soleen.com Link: https://lkml.kernel.org/r/20240808213437.682006-4-pasha.tatashin@soleen.com Fixes: 15995a352474 ("mm: report per-page metadata information") Signed-off-by: Pasha Tatashin <pasha.tatashin@soleen.com> Reported-by: Yi Zhang <yi.zhang@redhat.com> Closes: https://lore.kernel.org/linux-cxl/CAHj4cs9Ax1=CoJkgBGP_+sNu6-6=6v=_L-ZBZY0bVLD3wUWZQg@mail.gmail.com Reported-by: Alison Schofield <alison.schofield@intel.com> Closes: https://lore.kernel.org/linux-mm/Zq0tPd2h6alFz8XF@aschofie-mobl2/#t Tested-by: Dan Williams <dan.j.williams@intel.com> Tested-by: Alison Schofield <alison.schofield@intel.com> Acked-by: David Hildenbrand <david@redhat.com> Acked-by: David Rientjes <rientjes@google.com> Tested-by: Yi Zhang <yi.zhang@redhat.com> Cc: Domenico Cerasuolo <cerasuolodomenico@gmail.com> Cc: Fan Ni <fan.ni@samsung.com> Cc: Joel Granados <j.granados@samsung.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Li Zhijian <lizhijian@fujitsu.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: Mike Rapoport <rppt@kernel.org> Cc: Muchun Song <muchun.song@linux.dev> Cc: Nhat Pham <nphamcs@gmail.com> Cc: Sourav Panda <souravpanda@google.com> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Yosry Ahmed <yosryahmed@google.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>