summaryrefslogtreecommitdiffstats
path: root/kernel (unfollow)
Commit message (Collapse)AuthorFilesLines
2024-07-08nfsd: allow passing in array of thread counts via netlinkJeff Layton2-13/+31
Now that nfsd_svc can handle an array of thread counts, fix up the netlink threads interface to construct one from the netlink call and pass it through so we can start a pooled server the same way we would start a normal one. Note that any unspecified values in the array are considered zeroes, so it's possible to shut down a pooled server by passing in a short array that has only zeros, or even an empty array. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08nfsd: make nfsd_svc take an array of thread countsJeff Layton3-24/+45
Now that the refcounting is fixed, rework nfsd_svc to use the same thread setup as the pool_threads interface. Have it take an array of thread counts instead of just a single value, and pass that from the netlink threads set interface. Since the new netlink interface doesn't have the same restriction as pool_threads, move the guard against shutting down all threads to write_pool_threads. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08sunrpc: fix up the special handling of sv_nrpools == 1Jeff Layton2-19/+8
Only pooled services take a reference to the svc_pool_map. The sunrpc code has always used the sv_nrpools value to detect whether the service is pooled. The problem there is that nfsd is a pooled service, but when it's running in "global" pool_mode, it doesn't take a reference to the pool map because it has a sv_nrpools value of 1. This means that we have two separate codepaths for starting the server, depending on whether it's pooled or not. Fix this by adding a new flag to the svc_serv, that indicates whether the serv is pooled. With this we can have the nfsd service unconditionally take a reference, regardless of pool_mode. Note that this is a behavior change for /sys/module/sunrpc/parameters/pool_mode. Usually this file does not allow you to change the pool-mode while there are nfsd threads running, but if the pool-mode is "global" it's allowed. My assumption is that this is a bug, since it probably should never have worked this way. This patch changes the behavior such that you get back EBUSY even when nfsd is running in global mode. I think this is more reasonable behavior, and given that most people set this today using the module parameter, it's doubtful anyone will notice. Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08SUNRPC: Add a trace point in svc_xprt_deferred_closeChuck Lever1-0/+1
The trace point in svc_xprt_close() reports only some local close requests. Try to capture more local close requests. Note that "trace-cmd record -T -e sunrpc:svc_xprt_close" will neatly capture the identity of the caller requesting the close. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08NFSD: Support write delegations in LAYOUTGETChuck Lever1-2/+3
I noticed LAYOUTGET(LAYOUTIOMODE4_RW) returning NFS4ERR_ACCESS unexpectedly. The NFS client had created a file with mode 0444, and the server had returned a write delegation on the OPEN(CREATE). The client was requesting a RW layout using the write delegation stateid so that it could flush file modifications. Creating a read-only file does not seem to be problematic for NFSv4.1 without pNFS, so I began looking at NFSD's implementation of LAYOUTGET. The failure was because fh_verify() was doing a permission check as part of verifying the FH presented during the LAYOUTGET. It uses the loga_iomode value to specify the @accmode argument to fh_verify(). fh_verify(MAY_WRITE) on a file whose mode is 0444 fails with -EACCES. To permit LAYOUT* operations in this case, add OWNER_OVERRIDE when checking the access permission of the incoming file handle for LAYOUTGET and LAYOUTCOMMIT. Cc: Christoph Hellwig <hch@lst.de> Cc: stable@vger.kernel.org # v6.6+ Message-Id: 4E9C0D74-A06D-4DC3-A48A-73034DC40395@oracle.com Reviewed-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Christoph Hellwig <hch@lst.de> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08lockd: Use *-y instead of *-objs in MakefileAndy Shevchenko1-5/+4
*-objs suffix is reserved rather for (user-space) host programs while usually *-y suffix is used for kernel drivers (although *-objs works for that purpose for now). Let's correct the old usages of *-objs in Makefiles. Signed-off-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Reviewed-by: Benjamin Coddington <bcodding@redhat.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08NFSD: Fix nfsdcld warningChuck Lever2-3/+3
Since CONFIG_NFSD_LEGACY_CLIENT_TRACKING is a new config option, its initial default setting should have been Y (if we are to follow the common practice of "default Y, wait, default N, wait, remove code"). Paul also suggested adding a clearer remedy action to the warning message. Reported-by: Paul Menzel <pmenzel@molgen.mpg.de> Message-Id: <d2ab4ee7-ba0f-44ac-b921-90c8fa5a04d2@molgen.mpg.de> Fixes: 74fd48739d04 ("nfsd: new Kconfig option for legacy client tracking") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08svcrdma: Handle ADDR_CHANGE CM event properlyChuck Lever1-1/+15
Sagi tells me that when a bonded device reports an address change, the consumer must destroy its listener IDs and create new ones. See commit a032e4f6d60d ("nvmet-rdma: fix bonding failover possible NULL deref"). Suggested-by: Sagi Grimberg <sagi@grimberg.me> Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08svcrdma: Refactor the creation of listener CMA IDChuck Lever1-27/+40
In a moment, I will add a second consumer of CMA ID creation in svcrdma. Refactor so this code can be reused. Reviewed-by: Sagi Grimberg <sagi@grimberg.me> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08NFSD: remove unused structs 'nfsd3_voidargs'Dr. David Alan Gilbert2-4/+0
'nfsd3_voidargs' in nfs[23]acl.c is unused since commit 788f7183fba8 ("NFSD: Add common helpers to decode void args and encode void results"). Remove them. Signed-off-by: Dr. David Alan Gilbert <linux@treblig.org> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-08NFSD: harden svcxdr_dupstr() and svcxdr_tmpalloc() against integer overflowsDan Carpenter1-6/+6
These lengths come from xdr_stream_decode_u32() and so we should be a bit careful with them. Use size_add() and struct_size() to avoid integer overflows. Saving size_add()/struct_size() results to a u32 is unsafe because it truncates away the high bits. Also generally storing sizes in longs is safer. Most systems these days use 64 bit CPUs. It's harder for an addition to overflow 64 bits than it is to overflow 32 bits. Also functions like vmalloc() can successfully allocate UINT_MAX bytes, but nothing can allocate ULONG_MAX bytes. Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
2024-07-07Linux 6.10-rc7v6.10-rc7Linus Torvalds1-1/+1
2024-07-06selftests/powerpc: Fix build with USERCFLAGS setMichael Ellerman1-4/+1
Currently building the powerpc selftests with USERCFLAGS set to anything causes the build to break: $ make -C tools/testing/selftests/powerpc V=1 USERCFLAGS=-Wno-error ... gcc -Wno-error cache_shape.c ... cache_shape.c:18:10: fatal error: utils.h: No such file or directory 18 | #include "utils.h" | ^~~~~~~~~ compilation terminated. This happens because the USERCFLAGS are added to CFLAGS in lib.mk, which causes the check of CFLAGS in powerpc/flags.mk to skip setting CFLAGS at all, resulting in none of the usual CFLAGS being passed. That can be seen in the output above, the only flag passed to the compiler is -Wno-error. Fix it by dropping the conditional setting of CFLAGS in flags.mk. Instead always set CFLAGS, but also append USERCFLAGS if they are set. Note that appending to CFLAGS (with +=) wouldn't work, because flags.mk is included by multiple Makefiles (to support partial builds), causing CFLAGS to be appended to multiple times. Additionally that would place the USERCFLAGS prior to the standard CFLAGS, meaning the USERCFLAGS couldn't override the standard flags. Being able to override the standard flags is desirable, for example for adding -Wno-error. With the fix in place, the CFLAGS are set correctly, including the USERCFLAGS: $ make -C tools/testing/selftests/powerpc V=1 USERCFLAGS=-Wno-error ... gcc -std=gnu99 -O2 -Wall -Werror -DGIT_VERSION='"v6.10-rc2-7-gdea17e7e56c3"' -I/home/michael/linux/tools/testing/selftests/powerpc/include -Wno-error cache_shape.c ... Fixes: 5553a79387e9 ("selftests/powerpc: Add flags.mk to support pmu buildable") Signed-off-by: Michael Ellerman <mpe@ellerman.id.au> Link: https://msgid.link/20240706120833.909853-1-mpe@ellerman.id.au
2024-07-05gpiolib: of: add polarity quirk for TSC2005Dmitry Torokhov1-0/+8
DTS for Nokia N900 incorrectly specifies "active high" polarity for the reset line, while the chip documentation actually specifies it as "active low". In the past the driver fudged gpiod API and inverted the logic internally, but it was changed in d0d89493bff8. Fixes: d0d89493bff8 ("Input: tsc2004/5 - switch to using generic device properties") Signed-off-by: Dmitry Torokhov <dmitry.torokhov@gmail.com> Acked-by: Linus Walleij <linus.walleij@linaro.org> Link: https://lore.kernel.org/r/ZoWXwYtwgJIxi-hD@google.com Signed-off-by: Bartosz Golaszewski <bartosz.golaszewski@linaro.org>
2024-07-05tpm: Address !chip->auth in tpm_buf_append_hmac_session*()Jarkko Sakkinen2-124/+130
Unless tpm_chip_bootstrap() was called by the driver, !chip->auth can cause a null derefence in tpm_buf_hmac_session*(). Thus, address !chip->auth in tpm_buf_hmac_session*() and remove the fallback implementation for !TCG_TPM2_HMAC. Cc: stable@vger.kernel.org # v6.9+ Reported-by: Stefan Berger <stefanb@linux.ibm.com> Closes: https://lore.kernel.org/linux-integrity/20240617193408.1234365-1-stefanb@linux.ibm.com/ Fixes: 1085b8276bb4 ("tpm: Add the rest of the session HMAC API") Tested-by: Michael Ellerman <mpe@ellerman.id.au> # ppc Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2024-07-05tpm: Address !chip->auth in tpm_buf_append_name()Jarkko Sakkinen3-111/+131
Unless tpm_chip_bootstrap() was called by the driver, !chip->auth can cause a null derefence in tpm_buf_append_name(). Thus, address !chip->auth in tpm_buf_append_name() and remove the fallback implementation for !TCG_TPM2_HMAC. Cc: stable@vger.kernel.org # v6.10+ Reported-by: Stefan Berger <stefanb@linux.ibm.com> Closes: https://lore.kernel.org/linux-integrity/20240617193408.1234365-1-stefanb@linux.ibm.com/ Fixes: d0a25bb961e6 ("tpm: Add HMAC session name/handle append") Tested-by: Michael Ellerman <mpe@ellerman.id.au> # ppc Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2024-07-05tpm: Address !chip->auth in tpm2_*_auth_session()Jarkko Sakkinen1-2/+12
Unless tpm_chip_bootstrap() was called by the driver, !chip->auth can cause a null derefence in tpm2_*_auth_session(). Thus, address !chip->auth in tpm2_*_auth_session(). Cc: stable@vger.kernel.org # v6.9+ Reported-by: Stefan Berger <stefanb@linux.ibm.com> Closes: https://lore.kernel.org/linux-integrity/20240617193408.1234365-1-stefanb@linux.ibm.com/ Fixes: 699e3efd6c64 ("tpm: Add HMAC session start and end functions") Tested-by: Michael Ellerman <mpe@ellerman.id.au> # ppc Signed-off-by: Jarkko Sakkinen <jarkko@kernel.org>
2024-07-04bnxt_en: Fix the resource check condition for RSS contextsPavan Chebbi1-1/+5
While creating a new RSS context, bnxt_rfs_capable() currently makes a strict check to see if the required VNICs are already available. If the current VNICs are not what is required, either too many or not enough, it will call the firmware to reserve the exact number required. There is a bug in the firmware when the driver tries to relinquish some reserved VNICs and RSS contexts. It will cause the default VNIC to lose its RSS configuration and cause receive packets to be placed incorrectly. Workaround this problem by skipping the resource reduction. The driver will not reduce the VNIC and RSS context reservations when a context is deleted. The resources will be available for use when new contexts are created later. Potentially, this workaround can cause us to run out of VNIC and RSS contexts if there are a lot of VF functions creating and deleting RSS contexts. In the future, we will conditionally disable this workaround when the firmware fix is available. Fixes: 438ba39b25fe ("bnxt_en: Improve RSS context reservation infrastructure") Reported-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/netdev/20240625010210.2002310-1-kuba@kernel.org/ Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com> Signed-off-by: Pavan Chebbi <pavan.chebbi@broadcom.com> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Reviewed-by: Simon Horman <horms@kernel.org> Link: https://patch.msgid.link/20240703180112.78590-1-michael.chan@broadcom.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04mlxsw: core_linecards: Fix double memory deallocation in case of invalid INI ↵Aleksandr Mishin1-0/+1
file In case of invalid INI file mlxsw_linecard_types_init() deallocates memory but doesn't reset pointer to NULL and returns 0. In case of any error occurred after mlxsw_linecard_types_init() call, mlxsw_linecards_init() calls mlxsw_linecard_types_fini() which performs memory deallocation again. Add pointer reset to NULL. Found by Linux Verification Center (linuxtesting.org) with SVACE. Fixes: b217127e5e4e ("mlxsw: core_linecards: Add line card objects and implement provisioning") Signed-off-by: Aleksandr Mishin <amishin@t-argos.ru> Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Michal Kubiak <michal.kubiak@intel.com> Link: https://patch.msgid.link/20240703203251.8871-1-amishin@t-argos.ru Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04inet_diag: Initialize pad field in struct inet_diag_req_v2Shigeru Yoshida1-0/+2
KMSAN reported uninit-value access in raw_lookup() [1]. Diag for raw sockets uses the pad field in struct inet_diag_req_v2 for the underlying protocol. This field corresponds to the sdiag_raw_protocol field in struct inet_diag_req_raw. inet_diag_get_exact_compat() converts inet_diag_req to inet_diag_req_v2, but leaves the pad field uninitialized. So the issue occurs when raw_lookup() accesses the sdiag_raw_protocol field. Fix this by initializing the pad field in inet_diag_get_exact_compat(). Also, do the same fix in inet_diag_dump_compat() to avoid the similar issue in the future. [1] BUG: KMSAN: uninit-value in raw_lookup net/ipv4/raw_diag.c:49 [inline] BUG: KMSAN: uninit-value in raw_sock_get+0x657/0x800 net/ipv4/raw_diag.c:71 raw_lookup net/ipv4/raw_diag.c:49 [inline] raw_sock_get+0x657/0x800 net/ipv4/raw_diag.c:71 raw_diag_dump_one+0xa1/0x660 net/ipv4/raw_diag.c:99 inet_diag_cmd_exact+0x7d9/0x980 inet_diag_get_exact_compat net/ipv4/inet_diag.c:1404 [inline] inet_diag_rcv_msg_compat+0x469/0x530 net/ipv4/inet_diag.c:1426 sock_diag_rcv_msg+0x23d/0x740 net/core/sock_diag.c:282 netlink_rcv_skb+0x537/0x670 net/netlink/af_netlink.c:2564 sock_diag_rcv+0x35/0x40 net/core/sock_diag.c:297 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0xe74/0x1240 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x10c6/0x1260 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x332/0x3d0 net/socket.c:745 ____sys_sendmsg+0x7f0/0xb70 net/socket.c:2585 ___sys_sendmsg+0x271/0x3b0 net/socket.c:2639 __sys_sendmsg net/socket.c:2668 [inline] __do_sys_sendmsg net/socket.c:2677 [inline] __se_sys_sendmsg net/socket.c:2675 [inline] __x64_sys_sendmsg+0x27e/0x4a0 net/socket.c:2675 x64_sys_call+0x135e/0x3ce0 arch/x86/include/generated/asm/syscalls_64.h:47 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xd9/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Uninit was stored to memory at: raw_sock_get+0x650/0x800 net/ipv4/raw_diag.c:71 raw_diag_dump_one+0xa1/0x660 net/ipv4/raw_diag.c:99 inet_diag_cmd_exact+0x7d9/0x980 inet_diag_get_exact_compat net/ipv4/inet_diag.c:1404 [inline] inet_diag_rcv_msg_compat+0x469/0x530 net/ipv4/inet_diag.c:1426 sock_diag_rcv_msg+0x23d/0x740 net/core/sock_diag.c:282 netlink_rcv_skb+0x537/0x670 net/netlink/af_netlink.c:2564 sock_diag_rcv+0x35/0x40 net/core/sock_diag.c:297 netlink_unicast_kernel net/netlink/af_netlink.c:1335 [inline] netlink_unicast+0xe74/0x1240 net/netlink/af_netlink.c:1361 netlink_sendmsg+0x10c6/0x1260 net/netlink/af_netlink.c:1905 sock_sendmsg_nosec net/socket.c:730 [inline] __sock_sendmsg+0x332/0x3d0 net/socket.c:745 ____sys_sendmsg+0x7f0/0xb70 net/socket.c:2585 ___sys_sendmsg+0x271/0x3b0 net/socket.c:2639 __sys_sendmsg net/socket.c:2668 [inline] __do_sys_sendmsg net/socket.c:2677 [inline] __se_sys_sendmsg net/socket.c:2675 [inline] __x64_sys_sendmsg+0x27e/0x4a0 net/socket.c:2675 x64_sys_call+0x135e/0x3ce0 arch/x86/include/generated/asm/syscalls_64.h:47 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0xd9/0x1e0 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x77/0x7f Local variable req.i created at: inet_diag_get_exact_compat net/ipv4/inet_diag.c:1396 [inline] inet_diag_rcv_msg_compat+0x2a6/0x530 net/ipv4/inet_diag.c:1426 sock_diag_rcv_msg+0x23d/0x740 net/core/sock_diag.c:282 CPU: 1 PID: 8888 Comm: syz-executor.6 Not tainted 6.10.0-rc4-00217-g35bb670d65fc #32 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 Fixes: 432490f9d455 ("net: ip, diag -- Add diag interface for raw sockets") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://patch.msgid.link/20240703091649.111773-1-syoshida@redhat.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-04tcp: Don't flag tcp_sk(sk)->rx_opt.saw_unknown for TCP AO.Kuniyuki Iwashima1-0/+7
When we process segments with TCP AO, we don't check it in tcp_parse_options(). Thus, opt_rx->saw_unknown is set to 1, which unconditionally triggers the BPF TCP option parser. Let's avoid the unnecessary BPF invocation. Fixes: 0a3a809089eb ("net/tcp: Verify inbound TCP-AO signed segments") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Dmitry Safonov <0x7f454c46@gmail.com> Link: https://patch.msgid.link/20240703033508.6321-1-kuniyu@amazon.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
2024-07-04drm/xe/mcr: Avoid clobbering DSS steeringMatt Roper1-3/+3
A couple copy/paste mistakes in the code that selects steering targets for OADDRM and INSTANCE0 unintentionally clobbered the steering target for DSS ranges in some cases. The OADDRM/INSTANCE0 values were also not assigned as intended, although that mistake wound up being harmless since the desired values for those specific ranges were '0' which the kzalloc of the GT structure should have already taken care of implicitly. Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matt Roper <matthew.d.roper@intel.com> Reviewed-by: Lucas De Marchi <lucas.demarchi@intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240626210536.1620176-2-matthew.d.roper@intel.com (cherry picked from commit 4f82ac6102788112e599a6074d2c1f2afce923df) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2024-07-04drm/xe: fix error handling in xe_migrate_update_pgtablesMatthew Auld1-4/+4
Don't call drm_suballoc_free with sa_bo pointing to PTR_ERR. References: https://gitlab.freedesktop.org/drm/xe/kernel/-/issues/2120 Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Signed-off-by: Matthew Auld <matthew.auld@intel.com> Cc: Thomas Hellström <thomas.hellstrom@linux.intel.com> Cc: Matthew Brost <matthew.brost@intel.com> Cc: <stable@vger.kernel.org> # v6.8+ Reviewed-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240620102025.127699-2-matthew.auld@intel.com (cherry picked from commit ce6b63336f79ec5f3996de65f452330e395f99ae) Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com>
2024-07-04drm/ttm: Always take the bo delayed cleanup path for imported bosThomas Hellström1-0/+1
Bos can be put with multiple unrelated dma-resv locks held. But imported bos attempt to grab the bo dma-resv during dma-buf detach that typically happens during cleanup. That leads to lockde splats similar to the below and a potential ABBA deadlock. Fix this by always taking the delayed workqueue cleanup path for imported bos. Requesting stable fixes from when the Xe driver was introduced, since its usage of drm_exec and wide vm dma_resvs appear to be the first reliable trigger of this. [22982.116427] ============================================ [22982.116428] WARNING: possible recursive locking detected [22982.116429] 6.10.0-rc2+ #10 Tainted: G U W [22982.116430] -------------------------------------------- [22982.116430] glxgears:sh0/5785 is trying to acquire lock: [22982.116431] ffff8c2bafa539a8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: dma_buf_detach+0x3b/0xf0 [22982.116438] but task is already holding lock: [22982.116438] ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec] [22982.116442] other info that might help us debug this: [22982.116442] Possible unsafe locking scenario: [22982.116443] CPU0 [22982.116444] ---- [22982.116444] lock(reservation_ww_class_mutex); [22982.116445] lock(reservation_ww_class_mutex); [22982.116447] *** DEADLOCK *** [22982.116447] May be due to missing lock nesting notation [22982.116448] 5 locks held by glxgears:sh0/5785: [22982.116449] #0: ffff8c2d9aba58c8 (&xef->vm.lock){+.+.}-{3:3}, at: xe_file_close+0xde/0x1c0 [xe] [22982.116507] #1: ffff8c2e28cc8480 (&vm->lock){++++}-{3:3}, at: xe_vm_close_and_put+0x161/0x9b0 [xe] [22982.116578] #2: ffff8c2e31982970 (&val->lock){.+.+}-{3:3}, at: xe_validation_ctx_init+0x6d/0x70 [xe] [22982.116647] #3: ffffacdc469478a8 (reservation_ww_class_acquire){+.+.}-{0:0}, at: xe_vma_destroy_unlocked+0x7f/0xe0 [xe] [22982.116716] #4: ffff8c2d9aba6da8 (reservation_ww_class_mutex){+.+.}-{3:3}, at: drm_exec_lock_obj+0x49/0x2b0 [drm_exec] [22982.116719] stack backtrace: [22982.116720] CPU: 8 PID: 5785 Comm: glxgears:sh0 Tainted: G U W 6.10.0-rc2+ #10 [22982.116721] Hardware name: ASUS System Product Name/PRIME B560M-A AC, BIOS 2001 02/01/2023 [22982.116723] Call Trace: [22982.116724] <TASK> [22982.116725] dump_stack_lvl+0x77/0xb0 [22982.116727] __lock_acquire+0x1232/0x2160 [22982.116730] lock_acquire+0xcb/0x2d0 [22982.116732] ? dma_buf_detach+0x3b/0xf0 [22982.116734] ? __lock_acquire+0x417/0x2160 [22982.116736] __ww_mutex_lock.constprop.0+0xd0/0x13b0 [22982.116738] ? dma_buf_detach+0x3b/0xf0 [22982.116741] ? dma_buf_detach+0x3b/0xf0 [22982.116743] ? ww_mutex_lock+0x2b/0x90 [22982.116745] ww_mutex_lock+0x2b/0x90 [22982.116747] dma_buf_detach+0x3b/0xf0 [22982.116749] drm_prime_gem_destroy+0x2f/0x40 [drm] [22982.116775] xe_ttm_bo_destroy+0x32/0x220 [xe] [22982.116818] ? __mutex_unlock_slowpath+0x3a/0x290 [22982.116821] drm_exec_unlock_all+0xa1/0xd0 [drm_exec] [22982.116823] drm_exec_fini+0x12/0xb0 [drm_exec] [22982.116824] xe_validation_ctx_fini+0x15/0x40 [xe] [22982.116892] xe_vma_destroy_unlocked+0xb1/0xe0 [xe] [22982.116959] xe_vm_close_and_put+0x41a/0x9b0 [xe] [22982.117025] ? xa_find+0xe3/0x1e0 [22982.117028] xe_file_close+0x10a/0x1c0 [xe] [22982.117074] drm_file_free+0x22a/0x280 [drm] [22982.117099] drm_release_noglobal+0x22/0x70 [drm] [22982.117119] __fput+0xf1/0x2d0 [22982.117122] task_work_run+0x59/0x90 [22982.117125] do_exit+0x330/0xb40 [22982.117127] do_group_exit+0x36/0xa0 [22982.117129] get_signal+0xbd2/0xbe0 [22982.117131] arch_do_signal_or_restart+0x3e/0x240 [22982.117134] syscall_exit_to_user_mode+0x1e7/0x290 [22982.117137] do_syscall_64+0xa1/0x180 [22982.117139] ? lock_acquire+0xcb/0x2d0 [22982.117140] ? __set_task_comm+0x28/0x1e0 [22982.117141] ? find_held_lock+0x2b/0x80 [22982.117144] ? __set_task_comm+0xe1/0x1e0 [22982.117145] ? lock_release+0xca/0x290 [22982.117147] ? __do_sys_prctl+0x245/0xab0 [22982.117149] ? lockdep_hardirqs_on_prepare+0xde/0x190 [22982.117150] ? syscall_exit_to_user_mode+0xb0/0x290 [22982.117152] ? do_syscall_64+0xa1/0x180 [22982.117154] ? __lock_acquire+0x417/0x2160 [22982.117155] ? reacquire_held_locks+0xd1/0x1f0 [22982.117156] ? do_user_addr_fault+0x30c/0x790 [22982.117158] ? lock_acquire+0xcb/0x2d0 [22982.117160] ? find_held_lock+0x2b/0x80 [22982.117162] ? do_user_addr_fault+0x357/0x790 [22982.117163] ? lock_release+0xca/0x290 [22982.117164] ? do_user_addr_fault+0x361/0x790 [22982.117166] ? trace_hardirqs_off+0x4b/0xc0 [22982.117168] ? clear_bhb_loop+0x45/0xa0 [22982.117170] ? clear_bhb_loop+0x45/0xa0 [22982.117172] ? clear_bhb_loop+0x45/0xa0 [22982.117174] entry_SYSCALL_64_after_hwframe+0x76/0x7e [22982.117176] RIP: 0033:0x7f943d267169 [22982.117192] Code: Unable to access opcode bytes at 0x7f943d26713f. [22982.117193] RSP: 002b:00007f9430bffc80 EFLAGS: 00000246 ORIG_RAX: 00000000000000ca [22982.117195] RAX: fffffffffffffe00 RBX: 0000000000000000 RCX: 00007f943d267169 [22982.117196] RDX: 0000000000000000 RSI: 0000000000000189 RDI: 00005622f89579d0 [22982.117197] RBP: 00007f9430bffcb0 R08: 0000000000000000 R09: 00000000ffffffff [22982.117198] R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 [22982.117199] R13: 0000000000000000 R14: 0000000000000000 R15: 00005622f89579d0 [22982.117202] </TASK> Fixes: dd08ebf6c352 ("drm/xe: Introduce a new DRM driver for Intel GPUs") Cc: Christian König <christian.koenig@amd.com> Cc: Daniel Vetter <daniel@ffwll.ch> Cc: dri-devel@lists.freedesktop.org Cc: intel-xe@lists.freedesktop.org Cc: <stable@vger.kernel.org> # v6.8+ Signed-off-by: Thomas Hellström <thomas.hellstrom@linux.intel.com> Reviewed-by: Matthew Brost <matthew.brost@intel.com> Reviewed-by: Daniel Vetter <daniel.vetter@ffwll.ch> Reviewed-by: Christian König <christian.koenig@amd.com> Link: https://patchwork.freedesktop.org/patch/msgid/20240628153848.4989-1-thomas.hellstrom@linux.intel.com
2024-07-04selftests: make order checking verbose in msg_zerocopy selftestZijian Zhang1-1/+1
We find that when lock debugging is on, notifications may not come in order. Thus, we have order checking outputs managed by cfg_verbose, to avoid too many outputs in this case. Fixes: 07b65c5b31ce ("test: add msg_zerocopy test") Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240701225349.3395580-3-zijianzhang@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04selftests: fix OOM in msg_zerocopy selftestZijian Zhang1-1/+11
In selftests/net/msg_zerocopy.c, it has a while loop keeps calling sendmsg on a socket with MSG_ZEROCOPY flag, and it will recv the notifications until the socket is not writable. Typically, it will start the receiving process after around 30+ sendmsgs. However, as the introduction of commit dfa2f0483360 ("tcp: get rid of sysctl_tcp_adv_win_scale"), the sender is always writable and does not get any chance to run recv notifications. The selftest always exits with OUT_OF_MEMORY because the memory used by opt_skb exceeds the net.core.optmem_max. Meanwhile, it could be set to a different value to trigger OOM on older kernels too. Thus, we introduce "cfg_notification_limit" to force sender to receive notifications after some number of sendmsgs. Fixes: 07b65c5b31ce ("test: add msg_zerocopy test") Signed-off-by: Zijian Zhang <zijianzhang@bytedance.com> Signed-off-by: Xiaochun Lu <xiaochun.lu@bytedance.com> Reviewed-by: Willem de Bruijn <willemb@google.com> Link: https://patch.msgid.link/20240701225349.3395580-2-zijianzhang@bytedance.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04ice: use proper macro for testing bitPetr Oros1-1/+1
Do not use _test_bit() macro for testing bit. The proper macro for this is one without underline. _test_bit() is what test_bit() was prior to const-optimization. It directly calls arch_test_bit(), i.e. the arch-specific implementation (or the generic one). It's strictly _internal_ and shouldn't be used anywhere outside the actual test_bit() macro. test_bit() is a wrapper which checks whether the bitmap and the bit number are compile-time constants and if so, it calls the optimized function which evaluates this call to a compile-time constant as well. If either of them is not a compile-time constant, it just calls _test_bit(). test_bit() is the actual function to use anywhere in the kernel. IOW, calling _test_bit() avoids potential compile-time optimizations. The sensors is not a compile-time constant, thus most probably there are no object code changes before and after the patch. But anyway, we shouldn't call internal wrappers instead of the actual API. Fixes: 4da71a77fc3b ("ice: read internal temperature sensor") Acked-by: Ivan Vecera <ivecera@redhat.com> Reviewed-by: Alexander Lobakin <aleksander.lobakin@intel.com> Signed-off-by: Petr Oros <poros@redhat.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20240702171459.2606611-5-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04ice: Reject pin requests with unsupported flagsJacob Keller2-16/+23
The driver receives requests for configuring pins via the .enable callback of the PTP clock object. These requests come into the driver with flags which modify the requested behavior from userspace. Current implementation in ice does not reject flags that it doesn't support. This causes the driver to incorrectly apply requests with such flags as PTP_PEROUT_DUTY_CYCLE, or any future flags added by the kernel which it is not yet aware of. Fix this by properly validating flags in both ice_ptp_cfg_perout and ice_ptp_cfg_extts. Ensure that we check by bit-wise negating supported flags rather than just checking and rejecting known un-supported flags. This is preferable, as it ensures better compatibility with future kernels. Fixes: 172db5f91d5f ("ice: add support for auxiliary input/output pins") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20240702171459.2606611-4-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04ice: Don't process extts if PTP is disabledJacob Keller1-0/+4
The ice_ptp_extts_event() function can race with ice_ptp_release() and result in a NULL pointer dereference which leads to a kernel panic. Panic occurs because the ice_ptp_extts_event() function calls ptp_clock_event() with a NULL pointer. The ice driver has already released the PTP clock by the time the interrupt for the next external timestamp event occurs. To fix this, modify the ice_ptp_extts_event() function to check the PTP state and bail early if PTP is not ready. Fixes: 172db5f91d5f ("ice: add support for auxiliary input/output pins") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20240702171459.2606611-3-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04ice: Fix improper extts handlingMilena Olech2-22/+91
Extts events are disabled and enabled by the application ts2phc. However, in case where the driver is removed when the application is running, a specific extts event remains enabled and can cause a kernel crash. As a side effect, when the driver is reloaded and application is started again, remaining extts event for the channel from a previous run will keep firing and the message "extts on unexpected channel" might be printed to the user. To avoid that, extts events shall be disabled when PTP is released. Fixes: 172db5f91d5f ("ice: add support for auxiliary input/output pins") Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com> Co-developed-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Signed-off-by: Milena Olech <milena.olech@intel.com> Signed-off-by: Karol Kolacinski <karol.kolacinski@intel.com> Reviewed-by: Simon Horman <horms@kernel.org> Tested-by: Pucha Himasekhar Reddy <himasekharx.reddy.pucha@intel.com> (A Contingent worker at Intel) Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com> Link: https://patch.msgid.link/20240702171459.2606611-2-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04selftest: af_unix: Add test case for backtrack after finalising SCC.Kuniyuki Iwashima1-2/+23
syzkaller reported a KMSAN splat in __unix_walk_scc() while backtracking edge_stack after finalising SCC. Let's add a test case exercising the path. Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Link: https://patch.msgid.link/20240702160428.10153-2-syoshida@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04af_unix: Fix uninit-value in __unix_walk_scc()Shigeru Yoshida1-4/+5
KMSAN reported uninit-value access in __unix_walk_scc() [1]. In the list_for_each_entry_reverse() loop, when the vertex's index equals it's scc_index, the loop uses the variable vertex as a temporary variable that points to a vertex in scc. And when the loop is finished, the variable vertex points to the list head, in this case scc, which is a local variable on the stack (more precisely, it's not even scc and might underflow the call stack of __unix_walk_scc(): container_of(&scc, struct unix_vertex, scc_entry)). However, the variable vertex is used under the label prev_vertex. So if the edge_stack is not empty and the function jumps to the prev_vertex label, the function will access invalid data on the stack. This causes the uninit-value access issue. Fix this by introducing a new temporary variable for the loop. [1] BUG: KMSAN: uninit-value in __unix_walk_scc net/unix/garbage.c:478 [inline] BUG: KMSAN: uninit-value in unix_walk_scc net/unix/garbage.c:526 [inline] BUG: KMSAN: uninit-value in __unix_gc+0x2589/0x3c20 net/unix/garbage.c:584 __unix_walk_scc net/unix/garbage.c:478 [inline] unix_walk_scc net/unix/garbage.c:526 [inline] __unix_gc+0x2589/0x3c20 net/unix/garbage.c:584 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xade/0x1bf0 kernel/workqueue.c:3312 worker_thread+0xeb6/0x15b0 kernel/workqueue.c:3393 kthread+0x3c4/0x530 kernel/kthread.c:389 ret_from_fork+0x6e/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Uninit was stored to memory at: unix_walk_scc net/unix/garbage.c:526 [inline] __unix_gc+0x2adf/0x3c20 net/unix/garbage.c:584 process_one_work kernel/workqueue.c:3231 [inline] process_scheduled_works+0xade/0x1bf0 kernel/workqueue.c:3312 worker_thread+0xeb6/0x15b0 kernel/workqueue.c:3393 kthread+0x3c4/0x530 kernel/kthread.c:389 ret_from_fork+0x6e/0x90 arch/x86/kernel/process.c:147 ret_from_fork_asm+0x1a/0x30 arch/x86/entry/entry_64.S:244 Local variable entries created at: ref_tracker_free+0x48/0xf30 lib/ref_tracker.c:222 netdev_tracker_free include/linux/netdevice.h:4058 [inline] netdev_put include/linux/netdevice.h:4075 [inline] dev_put include/linux/netdevice.h:4101 [inline] update_gid_event_work_handler+0xaa/0x1b0 drivers/infiniband/core/roce_gid_mgmt.c:813 CPU: 1 PID: 12763 Comm: kworker/u8:31 Not tainted 6.10.0-rc4-00217-g35bb670d65fc #32 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.16.3-2.fc40 04/01/2014 Workqueue: events_unbound __unix_gc Fixes: 3484f063172d ("af_unix: Detect Strongly Connected Components.") Reported-by: syzkaller <syzkaller@googlegroups.com> Signed-off-by: Shigeru Yoshida <syoshida@redhat.com> Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com> Link: https://patch.msgid.link/20240702160428.10153-1-syoshida@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04bonding: Fix out-of-bounds read in bond_option_arp_ip_targets_set()Sam Sun1-3/+3
In function bond_option_arp_ip_targets_set(), if newval->string is an empty string, newval->string+1 will point to the byte after the string, causing an out-of-bound read. BUG: KASAN: slab-out-of-bounds in strlen+0x7d/0xa0 lib/string.c:418 Read of size 1 at addr ffff8881119c4781 by task syz-executor665/8107 CPU: 1 PID: 8107 Comm: syz-executor665 Not tainted 6.7.0-rc7 #1 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.15.0-1 04/01/2014 Call Trace: <TASK> __dump_stack lib/dump_stack.c:88 [inline] dump_stack_lvl+0xd9/0x150 lib/dump_stack.c:106 print_address_description mm/kasan/report.c:364 [inline] print_report+0xc1/0x5e0 mm/kasan/report.c:475 kasan_report+0xbe/0xf0 mm/kasan/report.c:588 strlen+0x7d/0xa0 lib/string.c:418 __fortify_strlen include/linux/fortify-string.h:210 [inline] in4_pton+0xa3/0x3f0 net/core/utils.c:130 bond_option_arp_ip_targets_set+0xc2/0x910 drivers/net/bonding/bond_options.c:1201 __bond_opt_set+0x2a4/0x1030 drivers/net/bonding/bond_options.c:767 __bond_opt_set_notify+0x48/0x150 drivers/net/bonding/bond_options.c:792 bond_opt_tryset_rtnl+0xda/0x160 drivers/net/bonding/bond_options.c:817 bonding_sysfs_store_option+0xa1/0x120 drivers/net/bonding/bond_sysfs.c:156 dev_attr_store+0x54/0x80 drivers/base/core.c:2366 sysfs_kf_write+0x114/0x170 fs/sysfs/file.c:136 kernfs_fop_write_iter+0x337/0x500 fs/kernfs/file.c:334 call_write_iter include/linux/fs.h:2020 [inline] new_sync_write fs/read_write.c:491 [inline] vfs_write+0x96a/0xd80 fs/read_write.c:584 ksys_write+0x122/0x250 fs/read_write.c:637 do_syscall_x64 arch/x86/entry/common.c:52 [inline] do_syscall_64+0x40/0x110 arch/x86/entry/common.c:83 entry_SYSCALL_64_after_hwframe+0x63/0x6b ---[ end trace ]--- Fix it by adding a check of string length before using it. Fixes: f9de11a16594 ("bonding: add ip checks when store ip target") Signed-off-by: Yue Sun <samsun1006219@gmail.com> Signed-off-by: Simon Horman <horms@kernel.org> Acked-by: Jay Vosburgh <jay.vosburgh@canonical.com> Reviewed-by: Hangbin Liu <liuhangbin@gmail.com> Link: https://patch.msgid.link/20240702-bond-oob-v6-1-2dfdba195c19@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04net: rswitch: Avoid use-after-free in rswitch_poll()Radu Rendec1-2/+2
The use-after-free is actually in rswitch_tx_free(), which is inlined in rswitch_poll(). Since `skb` and `gq->skbs[gq->dirty]` are in fact the same pointer, the skb is first freed using dev_kfree_skb_any(), then the value in skb->len is used to update the interface statistics. Let's move around the instructions to use skb->len before the skb is freed. This bug is trivial to reproduce using KFENCE. It will trigger a splat every few packets. A simple ARP request or ICMP echo request is enough. Fixes: 271e015b9153 ("net: rswitch: Add unmap_addrs instead of dma address in each desc") Signed-off-by: Radu Rendec <rrendec@redhat.com> Reviewed-by: Yoshihiro Shimoda <yoshihiro.shimoda.uh@renesas.com> Reviewed-by: Niklas Söderlund <niklas.soderlund+renesas@ragnatech.se> Link: https://patch.msgid.link/20240702210838.2703228-1-rrendec@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
2024-07-04btrfs: fix folio refcount in __alloc_dummy_extent_buffer()Boris Burkov1-1/+1
Another improper use of __folio_put() in an error path after freshly allocating pages/folios which returns them with the refcount initialized to 1. The refactor from __free_pages() -> __folio_put() (instead of folio_put) removed a refcount decrement found in __free_pages() and folio_put but absent from __folio_put(). Fixes: 13df3775efca ("btrfs: cleanup metadata page pointer usage") CC: stable@vger.kernel.org # 6.8+ Tested-by: Ed Tomlinson <edtoml@gmail.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-04btrfs: fix folio refcount in btrfs_do_encoded_write()Boris Burkov1-1/+1
The conversion to folios switched __free_page() to __folio_put() in the error path in btrfs_do_encoded_write(). However, this gets the page refcounting wrong. If we do hit that error path (I reproduced by modifying btrfs_do_encoded_write to pretend to always fail in a way that jumps to out_folios and running the fstests case btrfs/281), then we always hit the following BUG freeing the folio: BUG: Bad page state in process btrfs pfn:40ab0b page: refcount:1 mapcount:0 mapping:0000000000000000 index:0x61be5 pfn:0x40ab0b flags: 0x5ffff0000000000(node=0|zone=2|lastcpupid=0x1ffff) raw: 05ffff0000000000 0000000000000000 dead000000000122 0000000000000000 raw: 0000000000061be5 0000000000000000 00000001ffffffff 0000000000000000 page dumped because: nonzero _refcount Call Trace: <TASK> dump_stack_lvl+0x3d/0xe0 bad_page+0xea/0xf0 free_unref_page+0x8e1/0x900 ? __mem_cgroup_uncharge+0x69/0x90 __folio_put+0xe6/0x190 btrfs_do_encoded_write+0x445/0x780 ? current_time+0x25/0xd0 btrfs_do_write_iter+0x2cc/0x4b0 btrfs_ioctl_encoded_write+0x2b6/0x340 It turns out __free_page() decreases the page reference count while __folio_put() does not. Switch __folio_put() to folio_put() which decreases the folio reference count first. Fixes: 400b172b8cdc ("btrfs: compression: migrate compression/decompression paths to folios") Tested-by: Ed Tomlinson <edtoml@gmail.com> Reviewed-by: Qu Wenruo <wqu@suse.com> Reviewed-by: Filipe Manana <fdmanana@suse.com> Signed-off-by: Boris Burkov <boris@bur.io> Reviewed-by: David Sterba <dsterba@suse.com> Signed-off-by: David Sterba <dsterba@suse.com>
2024-07-04netfilter: nf_tables: unconditionally flush pending work before notifierFlorian Westphal1-2/+1
syzbot reports: KASAN: slab-uaf in nft_ctx_update include/net/netfilter/nf_tables.h:1831 KASAN: slab-uaf in nft_commit_release net/netfilter/nf_tables_api.c:9530 KASAN: slab-uaf int nf_tables_trans_destroy_work+0x152b/0x1750 net/netfilter/nf_tables_api.c:9597 Read of size 2 at addr ffff88802b0051c4 by task kworker/1:1/45 [..] Workqueue: events nf_tables_trans_destroy_work Call Trace: nft_ctx_update include/net/netfilter/nf_tables.h:1831 [inline] nft_commit_release net/netfilter/nf_tables_api.c:9530 [inline] nf_tables_trans_destroy_work+0x152b/0x1750 net/netfilter/nf_tables_api.c:9597 Problem is that the notifier does a conditional flush, but its possible that the table-to-be-removed is still referenced by transactions being processed by the worker, so we need to flush unconditionally. We could make the flush_work depend on whether we found a table to delete in nf-next to avoid the flush for most cases. AFAICS this problem is only exposed in nf-next, with commit e169285f8c56 ("netfilter: nf_tables: do not store nft_ctx in transaction objects"), with this commit applied there is an unconditional fetch of table->family which is whats triggering the above splat. Fixes: 2c9f0293280e ("netfilter: nf_tables: flush pending destroy work before netlink notifier") Reported-and-tested-by: syzbot+4fd66a69358fc15ae2ad@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=4fd66a69358fc15ae2ad Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
2024-07-04i2c: pnx: Fix potential deadlock warning from del_timer_sync() call in isrPiotr Wojtaszczyk1-38/+10
When del_timer_sync() is called in an interrupt context it throws a warning because of potential deadlock. The timer is used only to exit from wait_for_completion() after a timeout so replacing the call with wait_for_completion_timeout() allows to remove the problematic timer and its related functions altogether. Fixes: 41561f28e76a ("i2c: New Philips PNX bus driver") Signed-off-by: Piotr Wojtaszczyk <piotr.wojtaszczyk@timesys.com> Signed-off-by: Andi Shyti <andi.shyti@kernel.org>
2024-07-03tracing: Have memmapped ring buffer use ioctl of "R" range 0x20-2FSteven Rostedt (Google)2-1/+2
To prevent conflicts with other ioctl numbers to allow strace to have an idea of what is happening, add the range of ioctls for the trace buffer mapping from _IO("T", 0x1) to the range of "R" 0x20 - 0x2F. Link: https://lore.kernel.org/linux-trace-kernel/20240630105322.GA17573@altlinux.org/ Link: https://lore.kernel.org/linux-trace-kernel/20240630213626.GA23566@altlinux.org/ Cc: Jonathan Corbet <corbet@lwn.net> Fixes: cf9f0f7c4c5bb ("tracing: Allow user-space mapping of the ring-buffer") Link: https://lore.kernel.org/20240702153354.367861db@rorschach.local.home Reported-by: "Dmitry V. Levin" <ldv@strace.io> Reviewed-by: Mathieu Desnoyers <mathieu.desnoyers@efficios.com> Acked-by: Masami Hiramatsu (Google) <mhiramat@kernel.org> Signed-off-by: Steven Rostedt (Google) <rostedt@goodmis.org>
2024-07-03riscv: kexec: Avoid deadlock in kexec crash pathSong Shuai1-9/+1
If the kexec crash code is called in the interrupt context, the machine_kexec_mask_interrupts() function will trigger a deadlock while trying to acquire the irqdesc spinlock and then deactivate irqchip in irq_set_irqchip_state() function. Unlike arm64, riscv only requires irq_eoi handler to complete EOI and keeping irq_set_irqchip_state() will only leave this possible deadlock without any use. So we simply remove it. Link: https://lore.kernel.org/linux-riscv/20231208111015.173237-1-songshuaishuai@tinylab.org/ Fixes: b17d19a5314a ("riscv: kexec: Fixup irq controller broken in kexec crash path") Signed-off-by: Song Shuai <songshuaishuai@tinylab.org> Reviewed-by: Ryo Takakura <takakura@valinux.co.jp> Link: https://lore.kernel.org/r/20240626023316.539971-1-songshuaishuai@tinylab.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03riscv: stacktrace: fix usage of ftrace_graph_ret_addr()Puranjay Mohan1-1/+2
ftrace_graph_ret_addr() takes an `idx` integer pointer that is used to optimize the stack unwinding. Pass it a valid pointer to utilize the optimizations that might be available in the future. The commit is making riscv's usage of ftrace_graph_ret_addr() match x86_64. Signed-off-by: Puranjay Mohan <puranjay@kernel.org> Reviewed-by: Steven Rostedt (Google) <rostedt@goodmis.org> Link: https://lore.kernel.org/r/20240618145820.62112-1-puranjay@kernel.org Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03riscv: selftests: Fix vsetivli args for clangCharlie Jenkins1-1/+1
Clang does not support implicit LMUL in the vset* instruction sequences. Introduce an explicit LMUL in the vsetivli instruction. Signed-off-by: Charlie Jenkins <charlie@rivosinc.com> Fixes: 9d5328eeb185 ("riscv: selftests: Add signal handling vector tests") Link: https://lore.kernel.org/r/20240702-fix_sigreturn_test-v1-1-485f88a80612@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03perf: RISC-V: Check standard event availabilitySamuel Holland2-3/+41
The RISC-V SBI PMU specification defines several standard hardware and cache events. Currently, all of these events are exposed to userspace, even when not actually implemented. They appear in the `perf list` output, and commands like `perf stat` try to use them. This is more than just a cosmetic issue, because the PMU driver's .add function fails for these events, which causes pmu_groups_sched_in() to prematurely stop scheduling in other (possibly valid) hardware events. Add logic to check which events are supported by the hardware (i.e. can be mapped to some counter), so only usable events are reported to userspace. Since the kernel does not know the mapping between events and possible counters, this check must happen during boot, when no counters are in use. Make the check asynchronous to minimize impact on boot time. Fixes: e9991434596f ("RISC-V: Add perf platform driver based on SBI PMU extension") Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Reviewed-by: Atish Patra <atishp@rivosinc.com> Tested-by: Atish Patra <atishp@rivosinc.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240628-misc_perf_fixes-v4-3-e01cfddcf035@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03drivers/perf: riscv: Reset the counter to hpmevent mapping while starting cpusSamuel Holland1-1/+1
Currently, we stop all the counters while a new cpu is brought online. However, the hpmevent to counter mappings are not reset. The firmware may have some stale encoding in their mapping structure which may lead to undesirable results. We have not encountered such scenario though. Signed-off-by: Samuel Holland <samuel.holland@sifive.com> Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240628-misc_perf_fixes-v4-2-e01cfddcf035@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03drivers/perf: riscv: Do not update the event data if uptodateAtish Patra1-1/+1
In case of an counter overflow, the event data may get corrupted if called from an external overflow handler. This happens because we can't update the counter without starting it when SBI PMU extension is in use. However, the prev_count has been already updated at the first pass while the counter value is still the old one. The solution is simple where we don't need to update it again if it is already updated which can be detected using hwc state. The event state in the overflow handler is updated in the following patch. Thus, this fix can't be backported to kernel version where overflow support was added. Fixes: a8625217a054 ("drivers/perf: riscv: Implement SBI PMU snapshot function") Closes:https://lore.kernel.org/all/CC51D53B-846C-4D81-86FC-FBF969D0A0D6@pku.edu.cn/ Reported-by: garthlei@pku.edu.cn Signed-off-by: Atish Patra <atishp@rivosinc.com> Link: https://lore.kernel.org/r/20240628-misc_perf_fixes-v4-1-e01cfddcf035@rivosinc.com Signed-off-by: Palmer Dabbelt <palmer@rivosinc.com>
2024-07-03nilfs2: fix incorrect inode allocation from reserved inodesRyusuke Konishi4-12/+20
If the bitmap block that manages the inode allocation status is corrupted, nilfs_ifile_create_inode() may allocate a new inode from the reserved inode area where it should not be allocated. Previous fix commit d325dc6eb763 ("nilfs2: fix use-after-free bug of struct nilfs_root"), fixed the problem that reserved inodes with inode numbers less than NILFS_USER_INO (=11) were incorrectly reallocated due to bitmap corruption, but since the start number of non-reserved inodes is read from the super block and may change, in which case inode allocation may occur from the extended reserved inode area. If that happens, access to that inode will cause an IO error, causing the file system to degrade to an error state. Fix this potential issue by adding a wraparound option to the common metadata object allocation routine and by modifying nilfs_ifile_create_inode() to disable the option so that it only allocates inodes with inode numbers greater than or equal to the inode number read in "nilfs->ns_first_ino", regardless of the bitmap status of reserved inodes. Link: https://lkml.kernel.org/r/20240623051135.4180-4-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03nilfs2: add missing check for inode numbers on directory entriesRyusuke Konishi2-0/+11
Syzbot reported that mounting and unmounting a specific pattern of corrupted nilfs2 filesystem images causes a use-after-free of metadata file inodes, which triggers a kernel bug in lru_add_fn(). As Jan Kara pointed out, this is because the link count of a metadata file gets corrupted to 0, and nilfs_evict_inode(), which is called from iput(), tries to delete that inode (ifile inode in this case). The inconsistency occurs because directories containing the inode numbers of these metadata files that should not be visible in the namespace are read without checking. Fix this issue by treating the inode numbers of these internal files as errors in the sanity check helper when reading directory folios/pages. Also thanks to Hillf Danton and Matthew Wilcox for their initial mm-layer analysis. Link: https://lkml.kernel.org/r/20240623051135.4180-3-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Reported-by: syzbot+d79afb004be235636ee8@syzkaller.appspotmail.com Closes: https://syzkaller.appspot.com/bug?extid=d79afb004be235636ee8 Reported-by: Jan Kara <jack@suse.cz> Closes: https://lkml.kernel.org/r/20240617075758.wewhukbrjod5fp5o@quack3 Tested-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03nilfs2: fix inode number range checksRyusuke Konishi3-3/+10
Patch series "nilfs2: fix potential issues related to reserved inodes". This series fixes one use-after-free issue reported by syzbot, caused by nilfs2's internal inode being exposed in the namespace on a corrupted filesystem, and a couple of flaws that cause problems if the starting number of non-reserved inodes written in the on-disk super block is intentionally (or corruptly) changed from its default value. This patch (of 3): In the current implementation of nilfs2, "nilfs->ns_first_ino", which gives the first non-reserved inode number, is read from the superblock, but its lower limit is not checked. As a result, if a number that overlaps with the inode number range of reserved inodes such as the root directory or metadata files is set in the super block parameter, the inode number test macros (NILFS_MDT_INODE and NILFS_VALID_INODE) will not function properly. In addition, these test macros use left bit-shift calculations using with the inode number as the shift count via the BIT macro, but the result of a shift calculation that exceeds the bit width of an integer is undefined in the C specification, so if "ns_first_ino" is set to a large value other than the default value NILFS_USER_INO (=11), the macros may potentially malfunction depending on the environment. Fix these issues by checking the lower bound of "nilfs->ns_first_ino" and by preventing bit shifts equal to or greater than the NILFS_USER_INO constant in the inode number test macros. Also, change the type of "ns_first_ino" from signed integer to unsigned integer to avoid the need for type casting in comparisons such as the lower bound check introduced this time. Link: https://lkml.kernel.org/r/20240623051135.4180-1-konishi.ryusuke@gmail.com Link: https://lkml.kernel.org/r/20240623051135.4180-2-konishi.ryusuke@gmail.com Signed-off-by: Ryusuke Konishi <konishi.ryusuke@gmail.com> Cc: Hillf Danton <hdanton@sina.com> Cc: Jan Kara <jack@suse.cz> Cc: Matthew Wilcox (Oracle) <willy@infradead.org> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03mm: avoid overflows in dirty throttling logicJan Kara1-4/+26
The dirty throttling logic is interspersed with assumptions that dirty limits in PAGE_SIZE units fit into 32-bit (so that various multiplications fit into 64-bits). If limits end up being larger, we will hit overflows, possible divisions by 0 etc. Fix these problems by never allowing so large dirty limits as they have dubious practical value anyway. For dirty_bytes / dirty_background_bytes interfaces we can just refuse to set so large limits. For dirty_ratio / dirty_background_ratio it isn't so simple as the dirty limit is computed from the amount of available memory which can change due to memory hotplug etc. So when converting dirty limits from ratios to numbers of pages, we just don't allow the result to exceed UINT_MAX. This is root-only triggerable problem which occurs when the operator sets dirty limits to >16 TB. Link: https://lkml.kernel.org/r/20240621144246.11148-2-jack@suse.cz Signed-off-by: Jan Kara <jack@suse.cz> Reported-by: Zach O'Keefe <zokeefe@google.com> Reviewed-By: Zach O'Keefe <zokeefe@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>
2024-07-03Revert "mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again"Jan Kara1-1/+1
Patch series "mm: Avoid possible overflows in dirty throttling". Dirty throttling logic assumes dirty limits in page units fit into 32-bits. This patch series makes sure this is true (see patch 2/2 for more details). This patch (of 2): This reverts commit 9319b647902cbd5cc884ac08a8a6d54ce111fc78. The commit is broken in several ways. Firstly, the removed (u64) cast from the multiplication will introduce a multiplication overflow on 32-bit archs if wb_thresh * bg_thresh >= 1<<32 (which is actually common - the default settings with 4GB of RAM will trigger this). Secondly, the div64_u64() is unnecessarily expensive on 32-bit archs. We have div64_ul() in case we want to be safe & cheap. Thirdly, if dirty thresholds are larger than 1<<32 pages, then dirty balancing is going to blow up in many other spectacular ways anyway so trying to fix one possible overflow is just moot. Link: https://lkml.kernel.org/r/20240621144017.30993-1-jack@suse.cz Link: https://lkml.kernel.org/r/20240621144246.11148-1-jack@suse.cz Fixes: 9319b647902c ("mm/writeback: fix possible divide-by-zero in wb_dirty_limits(), again") Signed-off-by: Jan Kara <jack@suse.cz> Reviewed-By: Zach O'Keefe <zokeefe@google.com> Cc: <stable@vger.kernel.org> Signed-off-by: Andrew Morton <akpm@linux-foundation.org>