summaryrefslogtreecommitdiffstats
path: root/net/sunrpc/svcsock.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2023-08-181-0/+3
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Cross-merge networking fixes after downstream PR. Conflicts: drivers/net/ethernet/sfc/tc.c fa165e194997 ("sfc: don't unregister flow_indr if it was never registered") 3bf969e88ada ("sfc: add MAE table machinery for conntrack table") https://lore.kernel.org/all/20230818112159.7430e9b4@canb.auug.org.au/ No adjacent changes. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * sunrpc: set the bv_offset of first bvec in svc_tcp_sendmsgJeff Layton2023-08-141-0/+3
| | | | | | | | | | | | | | | | | | | | | | svc_tcp_sendmsg used to factor in the xdr->page_base when sending pages, but commit 5df5dd03a8f7 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") dropped that part of the handling. Fix it by setting the bv_offset of the first bvec. Fixes: 5df5dd03a8f7 ("sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpage") Signed-off-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | SUNRPC: Use new helpers to handle TLS AlertsChuck Lever2023-07-281-22/+25
| | | | | | | | | | | | | | | | | | | | Use the helpers to parse the level and description fields in incoming alerts. "Warning" alerts are discarded, and "fatal" alerts mean the session is no longer valid. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047944747.5241.1974889594004407123.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | SUNRPC: Send TLS Closure alerts before closing a TCP socketChuck Lever2023-07-281-0/+2
| | | | | | | | | | | | | | | | | | | | Before closing a TCP connection, the TLS protocol wants peers to send session close Alert notifications. Add those in both the RPC client and server. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047939404.5241.14392506226409865832.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net/tls: Move TLS protocol elements to a separate headerChuck Lever2023-07-281-0/+1
|/ | | | | | | | | | | | | Kernel TLS consumers will need definitions of various parts of the TLS protocol, but often do not need the function declarations and other infrastructure provided in <net/tls.h>. Break out existing standardized protocol elements into a separate header, and make room for a few more elements in subsequent patches. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Link: https://lore.kernel.org/r/169047931374.5241.7713175865185969309.stgit@oracle-102.nfsv4bat.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* Merge tag 'net-next-6.5' of ↵Linus Torvalds2023-06-291-26/+12
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking changes from Jakub Kicinski: "WiFi 7 and sendpage changes are the biggest pieces of work for this release. The latter will definitely require fixes but I think that we got it to a reasonable point. Core: - Rework the sendpage & splice implementations Instead of feeding data into sockets page by page extend sendmsg handlers to support taking a reference on the data, controlled by a new flag called MSG_SPLICE_PAGES Rework the handling of unexpected-end-of-file to invoke an additional callback instead of trying to predict what the right combination of MORE/NOTLAST flags is Remove the MSG_SENDPAGE_NOTLAST flag completely - Implement SCM_PIDFD, a new type of CMSG type analogous to SCM_CREDENTIALS, but it contains pidfd instead of plain pid - Enable socket busy polling with CONFIG_RT - Improve reliability and efficiency of reporting for ref_tracker - Auto-generate a user space C library for various Netlink families Protocols: - Allow TCP to shrink the advertised window when necessary, prevent sk_rcvbuf auto-tuning from growing the window all the way up to tcp_rmem[2] - Use per-VMA locking for "page-flipping" TCP receive zerocopy - Prepare TCP for device-to-device data transfers, by making sure that payloads are always attached to skbs as page frags - Make the backoff time for the first N TCP SYN retransmissions linear. Exponential backoff is unnecessarily conservative - Create a new MPTCP getsockopt to retrieve all info (MPTCP_FULL_INFO) - Avoid waking up applications using TLS sockets until we have a full record - Allow using kernel memory for protocol ioctl callbacks, paving the way to issuing ioctls over io_uring - Add nolocalbypass option to VxLAN, forcing packets to be fully encapsulated even if they are destined for a local IP address - Make TCPv4 use consistent hash in TIME_WAIT and SYN_RECV. Ensure in-kernel ECMP implementation (e.g. Open vSwitch) select the same link for all packets. Support L4 symmetric hashing in Open vSwitch - PPPoE: make number of hash bits configurable - Allow DNS to be overwritten by DHCPACK in the in-kernel DHCP client (ipconfig) - Add layer 2 miss indication and filtering, allowing higher layers (e.g. ACL filters) to make forwarding decisions based on whether packet matched forwarding state in lower devices (bridge) - Support matching on Connectivity Fault Management (CFM) packets - Hide the "link becomes ready" IPv6 messages by demoting their printk level to debug - HSR: don't enable promiscuous mode if device offloads the proto - Support active scanning in IEEE 802.15.4 - Continue work on Multi-Link Operation for WiFi 7 BPF: - Add precision propagation for subprogs and callbacks. This allows maintaining verification efficiency when subprograms are used, or in fact passing the verifier at all for complex programs, especially those using open-coded iterators - Improve BPF's {g,s}setsockopt() length handling. Previously BPF assumed the length is always equal to the amount of written data. But some protos allow passing a NULL buffer to discover what the output buffer *should* be, without writing anything - Accept dynptr memory as memory arguments passed to helpers - Add routing table ID to bpf_fib_lookup BPF helper - Support O_PATH FDs in BPF_OBJ_PIN and BPF_OBJ_GET commands - Drop bpf_capable() check in BPF_MAP_FREEZE command (used to mark maps as read-only) - Show target_{obj,btf}_id in tracing link fdinfo - Addition of several new kfuncs (most of the names are self-explanatory): - Add a set of new dynptr kfuncs: bpf_dynptr_adjust(), bpf_dynptr_is_null(), bpf_dynptr_is_rdonly(), bpf_dynptr_size() and bpf_dynptr_clone(). - bpf_task_under_cgroup() - bpf_sock_destroy() - force closing sockets - bpf_cpumask_first_and(), rework bpf_cpumask_any*() kfuncs Netfilter: - Relax set/map validation checks in nf_tables. Allow checking presence of an entry in a map without using the value - Increase ip_vs_conn_tab_bits range for 64BIT builds - Allow updating size of a set - Improve NAT tuple selection when connection is closing Driver API: - Integrate netdev with LED subsystem, to allow configuring HW "offloaded" blinking of LEDs based on link state and activity (i.e. packets coming in and out) - Support configuring rate selection pins of SFP modules - Factor Clause 73 auto-negotiation code out of the drivers, provide common helper routines - Add more fool-proof helpers for managing lifetime of MDIO devices associated with the PCS layer - Allow drivers to report advanced statistics related to Time Aware scheduler offload (taprio) - Allow opting out of VF statistics in link dump, to allow more VFs to fit into the message - Split devlink instance and devlink port operations New hardware / drivers: - Ethernet: - Synopsys EMAC4 IP support (stmmac) - Marvell 88E6361 8 port (5x1GE + 3x2.5GE) switches - Marvell 88E6250 7 port switches - Microchip LAN8650/1 Rev.B0 PHYs - MediaTek MT7981/MT7988 built-in 1GE PHY driver - WiFi: - Realtek RTL8192FU, 2.4 GHz, b/g/n mode, 2T2R, 300 Mbps - Realtek RTL8723DS (SDIO variant) - Realtek RTL8851BE - CAN: - Fintek F81604 Drivers: - Ethernet NICs: - Intel (100G, ice): - support dynamic interrupt allocation - use meta data match instead of VF MAC addr on slow-path - nVidia/Mellanox: - extend link aggregation to handle 4, rather than just 2 ports - spawn sub-functions without any features by default - OcteonTX2: - support HTB (Tx scheduling/QoS) offload - make RSS hash generation configurable - support selecting Rx queue using TC filters - Wangxun (ngbe/txgbe): - add basic Tx/Rx packet offloads - add phylink support (SFP/PCS control) - Freescale/NXP (enetc): - report TAPRIO packet statistics - Solarflare/AMD: - support matching on IP ToS and UDP source port of outer header - VxLAN and GENEVE tunnel encapsulation over IPv4 or IPv6 - add devlink dev info support for EF10 - Virtual NICs: - Microsoft vNIC: - size the Rx indirection table based on requested configuration - support VLAN tagging - Amazon vNIC: - try to reuse Rx buffers if not fully consumed, useful for ARM servers running with 16kB pages - Google vNIC: - support TCP segmentation of >64kB frames - Ethernet embedded switches: - Marvell (mv88e6xxx): - enable USXGMII (88E6191X) - Microchip: - lan966x: add support for Egress Stage 0 ACL engine - lan966x: support mapping packet priority to internal switch priority (based on PCP or DSCP) - Ethernet PHYs: - Broadcom PHYs: - support for Wake-on-LAN for BCM54210E/B50212E - report LPI counter - Microsemi PHYs: support RGMII delay configuration (VSC85xx) - Micrel PHYs: receive timestamp in the frame (LAN8841) - Realtek PHYs: support optional external PHY clock - Altera TSE PCS: merge the driver into Lynx PCS which it is a variant of - CAN: Kvaser PCIEcan: - support packet timestamping - WiFi: - Intel (iwlwifi): - major update for new firmware and Multi-Link Operation (MLO) - configuration rework to drop test devices and split the different families - support for segmented PNVM images and power tables - new vendor entries for PPAG (platform antenna gain) feature - Qualcomm 802.11ax (ath11k): - Multiple Basic Service Set Identifier (MBSSID) and Enhanced MBSSID Advertisement (EMA) support in AP mode - support factory test mode - RealTek (rtw89): - add RSSI based antenna diversity - support U-NII-4 channels on 5 GHz band - RealTek (rtl8xxxu): - AP mode support for 8188f - support USB RX aggregation for the newer chips" * tag 'net-next-6.5' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1602 commits) net: scm: introduce and use scm_recv_unix helper af_unix: Skip SCM_PIDFD if scm->pid is NULL. net: lan743x: Simplify comparison netlink: Add __sock_i_ino() for __netlink_diag_dump(). net: dsa: avoid suspicious RCU usage for synced VLAN-aware MAC addresses Revert "af_unix: Call scm_recv() only after scm_set_cred()." phylink: ReST-ify the phylink_pcs_neg_mode() kdoc libceph: Partially revert changes to support MSG_SPLICE_PAGES net: phy: mscc: fix packet loss due to RGMII delays net: mana: use vmalloc_array and vcalloc net: enetc: use vmalloc_array and vcalloc ionic: use vmalloc_array and vcalloc pds_core: use vmalloc_array and vcalloc gve: use vmalloc_array and vcalloc octeon_ep: use vmalloc_array and vcalloc net: usb: qmi_wwan: add u-blox 0x1312 composition perf trace: fix MSG_SPLICE_PAGES build error ipvlan: Fix return value of ipvlan_queue_xmit() netfilter: nf_tables: fix underflow in chain reference counter netfilter: nf_tables: unbind non-anonymous set if rule construction fails ...
| * sunrpc: Use sendmsg(MSG_SPLICE_PAGES) rather then sendpageDavid Howells2023-06-131-26/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When transmitting data, call down into TCP using sendmsg with MSG_SPLICE_PAGES to indicate that content should be spliced rather than performing sendpage calls to transmit header, data pages and trailer. Signed-off-by: David Howells <dhowells@redhat.com> Acked-by: Chuck Lever <chuck.lever@oracle.com> cc: Trond Myklebust <trond.myklebust@hammerspace.com> cc: Anna Schumaker <anna@kernel.org> cc: Jeff Layton <jlayton@kernel.org> cc: Jens Axboe <axboe@kernel.dk> cc: Matthew Wilcox <willy@infradead.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | SUNRPC: Trace struct svc_sock lifetime eventsChuck Lever2023-06-051-1/+3
| | | | | | | | | | | | | | | | | | Capture a timestamp and pointer address during the creation and destruction of struct svc_sock to record its lifetime. This helps to diagnose transport reference counting issues. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | SUNRPC: Improve observability in svc_tcp_accept()Chuck Lever2023-06-051-7/+2
| | | | | | | | | | | | | | | | | | | | | | | | The -ENOMEM arm could fire repeatedly if the system runs low on memory, so remove it. Don't bother to trace -EAGAIN error events, since those fire after a listener is created (with no work done) and once again after an accept has been handled successfully (again, with no work done). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | SUNRPC: Fix an incorrect commentChuck Lever2023-06-051-1/+1
| | | | | | | | | | | | | | The correct function name is svc_tcp_listen_data_ready(). Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | SUNRPC: Fix UAF in svc_tcp_listen_data_ready()Ding Hui2023-06-051-12/+11
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After the listener svc_sock is freed, and before invoking svc_tcp_accept() for the established child sock, there is a window that the newsock retaining a freed listener svc_sock in sk_user_data which cloning from parent. In the race window, if data is received on the newsock, we will observe use-after-free report in svc_tcp_listen_data_ready(). Reproduce by two tasks: 1. while :; do rpc.nfsd 0 ; rpc.nfsd; done 2. while :; do echo "" | ncat -4 127.0.0.1 2049 ; done KASAN report: ================================================================== BUG: KASAN: slab-use-after-free in svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc] Read of size 8 at addr ffff888139d96228 by task nc/102553 CPU: 7 PID: 102553 Comm: nc Not tainted 6.3.0+ #18 Hardware name: VMware, Inc. VMware Virtual Platform/440BX Desktop Reference Platform, BIOS 6.00 11/12/2020 Call Trace: <IRQ> dump_stack_lvl+0x33/0x50 print_address_description.constprop.0+0x27/0x310 print_report+0x3e/0x70 kasan_report+0xae/0xe0 svc_tcp_listen_data_ready+0x1cf/0x1f0 [sunrpc] tcp_data_queue+0x9f4/0x20e0 tcp_rcv_established+0x666/0x1f60 tcp_v4_do_rcv+0x51c/0x850 tcp_v4_rcv+0x23fc/0x2e80 ip_protocol_deliver_rcu+0x62/0x300 ip_local_deliver_finish+0x267/0x350 ip_local_deliver+0x18b/0x2d0 ip_rcv+0x2fb/0x370 __netif_receive_skb_one_core+0x166/0x1b0 process_backlog+0x24c/0x5e0 __napi_poll+0xa2/0x500 net_rx_action+0x854/0xc90 __do_softirq+0x1bb/0x5de do_softirq+0xcb/0x100 </IRQ> <TASK> ... </TASK> Allocated by task 102371: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 __kasan_kmalloc+0x7b/0x90 svc_setup_socket+0x52/0x4f0 [sunrpc] svc_addsock+0x20d/0x400 [sunrpc] __write_ports_addfd+0x209/0x390 [nfsd] write_ports+0x239/0x2c0 [nfsd] nfsctl_transaction_write+0xac/0x110 [nfsd] vfs_write+0x1c3/0xae0 ksys_write+0xed/0x1c0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Freed by task 102551: kasan_save_stack+0x1e/0x40 kasan_set_track+0x21/0x30 kasan_save_free_info+0x2a/0x50 __kasan_slab_free+0x106/0x190 __kmem_cache_free+0x133/0x270 svc_xprt_free+0x1e2/0x350 [sunrpc] svc_xprt_destroy_all+0x25a/0x440 [sunrpc] nfsd_put+0x125/0x240 [nfsd] nfsd_svc+0x2cb/0x3c0 [nfsd] write_threads+0x1ac/0x2a0 [nfsd] nfsctl_transaction_write+0xac/0x110 [nfsd] vfs_write+0x1c3/0xae0 ksys_write+0xed/0x1c0 do_syscall_64+0x38/0x90 entry_SYSCALL_64_after_hwframe+0x72/0xdc Fix the UAF by simply doing nothing in svc_tcp_listen_data_ready() if state != TCP_LISTEN, that will avoid dereferencing svsk for all child socket. Link: https://lore.kernel.org/lkml/20230507091131.23540-1-dinghui@sangfor.com.cn/ Fixes: fa9251afc33c ("SUNRPC: Call the default socket callbacks instead of open coding") Signed-off-by: Ding Hui <dinghui@sangfor.com.cn> Cc: <stable@vger.kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* nfsd: fix double fget() bug in __write_ports_addfd()Dan Carpenter2023-05-311-18/+6
| | | | | | | | | | | | | | | The bug here is that you cannot rely on getting the same socket from multiple calls to fget() because userspace can influence that. This is a kind of double fetch bug. The fix is to delete the svc_alien_sock() function and instead do the checking inside the svc_addsock() function. Fixes: 3064639423c4 ("nfsd: check passed socket's net matches NFSd superblock's one") Signed-off-by: Dan Carpenter <dan.carpenter@linaro.org> Reviewed-by: NeilBrown <neilb@suse.com> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: always free ctxt when freeing deferred requestNeilBrown2023-05-141-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | Since the ->xprt_ctxt pointer was added to svc_deferred_req, it has not been sufficient to use kfree() to free a deferred request. We may need to free the ctxt as well. As freeing the ctxt is all that ->xpo_release_rqst() does, we repurpose it to explicit do that even when the ctxt is not stored in an rqst. So we now have ->xpo_release_ctxt() which is given an xprt and a ctxt, which may have been taken either from an rqst or from a dreq. The caller is now responsible for clearing that pointer after the call to ->xpo_release_ctxt. We also clear dr->xprt_ctxt when the ctxt is moved into a new rqst when revisiting a deferred request. This ensures there is only one pointer to the ctxt, so the risk of double freeing in future is reduced. The new code in svc_xprt_release which releases both the ctxt and any rq_deferred depends on this. Fixes: 773f91b2cf3f ("SUNRPC: Fix NFSD's request deferral on RDMA transports") Signed-off-by: NeilBrown <neilb@suse.de> Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Fix error handling in svc_setup_socket()Chuck Lever2023-05-061-12/+4
| | | | | | | | | | | | | | Dan points out that sock_alloc_file() releases @sock on error, but so do all of svc_setup_socket's callers, resulting in a double- release if sock_alloc_file() returns an error. Rather than allocating a struct file for all new sockets, allocate one only for sockets created during a TCP accept. For the moment, those are the only ones that will ever be used with RPC-with-TLS. Reported-by: Dan Carpenter <dan.carpenter@linaro.org> Fixes: ae0d77708aae ("SUNRPC: Ensure server-side sockets have a sock->file") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Support TLS handshake in the server-side TCP socket codeChuck Lever2023-04-281-3/+98
| | | | | | | | | | | This patch adds opportunitistic RPC-with-TLS to the Linux in-kernel NFS server. If the client requests RPC-with-TLS and the user space handshake agent is running, the server will set up a TLS session. There are no policy settings yet. For example, the server cannot yet require the use of RPC-with-TLS to access its data. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Recognize control messages in server-side TCP socket codeChuck Lever2023-04-281-2/+46
| | | | | | | | To support kTLS, the server-side TCP socket receive path needs to watch for CMSGs. Acked-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Ensure server-side sockets have a sock->fileChuck Lever2023-04-261-7/+18
| | | | | | | | The TLS handshake upcall mechanism requires a non-NULL sock->file on the socket it hands to user space. svc_sock_free() already releases sock->file properly if one exists. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* Merge tag 'nfsd-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds2023-02-221-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Chuck Lever: "Two significant security enhancements are part of this release: - NFSD's RPC header encoding and decoding, including RPCSEC GSS and gssproxy header parsing, has been overhauled to make it more memory-safe. - Support for Kerberos AES-SHA2-based encryption types has been added for both the NFS client and server. This provides a clean path for deprecating and removing insecure encryption types based on DES and SHA-1. AES-SHA2 is also FIPS-140 compliant, so that NFS with Kerberos may now be used on systems with fips enabled. In addition to these, NFSD is now able to handle crossing into an auto-mounted mount point on an exported NFS mount. A number of fixes have been made to NFSD's server-side copy implementation. RPC metrics have been converted to per-CPU variables. This helps reduce unnecessary cross-CPU and cross-node memory bus traffic, and significantly reduces noise when KCSAN is enabled" * tag 'nfsd-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (121 commits) NFSD: Clean up nfsd_symlink() NFSD: copy the whole verifier in nfsd_copy_write_verifier nfsd: don't fsync nfsd_files on last close SUNRPC: Fix occasional warning when destroying gss_krb5_enctypes nfsd: fix courtesy client with deny mode handling in nfs4_upgrade_open NFSD: fix problems with cleanup on errors in nfsd4_copy nfsd: fix race to check ls_layouts nfsd: don't hand out delegation on setuid files being opened for write SUNRPC: Remove ->xpo_secure_port() SUNRPC: Clean up the svc_xprt_flags() macro nfsd: remove fs/nfsd/fault_inject.c NFSD: fix leaked reference count of nfsd4_ssc_umount_item nfsd: clean up potential nfsd_file refcount leaks in COPY codepath nfsd: zero out pointers after putting nfsd_files on COPY setup error SUNRPC: Fix whitespace damage in svcauth_unix.c nfsd: eliminate __nfs4_get_fd nfsd: add some kerneldoc comments for stateid preprocessing functions nfsd: eliminate find_deleg_file_locked nfsd: don't take nfsd4_copy ref for OP_OFFLOAD_STATUS SUNRPC: Add encryption self-tests ...
| * SUNRPC: Remove ->xpo_secure_port()Chuck Lever2023-02-201-2/+2
| | | | | | | | | | | | | | | | | | There's no need for the cost of this extra virtual function call during every RPC transaction: the RQ_SECURE bit can be set properly in ->xpo_recvfrom() instead. Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | Merge tag 'net-next-6.3' of ↵Linus Torvalds2023-02-221-0/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "Core: - Add dedicated kmem_cache for typical/small skb->head, avoid having to access struct page at kfree time, and improve memory use. - Introduce sysctl to set default RPS configuration for new netdevs. - Define Netlink protocol specification format which can be used to describe messages used by each family and auto-generate parsers. Add tools for generating kernel data structures and uAPI headers. - Expose all net/core sysctls inside netns. - Remove 4s sleep in netpoll if carrier is instantly detected on boot. - Add configurable limit of MDB entries per port, and port-vlan. - Continue populating drop reasons throughout the stack. - Retire a handful of legacy Qdiscs and classifiers. Protocols: - Support IPv4 big TCP (TSO frames larger than 64kB). - Add IP_LOCAL_PORT_RANGE socket option, to control local port range on socket by socket basis. - Track and report in procfs number of MPTCP sockets used. - Support mixing IPv4 and IPv6 flows in the in-kernel MPTCP path manager. - IPv6: don't check net.ipv6.route.max_size and rely on garbage collection to free memory (similarly to IPv4). - Support Penultimate Segment Pop (PSP) flavor in SRv6 (RFC8986). - ICMP: add per-rate limit counters. - Add support for user scanning requests in ieee802154. - Remove static WEP support. - Support minimal Wi-Fi 7 Extremely High Throughput (EHT) rate reporting. - WiFi 7 EHT channel puncturing support (client & AP). BPF: - Add a rbtree data structure following the "next-gen data structure" precedent set by recently added linked list, that is, by using kfunc + kptr instead of adding a new BPF map type. - Expose XDP hints via kfuncs with initial support for RX hash and timestamp metadata. - Add BPF_F_NO_TUNNEL_KEY extension to bpf_skb_set_tunnel_key to better support decap on GRE tunnel devices not operating in collect metadata. - Improve x86 JIT's codegen for PROBE_MEM runtime error checks. - Remove the need for trace_printk_lock for bpf_trace_printk and bpf_trace_vprintk helpers. - Extend libbpf's bpf_tracing.h support for tracing arguments of kprobes/uprobes and syscall as a special case. - Significantly reduce the search time for module symbols by livepatch and BPF. - Enable cpumasks to be used as kptrs, which is useful for tracing programs tracking which tasks end up running on which CPUs in different time intervals. - Add support for BPF trampoline on s390x and riscv64. - Add capability to export the XDP features supported by the NIC. - Add __bpf_kfunc tag for marking kernel functions as kfuncs. - Add cgroup.memory=nobpf kernel parameter option to disable BPF memory accounting for container environments. Netfilter: - Remove the CLUSTERIP target. It has been marked as obsolete for years, and we still have WARN splats wrt races of the out-of-band /proc interface installed by this target. - Add 'destroy' commands to nf_tables. They are identical to the existing 'delete' commands, but do not return an error if the referenced object (set, chain, rule...) did not exist. Driver API: - Improve cpumask_local_spread() locality to help NICs set the right IRQ affinity on AMD platforms. - Separate C22 and C45 MDIO bus transactions more clearly. - Introduce new DCB table to control DSCP rewrite on egress. - Support configuration of Physical Layer Collision Avoidance (PLCA) Reconciliation Sublayer (RS) (802.3cg-2019). Modern version of shared medium Ethernet. - Support for MAC Merge layer (IEEE 802.3-2018 clause 99). Allowing preemption of low priority frames by high priority frames. - Add support for controlling MACSec offload using netlink SET. - Rework devlink instance refcounts to allow registration and de-registration under the instance lock. Split the code into multiple files, drop some of the unnecessarily granular locks and factor out common parts of netlink operation handling. - Add TX frame aggregation parameters (for USB drivers). - Add a new attr TCA_EXT_WARN_MSG to report TC (offload) warning messages with notifications for debug. - Allow offloading of UDP NEW connections via act_ct. - Add support for per action HW stats in TC. - Support hardware miss to TC action (continue processing in SW from a specific point in the action chain). - Warn if old Wireless Extension user space interface is used with modern cfg80211/mac80211 drivers. Do not support Wireless Extensions for Wi-Fi 7 devices at all. Everyone should switch to using nl80211 interface instead. - Improve the CAN bit timing configuration. Use extack to return error messages directly to user space, update the SJW handling, including the definition of a new default value that will benefit CAN-FD controllers, by increasing their oscillator tolerance. New hardware / drivers: - Ethernet: - nVidia BlueField-3 support (control traffic driver) - Ethernet support for imx93 SoCs - Motorcomm yt8531 gigabit Ethernet PHY - onsemi NCN26000 10BASE-T1S PHY (with support for PLCA) - Microchip LAN8841 PHY (incl. cable diagnostics and PTP) - Amlogic gxl MDIO mux - WiFi: - RealTek RTL8188EU (rtl8xxxu) - Qualcomm Wi-Fi 7 devices (ath12k) - CAN: - Renesas R-Car V4H Drivers: - Bluetooth: - Set Per Platform Antenna Gain (PPAG) for Intel controllers. - Ethernet NICs: - Intel (1G, igc): - support TSN / Qbv / packet scheduling features of i226 model - Intel (100G, ice): - use GNSS subsystem instead of TTY - multi-buffer XDP support - extend support for GPIO pins to E823 devices - nVidia/Mellanox: - update the shared buffer configuration on PFC commands - implement PTP adjphase function for HW offset control - TC support for Geneve and GRE with VF tunnel offload - more efficient crypto key management method - multi-port eswitch support - Netronome/Corigine: - add DCB IEEE support - support IPsec offloading for NFP3800 - Freescale/NXP (enetc): - support XDP_REDIRECT for XDP non-linear buffers - improve reconfig, avoid link flap and waiting for idle - support MAC Merge layer - Other NICs: - sfc/ef100: add basic devlink support for ef100 - ionic: rx_push mode operation (writing descriptors via MMIO) - bnxt: use the auxiliary bus abstraction for RDMA - r8169: disable ASPM and reset bus in case of tx timeout - cpsw: support QSGMII mode for J721e CPSW9G - cpts: support pulse-per-second output - ngbe: add an mdio bus driver - usbnet: optimize usbnet_bh() by avoiding unnecessary queuing - r8152: handle devices with FW with NCM support - amd-xgbe: support 10Mbps, 2.5GbE speeds and rx-adaptation - virtio-net: support multi buffer XDP - virtio/vsock: replace virtio_vsock_pkt with sk_buff - tsnep: XDP support - Ethernet high-speed switches: - nVidia/Mellanox (mlxsw): - add support for latency TLV (in FW control messages) - Microchip (sparx5): - separate explicit and implicit traffic forwarding rules, make the implicit rules always active - add support for egress DSCP rewrite - IS0 VCAP support (Ingress Classification) - IS2 VCAP filters (protos, L3 addrs, L4 ports, flags, ToS etc.) - ES2 VCAP support (Egress Access Control) - support for Per-Stream Filtering and Policing (802.1Q, 8.6.5.1) - Ethernet embedded switches: - Marvell (mv88e6xxx): - add MAB (port auth) offload support - enable PTP receive for mv88e6390 - NXP (ocelot): - support MAC Merge layer - support for the the vsc7512 internal copper phys - Microchip: - lan9303: convert to PHYLINK - lan966x: support TC flower filter statistics - lan937x: PTP support for KSZ9563/KSZ8563 and LAN937x - lan937x: support Credit Based Shaper configuration - ksz9477: support Energy Efficient Ethernet - other: - qca8k: convert to regmap read/write API, use bulk operations - rswitch: Improve TX timestamp accuracy - Intel WiFi (iwlwifi): - EHT (Wi-Fi 7) rate reporting - STEP equalizer support: transfer some STEP (connection to radio on platforms with integrated wifi) related parameters from the BIOS to the firmware. - Qualcomm 802.11ax WiFi (ath11k): - IPQ5018 support - Fine Timing Measurement (FTM) responder role support - channel 177 support - MediaTek WiFi (mt76): - per-PHY LED support - mt7996: EHT (Wi-Fi 7) support - Wireless Ethernet Dispatch (WED) reset support - switch to using page pool allocator - RealTek WiFi (rtw89): - support new version of Bluetooth co-existance - Mobile: - rmnet: support TX aggregation" * tag 'net-next-6.3' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (1872 commits) page_pool: add a comment explaining the fragment counter usage net: ethtool: fix __ethtool_dev_mm_supported() implementation ethtool: pse-pd: Fix double word in comments xsk: add linux/vmalloc.h to xsk.c sefltests: netdevsim: wait for devlink instance after netns removal selftest: fib_tests: Always cleanup before exit net/mlx5e: Align IPsec ASO result memory to be as required by hardware net/mlx5e: TC, Set CT miss to the specific ct action instance net/mlx5e: Rename CHAIN_TO_REG to MAPPED_OBJ_TO_REG net/mlx5: Refactor tc miss handling to a single function net/mlx5: Kconfig: Make tc offload depend on tc skb extension net/sched: flower: Support hardware miss to tc action net/sched: flower: Move filter handle initialization earlier net/sched: cls_api: Support hardware miss to tc action net/sched: Rename user cookie and act cookie sfc: fix builds without CONFIG_RTC_LIB sfc: clean up some inconsistent indentings net/mlx4_en: Introduce flexible array to silence overflow warning net: lan966x: Fix possible deadlock inside PTP net/ulp: Remove redundant ->clone() test in inet_clone_ulp(). ...
| * | net/sock: Introduce trace_sk_data_ready()Peilin Ye2023-01-231-0/+5
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | As suggested by Cong, introduce a tracepoint for all ->sk_data_ready() callback implementations. For example: <...> iperf-609 [002] ..... 70.660425: sk_data_ready: family=2 protocol=6 func=sock_def_readable iperf-609 [002] ..... 70.660436: sk_data_ready: family=2 protocol=6 func=sock_def_readable <...> Suggested-by: Cong Wang <cong.wang@bytedance.com> Signed-off-by: Peilin Ye <peilin.ye@bytedance.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* / sunrpc: use bvec_set_page to initialize bvecsChristoph Hellwig2023-02-031-5/+2
|/ | | | | | | | | | Use the bvec_set_page helper to initialize bvecs. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Chuck Lever <chuck.lever@oracle.com> Acked-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/r/20230203150634.3199647-22-hch@lst.de Signed-off-by: Jens Axboe <axboe@kernel.dk>
* Merge tag 'nfsd-6.2-3' of ↵Linus Torvalds2023-01-101-4/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux Pull nfsd fixes from Chuck Lever: - Fix a race when creating NFSv4 files - Revert the use of relaxed bitops * tag 'nfsd-6.2-3' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: NFSD: Use set_bit(RQ_DROPME) Revert "SUNRPC: Use RMW bitops in single-threaded hot paths" nfsd: fix handling of cached open files in nfsd4_open codepath
| * Revert "SUNRPC: Use RMW bitops in single-threaded hot paths"Chuck Lever2023-01-061-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | The premise that "Once an svc thread is scheduled and executing an RPC, no other processes will touch svc_rqst::rq_flags" is false. svc_xprt_enqueue() examines the RQ_BUSY flag in scheduled nfsd threads when determining which thread to wake up next. Found via KCSAN. Fixes: 28df0988815f ("SUNRPC: Use RMW bitops in single-threaded hot paths") Reviewed-by: Jeff Layton <jlayton@kernel.org> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | use less confusing names for iov_iter direction initializersAl Viro2022-11-251-2/+2
|/ | | | | | | | | | | | | READ/WRITE proved to be actively confusing - the meanings are "data destination, as used with read(2)" and "data source, as used with write(2)", but people keep interpreting those as "we read data from it" and "we write data to it", i.e. exactly the wrong way. Call them ITER_DEST and ITER_SOURCE - at least that is harder to misinterpret... Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* Merge tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linuxLinus Torvalds2022-05-271-15/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Chuck Lever: "We introduce 'courteous server' in this release. Previously NFSD would purge open and lock state for an unresponsive client after one lease period (typically 90 seconds). Now, after one lease period, another client can open and lock those files and the unresponsive client's lease is purged; otherwise if the unresponsive client's open and lock state is uncontended, the server retains that open and lock state for up to 24 hours, allowing the client's workload to resume after a lengthy network partition. A longstanding issue with NFSv4 file creation is also addressed. Previously a file creation can fail internally, returning an error to the client, but leave the newly created file in place as an artifact. The file creation code path has been reorganized so that internal failures and race conditions are less likely to result in an unwanted file creation. A fault injector has been added to help exercise paths that are run during kernel metadata cache invalidation. These caches contain information maintained by user space about exported filesystems. Many of our test workloads do not trigger cache invalidation. There is one patch that is needed to support PREEMPT_RT and a fix for an ancient 'sleep while spin-locked' splat that seems to have become easier to hit since v5.18-rc3" * tag 'nfsd-5.19' of git://git.kernel.org/pub/scm/linux/kernel/git/cel/linux: (36 commits) NFSD: nfsd_file_put() can sleep NFSD: Add documenting comment for nfsd4_release_lockowner() NFSD: Modernize nfsd4_release_lockowner() NFSD: Fix possible sleep during nfsd4_release_lockowner() nfsd: destroy percpu stats counters after reply cache shutdown nfsd: Fix null-ptr-deref in nfsd_fill_super() nfsd: Unregister the cld notifier when laundry_wq create failed SUNRPC: Use RMW bitops in single-threaded hot paths NFSD: Clean up the show_nf_flags() macro NFSD: Trace filecache opens NFSD: Move documenting comment for nfsd4_process_open2() NFSD: Fix whitespace NFSD: Remove dprintk call sites from tail of nfsd4_open() NFSD: Instantiate a struct file when creating a regular NFSv4 file NFSD: Clean up nfsd_open_verified() NFSD: Remove do_nfsd_create() NFSD: Refactor NFSv4 OPEN(CREATE) NFSD: Refactor NFSv3 CREATE NFSD: Refactor nfsd_create_setattr() NFSD: Avoid calling fh_drop_write() twice in do_nfsd_create() ...
| * SUNRPC: Use RMW bitops in single-threaded hot pathsChuck Lever2022-05-231-4/+4
| | | | | | | | | | | | | | | | | | | | I noticed CPU pipeline stalls while using perf. Once an svc thread is scheduled and executing an RPC, no other processes will touch svc_rqst::rq_flags. Thus bus-locked atomics are not needed outside the svc thread scheduler. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Remove svc_rqst::rq_xprt_hlenChuck Lever2022-05-191-2/+0
| | | | | | | | | | Clean up: This field is now always set to zero. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Remove dead code in svc_tcp_release_rqst()Chuck Lever2022-05-191-9/+0
| | | | | | | | | | | | Clean up: svc_tcp_sendto() always sets rq_xprt_ctxt to NULL. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netPaolo Abeni2022-04-151-4/+9
|\|
| * SUNRPC: Move the call to xprt_send_pagedata() out of xprt_sock_sendmsg()Trond Myklebust2022-04-071-3/+6
| | | | | | | | | | | | | | | | | | | | The client and server have different requirements for their memory allocation, so move the allocation of the send buffer out of the socket send code that is common to both. Reported-by: NeilBrown <neilb@suse.de> Fixes: b2648015d452 ("SUNRPC: Make the rpciod and xprtiod slab allocation modes consistent") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
| * SUNRPC: svc_tcp_sendmsg() should handle errors from xdr_alloc_bvec()Trond Myklebust2022-04-071-1/+3
| | | | | | | | | | | | | | | | The allocation is done with GFP_KERNEL, but it could still fail in a low memory situation. Fixes: 4a85a6a3320b ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again") Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com>
* | net: remove noblock parameter from recvmsg() entitiesOliver Hartkopp2022-04-121-1/+1
|/ | | | | | | | | | | | | | | | | | | | | | | | | | The internal recvmsg() functions have two parameters 'flags' and 'noblock' that were merged inside skb_recv_datagram(). As a follow up patch to commit f4b41f062c42 ("net: remove noblock parameter from skb_recv_datagram()") this patch removes the separate 'noblock' parameter for recvmsg(). Analogue to the referenced patch for skb_recv_datagram() the 'flags' and 'noblock' parameters are unnecessarily split up with e.g. err = sk->sk_prot->recvmsg(sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); or in err = INDIRECT_CALL_2(sk->sk_prot->recvmsg, tcp_recvmsg, udp_recvmsg, sk, msg, size, flags & MSG_DONTWAIT, flags & ~MSG_DONTWAIT, &addr_len); instead of simply using only flags all the time and check for MSG_DONTWAIT where needed (to preserve for the formerly separated no(n)block condition). Signed-off-by: Oliver Hartkopp <socketcan@hartkopp.net> Link: https://lore.kernel.org/r/20220411124955.154876-1-socketcan@hartkopp.net Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* SUNRPC: fix ternary sign expansion bug in tracingDan Carpenter2021-04-221-1/+1
| | | | | | | | | | | | | | | This code is supposed to pass negative "err" values for tracing but it passes positive values instead. The problem is that the trace_svcsock_tcp_send() function takes a long but "err" is an int and "sent" is a u32. The negative is first type promoted to u32 so it becomes a high positive then it is promoted to long and it stays positive. Fix this by casting "err" directly to long. Fixes: 998024dee197 ("SUNRPC: Add more svcsock tracepoints") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Move svc_xprt_received() call sitesChuck Lever2021-03-221-2/+7
| | | | | | | | | | | | | | | | | | Currently, XPT_BUSY is not cleared until xpo_recvfrom returns. That effectively blocks the receipt and handling of the next RPC message until the current one has been taken off the transport. This strict ordering is a requirement for socket transports. For our kernel RPC/RDMA transport implementation, however, dequeuing an ingress message is nothing more than a list_del(). The transport can safely be marked un-busy as soon as that is done. To keep the changes simpler, this patch just moves the svc_xprt_received() call site from svc_handle_xprt() into the transports, so that the actual optimization can be done in a subsequent patch. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* svcrdma: Add a "deferred close" helperChuck Lever2021-03-221-9/+6
| | | | | | | | | | | | Refactor a bit of commonly used logic so that every site that wants a close deferred to an nfsd thread does all the right things (set_bit(XPT_CLOSE) then enqueue). Also, once XPT_CLOSE is set on a transport, it is never cleared. If XPT_CLOSE is already set, then the close is already being handled and the enqueue can be skipped. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Further clean up svc_tcp_sendmsg()Chuck Lever2021-02-161-8/+7
| | | | | | Clean up: The msghdr is no longer needed in the caller. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Remove redundant socket flags from svc_tcp_sendmsg()Trond Myklebust2021-02-161-9/+3
| | | | | | | | | Now that the caller controls the TCP_CORK socket option, it is redundant to set MSG_MORE and MSG_SENDPAGE_NOTLAST in the calls to kernel_sendpage(). Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Use TCP_CORK to optimise send performance on the serverTrond Myklebust2021-02-161-1/+7
| | | | | | | | | Use a counter to keep track of how many requests are queued behind the xprt->xpt_mutex, and keep TCP_CORK set until the queue is empty. Signed-off-by: Trond Myklebust <trond.myklebust@hammerspace.com> Link: https://lore.kernel.org/linux-nfs/20210213202532.23146-1-trondmy@kernel.org/T/#u Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Fix NFS READs that start at non-page-aligned offsetsChuck Lever2021-02-011-3/+4
| | | | | | | | | | | | | | | | | | | Anj Duvnjak reports that the Kodi.tv NFS client is not able to read video files from a v5.10.11 Linux NFS server. The new sendpage-based TCP sendto logic was not attentive to non- zero page_base values. nfsd_splice_read() sets that field when a READ payload starts in the middle of a page. The Linux NFS client rarely emits an NFS READ that is not page- aligned. All of my testing so far has been with Linux clients, so I missed this one. Reported-by: A. Duvnjak <avian@extremenerds.net> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=211471 Fixes: 4a85a6a3320b ("SUNRPC: Handle TCP socket sends with kernel_sendpage() again") Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Tested-by: A. Duvnjak <avian@extremenerds.net>
* SUNRPC: Handle TCP socket sends with kernel_sendpage() againChuck Lever2020-12-181-1/+85
| | | | | | | | | | | | | | | | | | | | | | Daire Byrne reports a ~50% aggregrate throughput regression on his Linux NFS server after commit da1661b93bf4 ("SUNRPC: Teach server to use xprt_sock_sendmsg for socket sends"), which replaced kernel_send_page() calls in NFSD's socket send path with calls to sock_sendmsg() using iov_iter. Investigation showed that tcp_sendmsg() was not using zero-copy to send the xdr_buf's bvec pages, but instead was relying on memcpy. This means copying every byte of a large NFS READ payload. It looks like TLS sockets do indeed support a ->sendpage method, so it's really not necessary to use xprt_sock_sendmsg() to support TLS fully on the server. A mechanical reversion of da1661b93bf4 is not possible at this point, but we can re-implement the server's TCP socket sendmsg path using kernel_sendpage(). Reported-by: Daire Byrne <daire@dneg.com> BugLink: https://bugzilla.kernel.org/show_bug.cgi?id=209439 Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Rename svc_encode_read_payload()Chuck Lever2020-11-301-4/+4
| | | | | | | Clean up: "result payload" is a less confusing name for these payloads. "READ payload" reflects only the NFS usage. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* SUNRPC: Fix svc_flush_dcache()Chuck Lever2020-09-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On platforms that implement flush_dcache_page(), a large NFS WRITE triggers the WARN_ONCE in bvec_iter_advance(): Sep 20 14:01:05 klimt.1015granger.net kernel: Attempted to advance past end of bvec iter Sep 20 14:01:05 klimt.1015granger.net kernel: WARNING: CPU: 0 PID: 1032 at include/linux/bvec.h:101 bvec_iter_advance.isra.0+0xa7/0x158 [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: Call Trace: Sep 20 14:01:05 klimt.1015granger.net kernel: svc_tcp_recvfrom+0x60c/0x12c7 [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: ? bvec_iter_advance.isra.0+0x158/0x158 [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: ? del_timer_sync+0x4b/0x55 Sep 20 14:01:05 klimt.1015granger.net kernel: ? test_bit+0x1d/0x27 [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: svc_recv+0x1193/0x15e4 [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: ? try_to_freeze.isra.0+0x6f/0x6f [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: ? refcount_sub_and_test.constprop.0+0x13/0x40 [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: ? svc_xprt_put+0x1e/0x29f [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: ? svc_send+0x39f/0x3c1 [sunrpc] Sep 20 14:01:05 klimt.1015granger.net kernel: nfsd+0x282/0x345 [nfsd] Sep 20 14:01:05 klimt.1015granger.net kernel: ? __kthread_parkme+0x74/0xba Sep 20 14:01:05 klimt.1015granger.net kernel: kthread+0x2ad/0x2bc Sep 20 14:01:05 klimt.1015granger.net kernel: ? nfsd_destroy+0x124/0x124 [nfsd] Sep 20 14:01:05 klimt.1015granger.net kernel: ? test_bit+0x1d/0x27 Sep 20 14:01:05 klimt.1015granger.net kernel: ? kthread_mod_delayed_work+0x115/0x115 Sep 20 14:01:05 klimt.1015granger.net kernel: ret_from_fork+0x22/0x30 Reported-by: He Zhe <zhe.he@windriver.com> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()") Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
* Merge tag 'uninit-macro-v5.9-rc1' of ↵Linus Torvalds2020-08-041-2/+2
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux Pull uninitialized_var() macro removal from Kees Cook: "This is long overdue, and has hidden too many bugs over the years. The series has several "by hand" fixes, and then a trivial treewide replacement. - Clean up non-trivial uses of uninitialized_var() - Update documentation and checkpatch for uninitialized_var() removal - Treewide removal of uninitialized_var()" * tag 'uninit-macro-v5.9-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/kees/linux: compiler: Remove uninitialized_var() macro treewide: Remove uninitialized_var() usage checkpatch: Remove awareness of uninitialized_var() macro mm/debug_vm_pgtable: Remove uninitialized_var() usage f2fs: Eliminate usage of uninitialized_var() macro media: sur40: Remove uninitialized_var() usage KVM: PPC: Book3S PR: Remove uninitialized_var() usage clk: spear: Remove uninitialized_var() usage clk: st: Remove uninitialized_var() usage spi: davinci: Remove uninitialized_var() usage ide: Remove uninitialized_var() usage rtlwifi: rtl8192cu: Remove uninitialized_var() usage b43: Remove uninitialized_var() usage drbd: Remove uninitialized_var() usage x86/mm/numa: Remove uninitialized_var() usage docs: deprecated.rst: Add uninitialized_var()
| * treewide: Remove uninitialized_var() usageKees Cook2020-07-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Using uninitialized_var() is dangerous as it papers over real bugs[1] (or can in the future), and suppresses unrelated compiler warnings (e.g. "unused variable"). If the compiler thinks it is uninitialized, either simply initialize the variable or make compiler changes. In preparation for removing[2] the[3] macro[4], remove all remaining needless uses with the following script: git grep '\buninitialized_var\b' | cut -d: -f1 | sort -u | \ xargs perl -pi -e \ 's/\buninitialized_var\(([^\)]+)\)/\1/g; s:\s*/\* (GCC be quiet|to make compiler happy) \*/$::g;' drivers/video/fbdev/riva/riva_hw.c was manually tweaked to avoid pathological white-space. No outstanding warnings were found building allmodconfig with GCC 9.3.0 for x86_64, i386, arm64, arm, powerpc, powerpc64le, s390x, mips, sparc64, alpha, and m68k. [1] https://lore.kernel.org/lkml/20200603174714.192027-1-glider@google.com/ [2] https://lore.kernel.org/lkml/CA+55aFw+Vbj0i=1TGqCR5vQkCzWJ0QxK6CernOU6eedsudAixw@mail.gmail.com/ [3] https://lore.kernel.org/lkml/CA+55aFwgbgqhbp1fkxvRKEpzyR5J8n1vKT1VZdz9knmPuXhOeg@mail.gmail.com/ [4] https://lore.kernel.org/lkml/CA+55aFz2500WfbKXAx8s67wrm9=yVJu65TpLgN_ybYNv0VEOKA@mail.gmail.com/ Reviewed-by: Leon Romanovsky <leonro@mellanox.com> # drivers/infiniband and mlx4/mlx5 Acked-by: Jason Gunthorpe <jgg@mellanox.com> # IB Acked-by: Kalle Valo <kvalo@codeaurora.org> # wireless drivers Reviewed-by: Chao Yu <yuchao0@huawei.com> # erofs Signed-off-by: Kees Cook <keescook@chromium.org>
* | SUNRPC: Add missing definition of ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGEChristophe Leroy2020-06-291-0/+1
|/ | | | | | | | | | | | | | | | | | | | Even if that's only a warning, not including asm/cacheflush.h leads to svc_flush_bvec() being empty allthough powerpc defines ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE. CC net/sunrpc/svcsock.o net/sunrpc/svcsock.c:227:5: warning: "ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE" is not defined [-Wundef] #if ARCH_IMPLEMENTS_FLUSH_DCACHE_PAGE ^ Include linux/highmem.h so that asm/cacheflush.h will be included. Reported-by: Christophe Leroy <christophe.leroy@csgroup.eu> Reported-by: kernel test robot <lkp@intel.com> Fixes: ca07eda33e01 ("SUNRPC: Refactor svc_recvfrom()") Signed-off-by: Christophe Leroy <christophe.leroy@csgroup.eu> Acked-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@redhat.com>
* Merge tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2020-06-111-195/+205
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd updates from Bruce Fields: "Highlights: - Keep nfsd clients from unnecessarily breaking their own delegations. Note this requires a small kthreadd addition. The result is Tejun Heo's suggestion (see link), and he was OK with this going through my tree. - Patch nfsd/clients/ to display filenames, and to fix byte-order when displaying stateid's. - fix a module loading/unloading bug, from Neil Brown. - A big series from Chuck Lever with RPC/RDMA and tracing improvements, and lay some groundwork for RPC-over-TLS" Link: https://lore.kernel.org/r/1588348912-24781-1-git-send-email-bfields@redhat.com * tag 'nfsd-5.8' of git://linux-nfs.org/~bfields/linux: (49 commits) sunrpc: use kmemdup_nul() in gssp_stringify() nfsd: safer handling of corrupted c_type nfsd4: make drc_slab global, not per-net SUNRPC: Remove unreachable error condition in rpcb_getport_async() nfsd: Fix svc_xprt refcnt leak when setup callback client failed sunrpc: clean up properly in gss_mech_unregister() sunrpc: svcauth_gss_register_pseudoflavor must reject duplicate registrations. sunrpc: check that domain table is empty at module unload. NFSD: Fix improperly-formatted Doxygen comments NFSD: Squash an annoying compiler warning SUNRPC: Clean up request deferral tracepoints NFSD: Add tracepoints for monitoring NFSD callbacks NFSD: Add tracepoints to the NFSD state management code NFSD: Add tracepoints to NFSD's duplicate reply cache SUNRPC: svc_show_status() macro should have enum definitions SUNRPC: Restructure svc_udp_recvfrom() SUNRPC: Refactor svc_recvfrom() SUNRPC: Clean up svc_release_skb() functions SUNRPC: Refactor recvfrom path dealing with incomplete TCP receives SUNRPC: Replace dprintk() call sites in TCP receive path ...
| * SUNRPC: Restructure svc_udp_recvfrom()Chuck Lever2020-05-201-25/+35
| | | | | | | | | | | | | | | | | | Clean up. At this point, we are not ready yet to support bio_vecs in the UDP transport implementation. However, we can clean up svc_udp_recvfrom() to match the tracing and straight-lining recently changes made in svc_tcp_recvfrom(). Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Refactor svc_recvfrom()Chuck Lever2020-05-201-41/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This function is not currently "generic" so remove the documenting comment and rename it appropriately. Its internals are converted to use bio_vecs for reading from the transport socket. In existing typical sunrpc uses of bio_vecs, the bio_vec array is allocated dynamically. Here, instead, an array of bio_vecs is added to svc_rqst. The lifetime of this array can be greater than one call to xpo_recvfrom(): - Multiple calls to xpo_recvfrom() might be needed to read an RPC message completely. - At some later point, rq_arg.bvecs will point to this array and it will carry the received message into svc_process(). I also expect that a future optimization will remove either the rq_vec or rq_pages array in favor of rq_bvec, thus conserving the size of struct svc_rqst. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>
| * SUNRPC: Clean up svc_release_skb() functionsChuck Lever2020-05-201-8/+15
| | | | | | | | | | | | | | Rename these functions using the convention used for other xpo method entry points. Signed-off-by: Chuck Lever <chuck.lever@oracle.com>