summaryrefslogtreecommitdiffstats
path: root/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
...
* | | | Merge tag 'for-linus-5.18-rc1-tag' of ↵Linus Torvalds2022-03-281-4/+4
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip Pull xen updates from Juergen Gross: - A bunch of minor cleanups - A fix for kexec in Xen dom0 when executed on a high cpu number - A fix for resuming after suspend of a Xen guest with assigned PCI devices - A fix for a crash due to not disabled preemption when resuming as Xen dom0 * tag 'for-linus-5.18-rc1-tag' of git://git.kernel.org/pub/scm/linux/kernel/git/xen/tip: xen: fix is_xen_pmu() xen: don't hang when resuming PCI device arch:x86:xen: Remove unnecessary assignment in xen_apic_read() xen/grant-table: remove readonly parameter from functions xen/grant-table: remove gnttab_*transfer*() functions drivers/xen: use helper macro __ATTR_RW x86/xen: Fix kerneldoc warning xen: delay xen_hvm_init_time_ops() if kdump is boot on vcpu>=32 xen: use time_is_before_eq_jiffies() instead of open coding it
| * | | | xen/grant-table: remove readonly parameter from functionsJuergen Gross2022-03-161-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The gnttab_end_foreign_access() family of functions is taking a "readonly" parameter, which isn't used. Remove it from the function parameters. Signed-off-by: Juergen Gross <jgross@suse.com> Link: https://lore.kernel.org/r/20220311103429.12845-3-jgross@suse.com Reviewed-by: Jan Beulich <jbeulich@suse.com> Acked-by: Christian Schoenebeck <qemu_oss@crudebyte.com> Signed-off-by: Boris Ostrovsky <boris.ostrovsky@oracle.com>
* | | | | Merge tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-clientLinus Torvalds2022-03-251-6/+2
|\ \ \ \ \ | |_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull ceph updates from Ilya Dryomov: "The highlights are: - several changes to how snap context and snap realms are tracked (Xiubo Li). In particular, this should resolve a long-standing issue of high kworker CPU usage and various stalls caused by needless iteration over all inodes in the snap realm. - async create fixes to address hangs in some edge cases (Jeff Layton) - support for getvxattr MDS op for querying server-side xattrs, such as file/directory layouts and ephemeral pins (Milind Changire) - average latency is now maintained for all metrics (Venky Shankar) - some tweaks around handling inline data to make it fit better with netfs helper library (David Howells) Also a couple of memory leaks got plugged along with a few assorted fixups. Last but not least, Xiubo has stepped up to serve as a CephFS co-maintainer" * tag 'ceph-for-5.18-rc1' of https://github.com/ceph/ceph-client: (27 commits) ceph: fix memory leak in ceph_readdir when note_last_dentry returns error ceph: uninitialized variable in debug output ceph: use tracked average r/w/m latencies to display metrics in debugfs ceph: include average/stdev r/w/m latency in mds metrics ceph: track average r/w/m latency ceph: use ktime_to_timespec64() rather than jiffies_to_timespec64() ceph: assign the ci only when the inode isn't NULL ceph: fix inode reference leakage in ceph_get_snapdir() ceph: misc fix for code style and logs ceph: allocate capsnap memory outside of ceph_queue_cap_snap() ceph: do not release the global snaprealm until unmounting ceph: remove incorrect and unused CEPH_INO_DOTDOT macro MAINTAINERS: add Xiubo Li as cephfs co-maintainer ceph: eliminate the recursion when rebuilding the snap context ceph: do not update snapshot context when there is no new snapshot ceph: zero the dir_entries memory when allocating it ceph: move to a dedicated slabcache for ceph_cap_snap ceph: add getvxattr op libceph: drop else branches in prepare_read_data{,_cont} ceph: fix comments mentioning i_mutex ...
| * | | | libceph: drop else branches in prepare_read_data{,_cont}Jeff Layton2022-03-011-6/+2
| | |_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | Just call set_in_bvec in the non-conditional part. Signed-off-by: Jeff Layton <jlayton@kernel.org> Reviewed-by: Ilya Dryomov <idryomov@gmail.com> Signed-off-by: Ilya Dryomov <idryomov@gmail.com>
* | | | Merge tag 'net-next-5.18' of ↵Linus Torvalds2022-03-24337-4250/+12623
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next Pull networking updates from Jakub Kicinski: "The sprinkling of SPI drivers is because we added a new one and Mark sent us a SPI driver interface conversion pull request. Core ---- - Introduce XDP multi-buffer support, allowing the use of XDP with jumbo frame MTUs and combination with Rx coalescing offloads (LRO). - Speed up netns dismantling (5x) and lower the memory cost a little. Remove unnecessary per-netns sockets. Scope some lists to a netns. Cut down RCU syncing. Use batch methods. Allow netdev registration to complete out of order. - Support distinguishing timestamp types (ingress vs egress) and maintaining them across packet scrubbing points (e.g. redirect). - Continue the work of annotating packet drop reasons throughout the stack. - Switch netdev error counters from an atomic to dynamically allocated per-CPU counters. - Rework a few preempt_disable(), local_irq_save() and busy waiting sections problematic on PREEMPT_RT. - Extend the ref_tracker to allow catching use-after-free bugs. BPF --- - Introduce "packing allocator" for BPF JIT images. JITed code is marked read only, and used to be allocated at page granularity. Custom allocator allows for more efficient memory use, lower iTLB pressure and prevents identity mapping huge pages from getting split. - Make use of BTF type annotations (e.g. __user, __percpu) to enforce the correct probe read access method, add appropriate helpers. - Convert the BPF preload to use light skeleton and drop the user-mode-driver dependency. - Allow XDP BPF_PROG_RUN test infra to send real packets, enabling its use as a packet generator. - Allow local storage memory to be allocated with GFP_KERNEL if called from a hook allowed to sleep. - Introduce fprobe (multi kprobe) to speed up mass attachment (arch bits to come later). - Add unstable conntrack lookup helpers for BPF by using the BPF kfunc infra. - Allow cgroup BPF progs to return custom errors to user space. - Add support for AF_UNIX iterator batching. - Allow iterator programs to use sleepable helpers. - Support JIT of add, and, or, xor and xchg atomic ops on arm64. - Add BTFGen support to bpftool which allows to use CO-RE in kernels without BTF info. - Large number of libbpf API improvements, cleanups and deprecations. Protocols --------- - Micro-optimize UDPv6 Tx, gaining up to 5% in test on dummy netdev. - Adjust TSO packet sizes based on min_rtt, allowing very low latency links (data centers) to always send full-sized TSO super-frames. - Make IPv6 flow label changes (AKA hash rethink) more configurable, via sysctl and setsockopt. Distinguish between server and client behavior. - VxLAN support to "collect metadata" devices to terminate only configured VNIs. This is similar to VLAN filtering in the bridge. - Support inserting IPv6 IOAM information to a fraction of frames. - Add protocol attribute to IP addresses to allow identifying where given address comes from (kernel-generated, DHCP etc.) - Support setting socket and IPv6 options via cmsg on ping6 sockets. - Reject mis-use of ECN bits in IP headers as part of DSCP/TOS. Define dscp_t and stop taking ECN bits into account in fib-rules. - Add support for locked bridge ports (for 802.1X). - tun: support NAPI for packets received from batched XDP buffs, doubling the performance in some scenarios. - IPv6 extension header handling in Open vSwitch. - Support IPv6 control message load balancing in bonding, prevent neighbor solicitation and advertisement from using the wrong port. Support NS/NA monitor selection similar to existing ARP monitor. - SMC - improve performance with TCP_CORK and sendfile() - support auto-corking - support TCP_NODELAY - MCTP (Management Component Transport Protocol) - add user space tag control interface - I2C binding driver (as specified by DMTF DSP0237) - Multi-BSSID beacon handling in AP mode for WiFi. - Bluetooth: - handle MSFT Monitor Device Event - add MGMT Adv Monitor Device Found/Lost events - Multi-Path TCP: - add support for the SO_SNDTIMEO socket option - lots of selftest cleanups and improvements - Increase the max PDU size in CAN ISOTP to 64 kB. Driver API ---------- - Add HW counters for SW netdevs, a mechanism for devices which offload packet forwarding to report packet statistics back to software interfaces such as tunnels. - Select the default NIC queue count as a fraction of number of physical CPU cores, instead of hard-coding to 8. - Expose devlink instance locks to drivers. Allow device layer of drivers to use that lock directly instead of creating their own which always runs into ordering issues in devlink callbacks. - Add header/data split indication to guide user space enabling of TCP zero-copy Rx. - Allow configuring completion queue event size. - Refactor page_pool to enable fragmenting after allocation. - Add allocation and page reuse statistics to page_pool. - Improve Multiple Spanning Trees support in the bridge to allow reuse of topologies across VLANs, saving HW resources in switches. - DSA (Distributed Switch Architecture): - replay and offload of host VLAN entries - offload of static and local FDB entries on LAG interfaces - FDB isolation and unicast filtering New hardware / drivers ---------------------- - Ethernet: - LAN937x T1 PHYs - Davicom DM9051 SPI NIC driver - Realtek RTL8367S, RTL8367RB-VB switch and MDIO - Microchip ksz8563 switches - Netronome NFP3800 SmartNICs - Fungible SmartNICs - MediaTek MT8195 switches - WiFi: - mt76: MediaTek mt7916 - mt76: MediaTek mt7921u USB adapters - brcmfmac: Broadcom BCM43454/6 - Mobile: - iosm: Intel M.2 7360 WWAN card Drivers ------- - Convert many drivers to the new phylink API built for split PCS designs but also simplifying other cases. - Intel Ethernet NICs: - add TTY for GNSS module for E810T device - improve AF_XDP performance - GTP-C and GTP-U filter offload - QinQ VLAN support - Mellanox Ethernet NICs (mlx5): - support xdp->data_meta - multi-buffer XDP - offload tc push_eth and pop_eth actions - Netronome Ethernet NICs (nfp): - flow-independent tc action hardware offload (police / meter) - AF_XDP - Other Ethernet NICs: - at803x: fiber and SFP support - xgmac: mdio: preamble suppression and custom MDC frequencies - r8169: enable ASPM L1.2 if system vendor flags it as safe - macb/gem: ZynqMP SGMII - hns3: add TX push mode - dpaa2-eth: software TSO - lan743x: multi-queue, mdio, SGMII, PTP - axienet: NAPI and GRO support - Mellanox Ethernet switches (mlxsw): - source and dest IP address rewrites - RJ45 ports - Marvell Ethernet switches (prestera): - basic routing offload - multi-chain TC ACL offload - NXP embedded Ethernet switches (ocelot & felix): - PTP over UDP with the ocelot-8021q DSA tagging protocol - basic QoS classification on Felix DSA switch using dcbnl - port mirroring for ocelot switches - Microchip high-speed industrial Ethernet (sparx5): - offloading of bridge port flooding flags - PTP Hardware Clock - Other embedded switches: - lan966x: PTP Hardward Clock - qca8k: mdio read/write operations via crafted Ethernet packets - Qualcomm 802.11ax WiFi (ath11k): - add LDPC FEC type and 802.11ax High Efficiency data in radiotap - enable RX PPDU stats in monitor co-exist mode - Intel WiFi (iwlwifi): - UHB TAS enablement via BIOS - band disablement via BIOS - channel switch offload - 32 Rx AMPDU sessions in newer devices - MediaTek WiFi (mt76): - background radar detection - thermal management improvements on mt7915 - SAR support for more mt76 platforms - MBSSID and 6 GHz band on mt7915 - RealTek WiFi: - rtw89: AP mode - rtw89: 160 MHz channels and 6 GHz band - rtw89: hardware scan - Bluetooth: - mt7921s: wake on Bluetooth, SCO over I2S, wide-band-speed (WBS) - Microchip CAN (mcp251xfd): - multiple RX-FIFOs and runtime configurable RX/TX rings - internal PLL, runtime PM handling simplification - improve chip detection and error handling after wakeup" * tag 'net-next-5.18' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2521 commits) llc: fix netdevice reference leaks in llc_ui_bind() drivers: ethernet: cpsw: fix panic when interrupt coaleceing is set via ethtool ice: don't allow to run ice_send_event_to_aux() in atomic ctx ice: fix 'scheduling while atomic' on aux critical err interrupt net/sched: fix incorrect vlan_push_eth dest field net: bridge: mst: Restrict info size queries to bridge ports net: marvell: prestera: add missing destroy_workqueue() in prestera_module_init() drivers: net: xgene: Fix regression in CRC stripping net: geneve: add missing netlink policy and size for IFLA_GENEVE_INNER_PROTO_INHERIT net: dsa: fix missing host-filtered multicast addresses net/mlx5e: Fix build warning, detected write beyond size of field iwlwifi: mvm: Don't fail if PPAG isn't supported selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" netdevice: add missing dm_private kdoc net: bridge: mst: prevent NULL deref in br_mst_info_size() selftests: forwarding: Use same VRF for port and VLAN upper ...
| * \ \ \ Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2022-03-2315-126/+208
| |\ \ \ \ | | | |_|/ | | |/| | | | | | | | | | | | | | | | | Merge in overtime fixes, no conflicts. Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | | tipc: fix the timer expires after interval 100msHoang Le2022-03-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the timer callback function tipc_sk_timeout(), we're trying to reschedule another timeout to retransmit a setup request if destination link is congested. But we use the incorrect timeout value (msecs_to_jiffies(100)) instead of (jiffies + msecs_to_jiffies(100)), so that the timer expires immediately, it's irrelevant for original description. In this commit we correct the timeout value in sk_reset_timer() Fixes: 6787927475e5 ("tipc: buffer overflow handling in listener socket") Acked-by: Ying Xue <ying.xue@windriver.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Link: https://lore.kernel.org/r/20220321042229.314288-1-hoang.h.le@dektech.com.au Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| | * | | net: dsa: fix panic on shutdown if multi-chip tree failed to probeVladimir Oltean2022-03-221-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DSA probing is atypical because a tree of devices must probe all at once, so out of N switches which call dsa_tree_setup_routing_table() during probe, for (N - 1) of them, "complete" will return false and they will exit probing early. The Nth switch will set up the whole tree on their behalf. The implication is that for (N - 1) switches, the driver binds to the device successfully, without doing anything. When the driver is bound, the ->shutdown() method may run. But if the Nth switch has failed to initialize the tree, there is nothing to do for the (N - 1) driver instances, since the slave devices have not been created, etc. Moreover, dsa_switch_shutdown() expects that the calling @ds has been in fact initialized, so it jumps at dereferencing the various data structures, which is incorrect. Avoid the ensuing NULL pointer dereferences by simply checking whether the Nth switch has previously set "ds->setup = true" for the switch which is currently shutting down. The entire setup is serialized under dsa2_mutex which we already hold. Fixes: 0650bf52b31f ("net: dsa: be compatible with masters which unregister on shutdown") Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Link: https://lore.kernel.org/r/20220318195443.275026-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | | openvswitch: always update flow key after natAaron Conole2022-03-221-59/+59
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During NAT, a tuple collision may occur. When this happens, openvswitch will make a second pass through NAT which will perform additional packet modification. This will update the skb data, but not the flow key that OVS uses. This means that future flow lookups, and packet matches will have incorrect data. This has been supported since 5d50aa83e2c8 ("openvswitch: support asymmetric conntrack"). That commit failed to properly update the sw_flow_key attributes, since it only called the ovs_ct_nat_update_key once, rather than each time ovs_ct_nat_execute was called. As these two operations are linked, the ovs_ct_nat_execute() function should always make sure that the sw_flow_key is updated after a successful call through NAT infrastructure. Fixes: 5d50aa83e2c8 ("openvswitch: support asymmetric conntrack") Cc: Dumitru Ceara <dceara@redhat.com> Cc: Numan Siddique <nusiddiq@redhat.com> Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Eelco Chaudron <echaudro@redhat.com> Link: https://lore.kernel.org/r/20220318124319.3056455-1-aconole@redhat.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | | tcp: ensure PMTU updates are processed during fastopenJakub Kicinski2022-03-211-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tp->rx_opt.mss_clamp is not populated, yet, during TFO send so we rise it to the local MSS. tp->mss_cache is not updated, however: tcp_v6_connect(): tp->rx_opt.mss_clamp = IPV6_MIN_MTU - headers; tcp_connect(): tcp_connect_init(): tp->mss_cache = min(mtu, tp->rx_opt.mss_clamp) tcp_send_syn_data(): tp->rx_opt.mss_clamp = tp->advmss After recent fixes to ICMPv6 PTB handling we started dropping PMTU updates higher than tp->mss_cache. Because of the stale tp->mss_cache value PMTU updates during TFO are always dropped. Thanks to Wei for helping zero in on the problem and the fix! Fixes: c7bb4b89033b ("ipv6: tcp: drop silly ICMPv6 packet too big messages") Reported-by: Andre Nash <alnash@fb.com> Reported-by: Neil Spring <ntspring@fb.com> Reviewed-by: Wei Wang <weiwan@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Reviewed-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20220321165957.1769954-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | | ax25: Fix NULL pointer dereferences in ax25 timersDuoming Zhou2022-03-212-8/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit 7ec02f5ac8a5 ("ax25: fix NPD bug in ax25_disconnect") move ax25_disconnect into lock_sock() in order to prevent NPD bugs. But there are race conditions that may lead to null pointer dereferences in ax25_heartbeat_expiry(), ax25_t1timer_expiry(), ax25_t2timer_expiry(), ax25_t3timer_expiry() and ax25_idletimer_expiry(), when we use ax25_kill_by_device() to detach the ax25 device. One of the race conditions that cause null pointer dereferences can be shown as below: (Thread 1) | (Thread 2) ax25_connect() | ax25_std_establish_data_link() | ax25_start_t1timer() | mod_timer(&ax25->t1timer,..) | | ax25_kill_by_device() (wait a time) | ... | s->ax25_dev = NULL; //(1) ax25_t1timer_expiry() | ax25->ax25_dev->values[..] //(2)| ... ... | We set null to ax25_cb->ax25_dev in position (1) and dereference the null pointer in position (2). The corresponding fail log is shown below: =============================================================== BUG: kernel NULL pointer dereference, address: 0000000000000050 CPU: 1 PID: 0 Comm: swapper/1 Not tainted 5.17.0-rc6-00794-g45690b7d0 RIP: 0010:ax25_t1timer_expiry+0x12/0x40 ... Call Trace: call_timer_fn+0x21/0x120 __run_timers.part.0+0x1ca/0x250 run_timer_softirq+0x2c/0x60 __do_softirq+0xef/0x2f3 irq_exit_rcu+0xb6/0x100 sysvec_apic_timer_interrupt+0xa2/0xd0 ... This patch moves ax25_disconnect() before s->ax25_dev = NULL and uses del_timer_sync() to delete timers in ax25_disconnect(). If ax25_disconnect() is called by ax25_kill_by_device() or ax25->ax25_dev is NULL, the reason in ax25_disconnect() will be equal to ENETUNREACH, it will wait all timers to stop before we set null to s->ax25_dev in ax25_kill_by_device(). Fixes: 7ec02f5ac8a5 ("ax25: fix NPD bug in ax25_disconnect") Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | ax25: Fix refcount leaks caused by ax25_cb_del()Duoming Zhou2022-03-211-3/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The previous commit d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") and commit feef318c855a ("ax25: fix UAF bugs of net_device caused by rebinding operation") increase the refcounts of ax25_dev and net_device in ax25_bind() and decrease the matching refcounts in ax25_kill_by_device() in order to prevent UAF bugs, but there are reference count leaks. The root cause of refcount leaks is shown below: (Thread 1) | (Thread 2) ax25_bind() | ... | ax25_addr_ax25dev() | ax25_dev_hold() //(1) | ... | dev_hold_track() //(2) | ... | ax25_destroy_socket() | ax25_cb_del() | ... | hlist_del_init() //(3) | | (Thread 3) | ax25_kill_by_device() | ... | ax25_for_each(s, &ax25_list) { | if (s->ax25_dev == ax25_dev) //(4) | ... | Firstly, we use ax25_bind() to increase the refcount of ax25_dev in position (1) and increase the refcount of net_device in position (2). Then, we use ax25_cb_del() invoked by ax25_destroy_socket() to delete ax25_cb in hlist in position (3) before calling ax25_kill_by_device(). Finally, the decrements of refcounts in ax25_kill_by_device() will not be executed, because no s->ax25_dev equals to ax25_dev in position (4). This patch adds decrements of refcounts in ax25_release() and use lock_sock() to do synchronization. If refcounts decrease in ax25_release(), the decrements of refcounts in ax25_kill_by_device() will not be executed and vice versa. Fixes: d01ffb9eee4a ("ax25: add refcount in ax25_dev to avoid UAF bugs") Fixes: 87563a043cef ("ax25: fix reference count leaks of ax25_dev") Fixes: feef318c855a ("ax25: fix UAF bugs of net_device caused by rebinding operation") Reported-by: Thomas Osterried <thomas@osterried.de> Signed-off-by: Duoming Zhou <duoming@zju.edu.cn> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | af_netlink: Fix shift out of bounds in group mask calculationPetr Machata2022-03-191-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a netlink message is received, netlink_recvmsg() fills in the address of the sender. One of the fields is the 32-bit bitfield nl_groups, which carries the multicast group on which the message was received. The least significant bit corresponds to group 1, and therefore the highest group that the field can represent is 32. Above that, the UB sanitizer flags the out-of-bounds shift attempts. Which bits end up being set in such case is implementation defined, but it's either going to be a wrong non-zero value, or zero, which is at least not misleading. Make the latter choice deterministic by always setting to 0 for higher-numbered multicast groups. To get information about membership in groups >= 32, userspace is expected to use nl_pktinfo control messages[0], which are enabled by NETLINK_PKTINFO socket option. [0] https://lwn.net/Articles/147608/ The way to trigger this issue is e.g. through monitoring the BRVLAN group: # bridge monitor vlan & # ip link add name br type bridge Which produces the following citation: UBSAN: shift-out-of-bounds in net/netlink/af_netlink.c:162:19 shift exponent 32 is too large for 32-bit type 'int' Fixes: f7fa9b10edbb ("[NETLINK]: Support dynamic number of multicast groups per netlink family") Signed-off-by: Petr Machata <petrm@nvidia.com> Reviewed-by: Ido Schimmel <idosch@nvidia.com> Link: https://lore.kernel.org/r/2bef6aabf201d1fc16cca139a744700cff9dcb04.1647527635.git.petrm@nvidia.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | | mptcp: Fix crash due to tcp_tsorted_anchor was initialized before release skbYonglong Li2022-03-181-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Got crash when doing pressure test of mptcp: =========================================================================== dst_release: dst:ffffa06ce6e5c058 refcnt:-1 kernel tried to execute NX-protected page - exploit attempt? (uid: 0) BUG: unable to handle kernel paging request at ffffa06ce6e5c058 PGD 190a01067 P4D 190a01067 PUD 43fffb067 PMD 22e403063 PTE 8000000226e5c063 Oops: 0011 [#1] SMP PTI CPU: 7 PID: 7823 Comm: kworker/7:0 Kdump: loaded Tainted: G E Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.2.1 04/01/2014 Call Trace: ? skb_release_head_state+0x68/0x100 ? skb_release_all+0xe/0x30 ? kfree_skb+0x32/0xa0 ? mptcp_sendmsg_frag+0x57e/0x750 ? __mptcp_retrans+0x21b/0x3c0 ? __switch_to_asm+0x35/0x70 ? mptcp_worker+0x25e/0x320 ? process_one_work+0x1a7/0x360 ? worker_thread+0x30/0x390 ? create_worker+0x1a0/0x1a0 ? kthread+0x112/0x130 ? kthread_flush_work_fn+0x10/0x10 ? ret_from_fork+0x35/0x40 =========================================================================== In __mptcp_alloc_tx_skb skb was allocated and skb->tcp_tsorted_anchor will be initialized, in under memory pressure situation sk_wmem_schedule will return false and then kfree_skb. In this case skb->_skb_refdst is not null because_skb_refdst and tcp_tsorted_anchor are stored in the same mem, and kfree_skb will try to release dst and cause crash. Fixes: f70cad1085d1 ("mptcp: stop relying on tcp_tx_skb_cache") Reviewed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Yonglong Li <liyonglong@chinatelecom.cn> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Link: https://lore.kernel.org/r/20220317220953.426024-1-mathew.j.martineau@linux.intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | | ipv4: Fix route lookups when handling ICMP redirects and PMTU updatesGuillaume Nault2022-03-181-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The PMTU update and ICMP redirect helper functions initialise their fl4 variable with either __build_flow_key() or build_sk_flow_key(). These initialisation functions always set ->flowi4_scope with RT_SCOPE_UNIVERSE and might set the ECN bits of ->flowi4_tos. This is not a problem when the route lookup is later done via ip_route_output_key_hash(), which properly clears the ECN bits from ->flowi4_tos and initialises ->flowi4_scope based on the RTO_ONLINK flag. However, some helpers call fib_lookup() directly, without sanitising the tos and scope fields, so the route lookup can fail and, as a result, the ICMP redirect or PMTU update aren't taken into account. Fix this by extracting the ->flowi4_tos and ->flowi4_scope sanitisation code into ip_rt_fix_tos(), then use this function in handlers that call fib_lookup() directly. Note 1: We can't sanitise ->flowi4_tos and ->flowi4_scope in a central place (like __build_flow_key() or flowi4_init_output()), because ip_route_output_key_hash() expects non-sanitised values. When called with sanitised values, it can erroneously overwrite RT_SCOPE_LINK with RT_SCOPE_UNIVERSE in ->flowi4_scope. Therefore we have to be careful to sanitise the values only for those paths that don't call ip_route_output_key_hash(). Note 2: The problem is mostly about sanitising ->flowi4_tos. Having ->flowi4_scope initialised with RT_SCOPE_UNIVERSE instead of RT_SCOPE_LINK probably wasn't really a problem: sockets with the SOCK_LOCALROUTE flag set (those that'd result in RTO_ONLINK being set) normally shouldn't receive ICMP redirects or PMTU updates. Fixes: 4895c771c7f0 ("ipv4: Add FIB nexthop exceptions.") Signed-off-by: Guillaume Nault <gnault@redhat.com> Reviewed-by: David Ahern <dsahern@kernel.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | | Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfJakub Kicinski2022-03-181-19/+50
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2022-03-18 We've added 2 non-merge commits during the last 18 day(s) which contain a total of 2 files changed, 50 insertions(+), 20 deletions(-). The main changes are: 1) Fix a race in XSK socket teardown code that can lead to a NULL pointer dereference, from Magnus. 2) Small MAINTAINERS doc update to remove Lorenz from sockmap, from Lorenz. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf: xsk: Fix race at socket teardown bpf: Remove Lorenz Bauer from L7 BPF maintainers ==================== Link: https://lore.kernel.org/r/20220318152418.28638-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | | * | | xsk: Fix race at socket teardownMagnus Karlsson2022-02-281-19/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix a race in the xsk socket teardown code that can lead to a NULL pointer dereference splat. The current xsk unbind code in xsk_unbind_dev() starts by setting xs->state to XSK_UNBOUND, sets xs->dev to NULL and then waits for any NAPI processing to terminate using synchronize_net(). After that, the release code starts to tear down the socket state and free allocated memory. BUG: kernel NULL pointer dereference, address: 00000000000000c0 PGD 8000000932469067 P4D 8000000932469067 PUD 0 Oops: 0000 [#1] PREEMPT SMP PTI CPU: 25 PID: 69132 Comm: grpcpp_sync_ser Tainted: G I 5.16.0+ #2 Hardware name: Dell Inc. PowerEdge R730/0599V5, BIOS 1.2.10 03/09/2015 RIP: 0010:__xsk_sendmsg+0x2c/0x690 [...] RSP: 0018:ffffa2348bd13d50 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 0000000000000040 RCX: ffff8d5fc632d258 RDX: 0000000000400000 RSI: ffffa2348bd13e10 RDI: ffff8d5fc5489800 RBP: ffffa2348bd13db0 R08: 0000000000000000 R09: 00007ffffffff000 R10: 0000000000000000 R11: 0000000000000000 R12: ffff8d5fc5489800 R13: ffff8d5fcb0f5140 R14: ffff8d5fcb0f5140 R15: 0000000000000000 FS: 00007f991cff9400(0000) GS:ffff8d6f1f700000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c0 CR3: 0000000114888005 CR4: 00000000001706e0 Call Trace: <TASK> ? aa_sk_perm+0x43/0x1b0 xsk_sendmsg+0xf0/0x110 sock_sendmsg+0x65/0x70 __sys_sendto+0x113/0x190 ? debug_smp_processor_id+0x17/0x20 ? fpregs_assert_state_consistent+0x23/0x50 ? exit_to_user_mode_prepare+0xa5/0x1d0 __x64_sys_sendto+0x29/0x30 do_syscall_64+0x3b/0xc0 entry_SYSCALL_64_after_hwframe+0x44/0xae There are two problems with the current code. First, setting xs->dev to NULL before waiting for all users to stop using the socket is not correct. The entry to the data plane functions xsk_poll(), xsk_sendmsg(), and xsk_recvmsg() are all guarded by a test that xs->state is in the state XSK_BOUND and if not, it returns right away. But one process might have passed this test but still have not gotten to the point in which it uses xs->dev in the code. In this interim, a second process executing xsk_unbind_dev() might have set xs->dev to NULL which will lead to a crash for the first process. The solution here is just to get rid of this NULL assignment since it is not used anymore. Before commit 42fddcc7c64b ("xsk: use state member for socket synchronization"), xs->dev was the gatekeeper to admit processes into the data plane functions, but it was replaced with the state variable xs->state in the aforementioned commit. The second problem is that synchronize_net() does not wait for any process in xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() to complete, which means that the state they rely on might be cleaned up prematurely. This can happen when the notifier gets called (at driver unload for example) as it uses xsk_unbind_dev(). Solve this by extending the RCU critical region from just the ndo_xsk_wakeup to the whole functions mentioned above, so that both the test of xs->state == XSK_BOUND and the last use of any member of xs is covered by the RCU critical section. This will guarantee that when synchronize_net() completes, there will be no processes left executing xsk_poll(), xsk_sendmsg(), or xsk_recvmsg() and state can be cleaned up safely. Note that we need to drop the RCU lock for the skb xmit path as it uses functions that might sleep. Due to this, we have to retest the xs->state after we grab the mutex that protects the skb xmit code from, among a number of things, an xsk_unbind_dev() being executed from the notifier at the same time. Fixes: 42fddcc7c64b ("xsk: use state member for socket synchronization") Reported-by: Elza Mathew <elza.mathew@intel.com> Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Björn Töpel <bjorn@kernel.org> Link: https://lore.kernel.org/bpf/20220228094552.10134-1-magnus.karlsson@gmail.com
| | * | | | af_unix: Support POLLPRI for OOB.Kuniyuki Iwashima2022-03-181-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit 314001f0bf92 ("af_unix: Add OOB support") introduced OOB for AF_UNIX, but it lacks some changes for POLLPRI. Let's add the missing piece. In the selftest, normal datagrams are sent followed by OOB data, so this commit replaces `POLLIN | POLLPRI` with just `POLLPRI` in the first test case. Fixes: 314001f0bf92 ("af_unix: Add OOB support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | af_unix: Fix some data-races around unix_sk(sk)->oob_skb.Kuniyuki Iwashima2022-03-181-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Out-of-band data automatically places a "mark" showing wherein the sequence the out-of-band data would have been. If the out-of-band data implies cancelling everything sent so far, the "mark" is helpful to flush them. When the socket's read pointer reaches the "mark", the ioctl() below sets a non zero value to the arg `atmark`: The out-of-band data is queued in sk->sk_receive_queue as well as ordinary data and also saved in unix_sk(sk)->oob_skb. It can be used to test if the head of the receive queue is the out-of-band data meaning the socket is at the "mark". While testing that, unix_ioctl() reads unix_sk(sk)->oob_skb locklessly. Thus, all accesses to oob_skb need some basic protection to avoid load/store tearing which KCSAN detects when these are called concurrently: - ioctl(fd_a, SIOCATMARK, &atmark, sizeof(atmark)) - send(fd_b_connected_to_a, buf, sizeof(buf), MSG_OOB) BUG: KCSAN: data-race in unix_ioctl / unix_stream_sendmsg write to 0xffff888003d9cff0 of 8 bytes by task 175 on cpu 1: unix_stream_sendmsg (net/unix/af_unix.c:2087 net/unix/af_unix.c:2191) sock_sendmsg (net/socket.c:705 net/socket.c:725) __sys_sendto (net/socket.c:2040) __x64_sys_sendto (net/socket.c:2048) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) read to 0xffff888003d9cff0 of 8 bytes by task 176 on cpu 0: unix_ioctl (net/unix/af_unix.c:3101 (discriminator 1)) sock_do_ioctl (net/socket.c:1128) sock_ioctl (net/socket.c:1242) __x64_sys_ioctl (fs/ioctl.c:52 fs/ioctl.c:874 fs/ioctl.c:860 fs/ioctl.c:860) do_syscall_64 (arch/x86/entry/common.c:50 arch/x86/entry/common.c:80) entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:113) value changed: 0xffff888003da0c00 -> 0xffff888003da0d00 Reported by Kernel Concurrency Sanitizer on: CPU: 0 PID: 176 Comm: unix_race_oob_i Not tainted 5.17.0-rc5-59529-g83dc4c2af682 #12 Hardware name: Red Hat KVM, BIOS 1.11.0-2.amzn2 04/01/2014 Fixes: 314001f0bf92 ("af_unix: Add OOB support") Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nfDavid S. Miller2022-03-184-24/+35
| | |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains Netfilter fixes for net: 1) Fix PPPoE and QinQ with flowtable inet family. 2) Missing register validation in nf_tables. 3) Initialize registers to avoid stack memleak to userspace. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | | | netfilter: nf_tables: initialize registers in nft_do_chain()Pablo Neira Ayuso2022-03-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Initialize registers to avoid stack leak into userspace. Fixes: 96518518cc41 ("netfilter: add nftables") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | | * | | | netfilter: nf_tables: validate registers coming from userspace.Pablo Neira Ayuso2022-03-171-5/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bail out in case userspace uses unsupported registers. Fixes: 49499c3e6e18 ("netfilter: nf_tables: switch registers to 32 bit addressing") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | | * | | | netfilter: flowtable: Fix QinQ and pppoe support for inet tablePablo Neira Ayuso2022-03-162-18/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nf_flow_offload_inet_hook() does not check for 802.1q and PPPoE. Fetch inner ethertype from these encapsulation protocols. Fixes: 72efd585f714 ("netfilter: flowtable: add pppoe support") Fixes: 4cd91f7c290f ("netfilter: flowtable: add vlan support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | llc: fix netdevice reference leaks in llc_ui_bind()Eric Dumazet2022-03-231-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Whenever llc_ui_bind() and/or llc_ui_autobind() took a reference on a netdevice but subsequently fail, they must properly release their reference or risk the infamous message from unregister_netdevice() at device dismantle. unregister_netdevice: waiting for eth0 to become free. Usage count = 3 Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: 赵子轩 <beraphin@gmail.com> Reported-by: Stoyan Manolov <smanolov@suse.de> Link: https://lore.kernel.org/r/20220323004147.1990845-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | net: bridge: mst: Restrict info size queries to bridge portsTobias Waldekranz2022-03-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ensure that no bridge masters are ever considered for MST info dumping. MST states are only supported on bridge ports, not bridge masters - which br_mst_info_size relies on. Fixes: 122c29486e1f ("net: bridge: mst: Support setting and reporting MST port states") Signed-off-by: Tobias Waldekranz <tobias@waldekranz.com> Link: https://lore.kernel.org/r/20220322133001.16181-1-tobias@waldekranz.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | net: dsa: fix missing host-filtered multicast addressesVladimir Oltean2022-03-231-10/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | DSA ports are stacked devices, so they use dev_mc_add() to sync their address list to their lower interface (DSA master). But they are also hardware devices, so they program those addresses to hardware using the __dev_mc_add() sync and unsync callbacks. Unfortunately both cannot work at the same time, and it seems that the multicast addresses which are already present on the DSA master, like 33:33:00:00:00:01 (added by addrconf.c as in6addr_linklocal_allnodes) are synced to the master via dev_mc_sync(), but not to hardware by __dev_mc_sync(). This happens because both the dev_mc_sync() -> __hw_addr_sync_one() code path, as well as __dev_mc_sync() -> __hw_addr_sync_dev(), operate on the same variable: ha->sync_cnt, in a way that causes the "sync" method (dsa_slave_sync_mc) to no longer be called. To fix the issue we need to work with the API in the way in which it was intended to be used, and therefore, call dev_uc_add() and friends for each individual hardware address, from the sync and unsync callbacks. Fixes: 5e8a1e03aa4d ("net: dsa: install secondary unicast and multicast addresses as host FDB/MDB") Link: https://lore.kernel.org/netdev/20220321163213.lrn5sk7m6grighbl@skbuf/ Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Link: https://lore.kernel.org/r/20220322003701.2056895-1-vladimir.oltean@nxp.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski2022-03-227-115/+449
| |\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf-next 2022-03-21 v2 We've added 137 non-merge commits during the last 17 day(s) which contain a total of 143 files changed, 7123 insertions(+), 1092 deletions(-). The main changes are: 1) Custom SEC() handling in libbpf, from Andrii. 2) subskeleton support, from Delyan. 3) Use btf_tag to recognize __percpu pointers in the verifier, from Hao. 4) Fix net.core.bpf_jit_harden race, from Hou. 5) Fix bpf_sk_lookup remote_port on big-endian, from Jakub. 6) Introduce fprobe (multi kprobe) _without_ arch bits, from Masami. The arch specific bits will come later. 7) Introduce multi_kprobe bpf programs on top of fprobe, from Jiri. 8) Enable non-atomic allocations in local storage, from Joanne. 9) Various var_off ptr_to_btf_id fixed, from Kumar. 10) bpf_ima_file_hash helper, from Roberto. 11) Add "live packet" mode for XDP in BPF_PROG_RUN, from Toke. * https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-next: (137 commits) selftests/bpf: Fix kprobe_multi test. Revert "rethook: x86: Add rethook x86 implementation" Revert "arm64: rethook: Add arm64 rethook implementation" Revert "powerpc: Add rethook support" Revert "ARM: rethook: Add rethook arm implementation" bpftool: Fix a bug in subskeleton code generation bpf: Fix bpf_prog_pack when PMU_SIZE is not defined bpf: Fix bpf_prog_pack for multi-node setup bpf: Fix warning for cast from restricted gfp_t in verifier bpf, arm: Fix various typos in comments libbpf: Close fd in bpf_object__reuse_map bpftool: Fix print error when show bpf map bpf: Fix kprobe_multi return probe backtrace Revert "bpf: Add support to inline bpf_get_func_ip helper on x86" bpf: Simplify check in btf_parse_hdr() selftests/bpf/test_lirc_mode2.sh: Exit with proper code bpf: Check for NULL return from bpf_get_btf_vmlinux selftests/bpf: Test skipping stacktrace bpf: Adjust BPF stack helper functions to accommodate skip > 0 bpf: Select proper size for bpf_prog_pack ... ==================== Link: https://lore.kernel.org/r/20220322050159.5507-1-alexei.starovoitov@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | | | | | bpf: Check for NULL return from bpf_get_btf_vmlinuxKumar Kartikeya Dwivedi2022-03-211-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_DEBUG_INFO_BTF is disabled, bpf_get_btf_vmlinux can return a NULL pointer. Check for it in btf_get_module_btf to prevent a NULL pointer dereference. While kernel test robot only complained about this specific case, let's also check for NULL in other call sites of bpf_get_btf_vmlinux. Fixes: 9492450fd287 ("bpf: Always raise reference in btf_get_module_btf") Reported-by: kernel test robot <oliver.sang@intel.com> Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220320143003.589540-1-memxor@gmail.com
| | * | | | | | bpf: Treat bpf_sk_lookup remote_port as a 2-byte fieldJakub Sitnicki2022-03-211-2/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 9a69e2b385f4 ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide") the remote_port field has been split up and re-declared from u32 to be16. However, the accompanying changes to the context access converter have not been well thought through when it comes big-endian platforms. Today 2-byte wide loads from offsetof(struct bpf_sk_lookup, remote_port) are handled as narrow loads from a 4-byte wide field. This by itself is not enough to create a problem, but when we combine 1. 32-bit wide access to ->remote_port backed by a 16-wide wide load, with 2. inherent difference between litte- and big-endian in how narrow loads need have to be handled (see bpf_ctx_narrow_access_offset), we get inconsistent results for a 2-byte loads from &ctx->remote_port on LE and BE architectures. This in turn makes BPF C code for the common case of 2-byte load from ctx->remote_port not portable. To rectify it, inform the context access converter that remote_port is 2-byte wide field, and only 1-byte loads need to be treated as narrow loads. At the same time, we special-case the 4-byte load from &ctx->remote_port to continue handling it the same way as do today, in order to keep the existing BPF programs working. Fixes: 9a69e2b385f4 ("bpf: Make remote_port field in struct bpf_sk_lookup 16-bit wide") Signed-off-by: Jakub Sitnicki <jakub@cloudflare.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220319183356.233666-2-jakub@cloudflare.com
| | * | | | | | bpf: Enable non-atomic allocations in local storageJoanne Koong2022-03-211-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, local storage memory can only be allocated atomically (GFP_ATOMIC). This restriction is too strict for sleepable bpf programs. In this patch, the verifier detects whether the program is sleepable, and passes the corresponding GFP_KERNEL or GFP_ATOMIC flag as a 5th argument to bpf_task/sk/inode_storage_get. This flag will propagate down to the local storage functions that allocate memory. Please note that bpf_task/sk/inode_storage_update_elem functions are invoked by userspace applications through syscalls. Preemption is disabled before bpf_task/sk/inode_storage_update_elem is called, which means they will always have to allocate memory atomically. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: KP Singh <kpsingh@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220318045553.3091807-2-joannekoong@fb.com
| | * | | | | | veth: Rework veth_xdp_rcv_skb in order to accept non-linear skbLorenzo Bianconi2022-03-171-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce veth_convert_skb_to_xdp_buff routine in order to convert a non-linear skb into a xdp buffer. If the received skb is cloned or shared, veth_convert_skb_to_xdp_buff will copy it in a new skb composed by order-0 pages for the linear and the fragmented area. Moreover veth_convert_skb_to_xdp_buff guarantees we have enough headroom for xdp. This is a preliminary patch to allow attaching xdp programs with frags support on veth devices. Signed-off-by: Lorenzo Bianconi <lorenzo@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/8d228b106bc1903571afd1d77e797bffe9a5ea7c.1646989407.git.lorenzo@kernel.org
| | * | | | | | bpf, sockmap: Fix double uncharge the mem of sk_msgWang Yufen2022-03-151-4/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If tcp_bpf_sendmsg is running during a tear down operation, psock may be freed. tcp_bpf_sendmsg() tcp_bpf_send_verdict() sk_msg_return() tcp_bpf_sendmsg_redir() unlikely(!psock)) sk_msg_free() The mem of msg has been uncharged in tcp_bpf_send_verdict() by sk_msg_return(), and would be uncharged by sk_msg_free() again. When psock is null, we can simply returning an error code, this would then trigger the sk_msg_free_nocharge in the error path of __SK_REDIRECT and would have the side effect of throwing an error up to user space. This would be a slight change in behavior from user side but would look the same as an error if the redirect on the socket threw an error. This issue can cause the following info: WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260 Call Trace: <TASK> __sk_destruct+0x24/0x1f0 sk_psock_destroy+0x19b/0x1c0 process_one_work+0x1b3/0x3c0 worker_thread+0x30/0x350 ? process_one_work+0x3c0/0x3c0 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220304081145.2037182-5-wangyufen@huawei.com
| | * | | | | | bpf, sockmap: Fix more uncharged while msg has more_dataWang Yufen2022-03-151-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In tcp_bpf_send_verdict(), if msg has more data after tcp_bpf_sendmsg_redir(): tcp_bpf_send_verdict() tosend = msg->sg.size //msg->sg.size = 22220 case __SK_REDIRECT: sk_msg_return() //uncharged msg->sg.size(22220) sk->sk_forward_alloc tcp_bpf_sendmsg_redir() //after tcp_bpf_sendmsg_redir, msg->sg.size=11000 goto more_data; tosend = msg->sg.size //msg->sg.size = 11000 case __SK_REDIRECT: sk_msg_return() //uncharged msg->sg.size(11000) to sk->sk_forward_alloc The msg->sg.size(11000) has been uncharged twice, to fix we can charge the remaining msg->sg.size before goto more data. This issue can cause the following info: WARNING: CPU: 0 PID: 9860 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0 Call Trace: <TASK> inet_csk_destroy_sock+0x55/0x110 __tcp_close+0x279/0x470 tcp_close+0x1f/0x60 inet_release+0x3f/0x80 __sock_release+0x3d/0xb0 sock_close+0x11/0x20 __fput+0x92/0x250 task_work_run+0x6a/0xa0 do_exit+0x33b/0xb60 do_group_exit+0x2f/0xa0 get_signal+0xb6/0x950 arch_do_signal_or_restart+0xac/0x2a0 ? vfs_write+0x237/0x290 exit_to_user_mode_prepare+0xa9/0x200 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x46/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> WARNING: CPU: 0 PID: 2136 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260 Call Trace: <TASK> __sk_destruct+0x24/0x1f0 sk_psock_destroy+0x19b/0x1c0 process_one_work+0x1b3/0x3c0 worker_thread+0x30/0x350 ? process_one_work+0x3c0/0x3c0 kthread+0xe6/0x110 ? kthread_complete_and_exit+0x20/0x20 ret_from_fork+0x22/0x30 </TASK> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220304081145.2037182-4-wangyufen@huawei.com
| | * | | | | | bpf, sockmap: Fix memleak in tcp_bpf_sendmsg while sk msg is fullWang Yufen2022-03-151-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If tcp_bpf_sendmsg() is running while sk msg is full. When sk_msg_alloc() returns -ENOMEM error, tcp_bpf_sendmsg() goes to wait_for_memory. If partial memory has been alloced by sk_msg_alloc(), that is, msg_tx->sg.size is greater than osize after sk_msg_alloc(), memleak occurs. To fix we use sk_msg_trim() to release the allocated memory, then goto wait for memory. Other call paths of sk_msg_alloc() have the similar issue, such as tls_sw_sendmsg(), so handle sk_msg_trim logic inside sk_msg_alloc(), as Cong Wang suggested. This issue can cause the following info: WARNING: CPU: 3 PID: 7950 at net/core/stream.c:208 sk_stream_kill_queues+0xd4/0x1a0 Call Trace: <TASK> inet_csk_destroy_sock+0x55/0x110 __tcp_close+0x279/0x470 tcp_close+0x1f/0x60 inet_release+0x3f/0x80 __sock_release+0x3d/0xb0 sock_close+0x11/0x20 __fput+0x92/0x250 task_work_run+0x6a/0xa0 do_exit+0x33b/0xb60 do_group_exit+0x2f/0xa0 get_signal+0xb6/0x950 arch_do_signal_or_restart+0xac/0x2a0 exit_to_user_mode_prepare+0xa9/0x200 syscall_exit_to_user_mode+0x12/0x30 do_syscall_64+0x46/0x80 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> WARNING: CPU: 3 PID: 2094 at net/ipv4/af_inet.c:155 inet_sock_destruct+0x13c/0x260 Call Trace: <TASK> __sk_destruct+0x24/0x1f0 sk_psock_destroy+0x19b/0x1c0 process_one_work+0x1b3/0x3c0 kthread+0xe6/0x110 ret_from_fork+0x22/0x30 </TASK> Fixes: 604326b41a6f ("bpf, sockmap: convert to generic sk_msg interface") Signed-off-by: Wang Yufen <wangyufen@huawei.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: John Fastabend <john.fastabend@gmail.com> Link: https://lore.kernel.org/bpf/20220304081145.2037182-3-wangyufen@huawei.com
| | * | | | | | bpf, test_run: Fix packet size check for live packet modeToke Høiland-Jørgensen2022-03-111-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The live packet mode uses some extra space at the start of each page to cache data structures so they don't have to be rebuilt at every repetition. This space wasn't correctly accounted for in the size checking of the arguments supplied to userspace. In addition, the definition of the frame size should include the size of the skb_shared_info (as there is other logic that subtracts the size of this). Together, these mistakes resulted in userspace being able to trip the XDP_WARN() in xdp_update_frame_from_buff(), which syzbot discovered in short order. Fix this by changing the frame size define and adding the extra headroom to the bpf_prog_test_run_xdp() function. Also drop the max_len parameter to the page_pool init, since this is related to DMA which is not used for the page pool instance in PROG_TEST_RUN. Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Reported-by: syzbot+0e91362d99386dc5de99@syzkaller.appspotmail.com Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220310225621.53374-1-toke@redhat.com
| | * | | | | | bpf: Remove BPF_SKB_DELIVERY_TIME_NONE and rename s/delivery_time_/tstamp_/Martin KaFai Lau2022-03-101-54/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch is to simplify the uapi bpf.h regarding to the tstamp type and use a similar way as the kernel to describe the value stored in __sk_buff->tstamp. My earlier thought was to avoid describing the semantic and clock base for the rcv timestamp until there is more clarity on the use case, so the __sk_buff->delivery_time_type naming instead of __sk_buff->tstamp_type. With some thoughts, it can reuse the UNSPEC naming. This patch first removes BPF_SKB_DELIVERY_TIME_NONE and also rename BPF_SKB_DELIVERY_TIME_UNSPEC to BPF_SKB_TSTAMP_UNSPEC and BPF_SKB_DELIVERY_TIME_MONO to BPF_SKB_TSTAMP_DELIVERY_MONO. The semantic of BPF_SKB_TSTAMP_DELIVERY_MONO is the same: __sk_buff->tstamp has delivery time in mono clock base. BPF_SKB_TSTAMP_UNSPEC means __sk_buff->tstamp has the (rcv) tstamp at ingress and the delivery time at egress. At egress, the clock base could be found from skb->sk->sk_clockid. __sk_buff->tstamp == 0 naturally means NONE, so NONE is not needed. With BPF_SKB_TSTAMP_UNSPEC for the rcv tstamp at ingress, the __sk_buff->delivery_time_type is also renamed to __sk_buff->tstamp_type which was also suggested in the earlier discussion: https://lore.kernel.org/bpf/b181acbe-caf8-502d-4b7b-7d96b9fc5d55@iogearbox.net/ The above will then make __sk_buff->tstamp and __sk_buff->tstamp_type the same as its kernel skb->tstamp and skb->mono_delivery_time counter part. The internal kernel function bpf_skb_convert_dtime_type_read() is then renamed to bpf_skb_convert_tstamp_type_read() and it can be simplified with the BPF_SKB_DELIVERY_TIME_NONE gone. A BPF_ALU32_IMM(BPF_AND) insn is also saved by using BPF_JMP32_IMM(BPF_JSET). The bpf helper bpf_skb_set_delivery_time() is also renamed to bpf_skb_set_tstamp(). The arg name is changed from dtime to tstamp also. It only allows setting tstamp 0 for BPF_SKB_TSTAMP_UNSPEC and it could be relaxed later if there is use case to change mono delivery time to non mono. prog->delivery_time_access is also renamed to prog->tstamp_type_access. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220309090509.3712315-1-kafai@fb.com
| | * | | | | | bpf: Simplify insn rewrite on BPF_WRITE __sk_buff->tstampMartin KaFai Lau2022-03-101-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BPF_JMP32_IMM(BPF_JSET) is used to save a BPF_ALU32_IMM(BPF_AND). The skb->tc_at_ingress and skb->mono_delivery_time are at the same offset, so only one BPF_LDX_MEM(BPF_B) is needed. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220309090502.3711982-1-kafai@fb.com
| | * | | | | | bpf: Simplify insn rewrite on BPF_READ __sk_buff->tstampMartin KaFai Lau2022-03-101-10/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The skb->tc_at_ingress and skb->mono_delivery_time are at the same byte offset. Thus, only one BPF_LDX_MEM(BPF_B) is needed and both bits can be tested together. /* BPF_READ: a = __sk_buff->tstamp */ if (skb->tc_at_ingress && skb->mono_delivery_time) a = 0; else a = skb->tstamp; Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220309090456.3711530-1-kafai@fb.com
| | * | | | | | bpf: net: Remove TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macroMartin KaFai Lau2022-03-101-7/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch removes the TC_AT_INGRESS_OFFSET and SKB_MONO_DELIVERY_TIME_OFFSET macros. Instead, PKT_VLAN_PRESENT_OFFSET is used because all of them are at the same offset. Comment is added to make it clear that changing the position of tc_at_ingress or mono_delivery_time will require to adjust the defined macros. The earlier discussion can be found here: https://lore.kernel.org/bpf/419d994e-ff61-7c11-0ec7-11fefcb0186e@iogearbox.net/ Signed-off-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220309090450.3710955-1-kafai@fb.com
| | * | | | | | bpf, test_run: Use kvfree() for memory allocated with kvmalloc()Yihao Han2022-03-101-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is allocated with kvmalloc(), the corresponding release function should not be kfree(), use kvfree() instead. Generated by: scripts/coccinelle/api/kfree_mismatch.cocci Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Signed-off-by: Yihao Han <hanyihao@vivo.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Toke Høiland-Jørgensen <toke@redhat.com> Link: https://lore.kernel.org/bpf/20220310092828.13405-1-hanyihao@vivo.com
| | * | | | | | bpf: Initialise retval in bpf_prog_test_run_xdp()Toke Høiland-Jørgensen2022-03-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel test robot pointed out that the newly added bpf_test_run_xdp_live() runner doesn't set the retval in the caller (by design), which means that the variable can be passed unitialised to bpf_test_finish(). Fix this by initialising the variable properly. Fixes: b530e9e1063e ("bpf: Add "live packet" mode for XDP in BPF_PROG_RUN") Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Link: https://lore.kernel.org/bpf/20220310110228.161869-1-toke@redhat.com
| | * | | | | | bpf: Add "live packet" mode for XDP in BPF_PROG_RUNToke Høiland-Jørgensen2022-03-091-14/+320
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for running XDP programs through BPF_PROG_RUN in a mode that enables live packet processing of the resulting frames. Previous uses of BPF_PROG_RUN for XDP returned the XDP program return code and the modified packet data to userspace, which is useful for unit testing of XDP programs. The existing BPF_PROG_RUN for XDP allows userspace to set the ingress ifindex and RXQ number as part of the context object being passed to the kernel. This patch reuses that code, but adds a new mode with different semantics, which can be selected with the new BPF_F_TEST_XDP_LIVE_FRAMES flag. When running BPF_PROG_RUN in this mode, the XDP program return codes will be honoured: returning XDP_PASS will result in the frame being injected into the networking stack as if it came from the selected networking interface, while returning XDP_TX and XDP_REDIRECT will result in the frame being transmitted out that interface. XDP_TX is translated into an XDP_REDIRECT operation to the same interface, since the real XDP_TX action is only possible from within the network drivers themselves, not from the process context where BPF_PROG_RUN is executed. Internally, this new mode of operation creates a page pool instance while setting up the test run, and feeds pages from that into the XDP program. The setup cost of this is amortised over the number of repetitions specified by userspace. To support the performance testing use case, we further optimise the setup step so that all pages in the pool are pre-initialised with the packet data, and pre-computed context and xdp_frame objects stored at the start of each page. This makes it possible to entirely avoid touching the page content on each XDP program invocation, and enables sending up to 9 Mpps/core on my test box. Because the data pages are recycled by the page pool, and the test runner doesn't re-initialise them for each run, subsequent invocations of the XDP program will see the packet data in the state it was after the last time it ran on that particular page. This means that an XDP program that modifies the packet before redirecting it has to be careful about which assumptions it makes about the packet content, but that is only an issue for the most naively written programs. Enabling the new flag is only allowed when not setting ctx_out and data_out in the test specification, since using it means frames will be redirected somewhere else, so they can't be returned. Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20220309105346.100053-2-toke@redhat.com
| | * | | | | | selftests/bpf: Add tests for kfunc register offset checksKumar Kartikeya Dwivedi2022-03-061-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Include a few verifier selftests that test against the problems being fixed by previous commits, i.e. release kfunc always require PTR_TO_BTF_ID fixed and var_off to be 0, and negative offset is not permitted and returns a helpful error message. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220304224645.3677453-9-memxor@gmail.com
| | * | | | | | bpf: Replace __diag_ignore with unified __diag_ignore_allKumar Kartikeya Dwivedi2022-03-062-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, -Wmissing-prototypes warning is ignored for GCC, but not clang. This leads to clang build warning in W=1 mode. Since the flag used by both compilers is same, we can use the unified __diag_ignore_all macro that works for all supported versions and compilers which have __diag macro support (currently GCC >= 8.0, and Clang >= 11.0). Also add nf_conntrack_bpf.h include to prevent missing prototype warning for register_nf_conntrack_bpf. Signed-off-by: Kumar Kartikeya Dwivedi <memxor@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Link: https://lore.kernel.org/bpf/20220304224645.3677453-8-memxor@gmail.com
| * | | | | | | net: bridge: mst: prevent NULL deref in br_mst_info_size()Eric Dumazet2022-03-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Call br_mst_info_size() only if vg pointer is not NULL. general protection fault, probably for non-canonical address 0xdffffc0000000058: 0000 [#1] PREEMPT SMP KASAN KASAN: null-ptr-deref in range [0x00000000000002c0-0x00000000000002c7] CPU: 0 PID: 975 Comm: syz-executor.0 Tainted: G W 5.17.0-next-20220321-syzkaller #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:br_mst_info_size+0x97/0x270 net/bridge/br_mst.c:242 Code: 00 00 31 c0 e8 ba 10 53 f9 31 c0 b9 40 00 00 00 4c 8d 6c 24 30 4c 89 ef f3 48 ab 48 8d 83 c0 02 00 00 48 89 04 24 48 c1 e8 03 <80> 3c 28 00 0f 85 ae 01 00 00 48 8b 83 c0 02 00 00 41 bf 04 00 00 RSP: 0018:ffffc900153770a8 EFLAGS: 00010202 RAX: 0000000000000058 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000040000 RSI: ffffffff88259876 RDI: ffffc900153772d8 RBP: dffffc0000000000 R08: 0000000000000000 R09: ffffffff8db68957 R10: ffffffff881f737b R11: 0000000000000000 R12: 0000000000000000 R13: ffffc900153770d8 R14: 00000000000002a0 R15: 00000000ffffffff FS: 00007f18bbb6f700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020001a80 CR3: 000000001a7d9000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 00000000000000d8 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: <TASK> br_get_link_af_size_filtered+0x6e9/0xc00 net/bridge/br_netlink.c:123 rtnl_link_get_af_size net/core/rtnetlink.c:598 [inline] if_nlmsg_size+0x40c/0xa50 net/core/rtnetlink.c:1040 rtnl_calcit.isra.0+0x25f/0x460 net/core/rtnetlink.c:3780 rtnetlink_rcv_msg+0xa65/0xb80 net/core/rtnetlink.c:5937 netlink_rcv_skb+0x153/0x420 net/netlink/af_netlink.c:2496 netlink_unicast_kernel net/netlink/af_netlink.c:1319 [inline] netlink_unicast+0x543/0x7f0 net/netlink/af_netlink.c:1345 netlink_sendmsg+0x904/0xe00 net/netlink/af_netlink.c:1921 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg+0xcf/0x120 net/socket.c:725 ____sys_sendmsg+0x6e8/0x810 net/socket.c:2413 ___sys_sendmsg+0xf3/0x170 net/socket.c:2467 __sys_sendmsg+0xe5/0x1b0 net/socket.c:2496 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x35/0x80 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae RIP: 0033:0x7f18baa89049 Code: ff ff c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 b8 ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007f18bbb6f168 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00007f18bab9bf60 RCX: 00007f18baa89049 RDX: 0000000000000000 RSI: 0000000020001a80 RDI: 0000000000000004 RBP: 00007f18baae308d R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 00007ffeedb2be2f R14: 00007f18bbb6f300 R15: 0000000000022000 </TASK> Modules linked in: ---[ end trace 0000000000000000 ]--- RIP: 0010:br_mst_info_size+0x97/0x270 net/bridge/br_mst.c:242 Code: 00 00 31 c0 e8 ba 10 53 f9 31 c0 b9 40 00 00 00 4c 8d 6c 24 30 4c 89 ef f3 48 ab 48 8d 83 c0 02 00 00 48 89 04 24 48 c1 e8 03 <80> 3c 28 00 0f 85 ae 01 00 00 48 8b 83 c0 02 00 00 41 bf 04 00 00 RSP: 0018:ffffc900153770a8 EFLAGS: 00010202 RAX: 0000000000000058 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000040000 RSI: ffffffff88259876 RDI: ffffc900153772d8 RBP: dffffc0000000000 R08: 0000000000000000 R09: ffffffff8db68957 R10: ffffffff881f737b R11: 0000000000000000 R12: 0000000000000000 R13: ffffc900153770d8 R14: 00000000000002a0 R15: 00000000ffffffff FS: 00007f18bbb6f700(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000001b2ca22000 CR3: 000000001a7d9000 CR4: 00000000003506f0 DR0: 0000000000000000 DR1: 00000000000000d8 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Fixes: 122c29486e1f ("net: bridge: mst: Support setting and reporting MST port states") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Tobias Waldekranz <tobias@waldekranz.com> Cc: Nikolay Aleksandrov <razor@blackwall.org> Reviewed-by: Tobias Waldekranz <tobias@waldekranz.com> Acked-by: Nikolay Aleksandrov <razor@blackwall.org> Link: https://lore.kernel.org/r/20220322012314.795187-1-eric.dumazet@gmail.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| * | | | | | | net/tls: optimize judgement processes in tls_set_device_offload()Ziyang Xuan2022-03-211-31/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It is known that priority setting HW offload when set tls TX/RX offload by setsockopt(). Check netdevice whether support NETIF_F_HW_TLS_TX or not at the later stages in the whole tls_set_device_offload() process, some memory allocations have been done before that. We must release those memory and return error if we judge the netdevice not support NETIF_F_HW_TLS_TX. It is redundant. Move NETIF_F_HW_TLS_TX judgement forward, and move start_marker_record and offload_ctx memory allocation back slightly. Thus, we can get simpler exception handling process. Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | | net/tls: remove unnecessary jump instructions in do_tls_setsockopt_conf()Ziyang Xuan2022-03-211-10/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid using "goto" jump instruction unconditionally when we can return directly. Remove unnecessary jump instructions in do_tls_setsockopt_conf(). Signed-off-by: Ziyang Xuan <william.xuanziyang@huawei.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | | net: Revert the softirq will run annotation in ____napi_schedule().Sebastian Andrzej Siewior2022-03-211-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The lockdep annotation lockdep_assert_softirq_will_run() expects that either hard or soft interrupts are disabled because both guaranty that the "raised" soft-interrupts will be processed once the context is left. This triggers in flush_smp_call_function_from_idle() but it this case it explicitly calls do_softirq() in case of pending softirqs. Revert the "softirq will run" annotation in ____napi_schedule() and move the check back to __netif_rx() as it was. Keep the IRQ-off assert in ____napi_schedule() because this is always required. Fixes: fbd9a2ceba5c7 ("net: Add lockdep asserts to ____napi_schedule().") Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Reviewed-by: Jason A. Donenfeld <Jason@zx2c4.com> Link: https://lore.kernel.org/r/YjhD3ZKWysyw8rc6@linutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | | devlink: hold the instance lock during eswitch_mode callbacksJakub Kicinski2022-03-211-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make the devlink core hold the instance lock during eswitch_mode callbacks. Cheat in case of mlx5 (see the cover letter). Reviewed-by: Leon Romanovsky <leonro@nvidia.com> Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | devlink: add explicitly locked flavor of the rate node APIsJakub Kicinski2022-03-211-25/+61
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We'll need an explicitly locked rate node API for netdevsim to switch eswitch mode setting to locked. Reviewed-by: Jiri Pirko <jiri@nvidia.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>