summaryrefslogtreecommitdiffstats
path: root/include/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge https://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextJakub Kicinski2022-05-241-0/+6
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2022-05-23 We've added 113 non-merge commits during the last 26 day(s) which contain a total of 121 files changed, 7425 insertions(+), 1586 deletions(-). The main changes are: 1) Speed up symbol resolution for kprobes multi-link attachments, from Jiri Olsa. 2) Add BPF dynamic pointer infrastructure e.g. to allow for dynamically sized ringbuf reservations without extra memory copies, from Joanne Koong. 3) Big batch of libbpf improvements towards libbpf 1.0 release, from Andrii Nakryiko. 4) Add BPF link iterator to traverse links via seq_file ops, from Dmitrii Dolgov. 5) Add source IP address to BPF tunnel key infrastructure, from Kaixi Fan. 6) Refine unprivileged BPF to disable only object-creating commands, from Alan Maguire. 7) Fix JIT blinding of ld_imm64 when they point to subprogs, from Alexei Starovoitov. 8) Add BPF access to mptcp_sock structures and their meta data, from Geliang Tang. 9) Add new BPF helper for access to remote CPU's BPF map elements, from Feng Zhou. 10) Allow attaching 64-bit cookie to BPF link of fentry/fexit/fmod_ret, from Kui-Feng Lee. 11) Follow-ups to typed pointer support in BPF maps, from Kumar Kartikeya Dwivedi. 12) Add busy-poll test cases to the XSK selftest suite, from Magnus Karlsson. 13) Improvements in BPF selftest test_progs subtest output, from Mykola Lysenko. 14) Fill bpf_prog_pack allocator areas with illegal instructions, from Song Liu. 15) Add generic batch operations for BPF map-in-map cases, from Takshak Chahande. 16) Make bpf_jit_enable more user friendly when permanently on 1, from Tiezhu Yang. 17) Fix an array overflow in bpf_trampoline_get_progs(), from Yuntao Wang. ==================== Link: https://lore.kernel.org/r/20220523223805.27931-1-daniel@iogearbox.net Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * bpf: Add bpf_skc_to_mptcp_sock_protoGeliang Tang2022-05-211-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements a new struct bpf_func_proto, named bpf_skc_to_mptcp_sock_proto. Define a new bpf_id BTF_SOCK_TYPE_MPTCP, and a new helper bpf_skc_to_mptcp_sock(), which invokes another new helper bpf_mptcp_sock_from_subflow() in net/mptcp/bpf.c to get struct mptcp_sock from a given subflow socket. v2: Emit BTF type, add func_id checks in verifier.c and bpf_trace.c, remove build check for CONFIG_BPF_JIT v5: Drop EXPORT_SYMBOL (Martin) Co-developed-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Nicolas Rybowski <nicolas.rybowski@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Geliang Tang <geliang.tang@suse.com> Signed-off-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Andrii Nakryiko <andrii@kernel.org> Link: https://lore.kernel.org/bpf/20220519233016.105670-2-mathew.j.martineau@linux.intel.com
* | Merge tag 'for-net-next-2022-05-23' of ↵Jakub Kicinski2022-05-232-2/+16
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Luiz Augusto von Dentz says: ==================== bluetooth-next pull request for net-next: - Add support for Realtek 8761BUV - Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk - Add support for RTL8852C - Add a new PID/VID 0489/e0c8 for MT7921 - Add support for Qualcomm WCN785x * tag 'for-net-next-2022-05-23' of git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next: (26 commits) Bluetooth: hci_sync: use hci_skb_event() helper Bluetooth: eir: Add helpers for managing service data Bluetooth: hci_sync: Fix attempting to suspend with unfiltered passive scan Bluetooth: MGMT: Add conditions for setting HCI_CONN_FLAG_REMOTE_WAKEUP Bluetooth: btmtksdio: fix the reset takes too long Bluetooth: btmtksdio: fix possible FW initialization failure Bluetooth: btmtksdio: fix use-after-free at btmtksdio_recv_event Bluetooth: btbcm: Add entry for BCM4373A0 UART Bluetooth Bluetooth: btusb: Add a new PID/VID 0489/e0c8 for MT7921 Bluetooth: btusb: Add 0x0bda:0x8771 Realtek 8761BUV devices Bluetooth: btusb: Set HCI_QUIRK_BROKEN_ERR_DATA_REPORTING for QCA Bluetooth: core: Fix missing power_on work cancel on HCI close Bluetooth: btusb: add support for Qualcomm WCN785x Bluetooth: protect le accept and resolv lists with hdev->lock Bluetooth: use hdev lock for accept_list and reject_list in conn req Bluetooth: use hdev lock in activate_scan for hci_is_adv_monitoring Bluetooth: btrtl: Add support for RTL8852C Bluetooth: btusb: Set HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN for QCA Bluetooth: Print broken quirks Bluetooth: HCI: Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk ... ==================== Link: https://lore.kernel.org/r/20220523204151.3327345-1-luiz.dentz@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | Bluetooth: eir: Add helpers for managing service dataLuiz Augusto von Dentz2022-05-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | This adds helpers for accessing and appending service data (0x16) ad type. Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
| * | Bluetooth: HCI: Add HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirkLuiz Augusto von Dentz2022-05-132-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds HCI_QUIRK_BROKEN_ENHANCED_SETUP_SYNC_CONN quirk which can be used to mark HCI_Enhanced_Setup_Synchronous_Connection as broken even if its support command bit are set since some controller report it as supported but the command don't work properly with some configurations (e.g. BT_VOICE_TRANSPARENT/mSBC). Signed-off-by: Luiz Augusto von Dentz <luiz.von.dentz@intel.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
* | | net: wrap the wireless pointers in struct net_device in an ifdefJakub Kicinski2022-05-222-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most protocol-specific pointers in struct net_device are under a respective ifdef. Wireless is the notable exception. Since there's a sizable number of custom-built kernels for datacenter workloads which don't build wireless it seems reasonable to ifdefy those pointers as well. While at it move IPv4 and IPv6 pointers up, those are special for obvious reasons. Acked-by: Johannes Berg <johannes@sipsolutions.net> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> # ieee802154 Acked-by: Sven Eckelmann <sven@narfation.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: Add a second bind table hashed by port and addressJoanne Koong2022-05-213-1/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently have one tcp bind table (bhash) which hashes by port number only. In the socket bind path, we check for bind conflicts by traversing the specified port's inet_bind2_bucket while holding the bucket's spinlock (see inet_csk_get_port() and inet_csk_bind_conflict()). In instances where there are tons of sockets hashed to the same port at different addresses, checking for a bind conflict is time-intensive and can cause softirq cpu lockups, as well as stops new tcp connections since __inet_inherit_port() also contests for the spinlock. This patch proposes adding a second bind table, bhash2, that hashes by port and ip address. Searching the bhash2 table leads to significantly faster conflict resolution and less time holding the spinlock. Signed-off-by: Joanne Koong <joannelkoong@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Acked-by: Kuniyuki Iwashima <kuniyu@amazon.co.jp> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski2022-05-201-0/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next, misc updates and fallout fixes from recent Florian's code rewritting (from last pull request): 1) Use new flowi4_l3mdev field in ip_route_me_harder(), from Martin Willi. 2) Avoid unnecessary GC with a timestamp in conncount, from William Tu and Yifeng Sun. 3) Remove TCP conntrack debugging, from Florian Westphal. 4) Fix compilation warning in ctnetlink, from Florian. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: ctnetlink: fix up for "netfilter: conntrack: remove unconfirmed list" netfilter: conntrack: remove pr_debug callsites from tcp tracker netfilter: nf_conncount: reduce unnecessary GC netfilter: Use l3mdev flow key when re-routing mangled packets ==================== Link: https://lore.kernel.org/r/20220519220206.722153-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | netfilter: nf_conncount: reduce unnecessary GCWilliam Tu2022-05-161-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently nf_conncount can trigger garbage collection (GC) at multiple places. Each GC process takes a spin_lock_bh to traverse the nf_conncount_list. We found that when testing port scanning use two parallel nmap, because the number of connection increase fast, the nf_conncount_count and its subsequent call to __nf_conncount_add take too much time, causing several CPU lockup. This happens when user set the conntrack limit to +20,000, because the larger the limit, the longer the list that GC has to traverse. The patch mitigate the performance issue by avoiding unnecessary GC with a timestamp. Whenever nf_conncount has done a GC, a timestamp is updated, and beforce the next time GC is triggered, we make sure it's more than a jiffies. By doin this we can greatly reduce the CPU cycles and avoid the softirq lockup. To reproduce it in OVS, $ ovs-appctl dpctl/ct-set-limits zone=1,limit=20000 $ ovs-appctl dpctl/ct-get-limits At another machine, runs two nmap $ nmap -p1- <IP> $ nmap -p1- <IP> Signed-off-by: William Tu <u9012063@gmail.com> Co-authored-by: Yifeng Sun <pkusunyifeng@gmail.com> Reported-by: Greg Rose <gvrose8192@gmail.com> Suggested-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | | Merge tag 'wireless-next-2022-05-19' of ↵Jakub Kicinski2022-05-192-11/+39
|\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next Kalle Valo says: ==================== wireless-next patches for v5.19 Second set of patches for v5.19 and most likely the last one. rtw89 got support for 8852ce devices and mt76 now supports Wireless Ethernet Dispatch. Major changes: cfg80211/mac80211 - support disabling EHT mode rtw89 - add support for Realtek 8852ce devices mt76 - Wireless Ethernet Dispatch support for flow offload - non-standard VHT MCS10-11 support - mt7921 AP mode support - mt7921 ipv6 NS offload support ath11k - enable keepalive during WoWLAN suspend - implement remain-on-channel support * tag 'wireless-next-2022-05-19' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless-next: (135 commits) iwlwifi: mei: fix potential NULL-ptr deref iwlwifi: mei: clear the sap data header before sending iwlwifi: mvm: remove vif_count iwlwifi: mvm: always tell the firmware to accept MCAST frames in BSS iwlwifi: mvm: add OTP info in case of init failure iwlwifi: mvm: fix assert 1F04 upon reconfig iwlwifi: fw: init SAR GEO table only if data is present iwlwifi: mvm: clean up authorized condition iwlwifi: mvm: use NULL instead of ERR_PTR when parsing wowlan status iwlwifi: pcie: simplify MSI-X cause mapping rtw89: pci: only mask out INT indicator register for disable interrupt v1 rtw89: convert rtw89_band to nl80211_band precisely rtw89: 8852c: update txpwr tables to HALRF_027_00_052 rtw89: cfo: check mac_id to avoid out-of-bounds rtw89: 8852c: set TX antenna path rtw89: add ieee80211::sta_rc_update ops wireless: Fix Makefile to be in alphabetical order mac80211: refactor freeing the next_beacon cfg80211: fix kernel-doc for cfg80211_beacon_data mac80211: minstrel_ht: support ieee80211_rate_status ... ==================== Link: https://lore.kernel.org/r/20220519153334.8D051C385AA@smtp.kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | cfg80211: fix kernel-doc for cfg80211_beacon_dataJohannes Berg2022-05-171-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The kernel-doc comment is formatted badly, resulting in a warning: include/net/cfg80211.h:1188: warning: bad line: [...] Fix that. Reported-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * | | | mac80211: extend current rate control tx status APIJonas Jelonek2022-05-161-2/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds the new struct ieee80211_rate_status and replaces 'struct rate_info *rate' in ieee80211_tx_status with pointer and length annotation. The struct ieee80211_rate_status allows to: (1) receive tx power status feedback for transmit power control (TPC) per packet or packet retry (2) dynamic mapping of wifi chip specific multi-rate retry (mrr) chains with different lengths (3) increase the limit of annotatable rate indices to support IEEE802.11ac rate sets and beyond ieee80211_tx_info, control and status buffer, and ieee80211_tx_rate cannot be used to achieve these goals due to fixed size limitations. Our new struct contains a struct rate_info to annotate the rate that was used, retry count of the rate and tx power. It is intended for all information related to RC and TPC that needs to be passed from driver to mac80211 and its RC/TPC algorithms like Minstrel_HT. It corresponds to one stage in an mrr. Multiple subsequent instances of this struct can be included in struct ieee80211_tx_status via a pointer and a length variable. Those instances can be allocated on-stack. The former reference to a single instance of struct rate_info is replaced with our new annotation. An extension is introduced to struct ieee80211_hw. There are two new members called 'tx_power_levels' and 'max_txpwr_levels_idx' acting as a tx power level table. When a wifi device is registered, the driver shall supply all supported power levels in this list. This allows to support several quirks like differing power steps in power level ranges or alike. TPC can use this for algorithm and thus be designed more abstract instead of handling all possible step widths individually. Further mandatory changes in status.c, mt76 and ath11k drivers due to the removal of 'struct rate_info *rate' are also included. status.c already uses the information in ieee80211_tx_status->rate in radiotap, this is now changed to use ieee80211_rate_status->rate_idx. mt76 driver already uses struct rate_info to pass the tx rate to status path. The new members of the ieee80211_tx_status are set to NULL and 0 because the previously passed rate is not relevant to rate control and accurate information is passed via tx_info->status.rates. For ath11k, the txrate can be passed via this struct because ath11k uses firmware RC and thus the information does not interfere with software RC. Compile-Tested: current wireless-next tree with all flags on Tested-on: Xiaomi 4A Gigabit (MediaTek MT7603E, MT7612E) with OpenWrt Linux 5.10.113 Signed-off-by: Jonas Jelonek <jelonek.jonas@gmail.com> Link: https://lore.kernel.org/r/20220509173958.1398201-2-jelonek.jonas@gmail.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * | | | nl80211: Parse NL80211_ATTR_HE_BSS_COLOR as a part of nl80211_parse_beaconRameshkumar Sundaram2022-05-161-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | NL80211_ATTR_HE_BSS_COLOR attribute can be included in both NL80211_CMD_START_AP and NL80211_CMD_SET_BEACON commands. Move he_bss_color from cfg80211_ap_settings to cfg80211_beacon_data and parse NL80211_ATTR_HE_BSS_COLOR as a part of nl80211_parse_beacon() to have bss color settings parsed for both start ap and set beacon commands. Add a new flag he_bss_color_valid to indicate whether NL80211_ATTR_HE_BSS_COLOR attribute is included. Signed-off-by: Rameshkumar Sundaram <quic_ramess@quicinc.com> Link: https://lore.kernel.org/r/1649867295-7204-2-git-send-email-quic_ramess@quicinc.com [fix build ...] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * | | | mac80211: remove stray multi_sta_back_32bit docsJohannes Berg2022-05-161-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This field doesn't exist, remove the docs for it. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * | | | mac80211: fix typo in documentationJohannes Berg2022-05-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is called offload_flags, remove the extra 'a'. Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * | | | cfg80211: support disabling EHT modeMuna Sinada2022-05-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow userspace to disable EHT mode during association. Signed-off-by: Muna Sinada <quic_msinada@quicinc.com> Signed-off-by: Aloka Dixit <quic_alokad@quicinc.com> Link: https://lore.kernel.org/r/20220323224636.20211-1-quic_alokad@quicinc.com Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * | | | cfg80211: remove cfg80211_get_chan_state()Johannes Berg2022-05-041-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We haven't used this function for years, since commit c781944b71f8 ("cfg80211: Remove unused cfg80211_can_use_iftype_chan()") which itself removed a function unused since commit 97dc94f1d933 ("cfg80211: remove channel_switch combination check"), almost eight years ago. Also remove the now unused enum cfg80211_chan_mode and some struct members that were only used for this function. Link: https://lore.kernel.org/r/20220412220958.1a191dca19d7.Ide4448f02d0e2f1ca2992971421ffc1933a5370a@changeid Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2022-05-193-2/+16
|\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | drivers/net/ethernet/mellanox/mlx5/core/main.c b33886971dbc ("net/mlx5: Initialize flow steering during driver probe") 40379a0084c2 ("net/mlx5_fpga: Drop INNOVA TLS support") f2b41b32cde8 ("net/mlx5: Remove ipsec_ops function table") https://lore.kernel.org/all/20220519040345.6yrjromcdistu7vh@sx1/ 16d42d313350 ("net/mlx5: Drain fw_reset when removing device") 8324a02c342a ("net/mlx5: Add exit route when waiting for FW") https://lore.kernel.org/all/20220519114119.060ce014@canb.auug.org.au/ tools/testing/selftests/net/mptcp/mptcp_join.sh e274f7154008 ("selftests: mptcp: add subflow limits test-cases") b6e074e171bc ("selftests: mptcp: add infinite map testcase") 5ac1d2d63451 ("selftests: mptcp: Add tests for userspace PM type") https://lore.kernel.org/all/20220516111918.366d747f@canb.auug.org.au/ net/mptcp/options.c ba2c89e0ea74 ("mptcp: fix checksum byte order") 1e39e5a32ad7 ("mptcp: infinite mapping sending") ea66758c1795 ("tcp: allow MPTCP to update the announced window") https://lore.kernel.org/all/20220519115146.751c3a37@canb.auug.org.au/ net/mptcp/pm.c 95d686517884 ("mptcp: fix subflow accounting on close") 4d25247d3ae4 ("mptcp: bypass in-kernel PM restrictions for non-kernel PMs") https://lore.kernel.org/all/20220516111435.72f35dca@canb.auug.org.au/ net/mptcp/subflow.c ae66fb2ba6c3 ("mptcp: Do TCP fallback on early DSS checksum failure") 0348c690ed37 ("mptcp: add the fallback check") f8d4bcacff3b ("mptcp: infinite mapping receiving") https://lore.kernel.org/all/20220519115837.380bb8d4@canb.auug.org.au/ Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * \ \ \ \ Merge branch 'master' of ↵David S. Miller2022-05-182-1/+14
| |\ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2022-05-18 1) Fix "disable_policy" flag use when arriving from different devices. From Eyal Birger. 2) Fix error handling of pfkey_broadcast in function pfkey_process. From Jiasheng Jiang. 3) Check the encryption module availability consistency in pfkey. From Thomas Bartschies. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | | | | xfrm: fix "disable_policy" flag use when arriving from different devicesEyal Birger2022-05-162-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In IPv4 setting the "disable_policy" flag on a device means no policy should be enforced for traffic originating from the device. This was implemented by seting the DST_NOPOLICY flag in the dst based on the originating device. However, dsts are cached in nexthops regardless of the originating devices, in which case, the DST_NOPOLICY flag value may be incorrect. Consider the following setup: +------------------------------+ | ROUTER | +-------------+ | +-----------------+ | | ipsec src |----|-|ipsec0 | | +-------------+ | |disable_policy=0 | +----+ | | +-----------------+ |eth1|-|----- +-------------+ | +-----------------+ +----+ | | noipsec src |----|-|eth0 | | +-------------+ | |disable_policy=1 | | | +-----------------+ | +------------------------------+ Where ROUTER has a default route towards eth1. dst entries for traffic arriving from eth0 would have DST_NOPOLICY and would be cached and therefore can be reused by traffic originating from ipsec0, skipping policy check. Fix by setting a IPSKB_NOPOLICY flag in IPCB and observing it instead of the DST in IN/FWD IPv4 policy checks. Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2") Reported-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Eyal Birger <eyal.birger@gmail.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | | Revert "tcp/dccp: get rid of inet_twsk_purge()"Eric Dumazet2022-05-131-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commits: 0dad4087a86a2cbe177404dc73f18ada26a2c390 ("tcp/dccp: get rid of inet_twsk_purge()") d507204d3c5cc57d9a8bdf0a477615bb59ea1611 ("tcp/dccp: add tw->tw_bslot") As Leonard pointed out, a newly allocated netns can happen to reuse a freed 'struct net'. While TCP TW timers were covered by my patches, other things were not: 1) Lookups in rx path (INET_MATCH() and INET6_MATCH()), as they look at 4-tuple plus the 'struct net' pointer. 2) /proc/net/tcp[6] and inet_diag, same reason. 3) hashinfo->bhash[], same reason. Fixing all this seems risky, lets instead revert. In the future, we might have a per netns tcp hash table, or a per netns list of timewait sockets... Fixes: 0dad4087a86a ("tcp/dccp: get rid of inet_twsk_purge()") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Leonard Crestez <cdleonard@gmail.com> Tested-by: Leonard Crestez <cdleonard@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | tls: Add opt-in zerocopy mode of sendfile()Boris Pismenny2022-05-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | TLS device offload copies sendfile data to a bounce buffer before transmitting. It allows to maintain the valid MAC on TLS records when the file contents change and a part of TLS record has to be retransmitted on TCP level. In many common use cases (like serving static files over HTTPS) the file contents are not changed on the fly. In many use cases breaking the connection is totally acceptable if the file is changed during transmission, because it would be received corrupted in any case. This commit allows to optimize performance for such use cases to providing a new optional mode of TLS sendfile(), in which the extra copy is skipped. Removing this copy improves performance significantly, as TLS and TCP sendfile perform the same operations, and the only overhead is TLS header/trailer insertion. The new mode can only be enabled with the new socket option named TLS_TX_ZEROCOPY_SENDFILE on per-socket basis. It preserves backwards compatibility with existing applications that rely on the copying behavior. The new mode is safe, meaning that unsolicited modifications of the file being sent can't break integrity of the kernel. The worst thing that can happen is sending a corrupted TLS record, which is in any case not forbidden when using regular TCP sockets. Sockets other than TLS device offload are not affected by the new socket option. The actual status of zerocopy sendfile can be queried with sock_diag. Performance numbers in a single-core test with 24 HTTPS streams on nginx, under 100% CPU load: * non-zerocopy: 33.6 Gbit/s * zerocopy: 79.92 Gbit/s CPU: Intel(R) Xeon(R) Platinum 8380 CPU @ 2.30GHz Signed-off-by: Boris Pismenny <borisp@nvidia.com> Signed-off-by: Tariq Toukan <tariqt@nvidia.com> Signed-off-by: Maxim Mikityanskiy <maximmi@nvidia.com> Reviewed-by: Jakub Kicinski <kuba@kernel.org> Link: https://lore.kernel.org/r/20220518092731.1243494-1-maximmi@nvidia.com Signed-off-by: Paolo Abeni <pabeni@redhat.com>
* | | | | | | inet: rename INET_MATCH()Eric Dumazet2022-05-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is no longer a macro, but an inlined function. INET_MATCH() -> inet_match() Signed-off-by: Eric Dumazet <edumazet@google.com> Suggested-by: Olivier Hartkopp <socketcan@hartkopp.net> Suggested-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | ipv6: add READ_ONCE(sk->sk_bound_dev_if) in INET6_MATCH()Eric Dumazet2022-05-161-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | INET6_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET6_MATCH() an inline function to ease our changes. v2: inline function only defined if IS_ENABLED(CONFIG_IPV6) Change the name to inet6_match(), this is no longer a macro. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | tcp: sk->sk_bound_dev_if once in inet_request_bound_dev_if()Eric Dumazet2022-05-161-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | inet_request_bound_dev_if() reads sk->sk_bound_dev_if twice while listener socket is not locked. Another cpu could change this field under us. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | net: annotate races around sk->sk_bound_dev_ifEric Dumazet2022-05-162-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | UDP sendmsg() is lockless, and reads sk->sk_bound_dev_if while this field can be changed by another thread. Adds minimal annotations to avoid KCSAN splats for UDP. Following patches will add more annotations to potential lockless readers. BUG: KCSAN: data-race in __ip6_datagram_connect / udpv6_sendmsg write to 0xffff888136d47a94 of 4 bytes by task 7681 on cpu 0: __ip6_datagram_connect+0x6e2/0x930 net/ipv6/datagram.c:221 ip6_datagram_connect+0x2a/0x40 net/ipv6/datagram.c:272 inet_dgram_connect+0x107/0x190 net/ipv4/af_inet.c:576 __sys_connect_file net/socket.c:1900 [inline] __sys_connect+0x197/0x1b0 net/socket.c:1917 __do_sys_connect net/socket.c:1927 [inline] __se_sys_connect net/socket.c:1924 [inline] __x64_sys_connect+0x3d/0x50 net/socket.c:1924 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae read to 0xffff888136d47a94 of 4 bytes by task 7670 on cpu 1: udpv6_sendmsg+0xc60/0x16e0 net/ipv6/udp.c:1436 inet6_sendmsg+0x5f/0x80 net/ipv6/af_inet6.c:652 sock_sendmsg_nosec net/socket.c:705 [inline] sock_sendmsg net/socket.c:725 [inline] ____sys_sendmsg+0x39a/0x510 net/socket.c:2413 ___sys_sendmsg net/socket.c:2467 [inline] __sys_sendmmsg+0x267/0x4c0 net/socket.c:2553 __do_sys_sendmmsg net/socket.c:2582 [inline] __se_sys_sendmmsg net/socket.c:2579 [inline] __x64_sys_sendmmsg+0x53/0x60 net/socket.c:2579 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x50 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x44/0xae value changed: 0x00000000 -> 0xffffff9b Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 7670 Comm: syz-executor.3 Tainted: G W 5.18.0-rc1-syzkaller-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 I chose to not add Fixes: tag because race has minor consequences and stable teams busy enough. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | net: allow gro_max_size to exceed 65536Alexander Duyck2022-05-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow the gro_max_size to exceed a value larger than 65536. There weren't really any external limitations that prevented this other than the fact that IPv4 only supports a 16 bit length field. Since we have the option of adding a hop-by-hop header for IPv6 we can allow IPv6 to exceed this value and for IPv4 and non-TCP flows we can cap things at 65536 via a constant rather than relying on gro_max_size. [edumazet] limit GRO_MAX_SIZE to (8 * 65535) to avoid overflows. Signed-off-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | ipv6/gso: remove temporary HBH/jumbo headerEric Dumazet2022-05-161-0/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ipv6 tcp and gro stacks will soon be able to build big TCP packets, with an added temporary Hop By Hop header. If GSO is involved for these large packets, we need to remove the temporary HBH header before segmentation happens. v2: perform HBH removal from ipv6_gso_segment() instead of skb_segment() (Alexander feedback) Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | ipv6: add struct hop_jumbo_hdr definitionEric Dumazet2022-05-161-0/+11
| |_|_|/ / / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Following patches will need to add and remove local IPv6 jumbogram options to enable BIG TCP. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextDavid S. Miller2022-05-168-80/+70
|\ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next This is v2 including deadlock fix in conntrack ecache rework reported by Jakub Kicinski. The following patchset contains Netfilter updates for net-next, mostly updates to conntrack from Florian Westphal. 1) Add a dedicated list for conntrack event redelivery. 2) Include event redelivery list in conntrack dumps of dying type. 3) Remove per-cpu dying list for event redelivery, not used anymore. 4) Add netns .pre_exit to cttimeout to zap timeout objects before synchronize_rcu() call. 5) Remove nf_ct_unconfirmed_destroy. 6) Add generation id for conntrack extensions for conntrack timeout and helpers. 7) Detach timeout policy from conntrack on cttimeout module removal. 8) Remove __nf_ct_unconfirmed_destroy. 9) Remove unconfirmed list. 10) Remove unconditional local_bh_disable in init_conntrack(). 11) Consolidate conntrack iterator nf_ct_iterate_cleanup(). 12) Detect if ctnetlink listeners exist to short-circuit event path early. 13) Un-inline nf_ct_ecache_ext_add(). 14) Add nf_conntrack_events autodetect ctnetlink listener mode and make it default. 15) Add nf_ct_ecache_exist() to check for event cache extension. 16) Extend flowtable reverse route lookup to include source, iif, tos and mark, from Sven Auhagen. 17) Do not verify zero checksum UDP packets in nf_reject, from Kevin Mitchell. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | netfilter: conntrack: skip verification of zero UDP checksumKevin Mitchell2022-05-131-4/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The checksum is optional for UDP packets. However nf_reject would previously require a valid checksum to elicit a response such as ICMP_DEST_UNREACH. Add some logic to nf_reject_verify_csum to determine if a UDP packet has a zero checksum and should therefore not be verified. Signed-off-by: Kevin Mitchell <kevmitch@arista.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: prefer extension check to pointer checkFlorian Westphal2022-05-132-17/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The pointer check usually results in a 'false positive': its likely that the ctnetlink module is loaded but no event monitoring is enabled. After recent change to autodetect ctnetlink usage and only allocate the ecache extension if a listener is active, check if the extension is present on a given conntrack. If its not there, there is nothing to report and calls to the notification framework can be elided. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: conntrack: un-inline nf_ct_ecache_ext_addFlorian Westphal2022-05-131-25/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only called when new ct is allocated or the extension isn't present. This function will be extended, place this in the conntrack module instead of inlining. The callers already depend on nf_conntrack module. Return value is changed to bool, noone used the returned pointer. Make sure that the core drops the newly allocated conntrack if the extension is requested but can't be added. This makes it necessary to ifdef the section, as the stub always returns false we'd drop every new conntrack if the the ecache extension is disabled in kconfig. Add from data path (xt_CT, nft_ct) is unchanged. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: nfnetlink: allow to detect if ctnetlink listeners existFlorian Westphal2022-05-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At this time, every new conntrack gets the 'event cache extension' enabled for it. This is because the 'net.netfilter.nf_conntrack_events' sysctl defaults to 1. Changing the default to 0 means that commands that rely on the event notification extension, e.g. 'conntrack -E' or conntrackd, stop working. We COULD detect if there is a listener by means of 'nfnetlink_has_listeners()' and only add the extension if this is true. The downside is a dependency from conntrack module to nfnetlink module. This adds a different way: inc/dec a counter whenever a ctnetlink group is being (un)subscribed and toggle a flag in struct net. Next patches will take advantage of this and will only add the event extension if the flag is set. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: conntrack: add nf_ct_iter_data object for nf_ct_iterate_cleanup*()Pablo Neira Ayuso2022-05-131-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a structure to collect all the context data that is passed to the cleanup iterator. struct nf_ct_iter_data { struct net *net; void *data; u32 portid; int report; }; There is a netns field that allows to clean up conntrack entries specifically owned by the specified netns. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: conntrack: remove unconfirmed listFlorian Westphal2022-05-132-7/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It has no function anymore and can be removed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: extensions: introduce extension genid countFlorian Westphal2022-05-132-16/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Multiple netfilter extensions store pointers to external data in their extension area struct. Examples: 1. Timeout policies 2. Connection tracking helpers. No references are taken for these. When a helper or timeout policy is removed, the conntrack table gets traversed and affected extensions are cleared. Conntrack entries not yet in the hashtable are referenced via a special list, the unconfirmed list. On removal of a policy or connection tracking helper, the unconfirmed list gets traversed an all entries are marked as dying, this prevents them from getting committed to the table at insertion time: core checks for dying bit, if set, the conntrack entry gets destroyed at confirm time. The disadvantage is that each new conntrack has to be added to the percpu unconfirmed list, and each insertion needs to remove it from this list. The list is only ever needed when a policy or helper is removed -- a rare occurrence. Add a generation ID count: Instead of adding to the list and then traversing that list on policy/helper removal, increment a counter that is stored in the extension area. For unconfirmed conntracks, the extension has the genid valid at ct allocation time. Removal of a helper/policy etc. increments the counter. At confirmation time, validate that ext->genid == global_id. If the stored number is not the same, do not allow the conntrack insertion, just like as if a confirmed-list traversal would have flagged the entry as dying. After insertion, the genid is no longer relevant (conntrack entries are now reachable via the conntrack table iterators and is set to 0. This allows removal of the percpu unconfirmed list. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: remove nf_ct_unconfirmed_destroy helperFlorian Westphal2022-05-131-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This helper tags connections not yet in the conntrack table as dying. These nf_conn entries will be dropped instead when the core attempts to insert them from the input or postrouting 'confirm' hook. After the previous change, the entries get unlinked from the list earlier, so that by the time the actual exit hook runs, new connections no longer have a timeout policy assigned. Its enough to walk the hashtable instead. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: cttimeout: decouple unlink and free on netns destructionFlorian Westphal2022-05-131-8/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make it so netns pre_exit unlinks the objects from the pernet list, so they cannot be found anymore. netns core issues a synchronize_rcu() before calling the exit hooks so any the time the exit hooks run unconfirmed nf_conn entries have been free'd or they have been committed to the hashtable. The exit hook still tags unconfirmed entries as dying, this can now be removed in a followup change. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: conntrack: remove the percpu dying listFlorian Westphal2022-05-131-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Its no longer needed. Entries that need event redelivery are placed on the new pernet dying list. The advantage is that there is no need to take additional spinlock on conntrack removal unless event redelivery failed or the conntrack entry was never added to the table in the first place (confirmed bit not set). The IPS_CONFIRMED bit now needs to be set as soon as the entry has been unlinked from the unconfirmed list, else the destroy function may attempt to unlink it a second time. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: conntrack: include ecache dying list in dumpsFlorian Westphal2022-05-131-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The new pernet dying list includes conntrack entries that await delivery of the 'destroy' event via ctnetlink. The old percpu dying list will be removed soon. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | | | | | netfilter: ecache: use dedicated list for event redeliveryFlorian Westphal2022-05-132-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This disentangles event redelivery and the percpu dying list. Because entries are now stored on a dedicated list, all entries are in NFCT_ECACHE_DESTROY_FAIL state and all entries still have confirmed bit set -- the reference count is at least 1. The 'struct net' back-pointer can be removed as well. The pcpu dying list will be removed eventually, it has no functionality. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | | | | | Merge branch 'master' of ↵Jakub Kicinski2022-05-131-8/+12
|\ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next Steffen Klassert says: ==================== pull request (net-next): ipsec-next 2022-05-13 1) Cleanups for the code behind the XFRM offload API. This is a preparation for the extension of the API for policy offload. From Leon Romanovsky. * 'master' of git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec-next: xfrm: drop not needed flags variable in XFRM offload struct net/mlx5e: Use XFRM state direction instead of flags netdevsim: rely on XFRM state direction instead of flags ixgbe: propagate XFRM offload state direction instead of flags xfrm: store and rely on direction to construct offload flags xfrm: rename xfrm_state_offload struct to allow reuse xfrm: delete not used number of external headers xfrm: free not used XFRM_ESP_NO_TRAILER flag ==================== Link: https://lore.kernel.org/r/20220513151218.4010119-1-steffen.klassert@secunet.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | | | | | xfrm: drop not needed flags variable in XFRM offload structLeon Romanovsky2022-05-061-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After drivers were converted to rely on direction, the flags is not used anymore and can be removed. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | | | xfrm: store and rely on direction to construct offload flagsLeon Romanovsky2022-05-061-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | XFRM state doesn't need anything from flags except to understand direction, so store it separately. For future patches, such change will allow us to reuse xfrm_dev_offload for policy offload too, which has three possible directions instead of two. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | | | xfrm: rename xfrm_state_offload struct to allow reuseLeon Romanovsky2022-05-061-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The struct xfrm_state_offload has all fields needed to hold information for offloaded policies too. In order to do not create new struct with same fields, let's rename existing one and reuse it later. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | | | xfrm: delete not used number of external headersLeon Romanovsky2022-05-061-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | num_exthdrs is set but never used, so delete it. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
| * | | | | | | xfrm: free not used XFRM_ESP_NO_TRAILER flagLeon Romanovsky2022-05-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After removal of Innova IPsec support from mlx5 driver, the last user of this XFRM_ESP_NO_TRAILER was gone too. This means that we can safely remove it as no other hardware is capable (or need) to remove ESP trailer. Reviewed-by: Raed Salem <raeds@nvidia.com> Signed-off-by: Leon Romanovsky <leonro@nvidia.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | | | | | | inet: add READ_ONCE(sk->sk_bound_dev_if) in INET_MATCH()Eric Dumazet2022-05-132-22/+16
| |_|_|_|_|/ / |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | INET_MATCH() runs without holding a lock on the socket. We probably need to annotate most reads. This patch makes INET_MATCH() an inline function to ease our changes. v2: We remove the 32bit version of it, as modern compilers should generate the same code really, no need to try to be smarter. Also make 'struct net *net' the first argument. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | net: inet: Retire port only listening_hashMartin KaFai Lau2022-05-132-42/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The listen sk is currently stored in two hash tables, listening_hash (hashed by port) and lhash2 (hashed by port and address). After commit 0ee58dad5b06 ("net: tcp6: prefer listeners bound to an address") and commit d9fbc7f6431f ("net: tcp: prefer listeners bound to an address"), the TCP-SYN lookup fast path does not use listening_hash. The commit 05c0b35709c5 ("tcp: seq_file: Replace listening_hash with lhash2") also moved the seq_file (/proc/net/tcp) iteration usage from listening_hash to lhash2. There are still a few listening_hash usages left. One of them is inet_reuseport_add_sock() which uses the listening_hash to search a listen sk during the listen() system call. This turns out to be very slow on use cases that listen on many different VIPs at a popular port (e.g. 443). [ On top of the slowness in adding to the tail in the IPv6 case ]. The latter patch has a selftest to demonstrate this case. This patch takes this chance to move all remaining listening_hash usages to lhash2 and then retire listening_hash. Since most changes need to be done together, it is hard to cut the listening_hash to lhash2 switch into small patches. The changes in this patch is highlighted here for the review purpose. 1. Because of the listening_hash removal, lhash2 can use the sk->sk_nulls_node instead of the icsk->icsk_listen_portaddr_node. This will also keep the sk_unhashed() check to work as is after stop adding sk to listening_hash. The union is removed from inet_listen_hashbucket because only nulls_head is needed. 2. icsk->icsk_listen_portaddr_node and its helpers are removed. 3. The current lhash2 users needs to iterate with sk_nulls_node instead of icsk_listen_portaddr_node. One case is in the inet[6]_lhash2_lookup(). Another case is the seq_file iterator in tcp_ipv4.c. One thing to note is sk_nulls_next() is needed because the old inet_lhash2_for_each_icsk_continue() does a "next" first before iterating. 4. Move the remaining listening_hash usage to lhash2 inet_reuseport_add_sock() which this series is trying to improve. inet_diag.c and mptcp_diag.c are the final two remaining use cases and is moved to lhash2 now also. Signed-off-by: Martin KaFai Lau <kafai@fb.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>