linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	inet: preserve const qualifier in inet_sk()	Eric Dumazet	2023-03-17	8	-14/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can change inet_sk() to propagate const qualifier of its argument. This should avoid some potential errors caused by accidental (const -> not_const) promotion. Other helpers like tcp_sk(), udp_sk(), raw_sk() will be handled in separate patch series. v2: use container_of_const() as advised by Jakub and Linus Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/netdev/20230315142841.3a2ac99a@kernel.org/ Link: https://lore.kernel.org/netdev/CAHk-=wiOf12nrYEF2vJMcucKjWPN-Ns_SW9fA7LwST_2Dzp7rw@mail.gmail.com/ Signed-off-by: David S. Miller <davem@davemloft.net>
*	netlink: specs: allow uapi-header in genetlink	Jakub Kicinski	2023-03-17	3	-2/+5
\| \| \| \| \| \| \| \| \|	Chuck wanted to put the UAPI header in linux/net/ which seems reasonable, allow genetlink families to choose the location. It doesn't really matter for non-C-like languages. Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	netlink-specs: add partial specification for devlink	Jakub Kicinski	2023-03-17	1	-0/+198
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Devlink is quite complex but put in the very basics so we can incrementally fill in the commands as needed. $ ./tools/net/ynl/cli.py --spec Documentation/netlink/specs/devlink.yaml \ --dump get [{'bus-name': 'netdevsim', 'dev-name': 'netdevsim1', 'dev-stats': {'reload-stats': {'reload-action-info': {'reload-action': 1, 'reload-action-stats': {'reload-stats-entry': [{'reload-stats-limit': 0, 'reload-stats-value': 0}]}}}, 'remote-reload-stats': {'reload-action-info': {'reload-action': 2, 'reload-action-stats': {'reload-stats-entry': [{'reload-stats-limit': 0, 'reload-stats-value': 0}, {'reload-stats-limit': 1, 'reload-stats-value': 0}]}}}}, 'reload-failed': 0}] Signed-off-by: Jakub Kicinski <kuba@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'net-packet-KCSAN'	David S. Miller	2023-03-17	3	-63/+87
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Eric Dumazet says: ==================== net/packet: KCSAN awareness This series is based on one syzbot report [1] Seven 'flags/booleans' are converted to atomic bit variant. po->xmit and po->tp_tstamp accesses get annotations. [1] BUG: KCSAN: data-race in packet_rcv / packet_setsockopt read-write to 0xffff88813dbe84e4 of 1 bytes by task 12312 on cpu 0: packet_setsockopt+0xb77/0xe60 net/packet/af_packet.c:3900 __sys_setsockopt+0x212/0x2b0 net/socket.c:2252 __do_sys_setsockopt net/socket.c:2263 [inline] __se_sys_setsockopt net/socket.c:2260 [inline] __x64_sys_setsockopt+0x62/0x70 net/socket.c:2260 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffff88813dbe84e4 of 1 bytes by task 1911 on cpu 1: packet_rcv+0x4b1/0xa40 net/packet/af_packet.c:2187 deliver_skb net/core/dev.c:2189 [inline] dev_queue_xmit_nit+0x3a9/0x620 net/core/dev.c:2259 xmit_one+0x71/0x2a0 net/core/dev.c:3586 dev_hard_start_xmit+0x72/0x120 net/core/dev.c:3606 __dev_queue_xmit+0x91c/0x11c0 net/core/dev.c:4256 dev_queue_xmit include/linux/netdevice.h:3008 [inline] neigh_hh_output include/net/neighbour.h:530 [inline] neigh_output include/net/neighbour.h:544 [inline] ip6_finish_output2+0x9e9/0xc30 net/ipv6/ip6_output.c:134 __ip6_finish_output net/ipv6/ip6_output.c:195 [inline] ip6_finish_output+0x395/0x4f0 net/ipv6/ip6_output.c:206 NF_HOOK_COND include/linux/netfilter.h:291 [inline] ip6_output+0x10e/0x210 net/ipv6/ip6_output.c:227 dst_output include/net/dst.h:445 [inline] ip6_local_out+0x60/0x80 net/ipv6/output_core.c:161 ip6tunnel_xmit include/net/ip6_tunnel.h:161 [inline] udp_tunnel6_xmit_skb+0x321/0x4a0 net/ipv6/ip6_udp_tunnel.c:109 send6+0x2ed/0x3b0 drivers/net/wireguard/socket.c:152 wg_socket_send_skb_to_peer+0xbb/0x120 drivers/net/wireguard/socket.c:178 wg_packet_create_data_done drivers/net/wireguard/send.c:251 [inline] wg_packet_tx_worker+0x142/0x360 drivers/net/wireguard/send.c:276 process_one_work+0x3d3/0x720 kernel/workqueue.c:2289 worker_thread+0x618/0xa70 kernel/workqueue.c:2436 kthread+0x1a9/0x1e0 kernel/kthread.c:376 ret_from_fork+0x1f/0x30 arch/x86/entry/entry_64.S:306 ==================== Reviewed-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: convert po->pressure to an atomic flag	Eric Dumazet	2023-03-17	2	-7/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Not only this removes some READ_ONCE()/WRITE_ONCE(), this also removes one integer. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: convert po->running to an atomic flag	Eric Dumazet	2023-03-17	3	-12/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Instead of consuming 32 bits for po->running, use one available bit in po->flags. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: convert po->has_vnet_hdr to an atomic flag	Eric Dumazet	2023-03-17	3	-11/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	po->has_vnet_hdr can be read locklessly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: convert po->tp_loss to an atomic flag	Eric Dumazet	2023-03-17	3	-6/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tp_loss can be read locklessly. Convert it to an atomic flag to avoid races. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: convert po->tp_tx_has_off to an atomic flag	Eric Dumazet	2023-03-17	2	-5/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This is to use existing space in po->flags, and reclaim the storage used by the non atomic bit fields. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: annotate accesses to po->tp_tstamp	Eric Dumazet	2023-03-17	2	-5/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	tp_tstamp is read locklessly. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: convert po->auxdata to an atomic flag	Eric Dumazet	2023-03-17	3	-8/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	po->auxdata can be read while another thread is changing its value, potentially raising KCSAN splat. Convert it to PACKET_SOCK_AUXDATA flag. Fixes: 8dc419447415 ("[PACKET]: Add optional checksum computation for recvmsg") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: convert po->origdev to an atomic flag	Eric Dumazet	2023-03-17	3	-8/+26
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	syzbot/KCAN reported that po->origdev can be read while another thread is changing its value. We can avoid this splat by converting this field to an actual bit. Following patches will convert remaining 1bit fields. Fixes: 80feaacb8a64 ("[AF_PACKET]: Add option to return orig_dev to userspace.") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net/packet: annotate accesses to po->xmit	Eric Dumazet	2023-03-17	1	-4/+8
\|/ \| \| \| \| \| \| \| \| \| \| \|	po->xmit can be set from setsockopt(PACKET_QDISC_BYPASS), while read locklessly. Use READ_ONCE()/WRITE_ONCE() to avoid potential load/store tearing issues. Fixes: d346a3fae3ff ("packet: introduce PACKET_QDISC_BYPASS socket option") Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'gve-xdp-support'	David S. Miller	2023-03-17	10	-138/+1252
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Praveen Kaligineedi says: ==================== gve: Add XDP support for GQI-QPL format Adding support for XDP DROP, PASS, TX, REDIRECT for GQI QPL format. Add AF_XDP zero-copy support. When an XDP program is installed, dedicated TX queues are created to handle XDP traffic. The user needs to ensure that the number of configured TX queues is equal to the number of configured RX queues; and the number of TX/RX queues is less than or equal to half the maximum number of TX/RX queues. The XDP traffic from AF_XDP sockets and from other NICs (arriving via XDP_REDIRECT) will also egress through the dedicated XDP TX queues. Although these changes support AF_XDP socket in zero-copy mode, there is still a copy happening within the driver between XSK buffer pool and QPL bounce buffers in GQI-QPL format. The following example demonstrates how the XDP packets are mapped to TX queues: Example configuration: Max RX queues : 2N, Max TX queues : 2N Configured RX queues : N, Configured TX queues : N TX queue mapping: TX queues with queue id 0,...,N-1 will handle traffic from the stack. TX queues with queue id N,...,2N-1 will handle XDP traffic. For the XDP packets transmitted using XDP_TX action: <Egress TX queue id> = N + <Ingress RX queue id> For the XDP packets that arrive from other NICs via XDP_REDIRECT action: <Egress TX queue id> = N + ( smp_processor_id % N ) For AF_XDP zero-copy mode: <Egress TX queue id> = N + <AF_XDP TX queue id> Changes in v2: - Removed gve_close/gve_open when adding XDP dedicated queues. Instead we add and register additional TX queues when the XDP program is installed. If the allocation/registration fails we return error and do not install the XDP program. Added a new patch to enable adding TX queues without gve_close/gve_open - Removed xdp tx spin lock from this patch. It is needed for XDP_REDIRECT support as both XDP_REDIRECT and XDP_TX traffic share the dedicated XDP queues. Moved the code to add xdp tx spinlock to the subsequent patch that adds XDP_REDIRECT support. - Added netdev_err when the user tries to set rx/tx queues to the values not supported when XDP is enabled. - Removed rcu annotation for xdp_prog. We disable the napi prior to adding/removing the xdp_prog and reenable it after the program has been installed for all the queues. - Ring the tx doorbell once for napi instead of every XDP TX packet. - Added a new helper function for freeing the FIFO buffer - Unregister xdp rxq for all the queues when the registration fails during XDP program installation - Register xsk rxq only when XSK buff pool is enabled - Removed code accessing internal xsk_buff_pool fields - Removed sleep driven code when disabling XSK buff pool. Disable napi and re-enable it after disabling XSK pool. - Make sure that we clean up dma mappings on XSK pool disable - Use napi_if_scheduled_mark_missed to avoid unnecessary napi move to the CPU calling ndo_xsk_wakeup() Changes in v3: - Padding bytes are used if the XDP TX packet headers do not fit at tail of TX FIFO. Taking these padding bytes into account while checking if enough space is available in TX FIFO. Changes in v4: - Turn on the carrier based on the link status synchronously rather than asynchronously when XDP is installed/uninstalled - Set the supported flags in net_device.xdp_features ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	gve: Add AF_XDP zero-copy support for GQI-QPL format	Praveen Kaligineedi	2023-03-17	5	-9/+274
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Adding AF_XDP zero-copy support. Note: Although these changes support AF_XDP socket in zero-copy mode, there is still a copy happening within the driver between XSK buffer pool and QPL bounce buffers in GQI-QPL format. In GQI-QPL queue format, the driver needs to allocate a fixed size memory, the size specified by vNIC device, for RX/TX and register this memory as a bounce buffer with the vNIC device when a queue is created. The number of pages in the bounce buffer is limited and the pages need to be made available to the vNIC by copying the RX data out to prevent head-of-line blocking. Therefore, we cannot pass the XSK buffer pool to the vNIC. The number of copies on RX path from the bounce buffer to XSK buffer is 2 for AF_XDP copy mode (bounce buffer -> allocated page frag -> XSK buffer) and 1 for AF_XDP zero-copy mode (bounce buffer -> XSK buffer). This patch contains the following changes: 1) Enable and disable XSK buffer pool 2) Copy XDP packets from QPL bounce buffers to XSK buffer on rx 3) Copy XDP packets from XSK buffer to QPL bounce buffers and ring the doorbell as part of XDP TX napi poll 4) ndo_xsk_wakeup callback support Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Jeroen de Borst <jeroendb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	gve: Add XDP REDIRECT support for GQI-QPL format	Praveen Kaligineedi	2023-03-17	5	-17/+138
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This patch contains the following changes: 1) Support for XDP REDIRECT action on rx 2) ndo_xdp_xmit callback support In GQI-QPL queue format, the driver needs to allocate a fixed size memory, the size specified by vNIC device, for RX/TX and register this memory as a bounce buffer with the vNIC device when a queue is created. The number of pages in the bounce buffer is limited and the pages need to be made available to the vNIC by copying the RX data out to prevent head-of-line blocking. The XDP_REDIRECT packets are therefore immediately copied to a newly allocated page. Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Jeroen de Borst <jeroendb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	gve: Add XDP DROP and TX support for GQI-QPL format	Praveen Kaligineedi	2023-03-17	5	-39/+687
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add support for XDP PASS, DROP and TX actions. This patch contains the following changes: 1) Support installing/uninstalling XDP program 2) Add dedicated XDP TX queues 3) Add support for XDP DROP action 4) Add support for XDP TX action Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Jeroen de Borst <jeroendb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	gve: Changes to add new TX queues	Praveen Kaligineedi	2023-03-17	6	-50/+104
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Changes to enable adding and removing TX queues without calling gve_close() and gve_open(). Made the following changes: 1) priv->tx, priv->rx and priv->qpls arrays are allocated based on max tx queues and max rx queues 2) Changed gve_adminq_create_tx_queues(), gve_adminq_destroy_tx_queues(), gve_tx_alloc_rings() and gve_tx_free_rings() functions to add/remove a subset of TX queues rather than all the TX queues. Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Jeroen de Borst <jeroendb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	gve: XDP support GQI-QPL: helper function changes	Praveen Kaligineedi	2023-03-17	8	-44/+70
\|/ \| \| \| \| \| \| \| \|	This patch adds/modifies helper functions needed to add XDP support. Signed-off-by: Praveen Kaligineedi <pkaligineedi@google.com> Reviewed-by: Jeroen de Borst <jeroendb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'net-sk_err-lockless-annotate'	David S. Miller	2023-03-17	20	-56/+63
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Eric Dumazet says: ==================== net: annotate lockless accesses to sk_err[_soft] This patch series is inspired by yet another syzbot report. Most poll() handlers are lockless and read sk->sk_err while other cpus can change it. Add READ_ONCE/WRITE_ONCE() to major/usual offenders. More to come later. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	af_unix: annotate lockless accesses to sk->sk_err	Eric Dumazet	2023-03-17	1	-4/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	unix_poll() and unix_dgram_poll() read sk->sk_err without any lock held. Add relevant READ_ONCE()/WRITE_ONCE() annotations. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	mptcp: annotate lockless accesses to sk->sk_err	Eric Dumazet	2023-03-17	3	-7/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	mptcp_poll() reads sk->sk_err without socket lock held/owned. Add READ_ONCE() and WRITE_ONCE() to avoid load/store tearing. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	tcp: annotate lockless access to sk->sk_err	Eric Dumazet	2023-03-17	6	-14/+15
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	tcp_poll() reads sk->sk_err without socket lock held/owned. We should used READ_ONCE() here, and update writers to use WRITE_ONCE(). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: annotate lockless accesses to sk->sk_err_soft	Eric Dumazet	2023-03-17	7	-9/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	This field can be read/written without lock synchronization. tcp and dccp have been handled in different patches. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	dccp: annotate lockless accesses to sk->sk_err_soft	Eric Dumazet	2023-03-17	3	-11/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \|	This field can be read/written without lock synchronization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	tcp: annotate lockless accesses to sk->sk_err_soft	Eric Dumazet	2023-03-17	4	-12/+13
\|/ \| \| \| \| \| \|	This field can be read/written without lock synchronization. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	vlan: partially enable SIOCSHWTSTAMP in container	Vadim Fedorenko	2023-03-17	1	-1/+1
\| \| \| \| \| \| \| \| \| \|	Setting timestamp filter was explicitly disabled on vlan devices in containers because it might affect other processes on the host. But it's absolutely legit in case when real device is in the same namespace. Fixes: 873017af7784 ("vlan: disable SIOCSHWTSTAMP in container") Signed-off-by: Vadim Fedorenko <vadim.fedorenko@linux.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'pcs_get_state-fixes'	David S. Miller	2023-03-17	2	-13/+4
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Russell King (Oracle) says: ==================== Minor fixes for pcs_get_state() implementations This series contains a number fixes for minor issues with some pcs_get_state() implementations, particualrly for the phylink state->an_enabled member. As they are minor, I'm suggesting we queue them in net-next as there is follow-on work for these, and there is no urgency for them to be in -rc. Just like phylib, state->advertising's Autoneg bit is a copy of state->an_enabled, and thus it is my intention to remove state->an_enabled from phylink to simplify things. This series gets rid of state->an_enabled assignments or reporting that should never have been there. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: pcs: lynx: don't print an_enabled in pcs_get_state()	Russell King (Oracle)	2023-03-17	1	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	an_enabled will be going away, and in any case, pcs_get_state() should not be updating this member. Remove the print. Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Reviewed-by: Steen Hegelund <Steen.Hegelund@microchip.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: pcs: xpcs: remove double-read of link state when using AN	Russell King (Oracle)	2023-03-17	1	-11/+2
\|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Phylink does not want the current state of the link when reading the PCS link state - it wants the latched state. Don't double-read the MII status register. Phylink will re-read as necessary to capture transient link-down events as of dbae3388ea9c ("net: phylink: Force retrigger in case of latched link-fail indicator"). The above referenced commit is a dependency for this change, and thus this change should not be backported to any kernel that does not contain the above referenced commit. Fixes: fcb26bd2b6ca ("net: phy: Add Synopsys DesignWare XPCS MDIO module") Signed-off-by: Russell King (Oracle) <rmk+kernel@armlinux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'vxlan-MDB-support'	David S. Miller	2023-03-17	15	-241/+4207
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Ido Schimmel says: ==================== vxlan: Add MDB support tl;dr ===== This patchset implements MDB support in the VXLAN driver, allowing it to selectively forward IP multicast traffic to VTEPs with interested receivers instead of flooding it to all the VTEPs as BUM. The motivating use case is intra and inter subnet multicast forwarding using EVPN [1][2], which means that MDB entries are only installed by the user space control plane and no snooping is implemented, thereby avoiding a lot of unnecessary complexity in the kernel. Background ========== Both the bridge and VXLAN drivers have an FDB that allows them to forward Ethernet frames based on their destination MAC addresses and VLAN/VNI. These FDBs are managed using the same PF_BRIDGE/RTM_NEIGH netlink messages and bridge(8) utility. However, only the bridge driver has an MDB that allows it to selectively forward IP multicast packets to bridge ports with interested receivers behind them, based on (S, G) and (, G) MDB entries. When these packets reach the VXLAN driver they are flooded using the "all-zeros" FDB entry (00:00:00:00:00:00). The entry either includes the list of all the VTEPs in the tenant domain (when ingress replication is used) or the multicast address of the BUM tunnel (when P2MP tunnels are used), to which all the VTEPs join. Networks that make heavy use of multicast in the overlay can benefit from a solution that allows them to selectively forward IP multicast traffic only to VTEPs with interested receivers. Such a solution is described in the next section. Motivation ========== RFC 7432 [3] defines a "MAC/IP Advertisement route" (type 2) [4] that allows VTEPs in the EVPN network to advertise and learn reachability information for unicast MAC addresses. Traffic destined to a unicast MAC address can therefore be selectively forwarded to a single VTEP behind which the MAC is located. The same is not true for IP multicast traffic. Such traffic is simply flooded as BUM to all VTEPs in the broadcast domain (BD) / subnet, regardless if a VTEP has interested receivers for the multicast stream or not. This is especially problematic for overlay networks that make heavy use of multicast. The issue is addressed by RFC 9251 [1] that defines a "Selective Multicast Ethernet Tag Route" (type 6) [5] which allows VTEPs in the EVPN network to advertise multicast streams that they are interested in. This is done by having each VTEP suppress IGMP/MLD packets from being transmitted to the NVE network and instead communicate the information over BGP to other VTEPs. The draft in [2] further extends RFC 9251 with procedures to allow efficient forwarding of IP multicast traffic not only in a given subnet, but also between different subnets in a tenant domain. The required changes in the bridge driver to support the above were already merged in merge commit 8150f0cfb24f ("Merge branch 'bridge-mcast-extensions-for-evpn'"). However, full support entails MDB support in the VXLAN driver so that it will be able to selectively forward IP multicast traffic only to VTEPs with interested receivers. The implementation of this MDB is described in the next section. Implementation ============== The user interface is extended to allow user space to specify the destination VTEP(s) and related parameters. Example usage: # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 198.51.100.1 # bridge mdb add dev vxlan0 port vxlan0 grp 239.1.1.1 permanent dst 192.0.2.1 $ bridge -d -s mdb show dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 192.0.2.1 0.00 dev vxlan0 port vxlan0 grp 239.1.1.1 permanent filter_mode exclude proto static dst 198.51.100.1 0.00 Since the MDB is fully managed by user space and since snooping is not implemented, only permanent entries can be installed and temporary entries are rejected by the kernel. The netlink interface is extended with a few new attributes in the RTM_NEWMDB / RTM_DELMDB request messages: [ struct nlmsghdr ] [ struct br_port_msg ] [ MDBA_SET_ENTRY ] struct br_mdb_entry [ MDBA_SET_ENTRY_ATTRS ] [ MDBE_ATTR_SOURCE ] struct in_addr / struct in6_addr [ MDBE_ATTR_SRC_LIST ] [ MDBE_SRC_LIST_ENTRY ] [ MDBE_SRCATTR_ADDRESS ] struct in_addr / struct in6_addr [ ...] [ MDBE_ATTR_GROUP_MODE ] u8 [ MDBE_ATTR_RTPORT ] u8 [ MDBE_ATTR_DST ] // new struct in_addr / struct in6_addr [ MDBE_ATTR_DST_PORT ] // new u16 [ MDBE_ATTR_VNI ] // new u32 [ MDBE_ATTR_IFINDEX ] // new s32 [ MDBE_ATTR_SRC_VNI ] // new u32 RTM_NEWMDB / RTM_DELMDB responses and notifications are extended with corresponding attributes. One MDB entry that can be installed in the VXLAN MDB, but not in the bridge MDB is the catchall entry (0.0.0.0 / ::). It is used to transmit unregistered multicast traffic that is not link-local and is especially useful when inter-subnet multicast forwarding is required. See patch #12 for a detailed explanation and motivation. It is similar to the "all-zeros" FDB entry that can be installed in the VXLAN FDB, but not the bridge FDB. "added_by_star_ex" entries -------------------------- The bridge driver automatically installs (S, G) MDB port group entries marked as "added_by_star_ex" whenever it detects that an (S, G) entry can prevent traffic from being forwarded via a port associated with an EXCLUDE (, G) entry. The bridge will add the port to the port group of the (S, G) entry, thereby creating a new port group entry. The complexity associated with these entries is not trivial, but it needs to reside in the bridge driver because it automatically installs MDB entries in response to snooped IGMP / MLD packets. The same in not true for the VXLAN MDB which is entirely managed by user space who is fully capable of forming the correct replication lists on its own. In addition, the complexity associated with the "added_by_star_ex" entries in the VXLAN driver is higher compared to the bridge: Whenever a remote VTEP is added to the catchall entry, it needs to be added to all the existing MDB entries, as such a remote requested all the multicast traffic to be forwarded to it. Similarly, whenever an (, G) or (S, G) entry is added, all the remotes associated with the catchall entry need to be added to it. Given the above, this patchset does not implement support for such entries. One argument against this decision can be that in the future someone might want to populate the VXLAN MDB in response to decapsulated IGMP / MLD packets and not according to EVPN routes. Regardless of my doubts regarding this possibility, it can be implemented using a new VXLAN device knob that will also enable the "added_by_star_ex" functionality. Testing ======= Tested using existing VXLAN and MDB selftests under "net/" and "net/forwarding/". Added a dedicated selftest in the last patch. Patchset overview ================= Patches #1-#3 are small preparations in the bridge driver. I plan to submit them separately together with an MDB dump test case. Patches #4-#6 are additional preparations centered around the extraction of the MDB netlink handlers from the bridge driver to the common rtnetlink code. This allows reusing the existing MDB netlink messages for the configuration of the VXLAN MDB. Patches #7-#9 include more small preparations in the common rtnetlink code and the VXLAN driver. Patch #10 implements the MDB control path in the VXLAN driver, which will allow user space to create, delete, replace and dump MDB entries. Patches #11-#12 implement the MDB data path in the VXLAN driver, allowing it to selectively forward IP multicast traffic according to the matched MDB entry. Patch #13 finally enables MDB support in the VXLAN driver. iproute2 patches can be found here [6]. Note that in order to fully support the specifications in [1] and [2], additional functionality is required from the data path. However, it can be achieved using existing kernel interfaces which is why it is not described here. Changelog ========= Since v1 [7]: Patch #9: Use htons() in 'case' instead of ntohs() in 'switch'. Since RFC [8]: Patch #3: Use NL_ASSERT_DUMP_CTX_FITS(). Patch #3: memset the entire context when moving to the next device. Patch #3: Reset sequence counters when moving to the next device. Patch #3: Use NL_SET_ERR_MSG_ATTR() in rtnl_validate_mdb_entry(). Patch #7: Remove restrictions regarding mixing of multicast and unicast remote destination IPs in an MDB entry. While such configuration does not make sense to me, it is no forbidden by the VXLAN FDB code and does not crash the kernel. Patch #7: Fix check regarding all-zeros MDB entry and source. Patch #11: New patch. [1] https://datatracker.ietf.org/doc/html/rfc9251 [2] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast [3] https://datatracker.ietf.org/doc/html/rfc7432 [4] https://datatracker.ietf.org/doc/html/rfc7432#section-7.2 [5] https://datatracker.ietf.org/doc/html/rfc9251#section-9.1 [6] https://github.com/idosch/iproute2/commits/submit/mdb_vxlan_rfc_v1 [7] https://lore.kernel.org/netdev/20230313145349.3557231-1-idosch@nvidia.com/ [8] https://lore.kernel.org/netdev/20230204170801.3897900-1-idosch@nvidia.com/ ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	selftests: net: Add VXLAN MDB test	Ido Schimmel	2023-03-17	3	-0/+2320
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add test cases for VXLAN MDB, testing the control and data paths. Two different sets of namespaces (i.e., ns{1,2}_v4 and ns{1,2}_v6) are used in order to test VXLAN MDB with both IPv4 and IPv6 underlays, respectively. Example truncated output: # ./test_vxlan_mdb.sh [...] Tests passed: 620 Tests failed: 0 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	vxlan: Enable MDB support	Ido Schimmel	2023-03-17	1	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Now that the VXLAN MDB control and data paths are in place we can expose the VXLAN MDB functionality to user space. Set the VXLAN MDB net device operations to the appropriate functions, thereby allowing the rtnetlink code to reach the VXLAN driver. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	vxlan: Add MDB data path support	Ido Schimmel	2023-03-17	3	-0/+135
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Integrate MDB support into the Tx path of the VXLAN driver, allowing it to selectively forward IP multicast traffic according to the matched MDB entry. If MDB entries are configured (i.e., 'VXLAN_F_MDB' is set) and the packet is an IP multicast packet, perform up to three different lookups according to the following priority: 1. For an (S, G) entry, using {Source VNI, Source IP, Destination IP}. 2. For a (, G) entry, using {Source VNI, Destination IP}. 3. For the catchall MDB entry (0.0.0.0 or ::), using the source VNI. The catchall MDB entry is similar to the catchall FDB entry (00:00:00:00:00:00) that is currently used to transmit BUM (broadcast, unknown unicast and multicast) traffic. However, unlike the catchall FDB entry, this entry is only used to transmit unregistered IP multicast traffic that is not link-local. Therefore, when configured, the catchall FDB entry will only transmit BULL (broadcast, unknown unicast, link-local multicast) traffic. The catchall MDB entry is useful in deployments where inter-subnet multicast forwarding is used and not all the VTEPs in a tenant domain are members in all the broadcast domains. In such deployments it is advantageous to transmit BULL (broadcast, unknown unicast and link-local multicast) and unregistered IP multicast traffic on different tunnels. If the same tunnel was used, a VTEP only interested in IP multicast traffic would also pull all the BULL traffic and drop it as it is not a member in the originating broadcast domain [1]. If the packet did not match an MDB entry (or if the packet is not an IP multicast packet), return it to the Tx path, allowing it to be forwarded according to the FDB. If the packet did match an MDB entry, forward it to the associated remote VTEPs. However, if the entry is a (, G) entry and the associated remote is in INCLUDE mode, then skip over it as the source IP is not in its source list (otherwise the packet would have matched on an (S, G) entry). Similarly, if the associated remote is marked as BLOCKED (can only be set on (S, G) entries), then skip over it as well as the remote is in EXCLUDE mode and the source IP is in its source list. [1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-2.6 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	vxlan: mdb: Add an internal flag to indicate MDB usage	Ido Schimmel	2023-03-17	2	-0/+8
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add an internal flag to indicate whether MDB entries are configured or not. Set the flag after installing the first MDB entry and clear it before deleting the last one. The flag will be consulted by the data path which will only perform an MDB lookup if the flag is set, thereby keeping the MDB overhead to a minimum when the MDB is not used. Another option would have been to use a static key, but it is global and not per-device, unlike the current approach. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	vxlan: mdb: Add MDB control path support	Ido Schimmel	2023-03-17	6	-1/+1396
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement MDB control path support, enabling the creation, deletion, replacement and dumping of MDB entries in a similar fashion to the bridge driver. Unlike the bridge driver, each entry stores a list of remote VTEPs to which matched packets need to be replicated to and not a list of bridge ports. The motivating use case is the installation of MDB entries by a user space control plane in response to received EVPN routes. As such, only allow permanent MDB entries to be installed and do not implement snooping functionality, avoiding a lot of unnecessary complexity. Since entries can only be modified by user space under RTNL, use RTNL as the write lock. Use RCU to ensure that MDB entries and remotes are not freed while being accessed from the data path during transmission. In terms of uAPI, reuse the existing MDB netlink interface, but add a few new attributes to request and response messages: * IP address of the destination VXLAN tunnel endpoint where the multicast receivers reside. * UDP destination port number to use to connect to the remote VXLAN tunnel endpoint. * VXLAN VNI Network Identifier to use to connect to the remote VXLAN tunnel endpoint. Required when Ingress Replication (IR) is used and the remote VTEP is not a member of originating broadcast domain (VLAN/VNI) [1]. * Source VNI Network Identifier the MDB entry belongs to. Used only when the VXLAN device is in external mode. * Interface index of the outgoing interface to reach the remote VXLAN tunnel endpoint. This is required when the underlay destination IP is multicast (P2MP), as the multicast routing tables are not consulted. All the new attributes are added under the 'MDBA_SET_ENTRY_ATTRS' nest which is strictly validated by the bridge driver, thereby automatically rejecting the new attributes. [1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-3.2.2 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	vxlan: Expose vxlan_xmit_one()	Ido Schimmel	2023-03-17	2	-3/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Given a packet and a remote destination, the function will take care of encapsulating the packet and transmitting it to the destination. Expose it so that it could be used in subsequent patches by the MDB code to transmit a packet to the remote destination(s) stored in the MDB entry. It will allow us to keep the MDB code self-contained, not exposing its data structures to the rest of the VXLAN driver. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	vxlan: Move address helpers to private headers	Ido Schimmel	2023-03-17	2	-47/+45
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the helpers out of the core C file to the private header so that they could be used by the upcoming MDB code. While at it, constify the second argument of vxlan_nla_get_addr(). Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	rtnetlink: bridge: mcast: Relax group address validation in common code	Ido Schimmel	2023-03-17	2	-2/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	In the upcoming VXLAN MDB implementation, the 0.0.0.0 and :: MDB entries will act as catchall entries for unregistered IP multicast traffic in a similar fashion to the 00:00:00:00:00:00 VXLAN FDB entry that is used to transmit BUM traffic. In deployments where inter-subnet multicast forwarding is used, not all the VTEPs in a tenant domain are members in all the broadcast domains. It is therefore advantageous to transmit BULL (broadcast, unknown unicast and link-local multicast) and unregistered IP multicast traffic on different tunnels. If the same tunnel was used, a VTEP only interested in IP multicast traffic would also pull all the BULL traffic and drop it as it is not a member in the originating broadcast domain [1]. Prepare for this change by allowing the 0.0.0.0 group address in the common rtnetlink MDB code and forbid it in the bridge driver. A similar change is not needed for IPv6 because the common code only validates that the group address is not the all-nodes address. [1] https://datatracker.ietf.org/doc/html/draft-ietf-bess-evpn-irb-mcast#section-2.6 Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	rtnetlink: bridge: mcast: Move MDB handlers out of bridge driver	Ido Schimmel	2023-03-17	5	-318/+244
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Currently, the bridge driver registers handlers for MDB netlink messages, making it impossible for other drivers to implement MDB support. As a preparation for VXLAN MDB support, move the MDB handlers out of the bridge driver to the core rtnetlink code. The rtnetlink code will call into individual drivers by invoking their previously added MDB net device operations. Note that while the diffstat is large, the change is mechanical. It moves code out of the bridge driver to rtnetlink code. Also note that a similar change was made in 2012 with commit 77162022ab26 ("net: add generic PF_BRIDGE:RTM_ FDB hooks") that moved FDB handlers out of the bridge driver to the core rtnetlink code. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	bridge: mcast: Implement MDB net device operations	Ido Schimmel	2023-03-17	3	-0/+152
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement the previously added MDB net device operations in the bridge driver so that they could be invoked by core rtnetlink code in the next patch. The operations are identical to the existing br_mdb_{dump,add,del} functions. The '_new' suffix will be removed in the next patch. The functions are re-implemented in this patch to make the conversion in the next patch easier to review. Add dummy implementations when 'CONFIG_BRIDGE_IGMP_SNOOPING' is disabled, so that an error will be returned to user space when it is trying to add or delete an MDB entry. This is consistent with existing behavior where the bridge driver does not even register rtnetlink handlers for RTM_{NEW,DEL,GET}MDB messages when this Kconfig option is disabled. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	net: Add MDB net device operations	Ido Schimmel	2023-03-17	1	-0/+21
\|/ \| \| \| \| \| \| \| \| \|	Add MDB net device operations that will be invoked by rtnetlink code in response to received RTM_{NEW,DEL,GET}MDB messages. Subsequent patches will implement these operations in the bridge and VXLAN drivers. Signed-off-by: Ido Schimmel <idosch@nvidia.com> Reviewed-by: Nikolay Aleksandrov <razor@blackwall.org> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'J784S4-CPSW9G-bindings'	David S. Miller	2023-03-17	1	-3/+7
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Siddharth Vadapalli says: ==================== Add J784S4 CPSW9G NET Bindings This series cleans up the bindings by reordering the compatibles, followed by adding the bindings for CPSW9G instance of CPSW Ethernet Switch on TI's J784S4 SoC. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	dt-bindings: net: ti: k3-am654-cpsw-nuss: Add J784S4 CPSW9G support	Siddharth Vadapalli	2023-03-17	1	-1/+5
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Update bindings for TI K3 J784S4 SoC which contains 9 ports (8 external ports) CPSW9G module and add compatible for it. Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
\| *	dt-bindings: net: ti: k3-am654-cpsw-nuss: Fix compatible order	Siddharth Vadapalli	2023-03-17	1	-2/+2
\|/ \| \| \| \| \| \|	Reorder compatibles to follow alphanumeric order. Signed-off-by: Siddharth Vadapalli <s-vadapalli@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: mana: Add new MANA VF performance counters for easier troubleshooting	Shradha Gupta	2023-03-17	3	-4/+128
\| \| \| \| \| \| \| \| \|	Extended performance counter stats in 'ethtool -S <interface>' output for MANA VF to facilitate troubleshooting. Tested-on: Ubuntu22 Signed-off-by: Shradha Gupta <shradhagupta@linux.microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: wangxun: Implement the ndo change mtu interface	Mengyuan Lou	2023-03-17	7	-5/+31
\| \| \| \| \| \| \|	Add ngbe and txgbe ndo_change_mtu support. Signed-off-by: Mengyuan Lou <mengyuanlou@net-swift.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	net: dsa: realtek: rtl8365mb: add change_mtu	Luiz Angelo Daros de Luca	2023-03-17	1	-4/+36
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The rtl8365mb was using a fixed MTU size of 1536, which was probably inspired by the rtl8366rb's initial frame size. However, unlike that family, the rtl8365mb family can specify the max frame size in bytes, rather than in fixed steps. DSA calls change_mtu for the CPU port once the max MTU value among the ports changes. As the max frame size is defined globally, the switch is configured only when the call affects the CPU port. The available specifications do not directly define the max supported frame size, but it mentions a 16k limit. This driver will use the 0x3FFF limit as it is used in the vendor API code. However, the switch sets the max frame size to 16368 bytes (0x3FF0) after it resets. change_mtu uses MTU size, or ethernet payload size, while the switch works with frame size. The frame size is calculated considering the ethernet header (14 bytes), a possible 802.1Q tag (4 bytes), the payload size (MTU), and the Ethernet FCS (4 bytes). The CPU tag (8 bytes) is consumed before the switch enforces the limit. During setup, the driver will use the default 1500-byte MTU of DSA to set the maximum frame size. The current sum will be VLAN_ETH_HLEN+1500+ETH_FCS_LEN, which results in 1522 bytes. Although it is lower than the previous initial value of 1536 bytes, the driver will increase the frame size for a larger MTU. However, if something requires more space without increasing the MTU, such as QinQ, we would need to add the extra length to the rtl8365mb_port_change_mtu() formula. MTU was tested up to 2018 (with 802.1Q) as that is as far as mt7620 (where rtl8367s is stacked) can go. The register was manually manipulated byte-by-byte to ensure the MTU to frame size conversion was correct. For frames without 802.1Q tag, the frame size limit will be 4 bytes over the required size. There is a jumbo register, enabled by default at 6k frame size. However, the jumbo settings do not seem to limit nor expand the maximum tested MTU (2018), even when jumbo is disabled. More tests are needed with a device that can handle larger frames. Signed-off-by: Luiz Angelo Daros de Luca <luizluca@gmail.com> Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Reviewed-by: Alvin Šipraga <alsi@bang-olufsen.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge branch 'add-ptp-support-for-sama7g5'	Jakub Kicinski	2023-03-17	1	-2/+3
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Durai Manickam says: ==================== Add PTP support for sama7g5 This patch series is intended to add PTP capability to the GEM and EMAC for sama7g5. ==================== Link: https://lore.kernel.org/r/20230315095053.53969-1-durai.manickamkr@microchip.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| *	net: macb: Add PTP support to EMAC for sama7g5	Durai Manickam KR	2023-03-17	1	-1/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add PTP capability to the Ethernet MAC. Signed-off-by: Durai Manickam KR <durai.manickamkr@microchip.com> Reviewed-by: Claudiu Beznea <claudiu.beznea@microchip.com> Reviewed-by: Michal Swiatkowski <michal.swiatkowski@linux.intel.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>