summaryrefslogtreecommitdiffstats
path: root/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
* rxrpc: Fix congestion managementDavid Howells2022-11-0810-20/+59
| | | | | | | | | | | | | | | | | | | | | | | | | rxrpc has a problem in its congestion management in that it saves the congestion window size (cwnd) from one call to another, but if this is 0 at the time is saved, then the next call may not actually manage to ever transmit anything. To this end: (1) Don't save cwnd between calls, but rather reset back down to the initial cwnd and re-enter slow-start if data transmission is idle for more than an RTT. (2) Preserve ssthresh instead, as that is a handy estimate of pipe capacity. Knowing roughly when to stop slow start and enter congestion avoidance can reduce the tendency to overshoot and drop larger amounts of packets when probing. In future, cwind growth also needs to be constrained when the window isn't being filled due to being application limited. Reported-by: Simon Wilkinson <sxw@auristor.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Remove the rxtx ringDavid Howells2022-11-082-39/+0
| | | | | | | | The Rx/Tx ring is no longer used, so remove it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Save last ACK's SACK table rather than marking txbufsDavid Howells2022-11-086-116/+181
| | | | | | | | | | | | | | | | | | | | Improve the tracking of which packets need to be transmitted by saving the last ACK packet that we receive that has a populated soft-ACK table rather than marking packets. Then we can step through the soft-ACK table and look at the packets we've transmitted beyond that to determine which packets we might want to retransmit. We also look at the highest serial number that has been acked to try and guess which packets we've transmitted the peer is likely to have seen. If necessary, we send a ping to retrieve that number. One downside that might be a problem is that we can't then compare the previous acked/unacked state so easily in rxrpc_input_soft_acks() - which is a potential problem for the slow-start algorithm. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Remove call->lockDavid Howells2022-11-085-32/+3
| | | | | | | | call->lock is no longer necessary, so remove it. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Don't use a ring buffer for call Tx queueDavid Howells2022-11-0811-423/+309
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change the way the Tx queueing works to make the following ends easier to achieve: (1) The filling of packets, the encryption of packets and the transmission of packets can be handled in parallel by separate threads, rather than rxrpc_sendmsg() allocating, filling, encrypting and transmitting each packet before moving onto the next one. (2) Get rid of the fixed-size ring which sets a hard limit on the number of packets that can be retained in the ring. This allows the number of packets to increase without having to allocate a very large ring or having variable-sized rings. [Note: the downside of this is that it's then less efficient to locate a packet for retransmission as we then have to step through a list and examine each buffer in the list.] (3) Allow the filler/encrypter to run ahead of the transmission window. (4) Make it easier to do zero copy UDP from the packet buffers. (5) Make it easier to do zero copy from userspace to the packet buffers - and thence to UDP (only if for unauthenticated connections). To that end, the following changes are made: (1) Use the new rxrpc_txbuf struct instead of sk_buff for keeping packets to be transmitted in. This allows them to be placed on multiple queues simultaneously. An sk_buff isn't really necessary as it's never passed on to lower-level networking code. (2) Keep the transmissable packets in a linked list on the call struct rather than in a ring. As a consequence, the annotation buffer isn't used either; rather a flag is set on the packet to indicate ackedness. (3) Use the RXRPC_CALL_TX_LAST flag to indicate that the last packet to be transmitted has been queued. Add RXRPC_CALL_TX_ALL_ACKED to indicate that all packets up to and including the last got hard acked. (4) Wire headers are now stored in the txbuf rather than being concocted on the stack and they're stored immediately before the data, thereby allowing zerocopy of a single span. (5) Don't bother with instant-resend on transmission failure; rather, leave it for a timer or an ACK packet to trigger. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Get rid of the Rx ringDavid Howells2022-11-0810-199/+279
| | | | | | | | | | | | | | | | | | | | | Get rid of the Rx ring and replace it with a pair of queues instead. One queue gets the packets that are in-sequence and are ready for processing by recvmsg(); the other queue gets the out-of-sequence packets for addition to the first queue as the holes get filled. The annotation ring is removed and replaced with a SACK table. The SACK table has the bits set that correspond exactly to the sequence number of the packet being acked. The SACK ring is copied when an ACK packet is being assembled and rotated so that the first ACK is in byte 0. Flow control handling is altered so that packets that are moved to the in-sequence queue are hard-ACK'd even before they're consumed - and then the Rx window size in the ACK packet (rsize) is shrunk down to compensate (even going to 0 if the window is full). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Clone received jumbo subpackets and queue separatelyDavid Howells2022-11-087-410/+240
| | | | | | | | | | | | | | | | | | | | Split up received jumbo packets into separate skbuffs by cloning the original skbuff for each subpacket and setting the offset and length of the data in that subpacket in the skbuff's private data. The subpackets are then placed on the recvmsg queue separately. The security class then gets to revise the offset and length to remove its metadata. If we fail to clone a packet, we just drop it and let the peer resend it. The original packet gets used for the final subpacket. This should make it easier to handle parallel decryption of the subpackets. It also simplifies the handling of lost or misordered packets in the queuing/buffering loop as the possibility of overlapping jumbo packets no longer needs to be considered. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Split the rxrpc_recvmsg tracepointDavid Howells2022-11-081-19/+18
| | | | | | | | | | Split the rxrpc_recvmsg tracepoint so that the tracepoints that are about data packet processing (and which have extra pieces of information) are separate from the tracepoint that shows the general flow of recvmsg(). Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Clean up ACK handlingDavid Howells2022-11-0810-122/+50
| | | | | | | | | | | | | | | | | | | | | | | Clean up the rxrpc_propose_ACK() function. If deferred PING ACK proposal is split out, it's only really needed for deferred DELAY ACKs. All other ACKs, bar terminal IDLE ACK are sent immediately. The deferred IDLE ACK submission can be handled by conversion of a DELAY ACK into an IDLE ACK if there's nothing to be SACK'd. Also, because there's a delay between an ACK being generated and being transmitted, it's possible that other ACKs of the same type will be generated during that interval. Apart from the ACK time and the serial number responded to, most of the ACK body, including window and SACK parameters, are not filled out till the point of transmission - so we can avoid generating a new ACK if there's one pending that will cover the SACK data we need to convey. Therefore, don't propose a new DELAY or IDLE ACK for a call if there's one already pending. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Allocate ACK records at proposal and queue for transmissionDavid Howells2022-11-089-256/+250
| | | | | | | | | Allocate rxrpc_txbuf records for ACKs and put onto a queue for the transmitter thread to dispatch. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Define rxrpc_txbuf struct to carry data to be transmittedDavid Howells2022-11-084-1/+156
| | | | | | | | | | | Define a struct, rxrpc_txbuf, to carry data to be transmitted instead of a socket buffer so that it can be placed onto multiple queues at once. This also allows the data buffer to be in the same allocation as the internal data. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Remove call->tx_phaseDavid Howells2022-11-084-7/+1
| | | | | | | | Remove call->tx_phase as it's only ever set. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Remove the flags from the rxrpc_skb tracepointDavid Howells2022-11-083-15/+7
| | | | | | | | | | | | | Remove the flags from the rxrpc_skb tracepoint as we're no longer going to be using this for the transmission buffers and so marking which are transmission buffers isn't going to be necessary. Note that this also remove the rxrpc skb flag that indicates if this is a transmission buffer and so the count is not updated for the moment. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Remove unnecessary header inclusionsDavid Howells2022-11-081-14/+0
| | | | | | | | Remove a bunch of unnecessary header inclusions. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Call udp_sendmsg() directlyDavid Howells2022-11-082-10/+28
| | | | | | | | | | Call udp_sendmsg() and udpv6_sendmsg() directly rather than calling kernel_sendmsg() as the latter assumes we want a kvec-class iterator. However, zerocopy explicitly doesn't work with such an iterator. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Use the core ICMP/ICMP6 parsersDavid Howells2022-11-083-217/+42
| | | | | | | | | | | | | | | | | | | Make rxrpc_encap_rcv_err() pass the ICMP/ICMP6 skbuff to ip_icmp_error() or ipv6_icmp_error() as appropriate to do the parsing rather than trying to do it in rxrpc. This pushes an error report onto the UDP socket's error queue and calls ->sk_error_report() from which point rxrpc can pick it up. It would be preferable to steal the packet directly from ip*_icmp_error() rather than letting it get queued, but this is probably good enough. Also note that __udp4_lib_err() calls sk_error_report() twice in some cases. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* net: Change the udp encap_err_rcv to allow use of {ip,ipv6}_icmp_error()David Howells2022-11-086-57/+24
| | | | | | | | | | | Change the udp encap_err_rcv signature to match ip_icmp_error() and ipv6_icmp_error() so that those can be used from the called function and export them. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org cc: netdev@vger.kernel.org
* rxrpc: Fix ack.bufferSize to be 0 when generating an ackDavid Howells2022-11-081-1/+1
| | | | | | | | | | ack.bufferSize should be set to 0 when generating an ack. Fixes: 8d94aa381dab ("rxrpc: Calls shouldn't hold socket refs") Reported-by: Jeffrey Altman <jaltman@auristor.com> Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Record stats for why the REQUEST-ACK flag is being setDavid Howells2022-11-083-0/+17
| | | | | | | | Record stats for why the REQUEST-ACK flag is being set. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Record statistics about ACK typesDavid Howells2022-11-085-1/+48
| | | | | | | | | | Record statistics about the different types of ACKs that have been transmitted and received and the number of ACKs that have been filled out and transmitted or that have been skipped. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Add stats procfile and DATA packet statsDavid Howells2022-11-087-0/+78
| | | | | | | | | | | | | | | Add a procfile, /proc/net/rxrpc/stats, to display some statistics about what rxrpc has been doing. Writing a blank line to the stats file will clear the increment-only counters. Allocated resource counters don't get cleared. Add some counters to count various things about DATA packets, including the number created, transmitted and retransmitted and the number received, the number of ACK-requests markings and the number of jumbo packets received. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Track highest acked serialDavid Howells2022-11-082-0/+5
| | | | | | | | | Keep track of the highest DATA serial number that has been acked by the peer for future purposes. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Split call timer-expiration from call timer-set tracepointDavid Howells2022-11-081-1/+1
| | | | | | | | | Split the tracepoint for call timer-set to separate out the call timer-expiration event Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* rxrpc: Trace setting of the request-ack flagDavid Howells2022-11-081-9/+23
| | | | | | | | | Add a tracepoint to log why the request-ack flag is set on an outgoing DATA packet, allowing debugging as to why. Signed-off-by: David Howells <dhowells@redhat.com> cc: Marc Dionne <marc.dionne@auristor.com> cc: linux-afs@lists.infradead.org
* tcp: add rcv_wnd and plb_rehash to TCP_INFOMubashir Adnan Qureshi2022-10-281-0/+2
| | | | | | | | | | | | rcv_wnd can be useful to diagnose TCP performance where receiver window becomes the bottleneck. rehash reports the PLB and timeout triggered rehash attempts by the TCP connection. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add u32 counter in tcp_sock and an SNMP counter for PLBMubashir Adnan Qureshi2022-10-283-0/+6
| | | | | | | | | | | | | | | | | | A u32 counter is added to tcp_sock for counting the number of PLB triggered rehashes for a TCP connection. An SNMP counter is also added to count overall PLB triggered rehash events for a host. These counters are hooked up to PLB implementation for DCTCP. TCP_NLA_REHASH is added to SCM_TIMESTAMPING_OPT_STATS that reports the rehash attempts triggered due to PLB or timeouts. This gives a historical view of sustained congestion or timeouts experienced by the TCP connection. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add support for PLB in DCTCPMubashir Adnan Qureshi2022-10-281-1/+22
| | | | | | | | | | | | PLB support is added to TCP DCTCP code. As DCTCP uses ECN as the congestion signal, PLB also uses ECN to make decisions whether to change the path or not upon sustained congestion. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add PLB functionality for TCPMubashir Adnan Qureshi2022-10-283-2/+109
| | | | | | | | | | | | | | | | | | | | | | | Congestion control algorithms track PLB state and cause the connection to trigger a path change when either of the 2 conditions is satisfied: - No packets are in flight and (# consecutive congested rounds >= sysctl_tcp_plb_idle_rehash_rounds) - (# consecutive congested rounds >= sysctl_tcp_plb_rehash_rounds) A round (RTT) is marked as congested when congestion signal (ECN ce_ratio) over an RTT is greater than sysctl_tcp_plb_cong_thresh. In the event of RTO, PLB (via tcp_write_timeout()) triggers a path change and disables congestion-triggered path changes for random time between (sysctl_tcp_plb_suspend_rto_sec, 2*sysctl_tcp_plb_suspend_rto_sec) to avoid hopping onto the "connectivity blackhole". RTO-triggered path changes can still happen during this cool-off period. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: add sysctls for TCP PLB parametersMubashir Adnan Qureshi2022-10-282-0/+51
| | | | | | | | | | | | | | | | | | | | | PLB (Protective Load Balancing) is a host based mechanism for load balancing across switch links. It leverages congestion signals(e.g. ECN) from transport layer to randomly change the path of the connection experiencing congestion. PLB changes the path of the connection by changing the outgoing IPv6 flow label for IPv6 connections (implemented in Linux by calling sk_rethink_txhash()). Because of this implementation mechanism, PLB can currently only work for IPv6 traffic. For more information, see the SIGCOMM 2022 paper: https://doi.org/10.1145/3544216.3544226 This commit adds new sysctl knobs and sets their default values for TCP PLB. Signed-off-by: Mubashir Adnan Qureshi <mubashirq@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-nextJakub Kicinski2022-10-288-30/+621
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next 1) Move struct nft_payload_set definition to .c file where it is only used. 2) Shrink transport and inner header offset fields in the nft_pktinfo structure to 16-bits, from Florian Westphal. 3) Get rid of nft_objref Kbuild toggle, make it built-in into nf_tables. This expression is used to instantiate conntrack helpers in nftables. After removing the conntrack helper auto-assignment toggle it this feature became more important so move it to the nf_tables core module. Also from Florian. 4) Extend the existing function to calculate payload inner header offset to deal with the GRE and IPIP transport protocols. 6) Add inner expression support for nf_tables. This new expression provides a packet parser for tunneled packets which uses a userspace description of the expected inner headers. The inner expression invokes the payload expression (via direct call) to match on the inner header protocol fields using the inner link, network and transport header offsets. An example of the bytecode generated from userspace to match on IP source encapsulated in a VxLAN packet: # nft --debug=netlink add rule netdev x y udp dport 4789 vxlan ip saddr 1.2.3.4 netdev x y [ meta load l4proto => reg 1 ] [ cmp eq reg 1 0x00000011 ] [ payload load 2b @ transport header + 2 => reg 1 ] [ cmp eq reg 1 0x0000b512 ] [ inner type vxlan hdrsize 8 flags f [ meta load protocol => reg 1 ] ] [ cmp eq reg 1 0x00000008 ] [ inner type vxlan hdrsize 8 flags f [ payload load 4b @ network header + 12 => reg 1 ] ] [ cmp eq reg 1 0x04030201 ] 7) Store inner link, network and transport header offsets in percpu area to parse inner packet header once only. Matching on a different tunnel type invalidates existing offsets in the percpu area and it invokes the inner tunnel parser again. 8) Add support for inner meta matching. This support for NFTA_META_PROTOCOL, which specifies the inner ethertype, and NFT_META_L4PROTO, which specifies the inner transport protocol. 9) Extend nft_inner to parse GENEVE optional fields to calculate the link layer offset. 10) Update inner expression so tunnel offset points to GRE header to normalize tunnel header handling. This also allows to perform different interpretations of the GRE header from userspace. * git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf-next: netfilter: nft_inner: set tunnel offset to GRE header offset netfilter: nft_inner: add geneve support netfilter: nft_meta: add inner match support netfilter: nft_inner: add percpu inner context netfilter: nft_inner: support for inner tunnel header matching netfilter: nft_payload: access ipip payload for inner offset netfilter: nft_payload: access GRE payload via inner offset netfilter: nft_objref: make it builtin netfilter: nf_tables: reduce nft_pktinfo by 8 bytes netfilter: nft_payload: move struct nft_payload_set definition where it belongs ==================== Link: https://lore.kernel.org/r/20221026132227.3287-1-pablo@netfilter.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * netfilter: nft_inner: set tunnel offset to GRE header offsetPablo Neira Ayuso2022-10-251-2/+7
| | | | | | | | | | | | | | | | | | | | Set inner tunnel offset to the GRE header, this is redundant to existing transport header offset, but this normalizes the handling of the tunnel header regardless its location in the layering. GRE version 0 is overloaded with RFCs, the type decorator in the inner expression might also be useful to interpret matching fields from the netlink delinearize path in userspace. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_inner: add geneve supportPablo Neira Ayuso2022-10-251-0/+17
| | | | | | | | | | | | | | Geneve tunnel header may contain options, parse geneve header and update offset to point to the link layer header according to the opt_len field. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_meta: add inner match supportPablo Neira Ayuso2022-10-252-0/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for inner meta matching on: - NFT_META_PROTOCOL: to match on the ethertype, this can be used regardless tunnel protocol provides no link layer header, in that case nft_inner sets on the ethertype based on the IP header version field. - NFT_META_L4PROTO: to match on the layer 4 protocol. These meta expression are usually autogenerated as dependencies by userspace nftables. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_inner: add percpu inner contextPablo Neira Ayuso2022-10-251-4/+22
| | | | | | | | | | | | | | | | Add NFT_PKTINFO_INNER_FULL flag to annotate that inner offsets are available. Store nft_inner_tun_ctx object in percpu area to cache existing inner offsets for this skbuff. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_inner: support for inner tunnel header matchingPablo Neira Ayuso2022-10-255-2/+463
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This new expression allows you to match on the inner headers that are encapsulated by any of the existing tunneling protocols. This expression parses the inner packet to set the link, network and transport offsets, so the existing expressions (with a few updates) can be reused to match on the inner headers. The inner expression supports for different tunnel combinations such as: - ethernet frame over IPv4/IPv6 packet, eg. VxLAN. - IPv4/IPv6 packet over IPv4/IPv6 packet, eg. IPIP. - IPv4/IPv6 packet over IPv4/IPv6 + transport header, eg. GRE. - transport header (ESP or SCTP) over transport header (usually UDP) The following fields are used to describe the tunnel protocol: - flags, which describe how to parse the inner headers: NFT_PAYLOAD_CTX_INNER_TUN, the tunnel provides its own header. NFT_PAYLOAD_CTX_INNER_ETHER, the ethernet frame is available as inner header. NFT_PAYLOAD_CTX_INNER_NH, the network header is available as inner header. NFT_PAYLOAD_CTX_INNER_TH, the transport header is available as inner header. For example, VxLAN sets on all of these flags. While GRE only sets on NFT_PAYLOAD_CTX_INNER_NH and NFT_PAYLOAD_CTX_INNER_TH. Then, ESP over UDP only sets on NFT_PAYLOAD_CTX_INNER_TH. The tunnel description is composed of the following attributes: - header size: in case the tunnel comes with its own header, eg. VxLAN. - type: this provides a hint to userspace on how to delinearize the rule. This is useful for VxLAN and Geneve since they run over UDP, since transport does not provide a hint. This is also useful in case hardware offload is ever supported. The type is not currently interpreted by the kernel. - expression: currently only payload supported. Follow up patch adds also inner meta support which is required by autogenerated dependencies. The exthdr expression should be supported too at some point. There is a new inner_ops operation that needs to be set on to allow to use an existing expression from the inner expression. This patch adds a new NFT_PAYLOAD_TUN_HEADER base which allows to match on the tunnel header fields, eg. vxlan vni. The payload expression is embedded into nft_inner private area and this private data area is passed to the payload inner eval function via direct call. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_payload: access ipip payload for inner offsetPablo Neira Ayuso2022-10-251-0/+3
| | | | | | | | | | | | | | | | ipip is an special case, transport and inner header offset are set to the same offset to use the upcoming inner expression for matching on inner tunnel headers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_payload: access GRE payload via inner offsetPablo Neira Ayuso2022-10-251-0/+32
| | | | | | | | | | | | | | Parse GRE v0 packets to properly set up inner offset, this allow for matching on inner headers. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_objref: make it builtinFlorian Westphal2022-10-254-29/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nft_objref is needed to reference named objects, it makes no sense to disable it. Before: text data bss dec filename 4014 424 0 4438 nft_objref.o 4174 1128 0 5302 nft_objref.ko 359351 15276 864 375491 nf_tables.ko After: text data bss dec filename 3815 408 0 4223 nft_objref.o 363161 15692 864 379717 nf_tables.ko Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * netfilter: nft_payload: move struct nft_payload_set definition where it belongsPablo Neira Ayuso2022-10-251-0/+10
| | | | | | | | | | | | | | Not required to expose this header in nf_tables_core.h, move it to where it is used, ie. nft_payload. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2022-10-2817-106/+174
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | drivers/net/can/usb/kvaser_usb/kvaser_usb_leaf.c 2871edb32f46 ("can: kvaser_usb: Fix possible completions during init_completion") abb8670938b2 ("can: kvaser_usb_leaf: Ignore stale bus-off after start") 8d21f5927ae6 ("can: kvaser_usb_leaf: Fix improved state not being reported") Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | kcm: do not sense pfmemalloc status in kcm_sendpage()Eric Dumazet2022-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to changes done in TCP in blamed commit. We should not sense pfmemalloc status in sendpage() methods. Fixes: 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221027040637.1107703-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | net: do not sense pfmemalloc status in skb_append_pagefrags()Eric Dumazet2022-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | skb_append_pagefrags() is used by af_unix and udp sendpage() implementation so far. In commit 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") we explained why we should not sense pfmemalloc status for pages owned by user space. We should also use skb_fill_page_desc_noacc() in skb_append_pagefrags() to avoid following KCSAN report: BUG: KCSAN: data-race in lru_add_fn / skb_append_pagefrags write to 0xffffea00058fc1c8 of 8 bytes by task 17319 on cpu 0: __list_add include/linux/list.h:73 [inline] list_add include/linux/list.h:88 [inline] lruvec_add_folio include/linux/mm_inline.h:323 [inline] lru_add_fn+0x327/0x410 mm/swap.c:228 folio_batch_move_lru+0x1e1/0x2a0 mm/swap.c:246 lru_add_drain_cpu+0x73/0x250 mm/swap.c:669 lru_add_drain+0x21/0x60 mm/swap.c:773 free_pages_and_swap_cache+0x16/0x70 mm/swap_state.c:311 tlb_batch_pages_flush mm/mmu_gather.c:59 [inline] tlb_flush_mmu_free mm/mmu_gather.c:256 [inline] tlb_flush_mmu+0x5b2/0x640 mm/mmu_gather.c:263 tlb_finish_mmu+0x86/0x100 mm/mmu_gather.c:363 exit_mmap+0x190/0x4d0 mm/mmap.c:3098 __mmput+0x27/0x1b0 kernel/fork.c:1185 mmput+0x3d/0x50 kernel/fork.c:1207 copy_process+0x19fc/0x2100 kernel/fork.c:2518 kernel_clone+0x166/0x550 kernel/fork.c:2671 __do_sys_clone kernel/fork.c:2812 [inline] __se_sys_clone kernel/fork.c:2796 [inline] __x64_sys_clone+0xc3/0xf0 kernel/fork.c:2796 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd read to 0xffffea00058fc1c8 of 8 bytes by task 17325 on cpu 1: page_is_pfmemalloc include/linux/mm.h:1817 [inline] __skb_fill_page_desc include/linux/skbuff.h:2432 [inline] skb_fill_page_desc include/linux/skbuff.h:2453 [inline] skb_append_pagefrags+0x210/0x600 net/core/skbuff.c:3974 unix_stream_sendpage+0x45e/0x990 net/unix/af_unix.c:2338 kernel_sendpage+0x184/0x300 net/socket.c:3561 sock_sendpage+0x5a/0x70 net/socket.c:1054 pipe_to_sendpage+0x128/0x160 fs/splice.c:361 splice_from_pipe_feed fs/splice.c:415 [inline] __splice_from_pipe+0x222/0x4d0 fs/splice.c:559 splice_from_pipe fs/splice.c:594 [inline] generic_splice_sendpage+0x89/0xc0 fs/splice.c:743 do_splice_from fs/splice.c:764 [inline] direct_splice_actor+0x80/0xa0 fs/splice.c:931 splice_direct_to_actor+0x305/0x620 fs/splice.c:886 do_splice_direct+0xfb/0x180 fs/splice.c:974 do_sendfile+0x3bf/0x910 fs/read_write.c:1255 __do_sys_sendfile64 fs/read_write.c:1323 [inline] __se_sys_sendfile64 fs/read_write.c:1309 [inline] __x64_sys_sendfile64+0x10c/0x150 fs/read_write.c:1309 do_syscall_x64 arch/x86/entry/common.c:50 [inline] do_syscall_64+0x2b/0x70 arch/x86/entry/common.c:80 entry_SYSCALL_64_after_hwframe+0x63/0xcd value changed: 0x0000000000000000 -> 0xffffea00058fc188 Reported by Kernel Concurrency Sanitizer on: CPU: 1 PID: 17325 Comm: syz-executor.0 Not tainted 6.1.0-rc1-syzkaller-00158-g440b7895c990-dirty #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/11/2022 Fixes: 326140063946 ("tcp: TX zerocopy should not sense pfmemalloc status") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221027040346.1104204-1-edumazet@google.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | Merge tag 'linux-can-fixes-for-6.1-20221027' of ↵Jakub Kicinski2022-10-271-1/+3
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can Marc Kleine-Budde says: ==================== pull-request: can 2022-10-27 Anssi Hannula fixes the use of the completions in the kvaser_usb driver. Biju Das contributes 2 patches for the rcar_canfd driver. A IRQ storm that can be triggered by high CAN bus load and channel specific IRQ handlers are fixed. Yang Yingliang fixes the j1939 transport protocol by moving a kfree_skb() out of a spin_lock_irqsave protected section. * tag 'linux-can-fixes-for-6.1-20221027' of git://git.kernel.org/pub/scm/linux/kernel/git/mkl/linux-can: can: j1939: transport: j1939_session_skb_drop_old(): spin_unlock_irqrestore() before kfree_skb() can: rcar_canfd: fix channel specific IRQ handling for RZ/G2L can: rcar_canfd: rcar_canfd_handle_global_receive(): fix IRQ storm on global FIFO receive can: kvaser_usb: Fix possible completions during init_completion ==================== Link: https://lore.kernel.org/r/20221027114356.1939821-1-mkl@pengutronix.de Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| | * | can: j1939: transport: j1939_session_skb_drop_old(): ↵Yang Yingliang2022-10-271-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | spin_unlock_irqrestore() before kfree_skb() It is not allowed to call kfree_skb() from hardware interrupt context or with interrupts being disabled. The skb is unlinked from the queue, so it can be freed after spin_unlock_irqrestore(). Fixes: 9d71dd0c7009 ("can: add support of SAE J1939 protocol") Signed-off-by: Yang Yingliang <yangyingliang@huawei.com> Acked-by: Oleksij Rempel <o.rempel@pengutronix.de> Link: https://lore.kernel.org/all/20221027091237.2290111-1-yangyingliang@huawei.com Cc: stable@vger.kernel.org [mkl: adjust subject] Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de>
| * | | nh: fix scope used to find saddr when adding non gw nhNicolas Dichtel2022-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As explained by Julian, fib_nh_scope is related to fib_nh_gw4, but fib_info_update_nhc_saddr() needs the scope of the route, which is the scope "before" fib_nh_scope, ie fib_nh_scope - 1. This patch fixes the problem described in commit 747c14307214 ("ip: fix dflt addr selection for connected nexthop"). Fixes: 597cfe4fc339 ("nexthop: Add support for IPv4 nexthops") Link: https://lore.kernel.org/netdev/6c8a44ba-c2d5-cdf-c5c7-5baf97cba38@ssi.bg/ Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | Revert "ip: fix dflt addr selection for connected nexthop"Nicolas Dichtel2022-10-271-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit 747c14307214b55dbd8250e1ab44cad8305756f1. As explained by Julian, nhc_scope is related to nhc_gw, not to the route. Revert the original patch. The initial problem is fixed differently in the next commit. Link: https://lore.kernel.org/netdev/6c8a44ba-c2d5-cdf-c5c7-5baf97cba38@ssi.bg/ Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | Revert "ip: fix triggering of 'icmp redirect'"Nicolas Dichtel2022-10-271-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit eb55dc09b5dd040232d5de32812cc83001a23da6. The patch that introduces this bug is reverted right after this one. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Reviewed-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | genetlink: limit the use of validation workarounds to old opsJakub Kicinski2022-10-271-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During review of previous change another thing came up - we should limit the use of validation workarounds to old commands. Don't list the workarounds one by one, as we're rejecting all existing ones. We can deal with the masking in the unlikely event that new flag is added. Link: https://lore.kernel.org/all/6ba9f727e555fd376623a298d5d305ad408c3d47.camel@sipsolutions.net/ Link: https://lore.kernel.org/r/20221026001524.1892202-1-kuba@kernel.org Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | openvswitch: switch from WARN to pr_warnAaron Conole2022-10-271-1/+2
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As noted by Paolo Abeni, pr_warn doesn't generate any splat and can still preserve the warning to the user that feature downgrade occurred. We likely cannot introduce other kinds of checks / enforcement here because syzbot can generate different genl versions to the datapath. Reported-by: syzbot+31cde0bef4bbf8ba2d86@syzkaller.appspotmail.com Fixes: 44da5ae5fbea ("openvswitch: Drop user features if old user space attempted to create datapath") Cc: Thomas Graf <tgraf@suug.ch> Signed-off-by: Aaron Conole <aconole@redhat.com> Acked-by: Ilya Maximets <i.maximets@ovn.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com>
| * | ipv6: ensure sane device mtu in tunnelsEric Dumazet2022-10-263-13/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Another syzbot report [1] with no reproducer hints at a bug in ip6_gre tunnel (dev:ip6gretap0) Since ipv6 mcast code makes sure to read dev->mtu once and applies a sanity check on it (see commit b9b312a7a451 "ipv6: mcast: better catch silly mtu values"), a remaining possibility is that a layer is able to set dev->mtu to an underflowed value (high order bit set). This could happen indeed in ip6gre_tnl_link_config_route(), ip6_tnl_link_config() and ipip6_tunnel_bind_dev() Make sure to sanitize mtu value in a local variable before it is written once on dev->mtu, as lockless readers could catch wrong temporary value. [1] skbuff: skb_over_panic: text:ffff80000b7a2f38 len:40 put:40 head:ffff000149dcf200 data:ffff000149dcf2b0 tail:0xd8 end:0xc0 dev:ip6gretap0 ------------[ cut here ]------------ kernel BUG at net/core/skbuff.c:120 Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP Modules linked in: CPU: 1 PID: 10241 Comm: kworker/1:1 Not tainted 6.0.0-rc7-syzkaller-18095-gbbed346d5a96 #0 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 09/30/2022 Workqueue: mld mld_ifc_work pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--) pc : skb_panic+0x4c/0x50 net/core/skbuff.c:116 lr : skb_panic+0x4c/0x50 net/core/skbuff.c:116 sp : ffff800020dd3b60 x29: ffff800020dd3b70 x28: 0000000000000000 x27: ffff00010df2a800 x26: 00000000000000c0 x25: 00000000000000b0 x24: ffff000149dcf200 x23: 00000000000000c0 x22: 00000000000000d8 x21: ffff80000b7a2f38 x20: ffff00014c2f7800 x19: 0000000000000028 x18: 00000000000001a9 x17: 0000000000000000 x16: ffff80000db49158 x15: ffff000113bf1a80 x14: 0000000000000000 x13: 00000000ffffffff x12: ffff000113bf1a80 x11: ff808000081c0d5c x10: 0000000000000000 x9 : 73f125dc5c63ba00 x8 : 73f125dc5c63ba00 x7 : ffff800008161d1c x6 : 0000000000000000 x5 : 0000000000000080 x4 : 0000000000000001 x3 : 0000000000000000 x2 : ffff0001fefddcd0 x1 : 0000000100000000 x0 : 0000000000000089 Call trace: skb_panic+0x4c/0x50 net/core/skbuff.c:116 skb_over_panic net/core/skbuff.c:125 [inline] skb_put+0xd4/0xdc net/core/skbuff.c:2049 ip6_mc_hdr net/ipv6/mcast.c:1714 [inline] mld_newpack+0x14c/0x270 net/ipv6/mcast.c:1765 add_grhead net/ipv6/mcast.c:1851 [inline] add_grec+0xa20/0xae0 net/ipv6/mcast.c:1989 mld_send_cr+0x438/0x5a8 net/ipv6/mcast.c:2115 mld_ifc_work+0x38/0x290 net/ipv6/mcast.c:2653 process_one_work+0x2d8/0x504 kernel/workqueue.c:2289 worker_thread+0x340/0x610 kernel/workqueue.c:2436 kthread+0x12c/0x158 kernel/kthread.c:376 ret_from_fork+0x10/0x20 arch/arm64/kernel/entry.S:860 Code: 91011400 aa0803e1 a90027ea 94373093 (d4210000) Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Reported-by: syzbot <syzkaller@googlegroups.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Link: https://lore.kernel.org/r/20221024020124.3756833-1-eric.dumazet@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>