diff options
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/altera_tse.txt | 4 | ||||
-rw-r--r-- | Documentation/networking/bonding.txt | 4 | ||||
-rw-r--r-- | Documentation/networking/checksum-offloads.txt | 14 | ||||
-rw-r--r-- | Documentation/networking/dsa/bcm_sf2.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/dsa/dsa.txt | 20 | ||||
-rw-r--r-- | Documentation/networking/filter.txt | 101 | ||||
-rw-r--r-- | Documentation/networking/gen_stats.txt | 6 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.txt | 10 | ||||
-rw-r--r-- | Documentation/networking/ipvlan.txt | 6 | ||||
-rw-r--r-- | Documentation/networking/mac80211-injection.txt | 17 | ||||
-rw-r--r-- | Documentation/networking/netdev-features.txt | 10 | ||||
-rw-r--r-- | Documentation/networking/netdevices.txt | 9 | ||||
-rw-r--r-- | Documentation/networking/pktgen.txt | 6 | ||||
-rw-r--r-- | Documentation/networking/segmentation-offloads.txt | 130 | ||||
-rw-r--r-- | Documentation/networking/stmmac.txt | 44 | ||||
-rw-r--r-- | Documentation/networking/switchdev.txt | 30 | ||||
-rw-r--r-- | Documentation/networking/timestamping.txt | 48 | ||||
-rw-r--r-- | Documentation/networking/vrf.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/xfrm_sync.txt | 6 |
19 files changed, 395 insertions, 74 deletions
diff --git a/Documentation/networking/altera_tse.txt b/Documentation/networking/altera_tse.txt index cd417d7b5bd4..50b8589d12fd 100644 --- a/Documentation/networking/altera_tse.txt +++ b/Documentation/networking/altera_tse.txt @@ -65,14 +65,14 @@ Driver parameters can be also passed in command line by using: 4.1) Transmit process When the driver's transmit routine is called by the kernel, it sets up a transmit descriptor by calling the underlying DMA transmit routine (SGDMA or -MSGDMA), and initites a transmit operation. Once the transmit is complete, an +MSGDMA), and initiates a transmit operation. Once the transmit is complete, an interrupt is driven by the transmit DMA logic. The driver handles the transmit completion in the context of the interrupt handling chain by recycling resource required to send and track the requested transmit operation. 4.2) Receive process The driver will post receive buffers to the receive DMA logic during driver -intialization. Receive buffers may or may not be queued depending upon the +initialization. Receive buffers may or may not be queued depending upon the underlying DMA logic (MSGDMA is able queue receive buffers, SGDMA is not able to queue receive buffers to the SGDMA receive logic). When a packet is received, the DMA logic generates an interrupt. The driver handles a receive diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 334b49ef02d1..57f52cdce32e 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -1880,8 +1880,8 @@ or more peers on the local network. The ARP monitor relies on the device driver itself to verify that traffic is flowing. In particular, the driver must keep up to -date the last receive time, dev->last_rx, and transmit start time, -dev->trans_start. If these are not updated by the driver, then the +date the last receive time, dev->last_rx. Drivers that use NETIF_F_LLTX +flag must also update netdev_queue->trans_start. If they do not, then the ARP monitor will immediately fail any slaves using that driver, and those slaves will stay down. If networking monitoring (tcpdump, etc) shows the ARP requests and replies on the network, then it may be that diff --git a/Documentation/networking/checksum-offloads.txt b/Documentation/networking/checksum-offloads.txt index de2a327766a7..56e36861245f 100644 --- a/Documentation/networking/checksum-offloads.txt +++ b/Documentation/networking/checksum-offloads.txt @@ -69,18 +69,18 @@ LCO: Local Checksum Offload LCO is a technique for efficiently computing the outer checksum of an encapsulated datagram when the inner checksum is due to be offloaded. The ones-complement sum of a correctly checksummed TCP or UDP packet is - equal to the sum of the pseudo header, because everything else gets - 'cancelled out' by the checksum field. This is because the sum was + equal to the complement of the sum of the pseudo header, because everything + else gets 'cancelled out' by the checksum field. This is because the sum was complemented before being written to the checksum field. More generally, this holds in any case where the 'IP-style' ones complement checksum is used, and thus any checksum that TX Checksum Offload supports. That is, if we have set up TX Checksum Offload with a start/offset pair, we - know that _after the device has filled in that checksum_, the ones + know that after the device has filled in that checksum, the ones complement sum from csum_start to the end of the packet will be equal to - _whatever value we put in the checksum field beforehand_. This allows us - to compute the outer checksum without looking at the payload: we simply - stop summing when we get to csum_start, then add the 16-bit word at - (csum_start + csum_offset). + the complement of whatever value we put in the checksum field beforehand. + This allows us to compute the outer checksum without looking at the payload: + we simply stop summing when we get to csum_start, then add the complement of + the 16-bit word at (csum_start + csum_offset). Then, when the true inner checksum is filled in (either by hardware or by skb_checksum_help()), the outer checksum will become correct by virtue of the arithmetic. diff --git a/Documentation/networking/dsa/bcm_sf2.txt b/Documentation/networking/dsa/bcm_sf2.txt index d999d0c1c5b8..eba3a2431e91 100644 --- a/Documentation/networking/dsa/bcm_sf2.txt +++ b/Documentation/networking/dsa/bcm_sf2.txt @@ -38,7 +38,7 @@ Implementation details ====================== The driver is located in drivers/net/dsa/bcm_sf2.c and is implemented as a DSA -driver; see Documentation/networking/dsa/dsa.txt for details on the subsytem +driver; see Documentation/networking/dsa/dsa.txt for details on the subsystem and what it provides. The SF2 switch is configured to enable a Broadcom specific 4-bytes switch tag diff --git a/Documentation/networking/dsa/dsa.txt b/Documentation/networking/dsa/dsa.txt index 3b196c304b73..631b0f7ae16f 100644 --- a/Documentation/networking/dsa/dsa.txt +++ b/Documentation/networking/dsa/dsa.txt @@ -334,7 +334,7 @@ more specifically with its VLAN filtering portion when configuring VLANs on top of per-port slave network devices. Since DSA primarily deals with MDIO-connected switches, although not exclusively, SWITCHDEV's prepare/abort/commit phases are often simplified into a prepare phase which -checks whether the operation is supporte by the DSA switch driver, and a commit +checks whether the operation is supported by the DSA switch driver, and a commit phase which applies the changes. As of today, the only SWITCHDEV objects supported by DSA are the FDB and VLAN @@ -533,7 +533,7 @@ Bridge layer out at the switch hardware for the switch to (re) learn MAC addresses behind this port. -- port_stp_update: bridge layer function invoked when a given switch port STP +- port_stp_state_set: bridge layer function invoked when a given switch port STP state is computed by the bridge layer and should be propagated to switch hardware to forward/block/learn traffic. The switch driver is responsible for computing a STP state change based on current and asked parameters and perform @@ -542,6 +542,12 @@ Bridge layer Bridge VLAN filtering --------------------- +- port_vlan_prepare: bridge layer function invoked when the bridge prepares the + configuration of a VLAN on the given port. If the operation is not supported + by the hardware, this function should return -EOPNOTSUPP to inform the bridge + code to fallback to a software implementation. No hardware setup must be done + in this function. See port_vlan_add for this and details. + - port_vlan_add: bridge layer function invoked when a VLAN is configured (tagged or untagged) for the given switch port @@ -552,6 +558,12 @@ Bridge VLAN filtering function that the driver has to call for each VLAN the given port is a member of. A switchdev object is used to carry the VID and bridge flags. +- port_fdb_prepare: bridge layer function invoked when the bridge prepares the + installation of a Forwarding Database entry. If the operation is not + supported, this function should return -EOPNOTSUPP to inform the bridge code + to fallback to a software implementation. No hardware setup must be done in + this function. See port_fdb_add for this and details. + - port_fdb_add: bridge layer function invoked when the bridge wants to install a Forwarding Database entry, the switch hardware should be programmed with the specified address in the specified VLAN Id in the forwarding database @@ -565,6 +577,10 @@ of DSA, would be the its port-based VLAN, used by the associated bridge device. the specified MAC address from the specified VLAN ID if it was mapped into this port forwarding database +- port_fdb_dump: bridge layer function invoked with a switchdev callback + function that the driver has to call for each MAC address known to be behind + the given port. A switchdev object is used to carry the VID and FDB info. + TODO ==== diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index 11f67f181d39..683ada5ad81d 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -216,14 +216,14 @@ opcodes as defined in linux/filter.h stand for: jmp 6 Jump to label ja 6 Jump to label - jeq 7, 8 Jump on k == A - jneq 8 Jump on k != A - jne 8 Jump on k != A - jlt 8 Jump on k < A - jle 8 Jump on k <= A - jgt 7, 8 Jump on k > A - jge 7, 8 Jump on k >= A - jset 7, 8 Jump on k & A + jeq 7, 8 Jump on A == k + jneq 8 Jump on A != k + jne 8 Jump on A != k + jlt 8 Jump on A < k + jle 8 Jump on A <= k + jgt 7, 8 Jump on A > k + jge 7, 8 Jump on A >= k + jset 7, 8 Jump on A & k add 0, 4 A + <x> sub 0, 4 A - <x> @@ -1095,6 +1095,87 @@ all use cases. See details of eBPF verifier in kernel/bpf/verifier.c +Direct packet access +-------------------- +In cls_bpf and act_bpf programs the verifier allows direct access to the packet +data via skb->data and skb->data_end pointers. +Ex: +1: r4 = *(u32 *)(r1 +80) /* load skb->data_end */ +2: r3 = *(u32 *)(r1 +76) /* load skb->data */ +3: r5 = r3 +4: r5 += 14 +5: if r5 > r4 goto pc+16 +R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp +6: r0 = *(u16 *)(r3 +12) /* access 12 and 13 bytes of the packet */ + +this 2byte load from the packet is safe to do, since the program author +did check 'if (skb->data + 14 > skb->data_end) goto err' at insn #5 which +means that in the fall-through case the register R3 (which points to skb->data) +has at least 14 directly accessible bytes. The verifier marks it +as R3=pkt(id=0,off=0,r=14). +id=0 means that no additional variables were added to the register. +off=0 means that no additional constants were added. +r=14 is the range of safe access which means that bytes [R3, R3 + 14) are ok. +Note that R5 is marked as R5=pkt(id=0,off=14,r=14). It also points +to the packet data, but constant 14 was added to the register, so +it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) +which is zero bytes. + +More complex packet access may look like: + R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ + 7: r4 = *(u8 *)(r3 +12) + 8: r4 *= 14 + 9: r3 = *(u32 *)(r1 +76) /* load skb->data */ +10: r3 += r4 +11: r2 = r1 +12: r2 <<= 48 +13: r2 >>= 48 +14: r3 += r2 +15: r2 = r3 +16: r2 += 8 +17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ +18: if r2 > r1 goto pc+2 + R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp +19: r1 = *(u8 *)(r3 +4) +The state of the register R3 is R3=pkt(id=2,off=0,r=8) +id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some +offset within a packet and since the program author did +'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). +The verifier only allows 'add' operation on packet registers. Any other +operation will set the register state to 'unknown_value' and it won't be +available for direct packet access. +Operation 'r3 += rX' may overflow and become less than original skb->data, +therefore the verifier has to prevent that. So it tracks the number of +upper zero bits in all 'uknown_value' registers, so when it sees +'r3 += rX' instruction and rX is more than 16-bit value, it will error as: +"cannot add integer value with N upper zero bits to ptr_to_packet" +Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is +R4=inv56 which means that upper 56 bits on the register are guaranteed +to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since +multiplying 8-bit value by constant 14 will keep upper 52 bits as zero. +Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign +extending. This logic is implemented in evaluate_reg_alu() function. + +The end result is that bpf program author can access packet directly +using normal C code as: + void *data = (void *)(long)skb->data; + void *data_end = (void *)(long)skb->data_end; + struct eth_hdr *eth = data; + struct iphdr *iph = data + sizeof(*eth); + struct udphdr *udp = data + sizeof(*eth) + sizeof(*iph); + + if (data + sizeof(*eth) + sizeof(*iph) + sizeof(*udp) > data_end) + return 0; + if (eth->h_proto != htons(ETH_P_IP)) + return 0; + if (iph->protocol != IPPROTO_UDP || iph->ihl != 5) + return 0; + if (udp->dest == 53 || udp->source == 9) + ...; +which makes such programs easier to write comparing to LD_ABS insn +and significantly faster. + eBPF maps --------- 'maps' is a generic storage of different types for sharing data between kernel @@ -1293,5 +1374,5 @@ to give potential BPF hackers or security auditors a better overview of the underlying architecture. Jay Schulist <jschlst@samba.org> -Daniel Borkmann <dborkman@redhat.com> -Alexei Starovoitov <ast@plumgrid.com> +Daniel Borkmann <daniel@iogearbox.net> +Alexei Starovoitov <ast@kernel.org> diff --git a/Documentation/networking/gen_stats.txt b/Documentation/networking/gen_stats.txt index 70e6275b757a..ff630a87b511 100644 --- a/Documentation/networking/gen_stats.txt +++ b/Documentation/networking/gen_stats.txt @@ -33,7 +33,8 @@ my_dumping_routine(struct sk_buff *skb, ...) { struct gnet_dump dump; - if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump) < 0) + if (gnet_stats_start_copy(skb, TCA_STATS2, &mystruct->lock, &dump, + TCA_PAD) < 0) goto rtattr_failure; if (gnet_stats_copy_basic(&dump, &mystruct->bstats) < 0 || @@ -56,7 +57,8 @@ existing TLV types. my_dumping_routine(struct sk_buff *skb, ...) { if (gnet_stats_start_copy_compat(skb, TCA_STATS2, TCA_STATS, - TCA_XSTATS, &mystruct->lock, &dump) < 0) + TCA_XSTATS, &mystruct->lock, &dump, + TCA_PAD) < 0) goto rtattr_failure; ... } diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index b183e2b606c8..6c7f365b1515 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -63,6 +63,16 @@ fwmark_reflect - BOOLEAN fwmark of the packet they are replying to. Default: 0 +fib_multipath_use_neigh - BOOLEAN + Use status of existing neighbor entry when determining nexthop for + multipath routes. If disabled, neighbor information is not used and + packets could be directed to a failed nexthop. Only valid for kernels + built with CONFIG_IP_ROUTE_MULTIPATH enabled. + Default: 0 (disabled) + Possible values: + 0 - disabled + 1 - enabled + route/max_size - INTEGER Maximum number of routes allowed in the kernel. Increase this when using large numbers of interfaces and/or routes. diff --git a/Documentation/networking/ipvlan.txt b/Documentation/networking/ipvlan.txt index cf996394e466..14422f8fcdc4 100644 --- a/Documentation/networking/ipvlan.txt +++ b/Documentation/networking/ipvlan.txt @@ -8,7 +8,7 @@ Initial Release: This is conceptually very similar to the macvlan driver with one major exception of using L3 for mux-ing /demux-ing among slaves. This property makes the master device share the L2 with it's slave devices. I have developed this -driver in conjuntion with network namespaces and not sure if there is use case +driver in conjunction with network namespaces and not sure if there is use case outside of it. @@ -42,7 +42,7 @@ out. In this mode the slaves will RX/TX multicast and broadcast (if applicable) as well. 4.2 L3 mode: - In this mode TX processing upto L3 happens on the stack instance attached + In this mode TX processing up to L3 happens on the stack instance attached to the slave device and packets are switched to the stack instance of the master device for the L2 processing and routing from that instance will be used before packets are queued on the outbound device. In this mode the slaves @@ -56,7 +56,7 @@ situations defines your use case then you can choose to use ipvlan - (a) The Linux host that is connected to the external switch / router has policy configured that allows only one mac per port. (b) No of virtual devices created on a master exceed the mac capacity and -puts the NIC in promiscous mode and degraded performance is a concern. +puts the NIC in promiscuous mode and degraded performance is a concern. (c) If the slave device is to be put into the hostile / untrusted network namespace where L2 on the slave could be changed / misused. diff --git a/Documentation/networking/mac80211-injection.txt b/Documentation/networking/mac80211-injection.txt index ec8f934c2eb2..d58d78df9ca2 100644 --- a/Documentation/networking/mac80211-injection.txt +++ b/Documentation/networking/mac80211-injection.txt @@ -37,14 +37,27 @@ radiotap headers and used to control injection: HT rate for the transmission (only for devices without own rate control). Also some flags are parsed - IEEE80211_TX_RC_SHORT_GI: use short guard interval - IEEE80211_TX_RC_40_MHZ_WIDTH: send in HT40 mode + IEEE80211_RADIOTAP_MCS_SGI: use short guard interval + IEEE80211_RADIOTAP_MCS_BW_40: send in HT40 mode * IEEE80211_RADIOTAP_DATA_RETRIES number of retries when either IEEE80211_RADIOTAP_RATE or IEEE80211_RADIOTAP_MCS was used + * IEEE80211_RADIOTAP_VHT + + VHT mcs and number of streams used in the transmission (only for devices + without own rate control). Also other fields are parsed + + flags field + IEEE80211_RADIOTAP_VHT_FLAG_SGI: use short guard interval + + bandwidth field + 1: send using 40MHz channel width + 4: send using 80MHz channel width + 11: send using 160MHz channel width + The injection code can also skip all other currently defined radiotap fields facilitating replay of captured radiotap headers directly. diff --git a/Documentation/networking/netdev-features.txt b/Documentation/networking/netdev-features.txt index f310edec8a77..7413eb05223b 100644 --- a/Documentation/networking/netdev-features.txt +++ b/Documentation/networking/netdev-features.txt @@ -131,13 +131,11 @@ stack. Driver should not change behaviour based on them. * LLTX driver (deprecated for hardware drivers) -NETIF_F_LLTX should be set in drivers that implement their own locking in -transmit path or don't need locking at all (e.g. software tunnels). -In ndo_start_xmit, it is recommended to use a try_lock and return -NETDEV_TX_LOCKED when the spin lock fails. The locking should also properly -protect against other callbacks (the rules you need to find out). +NETIF_F_LLTX is meant to be used by drivers that don't need locking at all, +e.g. software tunnels. -Don't use it for new drivers. +This is also used in a few legacy drivers that implement their +own locking, don't use it for new (hardware) drivers. * netns-local device diff --git a/Documentation/networking/netdevices.txt b/Documentation/networking/netdevices.txt index 0b1cf6b2a592..7fec2061a334 100644 --- a/Documentation/networking/netdevices.txt +++ b/Documentation/networking/netdevices.txt @@ -69,10 +69,9 @@ ndo_start_xmit: When the driver sets NETIF_F_LLTX in dev->features this will be called without holding netif_tx_lock. In this case the driver - has to lock by itself when needed. It is recommended to use a try lock - for this and return NETDEV_TX_LOCKED when the spin lock fails. - The locking there should also properly protect against - set_rx_mode. Note that the use of NETIF_F_LLTX is deprecated. + has to lock by itself when needed. + The locking there should also properly protect against + set_rx_mode. WARNING: use of NETIF_F_LLTX is deprecated. Don't use it for new drivers. Context: Process with BHs disabled or BH (timer), @@ -83,8 +82,6 @@ ndo_start_xmit: o NETDEV_TX_BUSY Cannot transmit packet, try later Usually a bug, means queue start/stop flow control is broken in the driver. Note: the driver must NOT put the skb in its DMA ring. - o NETDEV_TX_LOCKED Locking failed, please retry quickly. - Only valid when NETIF_F_LLTX is set. ndo_tx_timeout: Synchronization: netif_tx_lock spinlock; all TX queues frozen. diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index f4be85e96005..2c4e3354e128 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -67,12 +67,12 @@ The two basic thread commands are: * add_device DEVICE@NAME -- adds a single device * rem_device_all -- remove all associated devices -When adding a device to a thread, a corrosponding procfile is created +When adding a device to a thread, a corresponding procfile is created which is used for configuring this device. Thus, device names need to be unique. To support adding the same device to multiple threads, which is useful -with multi queue NICs, a the device naming scheme is extended with "@": +with multi queue NICs, the device naming scheme is extended with "@": device@something The part after "@" can be anything, but it is custom to use the thread @@ -221,7 +221,7 @@ Sample scripts A collection of tutorial scripts and helpers for pktgen is in the samples/pktgen directory. The helper parameters.sh file support easy -and consistant parameter parsing across the sample scripts. +and consistent parameter parsing across the sample scripts. Usage example and help: ./pktgen_sample01_simple.sh -i eth4 -m 00:1B:21:3C:9D:F8 -d 192.168.8.2 diff --git a/Documentation/networking/segmentation-offloads.txt b/Documentation/networking/segmentation-offloads.txt new file mode 100644 index 000000000000..f200467ade38 --- /dev/null +++ b/Documentation/networking/segmentation-offloads.txt @@ -0,0 +1,130 @@ +Segmentation Offloads in the Linux Networking Stack + +Introduction +============ + +This document describes a set of techniques in the Linux networking stack +to take advantage of segmentation offload capabilities of various NICs. + +The following technologies are described: + * TCP Segmentation Offload - TSO + * UDP Fragmentation Offload - UFO + * IPIP, SIT, GRE, and UDP Tunnel Offloads + * Generic Segmentation Offload - GSO + * Generic Receive Offload - GRO + * Partial Generic Segmentation Offload - GSO_PARTIAL + +TCP Segmentation Offload +======================== + +TCP segmentation allows a device to segment a single frame into multiple +frames with a data payload size specified in skb_shinfo()->gso_size. +When TCP segmentation requested the bit for either SKB_GSO_TCP or +SKB_GSO_TCP6 should be set in skb_shinfo()->gso_type and +skb_shinfo()->gso_size should be set to a non-zero value. + +TCP segmentation is dependent on support for the use of partial checksum +offload. For this reason TSO is normally disabled if the Tx checksum +offload for a given device is disabled. + +In order to support TCP segmentation offload it is necessary to populate +the network and transport header offsets of the skbuff so that the device +drivers will be able determine the offsets of the IP or IPv6 header and the +TCP header. In addition as CHECKSUM_PARTIAL is required csum_start should +also point to the TCP header of the packet. + +For IPv4 segmentation we support one of two types in terms of the IP ID. +The default behavior is to increment the IP ID with every segment. If the +GSO type SKB_GSO_TCP_FIXEDID is specified then we will not increment the IP +ID and all segments will use the same IP ID. If a device has +NETIF_F_TSO_MANGLEID set then the IP ID can be ignored when performing TSO +and we will either increment the IP ID for all frames, or leave it at a +static value based on driver preference. + +UDP Fragmentation Offload +========================= + +UDP fragmentation offload allows a device to fragment an oversized UDP +datagram into multiple IPv4 fragments. Many of the requirements for UDP +fragmentation offload are the same as TSO. However the IPv4 ID for +fragments should not increment as a single IPv4 datagram is fragmented. + +IPIP, SIT, GRE, UDP Tunnel, and Remote Checksum Offloads +======================================================== + +In addition to the offloads described above it is possible for a frame to +contain additional headers such as an outer tunnel. In order to account +for such instances an additional set of segmentation offload types were +introduced including SKB_GSO_IPIP, SKB_GSO_SIT, SKB_GSO_GRE, and +SKB_GSO_UDP_TUNNEL. These extra segmentation types are used to identify +cases where there are more than just 1 set of headers. For example in the +case of IPIP and SIT we should have the network and transport headers moved +from the standard list of headers to "inner" header offsets. + +Currently only two levels of headers are supported. The convention is to +refer to the tunnel headers as the outer headers, while the encapsulated +data is normally referred to as the inner headers. Below is the list of +calls to access the given headers: + +IPIP/SIT Tunnel: + Outer Inner +MAC skb_mac_header +Network skb_network_header skb_inner_network_header +Transport skb_transport_header + +UDP/GRE Tunnel: + Outer Inner +MAC skb_mac_header skb_inner_mac_header +Network skb_network_header skb_inner_network_header +Transport skb_transport_header skb_inner_transport_header + +In addition to the above tunnel types there are also SKB_GSO_GRE_CSUM and +SKB_GSO_UDP_TUNNEL_CSUM. These two additional tunnel types reflect the +fact that the outer header also requests to have a non-zero checksum +included in the outer header. + +Finally there is SKB_GSO_REMCSUM which indicates that a given tunnel header +has requested a remote checksum offload. In this case the inner headers +will be left with a partial checksum and only the outer header checksum +will be computed. + +Generic Segmentation Offload +============================ + +Generic segmentation offload is a pure software offload that is meant to +deal with cases where device drivers cannot perform the offloads described +above. What occurs in GSO is that a given skbuff will have its data broken +out over multiple skbuffs that have been resized to match the MSS provided +via skb_shinfo()->gso_size. + +Before enabling any hardware segmentation offload a corresponding software +offload is required in GSO. Otherwise it becomes possible for a frame to +be re-routed between devices and end up being unable to be transmitted. + +Generic Receive Offload +======================= + +Generic receive offload is the complement to GSO. Ideally any frame +assembled by GRO should be segmented to create an identical sequence of +frames using GSO, and any sequence of frames segmented by GSO should be +able to be reassembled back to the original by GRO. The only exception to +this is IPv4 ID in the case that the DF bit is set for a given IP header. +If the value of the IPv4 ID is not sequentially incrementing it will be +altered so that it is when a frame assembled via GRO is segmented via GSO. + +Partial Generic Segmentation Offload +==================================== + +Partial generic segmentation offload is a hybrid between TSO and GSO. What +it effectively does is take advantage of certain traits of TCP and tunnels +so that instead of having to rewrite the packet headers for each segment +only the inner-most transport header and possibly the outer-most network +header need to be updated. This allows devices that do not support tunnel +offloads or tunnel offloads with checksum to still make use of segmentation. + +With the partial offload what occurs is that all headers excluding the +inner transport header are updated such that they will contain the correct +values for if the header was simply duplicated. The one exception to this +is the outer IPv4 ID field. It is up to the device drivers to guarantee +that the IPv4 ID field is incremented in the case that a given header does +not have the DF bit set. diff --git a/Documentation/networking/stmmac.txt b/Documentation/networking/stmmac.txt index d64a14714236..671fe3dd56d3 100644 --- a/Documentation/networking/stmmac.txt +++ b/Documentation/networking/stmmac.txt @@ -1,6 +1,6 @@ STMicroelectronics 10/100/1000 Synopsys Ethernet driver -Copyright (C) 2007-2014 STMicroelectronics Ltd +Copyright (C) 2007-2015 STMicroelectronics Ltd Author: Giuseppe Cavallaro <peppe.cavallaro@st.com> This is the driver for the MAC 10/100/1000 on-chip Ethernet controllers @@ -138,6 +138,8 @@ struct plat_stmmacenet_data { int (*init)(struct platform_device *pdev, void *priv); void (*exit)(struct platform_device *pdev, void *priv); void *bsp_priv; + int has_gmac4; + bool tso_en; }; Where: @@ -181,6 +183,8 @@ Where: registers. init/exit callbacks should not use or modify platform data. o bsp_priv: another private pointer. + o has_gmac4: uses GMAC4 core. + o tso_en: Enables TSO (TCP Segmentation Offload) feature. For MDIO bus The we have: @@ -278,6 +282,13 @@ Please see the following document: o stmmac_ethtool.c: to implement the ethtool support; o stmmac.h: private driver structure; o common.h: common definitions and VFTs; + o mmc_core.c/mmc.h: Management MAC Counters; + o stmmac_hwtstamp.c: HW timestamp support for PTP; + o stmmac_ptp.c: PTP 1588 clock; + o dwmac-<XXX>.c: these are for the platform glue-logic file; e.g. dwmac-sti.c + for STMicroelectronics SoCs. + +- GMAC 3.x o descs.h: descriptor structure definitions; o dwmac1000_core.c: dwmac GiGa core functions; o dwmac1000_dma.c: dma functions for the GMAC chip; @@ -289,11 +300,32 @@ Please see the following document: o enh_desc.c: functions for handling enhanced descriptors; o norm_desc.c: functions for handling normal descriptors; o chain_mode.c/ring_mode.c:: functions to manage RING/CHAINED modes; - o mmc_core.c/mmc.h: Management MAC Counters; - o stmmac_hwtstamp.c: HW timestamp support for PTP; - o stmmac_ptp.c: PTP 1588 clock; - o dwmac-<XXX>.c: these are for the platform glue-logic file; e.g. dwmac-sti.c - for STMicroelectronics SoCs. + +- GMAC4.x generation + o dwmac4_core.c: dwmac GMAC4.x core functions; + o dwmac4_desc.c: functions for handling GMAC4.x descriptors; + o dwmac4_descs.h: descriptor definitions; + o dwmac4_dma.c: dma functions for the GMAC4.x chip; + o dwmac4_dma.h: dma definitions for the GMAC4.x chip; + o dwmac4.h: core definitions for the GMAC4.x chip; + o dwmac4_lib.c: generic GMAC4.x functions; + +4.12) TSO support (GMAC4.x) + +TSO (Tcp Segmentation Offload) feature is supported by GMAC 4.x chip family. +When a packet is sent through TCP protocol, the TCP stack ensures that +the SKB provided to the low level driver (stmmac in our case) matches with +the maximum frame len (IP header + TCP header + payload <= 1500 bytes (for +MTU set to 1500)). It means that if an application using TCP want to send a +packet which will have a length (after adding headers) > 1514 the packet +will be split in several TCP packets: The data payload is split and headers +(TCP/IP ..) are added. It is done by software. + +When TSO is enabled, the TCP stack doesn't care about the maximum frame +length and provide SKB packet to stmmac as it is. The GMAC IP will have to +perform the segmentation by it self to match with maximum frame length. + +This feature can be enabled in device tree through "snps,tso" entry. 5) Debug Information diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt index fad63136ee3e..31c39115834d 100644 --- a/Documentation/networking/switchdev.txt +++ b/Documentation/networking/switchdev.txt @@ -89,6 +89,18 @@ Typically, the management port is not participating in offloaded data plane and is loaded with a different driver, such as a NIC driver, on the management port device. +Switch ID +^^^^^^^^^ + +The switchdev driver must implement the switchdev op switchdev_port_attr_get +for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same +physical ID for each port of a switch. The ID must be unique between switches +on the same system. The ID does not need to be unique between switches on +different systems. + +The switch ID is used to locate ports on a switch and to know if aggregated +ports belong to the same switch. + Port Netdev Naming ^^^^^^^^^^^^^^^^^^ @@ -104,25 +116,13 @@ external configuration. For example, if a physical 40G port is split logically into 4 10G ports, resulting in 4 port netdevs, the device can give a unique name for each port using port PHYS name. The udev rule would be: -SUBSYSTEM=="net", ACTION=="add", DRIVER="<driver>", ATTR{phys_port_name}!="", \ - NAME="$attr{phys_port_name}" +SUBSYSTEM=="net", ACTION=="add", ATTR{phys_switch_id}=="<phys_switch_id>", \ + ATTR{phys_port_name}!="", NAME="swX$attr{phys_port_name}" Suggested naming convention is "swXpYsZ", where X is the switch name or ID, Y is the port name or ID, and Z is the sub-port name or ID. For example, sw1p1s0 would be sub-port 0 on port 1 on switch 1. -Switch ID -^^^^^^^^^ - -The switchdev driver must implement the switchdev op switchdev_port_attr_get -for SWITCHDEV_ATTR_ID_PORT_PARENT_ID for each port netdev, returning the same -physical ID for each port of a switch. The ID must be unique between switches -on the same system. The ID does not need to be unique between switches on -different systems. - -The switch ID is used to locate ports on a switch and to know if aggregated -ports belong to the same switch. - Port Features ^^^^^^^^^^^^^ @@ -386,7 +386,7 @@ used. First phase is to "prepare" anything needed, including various checks, memory allocation, etc. The goal is to handle the stuff that is not unlikely to fail here. The second phase is to "commit" the actual changes. -Switchdev provides an inftrastructure for sharing items (for example memory +Switchdev provides an infrastructure for sharing items (for example memory allocations) between the two phases. The object created by a driver in "prepare" phase and it is queued up by: diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index a977339fbe0a..671cccf0dcd2 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -44,11 +44,17 @@ timeval of SO_TIMESTAMP (ms). Supports multiple types of timestamp requests. As a result, this socket option takes a bitmap of flags, not a boolean. In - err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, &val); + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, + sizeof(val)); val is an integer with any of the following bits set. Setting other bit returns EINVAL and does not change the current state. +The socket option configures timestamp generation for individual +sk_buffs (1.3.1), timestamp reporting to the socket's error +queue (1.3.2) and options (1.3.3). Timestamp generation can also +be enabled for individual sendmsg calls using cmsg (1.3.4). + 1.3.1 Timestamp Generation @@ -71,13 +77,16 @@ SOF_TIMESTAMPING_RX_SOFTWARE: kernel receive stack. SOF_TIMESTAMPING_TX_HARDWARE: - Request tx timestamps generated by the network adapter. + Request tx timestamps generated by the network adapter. This flag + can be enabled via both socket options and control messages. SOF_TIMESTAMPING_TX_SOFTWARE: Request tx timestamps when data leaves the kernel. These timestamps are generated in the device driver as close as possible, but always prior to, passing the packet to the network interface. Hence, they require driver support and may not be available for all devices. + This flag can be enabled via both socket options and control messages. + SOF_TIMESTAMPING_TX_SCHED: Request tx timestamps prior to entering the packet scheduler. Kernel @@ -90,7 +99,8 @@ SOF_TIMESTAMPING_TX_SCHED: machines with virtual devices where a transmitted packet travels through multiple devices and, hence, multiple packet schedulers, a timestamp is generated at each layer. This allows for fine - grained measurement of queuing delay. + grained measurement of queuing delay. This flag can be enabled + via both socket options and control messages. SOF_TIMESTAMPING_TX_ACK: Request tx timestamps when all data in the send buffer has been @@ -99,6 +109,7 @@ SOF_TIMESTAMPING_TX_ACK: over-report measurement, because the timestamp is generated when all data up to and including the buffer at send() was acknowledged: the cumulative acknowledgment. The mechanism ignores SACK and FACK. + This flag can be enabled via both socket options and control messages. 1.3.2 Timestamp Reporting @@ -183,6 +194,37 @@ having access to the contents of the original packet, so cannot be combined with SOF_TIMESTAMPING_OPT_TSONLY. +1.3.4. Enabling timestamps via control messages + +In addition to socket options, timestamp generation can be requested +per write via cmsg, only for SOF_TIMESTAMPING_TX_* (see Section 1.3.1). +Using this feature, applications can sample timestamps per sendmsg() +without paying the overhead of enabling and disabling timestamps via +setsockopt: + + struct msghdr *msg; + ... + cmsg = CMSG_FIRSTHDR(msg); + cmsg->cmsg_level = SOL_SOCKET; + cmsg->cmsg_type = SO_TIMESTAMPING; + cmsg->cmsg_len = CMSG_LEN(sizeof(__u32)); + *((__u32 *) CMSG_DATA(cmsg)) = SOF_TIMESTAMPING_TX_SCHED | + SOF_TIMESTAMPING_TX_SOFTWARE | + SOF_TIMESTAMPING_TX_ACK; + err = sendmsg(fd, msg, 0); + +The SOF_TIMESTAMPING_TX_* flags set via cmsg will override +the SOF_TIMESTAMPING_TX_* flags set via setsockopt. + +Moreover, applications must still enable timestamp reporting via +setsockopt to receive timestamps: + + __u32 val = SOF_TIMESTAMPING_SOFTWARE | + SOF_TIMESTAMPING_OPT_ID /* or any other flag */; + err = setsockopt(fd, SOL_SOCKET, SO_TIMESTAMPING, (void *) val, + sizeof(val)); + + 1.4 Bytestream Timestamps The SO_TIMESTAMPING interface supports timestamping of bytes in a diff --git a/Documentation/networking/vrf.txt b/Documentation/networking/vrf.txt index d52aa10cfe91..5da679c573d2 100644 --- a/Documentation/networking/vrf.txt +++ b/Documentation/networking/vrf.txt @@ -41,7 +41,7 @@ using an rx_handler which gives the impression that packets flow through the VRF device. Similarly on egress routing rules are used to send packets to the VRF device driver before getting sent out the actual interface. This allows tcpdump on a VRF device to capture all packets into and out of the -VRF as a whole.[1] Similiarly, netfilter [2] and tc rules can be applied +VRF as a whole.[1] Similarly, netfilter [2] and tc rules can be applied using the VRF device to specify rules that apply to the VRF domain as a whole. [1] Packets in the forwarded state do not flow through the device, so those diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt index d7aac9dedeb4..8d88e0f2ec49 100644 --- a/Documentation/networking/xfrm_sync.txt +++ b/Documentation/networking/xfrm_sync.txt @@ -4,7 +4,7 @@ Krisztian <hidden@balabit.hu> and others and additional patches from Jamal <hadi@cyberus.ca>. The end goal for syncing is to be able to insert attributes + generate -events so that the an SA can be safely moved from one machine to another +events so that the SA can be safely moved from one machine to another for HA purposes. The idea is to synchronize the SA so that the takeover machine can do the processing of the SA as accurate as possible if it has access to it. @@ -13,7 +13,7 @@ We already have the ability to generate SA add/del/upd events. These patches add ability to sync and have accurate lifetime byte (to ensure proper decay of SAs) and replay counters to avoid replay attacks with as minimal loss at failover time. -This way a backup stays as closely uptodate as an active member. +This way a backup stays as closely up-to-date as an active member. Because the above items change for every packet the SA receives, it is possible for a lot of the events to be generated. @@ -163,7 +163,7 @@ If you have an SA that is getting hit by traffic in bursts such that there is a period where the timer threshold expires with no packets seen, then an odd behavior is seen as follows: The first packet arrival after a timer expiry will trigger a timeout -aevent; i.e we dont wait for a timeout period or a packet threshold +event; i.e we don't wait for a timeout period or a packet threshold to be reached. This is done for simplicity and efficiency reasons. -JHS |