diff options
Diffstat (limited to 'Documentation/networking')
20 files changed, 1436 insertions, 47 deletions
diff --git a/Documentation/networking/af_xdp.rst b/Documentation/networking/af_xdp.rst index e14d7d40fc75..eeedc2e826aa 100644 --- a/Documentation/networking/af_xdp.rst +++ b/Documentation/networking/af_xdp.rst @@ -220,7 +220,21 @@ Usage In order to use AF_XDP sockets there are two parts needed. The user-space application and the XDP program. For a complete setup and usage example, please refer to the sample application. The user-space -side is xdpsock_user.c and the XDP side xdpsock_kern.c. +side is xdpsock_user.c and the XDP side is part of libbpf. + +The XDP code sample included in tools/lib/bpf/xsk.c is the following:: + + SEC("xdp_sock") int xdp_sock_prog(struct xdp_md *ctx) + { + int index = ctx->rx_queue_index; + + // A set entry here means that the correspnding queue_id + // has an active AF_XDP socket bound to it. + if (bpf_map_lookup_elem(&xsks_map, &index)) + return bpf_redirect_map(&xsks_map, index, 0); + + return XDP_PASS; + } Naive ring dequeue and enqueue could look like this:: @@ -316,16 +330,16 @@ A: When a netdev of a physical NIC is initialized, Linux usually all the traffic, you can force the netdev to only have 1 queue, queue id 0, and then bind to queue 0. You can use ethtool to do this:: - sudo ethtool -L <interface> combined 1 + sudo ethtool -L <interface> combined 1 If you want to only see part of the traffic, you can program the NIC through ethtool to filter out your traffic to a single queue id that you can bind your XDP socket to. Here is one example in which UDP traffic to and from port 4242 are sent to queue 2:: - sudo ethtool -N <interface> rx-flow-hash udp4 fn - sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ - 4242 action 2 + sudo ethtool -N <interface> rx-flow-hash udp4 fn + sudo ethtool -N <interface> flow-type udp4 src-port 4242 dst-port \ + 4242 action 2 A number of other ways are possible all up to the capabilitites of the NIC you have. diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index d3e5dd26db12..e3abfbd32f71 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -706,9 +706,9 @@ num_unsol_na unsolicited IPv6 Neighbor Advertisements) to be issued after a failover event. As soon as the link is up on the new slave (possibly immediately) a peer notification is sent on the - bonding device and each VLAN sub-device. This is repeated at - each link monitor interval (arp_interval or miimon, whichever - is active) if the number is greater than 1. + bonding device and each VLAN sub-device. This is repeated at + the rate specified by peer_notif_delay if the number is + greater than 1. The valid range is 0 - 255; the default value is 1. These options affect only the active-backup mode. These options were added for @@ -727,6 +727,16 @@ packets_per_slave The valid range is 0 - 65535; the default value is 1. This option has effect only in balance-rr mode. +peer_notif_delay + + Specify the delay, in milliseconds, between each peer + notification (gratuitous ARP and unsolicited IPv6 Neighbor + Advertisement) when they are issued after a failover event. + This delay should be a multiple of the link monitor interval + (arp_interval or miimon, whichever is active). The default + value is 0 which means to match the value of the link monitor + interval. + primary A string (eth0, eth2, etc) specifying which slave is the diff --git a/Documentation/networking/device_drivers/amazon/ena.txt b/Documentation/networking/device_drivers/amazon/ena.txt index 2b4b6f57e549..1bb55c7b604c 100644 --- a/Documentation/networking/device_drivers/amazon/ena.txt +++ b/Documentation/networking/device_drivers/amazon/ena.txt @@ -73,7 +73,7 @@ operation. AQ is used for submitting management commands, and the results/responses are reported asynchronously through ACQ. -ENA introduces a very small set of management commands with room for +ENA introduces a small set of management commands with room for vendor-specific extensions. Most of the management operations are framed in a generic Get/Set feature command. @@ -202,11 +202,14 @@ delay value to each level. The user can enable/disable adaptive moderation, modify the interrupt delay table and restore its default values through sysfs. +RX copybreak: +============= The rx_copybreak is initialized by default to ENA_DEFAULT_RX_COPYBREAK and can be configured by the ETHTOOL_STUNABLE command of the SIOCETHTOOL ioctl. SKB: +==== The driver-allocated SKB for frames received from Rx handling using NAPI context. The allocation method depends on the size of the packet. If the frame length is larger than rx_copybreak, napi_get_frags() diff --git a/Documentation/networking/device_drivers/aquantia/atlantic.txt b/Documentation/networking/device_drivers/aquantia/atlantic.txt new file mode 100644 index 000000000000..d235cbaeccc6 --- /dev/null +++ b/Documentation/networking/device_drivers/aquantia/atlantic.txt @@ -0,0 +1,439 @@ +aQuantia AQtion Driver for the aQuantia Multi-Gigabit PCI Express Family of +Ethernet Adapters +============================================================================= + +Contents +======== + +- Identifying Your Adapter +- Configuration +- Supported ethtool options +- Command Line Parameters +- Config file parameters +- Support +- License + +Identifying Your Adapter +======================== + +The driver in this release is compatible with AQC-100, AQC-107, AQC-108 based ethernet adapters. + + +SFP+ Devices (for AQC-100 based adapters) +---------------------------------- + +This release tested with passive Direct Attach Cables (DAC) and SFP+/LC Optical Transceiver. + +Configuration +========================= + Viewing Link Messages + --------------------- + Link messages will not be displayed to the console if the distribution is + restricting system messages. In order to see network driver link messages on + your console, set dmesg to eight by entering the following: + + dmesg -n 8 + + NOTE: This setting is not saved across reboots. + + Jumbo Frames + ------------ + The driver supports Jumbo Frames for all adapters. Jumbo Frames support is + enabled by changing the MTU to a value larger than the default of 1500. + The maximum value for the MTU is 16000. Use the `ip` command to + increase the MTU size. For example: + + ip link set mtu 16000 dev enp1s0 + + ethtool + ------- + The driver utilizes the ethtool interface for driver configuration and + diagnostics, as well as displaying statistical information. The latest + ethtool version is required for this functionality. + + NAPI + ---- + NAPI (Rx polling mode) is supported in the atlantic driver. + +Supported ethtool options +============================ + Viewing adapter settings + --------------------- + ethtool <ethX> + + Output example: + + Settings for enp1s0: + Supported ports: [ TP ] + Supported link modes: 100baseT/Full + 1000baseT/Full + 10000baseT/Full + 2500baseT/Full + 5000baseT/Full + Supported pause frame use: Symmetric + Supports auto-negotiation: Yes + Supported FEC modes: Not reported + Advertised link modes: 100baseT/Full + 1000baseT/Full + 10000baseT/Full + 2500baseT/Full + 5000baseT/Full + Advertised pause frame use: Symmetric + Advertised auto-negotiation: Yes + Advertised FEC modes: Not reported + Speed: 10000Mb/s + Duplex: Full + Port: Twisted Pair + PHYAD: 0 + Transceiver: internal + Auto-negotiation: on + MDI-X: Unknown + Supports Wake-on: g + Wake-on: d + Link detected: yes + + --- + Note: AQrate speeds (2.5/5 Gb/s) will be displayed only with linux kernels > 4.10. + But you can still use these speeds: + ethtool -s eth0 autoneg off speed 2500 + + Viewing adapter information + --------------------- + ethtool -i <ethX> + + Output example: + + driver: atlantic + version: 5.2.0-050200rc5-generic-kern + firmware-version: 3.1.78 + expansion-rom-version: + bus-info: 0000:01:00.0 + supports-statistics: yes + supports-test: no + supports-eeprom-access: no + supports-register-dump: yes + supports-priv-flags: no + + + Viewing Ethernet adapter statistics: + --------------------- + ethtool -S <ethX> + + Output example: + NIC statistics: + InPackets: 13238607 + InUCast: 13293852 + InMCast: 52 + InBCast: 3 + InErrors: 0 + OutPackets: 23703019 + OutUCast: 23704941 + OutMCast: 67 + OutBCast: 11 + InUCastOctects: 213182760 + OutUCastOctects: 22698443 + InMCastOctects: 6600 + OutMCastOctects: 8776 + InBCastOctects: 192 + OutBCastOctects: 704 + InOctects: 2131839552 + OutOctects: 226938073 + InPacketsDma: 95532300 + OutPacketsDma: 59503397 + InOctetsDma: 1137102462 + OutOctetsDma: 2394339518 + InDroppedDma: 0 + Queue[0] InPackets: 23567131 + Queue[0] OutPackets: 20070028 + Queue[0] InJumboPackets: 0 + Queue[0] InLroPackets: 0 + Queue[0] InErrors: 0 + Queue[1] InPackets: 45428967 + Queue[1] OutPackets: 11306178 + Queue[1] InJumboPackets: 0 + Queue[1] InLroPackets: 0 + Queue[1] InErrors: 0 + Queue[2] InPackets: 3187011 + Queue[2] OutPackets: 13080381 + Queue[2] InJumboPackets: 0 + Queue[2] InLroPackets: 0 + Queue[2] InErrors: 0 + Queue[3] InPackets: 23349136 + Queue[3] OutPackets: 15046810 + Queue[3] InJumboPackets: 0 + Queue[3] InLroPackets: 0 + Queue[3] InErrors: 0 + + Interrupt coalescing support + --------------------------------- + ITR mode, TX/RX coalescing timings could be viewed with: + + ethtool -c <ethX> + + and changed with: + + ethtool -C <ethX> tx-usecs <usecs> rx-usecs <usecs> + + To disable coalescing: + + ethtool -C <ethX> tx-usecs 0 rx-usecs 0 tx-max-frames 1 tx-max-frames 1 + + Wake on LAN support + --------------------------------- + + WOL support by magic packet: + + ethtool -s <ethX> wol g + + To disable WOL: + + ethtool -s <ethX> wol d + + Set and check the driver message level + --------------------------------- + + Set message level + + ethtool -s <ethX> msglvl <level> + + Level values: + + 0x0001 - general driver status. + 0x0002 - hardware probing. + 0x0004 - link state. + 0x0008 - periodic status check. + 0x0010 - interface being brought down. + 0x0020 - interface being brought up. + 0x0040 - receive error. + 0x0080 - transmit error. + 0x0200 - interrupt handling. + 0x0400 - transmit completion. + 0x0800 - receive completion. + 0x1000 - packet contents. + 0x2000 - hardware status. + 0x4000 - Wake-on-LAN status. + + By default, the level of debugging messages is set 0x0001(general driver status). + + Check message level + + ethtool <ethX> | grep "Current message level" + + If you want to disable the output of messages + + ethtool -s <ethX> msglvl 0 + + RX flow rules (ntuple filters) + --------------------------------- + There are separate rules supported, that applies in that order: + 1. 16 VLAN ID rules + 2. 16 L2 EtherType rules + 3. 8 L3/L4 5-Tuple rules + + + The driver utilizes the ethtool interface for configuring ntuple filters, + via "ethtool -N <device> <filter>". + + To enable or disable the RX flow rules: + + ethtool -K ethX ntuple <on|off> + + When disabling ntuple filters, all the user programed filters are + flushed from the driver cache and hardware. All needed filters must + be re-added when ntuple is re-enabled. + + Because of the fixed order of the rules, the location of filters is also fixed: + - Locations 0 - 15 for VLAN ID filters + - Locations 16 - 31 for L2 EtherType filters + - Locations 32 - 39 for L3/L4 5-tuple filters (locations 32, 36 for IPv6) + + The L3/L4 5-tuple (protocol, source and destination IP address, source and + destination TCP/UDP/SCTP port) is compared against 8 filters. For IPv4, up to + 8 source and destination addresses can be matched. For IPv6, up to 2 pairs of + addresses can be supported. Source and destination ports are only compared for + TCP/UDP/SCTP packets. + + To add a filter that directs packet to queue 5, use <-N|-U|--config-nfc|--config-ntuple> switch: + + ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.1 dst-ip 10.0.0.2 src-port 2000 dst-port 2001 action 5 <loc 32> + + - action is the queue number. + - loc is the rule number. + + For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you must set the loc + number within 32 - 39. + For "flow-type ip4|udp4|tcp4|sctp4|ip6|udp6|tcp6|sctp6" you can set 8 rules + for traffic IPv4 or you can set 2 rules for traffic IPv6. Loc number traffic + IPv6 is 32 and 36. + At the moment you can not use IPv4 and IPv6 filters at the same time. + + Example filter for IPv6 filter traffic: + + sudo ethtool -N <ethX> flow-type tcp6 src-ip 2001:db8:0:f101::1 dst-ip 2001:db8:0:f101::2 action 1 loc 32 + sudo ethtool -N <ethX> flow-type ip6 src-ip 2001:db8:0:f101::2 dst-ip 2001:db8:0:f101::5 action -1 loc 36 + + Example filter for IPv4 filter traffic: + + sudo ethtool -N <ethX> flow-type udp4 src-ip 10.0.0.4 dst-ip 10.0.0.7 src-port 2000 dst-port 2001 loc 32 + sudo ethtool -N <ethX> flow-type tcp4 src-ip 10.0.0.3 dst-ip 10.0.0.9 src-port 2000 dst-port 2001 loc 33 + sudo ethtool -N <ethX> flow-type ip4 src-ip 10.0.0.6 dst-ip 10.0.0.4 loc 34 + + If you set action -1, then all traffic corresponding to the filter will be discarded. + The maximum value action is 31. + + + The VLAN filter (VLAN id) is compared against 16 filters. + VLAN id must be accompanied by mask 0xF000. That is to distinguish VLAN filter + from L2 Ethertype filter with UserPriority since both User Priority and VLAN ID + are passed in the same 'vlan' parameter. + + To add a filter that directs packets from VLAN 2001 to queue 5: + ethtool -N <ethX> flow-type ip4 vlan 2001 m 0xF000 action 1 loc 0 + + + L2 EtherType filters allows filter packet by EtherType field or both EtherType + and User Priority (PCP) field of 802.1Q. + UserPriority (vlan) parameter must be accompanied by mask 0x1FFF. That is to + distinguish VLAN filter from L2 Ethertype filter with UserPriority since both + User Priority and VLAN ID are passed in the same 'vlan' parameter. + + To add a filter that directs IP4 packess of priority 3 to queue 3: + ethtool -N <ethX> flow-type ether proto 0x800 vlan 0x600 m 0x1FFF action 3 loc 16 + + + To see the list of filters currently present: + + ethtool <-u|-n|--show-nfc|--show-ntuple> <ethX> + + Rules may be deleted from the table itself. This is done using: + + sudo ethtool <-N|-U|--config-nfc|--config-ntuple> <ethX> delete <loc> + + - loc is the rule number to be deleted. + + Rx filters is an interface to load the filter table that funnels all flow + into queue 0 unless an alternative queue is specified using "action". In that + case, any flow that matches the filter criteria will be directed to the + appropriate queue. RX filters is supported on all kernels 2.6.30 and later. + + RSS for UDP + --------------------------------- + Currently, NIC does not support RSS for fragmented IP packets, which leads to + incorrect working of RSS for fragmented UDP traffic. To disable RSS for UDP the + RX Flow L3/L4 rule may be used. + + Example: + ethtool -N eth0 flow-type udp4 action 0 loc 32 + +Command Line Parameters +======================= +The following command line parameters are available on atlantic driver: + +aq_itr -Interrupt throttling mode +---------------------------------------- +Accepted values: 0, 1, 0xFFFF +Default value: 0xFFFF +0 - Disable interrupt throttling. +1 - Enable interrupt throttling and use specified tx and rx rates. +0xFFFF - Auto throttling mode. Driver will choose the best RX and TX + interrupt throtting settings based on link speed. + +aq_itr_tx - TX interrupt throttle rate +---------------------------------------- +Accepted values: 0 - 0x1FF +Default value: 0 +TX side throttling in microseconds. Adapter will setup maximum interrupt delay +to this value. Minimum interrupt delay will be a half of this value + +aq_itr_rx - RX interrupt throttle rate +---------------------------------------- +Accepted values: 0 - 0x1FF +Default value: 0 +RX side throttling in microseconds. Adapter will setup maximum interrupt delay +to this value. Minimum interrupt delay will be a half of this value + +Note: ITR settings could be changed in runtime by ethtool -c means (see below) + +Config file parameters +======================= +For some fine tuning and performance optimizations, +some parameters can be changed in the {source_dir}/aq_cfg.h file. + +AQ_CFG_RX_PAGEORDER +---------------------------------------- +Default value: 0 +RX page order override. Thats a power of 2 number of RX pages allocated for +each descriptor. Received descriptor size is still limited by AQ_CFG_RX_FRAME_MAX. +Increasing pageorder makes page reuse better (actual on iommu enabled systems). + +AQ_CFG_RX_REFILL_THRES +---------------------------------------- +Default value: 32 +RX refill threshold. RX path will not refill freed descriptors until the +specified number of free descriptors is observed. Larger values may help +better page reuse but may lead to packet drops as well. + +AQ_CFG_VECS_DEF +------------------------------------------------------------ +Number of queues +Valid Range: 0 - 8 (up to AQ_CFG_VECS_MAX) +Default value: 8 +Notice this value will be capped by the number of cores available on the system. + +AQ_CFG_IS_RSS_DEF +------------------------------------------------------------ +Enable/disable Receive Side Scaling + +This feature allows the adapter to distribute receive processing +across multiple CPU-cores and to prevent from overloading a single CPU core. + +Valid values +0 - disabled +1 - enabled + +Default value: 1 + +AQ_CFG_NUM_RSS_QUEUES_DEF +------------------------------------------------------------ +Number of queues for Receive Side Scaling +Valid Range: 0 - 8 (up to AQ_CFG_VECS_DEF) + +Default value: AQ_CFG_VECS_DEF + +AQ_CFG_IS_LRO_DEF +------------------------------------------------------------ +Enable/disable Large Receive Offload + +This offload enables the adapter to coalesce multiple TCP segments and indicate +them as a single coalesced unit to the OS networking subsystem. +The system consumes less energy but it also introduces more latency in packets processing. + +Valid values +0 - disabled +1 - enabled + +Default value: 1 + +AQ_CFG_TX_CLEAN_BUDGET +---------------------------------------- +Maximum descriptors to cleanup on TX at once. +Default value: 256 + +After the aq_cfg.h file changed the driver must be rebuilt to take effect. + +Support +======= + +If an issue is identified with the released source code on the supported +kernel with a supported adapter, email the specific information related +to the issue to support@aquantia.com + +License +======= + +aQuantia Corporation Network Driver +Copyright(c) 2014 - 2019 aQuantia Corporation. + +This program is free software; you can redistribute it and/or modify it +under the terms and conditions of the GNU General Public License, +version 2, as published by the Free Software Foundation. diff --git a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst b/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst index 5045df990a4c..17dbee1ac53e 100644 --- a/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst +++ b/Documentation/networking/device_drivers/freescale/dpaa2/dpio-driver.rst @@ -39,8 +39,7 @@ The Linux DPIO driver consists of 3 primary components-- DPIO service-- provides APIs to other Linux drivers for services - QBman portal interface-- sends portal commands, gets responses -:: + QBman portal interface-- sends portal commands, gets responses:: fsl-mc other bus drivers @@ -60,6 +59,7 @@ The Linux DPIO driver consists of 3 primary components-- The diagram below shows how the DPIO driver components fit with the other DPAA2 Linux driver components:: + +------------+ | OS Network | | Stack | diff --git a/Documentation/networking/device_drivers/google/gve.rst b/Documentation/networking/device_drivers/google/gve.rst new file mode 100644 index 000000000000..793693cef6e3 --- /dev/null +++ b/Documentation/networking/device_drivers/google/gve.rst @@ -0,0 +1,123 @@ +.. SPDX-License-Identifier: GPL-2.0+ + +============================================================== +Linux kernel driver for Compute Engine Virtual Ethernet (gve): +============================================================== + +Supported Hardware +=================== +The GVE driver binds to a single PCI device id used by the virtual +Ethernet device found in some Compute Engine VMs. + ++--------------+----------+---------+ +|Field | Value | Comments| ++==============+==========+=========+ +|Vendor ID | `0x1AE0` | Google | ++--------------+----------+---------+ +|Device ID | `0x0042` | | ++--------------+----------+---------+ +|Sub-vendor ID | `0x1AE0` | Google | ++--------------+----------+---------+ +|Sub-device ID | `0x0058` | | ++--------------+----------+---------+ +|Revision ID | `0x0` | | ++--------------+----------+---------+ +|Device Class | `0x200` | Ethernet| ++--------------+----------+---------+ + +PCI Bars +======== +The gVNIC PCI device exposes three 32-bit memory BARS: +- Bar0 - Device configuration and status registers. +- Bar1 - MSI-X vector table +- Bar2 - IRQ, RX and TX doorbells + +Device Interactions +=================== +The driver interacts with the device in the following ways: + - Registers + - A block of MMIO registers + - See gve_register.h for more detail + - Admin Queue + - See description below + - Reset + - At any time the device can be reset + - Interrupts + - See supported interrupts below + - Transmit and Receive Queues + - See description below + +Registers +--------- +All registers are MMIO and big endian. + +The registers are used for initializing and configuring the device as well as +querying device status in response to management interrupts. + +Admin Queue (AQ) +---------------- +The Admin Queue is a PAGE_SIZE memory block, treated as an array of AQ +commands, used by the driver to issue commands to the device and set up +resources.The driver and the device maintain a count of how many commands +have been submitted and executed. To issue AQ commands, the driver must do +the following (with proper locking): + +1) Copy new commands into next available slots in the AQ array +2) Increment its counter by he number of new commands +3) Write the counter into the GVE_ADMIN_QUEUE_DOORBELL register +4) Poll the ADMIN_QUEUE_EVENT_COUNTER register until it equals + the value written to the doorbell, or until a timeout. + +The device will update the status field in each AQ command reported as +executed through the ADMIN_QUEUE_EVENT_COUNTER register. + +Device Resets +------------- +A device reset is triggered by writing 0x0 to the AQ PFN register. +This causes the device to release all resources allocated by the +driver, including the AQ itself. + +Interrupts +---------- +The following interrupts are supported by the driver: + +Management Interrupt +~~~~~~~~~~~~~~~~~~~~ +The management interrupt is used by the device to tell the driver to +look at the GVE_DEVICE_STATUS register. + +The handler for the management irq simply queues the service task in +the workqueue to check the register and acks the irq. + +Notification Block Interrupts +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ +The notification block interrupts are used to tell the driver to poll +the queues associated with that interrupt. + +The handler for these irqs schedule the napi for that block to run +and poll the queues. + +Traffic Queues +-------------- +gVNIC's queues are composed of a descriptor ring and a buffer and are +assigned to a notification block. + +The descriptor rings are power-of-two-sized ring buffers consisting of +fixed-size descriptors. They advance their head pointer using a __be32 +doorbell located in Bar2. The tail pointers are advanced by consuming +descriptors in-order and updating a __be32 counter. Both the doorbell +and the counter overflow to zero. + +Each queue's buffers must be registered in advance with the device as a +queue page list, and packet data can only be put in those pages. + +Transmit +~~~~~~~~ +gve maps the buffers for transmit rings into a FIFO and copies the packets +into the FIFO before sending them to the NIC. + +Receive +~~~~~~~ +The buffers for receive rings are put into a data ring that is the same +length as the descriptor ring and the head and tail pointers advance over +the rings together. diff --git a/Documentation/networking/device_drivers/index.rst b/Documentation/networking/device_drivers/index.rst index 75fa537763a4..2b7fefe72351 100644 --- a/Documentation/networking/device_drivers/index.rst +++ b/Documentation/networking/device_drivers/index.rst @@ -21,6 +21,8 @@ Contents: intel/i40e intel/iavf intel/ice + google/gve + mellanox/mlx5 .. only:: subproject diff --git a/Documentation/networking/device_drivers/mellanox/mlx5.rst b/Documentation/networking/device_drivers/mellanox/mlx5.rst new file mode 100644 index 000000000000..214325897732 --- /dev/null +++ b/Documentation/networking/device_drivers/mellanox/mlx5.rst @@ -0,0 +1,192 @@ +.. SPDX-License-Identifier: GPL-2.0 OR Linux-OpenIB + +================================================= +Mellanox ConnectX(R) mlx5 core VPI Network Driver +================================================= + +Copyright (c) 2019, Mellanox Technologies LTD. + +Contents +======== + +- `Enabling the driver and kconfig options`_ +- `Devlink info`_ +- `Devlink health reporters`_ + +Enabling the driver and kconfig options +================================================ + +| mlx5 core is modular and most of the major mlx5 core driver features can be selected (compiled in/out) +| at build time via kernel Kconfig flags. +| Basic features, ethernet net device rx/tx offloads and XDP, are available with the most basic flags +| CONFIG_MLX5_CORE=y/m and CONFIG_MLX5_CORE_EN=y. +| For the list of advanced features please see below. + +**CONFIG_MLX5_CORE=(y/m/n)** (module mlx5_core.ko) + +| The driver can be enabled by choosing CONFIG_MLX5_CORE=y/m in kernel config. +| This will provide mlx5 core driver for mlx5 ulps to interface with (mlx5e, mlx5_ib). + + +**CONFIG_MLX5_CORE_EN=(y/n)** + +| Choosing this option will allow basic ethernet netdevice support with all of the standard rx/tx offloads. +| mlx5e is the mlx5 ulp driver which provides netdevice kernel interface, when chosen, mlx5e will be +| built-in into mlx5_core.ko. + + +**CONFIG_MLX5_EN_ARFS=(y/n)** + +| Enables Hardware-accelerated receive flow steering (arfs) support, and ntuple filtering. +| https://community.mellanox.com/s/article/howto-configure-arfs-on-connectx-4 + + +**CONFIG_MLX5_EN_RXNFC=(y/n)** + +| Enables ethtool receive network flow classification, which allows user defined +| flow rules to direct traffic into arbitrary rx queue via ethtool set/get_rxnfc API. + + +**CONFIG_MLX5_CORE_EN_DCB=(y/n)**: + +| Enables `Data Center Bridging (DCB) Support <https://community.mellanox.com/s/article/howto-auto-config-pfc-and-ets-on-connectx-4-via-lldp-dcbx>`_. + + +**CONFIG_MLX5_MPFS=(y/n)** + +| Ethernet Multi-Physical Function Switch (MPFS) support in ConnectX NIC. +| MPFs is required for when `Multi-Host <http://www.mellanox.com/page/multihost>`_ configuration is enabled to allow passing +| user configured unicast MAC addresses to the requesting PF. + + +**CONFIG_MLX5_ESWITCH=(y/n)** + +| Ethernet SRIOV E-Switch support in ConnectX NIC. E-Switch provides internal SRIOV packet steering +| and switching for the enabled VFs and PF in two available modes: +| 1) `Legacy SRIOV mode (L2 mac vlan steering based) <https://community.mellanox.com/s/article/howto-configure-sr-iov-for-connectx-4-connectx-5-with-kvm--ethernet-x>`_. +| 2) `Switchdev mode (eswitch offloads) <https://www.mellanox.com/related-docs/prod_software/ASAP2_Hardware_Offloading_for_vSwitches_User_Manual_v4.4.pdf>`_. + + +**CONFIG_MLX5_CORE_IPOIB=(y/n)** + +| IPoIB offloads & acceleration support. +| Requires CONFIG_MLX5_CORE_EN to provide an accelerated interface for the rdma +| IPoIB ulp netdevice. + + +**CONFIG_MLX5_FPGA=(y/n)** + +| Build support for the Innova family of network cards by Mellanox Technologies. +| Innova network cards are comprised of a ConnectX chip and an FPGA chip on one board. +| If you select this option, the mlx5_core driver will include the Innova FPGA core and allow +| building sandbox-specific client drivers. + + +**CONFIG_MLX5_EN_IPSEC=(y/n)** + +| Enables `IPSec XFRM cryptography-offload accelaration <http://www.mellanox.com/related-docs/prod_software/Mellanox_Innova_IPsec_Ethernet_Adapter_Card_User_Manual.pdf>`_. + +**CONFIG_MLX5_EN_TLS=(y/n)** + +| TLS cryptography-offload accelaration. + + +**CONFIG_MLX5_INFINIBAND=(y/n/m)** (module mlx5_ib.ko) + +| Provides low-level InfiniBand/RDMA and `RoCE <https://community.mellanox.com/s/article/recommended-network-configuration-examples-for-roce-deployment>`_ support. + + +**External options** ( Choose if the corresponding mlx5 feature is required ) + +- CONFIG_PTP_1588_CLOCK: When chosen, mlx5 ptp support will be enabled +- CONFIG_VXLAN: When chosen, mlx5 vxaln support will be enabled. +- CONFIG_MLXFW: When chosen, mlx5 firmware flashing support will be enabled (via devlink and ethtool). + +Devlink info +============ + +The devlink info reports the running and stored firmware versions on device. +It also prints the device PSID which represents the HCA board type ID. + +User command example:: + + $ devlink dev info pci/0000:00:06.0 + pci/0000:00:06.0: + driver mlx5_core + versions: + fixed: + fw.psid MT_0000000009 + running: + fw.version 16.26.0100 + stored: + fw.version 16.26.0100 + +Devlink health reporters +======================== + +tx reporter +----------- +The tx reporter is responsible of two error scenarios: + +- TX timeout + Report on kernel tx timeout detection. + Recover by searching lost interrupts. +- TX error completion + Report on error tx completion. + Recover by flushing the TX queue and reset it. + +TX reporter also support Diagnose callback, on which it provides +real time information of its send queues status. + +User commands examples: + +- Diagnose send queues status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter tx + +- Show number of tx errors indicated, number of recover flows ended successfully, + is autorecover enabled and graceful period from last recover:: + + $ devlink health show pci/0000:82:00.0 reporter tx + +fw reporter +----------- +The fw reporter implements diagnose and dump callbacks. +It follows symptoms of fw error such as fw syndrome by triggering +fw core dump and storing it into the dump buffer. +The fw reporter diagnose command can be triggered any time by the user to check +current fw status. + +User commands examples: + +- Check fw heath status:: + + $ devlink health diagnose pci/0000:82:00.0 reporter fw + +- Read FW core dump if already stored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.0 reporter fw + +NOTE: This command can run only on the PF which has fw tracer ownership, +running it on other PF or any VF will return "Operation not permitted". + +fw fatal reporter +----------------- +The fw fatal reporter implements dump and recover callbacks. +It follows fatal errors indications by CR-space dump and recover flow. +The CR-space dump uses vsc interface which is valid even if the FW command +interface is not functional, which is the case in most FW fatal errors. +The recover function runs recover flow which reloads the driver and triggers fw +reset if needed. + +User commands examples: + +- Run fw recover flow manually:: + + $ devlink health recover pci/0000:82:00.0 reporter fw_fatal + +- Read FW CR-space dump if already strored or trigger new one:: + + $ devlink health dump show pci/0000:82:00.1 reporter fw_fatal + +NOTE: This command can run only on PF. diff --git a/Documentation/networking/dsa/b53.rst b/Documentation/networking/dsa/b53.rst new file mode 100644 index 000000000000..b41637cdb82b --- /dev/null +++ b/Documentation/networking/dsa/b53.rst @@ -0,0 +1,183 @@ +.. SPDX-License-Identifier: GPL-2.0 + +========================================== +Broadcom RoboSwitch Ethernet switch driver +========================================== + +The Broadcom RoboSwitch Ethernet switch family is used in quite a range of +xDSL router, cable modems and other multimedia devices. + +The actual implementation supports the devices BCM5325E, BCM5365, BCM539x, +BCM53115 and BCM53125 as well as BCM63XX. + +Implementation details +====================== + +The driver is located in ``drivers/net/dsa/b53/`` and is implemented as a +DSA driver; see ``Documentation/networking/dsa/dsa.rst`` for details on the +subsystem and what it provides. + +The switch is, if possible, configured to enable a Broadcom specific 4-bytes +switch tag which gets inserted by the switch for every packet forwarded to the +CPU interface, conversely, the CPU network interface should insert a similar +tag for packets entering the CPU port. The tag format is described in +``net/dsa/tag_brcm.c``. + +The configuration of the device depends on whether or not tagging is +supported. + +The interface names and example network configuration are used according the +configuration described in the :ref:`dsa-config-showcases`. + +Configuration with tagging support +---------------------------------- + +The tagging based configuration is desired. It is not specific to the b53 +DSA driver and will work like all DSA drivers which supports tagging. + +See :ref:`dsa-tagged-configuration`. + +Configuration without tagging support +------------------------------------- + +Older models (5325, 5365) support a different tag format that is not supported +yet. 539x and 531x5 require managed mode and some special handling, which is +also not yet supported. The tagging support is disabled in these cases and the +switch need a different configuration. + +The configuration slightly differ from the :ref:`dsa-vlan-configuration`. + +The b53 tags the CPU port in all VLANs, since otherwise any PVID untagged +VLAN programming would basically change the CPU port's default PVID and make +it untagged, undesirable. + +In difference to the configuration described in :ref:`dsa-vlan-configuration` +the default VLAN 1 has to be removed from the slave interface configuration in +single port and gateway configuration, while there is no need to add an extra +VLAN configuration in the bridge showcase. + +single port +~~~~~~~~~~~ +The configuration can only be set up via VLAN tagging and bridge setup. +By default packages are tagged with vid 1: + +.. code-block:: sh + + # tag traffic on CPU port + ip link add link eth0 name eth0.1 type vlan id 1 + ip link add link eth0 name eth0.2 type vlan id 2 + ip link add link eth0 name eth0.3 type vlan id 3 + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + ip link set eth0.1 up + ip link set eth0.2 up + ip link set eth0.3 up + + # bring up the slave interfaces + ip link set wan up + ip link set lan1 up + ip link set lan2 up + + # create bridge + ip link add name br0 type bridge + + # activate VLAN filtering + ip link set dev br0 type bridge vlan_filtering 1 + + # add ports to bridges + ip link set dev wan master br0 + ip link set dev lan1 master br0 + ip link set dev lan2 master br0 + + # tag traffic on ports + bridge vlan add dev lan1 vid 2 pvid untagged + bridge vlan del dev lan1 vid 1 + bridge vlan add dev lan2 vid 3 pvid untagged + bridge vlan del dev lan2 vid 1 + + # configure the VLANs + ip addr add 192.0.2.1/30 dev eth0.1 + ip addr add 192.0.2.5/30 dev eth0.2 + ip addr add 192.0.2.9/30 dev eth0.3 + + # bring up the bridge devices + ip link set br0 up + + +bridge +~~~~~~ + +.. code-block:: sh + + # tag traffic on CPU port + ip link add link eth0 name eth0.1 type vlan id 1 + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + ip link set eth0.1 up + + # bring up the slave interfaces + ip link set wan up + ip link set lan1 up + ip link set lan2 up + + # create bridge + ip link add name br0 type bridge + + # activate VLAN filtering + ip link set dev br0 type bridge vlan_filtering 1 + + # add ports to bridge + ip link set dev wan master br0 + ip link set dev lan1 master br0 + ip link set dev lan2 master br0 + ip link set eth0.1 master br0 + + # configure the bridge + ip addr add 192.0.2.129/25 dev br0 + + # bring up the bridge + ip link set dev br0 up + +gateway +~~~~~~~ + +.. code-block:: sh + + # tag traffic on CPU port + ip link add link eth0 name eth0.1 type vlan id 1 + ip link add link eth0 name eth0.2 type vlan id 2 + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + ip link set eth0.1 up + ip link set eth0.2 up + + # bring up the slave interfaces + ip link set wan up + ip link set lan1 up + ip link set lan2 up + + # create bridge + ip link add name br0 type bridge + + # activate VLAN filtering + ip link set dev br0 type bridge vlan_filtering 1 + + # add ports to bridges + ip link set dev wan master br0 + ip link set eth0.1 master br0 + ip link set dev lan1 master br0 + ip link set dev lan2 master br0 + + # tag traffic on ports + bridge vlan add dev wan vid 2 pvid untagged + bridge vlan del dev wan vid 1 + + # configure the VLANs + ip addr add 192.0.2.1/30 dev eth0.2 + ip addr add 192.0.2.129/25 dev br0 + + # bring up the bridge devices + ip link set br0 up diff --git a/Documentation/networking/dsa/configuration.rst b/Documentation/networking/dsa/configuration.rst new file mode 100644 index 000000000000..af029b3ca2ab --- /dev/null +++ b/Documentation/networking/dsa/configuration.rst @@ -0,0 +1,292 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================================= +DSA switch configuration from userspace +======================================= + +The DSA switch configuration is not integrated into the main userspace +network configuration suites by now and has to be performed manualy. + +.. _dsa-config-showcases: + +Configuration showcases +----------------------- + +To configure a DSA switch a couple of commands need to be executed. In this +documentation some common configuration scenarios are handled as showcases: + +*single port* + Every switch port acts as a different configurable Ethernet port + +*bridge* + Every switch port is part of one configurable Ethernet bridge + +*gateway* + Every switch port except one upstream port is part of a configurable + Ethernet bridge. + The upstream port acts as different configurable Ethernet port. + +All configurations are performed with tools from iproute2, which is available +at https://www.kernel.org/pub/linux/utils/net/iproute2/ + +Through DSA every port of a switch is handled like a normal linux Ethernet +interface. The CPU port is the switch port connected to an Ethernet MAC chip. +The corresponding linux Ethernet interface is called the master interface. +All other corresponding linux interfaces are called slave interfaces. + +The slave interfaces depend on the master interface. They can only brought up, +when the master interface is up. + +In this documentation the following Ethernet interfaces are used: + +*eth0* + the master interface + +*lan1* + a slave interface + +*lan2* + another slave interface + +*lan3* + a third slave interface + +*wan* + A slave interface dedicated for upstream traffic + +Further Ethernet interfaces can be configured similar. +The configured IPs and networks are: + +*single port* + * lan1: 192.0.2.1/30 (192.0.2.0 - 192.0.2.3) + * lan2: 192.0.2.5/30 (192.0.2.4 - 192.0.2.7) + * lan3: 192.0.2.9/30 (192.0.2.8 - 192.0.2.11) + +*bridge* + * br0: 192.0.2.129/25 (192.0.2.128 - 192.0.2.255) + +*gateway* + * br0: 192.0.2.129/25 (192.0.2.128 - 192.0.2.255) + * wan: 192.0.2.1/30 (192.0.2.0 - 192.0.2.3) + +.. _dsa-tagged-configuration: + +Configuration with tagging support +---------------------------------- + +The tagging based configuration is desired and supported by the majority of +DSA switches. These switches are capable to tag incoming and outgoing traffic +without using a VLAN based configuration. + +single port +~~~~~~~~~~~ + +.. code-block:: sh + + # configure each interface + ip addr add 192.0.2.1/30 dev lan1 + ip addr add 192.0.2.5/30 dev lan2 + ip addr add 192.0.2.9/30 dev lan3 + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + + # bring up the slave interfaces + ip link set lan1 up + ip link set lan2 up + ip link set lan3 up + +bridge +~~~~~~ + +.. code-block:: sh + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + + # bring up the slave interfaces + ip link set lan1 up + ip link set lan2 up + ip link set lan3 up + + # create bridge + ip link add name br0 type bridge + + # add ports to bridge + ip link set dev lan1 master br0 + ip link set dev lan2 master br0 + ip link set dev lan3 master br0 + + # configure the bridge + ip addr add 192.0.2.129/25 dev br0 + + # bring up the bridge + ip link set dev br0 up + +gateway +~~~~~~~ + +.. code-block:: sh + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + + # bring up the slave interfaces + ip link set wan up + ip link set lan1 up + ip link set lan2 up + + # configure the upstream port + ip addr add 192.0.2.1/30 dev wan + + # create bridge + ip link add name br0 type bridge + + # add ports to bridge + ip link set dev lan1 master br0 + ip link set dev lan2 master br0 + + # configure the bridge + ip addr add 192.0.2.129/25 dev br0 + + # bring up the bridge + ip link set dev br0 up + +.. _dsa-vlan-configuration: + +Configuration without tagging support +------------------------------------- + +A minority of switches are not capable to use a taging protocol +(DSA_TAG_PROTO_NONE). These switches can be configured by a VLAN based +configuration. + +single port +~~~~~~~~~~~ +The configuration can only be set up via VLAN tagging and bridge setup. + +.. code-block:: sh + + # tag traffic on CPU port + ip link add link eth0 name eth0.1 type vlan id 1 + ip link add link eth0 name eth0.2 type vlan id 2 + ip link add link eth0 name eth0.3 type vlan id 3 + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + ip link set eth0.1 up + ip link set eth0.2 up + ip link set eth0.3 up + + # bring up the slave interfaces + ip link set lan1 up + ip link set lan1 up + ip link set lan3 up + + # create bridge + ip link add name br0 type bridge + + # activate VLAN filtering + ip link set dev br0 type bridge vlan_filtering 1 + + # add ports to bridges + ip link set dev lan1 master br0 + ip link set dev lan2 master br0 + ip link set dev lan3 master br0 + + # tag traffic on ports + bridge vlan add dev lan1 vid 1 pvid untagged + bridge vlan add dev lan2 vid 2 pvid untagged + bridge vlan add dev lan3 vid 3 pvid untagged + + # configure the VLANs + ip addr add 192.0.2.1/30 dev eth0.1 + ip addr add 192.0.2.5/30 dev eth0.2 + ip addr add 192.0.2.9/30 dev eth0.3 + + # bring up the bridge devices + ip link set br0 up + + +bridge +~~~~~~ + +.. code-block:: sh + + # tag traffic on CPU port + ip link add link eth0 name eth0.1 type vlan id 1 + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + ip link set eth0.1 up + + # bring up the slave interfaces + ip link set lan1 up + ip link set lan2 up + ip link set lan3 up + + # create bridge + ip link add name br0 type bridge + + # activate VLAN filtering + ip link set dev br0 type bridge vlan_filtering 1 + + # add ports to bridge + ip link set dev lan1 master br0 + ip link set dev lan2 master br0 + ip link set dev lan3 master br0 + ip link set eth0.1 master br0 + + # tag traffic on ports + bridge vlan add dev lan1 vid 1 pvid untagged + bridge vlan add dev lan2 vid 1 pvid untagged + bridge vlan add dev lan3 vid 1 pvid untagged + + # configure the bridge + ip addr add 192.0.2.129/25 dev br0 + + # bring up the bridge + ip link set dev br0 up + +gateway +~~~~~~~ + +.. code-block:: sh + + # tag traffic on CPU port + ip link add link eth0 name eth0.1 type vlan id 1 + ip link add link eth0 name eth0.2 type vlan id 2 + + # The master interface needs to be brought up before the slave ports. + ip link set eth0 up + ip link set eth0.1 up + ip link set eth0.2 up + + # bring up the slave interfaces + ip link set wan up + ip link set lan1 up + ip link set lan2 up + + # create bridge + ip link add name br0 type bridge + + # activate VLAN filtering + ip link set dev br0 type bridge vlan_filtering 1 + + # add ports to bridges + ip link set dev wan master br0 + ip link set eth0.1 master br0 + ip link set dev lan1 master br0 + ip link set dev lan2 master br0 + + # tag traffic on ports + bridge vlan add dev lan1 vid 1 pvid untagged + bridge vlan add dev lan2 vid 1 pvid untagged + bridge vlan add dev wan vid 2 pvid untagged + + # configure the VLANs + ip addr add 192.0.2.1/30 dev eth0.2 + ip addr add 192.0.2.129/25 dev br0 + + # bring up the bridge devices + ip link set br0 up diff --git a/Documentation/networking/dsa/dsa.rst b/Documentation/networking/dsa/dsa.rst index ca87068b9ab9..563d56c6a25c 100644 --- a/Documentation/networking/dsa/dsa.rst +++ b/Documentation/networking/dsa/dsa.rst @@ -531,7 +531,7 @@ Bridge VLAN filtering a software implementation. .. note:: VLAN ID 0 corresponds to the port private database, which, in the context - of DSA, would be the its port-based VLAN, used by the associated bridge device. + of DSA, would be its port-based VLAN, used by the associated bridge device. - ``port_fdb_del``: bridge layer function invoked when the bridge wants to remove a Forwarding Database entry, the switch hardware should be programmed to delete @@ -554,7 +554,7 @@ Bridge VLAN filtering associated with this VLAN ID. .. note:: VLAN ID 0 corresponds to the port private database, which, in the context - of DSA, would be the its port-based VLAN, used by the associated bridge device. + of DSA, would be its port-based VLAN, used by the associated bridge device. - ``port_mdb_del``: bridge layer function invoked when the bridge wants to remove a multicast database entry, the switch hardware should be programmed to delete diff --git a/Documentation/networking/dsa/index.rst b/Documentation/networking/dsa/index.rst index 0e5b7a9be406..ee631e2d646f 100644 --- a/Documentation/networking/dsa/index.rst +++ b/Documentation/networking/dsa/index.rst @@ -6,6 +6,8 @@ Distributed Switch Architecture :maxdepth: 1 dsa + b53 bcm_sf2 lan9303 sja1105 + configuration diff --git a/Documentation/networking/dsa/sja1105.rst b/Documentation/networking/dsa/sja1105.rst index ea7bac438cfd..cb2858dece93 100644 --- a/Documentation/networking/dsa/sja1105.rst +++ b/Documentation/networking/dsa/sja1105.rst @@ -86,13 +86,13 @@ functionality. The following traffic modes are supported over the switch netdevices: +--------------------+------------+------------------+------------------+ -| | Standalone | Bridged with | Bridged with | -| | ports | vlan_filtering 0 | vlan_filtering 1 | +| | Standalone | Bridged with | Bridged with | +| | ports | vlan_filtering 0 | vlan_filtering 1 | +====================+============+==================+==================+ | Regular traffic | Yes | Yes | No (use master) | +--------------------+------------+------------------+------------------+ | Management traffic | Yes | Yes | Yes | -| (BPDU, PTP) | | | | +| (BPDU, PTP) | | | | +--------------------+------------+------------------+------------------+ Switching features diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 14fe93049d28..df33674799b5 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -80,6 +80,7 @@ fib_multipath_hash_policy - INTEGER Possible values: 0 - Layer 3 1 - Layer 4 + 2 - Layer 3 or inner Layer 3 if present fib_sync_mem - UNSIGNED INTEGER Amount of dirty memory from fib entries that can be backlogged before @@ -255,6 +256,14 @@ tcp_base_mss - INTEGER Path MTU discovery (MTU probing). If MTU probing is enabled, this is the initial MSS used by the connection. +tcp_min_snd_mss - INTEGER + TCP SYN and SYNACK messages usually advertise an ADVMSS option, + as described in RFC 1122 and RFC 6691. + If this ADVMSS option is smaller than tcp_min_snd_mss, + it is silently capped to tcp_min_snd_mss. + + Default : 48 (at least 8 bytes of payload per segment) + tcp_congestion_control - STRING Set the congestion control algorithm to be used for new connections. The algorithm "reno" is always available, but @@ -648,6 +657,26 @@ tcp_fastopen_blackhole_timeout_sec - INTEGER 0 to disable the blackhole detection. By default, it is set to 1hr. +tcp_fastopen_key - list of comma separated 32-digit hexadecimal INTEGERs + The list consists of a primary key and an optional backup key. The + primary key is used for both creating and validating cookies, while the + optional backup key is only used for validating cookies. The purpose of + the backup key is to maximize TFO validation when keys are rotated. + + A randomly chosen primary key may be configured by the kernel if + the tcp_fastopen sysctl is set to 0x400 (see above), or if the + TCP_FASTOPEN setsockopt() optname is set and a key has not been + previously configured via sysctl. If keys are configured via + setsockopt() by using the TCP_FASTOPEN_KEY optname, then those + per-socket keys will be used instead of any keys that are specified via + sysctl. + + A key is specified as 4 8-digit hexadecimal integers which are separated + by a '-' as: xxxxxxxx-xxxxxxxx-xxxxxxxx-xxxxxxxx. Leading zeros may be + omitted. A primary and a backup key may be specified by separating them + by a comma. If only one key is specified, it becomes the primary key and + any previously configured backup keys are removed. + tcp_syn_retries - INTEGER Number of times initial SYNs for an active TCP connection attempt will be retransmitted. Should not be higher than 127. Default value @@ -772,6 +801,14 @@ tcp_challenge_ack_limit - INTEGER in RFC 5961 (Improving TCP's Robustness to Blind In-Window Attacks) Default: 100 +tcp_rx_skb_cache - BOOLEAN + Controls a per TCP socket cache of one skb, that might help + performance of some workloads. This might be dangerous + on systems with a lot of TCP sockets, since it increases + memory usage. + + Default: 0 (disabled) + UDP variables: udp_l3mdev_accept - BOOLEAN @@ -1409,14 +1446,26 @@ flowlabel_state_ranges - BOOLEAN FALSE: disabled Default: true -flowlabel_reflect - BOOLEAN - Automatically reflect the flow label. Needed for Path MTU +flowlabel_reflect - INTEGER + Control flow label reflection. Needed for Path MTU Discovery to work with Equal Cost Multipath Routing in anycast environments. See RFC 7690 and: https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 - TRUE: enabled - FALSE: disabled - Default: FALSE + + This is a bitmask. + 1: enabled for established flows + + Note that this prevents automatic flowlabel changes, as done + in "tcp: change IPv6 flow-label upon receiving spurious retransmission" + and "tcp: Change txhash on every SYN and RTO retransmit" + + 2: enabled for TCP RESET packets (no active listener) + If set, a RST packet sent in response to a SYN packet on a closed + port will reflect the incoming flow label. + + 4: enabled for ICMPv6 echo reply messages. + + Default: 0 fib_multipath_hash_policy - INTEGER Controls which hash policy to use for multipath routes. @@ -1424,6 +1473,7 @@ fib_multipath_hash_policy - INTEGER Possible values: 0 - Layer 3 (source and destination addresses plus flow label) 1 - Layer 4 (standard 5-tuple) + 2 - Layer 3 or inner Layer 3 if present anycast_src_echo_reply - BOOLEAN Controls the use of anycast addresses as source addresses for ICMPv6 @@ -2237,7 +2287,7 @@ addr_scope_policy - INTEGER /proc/sys/net/core/* - Please see: Documentation/sysctl/net.txt for descriptions of these entries. + Please see: Documentation/admin-guide/sysctl/net.rst for descriptions of these entries. /proc/sys/net/unix/* diff --git a/Documentation/networking/mpls-sysctl.txt b/Documentation/networking/mpls-sysctl.txt index 2f24a1912a48..025cc9b96992 100644 --- a/Documentation/networking/mpls-sysctl.txt +++ b/Documentation/networking/mpls-sysctl.txt @@ -30,7 +30,7 @@ ip_ttl_propagate - BOOL 0 - disabled / RFC 3443 [Short] Pipe Model 1 - enabled / RFC 3443 Uniform Model (default) -default_ttl - BOOL +default_ttl - INTEGER Default TTL value to use for MPLS packets where it cannot be propagated from an IP header, either because one isn't present or ip_ttl_propagate has been disabled. diff --git a/Documentation/networking/phy.rst b/Documentation/networking/phy.rst index 0dd90d7df5ec..a689966bc4be 100644 --- a/Documentation/networking/phy.rst +++ b/Documentation/networking/phy.rst @@ -202,7 +202,8 @@ the PHY/controller, of which the PHY needs to be aware. *interface* is a u32 which specifies the connection type used between the controller and the PHY. Examples are GMII, MII, -RGMII, and SGMII. For a full list, see include/linux/phy.h +RGMII, and SGMII. See "PHY interface mode" below. For a full +list, see include/linux/phy.h Now just make sure that phydev->supported and phydev->advertising have any values pruned from them which don't make sense for your controller (a 10/100 @@ -225,6 +226,48 @@ When you want to disconnect from the network (even if just briefly), you call phy_stop(phydev). This function also stops the phylib state machine and disables PHY interrupts. +PHY interface modes +=================== + +The PHY interface mode supplied in the phy_connect() family of functions +defines the initial operating mode of the PHY interface. This is not +guaranteed to remain constant; there are PHYs which dynamically change +their interface mode without software interaction depending on the +negotiation results. + +Some of the interface modes are described below: + +``PHY_INTERFACE_MODE_1000BASEX`` + This defines the 1000BASE-X single-lane serdes link as defined by the + 802.3 standard section 36. The link operates at a fixed bit rate of + 1.25Gbaud using a 10B/8B encoding scheme, resulting in an underlying + data rate of 1Gbps. Embedded in the data stream is a 16-bit control + word which is used to negotiate the duplex and pause modes with the + remote end. This does not include "up-clocked" variants such as 2.5Gbps + speeds (see below.) + +``PHY_INTERFACE_MODE_2500BASEX`` + This defines a variant of 1000BASE-X which is clocked 2.5 times faster, + than the 802.3 standard giving a fixed bit rate of 3.125Gbaud. + +``PHY_INTERFACE_MODE_SGMII`` + This is used for Cisco SGMII, which is a modification of 1000BASE-X + as defined by the 802.3 standard. The SGMII link consists of a single + serdes lane running at a fixed bit rate of 1.25Gbaud with 10B/8B + encoding. The underlying data rate is 1Gbps, with the slower speeds of + 100Mbps and 10Mbps being achieved through replication of each data symbol. + The 802.3 control word is re-purposed to send the negotiated speed and + duplex information from to the MAC, and for the MAC to acknowledge + receipt. This does not include "up-clocked" variants such as 2.5Gbps + speeds. + + Note: mismatched SGMII vs 1000BASE-X configuration on a link can + successfully pass data in some circumstances, but the 16-bit control + word will not be correctly interpreted, which may cause mismatches in + duplex, pause or other settings. This is dependent on the MAC and/or + PHY behaviour. + + Pause frames / flow control =========================== diff --git a/Documentation/networking/rds.txt b/Documentation/networking/rds.txt index 0235ae69af2a..f2a0147c933d 100644 --- a/Documentation/networking/rds.txt +++ b/Documentation/networking/rds.txt @@ -389,7 +389,7 @@ Multipath RDS (mprds) a common (to all paths) part, and a per-path struct rds_conn_path. All I/O workqs and reconnect threads are driven from the rds_conn_path. Transports such as TCP that are multipath capable may then set up a - TPC socket per rds_conn_path, and this is managed by the transport via + TCP socket per rds_conn_path, and this is managed by the transport via the transport privatee cp_transport_data pointer. Transports announce themselves as multipath capable by setting the diff --git a/Documentation/networking/sfp-phylink.rst b/Documentation/networking/sfp-phylink.rst index 5bd26cb07244..91446b431b70 100644 --- a/Documentation/networking/sfp-phylink.rst +++ b/Documentation/networking/sfp-phylink.rst @@ -98,6 +98,7 @@ this documentation. 4. Add:: struct phylink *phylink; + struct phylink_config phylink_config; to the driver's private data structure. We shall refer to the driver's private data pointer as ``priv`` below, and the driver's @@ -223,8 +224,10 @@ this documentation. .. code-block:: c struct phylink *phylink; + priv->phylink_config.dev = &dev.dev; + priv->phylink_config.type = PHYLINK_NETDEV; - phylink = phylink_create(dev, node, phy_mode, &phylink_ops); + phylink = phylink_create(&priv->phylink_config, node, phy_mode, &phylink_ops); if (IS_ERR(phylink)) { err = PTR_ERR(phylink); fail probe; diff --git a/Documentation/networking/timestamping.txt b/Documentation/networking/timestamping.txt index bbdaf8990031..8dd6333c3270 100644 --- a/Documentation/networking/timestamping.txt +++ b/Documentation/networking/timestamping.txt @@ -368,7 +368,7 @@ ts[1] used to hold hardware timestamps converted to system time. Instead, expose the hardware clock device on the NIC directly as a HW PTP clock source, to allow time conversion in userspace and optionally synchronize system time with a userspace PTP stack such -as linuxptp. For the PTP clock API, see Documentation/ptp/ptp.txt. +as linuxptp. For the PTP clock API, see Documentation/driver-api/ptp.rst. Note that if the SO_TIMESTAMP or SO_TIMESTAMPNS option is enabled together with SO_TIMESTAMPING using SOF_TIMESTAMPING_SOFTWARE, a false diff --git a/Documentation/networking/tls-offload.rst b/Documentation/networking/tls-offload.rst index cb85af559dff..048e5ca44824 100644 --- a/Documentation/networking/tls-offload.rst +++ b/Documentation/networking/tls-offload.rst @@ -206,7 +206,11 @@ TX Segments transmitted from an offloaded socket can get out of sync in similar ways to the receive side-retransmissions - local drops -are possible, though network reorders are not. +are possible, though network reorders are not. There are currently +two mechanisms for dealing with out of order segments. + +Crypto state rebuilding +~~~~~~~~~~~~~~~~~~~~~~~ Whenever an out of order segment is transmitted the driver provides the device with enough information to perform cryptographic operations. @@ -225,6 +229,35 @@ was just a retransmission. The former is simpler, and does not require retransmission detection therefore it is the recommended method until such time it is proven inefficient. +Next record sync +~~~~~~~~~~~~~~~~ + +Whenever an out of order segment is detected the driver requests +that the ``ktls`` software fallback code encrypt it. If the segment's +sequence number is lower than expected the driver assumes retransmission +and doesn't change device state. If the segment is in the future, it +may imply a local drop, the driver asks the stack to sync the device +to the next record state and falls back to software. + +Resync request is indicated with: + +.. code-block:: c + + void tls_offload_tx_resync_request(struct sock *sk, u32 got_seq, u32 exp_seq) + +Until resync is complete driver should not access its expected TCP +sequence number (as it will be updated from a different context). +Following helper should be used to test if resync is complete: + +.. code-block:: c + + bool tls_offload_tx_resync_pending(struct sock *sk) + +Next time ``ktls`` pushes a record it will first send its TCP sequence number +and TLS record number to the driver. Stack will also make sure that +the new record will start on a segment boundary (like it does when +the connection is initially added). + RX -- @@ -268,6 +301,9 @@ Device can only detect that segment 4 also contains a TLS header if it knows the length of the previous record from segment 2. In this case the device will lose synchronization with the stream. +Stream scan resynchronization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + When the device gets out of sync and the stream reaches TCP sequence numbers more than a max size record past the expected TCP sequence number, the device starts scanning for a known header pattern. For example @@ -298,6 +334,22 @@ Special care has to be taken if the confirmation request is passed asynchronously to the packet stream and record may get processed by the kernel before the confirmation request. +Stack-driven resynchronization +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +The driver may also request the stack to perform resynchronization +whenever it sees the records are no longer getting decrypted. +If the connection is configured in this mode the stack automatically +schedules resynchronization after it has received two completely encrypted +records. + +The stack waits for the socket to drain and informs the device about +the next expected record number and its TCP sequence number. If the +records continue to be received fully encrypted stack retries the +synchronization with an exponential back off (first after 2 encrypted +records, then after 4 records, after 8, after 16... up until every +128 records). + Error handling ============== @@ -379,7 +431,6 @@ by the driver: but did not arrive in the expected order * ``tx_tls_drop_no_sync_data`` - number of TX packets dropped because they arrived out of order and associated record could not be found - (see also :ref:`pre_tls_data`) Notable corner cases, exceptions and additional requirements ============================================================ @@ -462,21 +513,3 @@ Redirects leak clear text In the RX direction, if segment has already been decrypted by the device and it gets redirected or mirrored - clear text will be transmitted out. - -.. _pre_tls_data: - -Transmission of pre-TLS data ----------------------------- - -User can enqueue some already encrypted and framed records before enabling -``ktls`` on the socket. Those records have to get sent as they are. This is -perfectly easy to handle in the software case - such data will be waiting -in the TCP layer, TLS ULP won't see it. In the offloaded case when pre-queued -segment reaches transmission point it appears to be out of order (before the -expected TCP sequence number) and the stack does not have a record information -associated. - -All segments without record information cannot, however, be assumed to be -pre-queued data, because a race condition exists between TCP stack queuing -a retransmission, the driver seeing the retransmission and TCP ACK arriving -for the retransmitted data. |