diff options
Diffstat (limited to 'Documentation/networking')
23 files changed, 663 insertions, 644 deletions
diff --git a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst index 1a4fc6607582..1661d13174d5 100644 --- a/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst +++ b/Documentation/networking/device_drivers/can/ctu/ctucanfd-driver.rst @@ -229,8 +229,7 @@ frames for a while. This has a potential to avoid the costly round of enabling interrupts, handling an incoming IRQ in ISR, re-enabling the softirq and switching context back to softirq. -More detailed documentation of NAPI may be found on the pages of Linux -Foundation `<https://wiki.linuxfoundation.org/networking/napi>`_. +See :ref:`Documentation/networking/napi.rst <napi>` for more information. Integrating the core to Xilinx Zynq ----------------------------------- diff --git a/Documentation/networking/device_drivers/ethernet/index.rst b/Documentation/networking/device_drivers/ethernet/index.rst index 392969ac88ad..6e9e7012d000 100644 --- a/Documentation/networking/device_drivers/ethernet/index.rst +++ b/Documentation/networking/device_drivers/ethernet/index.rst @@ -31,7 +31,6 @@ Contents: intel/fm10k intel/igb intel/igbvf - intel/ixgb intel/ixgbe intel/ixgbevf intel/i40e diff --git a/Documentation/networking/device_drivers/ethernet/intel/e100.rst b/Documentation/networking/device_drivers/ethernet/intel/e100.rst index 3d4a9ba21946..5dee1b53e977 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/e100.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/e100.rst @@ -151,8 +151,7 @@ NAPI NAPI (Rx polling mode) is supported in the e100 driver. -See https://wiki.linuxfoundation.org/networking/napi for more -information on NAPI. +See :ref:`Documentation/networking/napi.rst <napi>` for more information. Multiple Interfaces on Same Ethernet Broadcast Network ------------------------------------------------------ @@ -181,8 +180,6 @@ Support For general information, go to the Intel support website at: https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: -http://sourceforge.net/projects/e1000 If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/e1000.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst index 4aaae0f7d6ba..52a7fb9ce8d9 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/e1000.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/e1000.rst @@ -451,13 +451,8 @@ Support ======= For general information, go to the Intel support website at: - - http://support.intel.com - -or the Intel Wired Networking project hosted by Sourceforge at: - - http://sourceforge.net/projects/e1000 +http://support.intel.com If an issue is identified with the released source code on the supported kernel with a supported adapter, email the specific information related -to the issue to e1000-devel@lists.sf.net +to the issue to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst index f49cd370e7bf..d8f810afdd49 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/e1000e.rst @@ -371,13 +371,8 @@ NOTE: Wake on LAN is only supported on port A for the following devices: Support ======= For general information, go to the Intel support website at: - https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst index 9258ef6f515c..396a2c8c3db1 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/fm10k.rst @@ -130,13 +130,8 @@ the Intel Ethernet Controller XL710. Support ======= For general information, go to the Intel support website at: - https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst index ac35bd472bdc..4fbaa1a2d674 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/i40e.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/i40e.rst @@ -399,8 +399,8 @@ operate only in full duplex and only at their native speed. NAPI ---- NAPI (Rx polling mode) is supported in the i40e driver. -For more information on NAPI, see -https://wiki.linuxfoundation.org/networking/napi + +See :ref:`Documentation/networking/napi.rst <napi>` for more information. Flow Control ------------ @@ -759,13 +759,8 @@ enabled when setting up DCB on your switch. Support ======= For general information, go to the Intel support website at: - https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/iavf.rst b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst index 151af0a8da9c..eb926c3bd4cd 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/iavf.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/iavf.rst @@ -319,13 +319,8 @@ This is caused by the way the Linux kernel reports this stressed condition. Support ======= For general information, go to the Intel support website at: - https://support.intel.com -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on the supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/ice.rst b/Documentation/networking/device_drivers/ethernet/intel/ice.rst index 5efea4dd1251..69695e5511f4 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ice.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ice.rst @@ -817,10 +817,10 @@ NOTE: NAPI ---- + This driver supports NAPI (Rx polling mode). -For more information on NAPI, see -https://wiki.linuxfoundation.org/networking/napi +See :ref:`Documentation/networking/napi.rst <napi>` for more information. MACVLAN ------- @@ -1026,12 +1026,9 @@ Support For general information, go to the Intel support website at: https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. Trademarks diff --git a/Documentation/networking/device_drivers/ethernet/intel/igb.rst b/Documentation/networking/device_drivers/ethernet/intel/igb.rst index d46289e182cf..fbd590b6a0d6 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/igb.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/igb.rst @@ -201,13 +201,8 @@ NOTE: This feature is exclusive to i210 models. Support ======= For general information, go to the Intel support website at: - https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst index 40fa210c5e14..11a9017f3069 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/igbvf.rst @@ -53,13 +53,8 @@ https://www.kernel.org/pub/software/network/ethtool/ Support ======= For general information, go to the Intel support website at: - https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgb.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgb.rst deleted file mode 100644 index c6a233e68ad6..000000000000 --- a/Documentation/networking/device_drivers/ethernet/intel/ixgb.rst +++ /dev/null @@ -1,468 +0,0 @@ -.. SPDX-License-Identifier: GPL-2.0+ - -===================================================================== -Linux Base Driver for 10 Gigabit Intel(R) Ethernet Network Connection -===================================================================== - -October 1, 2018 - - -Contents -======== - -- In This Release -- Identifying Your Adapter -- Command Line Parameters -- Improving Performance -- Additional Configurations -- Known Issues/Troubleshooting -- Support - - - -In This Release -=============== - -This file describes the ixgb Linux Base Driver for the 10 Gigabit Intel(R) -Network Connection. This driver includes support for Itanium(R)2-based -systems. - -For questions related to hardware requirements, refer to the documentation -supplied with your 10 Gigabit adapter. All hardware requirements listed apply -to use with Linux. - -The following features are available in this kernel: - - Native VLANs - - Channel Bonding (teaming) - - SNMP - -Channel Bonding documentation can be found in the Linux kernel source: -/Documentation/networking/bonding.rst - -The driver information previously displayed in the /proc filesystem is not -supported in this release. Alternatively, you can use ethtool (version 1.6 -or later), lspci, and iproute2 to obtain the same information. - -Instructions on updating ethtool can be found in the section "Additional -Configurations" later in this document. - - -Identifying Your Adapter -======================== - -The following Intel network adapters are compatible with the drivers in this -release: - -+------------+------------------------------+----------------------------------+ -| Controller | Adapter Name | Physical Layer | -+============+==============================+==================================+ -| 82597EX | Intel(R) PRO/10GbE LR/SR/CX4 | - 10G Base-LR (fiber) | -| | Server Adapters | - 10G Base-SR (fiber) | -| | | - 10G Base-CX4 (copper) | -+------------+------------------------------+----------------------------------+ - -For more information on how to identify your adapter, go to the Adapter & -Driver ID Guide at: - - https://support.intel.com - - -Command Line Parameters -======================= - -If the driver is built as a module, the following optional parameters are -used by entering them on the command line with the modprobe command using -this syntax:: - - modprobe ixgb [<option>=<VAL1>,<VAL2>,...] - -For example, with two 10GbE PCI adapters, entering:: - - modprobe ixgb TxDescriptors=80,128 - -loads the ixgb driver with 80 TX resources for the first adapter and 128 TX -resources for the second adapter. - -The default value for each parameter is generally the recommended setting, -unless otherwise noted. - -Copybreak ---------- -:Valid Range: 0-XXXX -:Default Value: 256 - - This is the maximum size of packet that is copied to a new buffer on - receive. - -Debug ------ -:Valid Range: 0-16 (0=none,...,16=all) -:Default Value: 0 - - This parameter adjusts the level of debug messages displayed in the - system logs. - -FlowControl ------------ -:Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) -:Default Value: 1 if no EEPROM, otherwise read from EEPROM - - This parameter controls the automatic generation(Tx) and response(Rx) to - Ethernet PAUSE frames. There are hardware bugs associated with enabling - Tx flow control so beware. - -RxDescriptors -------------- -:Valid Range: 64-4096 -:Default Value: 1024 - - This value is the number of receive descriptors allocated by the driver. - Increasing this value allows the driver to buffer more incoming packets. - Each descriptor is 16 bytes. A receive buffer is also allocated for - each descriptor and can be either 2048, 4056, 8192, or 16384 bytes, - depending on the MTU setting. When the MTU size is 1500 or less, the - receive buffer size is 2048 bytes. When the MTU is greater than 1500 the - receive buffer size will be either 4056, 8192, or 16384 bytes. The - maximum MTU size is 16114. - -TxDescriptors -------------- -:Valid Range: 64-4096 -:Default Value: 256 - - This value is the number of transmit descriptors allocated by the driver. - Increasing this value allows the driver to queue more transmits. Each - descriptor is 16 bytes. - -RxIntDelay ----------- -:Valid Range: 0-65535 (0=off) -:Default Value: 72 - - This value delays the generation of receive interrupts in units of - 0.8192 microseconds. Receive interrupt reduction can improve CPU - efficiency if properly tuned for specific network traffic. Increasing - this value adds extra latency to frame reception and can end up - decreasing the throughput of TCP traffic. If the system is reporting - dropped receives, this value may be set too high, causing the driver to - run out of available receive descriptors. - -TxIntDelay ----------- -:Valid Range: 0-65535 (0=off) -:Default Value: 32 - - This value delays the generation of transmit interrupts in units of - 0.8192 microseconds. Transmit interrupt reduction can improve CPU - efficiency if properly tuned for specific network traffic. Increasing - this value adds extra latency to frame transmission and can end up - decreasing the throughput of TCP traffic. If this value is set too high, - it will cause the driver to run out of available transmit descriptors. - -XsumRX ------- -:Valid Range: 0-1 -:Default Value: 1 - - A value of '1' indicates that the driver should enable IP checksum - offload for received packets (both UDP and TCP) to the adapter hardware. - -RxFCHighThresh --------------- -:Valid Range: 1,536-262,136 (0x600 - 0x3FFF8, 8 byte granularity) -:Default Value: 196,608 (0x30000) - - Receive Flow control high threshold (when we send a pause frame) - -RxFCLowThresh -------------- -:Valid Range: 64-262,136 (0x40 - 0x3FFF8, 8 byte granularity) -:Default Value: 163,840 (0x28000) - - Receive Flow control low threshold (when we send a resume frame) - -FCReqTimeout ------------- -:Valid Range: 1-65535 -:Default Value: 65535 - - Flow control request timeout (how long to pause the link partner's tx) - -IntDelayEnable --------------- -:Value Range: 0,1 -:Default Value: 1 - - Interrupt Delay, 0 disables transmit interrupt delay and 1 enables it. - - -Improving Performance -===================== - -With the 10 Gigabit server adapters, the default Linux configuration will -very likely limit the total available throughput artificially. There is a set -of configuration changes that, when applied together, will increase the ability -of Linux to transmit and receive data. The following enhancements were -originally acquired from settings published at https://www.spec.org/web99/ for -various submitted results using Linux. - -NOTE: - These changes are only suggestions, and serve as a starting point for - tuning your network performance. - -The changes are made in three major ways, listed in order of greatest effect: - -- Use ip link to modify the mtu (maximum transmission unit) and the txqueuelen - parameter. -- Use sysctl to modify /proc parameters (essentially kernel tuning) -- Use setpci to modify the MMRBC field in PCI-X configuration space to increase - transmit burst lengths on the bus. - -NOTE: - setpci modifies the adapter's configuration registers to allow it to read - up to 4k bytes at a time (for transmits). However, for some systems the - behavior after modifying this register may be undefined (possibly errors of - some kind). A power-cycle, hard reset or explicitly setting the e6 register - back to 22 (setpci -d 8086:1a48 e6.b=22) may be required to get back to a - stable configuration. - -- COPY these lines and paste them into ixgb_perf.sh: - -:: - - #!/bin/bash - echo "configuring network performance , edit this file to change the interface - or device ID of 10GbE card" - # set mmrbc to 4k reads, modify only Intel 10GbE device IDs - # replace 1a48 with appropriate 10GbE device's ID installed on the system, - # if needed. - setpci -d 8086:1a48 e6.b=2e - # set the MTU (max transmission unit) - it requires your switch and clients - # to change as well. - # set the txqueuelen - # your ixgb adapter should be loaded as eth1 for this to work, change if needed - ip li set dev eth1 mtu 9000 txqueuelen 1000 up - # call the sysctl utility to modify /proc/sys entries - sysctl -p ./sysctl_ixgb.conf - -- COPY these lines and paste them into sysctl_ixgb.conf: - -:: - - # some of the defaults may be different for your kernel - # call this file with sysctl -p <this file> - # these are just suggested values that worked well to increase throughput in - # several network benchmark tests, your mileage may vary - - ### IPV4 specific settings - # turn TCP timestamp support off, default 1, reduces CPU use - net.ipv4.tcp_timestamps = 0 - # turn SACK support off, default on - # on systems with a VERY fast bus -> memory interface this is the big gainer - net.ipv4.tcp_sack = 0 - # set min/default/max TCP read buffer, default 4096 87380 174760 - net.ipv4.tcp_rmem = 10000000 10000000 10000000 - # set min/pressure/max TCP write buffer, default 4096 16384 131072 - net.ipv4.tcp_wmem = 10000000 10000000 10000000 - # set min/pressure/max TCP buffer space, default 31744 32256 32768 - net.ipv4.tcp_mem = 10000000 10000000 10000000 - - ### CORE settings (mostly for socket and UDP effect) - # set maximum receive socket buffer size, default 131071 - net.core.rmem_max = 524287 - # set maximum send socket buffer size, default 131071 - net.core.wmem_max = 524287 - # set default receive socket buffer size, default 65535 - net.core.rmem_default = 524287 - # set default send socket buffer size, default 65535 - net.core.wmem_default = 524287 - # set maximum amount of option memory buffers, default 10240 - net.core.optmem_max = 524287 - # set number of unprocessed input packets before kernel starts dropping them; default 300 - net.core.netdev_max_backlog = 300000 - -Edit the ixgb_perf.sh script if necessary to change eth1 to whatever interface -your ixgb driver is using and/or replace '1a48' with appropriate 10GbE device's -ID installed on the system. - -NOTE: - Unless these scripts are added to the boot process, these changes will - only last only until the next system reboot. - - -Resolving Slow UDP Traffic --------------------------- -If your server does not seem to be able to receive UDP traffic as fast as it -can receive TCP traffic, it could be because Linux, by default, does not set -the network stack buffers as large as they need to be to support high UDP -transfer rates. One way to alleviate this problem is to allow more memory to -be used by the IP stack to store incoming data. - -For instance, use the commands:: - - sysctl -w net.core.rmem_max=262143 - -and:: - - sysctl -w net.core.rmem_default=262143 - -to increase the read buffer memory max and default to 262143 (256k - 1) from -defaults of max=131071 (128k - 1) and default=65535 (64k - 1). These variables -will increase the amount of memory used by the network stack for receives, and -can be increased significantly more if necessary for your application. - - -Additional Configurations -========================= - -Configuring the Driver on Different Distributions -------------------------------------------------- -Configuring a network driver to load properly when the system is started is -distribution dependent. Typically, the configuration process involves adding -an alias line to /etc/modprobe.conf as well as editing other system startup -scripts and/or configuration files. Many popular Linux distributions ship -with tools to make these changes for you. To learn the proper way to -configure a network device for your system, refer to your distribution -documentation. If during this process you are asked for the driver or module -name, the name for the Linux Base Driver for the Intel 10GbE Family of -Adapters is ixgb. - -Viewing Link Messages ---------------------- -Link messages will not be displayed to the console if the distribution is -restricting system messages. In order to see network driver link messages on -your console, set dmesg to eight by entering the following:: - - dmesg -n 8 - -NOTE: This setting is not saved across reboots. - -Jumbo Frames ------------- -The driver supports Jumbo Frames for all adapters. Jumbo Frames support is -enabled by changing the MTU to a value larger than the default of 1500. -The maximum value for the MTU is 16114. Use the ip command to -increase the MTU size. For example:: - - ip li set dev ethx mtu 9000 - -The maximum MTU setting for Jumbo Frames is 16114. This value coincides -with the maximum Jumbo Frames size of 16128. - -Ethtool -------- -The driver utilizes the ethtool interface for driver configuration and -diagnostics, as well as displaying statistical information. The ethtool -version 1.6 or later is required for this functionality. - -The latest release of ethtool can be found from -https://www.kernel.org/pub/software/network/ethtool/ - -NOTE: - The ethtool version 1.6 only supports a limited set of ethtool options. - Support for a more complete ethtool feature set can be enabled by - upgrading to the latest version. - -NAPI ----- -NAPI (Rx polling mode) is supported in the ixgb driver. - -See https://wiki.linuxfoundation.org/networking/napi for more information on -NAPI. - - -Known Issues/Troubleshooting -============================ - -NOTE: - After installing the driver, if your Intel Network Connection is not - working, verify in the "In This Release" section of the readme that you have - installed the correct driver. - -Cable Interoperability Issue with Fujitsu XENPAK Module in SmartBits Chassis ----------------------------------------------------------------------------- -Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 -Server adapter is connected to a Fujitsu XENPAK CX4 module in a SmartBits -chassis using 15 m/24AWG cable assemblies manufactured by Fujitsu or Leoni. -The CRC errors may be received either by the Intel(R) PRO/10GbE CX4 -Server adapter or the SmartBits. If this situation occurs using a different -cable assembly may resolve the issue. - -Cable Interoperability Issues with HP Procurve 3400cl Switch Port ------------------------------------------------------------------ -Excessive CRC errors may be observed if the Intel(R) PRO/10GbE CX4 Server -adapter is connected to an HP Procurve 3400cl switch port using short cables -(1 m or shorter). If this situation occurs, using a longer cable may resolve -the issue. - -Excessive CRC errors may be observed using Fujitsu 24AWG cable assemblies that -Are 10 m or longer or where using a Leoni 15 m/24AWG cable assembly. The CRC -errors may be received either by the CX4 Server adapter or at the switch. If -this situation occurs, using a different cable assembly may resolve the issue. - -Jumbo Frames System Requirement -------------------------------- -Memory allocation failures have been observed on Linux systems with 64 MB -of RAM or less that are running Jumbo Frames. If you are using Jumbo -Frames, your system may require more than the advertised minimum -requirement of 64 MB of system memory. - -Performance Degradation with Jumbo Frames ------------------------------------------ -Degradation in throughput performance may be observed in some Jumbo frames -environments. If this is observed, increasing the application's socket buffer -size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values may help. -See the specific application manual and /usr/src/linux*/Documentation/ -networking/ip-sysctl.txt for more details. - -Allocating Rx Buffers when Using Jumbo Frames ---------------------------------------------- -Allocating Rx buffers when using Jumbo Frames on 2.6.x kernels may fail if -the available memory is heavily fragmented. This issue may be seen with PCI-X -adapters or with packet split disabled. This can be reduced or eliminated -by changing the amount of available memory for receive buffer allocation, by -increasing /proc/sys/vm/min_free_kbytes. - -Multiple Interfaces on Same Ethernet Broadcast Network ------------------------------------------------------- -Due to the default ARP behavior on Linux, it is not possible to have -one system on two IP networks in the same Ethernet broadcast domain -(non-partitioned switch) behave as expected. All Ethernet interfaces -will respond to IP traffic for any IP address assigned to the system. -This results in unbalanced receive traffic. - -If you have multiple interfaces in a server, do either of the following: - - - Turn on ARP filtering by entering:: - - echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter - - - Install the interfaces in separate broadcast domains - either in - different switches or in a switch partitioned to VLANs. - -UDP Stress Test Dropped Packet Issue --------------------------------------- -Under small packets UDP stress test with 10GbE driver, the Linux system -may drop UDP packets due to the fullness of socket buffers. You may want -to change the driver's Flow Control variables to the minimum value for -controlling packet reception. - -Tx Hangs Possible Under Stress ------------------------------- -Under stress conditions, if TX hangs occur, turning off TSO -"ethtool -K eth0 tso off" may resolve the problem. - - -Support -======= -For general information, go to the Intel support website at: - -https://www.intel.com/support/ - -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - -If an issue is identified with the released source code on a supported kernel -with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst index 0a233b17c664..1e5f16993f69 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbe.rst @@ -545,13 +545,8 @@ on the Intel Ethernet Controller XL710. Support ======= For general information, go to the Intel support website at: - https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst index 76bbde736f21..08dc0d368a48 100644 --- a/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst +++ b/Documentation/networking/device_drivers/ethernet/intel/ixgbevf.rst @@ -55,13 +55,8 @@ VLANs: There is a limit of a total of 64 shared VLANs to 1 or more VFs. Support ======= For general information, go to the Intel support website at: - https://www.intel.com/support/ -or the Intel Wired Networking project hosted by Sourceforge at: - -https://sourceforge.net/projects/e1000 - If an issue is identified with the released source code on a supported kernel with a supported adapter, email the specific information related to the issue -to e1000-devel@lists.sf.net. +to intel-wired-lan@lists.osuosl.org. diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst index 4cd8e869762b..6b2d1fe74ecf 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/counters.rst @@ -346,32 +346,6 @@ the software port. - The number of receive packets with CQE compression on ring i [#accel]_. - Acceleration - * - `rx[i]_cache_reuse` - - The number of events of successful reuse of a page from a driver's - internal page cache. - - Acceleration - - * - `rx[i]_cache_full` - - The number of events of full internal page cache where driver can't put a - page back to the cache for recycling (page will be freed). - - Acceleration - - * - `rx[i]_cache_empty` - - The number of events where cache was empty - no page to give. Driver - shall allocate new page. - - Acceleration - - * - `rx[i]_cache_busy` - - The number of events where cache head was busy and cannot be recycled. - Driver allocated new page. - - Acceleration - - * - `rx[i]_cache_waive` - - The number of cache evacuation. This can occur due to page move to - another NUMA node or page was pfmemalloc-ed and should be freed as soon - as possible. - - Acceleration - * - `rx[i]_arfs_err` - Number of flow rules that failed to be added to the flow table. - Error diff --git a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst index 9b5c40ba7f0d..0995e4e5acd7 100644 --- a/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst +++ b/Documentation/networking/device_drivers/ethernet/mellanox/mlx5/devlink.rst @@ -122,6 +122,41 @@ users try to enable them. $ devlink dev eswitch set pci/0000:06:00.0 mode switchdev +hairpin_num_queues: Number of hairpin queues +-------------------------------------------- +We refer to a TC NIC rule that involves forwarding as "hairpin". + +Hairpin queues are mlx5 hardware specific implementation for hardware +forwarding of such packets. + +- Show the number of hairpin queues:: + + $ devlink dev param show pci/0000:06:00.0 name hairpin_num_queues + pci/0000:06:00.0: + name hairpin_num_queues type driver-specific + values: + cmode driverinit value 2 + +- Change the number of hairpin queues:: + + $ devlink dev param set pci/0000:06:00.0 name hairpin_num_queues value 4 cmode driverinit + +hairpin_queue_size: Size of the hairpin queues +---------------------------------------------- +Control the size of the hairpin queues. + +- Show the size of the hairpin queues:: + + $ devlink dev param show pci/0000:06:00.0 name hairpin_queue_size + pci/0000:06:00.0: + name hairpin_queue_size type driver-specific + values: + cmode driverinit value 1024 + +- Change the size (in packets) of the hairpin queues:: + + $ devlink dev param set pci/0000:06:00.0 name hairpin_queue_size value 512 cmode driverinit + Health reporters ================ diff --git a/Documentation/networking/devlink/mlx5.rst b/Documentation/networking/devlink/mlx5.rst index 3321117cf605..202798d6501e 100644 --- a/Documentation/networking/devlink/mlx5.rst +++ b/Documentation/networking/devlink/mlx5.rst @@ -72,6 +72,18 @@ parameters. Default: disabled + * - ``hairpin_num_queues`` + - u32 + - driverinit + - We refer to a TC NIC rule that involves forwarding as "hairpin". + Hairpin queues are mlx5 hardware specific implementation for hardware + forwarding of such packets. + + Control the number of hairpin queues. + * - ``hairpin_queue_size`` + - u32 + - driverinit + - Control the size (in packets) of the hairpin queues. The ``mlx5`` driver supports reloading via ``DEVLINK_CMD_RELOAD`` diff --git a/Documentation/networking/driver.rst b/Documentation/networking/driver.rst index 64f7236ff10b..4f5dfa9c022e 100644 --- a/Documentation/networking/driver.rst +++ b/Documentation/networking/driver.rst @@ -4,94 +4,124 @@ Softnet Driver Issues ===================== -Transmit path guidelines: +Probing guidelines +================== -1) The ndo_start_xmit method must not return NETDEV_TX_BUSY under - any normal circumstances. It is considered a hard error unless - there is no way your device can tell ahead of time when its - transmit function will become busy. +Address validation +------------------ - Instead it must maintain the queue properly. For example, - for a driver implementing scatter-gather this means:: +Any hardware layer address you obtain for your device should +be verified. For example, for ethernet check it with +linux/etherdevice.h:is_valid_ether_addr() + +Close/stop guidelines +===================== + +Quiescence +---------- + +After the ndo_stop routine has been called, the hardware must +not receive or transmit any data. All in flight packets must +be aborted. If necessary, poll or wait for completion of +any reset commands. + +Auto-close +---------- + +The ndo_stop routine will be called by unregister_netdevice +if device is still UP. + +Transmit path guidelines +======================== + +Stop queues in advance +---------------------- + +The ndo_start_xmit method must not return NETDEV_TX_BUSY under +any normal circumstances. It is considered a hard error unless +there is no way your device can tell ahead of time when its +transmit function will become busy. + +Instead it must maintain the queue properly. For example, +for a driver implementing scatter-gather this means: + +.. code-block:: c + + static u32 drv_tx_avail(struct drv_ring *dr) + { + u32 used = READ_ONCE(dr->prod) - READ_ONCE(dr->cons); + + return dr->tx_ring_size - (used & bp->tx_ring_mask); + } static netdev_tx_t drv_hard_start_xmit(struct sk_buff *skb, struct net_device *dev) { struct drv *dp = netdev_priv(dev); + struct netdev_queue *txq; + struct drv_ring *dr; + int idx; - lock_tx(dp); - ... - /* This is a hard error log it. */ - if (TX_BUFFS_AVAIL(dp) <= (skb_shinfo(skb)->nr_frags + 1)) { + idx = skb_get_queue_mapping(skb); + dr = dp->tx_rings[idx]; + txq = netdev_get_tx_queue(dev, idx); + + //... + /* This should be a very rare race - log it. */ + if (drv_tx_avail(dr) <= skb_shinfo(skb)->nr_frags + 1) { netif_stop_queue(dev); - unlock_tx(dp); - printk(KERN_ERR PFX "%s: BUG! Tx Ring full when queue awake!\n", - dev->name); + netdev_warn(dev, "Tx Ring full when queue awake!\n"); return NETDEV_TX_BUSY; } - ... queue packet to card ... - ... update tx consumer index ... - - if (TX_BUFFS_AVAIL(dp) <= (MAX_SKB_FRAGS + 1)) - netif_stop_queue(dev); - - ... - unlock_tx(dp); - ... - return NETDEV_TX_OK; - } - - And then at the end of your TX reclamation event handling:: + //... queue packet to card ... - if (netif_queue_stopped(dp->dev) && - TX_BUFFS_AVAIL(dp) > (MAX_SKB_FRAGS + 1)) - netif_wake_queue(dp->dev); + netdev_tx_sent_queue(txq, skb->len); - For a non-scatter-gather supporting card, the three tests simply become:: + //... update tx producer index using WRITE_ONCE() ... - /* This is a hard error log it. */ - if (TX_BUFFS_AVAIL(dp) <= 0) + if (!netif_txq_maybe_stop(txq, drv_tx_avail(dr), + MAX_SKB_FRAGS + 1, 2 * MAX_SKB_FRAGS)) + dr->stats.stopped++; - and:: + //... + return NETDEV_TX_OK; + } - if (TX_BUFFS_AVAIL(dp) == 0) +And then at the end of your TX reclamation event handling: - and:: +.. code-block:: c - if (netif_queue_stopped(dp->dev) && - TX_BUFFS_AVAIL(dp) > 0) - netif_wake_queue(dp->dev); + //... update tx consumer index using WRITE_ONCE() ... -2) An ndo_start_xmit method must not modify the shared parts of a - cloned SKB. + netif_txq_completed_wake(txq, cmpl_pkts, cmpl_bytes, + drv_tx_avail(dr), 2 * MAX_SKB_FRAGS); -3) Do not forget that once you return NETDEV_TX_OK from your - ndo_start_xmit method, it is your driver's responsibility to free - up the SKB and in some finite amount of time. +Lockless queue stop / wake helper macros +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ - For example, this means that it is not allowed for your TX - mitigation scheme to let TX packets "hang out" in the TX - ring unreclaimed forever if no new TX packets are sent. - This error can deadlock sockets waiting for send buffer room - to be freed up. +.. kernel-doc:: include/net/netdev_queues.h + :doc: Lockless queue stopping / waking helpers. - If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you - must not keep any reference to that SKB and you must not attempt - to free it up. +No exclusive ownership +---------------------- -Probing guidelines: +An ndo_start_xmit method must not modify the shared parts of a +cloned SKB. -1) Any hardware layer address you obtain for your device should - be verified. For example, for ethernet check it with - linux/etherdevice.h:is_valid_ether_addr() +Timely completions +------------------ -Close/stop guidelines: +Do not forget that once you return NETDEV_TX_OK from your +ndo_start_xmit method, it is your driver's responsibility to free +up the SKB and in some finite amount of time. -1) After the ndo_stop routine has been called, the hardware must - not receive or transmit any data. All in flight packets must - be aborted. If necessary, poll or wait for completion of - any reset commands. +For example, this means that it is not allowed for your TX +mitigation scheme to let TX packets "hang out" in the TX +ring unreclaimed forever if no new TX packets are sent. +This error can deadlock sockets waiting for send buffer room +to be freed up. -2) The ndo_stop routine will be called by unregister_netdevice - if device is still UP. +If you return NETDEV_TX_BUSY from the ndo_start_xmit method, you +must not keep any reference to that SKB and you must not attempt +to free it up. diff --git a/Documentation/networking/ethtool-netlink.rst b/Documentation/networking/ethtool-netlink.rst index e1bc6186d7ea..cd0973d4ba01 100644 --- a/Documentation/networking/ethtool-netlink.rst +++ b/Documentation/networking/ethtool-netlink.rst @@ -860,22 +860,24 @@ Request contents: Kernel response contents: - ==================================== ====== =========================== - ``ETHTOOL_A_RINGS_HEADER`` nested reply header - ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring - ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring - ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring - ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring - ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring - ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring - ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring - ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring - ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring - ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split - ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE - ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode - ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode - ==================================== ====== =========================== + ======================================= ====== =========================== + ``ETHTOOL_A_RINGS_HEADER`` nested reply header + ``ETHTOOL_A_RINGS_RX_MAX`` u32 max size of RX ring + ``ETHTOOL_A_RINGS_RX_MINI_MAX`` u32 max size of RX mini ring + ``ETHTOOL_A_RINGS_RX_JUMBO_MAX`` u32 max size of RX jumbo ring + ``ETHTOOL_A_RINGS_TX_MAX`` u32 max size of TX ring + ``ETHTOOL_A_RINGS_RX`` u32 size of RX ring + ``ETHTOOL_A_RINGS_RX_MINI`` u32 size of RX mini ring + ``ETHTOOL_A_RINGS_RX_JUMBO`` u32 size of RX jumbo ring + ``ETHTOOL_A_RINGS_TX`` u32 size of TX ring + ``ETHTOOL_A_RINGS_RX_BUF_LEN`` u32 size of buffers on the ring + ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` u8 TCP header / data split + ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE + ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode + ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode + ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer + ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN_MAX`` u32 max size of TX push buffer + ======================================= ====== =========================== ``ETHTOOL_A_RINGS_TCP_DATA_SPLIT`` indicates whether the device is usable with page-flipping TCP zero-copy receive (``getsockopt(TCP_ZEROCOPY_RECEIVE)``). @@ -891,6 +893,18 @@ through MMIO writes, thus reducing the latency. However, enabling this feature may increase the CPU cost. Drivers may enforce additional per-packet eligibility checks (e.g. on packet size). +``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` specifies the maximum number of bytes of a +transmitted packet a driver can push directly to the underlying device +('push' mode). Pushing some of the payload bytes to the device has the +advantages of reducing latency for small packets by avoiding DMA mapping (same +as ``ETHTOOL_A_RINGS_TX_PUSH`` parameter) as well as allowing the underlying +device to process packet headers ahead of fetching its payload. +This can help the device to make fast actions based on the packet's headers. +This is similar to the "tx-copybreak" parameter, which copies the packet to a +preallocated DMA memory area instead of mapping new memory. However, +tx-push-buff parameter copies the packet directly to the device to allow the +device to take faster actions on the packet. + RINGS_SET ========= @@ -908,6 +922,7 @@ Request contents: ``ETHTOOL_A_RINGS_CQE_SIZE`` u32 Size of TX/RX CQE ``ETHTOOL_A_RINGS_TX_PUSH`` u8 flag of TX Push mode ``ETHTOOL_A_RINGS_RX_PUSH`` u8 flag of RX Push mode + ``ETHTOOL_A_RINGS_TX_PUSH_BUF_LEN`` u32 size of TX push buffer ==================================== ====== =========================== Kernel checks that requested ring sizes do not exceed limits reported by diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index 4ddcae33c336..a164ff074356 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -36,6 +36,7 @@ Contents: scaling tls tls-offload + tls-handshake nfc 6lowpan 6pack @@ -73,6 +74,7 @@ Contents: mpls-sysctl mptcp-sysctl multiqueue + napi netconsole netdev-features netdevices diff --git a/Documentation/networking/napi.rst b/Documentation/networking/napi.rst new file mode 100644 index 000000000000..a7a047742e93 --- /dev/null +++ b/Documentation/networking/napi.rst @@ -0,0 +1,254 @@ +.. SPDX-License-Identifier: (GPL-2.0-only OR BSD-2-Clause) + +.. _napi: + +==== +NAPI +==== + +NAPI is the event handling mechanism used by the Linux networking stack. +The name NAPI no longer stands for anything in particular [#]_. + +In basic operation the device notifies the host about new events +via an interrupt. +The host then schedules a NAPI instance to process the events. +The device may also be polled for events via NAPI without receiving +interrupts first (:ref:`busy polling<poll>`). + +NAPI processing usually happens in the software interrupt context, +but there is an option to use :ref:`separate kernel threads<threaded>` +for NAPI processing. + +All in all NAPI abstracts away from the drivers the context and configuration +of event (packet Rx and Tx) processing. + +Driver API +========== + +The two most important elements of NAPI are the struct napi_struct +and the associated poll method. struct napi_struct holds the state +of the NAPI instance while the method is the driver-specific event +handler. The method will typically free Tx packets that have been +transmitted and process newly received packets. + +.. _drv_ctrl: + +Control API +----------- + +netif_napi_add() and netif_napi_del() add/remove a NAPI instance +from the system. The instances are attached to the netdevice passed +as argument (and will be deleted automatically when netdevice is +unregistered). Instances are added in a disabled state. + +napi_enable() and napi_disable() manage the disabled state. +A disabled NAPI can't be scheduled and its poll method is guaranteed +to not be invoked. napi_disable() waits for ownership of the NAPI +instance to be released. + +The control APIs are not idempotent. Control API calls are safe against +concurrent use of datapath APIs but an incorrect sequence of control API +calls may result in crashes, deadlocks, or race conditions. For example, +calling napi_disable() multiple times in a row will deadlock. + +Datapath API +------------ + +napi_schedule() is the basic method of scheduling a NAPI poll. +Drivers should call this function in their interrupt handler +(see :ref:`drv_sched` for more info). A successful call to napi_schedule() +will take ownership of the NAPI instance. + +Later, after NAPI is scheduled, the driver's poll method will be +called to process the events/packets. The method takes a ``budget`` +argument - drivers can process completions for any number of Tx +packets but should only process up to ``budget`` number of +Rx packets. Rx processing is usually much more expensive. + +In other words, it is recommended to ignore the budget argument when +performing TX buffer reclamation to ensure that the reclamation is not +arbitrarily bounded; however, it is required to honor the budget argument +for RX processing. + +.. warning:: + + The ``budget`` argument may be 0 if core tries to only process Tx completions + and no Rx packets. + +The poll method returns the amount of work done. If the driver still +has outstanding work to do (e.g. ``budget`` was exhausted) +the poll method should return exactly ``budget``. In that case, +the NAPI instance will be serviced/polled again (without the +need to be scheduled). + +If event processing has been completed (all outstanding packets +processed) the poll method should call napi_complete_done() +before returning. napi_complete_done() releases the ownership +of the instance. + +.. warning:: + + The case of finishing all events and using exactly ``budget`` + must be handled carefully. There is no way to report this + (rare) condition to the stack, so the driver must either + not call napi_complete_done() and wait to be called again, + or return ``budget - 1``. + + If the ``budget`` is 0 napi_complete_done() should never be called. + +Call sequence +------------- + +Drivers should not make assumptions about the exact sequencing +of calls. The poll method may be called without the driver scheduling +the instance (unless the instance is disabled). Similarly, +it's not guaranteed that the poll method will be called, even +if napi_schedule() succeeded (e.g. if the instance gets disabled). + +As mentioned in the :ref:`drv_ctrl` section - napi_disable() and subsequent +calls to the poll method only wait for the ownership of the instance +to be released, not for the poll method to exit. This means that +drivers should avoid accessing any data structures after calling +napi_complete_done(). + +.. _drv_sched: + +Scheduling and IRQ masking +-------------------------- + +Drivers should keep the interrupts masked after scheduling +the NAPI instance - until NAPI polling finishes any further +interrupts are unnecessary. + +Drivers which have to mask the interrupts explicitly (as opposed +to IRQ being auto-masked by the device) should use the napi_schedule_prep() +and __napi_schedule() calls: + +.. code-block:: c + + if (napi_schedule_prep(&v->napi)) { + mydrv_mask_rxtx_irq(v->idx); + /* schedule after masking to avoid races */ + __napi_schedule(&v->napi); + } + +IRQ should only be unmasked after a successful call to napi_complete_done(): + +.. code-block:: c + + if (budget && napi_complete_done(&v->napi, work_done)) { + mydrv_unmask_rxtx_irq(v->idx); + return min(work_done, budget - 1); + } + +napi_schedule_irqoff() is a variant of napi_schedule() which takes advantage +of guarantees given by being invoked in IRQ context (no need to +mask interrupts). Note that PREEMPT_RT forces all interrupts +to be threaded so the interrupt may need to be marked ``IRQF_NO_THREAD`` +to avoid issues on real-time kernel configurations. + +Instance to queue mapping +------------------------- + +Modern devices have multiple NAPI instances (struct napi_struct) per +interface. There is no strong requirement on how the instances are +mapped to queues and interrupts. NAPI is primarily a polling/processing +abstraction without specific user-facing semantics. That said, most networking +devices end up using NAPI in fairly similar ways. + +NAPI instances most often correspond 1:1:1 to interrupts and queue pairs +(queue pair is a set of a single Rx and single Tx queue). + +In less common cases a NAPI instance may be used for multiple queues +or Rx and Tx queues can be serviced by separate NAPI instances on a single +core. Regardless of the queue assignment, however, there is usually still +a 1:1 mapping between NAPI instances and interrupts. + +It's worth noting that the ethtool API uses a "channel" terminology where +each channel can be either ``rx``, ``tx`` or ``combined``. It's not clear +what constitutes a channel; the recommended interpretation is to understand +a channel as an IRQ/NAPI which services queues of a given type. For example, +a configuration of 1 ``rx``, 1 ``tx`` and 1 ``combined`` channel is expected +to utilize 3 interrupts, 2 Rx and 2 Tx queues. + +User API +======== + +User interactions with NAPI depend on NAPI instance ID. The instance IDs +are only visible to the user thru the ``SO_INCOMING_NAPI_ID`` socket option. +It's not currently possible to query IDs used by a given device. + +Software IRQ coalescing +----------------------- + +NAPI does not perform any explicit event coalescing by default. +In most scenarios batching happens due to IRQ coalescing which is done +by the device. There are cases where software coalescing is helpful. + +NAPI can be configured to arm a repoll timer instead of unmasking +the hardware interrupts as soon as all packets are processed. +The ``gro_flush_timeout`` sysfs configuration of the netdevice +is reused to control the delay of the timer, while +``napi_defer_hard_irqs`` controls the number of consecutive empty polls +before NAPI gives up and goes back to using hardware IRQs. + +.. _poll: + +Busy polling +------------ + +Busy polling allows a user process to check for incoming packets before +the device interrupt fires. As is the case with any busy polling it trades +off CPU cycles for lower latency (production uses of NAPI busy polling +are not well known). + +Busy polling is enabled by either setting ``SO_BUSY_POLL`` on +selected sockets or using the global ``net.core.busy_poll`` and +``net.core.busy_read`` sysctls. An io_uring API for NAPI busy polling +also exists. + +IRQ mitigation +--------------- + +While busy polling is supposed to be used by low latency applications, +a similar mechanism can be used for IRQ mitigation. + +Very high request-per-second applications (especially routing/forwarding +applications and especially applications using AF_XDP sockets) may not +want to be interrupted until they finish processing a request or a batch +of packets. + +Such applications can pledge to the kernel that they will perform a busy +polling operation periodically, and the driver should keep the device IRQs +permanently masked. This mode is enabled by using the ``SO_PREFER_BUSY_POLL`` +socket option. To avoid system misbehavior the pledge is revoked +if ``gro_flush_timeout`` passes without any busy poll call. + +The NAPI budget for busy polling is lower than the default (which makes +sense given the low latency intention of normal busy polling). This is +not the case with IRQ mitigation, however, so the budget can be adjusted +with the ``SO_BUSY_POLL_BUDGET`` socket option. + +.. _threaded: + +Threaded NAPI +------------- + +Threaded NAPI is an operating mode that uses dedicated kernel +threads rather than software IRQ context for NAPI processing. +The configuration is per netdevice and will affect all +NAPI instances of that device. Each NAPI instance will spawn a separate +thread (called ``napi/${ifc-name}-${napi-id}``). + +It is recommended to pin each kernel thread to a single CPU, the same +CPU as the CPU which services the interrupt. Note that the mapping +between IRQs and NAPI instances may not be trivial (and is driver +dependent). The NAPI instance IDs will be assigned in the opposite +order than the process IDs of the kernel threads. + +Threaded NAPI is controlled by writing 0/1 to the ``threaded`` file in +netdev's sysfs directory. + +.. rubric:: Footnotes + +.. [#] NAPI was originally referred to as New API in 2.4 Linux. diff --git a/Documentation/networking/page_pool.rst b/Documentation/networking/page_pool.rst index 30f1344e7cca..873efd97f822 100644 --- a/Documentation/networking/page_pool.rst +++ b/Documentation/networking/page_pool.rst @@ -165,6 +165,7 @@ Registration pp_params.pool_size = DESC_NUM; pp_params.nid = NUMA_NO_NODE; pp_params.dev = priv->dev; + pp_params.napi = napi; /* only if locking is tied to NAPI */ pp_params.dma_dir = xdp_prog ? DMA_BIDIRECTIONAL : DMA_FROM_DEVICE; page_pool = page_pool_create(&pp_params); diff --git a/Documentation/networking/tls-handshake.rst b/Documentation/networking/tls-handshake.rst new file mode 100644 index 000000000000..a2817a88e905 --- /dev/null +++ b/Documentation/networking/tls-handshake.rst @@ -0,0 +1,217 @@ +.. SPDX-License-Identifier: GPL-2.0 + +======================= +In-Kernel TLS Handshake +======================= + +Overview +======== + +Transport Layer Security (TLS) is a Upper Layer Protocol (ULP) that runs +over TCP. TLS provides end-to-end data integrity and confidentiality in +addition to peer authentication. + +The kernel's kTLS implementation handles the TLS record subprotocol, but +does not handle the TLS handshake subprotocol which is used to establish +a TLS session. Kernel consumers can use the API described here to +request TLS session establishment. + +There are several possible ways to provide a handshake service in the +kernel. The API described here is designed to hide the details of those +implementations so that in-kernel TLS consumers do not need to be +aware of how the handshake gets done. + + +User handshake agent +==================== + +As of this writing, there is no TLS handshake implementation in the +Linux kernel. To provide a handshake service, a handshake agent +(typically in user space) is started in each network namespace where a +kernel consumer might require a TLS handshake. Handshake agents listen +for events sent from the kernel that indicate a handshake request is +waiting. + +An open socket is passed to a handshake agent via a netlink operation, +which creates a socket descriptor in the agent's file descriptor table. +If the handshake completes successfully, the handshake agent promotes +the socket to use the TLS ULP and sets the session information using the +SOL_TLS socket options. The handshake agent returns the socket to the +kernel via a second netlink operation. + + +Kernel Handshake API +==================== + +A kernel TLS consumer initiates a client-side TLS handshake on an open +socket by invoking one of the tls_client_hello() functions. First, it +fills in a structure that contains the parameters of the request: + +.. code-block:: c + + struct tls_handshake_args { + struct socket *ta_sock; + tls_done_func_t ta_done; + void *ta_data; + unsigned int ta_timeout_ms; + key_serial_t ta_keyring; + key_serial_t ta_my_cert; + key_serial_t ta_my_privkey; + unsigned int ta_num_peerids; + key_serial_t ta_my_peerids[5]; + }; + +The @ta_sock field references an open and connected socket. The consumer +must hold a reference on the socket to prevent it from being destroyed +while the handshake is in progress. The consumer must also have +instantiated a struct file in sock->file. + + +@ta_done contains a callback function that is invoked when the handshake +has completed. Further explanation of this function is in the "Handshake +Completion" sesction below. + +The consumer can fill in the @ta_timeout_ms field to force the servicing +handshake agent to exit after a number of milliseconds. This enables the +socket to be fully closed once both the kernel and the handshake agent +have closed their endpoints. + +Authentication material such as x.509 certificates, private certificate +keys, and pre-shared keys are provided to the handshake agent in keys +that are instantiated by the consumer before making the handshake +request. The consumer can provide a private keyring that is linked into +the handshake agent's process keyring in the @ta_keyring field to prevent +access of those keys by other subsystems. + +To request an x.509-authenticated TLS session, the consumer fills in +the @ta_my_cert and @ta_my_privkey fields with the serial numbers of +keys containing an x.509 certificate and the private key for that +certificate. Then, it invokes this function: + +.. code-block:: c + + ret = tls_client_hello_x509(args, gfp_flags); + +The function returns zero when the handshake request is under way. A +zero return guarantees the callback function @ta_done will be invoked +for this socket. The function returns a negative errno if the handshake +could not be started. A negative errno guarantees the callback function +@ta_done will not be invoked on this socket. + + +To initiate a client-side TLS handshake with a pre-shared key, use: + +.. code-block:: c + + ret = tls_client_hello_psk(args, gfp_flags); + +However, in this case, the consumer fills in the @ta_my_peerids array +with serial numbers of keys containing the peer identities it wishes +to offer, and the @ta_num_peerids field with the number of array +entries it has filled in. The other fields are filled in as above. + + +To initiate an anonymous client-side TLS handshake use: + +.. code-block:: c + + ret = tls_client_hello_anon(args, gfp_flags); + +The handshake agent presents no peer identity information to the remote +during this type of handshake. Only server authentication (ie the client +verifies the server's identity) is performed during the handshake. Thus +the established session uses encryption only. + + +Consumers that are in-kernel servers use: + +.. code-block:: c + + ret = tls_server_hello_x509(args, gfp_flags); + +or + +.. code-block:: c + + ret = tls_server_hello_psk(args, gfp_flags); + +The argument structure is filled in as above. + + +If the consumer needs to cancel the handshake request, say, due to a ^C +or other exigent event, the consumer can invoke: + +.. code-block:: c + + bool tls_handshake_cancel(sock); + +This function returns true if the handshake request associated with +@sock has been canceled. The consumer's handshake completion callback +will not be invoked. If this function returns false, then the consumer's +completion callback has already been invoked. + + +Handshake Completion +==================== + +When the handshake agent has completed processing, it notifies the +kernel that the socket may be used by the consumer again. At this point, +the consumer's handshake completion callback, provided in the @ta_done +field in the tls_handshake_args structure, is invoked. + +The synopsis of this function is: + +.. code-block:: c + + typedef void (*tls_done_func_t)(void *data, int status, + key_serial_t peerid); + +The consumer provides a cookie in the @ta_data field of the +tls_handshake_args structure that is returned in the @data parameter of +this callback. The consumer uses the cookie to match the callback to the +thread waiting for the handshake to complete. + +The success status of the handshake is returned via the @status +parameter: + ++------------+----------------------------------------------+ +| status | meaning | ++============+==============================================+ +| 0 | TLS session established successfully | ++------------+----------------------------------------------+ +| -EACCESS | Remote peer rejected the handshake or | +| | authentication failed | ++------------+----------------------------------------------+ +| -ENOMEM | Temporary resource allocation failure | ++------------+----------------------------------------------+ +| -EINVAL | Consumer provided an invalid argument | ++------------+----------------------------------------------+ +| -ENOKEY | Missing authentication material | ++------------+----------------------------------------------+ +| -EIO | An unexpected fault occurred | ++------------+----------------------------------------------+ + +The @peerid parameter contains the serial number of a key containing the +remote peer's identity or the value TLS_NO_PEERID if the session is not +authenticated. + +A best practice is to close and destroy the socket immediately if the +handshake failed. + + +Other considerations +-------------------- + +While a handshake is under way, the kernel consumer must alter the +socket's sk_data_ready callback function to ignore all incoming data. +Once the handshake completion callback function has been invoked, normal +receive operation can be resumed. + +Once a TLS session is established, the consumer must provide a buffer +for and then examine the control message (CMSG) that is part of every +subsequent sock_recvmsg(). Each control message indicates whether the +received message data is TLS record data or session metadata. + +See tls.rst for details on how a kTLS consumer recognizes incoming +(decrypted) application data, alerts, and handshake packets once the +socket has been promoted to use the TLS ULP. |