diff options
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/networking/batman-adv.rst | 220 | ||||
-rw-r--r-- | Documentation/networking/batman-adv.txt | 215 | ||||
-rw-r--r-- | Documentation/networking/bonding.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/dpaa.txt | 68 | ||||
-rw-r--r-- | Documentation/networking/filter.txt | 134 | ||||
-rw-r--r-- | Documentation/networking/hinic.txt | 125 | ||||
-rw-r--r-- | Documentation/networking/ieee802154.txt | 16 | ||||
-rw-r--r-- | Documentation/networking/index.rst | 1 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.txt | 47 | ||||
-rw-r--r-- | Documentation/networking/msg_zerocopy.rst | 257 | ||||
-rw-r--r-- | Documentation/networking/netdev-FAQ.txt | 8 | ||||
-rw-r--r-- | Documentation/networking/netvsc.txt | 75 | ||||
-rw-r--r-- | Documentation/networking/nf_conntrack-sysctl.txt | 11 | ||||
-rw-r--r-- | Documentation/networking/rmnet.txt | 82 | ||||
-rw-r--r-- | Documentation/networking/rxrpc.txt | 57 | ||||
-rw-r--r-- | Documentation/networking/strparser.txt | 209 | ||||
-rw-r--r-- | Documentation/networking/switchdev.txt | 68 |
18 files changed, 1217 insertions, 380 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index c6beb5f1637f..7a79b3587dd3 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -30,8 +30,6 @@ atm.txt - info on where to get ATM programs and support for Linux. ax25.txt - info on using AX.25 and NET/ROM code for Linux -batman-adv.txt - - B.A.T.M.A.N routing protocol on top of layer 2 Ethernet Frames. baycom.txt - info on the driver for Baycom style amateur radio modems bonding.txt diff --git a/Documentation/networking/batman-adv.rst b/Documentation/networking/batman-adv.rst new file mode 100644 index 000000000000..a342b2cc3dc6 --- /dev/null +++ b/Documentation/networking/batman-adv.rst @@ -0,0 +1,220 @@ +========== +batman-adv +========== + +Batman advanced is a new approach to wireless networking which does no longer +operate on the IP basis. Unlike the batman daemon, which exchanges information +using UDP packets and sets routing tables, batman-advanced operates on ISO/OSI +Layer 2 only and uses and routes (or better: bridges) Ethernet Frames. It +emulates a virtual network switch of all nodes participating. Therefore all +nodes appear to be link local, thus all higher operating protocols won't be +affected by any changes within the network. You can run almost any protocol +above batman advanced, prominent examples are: IPv4, IPv6, DHCP, IPX. + +Batman advanced was implemented as a Linux kernel driver to reduce the overhead +to a minimum. It does not depend on any (other) network driver, and can be used +on wifi as well as ethernet lan, vpn, etc ... (anything with ethernet-style +layer 2). + + +Configuration +============= + +Load the batman-adv module into your kernel:: + + $ insmod batman-adv.ko + +The module is now waiting for activation. You must add some interfaces on which +batman can operate. After loading the module batman advanced will scan your +systems interfaces to search for compatible interfaces. Once found, it will +create subfolders in the ``/sys`` directories of each supported interface, +e.g.:: + + $ ls /sys/class/net/eth0/batman_adv/ + elp_interval iface_status mesh_iface throughput_override + +If an interface does not have the ``batman_adv`` subfolder, it probably is not +supported. Not supported interfaces are: loopback, non-ethernet and batman's +own interfaces. + +Note: After the module was loaded it will continuously watch for new +interfaces to verify the compatibility. There is no need to reload the module +if you plug your USB wifi adapter into your machine after batman advanced was +initially loaded. + +The batman-adv soft-interface can be created using the iproute2 tool ``ip``:: + + $ ip link add name bat0 type batadv + +To activate a given interface simply attach it to the ``bat0`` interface:: + + $ ip link set dev eth0 master bat0 + +Repeat this step for all interfaces you wish to add. Now batman starts +using/broadcasting on this/these interface(s). + +By reading the "iface_status" file you can check its status:: + + $ cat /sys/class/net/eth0/batman_adv/iface_status + active + +To deactivate an interface you have to detach it from the "bat0" interface:: + + $ ip link set dev eth0 nomaster + + +All mesh wide settings can be found in batman's own interface folder:: + + $ ls /sys/class/net/bat0/mesh/ + aggregated_ogms fragmentation isolation_mark routing_algo + ap_isolation gw_bandwidth log_level vlan0 + bonding gw_mode multicast_mode + bridge_loop_avoidance gw_sel_class network_coding + distributed_arp_table hop_penalty orig_interval + +There is a special folder for debugging information:: + + $ ls /sys/kernel/debug/batman_adv/bat0/ + bla_backbone_table log neighbors transtable_local + bla_claim_table mcast_flags originators + dat_cache nc socket + gateways nc_nodes transtable_global + +Some of the files contain all sort of status information regarding the mesh +network. For example, you can view the table of originators (mesh +participants) with:: + + $ cat /sys/kernel/debug/batman_adv/bat0/originators + +Other files allow to change batman's behaviour to better fit your requirements. +For instance, you can check the current originator interval (value in +milliseconds which determines how often batman sends its broadcast packets):: + + $ cat /sys/class/net/bat0/mesh/orig_interval + 1000 + +and also change its value:: + + $ echo 3000 > /sys/class/net/bat0/mesh/orig_interval + +In very mobile scenarios, you might want to adjust the originator interval to a +lower value. This will make the mesh more responsive to topology changes, but +will also increase the overhead. + + +Usage +===== + +To make use of your newly created mesh, batman advanced provides a new +interface "bat0" which you should use from this point on. All interfaces added +to batman advanced are not relevant any longer because batman handles them for +you. Basically, one "hands over" the data by using the batman interface and +batman will make sure it reaches its destination. + +The "bat0" interface can be used like any other regular interface. It needs an +IP address which can be either statically configured or dynamically (by using +DHCP or similar services):: + + NodeA: ip link set up dev bat0 + NodeA: ip addr add 192.168.0.1/24 dev bat0 + + NodeB: ip link set up dev bat0 + NodeB: ip addr add 192.168.0.2/24 dev bat0 + NodeB: ping 192.168.0.1 + +Note: In order to avoid problems remove all IP addresses previously assigned to +interfaces now used by batman advanced, e.g.:: + + $ ip addr flush dev eth0 + + +Logging/Debugging +================= + +All error messages, warnings and information messages are sent to the kernel +log. Depending on your operating system distribution this can be read in one of +a number of ways. Try using the commands: ``dmesg``, ``logread``, or looking in +the files ``/var/log/kern.log`` or ``/var/log/syslog``. All batman-adv messages +are prefixed with "batman-adv:" So to see just these messages try:: + + $ dmesg | grep batman-adv + +When investigating problems with your mesh network, it is sometimes necessary to +see more detail debug messages. This must be enabled when compiling the +batman-adv module. When building batman-adv as part of kernel, use "make +menuconfig" and enable the option ``B.A.T.M.A.N. debugging`` +(``CONFIG_BATMAN_ADV_DEBUG=y``). + +Those additional debug messages can be accessed using a special file in +debugfs:: + + $ cat /sys/kernel/debug/batman_adv/bat0/log + +The additional debug output is by default disabled. It can be enabled during +run time. Following log_levels are defined: + +.. flat-table:: + + * - 0 + - All debug output disabled + * - 1 + - Enable messages related to routing / flooding / broadcasting + * - 2 + - Enable messages related to route added / changed / deleted + * - 4 + - Enable messages related to translation table operations + * - 8 + - Enable messages related to bridge loop avoidance + * - 16 + - Enable messages related to DAT, ARP snooping and parsing + * - 32 + - Enable messages related to network coding + * - 64 + - Enable messages related to multicast + * - 128 + - Enable messages related to throughput meter + * - 255 + - Enable all messages + +The debug output can be changed at runtime using the file +``/sys/class/net/bat0/mesh/log_level``. e.g.:: + + $ echo 6 > /sys/class/net/bat0/mesh/log_level + +will enable debug messages for when routes change. + +Counters for different types of packets entering and leaving the batman-adv +module are available through ethtool:: + + $ ethtool --statistics bat0 + + +batctl +====== + +As batman advanced operates on layer 2, all hosts participating in the virtual +switch are completely transparent for all protocols above layer 2. Therefore +the common diagnosis tools do not work as expected. To overcome these problems, +batctl was created. At the moment the batctl contains ping, traceroute, tcpdump +and interfaces to the kernel module settings. + +For more information, please see the manpage (``man batctl``). + +batctl is available on https://www.open-mesh.org/ + + +Contact +======= + +Please send us comments, experiences, questions, anything :) + +IRC: + #batman on irc.freenode.org +Mailing-list: + b.a.t.m.a.n@open-mesh.org (optional subscription at + https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n) + +You can also contact the Authors: + +* Marek Lindner <mareklindner@neomailbox.ch> +* Simon Wunderlich <sw@simonwunderlich.de> diff --git a/Documentation/networking/batman-adv.txt b/Documentation/networking/batman-adv.txt deleted file mode 100644 index ccf94677b240..000000000000 --- a/Documentation/networking/batman-adv.txt +++ /dev/null @@ -1,215 +0,0 @@ -BATMAN-ADV ----------- - -Batman advanced is a new approach to wireless networking which -does no longer operate on the IP basis. Unlike the batman daemon, -which exchanges information using UDP packets and sets routing -tables, batman-advanced operates on ISO/OSI Layer 2 only and uses -and routes (or better: bridges) Ethernet Frames. It emulates a -virtual network switch of all nodes participating. Therefore all -nodes appear to be link local, thus all higher operating proto- -cols won't be affected by any changes within the network. You can -run almost any protocol above batman advanced, prominent examples -are: IPv4, IPv6, DHCP, IPX. - -Batman advanced was implemented as a Linux kernel driver to re- -duce the overhead to a minimum. It does not depend on any (other) -network driver, and can be used on wifi as well as ethernet lan, -vpn, etc ... (anything with ethernet-style layer 2). - - -CONFIGURATION -------------- - -Load the batman-adv module into your kernel: - -# insmod batman-adv.ko - -The module is now waiting for activation. You must add some in- -terfaces on which batman can operate. After loading the module -batman advanced will scan your systems interfaces to search for -compatible interfaces. Once found, it will create subfolders in -the /sys directories of each supported interface, e.g. - -# ls /sys/class/net/eth0/batman_adv/ -# elp_interval iface_status mesh_iface throughput_override - -If an interface does not have the "batman_adv" subfolder it prob- -ably is not supported. Not supported interfaces are: loopback, -non-ethernet and batman's own interfaces. - -Note: After the module was loaded it will continuously watch for -new interfaces to verify the compatibility. There is no need to -reload the module if you plug your USB wifi adapter into your ma- -chine after batman advanced was initially loaded. - -The batman-adv soft-interface can be created using the iproute2 -tool "ip" - -# ip link add name bat0 type batadv - -To activate a given interface simply attach it to the "bat0" -interface - -# ip link set dev eth0 master bat0 - -Repeat this step for all interfaces you wish to add. Now batman -starts using/broadcasting on this/these interface(s). - -By reading the "iface_status" file you can check its status: - -# cat /sys/class/net/eth0/batman_adv/iface_status -# active - -To deactivate an interface you have to detach it from the -"bat0" interface: - -# ip link set dev eth0 nomaster - - -All mesh wide settings can be found in batman's own interface -folder: - -# ls /sys/class/net/bat0/mesh/ -# aggregated_ogms fragmentation isolation_mark routing_algo -# ap_isolation gw_bandwidth log_level vlan0 -# bonding gw_mode multicast_mode -# bridge_loop_avoidance gw_sel_class network_coding -# distributed_arp_table hop_penalty orig_interval - -There is a special folder for debugging information: - -# ls /sys/kernel/debug/batman_adv/bat0/ -# bla_backbone_table log neighbors transtable_local -# bla_claim_table mcast_flags originators -# dat_cache nc socket -# gateways nc_nodes transtable_global - -Some of the files contain all sort of status information regard- -ing the mesh network. For example, you can view the table of -originators (mesh participants) with: - -# cat /sys/kernel/debug/batman_adv/bat0/originators - -Other files allow to change batman's behaviour to better fit your -requirements. For instance, you can check the current originator -interval (value in milliseconds which determines how often batman -sends its broadcast packets): - -# cat /sys/class/net/bat0/mesh/orig_interval -# 1000 - -and also change its value: - -# echo 3000 > /sys/class/net/bat0/mesh/orig_interval - -In very mobile scenarios, you might want to adjust the originator -interval to a lower value. This will make the mesh more respon- -sive to topology changes, but will also increase the overhead. - - -USAGE ------ - -To make use of your newly created mesh, batman advanced provides -a new interface "bat0" which you should use from this point on. -All interfaces added to batman advanced are not relevant any -longer because batman handles them for you. Basically, one "hands -over" the data by using the batman interface and batman will make -sure it reaches its destination. - -The "bat0" interface can be used like any other regular inter- -face. It needs an IP address which can be either statically con- -figured or dynamically (by using DHCP or similar services): - -# NodeA: ip link set up dev bat0 -# NodeA: ip addr add 192.168.0.1/24 dev bat0 - -# NodeB: ip link set up dev bat0 -# NodeB: ip addr add 192.168.0.2/24 dev bat0 -# NodeB: ping 192.168.0.1 - -Note: In order to avoid problems remove all IP addresses previ- -ously assigned to interfaces now used by batman advanced, e.g. - -# ip addr flush dev eth0 - - -LOGGING/DEBUGGING ------------------ - -All error messages, warnings and information messages are sent to -the kernel log. Depending on your operating system distribution -this can be read in one of a number of ways. Try using the com- -mands: dmesg, logread, or looking in the files /var/log/kern.log -or /var/log/syslog. All batman-adv messages are prefixed with -"batman-adv:" So to see just these messages try - -# dmesg | grep batman-adv - -When investigating problems with your mesh network it is some- -times necessary to see more detail debug messages. This must be -enabled when compiling the batman-adv module. When building bat- -man-adv as part of kernel, use "make menuconfig" and enable the -option "B.A.T.M.A.N. debugging". - -Those additional debug messages can be accessed using a special -file in debugfs - -# cat /sys/kernel/debug/batman_adv/bat0/log - -The additional debug output is by default disabled. It can be en- -abled during run time. Following log_levels are defined: - - 0 - All debug output disabled - 1 - Enable messages related to routing / flooding / broadcasting - 2 - Enable messages related to route added / changed / deleted - 4 - Enable messages related to translation table operations - 8 - Enable messages related to bridge loop avoidance - 16 - Enable messages related to DAT, ARP snooping and parsing - 32 - Enable messages related to network coding - 64 - Enable messages related to multicast -128 - Enable messages related to throughput meter -255 - Enable all messages - -The debug output can be changed at runtime using the file -/sys/class/net/bat0/mesh/log_level. e.g. - -# echo 6 > /sys/class/net/bat0/mesh/log_level - -will enable debug messages for when routes change. - -Counters for different types of packets entering and leaving the -batman-adv module are available through ethtool: - -# ethtool --statistics bat0 - - -BATCTL ------- - -As batman advanced operates on layer 2 all hosts participating in -the virtual switch are completely transparent for all protocols -above layer 2. Therefore the common diagnosis tools do not work -as expected. To overcome these problems batctl was created. At -the moment the batctl contains ping, traceroute, tcpdump and -interfaces to the kernel module settings. - -For more information, please see the manpage (man batctl). - -batctl is available on https://www.open-mesh.org/ - - -CONTACT -------- - -Please send us comments, experiences, questions, anything :) - -IRC: #batman on irc.freenode.org -Mailing-list: b.a.t.m.a.n@open-mesh.org (optional subscription - at https://lists.open-mesh.org/mm/listinfo/b.a.t.m.a.n) - -You can also contact the Authors: - -Marek Lindner <mareklindner@neomailbox.ch> -Simon Wunderlich <sw@simonwunderlich.de> diff --git a/Documentation/networking/bonding.txt b/Documentation/networking/bonding.txt index 57f52cdce32e..9ba04c0bab8d 100644 --- a/Documentation/networking/bonding.txt +++ b/Documentation/networking/bonding.txt @@ -2387,7 +2387,7 @@ broadcast: Like active-backup, there is not much advantage to this and packet type ID), so in a "gatewayed" configuration, all outgoing traffic will generally use the same device. Incoming traffic may also end up on a single device, but that is - dependent upon the balancing policy of the peer's 8023.ad + dependent upon the balancing policy of the peer's 802.3ad implementation. In a "local" configuration, traffic will be distributed across the devices in the bond. diff --git a/Documentation/networking/dpaa.txt b/Documentation/networking/dpaa.txt index 76e016d4d344..f88194f71c54 100644 --- a/Documentation/networking/dpaa.txt +++ b/Documentation/networking/dpaa.txt @@ -13,6 +13,7 @@ Contents - Configuring DPAA Ethernet in your kernel - DPAA Ethernet Frame Processing - DPAA Ethernet Features + - DPAA IRQ Affinity and Receive Side Scaling - Debugging DPAA Ethernet Overview @@ -147,7 +148,10 @@ gradually. The driver has Rx and Tx checksum offloading for UDP and TCP. Currently the Rx checksum offload feature is enabled by default and cannot be controlled through -ethtool. +ethtool. Also, rx-flow-hash and rx-hashing was added. The addition of RSS +provides a big performance boost for the forwarding scenarios, allowing +different traffic flows received by one interface to be processed by different +CPUs in parallel. The driver has support for multiple prioritized Tx traffic classes. Priorities range from 0 (lowest) to 3 (highest). These are mapped to HW workqueues with @@ -166,6 +170,68 @@ classes as follows: tc qdisc add dev <int> root handle 1: \ mqprio num_tc 4 map 0 0 0 0 1 1 1 1 2 2 2 2 3 3 3 3 hw 1 +DPAA IRQ Affinity and Receive Side Scaling +========================================== + +Traffic coming on the DPAA Rx queues or on the DPAA Tx confirmation +queues is seen by the CPU as ingress traffic on a certain portal. +The DPAA QMan portal interrupts are affined each to a certain CPU. +The same portal interrupt services all the QMan portal consumers. + +By default the DPAA Ethernet driver enables RSS, making use of the +DPAA FMan Parser and Keygen blocks to distribute traffic on 128 +hardware frame queues using a hash on IP v4/v6 source and destination +and L4 source and destination ports, in present in the received frame. +When RSS is disabled, all traffic received by a certain interface is +received on the default Rx frame queue. The default DPAA Rx frame +queues are configured to put the received traffic into a pool channel +that allows any available CPU portal to dequeue the ingress traffic. +The default frame queues have the HOLDACTIVE option set, ensuring that +traffic bursts from a certain queue are serviced by the same CPU. +This ensures a very low rate of frame reordering. A drawback of this +is that only one CPU at a time can service the traffic received by a +certain interface when RSS is not enabled. + +To implement RSS, the DPAA Ethernet driver allocates an extra set of +128 Rx frame queues that are configured to dedicated channels, in a +round-robin manner. The mapping of the frame queues to CPUs is now +hardcoded, there is no indirection table to move traffic for a certain +FQ (hash result) to another CPU. The ingress traffic arriving on one +of these frame queues will arrive at the same portal and will always +be processed by the same CPU. This ensures intra-flow order preservation +and workload distribution for multiple traffic flows. + +RSS can be turned off for a certain interface using ethtool, i.e. + + # ethtool -N fm1-mac9 rx-flow-hash tcp4 "" + +To turn it back on, one needs to set rx-flow-hash for tcp4/6 or udp4/6: + + # ethtool -N fm1-mac9 rx-flow-hash udp4 sfdn + +There is no independent control for individual protocols, any command +run for one of tcp4|udp4|ah4|esp4|sctp4|tcp6|udp6|ah6|esp6|sctp6 is +going to control the rx-flow-hashing for all protocols on that interface. + +Besides using the FMan Keygen computed hash for spreading traffic on the +128 Rx FQs, the DPAA Ethernet driver also sets the skb hash value when +the NETIF_F_RXHASH feature is on (active by default). This can be turned +on or off through ethtool, i.e.: + + # ethtool -K fm1-mac9 rx-hashing off + # ethtool -k fm1-mac9 | grep hash + receive-hashing: off + # ethtool -K fm1-mac9 rx-hashing on + Actual changes: + receive-hashing: on + # ethtool -k fm1-mac9 | grep hash + receive-hashing: on + +Please note that Rx hashing depends upon the rx-flow-hashing being on +for that interface - turning off rx-flow-hashing will also disable the +rx-hashing (without ethtool reporting it as off as that depends on the +NETIF_F_RXHASH feature flag). + Debugging ========= diff --git a/Documentation/networking/filter.txt b/Documentation/networking/filter.txt index b69b205501de..87814859cfc2 100644 --- a/Documentation/networking/filter.txt +++ b/Documentation/networking/filter.txt @@ -45,7 +45,7 @@ in many more places. There's xt_bpf for netfilter, cls_bpf in the kernel qdisc layer, SECCOMP-BPF (SECure COMPuting [1]), and lots of other places such as team driver, PTP code, etc where BPF is being used. - [1] Documentation/prctl/seccomp_filter.txt + [1] Documentation/userspace-api/seccomp_filter.rst Original BPF paper: @@ -337,7 +337,7 @@ Examples for low-level BPF: jeq #14, good /* __NR_rt_sigprocmask */ jeq #13, good /* __NR_rt_sigaction */ jeq #35, good /* __NR_nanosleep */ - bad: ret #0 /* SECCOMP_RET_KILL */ + bad: ret #0 /* SECCOMP_RET_KILL_THREAD */ good: ret #0x7fff0000 /* SECCOMP_RET_ALLOW */ The above example code can be placed into a file (here called "foo"), and @@ -596,8 +596,8 @@ skb pointer). All constraints and restrictions from bpf_check_classic() apply before a conversion to the new layout is being done behind the scenes! Currently, the classic BPF format is being used for JITing on most 32-bit -architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64 perform JIT -compilation from eBPF instruction set. +architectures, whereas x86-64, aarch64, s390x, powerpc64, sparc64, arm32 perform +JIT compilation from eBPF instruction set. Some core changes of the new internal format: @@ -793,7 +793,7 @@ Some core changes of the new internal format: bpf_exit After the call the registers R1-R5 contain junk values and cannot be read. - In the future an eBPF verifier can be used to validate internal BPF programs. + An in-kernel eBPF verifier is used to validate internal BPF programs. Also in the new design, eBPF is limited to 4096 insns, which means that any program will terminate quickly and will only call a fixed number of kernel @@ -906,6 +906,10 @@ If BPF_CLASS(code) == BPF_JMP, BPF_OP(code) is one of: BPF_JSGE 0x70 /* eBPF only: signed '>=' */ BPF_CALL 0x80 /* eBPF only: function call */ BPF_EXIT 0x90 /* eBPF only: function return */ + BPF_JLT 0xa0 /* eBPF only: unsigned '<' */ + BPF_JLE 0xb0 /* eBPF only: unsigned '<=' */ + BPF_JSLT 0xc0 /* eBPF only: signed '<' */ + BPF_JSLE 0xd0 /* eBPF only: signed '<=' */ So BPF_ADD | BPF_X | BPF_ALU means 32-bit addition in both classic BPF and eBPF. There are only two registers in classic BPF, so it means A += X. @@ -1017,7 +1021,7 @@ At the start of the program the register R1 contains a pointer to context and has type PTR_TO_CTX. If verifier sees an insn that does R2=R1, then R2 has now type PTR_TO_CTX as well and can be used on the right hand side of expression. -If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=UNKNOWN_VALUE, +If R1=PTR_TO_CTX and insn is R2=R1+R1, then R2=SCALAR_VALUE, since addition of two valid pointers makes invalid pointer. (In 'secure' mode verifier will reject any type of pointer arithmetic to make sure that kernel addresses don't leak to unprivileged users) @@ -1039,7 +1043,7 @@ is a correct program. If there was R1 instead of R6, it would have been rejected. load/store instructions are allowed only with registers of valid types, which -are PTR_TO_CTX, PTR_TO_MAP, FRAME_PTR. They are bounds and alignment checked. +are PTR_TO_CTX, PTR_TO_MAP, PTR_TO_STACK. They are bounds and alignment checked. For example: bpf_mov R1 = 1 bpf_mov R2 = 2 @@ -1058,7 +1062,7 @@ intends to load a word from address R6 + 8 and store it into R0 If R6=PTR_TO_CTX, via is_valid_access() callback the verifier will know that offset 8 of size 4 bytes can be accessed for reading, otherwise the verifier will reject the program. -If R6=FRAME_PTR, then access should be aligned and be within +If R6=PTR_TO_STACK, then access should be aligned and be within stack bounds, which are [-MAX_BPF_STACK, 0). In this example offset is 8, so it will fail verification, since it's out of bounds. @@ -1069,7 +1073,7 @@ For example: bpf_ld R0 = *(u32 *)(R10 - 4) bpf_exit is invalid program. -Though R10 is correct read-only register and has type FRAME_PTR +Though R10 is correct read-only register and has type PTR_TO_STACK and R10 - 4 is within stack bounds, there were no stores into that location. Pointer register spill/fill is tracked as well, since four (R6-R9) @@ -1094,6 +1098,71 @@ all use cases. See details of eBPF verifier in kernel/bpf/verifier.c +Register value tracking +----------------------- +In order to determine the safety of an eBPF program, the verifier must track +the range of possible values in each register and also in each stack slot. +This is done with 'struct bpf_reg_state', defined in include/linux/ +bpf_verifier.h, which unifies tracking of scalar and pointer values. Each +register state has a type, which is either NOT_INIT (the register has not been +written to), SCALAR_VALUE (some value which is not usable as a pointer), or a +pointer type. The types of pointers describe their base, as follows: + PTR_TO_CTX Pointer to bpf_context. + CONST_PTR_TO_MAP Pointer to struct bpf_map. "Const" because arithmetic + on these pointers is forbidden. + PTR_TO_MAP_VALUE Pointer to the value stored in a map element. + PTR_TO_MAP_VALUE_OR_NULL + Either a pointer to a map value, or NULL; map accesses + (see section 'eBPF maps', below) return this type, + which becomes a PTR_TO_MAP_VALUE when checked != NULL. + Arithmetic on these pointers is forbidden. + PTR_TO_STACK Frame pointer. + PTR_TO_PACKET skb->data. + PTR_TO_PACKET_END skb->data + headlen; arithmetic forbidden. +However, a pointer may be offset from this base (as a result of pointer +arithmetic), and this is tracked in two parts: the 'fixed offset' and 'variable +offset'. The former is used when an exactly-known value (e.g. an immediate +operand) is added to a pointer, while the latter is used for values which are +not exactly known. The variable offset is also used in SCALAR_VALUEs, to track +the range of possible values in the register. +The verifier's knowledge about the variable offset consists of: +* minimum and maximum values as unsigned +* minimum and maximum values as signed +* knowledge of the values of individual bits, in the form of a 'tnum': a u64 +'mask' and a u64 'value'. 1s in the mask represent bits whose value is unknown; +1s in the value represent bits known to be 1. Bits known to be 0 have 0 in both +mask and value; no bit should ever be 1 in both. For example, if a byte is read +into a register from memory, the register's top 56 bits are known zero, while +the low 8 are unknown - which is represented as the tnum (0x0; 0xff). If we +then OR this with 0x40, we get (0x40; 0xcf), then if we add 1 we get (0x0; +0x1ff), because of potential carries. +Besides arithmetic, the register state can also be updated by conditional +branches. For instance, if a SCALAR_VALUE is compared > 8, in the 'true' branch +it will have a umin_value (unsigned minimum value) of 9, whereas in the 'false' +branch it will have a umax_value of 8. A signed compare (with BPF_JSGT or +BPF_JSGE) would instead update the signed minimum/maximum values. Information +from the signed and unsigned bounds can be combined; for instance if a value is +first tested < 8 and then tested s> 4, the verifier will conclude that the value +is also > 4 and s< 8, since the bounds prevent crossing the sign boundary. +PTR_TO_PACKETs with a variable offset part have an 'id', which is common to all +pointers sharing that same variable offset. This is important for packet range +checks: after adding some variable to a packet pointer, if you then copy it to +another register and (say) add a constant 4, both registers will share the same +'id' but one will have a fixed offset of +4. Then if it is bounds-checked and +found to be less than a PTR_TO_PACKET_END, the other register is now known to +have a safe range of at least 4 bytes. See 'Direct packet access', below, for +more on PTR_TO_PACKET ranges. +The 'id' field is also used on PTR_TO_MAP_VALUE_OR_NULL, common to all copies of +the pointer returned from a map lookup. This means that when one copy is +checked and found to be non-NULL, all copies can become PTR_TO_MAP_VALUEs. +As well as range-checking, the tracked information is also used for enforcing +alignment of pointer accesses. For instance, on most systems the packet pointer +is 2 bytes after a 4-byte alignment. If a program adds 14 bytes to that to jump +over the Ethernet header, then reads IHL and addes (IHL * 4), the resulting +pointer will have a variable offset known to be 4n+2 for some n, so adding the 2 +bytes (NET_IP_ALIGN) gives a 4-byte alignment and so word-sized accesses through +that pointer are safe. + Direct packet access -------------------- In cls_bpf and act_bpf programs the verifier allows direct access to the packet @@ -1121,7 +1190,7 @@ it now points to 'skb->data + 14' and accessible range is [R5, R5 + 14 - 14) which is zero bytes. More complex packet access may look like: - R0=imm1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp + R0=inv1 R1=ctx R3=pkt(id=0,off=0,r=14) R4=pkt_end R5=pkt(id=0,off=14,r=14) R10=fp 6: r0 = *(u8 *)(r3 +7) /* load 7th byte from the packet */ 7: r4 = *(u8 *)(r3 +12) 8: r4 *= 14 @@ -1135,26 +1204,31 @@ More complex packet access may look like: 16: r2 += 8 17: r1 = *(u32 *)(r1 +80) /* load skb->data_end */ 18: if r2 > r1 goto pc+2 - R0=inv56 R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv52 R5=pkt(id=0,off=14,r=14) R10=fp + R0=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) R1=pkt_end R2=pkt(id=2,off=8,r=8) R3=pkt(id=2,off=0,r=8) R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)) R5=pkt(id=0,off=14,r=14) R10=fp 19: r1 = *(u8 *)(r3 +4) The state of the register R3 is R3=pkt(id=2,off=0,r=8) id=2 means that two 'r3 += rX' instructions were seen, so r3 points to some offset within a packet and since the program author did 'if (r3 + 8 > r1) goto err' at insn #18, the safe range is [R3, R3 + 8). -The verifier only allows 'add' operation on packet registers. Any other -operation will set the register state to 'unknown_value' and it won't be +The verifier only allows 'add'/'sub' operations on packet registers. Any other +operation will set the register state to 'SCALAR_VALUE' and it won't be available for direct packet access. Operation 'r3 += rX' may overflow and become less than original skb->data, -therefore the verifier has to prevent that. So it tracks the number of -upper zero bits in all 'uknown_value' registers, so when it sees -'r3 += rX' instruction and rX is more than 16-bit value, it will error as: -"cannot add integer value with N upper zero bits to ptr_to_packet" +therefore the verifier has to prevent that. So when it sees 'r3 += rX' +instruction and rX is more than 16-bit value, any subsequent bounds-check of r3 +against skb->data_end will not give us 'range' information, so attempts to read +through the pointer will give "invalid access to packet" error. Ex. after insn 'r4 = *(u8 *)(r3 +12)' (insn #7 above) the state of r4 is -R4=inv56 which means that upper 56 bits on the register are guaranteed -to be zero. After insn 'r4 *= 14' the state becomes R4=inv52, since -multiplying 8-bit value by constant 14 will keep upper 52 bits as zero. -Similarly 'r2 >>= 48' will make R2=inv48, since the shift is not sign -extending. This logic is implemented in evaluate_reg_alu() function. +R4=inv(id=0,umax_value=255,var_off=(0x0; 0xff)) which means that upper 56 bits +of the register are guaranteed to be zero, and nothing is known about the lower +8 bits. After insn 'r4 *= 14' the state becomes +R4=inv(id=0,umax_value=3570,var_off=(0x0; 0xfffe)), since multiplying an 8-bit +value by constant 14 will keep upper 52 bits as zero, also the least significant +bit will be zero as 14 is even. Similarly 'r2 >>= 48' will make +R2=inv(id=0,umax_value=65535,var_off=(0x0; 0xffff)), since the shift is not sign +extending. This logic is implemented in adjust_reg_min_max_vals() function, +which calls adjust_ptr_min_max_vals() for adding pointer to scalar (or vice +versa) and adjust_scalar_min_max_vals() for operations on two scalars. The end result is that bpf program author can access packet directly using normal C code as: @@ -1214,6 +1288,22 @@ The map is defined by: . key size in bytes . value size in bytes +Pruning +------- +The verifier does not actually walk all possible paths through the program. For +each new branch to analyse, the verifier looks at all the states it's previously +been in when at this instruction. If any of them contain the current state as a +subset, the branch is 'pruned' - that is, the fact that the previous state was +accepted implies the current state would be as well. For instance, if in the +previous state, r1 held a packet-pointer, and in the current state, r1 holds a +packet-pointer with a range as long or longer and at least as strict an +alignment, then r1 is safe. Similarly, if r2 was NOT_INIT before then it can't +have been used by any path from that point, so any value in r2 (including +another NOT_INIT) is safe. The implementation is in the function regsafe(). +Pruning considers not only the registers but also the stack (and any spilled +registers it may hold). They must all be safe for the branch to be pruned. +This is implemented in states_equal(). + Understanding eBPF verifier messages ------------------------------------ diff --git a/Documentation/networking/hinic.txt b/Documentation/networking/hinic.txt new file mode 100644 index 000000000000..989366a4039c --- /dev/null +++ b/Documentation/networking/hinic.txt @@ -0,0 +1,125 @@ +Linux Kernel Driver for Huawei Intelligent NIC(HiNIC) family +============================================================ + +Overview: +========= +HiNIC is a network interface card for the Data Center Area. + +The driver supports a range of link-speed devices (10GbE, 25GbE, 40GbE, etc.). +The driver supports also a negotiated and extendable feature set. + +Some HiNIC devices support SR-IOV. This driver is used for Physical Function +(PF). + +HiNIC devices support MSI-X interrupt vector for each Tx/Rx queue and +adaptive interrupt moderation. + +HiNIC devices support also various offload features such as checksum offload, +TCP Transmit Segmentation Offload(TSO), Receive-Side Scaling(RSS) and +LRO(Large Receive Offload). + + +Supported PCI vendor ID/device IDs: +=================================== + +19e5:1822 - HiNIC PF + + +Driver Architecture and Source Code: +==================================== + +hinic_dev - Implement a Logical Network device that is independent from +specific HW details about HW data structure formats. + +hinic_hwdev - Implement the HW details of the device and include the components +for accessing the PCI NIC. + +hinic_hwdev contains the following components: +=============================================== + +HW Interface: +============= + +The interface for accessing the pci device (DMA memory and PCI BARs). +(hinic_hw_if.c, hinic_hw_if.h) + +Configuration Status Registers Area that describes the HW Registers on the +configuration and status BAR0. (hinic_hw_csr.h) + +MGMT components: +================ + +Asynchronous Event Queues(AEQs) - The event queues for receiving messages from +the MGMT modules on the cards. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Application Programmable Interface commands(API CMD) - Interface for sending +MGMT commands to the card. (hinic_hw_api_cmd.c, hinic_hw_api_cmd.h) + +Management (MGMT) - the PF to MGMT channel that uses API CMD for sending MGMT +commands to the card and receives notifications from the MGMT modules on the +card by AEQs. Also set the addresses of the IO CMDQs in HW. +(hinic_hw_mgmt.c, hinic_hw_mgmt.h) + +IO components: +============== + +Completion Event Queues(CEQs) - The completion Event Queues that describe IO +tasks that are finished. (hinic_hw_eqs.c, hinic_hw_eqs.h) + +Work Queues(WQ) - Contain the memory and operations for use by CMD queues and +the Queue Pairs. The WQ is a Memory Block in a Page. The Block contains +pointers to Memory Areas that are the Memory for the Work Queue Elements(WQEs). +(hinic_hw_wq.c, hinic_hw_wq.h) + +Command Queues(CMDQ) - The queues for sending commands for IO management and is +used to set the QPs addresses in HW. The commands completion events are +accumulated on the CEQ that is configured to receive the CMDQ completion events. +(hinic_hw_cmdq.c, hinic_hw_cmdq.h) + +Queue Pairs(QPs) - The HW Receive and Send queues for Receiving and Transmitting +Data. (hinic_hw_qp.c, hinic_hw_qp.h, hinic_hw_qp_ctxt.h) + +IO - de/constructs all the IO components. (hinic_hw_io.c, hinic_hw_io.h) + +HW device: +========== + +HW device - de/constructs the HW Interface, the MGMT components on the +initialization of the driver and the IO components on the case of Interface +UP/DOWN Events. (hinic_hw_dev.c, hinic_hw_dev.h) + + +hinic_dev contains the following components: +=============================================== + +PCI ID table - Contains the supported PCI Vendor/Device IDs. +(hinic_pci_tbl.h) + +Port Commands - Send commands to the HW device for port management +(MAC, Vlan, MTU, ...). (hinic_port.c, hinic_port.h) + +Tx Queues - Logical Tx Queues that use the HW Send Queues for transmit. +The Logical Tx queue is not dependent on the format of the HW Send Queue. +(hinic_tx.c, hinic_tx.h) + +Rx Queues - Logical Rx Queues that use the HW Receive Queues for receive. +The Logical Rx queue is not dependent on the format of the HW Receive Queue. +(hinic_rx.c, hinic_rx.h) + +hinic_dev - de/constructs the Logical Tx and Rx Queues. +(hinic_main.c, hinic_dev.h) + + +Miscellaneous: +============= + +Common functions that are used by HW and Logical Device. +(hinic_common.c, hinic_common.h) + + +Support +======= + +If an issue is identified with the released source code on the supported kernel +with a supported adapter, email the specific information related to the issue to +aviad.krawczyk@huawei.com. diff --git a/Documentation/networking/ieee802154.txt b/Documentation/networking/ieee802154.txt index c4114346f054..057e9fdbfac9 100644 --- a/Documentation/networking/ieee802154.txt +++ b/Documentation/networking/ieee802154.txt @@ -84,17 +84,17 @@ Device drivers API ================== The include/net/mac802154.h defines following functions: - - struct ieee802154_dev *ieee802154_alloc_device - (size_t priv_size, struct ieee802154_ops *ops): - allocation of IEEE 802.15.4 compatible device + - struct ieee802154_hw * + ieee802154_alloc_hw(size_t priv_data_len, const struct ieee802154_ops *ops): + allocation of IEEE 802.15.4 compatible hardware device - - void ieee802154_free_device(struct ieee802154_dev *dev): - freeing allocated device + - void ieee802154_free_hw(struct ieee802154_hw *hw): + freeing allocated hardware device - - int ieee802154_register_device(struct ieee802154_dev *dev): - register PHY in the system + - int ieee802154_register_hw(struct ieee802154_hw *hw): + register PHY which is the allocated hardware device, in the system - - void ieee802154_unregister_device(struct ieee802154_dev *dev): + - void ieee802154_unregister_hw(struct ieee802154_hw *hw): freeing registered PHY Moreover IEEE 802.15.4 device operations structure should be filled. diff --git a/Documentation/networking/index.rst b/Documentation/networking/index.rst index b5bd87e01f52..66e620866245 100644 --- a/Documentation/networking/index.rst +++ b/Documentation/networking/index.rst @@ -6,6 +6,7 @@ Contents: .. toctree:: :maxdepth: 2 + batman-adv kapi z8530book diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index 974ab47ae53a..77f4de59dc9c 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -109,7 +109,10 @@ neigh/default/unres_qlen_bytes - INTEGER queued for each unresolved address by other network layers. (added in linux 3.3) Setting negative value is meaningless and will return error. - Default: 65536 Bytes(64KB) + Default: SK_WMEM_MAX, (same as net.core.wmem_default). + Exact value depends on architecture and kernel options, + but should be enough to allow queuing 256 packets + of medium size. neigh/default/unres_qlen - INTEGER The maximum number of packets which may be queued for each @@ -119,7 +122,7 @@ neigh/default/unres_qlen - INTEGER unexpected packet loss. The current default value is calculated according to default value of unres_qlen_bytes and true size of packet. - Default: 31 + Default: 101 mtu_expires - INTEGER Time, in seconds, that cached PMTU information is kept. @@ -353,12 +356,7 @@ tcp_l3mdev_accept - BOOLEAN compiled with CONFIG_NET_L3_MASTER_DEV. tcp_low_latency - BOOLEAN - If set, the TCP stack makes decisions that prefer lower - latency as opposed to higher throughput. By default, this - option is not set meaning that higher throughput is preferred. - An example of an application where this default should be - changed would be a Beowulf compute cluster. - Default: 0 + This is a legacy option, it has no effect anymore. tcp_max_orphans - INTEGER Maximal number of TCP sockets not attached to any user file handle, @@ -1291,8 +1289,7 @@ tag - INTEGER xfrm4_gc_thresh - INTEGER The threshold at which we will start garbage collecting for IPv4 destination cache entries. At twice this value the system will - refuse new allocations. The value must be set below the flowcache - limit (4096 * number of online cpus) to take effect. + refuse new allocations. igmp_link_local_mcast_reports - BOOLEAN Enable IGMP reports for link local multicast groups in the @@ -1356,6 +1353,15 @@ flowlabel_state_ranges - BOOLEAN FALSE: disabled Default: true +flowlabel_reflect - BOOLEAN + Automatically reflect the flow label. Needed for Path MTU + Discovery to work with Equal Cost Multipath Routing in anycast + environments. See RFC 7690 and: + https://tools.ietf.org/html/draft-wang-6man-flow-label-reflection-01 + TRUE: enabled + FALSE: disabled + Default: FALSE + anycast_src_echo_reply - BOOLEAN Controls the use of anycast addresses as source addresses for ICMPv6 echo reply @@ -1674,6 +1680,9 @@ accept_dad - INTEGER 2: Enable DAD, and disable IPv6 operation if MAC-based duplicate link-local address has been found. + DAD operation and mode on a given interface will be selected according + to the maximum value of conf/{all,interface}/accept_dad. + force_tllao - BOOLEAN Enable sending the target link-layer address option even when responding to a unicast neighbor solicitation. @@ -1721,16 +1730,23 @@ suppress_frag_ndisc - INTEGER optimistic_dad - BOOLEAN Whether to perform Optimistic Duplicate Address Detection (RFC 4429). - 0: disabled (default) - 1: enabled + 0: disabled (default) + 1: enabled + + Optimistic Duplicate Address Detection for the interface will be enabled + if at least one of conf/{all,interface}/optimistic_dad is set to 1, + it will be disabled otherwise. use_optimistic - BOOLEAN If enabled, do not classify optimistic addresses as deprecated during source address selection. Preferred addresses will still be chosen before optimistic addresses, subject to other ranking in the source address selection algorithm. - 0: disabled (default) - 1: enabled + 0: disabled (default) + 1: enabled + + This will be enabled if at least one of + conf/{all,interface}/use_optimistic is set to 1, disabled otherwise. stable_secret - IPv6 address This IPv6 address will be used as a secret to generate IPv6 @@ -1778,8 +1794,7 @@ ratelimit - INTEGER xfrm6_gc_thresh - INTEGER The threshold at which we will start garbage collecting for IPv6 destination cache entries. At twice this value the system will - refuse new allocations. The value must be set below the flowcache - limit (4096 * number of online cpus) to take effect. + refuse new allocations. IPv6 Update by: diff --git a/Documentation/networking/msg_zerocopy.rst b/Documentation/networking/msg_zerocopy.rst new file mode 100644 index 000000000000..77f6d7e25cfd --- /dev/null +++ b/Documentation/networking/msg_zerocopy.rst @@ -0,0 +1,257 @@ + +============ +MSG_ZEROCOPY +============ + +Intro +===== + +The MSG_ZEROCOPY flag enables copy avoidance for socket send calls. +The feature is currently implemented for TCP sockets. + + +Opportunity and Caveats +----------------------- + +Copying large buffers between user process and kernel can be +expensive. Linux supports various interfaces that eschew copying, +such as sendpage and splice. The MSG_ZEROCOPY flag extends the +underlying copy avoidance mechanism to common socket send calls. + +Copy avoidance is not a free lunch. As implemented, with page pinning, +it replaces per byte copy cost with page accounting and completion +notification overhead. As a result, MSG_ZEROCOPY is generally only +effective at writes over around 10 KB. + +Page pinning also changes system call semantics. It temporarily shares +the buffer between process and network stack. Unlike with copying, the +process cannot immediately overwrite the buffer after system call +return without possibly modifying the data in flight. Kernel integrity +is not affected, but a buggy program can possibly corrupt its own data +stream. + +The kernel returns a notification when it is safe to modify data. +Converting an existing application to MSG_ZEROCOPY is not always as +trivial as just passing the flag, then. + + +More Info +--------- + +Much of this document was derived from a longer paper presented at +netdev 2.1. For more in-depth information see that paper and talk, +the excellent reporting over at LWN.net or read the original code. + + paper, slides, video + https://netdevconf.org/2.1/session.html?debruijn + + LWN article + https://lwn.net/Articles/726917/ + + patchset + [PATCH net-next v4 0/9] socket sendmsg MSG_ZEROCOPY + http://lkml.kernel.org/r/20170803202945.70750-1-willemdebruijn.kernel@gmail.com + + +Interface +========= + +Passing the MSG_ZEROCOPY flag is the most obvious step to enable copy +avoidance, but not the only one. + +Socket Setup +------------ + +The kernel is permissive when applications pass undefined flags to the +send system call. By default it simply ignores these. To avoid enabling +copy avoidance mode for legacy processes that accidentally already pass +this flag, a process must first signal intent by setting a socket option: + +:: + + if (setsockopt(fd, SOL_SOCKET, SO_ZEROCOPY, &one, sizeof(one))) + error(1, errno, "setsockopt zerocopy"); + + +Transmission +------------ + +The change to send (or sendto, sendmsg, sendmmsg) itself is trivial. +Pass the new flag. + +:: + + ret = send(fd, buf, sizeof(buf), MSG_ZEROCOPY); + +A zerocopy failure will return -1 with errno ENOBUFS. This happens if +the socket option was not set, the socket exceeds its optmem limit or +the user exceeds its ulimit on locked pages. + + +Mixing copy avoidance and copying +~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ + +Many workloads have a mixture of large and small buffers. Because copy +avoidance is more expensive than copying for small packets, the +feature is implemented as a flag. It is safe to mix calls with the flag +with those without. + + +Notifications +------------- + +The kernel has to notify the process when it is safe to reuse a +previously passed buffer. It queues completion notifications on the +socket error queue, akin to the transmit timestamping interface. + +The notification itself is a simple scalar value. Each socket +maintains an internal unsigned 32-bit counter. Each send call with +MSG_ZEROCOPY that successfully sends data increments the counter. The +counter is not incremented on failure or if called with length zero. +The counter counts system call invocations, not bytes. It wraps after +UINT_MAX calls. + + +Notification Reception +~~~~~~~~~~~~~~~~~~~~~~ + +The below snippet demonstrates the API. In the simplest case, each +send syscall is followed by a poll and recvmsg on the error queue. + +Reading from the error queue is always a non-blocking operation. The +poll call is there to block until an error is outstanding. It will set +POLLERR in its output flags. That flag does not have to be set in the +events field. Errors are signaled unconditionally. + +:: + + pfd.fd = fd; + pfd.events = 0; + if (poll(&pfd, 1, -1) != 1 || pfd.revents & POLLERR == 0) + error(1, errno, "poll"); + + ret = recvmsg(fd, &msg, MSG_ERRQUEUE); + if (ret == -1) + error(1, errno, "recvmsg"); + + read_notification(msg); + +The example is for demonstration purpose only. In practice, it is more +efficient to not wait for notifications, but read without blocking +every couple of send calls. + +Notifications can be processed out of order with other operations on +the socket. A socket that has an error queued would normally block +other operations until the error is read. Zerocopy notifications have +a zero error code, however, to not block send and recv calls. + + +Notification Batching +~~~~~~~~~~~~~~~~~~~~~ + +Multiple outstanding packets can be read at once using the recvmmsg +call. This is often not needed. In each message the kernel returns not +a single value, but a range. It coalesces consecutive notifications +while one is outstanding for reception on the error queue. + +When a new notification is about to be queued, it checks whether the +new value extends the range of the notification at the tail of the +queue. If so, it drops the new notification packet and instead increases +the range upper value of the outstanding notification. + +For protocols that acknowledge data in-order, like TCP, each +notification can be squashed into the previous one, so that no more +than one notification is outstanding at any one point. + +Ordered delivery is the common case, but not guaranteed. Notifications +may arrive out of order on retransmission and socket teardown. + + +Notification Parsing +~~~~~~~~~~~~~~~~~~~~ + +The below snippet demonstrates how to parse the control message: the +read_notification() call in the previous snippet. A notification +is encoded in the standard error format, sock_extended_err. + +The level and type fields in the control data are protocol family +specific, IP_RECVERR or IPV6_RECVERR. + +Error origin is the new type SO_EE_ORIGIN_ZEROCOPY. ee_errno is zero, +as explained before, to avoid blocking read and write system calls on +the socket. + +The 32-bit notification range is encoded as [ee_info, ee_data]. This +range is inclusive. Other fields in the struct must be treated as +undefined, bar for ee_code, as discussed below. + +:: + + struct sock_extended_err *serr; + struct cmsghdr *cm; + + cm = CMSG_FIRSTHDR(msg); + if (cm->cmsg_level != SOL_IP && + cm->cmsg_type != IP_RECVERR) + error(1, 0, "cmsg"); + + serr = (void *) CMSG_DATA(cm); + if (serr->ee_errno != 0 || + serr->ee_origin != SO_EE_ORIGIN_ZEROCOPY) + error(1, 0, "serr"); + + printf("completed: %u..%u\n", serr->ee_info, serr->ee_data); + + +Deferred copies +~~~~~~~~~~~~~~~ + +Passing flag MSG_ZEROCOPY is a hint to the kernel to apply copy +avoidance, and a contract that the kernel will queue a completion +notification. It is not a guarantee that the copy is elided. + +Copy avoidance is not always feasible. Devices that do not support +scatter-gather I/O cannot send packets made up of kernel generated +protocol headers plus zerocopy user data. A packet may need to be +converted to a private copy of data deep in the stack, say to compute +a checksum. + +In all these cases, the kernel returns a completion notification when +it releases its hold on the shared pages. That notification may arrive +before the (copied) data is fully transmitted. A zerocopy completion +notification is not a transmit completion notification, therefore. + +Deferred copies can be more expensive than a copy immediately in the +system call, if the data is no longer warm in the cache. The process +also incurs notification processing cost for no benefit. For this +reason, the kernel signals if data was completed with a copy, by +setting flag SO_EE_CODE_ZEROCOPY_COPIED in field ee_code on return. +A process may use this signal to stop passing flag MSG_ZEROCOPY on +subsequent requests on the same socket. + + +Implementation +============== + +Loopback +-------- + +Data sent to local sockets can be queued indefinitely if the receive +process does not read its socket. Unbound notification latency is not +acceptable. For this reason all packets generated with MSG_ZEROCOPY +that are looped to a local socket will incur a deferred copy. This +includes looping onto packet sockets (e.g., tcpdump) and tun devices. + + +Testing +======= + +More realistic example code can be found in the kernel source under +tools/testing/selftests/net/msg_zerocopy.c. + +Be cognizant of the loopback constraint. The test can be run between +a pair of hosts. But if run between a local pair of processes, for +instance when run with msg_zerocopy.sh between a veth pair across +namespaces, the test will not show any improvement. For testing, the +loopback restriction can be temporarily relaxed by making +skb_orphan_frags_rx identical to skb_orphan_frags. diff --git a/Documentation/networking/netdev-FAQ.txt b/Documentation/networking/netdev-FAQ.txt index 247a30ba8e17..cfc66ea72329 100644 --- a/Documentation/networking/netdev-FAQ.txt +++ b/Documentation/networking/netdev-FAQ.txt @@ -111,6 +111,14 @@ A: Generally speaking, the patches get triaged quickly (in less than 48h). patch is a good way to ensure your patch is ignored or pushed to the bottom of the priority list. +Q: I submitted multiple versions of the patch series, should I directly update + patchwork for the previous versions of these patch series? + +A: No, please don't interfere with the patch status on patchwork, leave it to + the maintainer to figure out what is the most recent and current version that + should be applied. If there is any doubt, the maintainer will reply and ask + what should be done. + Q: How can I tell what patches are queued up for backporting to the various stable releases? diff --git a/Documentation/networking/netvsc.txt b/Documentation/networking/netvsc.txt new file mode 100644 index 000000000000..93560fb1170a --- /dev/null +++ b/Documentation/networking/netvsc.txt @@ -0,0 +1,75 @@ +Hyper-V network driver +====================== + +Compatibility +============= + +This driver is compatible with Windows Server 2012 R2, 2016 and +Windows 10. + +Features +======== + + Checksum offload + ---------------- + The netvsc driver supports checksum offload as long as the + Hyper-V host version does. Windows Server 2016 and Azure + support checksum offload for TCP and UDP for both IPv4 and + IPv6. Windows Server 2012 only supports checksum offload for TCP. + + Receive Side Scaling + -------------------- + Hyper-V supports receive side scaling. For TCP, packets are + distributed among available queues based on IP address and port + number. + + For UDP, we can switch UDP hash level between L3 and L4 by ethtool + command. UDP over IPv4 and v6 can be set differently. The default + hash level is L4. We currently only allow switching TX hash level + from within the guests. + + On Azure, fragmented UDP packets have high loss rate with L4 + hashing. Using L3 hashing is recommended in this case. + + For example, for UDP over IPv4 on eth0: + To include UDP port numbers in hashing: + ethtool -N eth0 rx-flow-hash udp4 sdfn + To exclude UDP port numbers in hashing: + ethtool -N eth0 rx-flow-hash udp4 sd + To show UDP hash level: + ethtool -n eth0 rx-flow-hash udp4 + + Generic Receive Offload, aka GRO + -------------------------------- + The driver supports GRO and it is enabled by default. GRO coalesces + like packets and significantly reduces CPU usage under heavy Rx + load. + + SR-IOV support + -------------- + Hyper-V supports SR-IOV as a hardware acceleration option. If SR-IOV + is enabled in both the vSwitch and the guest configuration, then the + Virtual Function (VF) device is passed to the guest as a PCI + device. In this case, both a synthetic (netvsc) and VF device are + visible in the guest OS and both NIC's have the same MAC address. + + The VF is enslaved by netvsc device. The netvsc driver will transparently + switch the data path to the VF when it is available and up. + Network state (addresses, firewall, etc) should be applied only to the + netvsc device; the slave device should not be accessed directly in + most cases. The exceptions are if some special queue discipline or + flow direction is desired, these should be applied directly to the + VF slave device. + + Receive Buffer + -------------- + Packets are received into a receive area which is created when device + is probed. The receive area is broken into MTU sized chunks and each may + contain one or more packets. The number of receive sections may be changed + via ethtool Rx ring parameters. + + There is a similar send buffer which is used to aggregate packets for sending. + The send area is broken into chunks of 6144 bytes, each of section may + contain one or more packets. The send buffer is an optimization, the driver + will use slower method to handle very large packets or if the send buffer + area is exhausted. diff --git a/Documentation/networking/nf_conntrack-sysctl.txt b/Documentation/networking/nf_conntrack-sysctl.txt index 497d668288f9..433b6724797a 100644 --- a/Documentation/networking/nf_conntrack-sysctl.txt +++ b/Documentation/networking/nf_conntrack-sysctl.txt @@ -96,17 +96,6 @@ nf_conntrack_max - INTEGER Size of connection tracking table. Default value is nf_conntrack_buckets value * 4. -nf_conntrack_default_on - BOOLEAN - 0 - don't register conntrack in new net namespaces - 1 - register conntrack in new net namespaces (default) - - This controls wheter newly created network namespaces have connection - tracking enabled by default. It will be enabled automatically - regardless of this setting if the new net namespace requires - connection tracking, e.g. when NAT rules are created. - This setting is only visible in initial user namespace, it has no - effect on existing namespaces. - nf_conntrack_tcp_be_liberal - BOOLEAN 0 - disabled (default) not 0 - enabled diff --git a/Documentation/networking/rmnet.txt b/Documentation/networking/rmnet.txt new file mode 100644 index 000000000000..6b341eaf2062 --- /dev/null +++ b/Documentation/networking/rmnet.txt @@ -0,0 +1,82 @@ +1. Introduction + +rmnet driver is used for supporting the Multiplexing and aggregation +Protocol (MAP). This protocol is used by all recent chipsets using Qualcomm +Technologies, Inc. modems. + +This driver can be used to register onto any physical network device in +IP mode. Physical transports include USB, HSIC, PCIe and IP accelerator. + +Multiplexing allows for creation of logical netdevices (rmnet devices) to +handle multiple private data networks (PDN) like a default internet, tethering, +multimedia messaging service (MMS) or IP media subsystem (IMS). Hardware sends +packets with MAP headers to rmnet. Based on the multiplexer id, rmnet +routes to the appropriate PDN after removing the MAP header. + +Aggregation is required to achieve high data rates. This involves hardware +sending aggregated bunch of MAP frames. rmnet driver will de-aggregate +these MAP frames and send them to appropriate PDN's. + +2. Packet format + +a. MAP packet (data / control) + +MAP header has the same endianness of the IP packet. + +Packet format - + +Bit 0 1 2-7 8 - 15 16 - 31 +Function Command / Data Reserved Pad Multiplexer ID Payload length +Bit 32 - x +Function Raw Bytes + +Command (1)/ Data (0) bit value is to indicate if the packet is a MAP command +or data packet. Control packet is used for transport level flow control. Data +packets are standard IP packets. + +Reserved bits are usually zeroed out and to be ignored by receiver. + +Padding is number of bytes to be added for 4 byte alignment if required by +hardware. + +Multiplexer ID is to indicate the PDN on which data has to be sent. + +Payload length includes the padding length but does not include MAP header +length. + +b. MAP packet (command specific) + +Bit 0 1 2-7 8 - 15 16 - 31 +Function Command Reserved Pad Multiplexer ID Payload length +Bit 32 - 39 40 - 45 46 - 47 48 - 63 +Function Command name Reserved Command Type Reserved +Bit 64 - 95 +Function Transaction ID +Bit 96 - 127 +Function Command data + +Command 1 indicates disabling flow while 2 is enabling flow + +Command types - +0 for MAP command request +1 is to acknowledge the receipt of a command +2 is for unsupported commands +3 is for error during processing of commands + +c. Aggregation + +Aggregation is multiple MAP packets (can be data or command) delivered to +rmnet in a single linear skb. rmnet will process the individual +packets and either ACK the MAP command or deliver the IP packet to the +network stack as needed + +MAP header|IP Packet|Optional padding|MAP header|IP Packet|Optional padding.... +MAP header|IP Packet|Optional padding|MAP header|Command Packet|Optional pad... + +3. Userspace configuration + +rmnet userspace configuration is done through netlink library librmnetctl +and command line utility rmnetcli. Utility is hosted in codeaurora forum git. +The driver uses rtnl_link_ops for communication. + +https://source.codeaurora.org/quic/la/platform/vendor/qcom-opensource/dataservices/tree/rmnetctl diff --git a/Documentation/networking/rxrpc.txt b/Documentation/networking/rxrpc.txt index 8c70ba5dee4d..810620153a44 100644 --- a/Documentation/networking/rxrpc.txt +++ b/Documentation/networking/rxrpc.txt @@ -818,10 +818,15 @@ The kernel interface functions are as follows: (*) Send data through a call. + typedef void (*rxrpc_notify_end_tx_t)(struct sock *sk, + unsigned long user_call_ID, + struct sk_buff *skb); + int rxrpc_kernel_send_data(struct socket *sock, struct rxrpc_call *call, struct msghdr *msg, - size_t len); + size_t len, + rxrpc_notify_end_tx_t notify_end_rx); This is used to supply either the request part of a client call or the reply part of a server call. msg.msg_iovlen and msg.msg_iov specify the @@ -832,6 +837,11 @@ The kernel interface functions are as follows: The msg must not specify a destination address, control data or any flags other than MSG_MORE. len is the total amount of data to transmit. + notify_end_rx can be NULL or it can be used to specify a function to be + called when the call changes state to end the Tx phase. This function is + called with the call-state spinlock held to prevent any reply or final ACK + from being delivered first. + (*) Receive data from a call. int rxrpc_kernel_recv_data(struct socket *sock, @@ -965,6 +975,51 @@ The kernel interface functions are as follows: size should be set when the call is begun. tx_total_len may not be less than zero. + (*) Check to see the completion state of a call so that the caller can assess + whether it needs to be retried. + + enum rxrpc_call_completion { + RXRPC_CALL_SUCCEEDED, + RXRPC_CALL_REMOTELY_ABORTED, + RXRPC_CALL_LOCALLY_ABORTED, + RXRPC_CALL_LOCAL_ERROR, + RXRPC_CALL_NETWORK_ERROR, + }; + + int rxrpc_kernel_check_call(struct socket *sock, struct rxrpc_call *call, + enum rxrpc_call_completion *_compl, + u32 *_abort_code); + + On return, -EINPROGRESS will be returned if the call is still ongoing; if + it is finished, *_compl will be set to indicate the manner of completion, + *_abort_code will be set to any abort code that occurred. 0 will be + returned on a successful completion, -ECONNABORTED will be returned if the + client failed due to a remote abort and anything else will return an + appropriate error code. + + The caller should look at this information to decide if it's worth + retrying the call. + + (*) Retry a client call. + + int rxrpc_kernel_retry_call(struct socket *sock, + struct rxrpc_call *call, + struct sockaddr_rxrpc *srx, + struct key *key); + + This attempts to partially reinitialise a call and submit it again whilst + reusing the original call's Tx queue to avoid the need to repackage and + re-encrypt the data to be sent. call indicates the call to retry, srx the + new address to send it to and key the encryption key to use for signing or + encrypting the packets. + + For this to work, the first Tx data packet must still be in the transmit + queue, and currently this is only permitted for local and network errors + and the call must not have been aborted. Any partially constructed Tx + packet is left as is and can continue being filled afterwards. + + It returns 0 if the call was requeued and an error otherwise. + ======================= CONFIGURABLE PARAMETERS diff --git a/Documentation/networking/strparser.txt b/Documentation/networking/strparser.txt index a0bf573dfa61..13081b3decef 100644 --- a/Documentation/networking/strparser.txt +++ b/Documentation/networking/strparser.txt @@ -1,45 +1,107 @@ -Stream Parser -------------- +Stream Parser (strparser) + +Introduction +============ The stream parser (strparser) is a utility that parses messages of an -application layer protocol running over a TCP connection. The stream +application layer protocol running over a data stream. The stream parser works in conjunction with an upper layer in the kernel to provide kernel support for application layer messages. For instance, Kernel Connection Multiplexor (KCM) uses the Stream Parser to parse messages using a BPF program. +The strparser works in one of two modes: receive callback or general +mode. + +In receive callback mode, the strparser is called from the data_ready +callback of a TCP socket. Messages are parsed and delivered as they are +received on the socket. + +In general mode, a sequence of skbs are fed to strparser from an +outside source. Message are parsed and delivered as the sequence is +processed. This modes allows strparser to be applied to arbitrary +streams of data. + Interface ---------- +========= The API includes a context structure, a set of callbacks, utility -functions, and a data_ready function. The callbacks include -a parse_msg function that is called to perform parsing (e.g. -BPF parsing in case of KCM), and a rcv_msg function that is called -when a full message has been completed. +functions, and a data_ready function for receive callback mode. The +callbacks include a parse_msg function that is called to perform +parsing (e.g. BPF parsing in case of KCM), and a rcv_msg function +that is called when a full message has been completed. + +Functions +========= + +strp_init(struct strparser *strp, struct sock *sk, + const struct strp_callbacks *cb) + + Called to initialize a stream parser. strp is a struct of type + strparser that is allocated by the upper layer. sk is the TCP + socket associated with the stream parser for use with receive + callback mode; in general mode this is set to NULL. Callbacks + are called by the stream parser (the callbacks are listed below). + +void strp_pause(struct strparser *strp) + + Temporarily pause a stream parser. Message parsing is suspended + and no new messages are delivered to the upper layer. + +void strp_pause(struct strparser *strp) + + Unpause a paused stream parser. + +void strp_stop(struct strparser *strp); + + strp_stop is called to completely stop stream parser operations. + This is called internally when the stream parser encounters an + error, and it is called from the upper layer to stop parsing + operations. + +void strp_done(struct strparser *strp); + + strp_done is called to release any resources held by the stream + parser instance. This must be called after the stream processor + has been stopped. -A stream parser can be instantiated for a TCP connection. This is done -by: +int strp_process(struct strparser *strp, struct sk_buff *orig_skb, + unsigned int orig_offset, size_t orig_len, + size_t max_msg_size, long timeo) -strp_init(struct strparser *strp, struct sock *csk, - struct strp_callbacks *cb) + strp_process is called in general mode for a stream parser to + parse an sk_buff. The number of bytes processed or a negative + error number is returned. Note that strp_process does not + consume the sk_buff. max_msg_size is maximum size the stream + parser will parse. timeo is timeout for completing a message. -strp is a struct of type strparser that is allocated by the upper layer. -csk is the TCP socket associated with the stream parser. Callbacks are -called by the stream parser. +void strp_data_ready(struct strparser *strp); + + The upper layer calls strp_tcp_data_ready when data is ready on + the lower socket for strparser to process. This should be called + from a data_ready callback that is set on the socket. Note that + maximum messages size is the limit of the receive socket + buffer and message timeout is the receive timeout for the socket. + +void strp_check_rcv(struct strparser *strp); + + strp_check_rcv is called to check for new messages on the socket. + This is normally called at initialization of a stream parser + instance or after strp_unpause. Callbacks ---------- +========= -There are four callbacks: +There are six callbacks: int (*parse_msg)(struct strparser *strp, struct sk_buff *skb); parse_msg is called to determine the length of the next message in the stream. The upper layer must implement this function. It should parse the sk_buff as containing the headers for the - next application layer messages in the stream. + next application layer message in the stream. - The skb->cb in the input skb is a struct strp_rx_msg. Only + The skb->cb in the input skb is a struct strp_msg. Only the offset field is relevant in parse_msg and gives the offset where the message starts in the skb. @@ -50,26 +112,41 @@ int (*parse_msg)(struct strparser *strp, struct sk_buff *skb); -ESTRPIPE : current message should not be processed by the kernel, return control of the socket to userspace which can proceed to read the messages itself - other < 0 : Error is parsing, give control back to userspace + other < 0 : Error in parsing, give control back to userspace assuming that synchronization is lost and the stream is unrecoverable (application expected to close TCP socket) In the case that an error is returned (return value is less than - zero) the stream parser will set the error on TCP socket and wake - it up. If parse_msg returned -ESTRPIPE and the stream parser had - previously read some bytes for the current message, then the error - set on the attached socket is ENODATA since the stream is - unrecoverable in that case. + zero) and the parser is in receive callback mode, then it will set + the error on TCP socket and wake it up. If parse_msg returned + -ESTRPIPE and the stream parser had previously read some bytes for + the current message, then the error set on the attached socket is + ENODATA since the stream is unrecoverable in that case. + +void (*lock)(struct strparser *strp) + + The lock callback is called to lock the strp structure when + the strparser is performing an asynchronous operation (such as + processing a timeout). In receive callback mode the default + function is to lock_sock for the associated socket. In general + mode the callback must be set appropriately. + +void (*unlock)(struct strparser *strp) + + The unlock callback is called to release the lock obtained + by the lock callback. In receive callback mode the default + function is release_sock for the associated socket. In general + mode the callback must be set appropriately. void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb); rcv_msg is called when a full message has been received and is queued. The callee must consume the sk_buff; it can call strp_pause to prevent any further messages from being - received in rcv_msg (see strp_pause below). This callback + received in rcv_msg (see strp_pause above). This callback must be set. - The skb->cb in the input skb is a struct strp_rx_msg. This + The skb->cb in the input skb is a struct strp_msg. This struct contains two fields: offset and full_len. Offset is where the message starts in the skb, and full_len is the the length of the message. skb->len - offset may be greater @@ -78,59 +155,53 @@ void (*rcv_msg)(struct strparser *strp, struct sk_buff *skb); int (*read_sock_done)(struct strparser *strp, int err); read_sock_done is called when the stream parser is done reading - the TCP socket. The stream parser may read multiple messages - in a loop and this function allows cleanup to occur when existing - the loop. If the callback is not set (NULL in strp_init) a - default function is used. + the TCP socket in receive callback mode. The stream parser may + read multiple messages in a loop and this function allows cleanup + to occur when exiting the loop. If the callback is not set (NULL + in strp_init) a default function is used. void (*abort_parser)(struct strparser *strp, int err); This function is called when stream parser encounters an error - in parsing. The default function stops the stream parser for the - TCP socket and sets the error in the socket. The default function - can be changed by setting the callback to non-NULL in strp_init. + in parsing. The default function stops the stream parser and + sets the error in the socket if the parser is in receive callback + mode. The default function can be changed by setting the callback + to non-NULL in strp_init. -Functions ---------- +Statistics +========== -The upper layer calls strp_tcp_data_ready when data is ready on the lower -socket for strparser to process. This should be called from a data_ready -callback that is set on the socket. +Various counters are kept for each stream parser instance. These are in +the strp_stats structure. strp_aggr_stats is a convenience structure for +accumulating statistics for multiple stream parser instances. +save_strp_stats and aggregate_strp_stats are helper functions to save +and aggregate statistics. -strp_stop is called to completely stop stream parser operations. This -is called internally when the stream parser encounters an error, and -it is called from the upper layer when unattaching a TCP socket. +Message assembly limits +======================= -strp_done is called to unattach the stream parser from the TCP socket. -This must be called after the stream processor has be stopped. +The stream parser provide mechanisms to limit the resources consumed by +message assembly. -strp_check_rcv is called to check for new messages on the socket. This -is normally called at initialization of the a stream parser instance -of after strp_unpause. +A timer is set when assembly starts for a new message. In receive +callback mode the message timeout is taken from rcvtime for the +associated TCP socket. In general mode, the timeout is passed as an +argument in strp_process. If the timer fires before assembly completes +the stream parser is aborted and the ETIMEDOUT error is set on the TCP +socket if in receive callback mode. -Statistics ----------- +In receive callback mode, message length is limited to the receive +buffer size of the associated TCP socket. If the length returned by +parse_msg is greater than the socket buffer size then the stream parser +is aborted with EMSGSIZE error set on the TCP socket. Note that this +makes the maximum size of receive skbuffs for a socket with a stream +parser to be 2*sk_rcvbuf of the TCP socket. -Various counters are kept for each stream parser for a TCP socket. -These are in the strp_stats structure. strp_aggr_stats is a convenience -structure for accumulating statistics for multiple stream parser -instances. save_strp_stats and aggregate_strp_stats are helper functions -to save and aggregate statistics. +In general mode the message length limit is passed in as an argument +to strp_process. -Message assembly limits ------------------------ +Author +====== -The stream parser provide mechanisms to limit the resources consumed by -message assembly. +Tom Herbert (tom@quantonium.net) -A timer is set when assembly starts for a new message. The message -timeout is taken from rcvtime for the associated TCP socket. If the -timer fires before assembly completes the stream parser is aborted -and the ETIMEDOUT error is set on the TCP socket. - -Message length is limited to the receive buffer size of the associated -TCP socket. If the length returned by parse_msg is greater than -the socket buffer size then the stream parser is aborted with -EMSGSIZE error set on the TCP socket. Note that this makes the -maximum size of receive skbuffs for a socket with a stream parser -to be 2*sk_rcvbuf of the TCP socket. diff --git a/Documentation/networking/switchdev.txt b/Documentation/networking/switchdev.txt index 5e40e1f68873..82236a17b5e6 100644 --- a/Documentation/networking/switchdev.txt +++ b/Documentation/networking/switchdev.txt @@ -13,42 +13,42 @@ an example setup using a data-center-class switch ASIC chip. Other setups with SR-IOV or soft switches, such as OVS, are possible. - User-space tools - - user space | - +-------------------------------------------------------------------+ - kernel | Netlink - | - +--------------+-------------------------------+ - | Network stack | - | (Linux) | - | | - +----------------------------------------------+ + User-space tools + + user space | + +-------------------------------------------------------------------+ + kernel | Netlink + | + +--------------+-------------------------------+ + | Network stack | + | (Linux) | + | | + +----------------------------------------------+ sw1p2 sw1p4 sw1p6 - sw1p1 + sw1p3 + sw1p5 + eth1 - + | + | + | + - | | | | | | | - +--+----+----+----+-+--+----+---+ +-----+-----+ - | Switch driver | | mgmt | - | (this document) | | driver | - | | | | - +--------------+----------------+ +-----------+ - | - kernel | HW bus (eg PCI) - +-------------------------------------------------------------------+ - hardware | - +--------------+---+------------+ - | Switch device (sw1) | - | +----+ +--------+ - | | v offloaded data path | mgmt port - | | | | - +--|----|----+----+----+----+---+ - | | | | | | - + + + + + + - p1 p2 p3 p4 p5 p6 - - front-panel ports + sw1p1 + sw1p3 + sw1p5 + eth1 + + | + | + | + + | | | | | | | + +--+----+----+----+----+----+---+ +-----+-----+ + | Switch driver | | mgmt | + | (this document) | | driver | + | | | | + +--------------+----------------+ +-----------+ + | + kernel | HW bus (eg PCI) + +-------------------------------------------------------------------+ + hardware | + +--------------+----------------+ + | Switch device (sw1) | + | +----+ +--------+ + | | v offloaded data path | mgmt port + | | | | + +--|----|----+----+----+----+---+ + | | | | | | + + + + + + + + p1 p2 p3 p4 p5 p6 + + front-panel ports Fig 1. |