diff options
Diffstat (limited to 'Documentation/networking')
-rw-r--r-- | Documentation/networking/00-INDEX | 2 | ||||
-rw-r--r-- | Documentation/networking/NAPI_HOWTO.txt | 26 | ||||
-rw-r--r-- | Documentation/networking/cs89x0.txt | 6 | ||||
-rw-r--r-- | Documentation/networking/dccp.txt | 88 | ||||
-rw-r--r-- | Documentation/networking/e1000.txt | 451 | ||||
-rw-r--r-- | Documentation/networking/generic_netlink.txt | 3 | ||||
-rw-r--r-- | Documentation/networking/ip-sysctl.txt | 347 | ||||
-rw-r--r-- | Documentation/networking/iphase.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/packet_mmap.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/phy.txt | 13 | ||||
-rw-r--r-- | Documentation/networking/pktgen.txt | 6 | ||||
-rw-r--r-- | Documentation/networking/proc_net_tcp.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/sk98lin.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/slicecom.txt | 2 | ||||
-rw-r--r-- | Documentation/networking/udplite.txt | 281 | ||||
-rw-r--r-- | Documentation/networking/wan-router.txt | 8 | ||||
-rw-r--r-- | Documentation/networking/xfrm_sync.txt | 5 |
17 files changed, 842 insertions, 404 deletions
diff --git a/Documentation/networking/00-INDEX b/Documentation/networking/00-INDEX index b1181ce232d9..e06b6e3c1db5 100644 --- a/Documentation/networking/00-INDEX +++ b/Documentation/networking/00-INDEX @@ -58,6 +58,8 @@ fore200e.txt - FORE Systems PCA-200E/SBA-200E ATM NIC driver info. framerelay.txt - info on using Frame Relay/Data Link Connection Identifier (DLCI). +generic_netlink.txt + - info on Generic Netlink ip-sysctl.txt - /proc/sys/net/ipv4/* variables ip_dynaddr.txt diff --git a/Documentation/networking/NAPI_HOWTO.txt b/Documentation/networking/NAPI_HOWTO.txt index 93af3e87c65b..fb8dc6422a52 100644 --- a/Documentation/networking/NAPI_HOWTO.txt +++ b/Documentation/networking/NAPI_HOWTO.txt @@ -95,8 +95,8 @@ There are two types of event register ACK mechanisms. Move all to dev->poll() C) Ability to detect new work correctly. -NAPI works by shutting down event interrupts when theres work and -turning them on when theres none. +NAPI works by shutting down event interrupts when there's work and +turning them on when there's none. New packets might show up in the small window while interrupts were being re-enabled (refer to appendix 2). A packet might sneak in during the period we are enabling interrupts. We only get to know about such a packet when the @@ -114,7 +114,7 @@ Locking rules and environmental guarantees only one CPU can pick the initial interrupt and hence the initial netif_rx_schedule(dev); - The core layer invokes devices to send packets in a round robin format. -This implies receive is totaly lockless because of the guarantee only that +This implies receive is totally lockless because of the guarantee that only one CPU is executing it. - contention can only be the result of some other CPU accessing the rx ring. This happens only in close() and suspend() (when these methods @@ -510,7 +510,7 @@ static int my_poll (struct net_device *dev, int *budget) an interrupt will be generated */ goto done; } - /* done! at least thats what it looks like ;-> + /* done! at least that's what it looks like ;-> if new packets came in after our last check on status bits they'll be caught by the while check and we go back and clear them since we havent exceeded our quota */ @@ -535,11 +535,11 @@ done: * 1. it can race with disabling irqs in irq handler (which are done to * schedule polls) * 2. it can race with dis/enabling irqs in other poll threads - * 3. if an irq raised after the begining of the outer beginning - * loop(marked in the code above), it will be immediately + * 3. if an irq raised after the beginning of the outer beginning + * loop (marked in the code above), it will be immediately * triggered here. * - * Summarizing: the logic may results in some redundant irqs both + * Summarizing: the logic may result in some redundant irqs both * due to races in masking and due to too late acking of already * processed irqs. The good news: no events are ever lost. */ @@ -601,7 +601,7 @@ a) 5) dev->close() and dev->suspend() issues ========================================== -The driver writter neednt worry about this. The top net layer takes +The driver writer needn't worry about this; the top net layer takes care of it. 6) Adding new Stats to /proc @@ -622,9 +622,9 @@ FC should be programmed to apply in the case when the system cant pull out packets fast enough i.e send a pause only when you run out of rx buffers. Note FC in itself is a good solution but we have found it to not be much of a commodity feature (both in NICs and switches) and hence falls -under the same category as using NIC based mitigation. Also experiments -indicate that its much harder to resolve the resource allocation -issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness +under the same category as using NIC based mitigation. Also, experiments +indicate that it's much harder to resolve the resource allocation +issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness proved harder. In any case, FC works even better with NAPI but is not necessary. @@ -678,10 +678,10 @@ routine: CSR5 bit of interest is only the rx status. If you look at the last if statement: you just finished grabbing all the packets from the rx ring .. you check if -status bit says theres more packets just in ... it says none; you then +status bit says there are more packets just in ... it says none; you then enable rx interrupts again; if a new packet just came in during this check, we are counting that CSR5 will be set in that small window of opportunity -and that by re-enabling interrupts, we would actually triger an interrupt +and that by re-enabling interrupts, we would actually trigger an interrupt to register the new packet for processing. [The above description nay be very verbose, if you have better wording diff --git a/Documentation/networking/cs89x0.txt b/Documentation/networking/cs89x0.txt index 64896470e279..6387d3decf85 100644 --- a/Documentation/networking/cs89x0.txt +++ b/Documentation/networking/cs89x0.txt @@ -248,7 +248,7 @@ c) The driver's hardware probe routine is designed to avoid with device probing. To avoid this behaviour, add one to the `io=' module parameter. This doesn't actually change the I/O address, but it is a flag to tell the driver - topartially initialise the hardware before trying to + to partially initialise the hardware before trying to identify the card. This could be dangerous if you are not sure that there is a cs89x0 card at the provided address. @@ -620,8 +620,8 @@ I/O Address Device IRQ Device 12 Mouse (PS/2) Memory Address Device 13 Math Coprocessor -------------- --------------------- 14 Hard Disk controller -A000-BFFF EGA Graphics Adpater -A000-C7FF VGA Graphics Adpater +A000-BFFF EGA Graphics Adapter +A000-C7FF VGA Graphics Adapter B000-BFFF Mono Graphics Adapter B800-BFFF Color Graphics Adapter E000-FFFF AT BIOS diff --git a/Documentation/networking/dccp.txt b/Documentation/networking/dccp.txt index 74563b38ffd9..387482e46c47 100644 --- a/Documentation/networking/dccp.txt +++ b/Documentation/networking/dccp.txt @@ -19,40 +19,92 @@ for real time and multimedia traffic. It has a base protocol and pluggable congestion control IDs (CCIDs). -It is at draft RFC status and the homepage for DCCP as a protocol is at: - http://www.icir.org/kohler/dcp/ +It is at proposed standard RFC status and the homepage for DCCP as a protocol +is at: + http://www.read.cs.ucla.edu/dccp/ Missing features ================ The DCCP implementation does not currently have all the features that are in -the draft RFC. +the RFC. -In particular the following are missing: -- CCID2 support -- feature negotiation - -When testing against other implementations it appears that elapsed time -options are not coded compliant to the specification. +The known bugs are at: + http://linux-net.osdl.org/index.php/TODO#DCCP Socket options ============== -DCCP_SOCKOPT_PACKET_SIZE is used for CCID3 to set default packet size for -calculations. - DCCP_SOCKOPT_SERVICE sets the service. The specification mandates use of service codes (RFC 4340, sec. 8.1.2); if this socket option is not set, the socket will fall back to 0 (which means that no meaningful service code is present). Connecting sockets set at most one service option; for listening sockets, multiple service codes can be specified. +DCCP_SOCKOPT_SEND_CSCOV and DCCP_SOCKOPT_RECV_CSCOV are used for setting the +partial checksum coverage (RFC 4340, sec. 9.2). The default is that checksums +always cover the entire packet and that only fully covered application data is +accepted by the receiver. Hence, when using this feature on the sender, it must +be enabled at the receiver, too with suitable choice of CsCov. + +DCCP_SOCKOPT_SEND_CSCOV sets the sender checksum coverage. Values in the + range 0..15 are acceptable. The default setting is 0 (full coverage), + values between 1..15 indicate partial coverage. +DCCP_SOCKOPT_SEND_CSCOV is for the receiver and has a different meaning: it + sets a threshold, where again values 0..15 are acceptable. The default + of 0 means that all packets with a partial coverage will be discarded. + Values in the range 1..15 indicate that packets with minimally such a + coverage value are also acceptable. The higher the number, the more + restrictive this setting (see [RFC 4340, sec. 9.2.1]). + +Sysctl variables +================ +Several DCCP default parameters can be managed by the following sysctls +(sysctl net.dccp.default or /proc/sys/net/dccp/default): + +request_retries + The number of active connection initiation retries (the number of + Requests minus one) before timing out. In addition, it also governs + the behaviour of the other, passive side: this variable also sets + the number of times DCCP repeats sending a Response when the initial + handshake does not progress from RESPOND to OPEN (i.e. when no Ack + is received after the initial Request). This value should be greater + than 0, suggested is less than 10. Analogue of tcp_syn_retries. + +retries1 + How often a DCCP Response is retransmitted until the listening DCCP + side considers its connecting peer dead. Analogue of tcp_retries1. + +retries2 + The number of times a general DCCP packet is retransmitted. This has + importance for retransmitted acknowledgments and feature negotiation, + data packets are never retransmitted. Analogue of tcp_retries2. + +send_ndp = 1 + Whether or not to send NDP count options (sec. 7.7.2). + +send_ackvec = 1 + Whether or not to send Ack Vector options (sec. 11.5). + +ack_ratio = 2 + The default Ack Ratio (sec. 11.3) to use. + +tx_ccid = 2 + Default CCID for the sender-receiver half-connection. + +rx_ccid = 2 + Default CCID for the receiver-sender half-connection. + +seq_window = 100 + The initial sequence window (sec. 7.5.2). + +tx_qlen = 5 + The size of the transmit buffer in packets. A value of 0 corresponds + to an unbounded transmit buffer. + Notes ===== -SELinux does not yet have support for DCCP. You will need to turn it off or -else you will get EACCES. - -DCCP does not travel through NAT successfully at present. This is because -the checksum covers the psuedo-header as per TCP and UDP. It should be -relatively trivial to add Linux NAT support for DCCP. +DCCP does not travel through NAT successfully at present on many boxes. This is +because the checksum covers the psuedo-header as per TCP and UDP. Linux NAT +support for DCCP has been added. diff --git a/Documentation/networking/e1000.txt b/Documentation/networking/e1000.txt index 5c0a5cc03998..61b171cf5313 100644 --- a/Documentation/networking/e1000.txt +++ b/Documentation/networking/e1000.txt @@ -1,7 +1,7 @@ Linux* Base Driver for the Intel(R) PRO/1000 Family of Adapters =============================================================== -November 15, 2005 +September 26, 2006 Contents @@ -9,6 +9,7 @@ Contents - In This Release - Identifying Your Adapter +- Building and Installation - Command Line Parameters - Speed and Duplex Configuration - Additional Configurations @@ -41,6 +42,9 @@ or later), lspci, and ifconfig to obtain the same information. Instructions on updating ethtool can be found in the section "Additional Configurations" later in this document. +NOTE: The Intel(R) 82562v 10/100 Network Connection only provides 10/100 +support. + Identifying Your Adapter ======================== @@ -51,28 +55,27 @@ Driver ID Guide at: http://support.intel.com/support/network/adapter/pro100/21397.htm For the latest Intel network drivers for Linux, refer to the following -website. In the search field, enter your adapter name or type, or use the +website. In the search field, enter your adapter name or type, or use the networking link on the left to search for your adapter: http://downloadfinder.intel.com/scripts-df/support_intel.asp -Command Line Parameters ======================= +Command Line Parameters +======================= If the driver is built as a module, the following optional parameters -are used by entering them on the command line with the modprobe or insmod -command using this syntax: +are used by entering them on the command line with the modprobe command +using this syntax: modprobe e1000 [<option>=<VAL1>,<VAL2>,...] - insmod e1000 [<option>=<VAL1>,<VAL2>,...] - For example, with two PRO/1000 PCI adapters, entering: - insmod e1000 TxDescriptors=80,128 + modprobe e1000 TxDescriptors=80,128 -loads the e1000 driver with 80 TX descriptors for the first adapter and 128 -TX descriptors for the second adapter. +loads the e1000 driver with 80 TX descriptors for the first adapter and +128 TX descriptors for the second adapter. The default value for each parameter is generally the recommended setting, unless otherwise noted. @@ -87,7 +90,7 @@ NOTES: For more information about the AutoNeg, Duplex, and Speed http://www.intel.com/design/network/applnots/ap450.htm A descriptor describes a data buffer and attributes related to - the data buffer. This information is accessed by the hardware. + the data buffer. This information is accessed by the hardware. AutoNeg @@ -96,9 +99,9 @@ AutoNeg Valid Range: 0x01-0x0F, 0x20-0x2F Default Value: 0x2F -This parameter is a bit mask that specifies which speed and duplex -settings the board advertises. When this parameter is used, the Speed -and Duplex parameters must not be specified. +This parameter is a bit-mask that specifies the speed and duplex settings +advertised by the adapter. When this parameter is used, the Speed and +Duplex parameters must not be specified. NOTE: Refer to the Speed and Duplex section of this readme for more information on the AutoNeg parameter. @@ -110,14 +113,15 @@ Duplex Valid Range: 0-2 (0=auto-negotiate, 1=half, 2=full) Default Value: 0 -Defines the direction in which data is allowed to flow. Can be either -one or two-directional. If both Duplex and the link partner are set to -auto-negotiate, the board auto-detects the correct duplex. If the link -partner is forced (either full or half), Duplex defaults to half-duplex. +This defines the direction in which data is allowed to flow. Can be +either one or two-directional. If both Duplex and the link partner are +set to auto-negotiate, the board auto-detects the correct duplex. If the +link partner is forced (either full or half), Duplex defaults to half- +duplex. FlowControl ----------- +----------- Valid Range: 0-3 (0=none, 1=Rx only, 2=Tx only, 3=Rx&Tx) Default Value: Reads flow control settings from the EEPROM @@ -127,57 +131,107 @@ to Ethernet PAUSE frames. InterruptThrottleRate --------------------- -(not supported on Intel 82542, 82543 or 82544-based adapters) -Valid Range: 100-100000 (0=off, 1=dynamic) -Default Value: 8000 - -This value represents the maximum number of interrupts per second the -controller generates. InterruptThrottleRate is another setting used in -interrupt moderation. Dynamic mode uses a heuristic algorithm to adjust -InterruptThrottleRate based on the current traffic load. +(not supported on Intel(R) 82542, 82543 or 82544-based adapters) +Valid Range: 0,1,3,100-100000 (0=off, 1=dynamic, 3=dynamic conservative) +Default Value: 3 + +The driver can limit the amount of interrupts per second that the adapter +will generate for incoming packets. It does this by writing a value to the +adapter that is based on the maximum amount of interrupts that the adapter +will generate per second. + +Setting InterruptThrottleRate to a value greater or equal to 100 +will program the adapter to send out a maximum of that many interrupts +per second, even if more packets have come in. This reduces interrupt +load on the system and can lower CPU utilization under heavy load, +but will increase latency as packets are not processed as quickly. + +The default behaviour of the driver previously assumed a static +InterruptThrottleRate value of 8000, providing a good fallback value for +all traffic types,but lacking in small packet performance and latency. +The hardware can handle many more small packets per second however, and +for this reason an adaptive interrupt moderation algorithm was implemented. + +Since 7.3.x, the driver has two adaptive modes (setting 1 or 3) in which +it dynamically adjusts the InterruptThrottleRate value based on the traffic +that it receives. After determining the type of incoming traffic in the last +timeframe, it will adjust the InterruptThrottleRate to an appropriate value +for that traffic. + +The algorithm classifies the incoming traffic every interval into +classes. Once the class is determined, the InterruptThrottleRate value is +adjusted to suit that traffic type the best. There are three classes defined: +"Bulk traffic", for large amounts of packets of normal size; "Low latency", +for small amounts of traffic and/or a significant percentage of small +packets; and "Lowest latency", for almost completely small packets or +minimal traffic. + +In dynamic conservative mode, the InterruptThrottleRate value is set to 4000 +for traffic that falls in class "Bulk traffic". If traffic falls in the "Low +latency" or "Lowest latency" class, the InterruptThrottleRate is increased +stepwise to 20000. This default mode is suitable for most applications. + +For situations where low latency is vital such as cluster or +grid computing, the algorithm can reduce latency even more when +InterruptThrottleRate is set to mode 1. In this mode, which operates +the same as mode 3, the InterruptThrottleRate will be increased stepwise to +70000 for traffic in class "Lowest latency". + +Setting InterruptThrottleRate to 0 turns off any interrupt moderation +and may improve small packet latency, but is generally not suitable +for bulk throughput traffic. NOTE: InterruptThrottleRate takes precedence over the TxAbsIntDelay and - RxAbsIntDelay parameters. In other words, minimizing the receive + RxAbsIntDelay parameters. In other words, minimizing the receive and/or transmit absolute delays does not force the controller to generate more interrupts than what the Interrupt Throttle Rate allows. -CAUTION: If you are using the Intel PRO/1000 CT Network Connection +CAUTION: If you are using the Intel(R) PRO/1000 CT Network Connection (controller 82547), setting InterruptThrottleRate to a value greater than 75,000, may hang (stop transmitting) adapters - under certain network conditions. If this occurs a NETDEV - WATCHDOG message is logged in the system event log. In + under certain network conditions. If this occurs a NETDEV + WATCHDOG message is logged in the system event log. In addition, the controller is automatically reset, restoring - the network connection. To eliminate the potential for the + the network connection. To eliminate the potential for the hang, ensure that InterruptThrottleRate is set no greater than 75,000 and is not set to 0. NOTE: When e1000 is loaded with default settings and multiple adapters are in use simultaneously, the CPU utilization may increase non- - linearly. In order to limit the CPU utilization without impacting + linearly. In order to limit the CPU utilization without impacting the overall throughput, we recommend that you load the driver as follows: - insmod e1000.o InterruptThrottleRate=3000,3000,3000 + modprobe e1000 InterruptThrottleRate=3000,3000,3000 This sets the InterruptThrottleRate to 3000 interrupts/sec for - the first, second, and third instances of the driver. The range + the first, second, and third instances of the driver. The range of 2000 to 3000 interrupts per second works on a majority of systems and is a good starting point, but the optimal value will - be platform-specific. If CPU utilization is not a concern, use + be platform-specific. If CPU utilization is not a concern, use RX_POLLING (NAPI) and default driver settings. + RxDescriptors ------------- Valid Range: 80-256 for 82542 and 82543-based adapters 80-4096 for all other supported adapters Default Value: 256 -This value specifies the number of receive descriptors allocated by the -driver. Increasing this value allows the driver to buffer more incoming -packets. Each descriptor is 16 bytes. A receive buffer is also -allocated for each descriptor and is 2048. +This value specifies the number of receive buffer descriptors allocated +by the driver. Increasing this value allows the driver to buffer more +incoming packets, at the expense of increased system memory utilization. + +Each descriptor is 16 bytes. A receive buffer is also allocated for each +descriptor and can be either 2048, 4096, 8192, or 16384 bytes, depending +on the MTU setting. The maximum MTU size is 16110. + +NOTE: MTU designates the frame size. It only needs to be set for Jumbo + Frames. Depending on the available system resources, the request + for a higher number of receive descriptors may be denied. In this + case, use a lower number. RxIntDelay @@ -187,17 +241,17 @@ Default Value: 0 This value delays the generation of receive interrupts in units of 1.024 microseconds. Receive interrupt reduction can improve CPU efficiency if -properly tuned for specific network traffic. Increasing this value adds +properly tuned for specific network traffic. Increasing this value adds extra latency to frame reception and can end up decreasing the throughput -of TCP traffic. If the system is reporting dropped receives, this value +of TCP traffic. If the system is reporting dropped receives, this value may be set too high, causing the driver to run out of available receive descriptors. CAUTION: When setting RxIntDelay to a value other than 0, adapters may - hang (stop transmitting) under certain network conditions. If + hang (stop transmitting) under certain network conditions. If this occurs a NETDEV WATCHDOG message is logged in the system - event log. In addition, the controller is automatically reset, - restoring the network connection. To eliminate the potential + event log. In addition, the controller is automatically reset, + restoring the network connection. To eliminate the potential for the hang ensure that RxIntDelay is set to 0. @@ -208,7 +262,7 @@ Valid Range: 0-65535 (0=off) Default Value: 128 This value, in units of 1.024 microseconds, limits the delay in which a -receive interrupt is generated. Useful only if RxIntDelay is non-zero, +receive interrupt is generated. Useful only if RxIntDelay is non-zero, this value ensures that an interrupt is generated after the initial packet is received within the set amount of time. Proper tuning, along with RxIntDelay, may improve traffic throughput in specific network @@ -222,9 +276,9 @@ Valid Settings: 0, 10, 100, 1000 Default Value: 0 (auto-negotiate at all supported speeds) Speed forces the line speed to the specified value in megabits per second -(Mbps). If this parameter is not specified or is set to 0 and the link +(Mbps). If this parameter is not specified or is set to 0 and the link partner is set to auto-negotiate, the board will auto-detect the correct -speed. Duplex should also be set when Speed is set to either 10 or 100. +speed. Duplex should also be set when Speed is set to either 10 or 100. TxDescriptors @@ -234,7 +288,7 @@ Valid Range: 80-256 for 82542 and 82543-based adapters Default Value: 256 This value is the number of transmit descriptors allocated by the driver. -Increasing this value allows the driver to queue more transmits. Each +Increasing this value allows the driver to queue more transmits. Each descriptor is 16 bytes. NOTE: Depending on the available system resources, the request for a @@ -248,8 +302,8 @@ Valid Range: 0-65535 (0=off) Default Value: 64 This value delays the generation of transmit interrupts in units of -1.024 microseconds. Transmit interrupt reduction can improve CPU -efficiency if properly tuned for specific network traffic. If the +1.024 microseconds. Transmit interrupt reduction can improve CPU +efficiency if properly tuned for specific network traffic. If the system is reporting dropped transmits, this value may be set too high causing the driver to run out of available transmit descriptors. @@ -261,7 +315,7 @@ Valid Range: 0-65535 (0=off) Default Value: 64 This value, in units of 1.024 microseconds, limits the delay in which a -transmit interrupt is generated. Useful only if TxIntDelay is non-zero, +transmit interrupt is generated. Useful only if TxIntDelay is non-zero, this value ensures that an interrupt is generated after the initial packet is sent on the wire within the set amount of time. Proper tuning, along with TxIntDelay, may improve traffic throughput in specific @@ -288,15 +342,15 @@ fiber interface board only links at 1000 Mbps full-duplex. For copper-based boards, the keywords interact as follows: - The default operation is auto-negotiate. The board advertises all + The default operation is auto-negotiate. The board advertises all supported speed and duplex combinations, and it links at the highest common speed and duplex mode IF the link partner is set to auto-negotiate. If Speed = 1000, limited auto-negotiation is enabled and only 1000 Mbps is advertised (The 1000BaseT spec requires auto-negotiation.) - If Speed = 10 or 100, then both Speed and Duplex should be set. Auto- - negotiation is disabled, and the AutoNeg parameter is ignored. Partner + If Speed = 10 or 100, then both Speed and Duplex should be set. Auto- + negotiation is disabled, and the AutoNeg parameter is ignored. Partner SHOULD also be forced. The AutoNeg parameter is used when more control is required over the @@ -304,7 +358,7 @@ auto-negotiation process. It should be used when you wish to control which speed and duplex combinations are advertised during the auto-negotiation process. -The parameter may be specified as either a decimal or hexidecimal value as +The parameter may be specified as either a decimal or hexadecimal value as determined by the bitmap below. Bit position 7 6 5 4 3 2 1 0 @@ -337,20 +391,19 @@ Additional Configurations Configuring the Driver on Different Distributions ------------------------------------------------- - Configuring a network driver to load properly when the system is started - is distribution dependent. Typically, the configuration process involves + is distribution dependent. Typically, the configuration process involves adding an alias line to /etc/modules.conf or /etc/modprobe.conf as well - as editing other system startup scripts and/or configuration files. Many + as editing other system startup scripts and/or configuration files. Many popular Linux distributions ship with tools to make these changes for you. To learn the proper way to configure a network device for your system, - refer to your distribution documentation. If during this process you are + refer to your distribution documentation. If during this process you are asked for the driver or module name, the name for the Linux Base Driver - for the Intel PRO/1000 Family of Adapters is e1000. + for the Intel(R) PRO/1000 Family of Adapters is e1000. As an example, if you install the e1000 driver for two PRO/1000 adapters (eth0 and eth1) and set the speed and duplex to 10full and 100half, add - the following to modules.conf or modprobe.conf: + the following to modules.conf or or modprobe.conf: alias eth0 e1000 alias eth1 e1000 @@ -358,9 +411,8 @@ Additional Configurations Viewing Link Messages --------------------- - Link messages will not be displayed to the console if the distribution is - restricting system messages. In order to see network driver link messages + restricting system messages. In order to see network driver link messages on your console, set dmesg to eight by entering the following: dmesg -n 8 @@ -369,11 +421,9 @@ Additional Configurations Jumbo Frames ------------ - - The driver supports Jumbo Frames for all adapters except 82542 and - 82573-based adapters. Jumbo Frames support is enabled by changing the - MTU to a value larger than the default of 1500. Use the ifconfig command - to increase the MTU size. For example: + Jumbo Frames support is enabled by changing the MTU to a value larger than + the default of 1500. Use the ifconfig command to increase the MTU size. + For example: ifconfig eth<x> mtu 9000 up @@ -390,26 +440,49 @@ Additional Configurations - To enable Jumbo Frames, increase the MTU size on the interface beyond 1500. - - The maximum MTU setting for Jumbo Frames is 16110. This value coincides + + - The maximum MTU setting for Jumbo Frames is 16110. This value coincides with the maximum Jumbo Frames size of 16128. + - Using Jumbo Frames at 10 or 100 Mbps may result in poor performance or loss of link. + - Some Intel gigabit adapters that support Jumbo Frames have a frame size limit of 9238 bytes, with a corresponding MTU size limit of 9216 bytes. - The adapters with this limitation are based on the Intel 82571EB and - 82572EI controllers, which correspond to these product names: - Intel® PRO/1000 PT Dual Port Server Adapter - Intel® PRO/1000 PF Dual Port Server Adapter - Intel® PRO/1000 PT Server Adapter - Intel® PRO/1000 PT Desktop Adapter - Intel® PRO/1000 PF Server Adapter - - - The Intel PRO/1000 PM Network Connection does not support jumbo frames. + The adapters with this limitation are based on the Intel(R) 82571EB, + 82572EI, 82573L and 80003ES2LAN controller. These correspond to the + following product names: + Intel(R) PRO/1000 PT Server Adapter + Intel(R) PRO/1000 PT Desktop Adapter + Intel(R) PRO/1000 PT Network Connection + Intel(R) PRO/1000 PT Dual Port Server Adapter + Intel(R) PRO/1000 PT Dual Port Network Connection + Intel(R) PRO/1000 PF Server Adapter + Intel(R) PRO/1000 PF Network Connection + Intel(R) PRO/1000 PF Dual Port Server Adapter + Intel(R) PRO/1000 PB Server Connection + Intel(R) PRO/1000 PL Network Connection + Intel(R) PRO/1000 EB Network Connection with I/O Acceleration + Intel(R) PRO/1000 EB Backplane Connection with I/O Acceleration + Intel(R) PRO/1000 PT Quad Port Server Adapter + + - Adapters based on the Intel(R) 82542 and 82573V/E controller do not + support Jumbo Frames. These correspond to the following product names: + Intel(R) PRO/1000 Gigabit Server Adapter + Intel(R) PRO/1000 PM Network Connection + + - The following adapters do not support Jumbo Frames: + Intel(R) 82562V 10/100 Network Connection + Intel(R) 82566DM Gigabit Network Connection + Intel(R) 82566DC Gigabit Network Connection + Intel(R) 82566MM Gigabit Network Connection + Intel(R) 82566MC Gigabit Network Connection + Intel(R) 82562GT 10/100 Network Connection + Intel(R) 82562G 10/100 Network Connection Ethtool ------- - The driver utilizes the ethtool interface for driver configuration and diagnostics, as well as displaying statistical information. Ethtool version 1.6 or later is required for this functionality. @@ -417,15 +490,14 @@ Additional Configurations The latest release of ethtool can be found from http://sourceforge.net/projects/gkernel. - NOTE: Ethtool 1.6 only supports a limited set of ethtool options. Support + NOTE: Ethtool 1.6 only supports a limited set of ethtool options. Support for a more complete ethtool feature set can be enabled by upgrading ethtool to ethtool-1.8.1. Enabling Wake on LAN* (WoL) --------------------------- - - WoL is configured through the Ethtool* utility. Ethtool is included with - all versions of Red Hat after Red Hat 7.2. For other Linux distributions, + WoL is configured through the Ethtool* utility. Ethtool is included with + all versions of Red Hat after Red Hat 7.2. For other Linux distributions, download and install Ethtool from the following website: http://sourceforge.net/projects/gkernel. @@ -436,11 +508,17 @@ Additional Configurations For this driver version, in order to enable WoL, the e1000 driver must be loaded when shutting down or rebooting the system. + Wake On LAN is only supported on port A for the following devices: + Intel(R) PRO/1000 PT Dual Port Network Connection + Intel(R) PRO/1000 PT Dual Port Server Connection + Intel(R) PRO/1000 PT Dual Port Server Adapter + Intel(R) PRO/1000 PF Dual Port Server Adapter + Intel(R) PRO/1000 PT Quad Port Server Adapter + NAPI ---- - - NAPI (Rx polling mode) is supported in the e1000 driver. NAPI is enabled - or disabled based on the configuration of the kernel. To override + NAPI (Rx polling mode) is supported in the e1000 driver. NAPI is enabled + or disabled based on the configuration of the kernel. To override the default, use the following compile-time flags. To enable NAPI, compile the driver module, passing in a configuration option: @@ -457,88 +535,105 @@ Additional Configurations Known Issues ============ - Jumbo Frames System Requirement - ------------------------------- - - Memory allocation failures have been observed on Linux systems with 64 MB - of RAM or less that are running Jumbo Frames. If you are using Jumbo - Frames, your system may require more than the advertised minimum - requirement of 64 MB of system memory. - - Performance Degradation with Jumbo Frames - ----------------------------------------- - - Degradation in throughput performance may be observed in some Jumbo frames - environments. If this is observed, increasing the application's socket - buffer size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values - may help. See the specific application manual and - /usr/src/linux*/Documentation/ - networking/ip-sysctl.txt for more details. - - Jumbo frames on Foundry BigIron 8000 switch - ------------------------------------------- - There is a known issue using Jumbo frames when connected to a Foundry - BigIron 8000 switch. This is a 3rd party limitation. If you experience - loss of packets, lower the MTU size. - - Multiple Interfaces on Same Ethernet Broadcast Network - ------------------------------------------------------ - - Due to the default ARP behavior on Linux, it is not possible to have - one system on two IP networks in the same Ethernet broadcast domain - (non-partitioned switch) behave as expected. All Ethernet interfaces - will respond to IP traffic for any IP address assigned to the system. - This results in unbalanced receive traffic. - - If you have multiple interfaces in a server, either turn on ARP - filtering by entering: - - echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter - (this only works if your kernel's version is higher than 2.4.5), - - NOTE: This setting is not saved across reboots. The configuration - change can be made permanent by adding the line: - net.ipv4.conf.all.arp_filter = 1 - to the file /etc/sysctl.conf - - or, - - install the interfaces in separate broadcast domains (either in - different switches or in a switch partitioned to VLANs). - - 82541/82547 can't link or are slow to link with some link partners - ----------------------------------------------------------------- - - There is a known compatibility issue with 82541/82547 and some - low-end switches where the link will not be established, or will - be slow to establish. In particular, these switches are known to - be incompatible with 82541/82547: - - Planex FXG-08TE - I-O Data ETG-SH8 - - To workaround this issue, the driver can be compiled with an override - of the PHY's master/slave setting. Forcing master or forcing slave - mode will improve time-to-link. - - # make EXTRA_CFLAGS=-DE1000_MASTER_SLAVE=<n> - - Where <n> is: - - 0 = Hardware default - 1 = Master mode - 2 = Slave mode - 3 = Auto master/slave - - Disable rx flow control with ethtool - ------------------------------------ - - In order to disable receive flow control using ethtool, you must turn - off auto-negotiation on the same command line. - - For example: - - ethtool -A eth? autoneg off rx off +Dropped Receive Packets on Half-duplex 10/100 Networks +------------------------------------------------------ +If you have an Intel PCI Express adapter running at 10mbps or 100mbps, half- +duplex, you may observe occasional dropped receive packets. There are no +workarounds for this problem in this network configuration. The network must +be updated to operate in full-duplex, and/or 1000mbps only. + +Jumbo Frames System Requirement +------------------------------- +Memory allocation failures have been observed on Linux systems with 64 MB +of RAM or less that are running Jumbo Frames. If you are using Jumbo +Frames, your system may require more than the advertised minimum +requirement of 64 MB of system memory. + +Performance Degradation with Jumbo Frames +----------------------------------------- +Degradation in throughput performance may be observed in some Jumbo frames +environments. If this is observed, increasing the application's socket +buffer size and/or increasing the /proc/sys/net/ipv4/tcp_*mem entry values +may help. See the specific application manual and +/usr/src/linux*/Documentation/ +networking/ip-sysctl.txt for more details. + +Jumbo Frames on Foundry BigIron 8000 switch +------------------------------------------- +There is a known issue using Jumbo frames when connected to a Foundry +BigIron 8000 switch. This is a 3rd party limitation. If you experience +loss of packets, lower the MTU size. + +Allocating Rx Buffers when Using Jumbo Frames +--------------------------------------------- +Allocating Rx buffers when using Jumbo Frames on 2.6.x kernels may fail if +the available memory is heavily fragmented. This issue may be seen with PCI-X +adapters or with packet split disabled. This can be reduced or eliminated +by changing the amount of available memory for receive buffer allocation, by +increasing /proc/sys/vm/min_free_kbytes. + +Multiple Interfaces on Same Ethernet Broadcast Network +------------------------------------------------------ +Due to the default ARP behavior on Linux, it is not possible to have +one system on two IP networks in the same Ethernet broadcast domain +(non-partitioned switch) behave as expected. All Ethernet interfaces +will respond to IP traffic for any IP address assigned to the system. +This results in unbalanced receive traffic. + +If you have multiple interfaces in a server, either turn on ARP +filtering by entering: + + echo 1 > /proc/sys/net/ipv4/conf/all/arp_filter +(this only works if your kernel's version is higher than 2.4.5), + +NOTE: This setting is not saved across reboots. The configuration +change can be made permanent by adding the line: + net.ipv4.conf.all.arp_filter = 1 +to the file /etc/sysctl.conf + + or, + +install the interfaces in separate broadcast domains (either in +different switches or in a switch partitioned to VLANs). + +82541/82547 can't link or are slow to link with some link partners +----------------------------------------------------------------- +There is a known compatibility issue with 82541/82547 and some +low-end switches where the link will not be established, or will +be slow to establish. In particular, these switches are known to +be incompatible with 82541/82547: + + Planex FXG-08TE + I-O Data ETG-SH8 + +To workaround this issue, the driver can be compiled with an override +of the PHY's master/slave setting. Forcing master or forcing slave +mode will improve time-to-link. + + # make CFLAGS_EXTRA=-DE1000_MASTER_SLAVE=<n> + +Where <n> is: + + 0 = Hardware default + 1 = Master mode + 2 = Slave mode + 3 = Auto master/slave + +Disable rx flow control with ethtool +------------------------------------ +In order to disable receive flow control using ethtool, you must turn +off auto-negotiation on the same command line. + +For example: + + ethtool -A eth? autoneg off rx off + +Unplugging network cable while ethtool -p is running +---------------------------------------------------- +In kernel versions 2.5.50 and later (including 2.6 kernel), unplugging +the network cable while ethtool -p is running will cause the system to +become unresponsive to keyboard commands, except for control-alt-delete. +Restarting the system appears to be the only remedy. Support @@ -548,24 +643,10 @@ For general information, go to the Intel support website at: http://support.intel.com - or the Intel Wired Networking project hosted by Sourceforge at: +or the Intel Wired Networking project hosted by Sourceforge at: http://sourceforge.net/projects/e1000 If an issue is identified with the released source code on the supported kernel with a supported adapter, email the specific information related -to the issue to e1000-devel@lists.sourceforge.net - - -License -======= - -This software program is released under the terms of a license agreement -between you ('Licensee') and Intel. Do not use or load this software or any -associated materials (collectively, the 'Software') until you have carefully -read the full terms and conditions of the file COPYING located in this software -package. By loading or using the Software, you agree to the terms of this -Agreement. If you do not agree with the terms of this Agreement, do not -install or use the Software. - -* Other names and brands may be claimed as the property of others. +to the issue to e1000-devel@lists.sf.net diff --git a/Documentation/networking/generic_netlink.txt b/Documentation/networking/generic_netlink.txt new file mode 100644 index 000000000000..d4f8b8b9b53c --- /dev/null +++ b/Documentation/networking/generic_netlink.txt @@ -0,0 +1,3 @@ +A wiki document on how to use Generic Netlink can be found here: + + * http://linux-net.osdl.org/index.php/Generic_Netlink_HOWTO diff --git a/Documentation/networking/ip-sysctl.txt b/Documentation/networking/ip-sysctl.txt index fd3c0c012351..a0f6842368c3 100644 --- a/Documentation/networking/ip-sysctl.txt +++ b/Documentation/networking/ip-sysctl.txt @@ -101,6 +101,11 @@ inet_peer_gc_maxtime - INTEGER TCP variables: +somaxconn - INTEGER + Limit of socket listen() backlog, known in userspace as SOMAXCONN. + Defaults to 128. See also tcp_max_syn_backlog for additional tuning + for TCP sockets. + tcp_abc - INTEGER Controls Appropriate Byte Count (ABC) defined in RFC3465. ABC is a way of increasing congestion window (cwnd) more slowly @@ -112,48 +117,51 @@ tcp_abc - INTEGER of two segments to compensate for delayed acknowledgments. Default: 0 (off) -tcp_syn_retries - INTEGER - Number of times initial SYNs for an active TCP connection attempt - will be retransmitted. Should not be higher than 255. Default value - is 5, which corresponds to ~180seconds. +tcp_abort_on_overflow - BOOLEAN + If listening service is too slow to accept new connections, + reset them. Default state is FALSE. It means that if overflow + occurred due to a burst, connection will recover. Enable this + option _only_ if you are really sure that listening daemon + cannot be tuned to accept connections faster. Enabling this + option can harm clients of your server. -tcp_synack_retries - INTEGER - Number of times SYNACKs for a passive TCP connection attempt will - be retransmitted. Should not be higher than 255. Default value - is 5, which corresponds to ~180seconds. +tcp_adv_win_scale - INTEGER + Count buffering overhead as bytes/2^tcp_adv_win_scale + (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), + if it is <= 0. + Default: 2 -tcp_keepalive_time - INTEGER - How often TCP sends out keepalive messages when keepalive is enabled. - Default: 2hours. +tcp_allowed_congestion_control - STRING + Show/set the congestion control choices available to non-privileged + processes. The list is a subset of those listed in + tcp_available_congestion_control. + Default is "reno" and the default setting (tcp_congestion_control). -tcp_keepalive_probes - INTEGER - How many keepalive probes TCP sends out, until it decides that the - connection is broken. Default value: 9. +tcp_app_win - INTEGER + Reserve max(window/2^tcp_app_win, mss) of window for application + buffer. Value 0 is special, it means that nothing is reserved. + Default: 31 -tcp_keepalive_intvl - INTEGER - How frequently the probes are send out. Multiplied by - tcp_keepalive_probes it is time to kill not responding connection, - after probes started. Default value: 75sec i.e. connection - will be aborted after ~11 minutes of retries. +tcp_available_congestion_control - STRING + Shows the available congestion control choices that are registered. + More congestion control algorithms may be available as modules, + but not loaded. -tcp_retries1 - INTEGER - How many times to retry before deciding that something is wrong - and it is necessary to report this suspicion to network layer. - Minimal RFC value is 3, it is default, which corresponds - to ~3sec-8min depending on RTO. +tcp_congestion_control - STRING + Set the congestion control algorithm to be used for new + connections. The algorithm "reno" is always available, but + additional choices may be available based on kernel configuration. + Default is set as part of kernel configuration. -tcp_retries2 - INTEGER - How may times to retry before killing alive TCP connection. - RFC1122 says that the limit should be longer than 100 sec. - It is too small number. Default value 15 corresponds to ~13-30min - depending on RTO. +tcp_dsack - BOOLEAN + Allows TCP to send "duplicate" SACKs. -tcp_orphan_retries - INTEGER - How may times to retry before killing TCP connection, closed - by our side. Default value 7 corresponds to ~50sec-16min - depending on RTO. If you machine is loaded WEB server, - you should think about lowering this value, such sockets - may consume significant resources. Cf. tcp_max_orphans. +tcp_ecn - BOOLEAN + Enable Explicit Congestion Notification in TCP. + +tcp_fack - BOOLEAN + Enable FACK congestion avoidance and fast retransmission. + The value is not used, if tcp_sack is not enabled. tcp_fin_timeout - INTEGER Time to hold socket in state FIN-WAIT-2, if it was closed @@ -166,24 +174,33 @@ tcp_fin_timeout - INTEGER because they eat maximum 1.5K of memory, but they tend to live longer. Cf. tcp_max_orphans. -tcp_max_tw_buckets - INTEGER - Maximal number of timewait sockets held by system simultaneously. - If this number is exceeded time-wait socket is immediately destroyed - and warning is printed. This limit exists only to prevent - simple DoS attacks, you _must_ not lower the limit artificially, - but rather increase it (probably, after increasing installed memory), - if network conditions require more than default value. +tcp_frto - BOOLEAN + Enables F-RTO, an enhanced recovery algorithm for TCP retransmission + timeouts. It is particularly beneficial in wireless environments + where packet loss is typically due to random radio interference + rather than intermediate router congestion. -tcp_tw_recycle - BOOLEAN - Enable fast recycling TIME-WAIT sockets. Default value is 0. - It should not be changed without advice/request of technical - experts. +tcp_keepalive_time - INTEGER + How often TCP sends out keepalive messages when keepalive is enabled. + Default: 2hours. -tcp_tw_reuse - BOOLEAN - Allow to reuse TIME-WAIT sockets for new connections when it is - safe from protocol viewpoint. Default value is 0. - It should not be changed without advice/request of technical - experts. +tcp_keepalive_probes - INTEGER + How many keepalive probes TCP sends out, until it decides that the + connection is broken. Default value: 9. + +tcp_keepalive_intvl - INTEGER + How frequently the probes are send out. Multiplied by + tcp_keepalive_probes it is time to kill not responding connection, + after probes started. Default value: 75sec i.e. connection + will be aborted after ~11 minutes of retries. + +tcp_low_latency - BOOLEAN + If set, the TCP stack makes decisions that prefer lower + latency as opposed to higher throughput. By default, this + option is not set meaning that higher throughput is preferred. + An example of an application where this default should be + changed would be a Beowulf compute cluster. + Default: 0 tcp_max_orphans - INTEGER Maximal number of TCP sockets not attached to any user file handle, @@ -197,41 +214,6 @@ tcp_max_orphans - INTEGER more aggressively. Let me to remind again: each orphan eats up to ~64K of unswappable memory. -tcp_abort_on_overflow - BOOLEAN - If listening service is too slow to accept new connections, - reset them. Default state is FALSE. It means that if overflow - occurred due to a burst, connection will recover. Enable this - option _only_ if you are really sure that listening daemon - cannot be tuned to accept connections faster. Enabling this - option can harm clients of your server. - -tcp_syncookies - BOOLEAN - Only valid when the kernel was compiled with CONFIG_SYNCOOKIES - Send out syncookies when the syn backlog queue of a socket - overflows. This is to prevent against the common 'syn flood attack' - Default: FALSE - - Note, that syncookies is fallback facility. - It MUST NOT be used to help highly loaded servers to stand - against legal connection rate. If you see synflood warnings - in your logs, but investigation shows that they occur - because of overload with legal connections, you should tune - another parameters until this warning disappear. - See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. - - syncookies seriously violate TCP protocol, do not allow - to use TCP extensions, can result in serious degradation - of some services (f.e. SMTP relaying), visible not by you, - but your clients and relays, contacting you. While you see - synflood warnings in logs not being really flooded, your server - is seriously misconfigured. - -tcp_stdurg - BOOLEAN - Use the Host requirements interpretation of the TCP urg pointer field. - Most hosts use the older BSD interpretation, so if you turn this on - Linux might not communicate correctly with them. - Default: FALSE - tcp_max_syn_backlog - INTEGER Maximal number of remembered connection requests, which are still did not receive an acknowledgment from connecting client. @@ -239,24 +221,34 @@ tcp_max_syn_backlog - INTEGER and 128 for low memory machines. If server suffers of overload, try to increase this number. -tcp_window_scaling - BOOLEAN - Enable window scaling as defined in RFC1323. +tcp_max_tw_buckets - INTEGER + Maximal number of timewait sockets held by system simultaneously. + If this number is exceeded time-wait socket is immediately destroyed + and warning is printed. This limit exists only to prevent + simple DoS attacks, you _must_ not lower the limit artificially, + but rather increase it (probably, after increasing installed memory), + if network conditions require more than default value. -tcp_timestamps - BOOLEAN - Enable timestamps as defined in RFC1323. +tcp_mem - vector of 3 INTEGERs: min, pressure, max + min: below this number of pages TCP is not bothered about its + memory appetite. -tcp_sack - BOOLEAN - Enable select acknowledgments (SACKS). + pressure: when amount of memory allocated by TCP exceeds this number + of pages, TCP moderates its memory consumption and enters memory + pressure mode, which is exited when memory consumption falls + under "min". -tcp_fack - BOOLEAN - Enable FACK congestion avoidance and fast retransmission. - The value is not used, if tcp_sack is not enabled. + max: number of pages allowed for queueing by all TCP sockets. -tcp_dsack - BOOLEAN - Allows TCP to send "duplicate" SACKs. + Defaults are calculated at boot time from amount of available + memory. -tcp_ecn - BOOLEAN - Enable Explicit Congestion Notification in TCP. +tcp_orphan_retries - INTEGER + How may times to retry before killing TCP connection, closed + by our side. Default value 7 corresponds to ~50sec-16min + depending on RTO. If you machine is loaded WEB server, + you should think about lowering this value, such sockets + may consume significant resources. Cf. tcp_max_orphans. tcp_reordering - INTEGER Maximal reordering of packets in a TCP stream. @@ -267,20 +259,23 @@ tcp_retrans_collapse - BOOLEAN On retransmit try to send bigger packets to work around bugs in certain TCP stacks. -tcp_wmem - vector of 3 INTEGERs: min, default, max - min: Amount of memory reserved for send buffers for TCP socket. - Each TCP socket has rights to use it due to fact of its birth. - Default: 4K +tcp_retries1 - INTEGER + How many times to retry before deciding that something is wrong + and it is necessary to report this suspicion to network layer. + Minimal RFC value is 3, it is default, which corresponds + to ~3sec-8min depending on RTO. - default: Amount of memory allowed for send buffers for TCP socket - by default. This value overrides net.core.wmem_default used - by other protocols, it is usually lower than net.core.wmem_default. - Default: 16K +tcp_retries2 - INTEGER + How may times to retry before killing alive TCP connection. + RFC1122 says that the limit should be longer than 100 sec. + It is too small number. Default value 15 corresponds to ~13-30min + depending on RTO. - max: Maximal amount of memory allowed for automatically selected - send buffers for TCP socket. This value does not override - net.core.wmem_max, "static" selection via SO_SNDBUF does not use this. - Default: 128K +tcp_rfc1337 - BOOLEAN + If set, the TCP stack behaves conforming to RFC1337. If unset, + we are not conforming to RFC, but prevent TCP TIME_WAIT + assassination. + Default: 0 tcp_rmem - vector of 3 INTEGERs: min, default, max min: Minimal size of receive buffer used by TCP sockets. @@ -299,67 +294,91 @@ tcp_rmem - vector of 3 INTEGERs: min, default, max net.core.rmem_max, "static" selection via SO_RCVBUF does not use this. Default: 87380*2 bytes. -tcp_mem - vector of 3 INTEGERs: min, pressure, max - min: below this number of pages TCP is not bothered about its - memory appetite. +tcp_sack - BOOLEAN + Enable select acknowledgments (SACKS). - pressure: when amount of memory allocated by TCP exceeds this number - of pages, TCP moderates its memory consumption and enters memory - pressure mode, which is exited when memory consumption falls - under "min". +tcp_slow_start_after_idle - BOOLEAN + If set, provide RFC2861 behavior and time out the congestion + window after an idle period. An idle period is defined at + the current RTO. If unset, the congestion window will not + be timed out after an idle period. + Default: 1 - max: number of pages allowed for queueing by all TCP sockets. +tcp_stdurg - BOOLEAN + Use the Host requirements interpretation of the TCP urg pointer field. + Most hosts use the older BSD interpretation, so if you turn this on + Linux might not communicate correctly with them. + Default: FALSE - Defaults are calculated at boot time from amount of available - memory. +tcp_synack_retries - INTEGER + Number of times SYNACKs for a passive TCP connection attempt will + be retransmitted. Should not be higher than 255. Default value + is 5, which corresponds to ~180seconds. -tcp_app_win - INTEGER - Reserve max(window/2^tcp_app_win, mss) of window for application - buffer. Value 0 is special, it means that nothing is reserved. - Default: 31 +tcp_syncookies - BOOLEAN + Only valid when the kernel was compiled with CONFIG_SYNCOOKIES + Send out syncookies when the syn backlog queue of a socket + overflows. This is to prevent against the common 'syn flood attack' + Default: FALSE -tcp_adv_win_scale - INTEGER - Count buffering overhead as bytes/2^tcp_adv_win_scale - (if tcp_adv_win_scale > 0) or bytes-bytes/2^(-tcp_adv_win_scale), - if it is <= 0. - Default: 2 + Note, that syncookies is fallback facility. + It MUST NOT be used to help highly loaded servers to stand + against legal connection rate. If you see synflood warnings + in your logs, but investigation shows that they occur + because of overload with legal connections, you should tune + another parameters until this warning disappear. + See: tcp_max_syn_backlog, tcp_synack_retries, tcp_abort_on_overflow. -tcp_rfc1337 - BOOLEAN - If set, the TCP stack behaves conforming to RFC1337. If unset, - we are not conforming to RFC, but prevent TCP TIME_WAIT - assassination. - Default: 0 + syncookies seriously violate TCP protocol, do not allow + to use TCP extensions, can result in serious degradation + of some services (f.e. SMTP relaying), visible not by you, + but your clients and relays, contacting you. While you see + synflood warnings in logs not being really flooded, your server + is seriously misconfigured. -tcp_low_latency - BOOLEAN - If set, the TCP stack makes decisions that prefer lower - latency as opposed to higher throughput. By default, this - option is not set meaning that higher throughput is preferred. - An example of an application where this default should be - changed would be a Beowulf compute cluster. - Default: 0 +tcp_syn_retries - INTEGER + Number of times initial SYNs for an active TCP connection attempt + will be retransmitted. Should not be higher than 255. Default value + is 5, which corresponds to ~180seconds. + +tcp_timestamps - BOOLEAN + Enable timestamps as defined in RFC1323. tcp_tso_win_divisor - INTEGER - This allows control over what percentage of the congestion window - can be consumed by a single TSO frame. - The setting of this parameter is a choice between burstiness and - building larger TSO frames. - Default: 3 + This allows control over what percentage of the congestion window + can be consumed by a single TSO frame. + The setting of this parameter is a choice between burstiness and + building larger TSO frames. + Default: 3 -tcp_frto - BOOLEAN - Enables F-RTO, an enhanced recovery algorithm for TCP retransmission - timeouts. It is particularly beneficial in wireless environments - where packet loss is typically due to random radio interference - rather than intermediate router congestion. +tcp_tw_recycle - BOOLEAN + Enable fast recycling TIME-WAIT sockets. Default value is 0. + It should not be changed without advice/request of technical + experts. -tcp_congestion_control - STRING - Set the congestion control algorithm to be used for new - connections. The algorithm "reno" is always available, but - additional choices may be available based on kernel configuration. +tcp_tw_reuse - BOOLEAN + Allow to reuse TIME-WAIT sockets for new connections when it is + safe from protocol viewpoint. Default value is 0. + It should not be changed without advice/request of technical + experts. -somaxconn - INTEGER - Limit of socket listen() backlog, known in userspace as SOMAXCONN. - Defaults to 128. See also tcp_max_syn_backlog for additional tuning - for TCP sockets. +tcp_window_scaling - BOOLEAN + Enable window scaling as defined in RFC1323. + +tcp_wmem - vector of 3 INTEGERs: min, default, max + min: Amount of memory reserved for send buffers for TCP socket. + Each TCP socket has rights to use it due to fact of its birth. + Default: 4K + + default: Amount of memory allowed for send buffers for TCP socket + by default. This value overrides net.core.wmem_default used + by other protocols, it is usually lower than net.core.wmem_default. + Default: 16K + + max: Maximal amount of memory allowed for automatically selected + send buffers for TCP socket. This value does not override + net.core.wmem_max, "static" selection via SO_SNDBUF does not use this. + Default: 128K tcp_workaround_signed_windows - BOOLEAN If set, assume no receipt of a window scaling option means the @@ -368,13 +387,6 @@ tcp_workaround_signed_windows - BOOLEAN not receive a window scaling option from them. Default: 0 -tcp_slow_start_after_idle - BOOLEAN - If set, provide RFC2861 behavior and time out the congestion - window after an idle period. An idle period is defined at - the current RTO. If unset, the congestion window will not - be timed out after an idle period. - Default: 1 - CIPSOv4 Variables: cipso_cache_enable - BOOLEAN @@ -974,4 +986,3 @@ no_cong_thresh FIXME slot_timeout FIXME warn_noreply_time FIXME -$Id: ip-sysctl.txt,v 1.20 2001/12/13 09:00:18 davem Exp $ diff --git a/Documentation/networking/iphase.txt b/Documentation/networking/iphase.txt index 493203a080a8..55eac4a784e2 100644 --- a/Documentation/networking/iphase.txt +++ b/Documentation/networking/iphase.txt @@ -81,7 +81,7 @@ Installation 1M. The RAM size decides the number of buffers and buffer size. The default size and number of buffers are set as following: - Totol Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf + Total Rx RAM Tx RAM Rx Buf Tx Buf Rx buf Tx buf RAM size size size size size cnt cnt -------- ------ ------ ------ ------ ------ ------ 128K 64K 64K 10K 10K 6 6 diff --git a/Documentation/networking/packet_mmap.txt b/Documentation/networking/packet_mmap.txt index 12a008a5c221..5a232d946be3 100644 --- a/Documentation/networking/packet_mmap.txt +++ b/Documentation/networking/packet_mmap.txt @@ -284,7 +284,7 @@ the necessary memory, so normally limits can be reached. ------------------- If you check the source code you will see that what I draw here as a frame -is not only the link level frame. At the begining of each frame there is a +is not only the link level frame. At the beginning of each frame there is a header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame meta information like timestamp. So what we draw here a frame it's really the following (from include/linux/if_packet.h): diff --git a/Documentation/networking/phy.txt b/Documentation/networking/phy.txt index 29ccae409031..0bc95eab1512 100644 --- a/Documentation/networking/phy.txt +++ b/Documentation/networking/phy.txt @@ -1,7 +1,7 @@ ------- PHY Abstraction Layer -(Updated 2005-07-21) +(Updated 2006-11-30) Purpose @@ -97,11 +97,12 @@ Letting the PHY Abstraction Layer do Everything Next, you need to know the device name of the PHY connected to this device. The name will look something like, "phy0:0", where the first number is the - bus id, and the second is the PHY's address on that bus. + bus id, and the second is the PHY's address on that bus. Typically, + the bus is responsible for making its ID unique. Now, to connect, just call this function: - phydev = phy_connect(dev, phy_name, &adjust_link, flags); + phydev = phy_connect(dev, phy_name, &adjust_link, flags, interface); phydev is a pointer to the phy_device structure which represents the PHY. If phy_connect is successful, it will return the pointer. dev, here, is the @@ -115,6 +116,10 @@ Letting the PHY Abstraction Layer do Everything This is useful if the system has put hardware restrictions on the PHY/controller, of which the PHY needs to be aware. + interface is a u32 which specifies the connection type used + between the controller and the PHY. Examples are GMII, MII, + RGMII, and SGMII. For a full list, see include/linux/phy.h + Now just make sure that phydev->supported and phydev->advertising have any values pruned from them which don't make sense for your controller (a 10/100 controller may be connected to a gigabit capable PHY, so you would need to @@ -191,7 +196,7 @@ Doing it all yourself start, or disables then frees them for stop. struct phy_device * phy_attach(struct net_device *dev, const char *phy_id, - u32 flags); + u32 flags, phy_interface_t interface); Attaches a network device to a particular PHY, binding the PHY to a generic driver if none was found during bus initialization. Passes in diff --git a/Documentation/networking/pktgen.txt b/Documentation/networking/pktgen.txt index c8eee23be8c0..c6cf4a3c16e0 100644 --- a/Documentation/networking/pktgen.txt +++ b/Documentation/networking/pktgen.txt @@ -63,8 +63,8 @@ Current: Result: OK: 13101142(c12220741+d880401) usec, 10000000 (60byte,0frags) 763292pps 390Mb/sec (390805504bps) errors: 39664 -Confguring threads and devices -============================== +Configuring threads and devices +================================ This is done via the /proc interface easiest done via pgset in the scripts Examples: @@ -116,7 +116,7 @@ Examples: there must be no spaces between the arguments. Leading zeros are required. Do not set the bottom of stack bit, - thats done automatically. If you do + that's done automatically. If you do set the bottom of stack bit, that indicates that you want to randomly generate that address and the flag diff --git a/Documentation/networking/proc_net_tcp.txt b/Documentation/networking/proc_net_tcp.txt index 59cb915c3713..5e21f7cb6383 100644 --- a/Documentation/networking/proc_net_tcp.txt +++ b/Documentation/networking/proc_net_tcp.txt @@ -25,7 +25,7 @@ up into 3 parts because of the length of the line): 1000 0 54165785 4 cd1e6040 25 4 27 3 -1 | | | | | | | | | |--> slow start size threshold, - | | | | | | | | | or -1 if the treshold + | | | | | | | | | or -1 if the threshold | | | | | | | | | is >= 0xFFFF | | | | | | | | |----> sending congestion window | | | | | | | |-------> (ack.quick<<1)|ack.pingpong diff --git a/Documentation/networking/sk98lin.txt b/Documentation/networking/sk98lin.txt index 4e1cc745ec63..8590a954df1d 100644 --- a/Documentation/networking/sk98lin.txt +++ b/Documentation/networking/sk98lin.txt @@ -346,7 +346,7 @@ Possible modes: depending on the load of the system. If the driver detects that the system load is too high, the driver tries to shield the system against too much network load by enabling interrupt moderation. If - at a later - time - the CPU utilizaton decreases again (or if the network load is + time - the CPU utilization decreases again (or if the network load is negligible) the interrupt moderation will automatically be disabled. Interrupt moderation should be used when the driver has to handle one or more diff --git a/Documentation/networking/slicecom.txt b/Documentation/networking/slicecom.txt index 2f04c9267f89..32d3b916afad 100644 --- a/Documentation/networking/slicecom.txt +++ b/Documentation/networking/slicecom.txt @@ -126,7 +126,7 @@ comx0/boardnum - board number of the SliceCom in the PC (using the 'natural' Though the options below are to be set on a single interface, they apply to the whole board. The restriction, to use them on 'UP' interfaces, is because the -command sequence below could lead to unpredicable results. +command sequence below could lead to unpredictable results. # echo 0 >boardnum # echo internal >clock_source diff --git a/Documentation/networking/udplite.txt b/Documentation/networking/udplite.txt new file mode 100644 index 000000000000..dd6f46b83dab --- /dev/null +++ b/Documentation/networking/udplite.txt @@ -0,0 +1,281 @@ + =========================================================================== + The UDP-Lite protocol (RFC 3828) + =========================================================================== + + + UDP-Lite is a Standards-Track IETF transport protocol whose characteristic + is a variable-length checksum. This has advantages for transport of multimedia + (video, VoIP) over wireless networks, as partly damaged packets can still be + fed into the codec instead of being discarded due to a failed checksum test. + + This file briefly describes the existing kernel support and the socket API. + For in-depth information, you can consult: + + o The UDP-Lite Homepage: http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/ + Fom here you can also download some example application source code. + + o The UDP-Lite HOWTO on + http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/files/UDP-Lite-HOWTO.txt + + o The Wireshark UDP-Lite WiKi (with capture files): + http://wiki.wireshark.org/Lightweight_User_Datagram_Protocol + + o The Protocol Spec, RFC 3828, http://www.ietf.org/rfc/rfc3828.txt + + + I) APPLICATIONS + + Several applications have been ported successfully to UDP-Lite. Ethereal + (now called wireshark) has UDP-Litev4/v6 support by default. The tarball on + + http://www.erg.abdn.ac.uk/users/gerrit/udp-lite/files/udplite_linux.tar.gz + + has source code for several v4/v6 client-server and network testing examples. + + Porting applications to UDP-Lite is straightforward: only socket level and + IPPROTO need to be changed; senders additionally set the checksum coverage + length (default = header length = 8). Details are in the next section. + + + II) PROGRAMMING API + + UDP-Lite provides a connectionless, unreliable datagram service and hence + uses the same socket type as UDP. In fact, porting from UDP to UDP-Lite is + very easy: simply add `IPPROTO_UDPLITE' as the last argument of the socket(2) + call so that the statement looks like: + + s = socket(PF_INET, SOCK_DGRAM, IPPROTO_UDPLITE); + + or, respectively, + + s = socket(PF_INET6, SOCK_DGRAM, IPPROTO_UDPLITE); + + With just the above change you are able to run UDP-Lite services or connect + to UDP-Lite servers. The kernel will assume that you are not interested in + using partial checksum coverage and so emulate UDP mode (full coverage). + + To make use of the partial checksum coverage facilities requires setting a + single socket option, which takes an integer specifying the coverage length: + + * Sender checksum coverage: UDPLITE_SEND_CSCOV + + For example, + + int val = 20; + setsockopt(s, SOL_UDPLITE, UDPLITE_SEND_CSCOV, &val, sizeof(int)); + + sets the checksum coverage length to 20 bytes (12b data + 8b header). + Of each packet only the first 20 bytes (plus the pseudo-header) will be + checksummed. This is useful for RTP applications which have a 12-byte + base header. + + + * Receiver checksum coverage: UDPLITE_RECV_CSCOV + + This option is the receiver-side analogue. It is truly optional, i.e. not + required to enable traffic with partial checksum coverage. Its function is + that of a traffic filter: when enabled, it instructs the kernel to drop + all packets which have a coverage _less_ than this value. For example, if + RTP and UDP headers are to be protected, a receiver can enforce that only + packets with a minimum coverage of 20 are admitted: + + int min = 20; + setsockopt(s, SOL_UDPLITE, UDPLITE_RECV_CSCOV, &min, sizeof(int)); + + The calls to getsockopt(2) are analogous. Being an extension and not a stand- + alone protocol, all socket options known from UDP can be used in exactly the + same manner as before, e.g. UDP_CORK or UDP_ENCAP. + + A detailed discussion of UDP-Lite checksum coverage options is in section IV. + + + III) HEADER FILES + + The socket API requires support through header files in /usr/include: + + * /usr/include/netinet/in.h + to define IPPROTO_UDPLITE + + * /usr/include/netinet/udplite.h + for UDP-Lite header fields and protocol constants + + For testing purposes, the following can serve as a `mini' header file: + + #define IPPROTO_UDPLITE 136 + #define SOL_UDPLITE 136 + #define UDPLITE_SEND_CSCOV 10 + #define UDPLITE_RECV_CSCOV 11 + + Ready-made header files for various distros are in the UDP-Lite tarball. + + + IV) KERNEL BEHAVIOUR WITH REGARD TO THE VARIOUS SOCKET OPTIONS + + To enable debugging messages, the log level need to be set to 8, as most + messages use the KERN_DEBUG level (7). + + 1) Sender Socket Options + + If the sender specifies a value of 0 as coverage length, the module + assumes full coverage, transmits a packet with coverage length of 0 + and according checksum. If the sender specifies a coverage < 8 and + different from 0, the kernel assumes 8 as default value. Finally, + if the specified coverage length exceeds the packet length, the packet + length is used instead as coverage length. + + 2) Receiver Socket Options + + The receiver specifies the minimum value of the coverage length it + is willing to accept. A value of 0 here indicates that the receiver + always wants the whole of the packet covered. In this case, all + partially covered packets are dropped and an error is logged. + + It is not possible to specify illegal values (<0 and <8); in these + cases the default of 8 is assumed. + + All packets arriving with a coverage value less than the specified + threshold are discarded, these events are also logged. + + 3) Disabling the Checksum Computation + + On both sender and receiver, checksumming will always be performed + and can not be disabled using SO_NO_CHECK. Thus + + setsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, ... ); + + will always will be ignored, while the value of + + getsockopt(sockfd, SOL_SOCKET, SO_NO_CHECK, &value, ...); + + is meaningless (as in TCP). Packets with a zero checksum field are + illegal (cf. RFC 3828, sec. 3.1) will be silently discarded. + + 4) Fragmentation + + The checksum computation respects both buffersize and MTU. The size + of UDP-Lite packets is determined by the size of the send buffer. The + minimum size of the send buffer is 2048 (defined as SOCK_MIN_SNDBUF + in include/net/sock.h), the default value is configurable as + net.core.wmem_default or via setting the SO_SNDBUF socket(7) + option. The maximum upper bound for the send buffer is determined + by net.core.wmem_max. + + Given a payload size larger than the send buffer size, UDP-Lite will + split the payload into several individual packets, filling up the + send buffer size in each case. + + The precise value also depends on the interface MTU. The interface MTU, + in turn, may trigger IP fragmentation. In this case, the generated + UDP-Lite packet is split into several IP packets, of which only the + first one contains the L4 header. + + The send buffer size has implications on the checksum coverage length. + Consider the following example: + + Payload: 1536 bytes Send Buffer: 1024 bytes + MTU: 1500 bytes Coverage Length: 856 bytes + + UDP-Lite will ship the 1536 bytes in two separate packets: + + Packet 1: 1024 payload + 8 byte header + 20 byte IP header = 1052 bytes + Packet 2: 512 payload + 8 byte header + 20 byte IP header = 540 bytes + + The coverage packet covers the UDP-Lite header and 848 bytes of the + payload in the first packet, the second packet is fully covered. Note + that for the second packet, the coverage length exceeds the packet + length. The kernel always re-adjusts the coverage length to the packet + length in such cases. + + As an example of what happens when one UDP-Lite packet is split into + several tiny fragments, consider the following example. + + Payload: 1024 bytes Send buffer size: 1024 bytes + MTU: 300 bytes Coverage length: 575 bytes + + +-+-----------+--------------+--------------+--------------+ + |8| 272 | 280 | 280 | 280 | + +-+-----------+--------------+--------------+--------------+ + 280 560 840 1032 + ^ + *****checksum coverage************* + + The UDP-Lite module generates one 1032 byte packet (1024 + 8 byte + header). According to the interface MTU, these are split into 4 IP + packets (280 byte IP payload + 20 byte IP header). The kernel module + sums the contents of the entire first two packets, plus 15 bytes of + the last packet before releasing the fragments to the IP module. + + To see the analogous case for IPv6 fragmentation, consider a link + MTU of 1280 bytes and a write buffer of 3356 bytes. If the checksum + coverage is less than 1232 bytes (MTU minus IPv6/fragment header + lengths), only the first fragment needs to be considered. When using + larger checksum coverage lengths, each eligible fragment needs to be + checksummed. Suppose we have a checksum coverage of 3062. The buffer + of 3356 bytes will be split into the following fragments: + + Fragment 1: 1280 bytes carrying 1232 bytes of UDP-Lite data + Fragment 2: 1280 bytes carrying 1232 bytes of UDP-Lite data + Fragment 3: 948 bytes carrying 900 bytes of UDP-Lite data + + The first two fragments have to be checksummed in full, of the last + fragment only 598 (= 3062 - 2*1232) bytes are checksummed. + + While it is important that such cases are dealt with correctly, they + are (annoyingly) rare: UDP-Lite is designed for optimising multimedia + performance over wireless (or generally noisy) links and thus smaller + coverage lenghts are likely to be expected. + + + V) UDP-LITE RUNTIME STATISTICS AND THEIR MEANING + + Exceptional and error conditions are logged to syslog at the KERN_DEBUG + level. Live statistics about UDP-Lite are available in /proc/net/snmp + and can (with newer versions of netstat) be viewed using + + netstat -svu + + This displays UDP-Lite statistics variables, whose meaning is as follows. + + InDatagrams: Total number of received datagrams. + + NoPorts: Number of packets received to an unknown port. + These cases are counted separately (not as InErrors). + + InErrors: Number of erroneous UDP-Lite packets. Errors include: + * internal socket queue receive errors + * packet too short (less than 8 bytes or stated + coverage length exceeds received length) + * xfrm4_policy_check() returned with error + * application has specified larger min. coverage + length than that of incoming packet + * checksum coverage violated + * bad checksum + + OutDatagrams: Total number of sent datagrams. + + These statistics derive from the UDP MIB (RFC 2013). + + + VI) IPTABLES + + There is packet match support for UDP-Lite as well as support for the LOG target. + If you copy and paste the following line into /etc/protcols, + + udplite 136 UDP-Lite # UDP-Lite [RFC 3828] + + then + iptables -A INPUT -p udplite -j LOG + + will produce logging output to syslog. Dropping and rejecting packets also works. + + + VII) MAINTAINER ADDRESS + + The UDP-Lite patch was developed at + University of Aberdeen + Electronics Research Group + Department of Engineering + Fraser Noble Building + Aberdeen AB24 3UE; UK + The current maintainer is Gerrit Renker, <gerrit@erg.abdn.ac.uk>. Initial + code was developed by William Stanislaus, <william@erg.abdn.ac.uk>. diff --git a/Documentation/networking/wan-router.txt b/Documentation/networking/wan-router.txt index 0cf654147634..653978dcea7f 100644 --- a/Documentation/networking/wan-router.txt +++ b/Documentation/networking/wan-router.txt @@ -412,7 +412,7 @@ beta-2.1.4 Jul 2000 o Dynamic interface configuration: beta3-2.1.4 Jul 2000 o X25 M_BIT Problem fix. o Added the Multi-Port PPP - Updated utilites for the Multi-Port PPP. + Updated utilities for the Multi-Port PPP. 2.1.4 Aut 2000 o In X25API: @@ -444,13 +444,13 @@ beta1-2.1.5 Nov 15 2000 o Cpipemon - Added set FT1 commands to the cpipemon. Thus CSU/DSU - configuraiton can be performed using cpipemon. + configuration can be performed using cpipemon. All systems that cannot run cfgft1 GUI utility should use cpipemon to configure the on board CSU/DSU. o Keyboard Led Monitor/Debugger - - A new utilty /usr/sbin/wpkbdmon uses keyboard leds + - A new utility /usr/sbin/wpkbdmon uses keyboard leds to convey operational statistic information of the Sangoma WANPIPE cards. NUM_LOCK = Line State (On=connected, Off=disconnected) @@ -464,7 +464,7 @@ beta1-2.1.5 Nov 15 2000 - Appropriate number of devices are dynamically loaded based on the number of Sangoma cards found. - Note: The kernel configuraiton option + Note: The kernel configuration option CONFIG_WANPIPE_CARDS has been taken out. o Fixed the Frame Relay and Chdlc network interfaces so they are diff --git a/Documentation/networking/xfrm_sync.txt b/Documentation/networking/xfrm_sync.txt index 8be626f7c0b8..d7aac9dedeb4 100644 --- a/Documentation/networking/xfrm_sync.txt +++ b/Documentation/networking/xfrm_sync.txt @@ -47,10 +47,13 @@ aevent_id structure looks like: struct xfrm_aevent_id { struct xfrm_usersa_id sa_id; + xfrm_address_t saddr; __u32 flags; + __u32 reqid; }; -xfrm_usersa_id in this message layout identifies the SA. +The unique SA is identified by the combination of xfrm_usersa_id, +reqid and saddr. flags are used to indicate different things. The possible flags are: |