From 1da177e4c3f41524e886b7f1b8a0c1fc7321cac2 Mon Sep 17 00:00:00 2001 From: Linus Torvalds Date: Sat, 16 Apr 2005 15:20:36 -0700 Subject: Linux-2.6.12-rc2 Initial git repository build. I'm not bothering with the full history, even though we have it. We can create a separate "historical" git archive of that later if we want to, and in the meantime it's about 3.2GB when imported into git - space that would just make the early git days unnecessarily complicated, when we don't have a lot of good infrastructure for it. Let it rip! --- Documentation/networking/NAPI_HOWTO.txt | 766 ++++++++++++++++++++++++++++++++ 1 file changed, 766 insertions(+) create mode 100644 Documentation/networking/NAPI_HOWTO.txt (limited to 'Documentation/networking/NAPI_HOWTO.txt') diff --git a/Documentation/networking/NAPI_HOWTO.txt b/Documentation/networking/NAPI_HOWTO.txt new file mode 100644 index 000000000000..54376e8249c1 --- /dev/null +++ b/Documentation/networking/NAPI_HOWTO.txt @@ -0,0 +1,766 @@ +HISTORY: +February 16/2002 -- revision 0.2.1: +COR typo corrected +February 10/2002 -- revision 0.2: +some spell checking ;-> +January 12/2002 -- revision 0.1 +This is still work in progress so may change. +To keep up to date please watch this space. + +Introduction to NAPI +==================== + +NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique +to improve network performance on Linux. For more details please +read that paper. +NAPI provides a "inherent mitigation" which is bound by system capacity +as can be seen from the following data collected by Robert on Gigabit +ethernet (e1000): + + Psize Ipps Tput Rxint Txint Done Ndone + --------------------------------------------------------------- + 60 890000 409362 17 27622 7 6823 + 128 758150 464364 21 9301 10 7738 + 256 445632 774646 42 15507 21 12906 + 512 232666 994445 241292 19147 241192 1062 + 1024 119061 1000003 872519 19258 872511 0 + 1440 85193 1000003 946576 19505 946569 0 + + +Legend: +"Ipps" stands for input packets per second. +"Tput" == packets out of total 1M that made it out. +"txint" == transmit completion interrupts seen +"Done" == The number of times that the poll() managed to pull all +packets out of the rx ring. Note from this that the lower the +load the more we could clean up the rxring +"Ndone" == is the converse of "Done". Note again, that the higher +the load the more times we couldnt clean up the rxring. + +Observe that: +when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. +The system cant handle the processing at 1 interrupt/packet at that load level. +At lower rates on the other hand, rx interrupts go up and therefore the +interrupt/packet ratio goes up (as observable from that table). So there is +possibility that under low enough input, you get one poll call for each +input packet caused by a single interrupt each time. And if the system +cant handle interrupt per packet ratio of 1, then it will just have to +chug along .... + + +0) Prerequisites: +================== +A driver MAY continue using the old 2.4 technique for interfacing +to the network stack and not benefit from the NAPI changes. +NAPI additions to the kernel do not break backward compatibility. +NAPI, however, requires the following features to be available: + +A) DMA ring or enough RAM to store packets in software devices. + +B) Ability to turn off interrupts or maybe events that send packets up +the stack. + +NAPI processes packet events in what is known as dev->poll() method. +Typically, only packet receive events are processed in dev->poll(). +The rest of the events MAY be processed by the regular interrupt handler +to reduce processing latency (justified also because there are not that +many of them). +Note, however, NAPI does not enforce that dev->poll() only processes +receive events. +Tests with the tulip driver indicated slightly increased latency if +all of the interrupt handler is moved to dev->poll(). Also MII handling +gets a little trickier. +The example used in this document is to move the receive processing only +to dev->poll(); this is shown with the patch for the tulip driver. +For an example of code that moves all the interrupt driver to +dev->poll() look at the ported e1000 code. + +There are caveats that might force you to go with moving everything to +dev->poll(). Different NICs work differently depending on their status/event +acknowledgement setup. +There are two types of event register ACK mechanisms. + I) what is known as Clear-on-read (COR). + when you read the status/event register, it clears everything! + The natsemi and sunbmac NICs are known to do this. + In this case your only choice is to move all to dev->poll() + + II) Clear-on-write (COW) + i) you clear the status by writing a 1 in the bit-location you want. + These are the majority of the NICs and work the best with NAPI. + Put only receive events in dev->poll(); leave the rest in + the old interrupt handler. + ii) whatever you write in the status register clears every thing ;-> + Cant seem to find any supported by Linux which do this. If + someone knows such a chip email us please. + Move all to dev->poll() + +C) Ability to detect new work correctly. +NAPI works by shutting down event interrupts when theres work and +turning them on when theres none. +New packets might show up in the small window while interrupts were being +re-enabled (refer to appendix 2). A packet might sneak in during the period +we are enabling interrupts. We only get to know about such a packet when the +next new packet arrives and generates an interrupt. +Essentially, there is a small window of opportunity for a race condition +which for clarity we'll refer to as the "rotting packet". + +This is a very important topic and appendix 2 is dedicated for more +discussion. + +Locking rules and environmental guarantees +========================================== + +-Guarantee: Only one CPU at any time can call dev->poll(); this is because +only one CPU can pick the initial interrupt and hence the initial +netif_rx_schedule(dev); +- The core layer invokes devices to send packets in a round robin format. +This implies receive is totaly lockless because of the guarantee only that +one CPU is executing it. +- contention can only be the result of some other CPU accessing the rx +ring. This happens only in close() and suspend() (when these methods +try to clean the rx ring); +****guarantee: driver authors need not worry about this; synchronization +is taken care for them by the top net layer. +-local interrupts are enabled (if you dont move all to dev->poll()). For +example link/MII and txcomplete continue functioning just same old way. +This improves the latency of processing these events. It is also assumed that +the receive interrupt is the largest cause of noise. Note this might not +always be true. +[according to Manfred Spraul, the winbond insists on sending one +txmitcomplete interrupt for each packet (although this can be mitigated)]. +For these broken drivers, move all to dev->poll(). + +For the rest of this text, we'll assume that dev->poll() only +processes receive events. + +new methods introduce by NAPI +============================= + +a) netif_rx_schedule(dev) +Called by an IRQ handler to schedule a poll for device + +b) netif_rx_schedule_prep(dev) +puts the device in a state which allows for it to be added to the +CPU polling list if it is up and running. You can look at this as +the first half of netif_rx_schedule(dev) above; the second half +being c) below. + +c) __netif_rx_schedule(dev) +Add device to the poll list for this CPU; assuming that _prep above +has already been called and returned 1. + +d) netif_rx_reschedule(dev, undo) +Called to reschedule polling for device specifically for some +deficient hardware. Read Appendix 2 for more details. + +e) netif_rx_complete(dev) + +Remove interface from the CPU poll list: it must be in the poll list +on current cpu. This primitive is called by dev->poll(), when +it completes its work. The device cannot be out of poll list at this +call, if it is then clearly it is a BUG(). You'll know ;-> + +All these above nethods are used below. So keep reading for clarity. + +Device driver changes to be made when porting NAPI +================================================== + +Below we describe what kind of changes are required for NAPI to work. + +1) introduction of dev->poll() method +===================================== + +This is the method that is invoked by the network core when it requests +for new packets from the driver. A driver is allowed to send upto +dev->quota packets by the current CPU before yielding to the network +subsystem (so other devices can also get opportunity to send to the stack). + +dev->poll() prototype looks as follows: +int my_poll(struct net_device *dev, int *budget) + +budget is the remaining number of packets the network subsystem on the +current CPU can send up the stack before yielding to other system tasks. +*Each driver is responsible for decrementing budget by the total number of +packets sent. + Total number of packets cannot exceed dev->quota. + +dev->poll() method is invoked by the top layer, the driver just sends if it +can to the stack the packet quantity requested. + +more on dev->poll() below after the interrupt changes are explained. + +2) registering dev->poll() method +=================================== + +dev->poll should be set in the dev->probe() method. +e.g: +dev->open = my_open; +. +. +/* two new additions */ +/* first register my poll method */ +dev->poll = my_poll; +/* next register my weight/quanta; can be overridden in /proc */ +dev->weight = 16; +. +. +dev->stop = my_close; + + + +3) scheduling dev->poll() +============================= +This involves modifying the interrupt handler and the code +path which takes the packet off the NIC and sends them to the +stack. + +it's important at this point to introduce the classical D Becker +interrupt processor: + +------------------ +static irqreturn_t +netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + + struct net_device *dev = (struct net_device *)dev_instance; + struct my_private *tp = (struct my_private *)dev->priv; + + int work_count = my_work_count; + status = read_interrupt_status_reg(); + if (status == 0) + return IRQ_NONE; /* Shared IRQ: not us */ + if (status == 0xffff) + return IRQ_HANDLED; /* Hot unplug */ + if (status & error) + do_some_error_handling() + + do { + acknowledge_ints_ASAP(); + + if (status & link_interrupt) { + spin_lock(&tp->link_lock); + do_some_link_stat_stuff(); + spin_lock(&tp->link_lock); + } + + if (status & rx_interrupt) { + receive_packets(dev); + } + + if (status & rx_nobufs) { + make_rx_buffs_avail(); + } + + if (status & tx_related) { + spin_lock(&tp->lock); + tx_ring_free(dev); + if (tx_died) + restart_tx(); + spin_unlock(&tp->lock); + } + + status = read_interrupt_status_reg(); + + } while (!(status & error) || more_work_to_be_done); + return IRQ_HANDLED; +} + +---------------------------------------------------------------------- + +We now change this to what is shown below to NAPI-enable it: + +---------------------------------------------------------------------- +static irqreturn_t +netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) +{ + struct net_device *dev = (struct net_device *)dev_instance; + struct my_private *tp = (struct my_private *)dev->priv; + + status = read_interrupt_status_reg(); + if (status == 0) + return IRQ_NONE; /* Shared IRQ: not us */ + if (status == 0xffff) + return IRQ_HANDLED; /* Hot unplug */ + if (status & error) + do_some_error_handling(); + + do { +/************************ start note *********************************/ + acknowledge_ints_ASAP(); // dont ack rx and rxnobuff here +/************************ end note *********************************/ + + if (status & link_interrupt) { + spin_lock(&tp->link_lock); + do_some_link_stat_stuff(); + spin_unlock(&tp->link_lock); + } +/************************ start note *********************************/ + if (status & rx_interrupt || (status & rx_nobuffs)) { + if (netif_rx_schedule_prep(dev)) { + + /* disable interrupts caused + * by arriving packets */ + disable_rx_and_rxnobuff_ints(); + /* tell system we have work to be done. */ + __netif_rx_schedule(dev); + } else { + printk("driver bug! interrupt while in poll\n"); + /* FIX by disabling interrupts */ + disable_rx_and_rxnobuff_ints(); + } + } +/************************ end note note *********************************/ + + if (status & tx_related) { + spin_lock(&tp->lock); + tx_ring_free(dev); + + if (tx_died) + restart_tx(); + spin_unlock(&tp->lock); + } + + status = read_interrupt_status_reg(); + +/************************ start note *********************************/ + } while (!(status & error) || more_work_to_be_done(status)); +/************************ end note note *********************************/ + return IRQ_HANDLED; +} + +--------------------------------------------------------------------- + + +We note several things from above: + +I) Any interrupt source which is caused by arriving packets is now +turned off when it occurs. Depending on the hardware, there could be +several reasons that arriving packets would cause interrupts; these are the +interrupt sources we wish to avoid. The two common ones are a) a packet +arriving (rxint) b) a packet arriving and finding no DMA buffers available +(rxnobuff) . +This means also acknowledge_ints_ASAP() will not clear the status +register for those two items above; clearing is done in the place where +proper work is done within NAPI; at the poll() and refill_rx_ring() +discussed further below. +netif_rx_schedule_prep() returns 1 if device is in running state and +gets successfully added to the core poll list. If we get a zero value +we can _almost_ assume are already added to the list (instead of not running. +Logic based on the fact that you shouldn't get interrupt if not running) +We rectify this by disabling rx and rxnobuf interrupts. + +II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared. +These functionalities are still around actually...... + +infact, receive_packets(dev) is very close to my_poll() and +make_rx_buffs_avail() is invoked from my_poll() + +4) converting receive_packets() to dev->poll() +=============================================== + +We need to convert the classical D Becker receive_packets(dev) to my_poll() + +First the typical receive_packets() below: +------------------------------------------------------------------- + +/* this is called by interrupt handler */ +static void receive_packets (struct net_device *dev) +{ + + struct my_private *tp = (struct my_private *)dev->priv; + rx_ring = tp->rx_ring; + cur_rx = tp->cur_rx; + int entry = cur_rx % RX_RING_SIZE; + int received = 0; + int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; + + while (rx_ring_not_empty) { + u32 rx_status; + unsigned int rx_size; + unsigned int pkt_size; + struct sk_buff *skb; + /* read size+status of next frame from DMA ring buffer */ + /* the number 16 and 4 are just examples */ + rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); + rx_size = rx_status >> 16; + pkt_size = rx_size - 4; + + /* process errors */ + if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || + (!(rx_status & RxStatusOK))) { + netdrv_rx_err (rx_status, dev, tp, ioaddr); + return; + } + + if (--rx_work_limit < 0) + break; + + /* grab a skb */ + skb = dev_alloc_skb (pkt_size + 2); + if (skb) { + . + . + netif_rx (skb); + . + . + } else { /* OOM */ + /*seems very driver specific ... some just pass + whatever is on the ring already. */ + } + + /* move to the next skb on the ring */ + entry = (++tp->cur_rx) % RX_RING_SIZE; + received++ ; + + } + + /* store current ring pointer state */ + tp->cur_rx = cur_rx; + + /* Refill the Rx ring buffers if they are needed */ + refill_rx_ring(); + . + . + +} +------------------------------------------------------------------- +We change it to a new one below; note the additional parameter in +the call. + +------------------------------------------------------------------- + +/* this is called by the network core */ +static int my_poll (struct net_device *dev, int *budget) +{ + + struct my_private *tp = (struct my_private *)dev->priv; + rx_ring = tp->rx_ring; + cur_rx = tp->cur_rx; + int entry = cur_rx % RX_BUF_LEN; + /* maximum packets to send to the stack */ +/************************ note note *********************************/ + int rx_work_limit = dev->quota; + +/************************ end note note *********************************/ + do { // outer beginning loop starts here + + clear_rx_status_register_bit(); + + while (rx_ring_not_empty) { + u32 rx_status; + unsigned int rx_size; + unsigned int pkt_size; + struct sk_buff *skb; + /* read size+status of next frame from DMA ring buffer */ + /* the number 16 and 4 are just examples */ + rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); + rx_size = rx_status >> 16; + pkt_size = rx_size - 4; + + /* process errors */ + if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || + (!(rx_status & RxStatusOK))) { + netdrv_rx_err (rx_status, dev, tp, ioaddr); + return 1; + } + +/************************ note note *********************************/ + if (--rx_work_limit < 0) { /* we got packets, but no quota */ + /* store current ring pointer state */ + tp->cur_rx = cur_rx; + + /* Refill the Rx ring buffers if they are needed */ + refill_rx_ring(dev); + goto not_done; + } +/********************** end note **********************************/ + + /* grab a skb */ + skb = dev_alloc_skb (pkt_size + 2); + if (skb) { + . + . +/************************ note note *********************************/ + netif_receive_skb (skb); +/********************** end note **********************************/ + . + . + } else { /* OOM */ + /*seems very driver specific ... common is just pass + whatever is on the ring already. */ + } + + /* move to the next skb on the ring */ + entry = (++tp->cur_rx) % RX_RING_SIZE; + received++ ; + + } + + /* store current ring pointer state */ + tp->cur_rx = cur_rx; + + /* Refill the Rx ring buffers if they are needed */ + refill_rx_ring(dev); + + /* no packets on ring; but new ones can arrive since we last + checked */ + status = read_interrupt_status_reg(); + if (rx status is not set) { + /* If something arrives in this narrow window, + an interrupt will be generated */ + goto done; + } + /* done! at least thats what it looks like ;-> + if new packets came in after our last check on status bits + they'll be caught by the while check and we go back and clear them + since we havent exceeded our quota */ + } while (rx_status_is_set); + +done: + +/************************ note note *********************************/ + dev->quota -= received; + *budget -= received; + + /* If RX ring is not full we are out of memory. */ + if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) + goto oom; + + /* we are happy/done, no more packets on ring; put us back + to where we can start processing interrupts again */ + netif_rx_complete(dev); + enable_rx_and_rxnobuf_ints(); + + /* The last op happens after poll completion. Which means the following: + * 1. it can race with disabling irqs in irq handler (which are done to + * schedule polls) + * 2. it can race with dis/enabling irqs in other poll threads + * 3. if an irq raised after the begining of the outer beginning + * loop(marked in the code above), it will be immediately + * triggered here. + * + * Summarizing: the logic may results in some redundant irqs both + * due to races in masking and due to too late acking of already + * processed irqs. The good news: no events are ever lost. + */ + + return 0; /* done */ + +not_done: + if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || + tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) + refill_rx_ring(dev); + + if (!received) { + printk("received==0\n"); + received = 1; + } + dev->quota -= received; + *budget -= received; + return 1; /* not_done */ + +oom: + /* Start timer, stop polling, but do not enable rx interrupts. */ + start_poll_timer(dev); + return 0; /* we'll take it from here so tell core "done"*/ + +/************************ End note note *********************************/ +} +------------------------------------------------------------------- + +From above we note that: +0) rx_work_limit = dev->quota +1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when +it does the work. +2) We have a done and not_done state. +3) instead of netif_rx() we call netif_receive_skb() to pass the skb. +4) we have a new way of handling oom condition +5) A new outer for (;;) loop has been added. This serves the purpose of +ensuring that if a new packet has come in, after we are all set and done, +and we have not exceeded our quota that we continue sending packets up. + + +----------------------------------------------------------- +Poll timer code will need to do the following: + +a) + + if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || + tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) + refill_rx_ring(dev); + + /* If RX ring is not full we are still out of memory. + Restart the timer again. Else we re-add ourselves + to the master poll list. + */ + + if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) + restart_timer(); + + else netif_rx_schedule(dev); /* we are back on the poll list */ + +5) dev->close() and dev->suspend() issues +========================================== +The driver writter neednt worry about this. The top net layer takes +care of it. + +6) Adding new Stats to /proc +============================= +In order to debug some of the new features, we introduce new stats +that need to be collected. +TODO: Fill this later. + +APPENDIX 1: discussion on using ethernet HW FC +============================================== +Most chips with FC only send a pause packet when they run out of Rx buffers. +Since packets are pulled off the DMA ring by a softirq in NAPI, +if the system is slow in grabbing them and we have a high input +rate (faster than the system's capacity to remove packets), then theoretically +there will only be one rx interrupt for all packets during a given packetstorm. +Under low load, we might have a single interrupt per packet. +FC should be programmed to apply in the case when the system cant pull out +packets fast enough i.e send a pause only when you run out of rx buffers. +Note FC in itself is a good solution but we have found it to not be +much of a commodity feature (both in NICs and switches) and hence falls +under the same category as using NIC based mitigation. Also experiments +indicate that its much harder to resolve the resource allocation +issue (aka lazy receiving that NAPI offers) and hence quantify its usefullness +proved harder. In any case, FC works even better with NAPI but is not +necessary. + + +APPENDIX 2: the "rotting packet" race-window avoidance scheme +============================================================= + +There are two types of associations seen here + +1) status/int which honors level triggered IRQ + +If a status bit for receive or rxnobuff is set and the corresponding +interrupt-enable bit is not on, then no interrupts will be generated. However, +as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is +generated. [assuming the status bit was not turned off]. +Generally the concept of level triggered IRQs in association with a status and +interrupt-enable CSR register set is used to avoid the race. + +If we take the example of the tulip: +"pending work" is indicated by the status bit(CSR5 in tulip). +the corresponding interrupt bit (CSR7 in tulip) might be turned off (but +the CSR5 will continue to be turned on with new packet arrivals even if +we clear it the first time) +Very important is the fact that if we turn on the interrupt bit on when +status is set that an immediate irq is triggered. + +If we cleared the rx ring and proclaimed there was "no more work +to be done" and then went on to do a few other things; then when we enable +interrupts, there is a possibility that a new packet might sneak in during +this phase. It helps to look at the pseudo code for the tulip poll +routine: + +-------------------------- + do { + ACK; + while (ring_is_not_empty()) { + work-work-work + if quota is exceeded: exit, no touching irq status/mask + } + /* No packets, but new can arrive while we are doing this*/ + CSR5 := read + if (CSR5 is not set) { + /* If something arrives in this narrow window here, + * where the comments are ;-> irq will be generated */ + unmask irqs; + exit poll; + } + } while (rx_status_is_set); +------------------------ + +CSR5 bit of interest is only the rx status. +If you look at the last if statement: +you just finished grabbing all the packets from the rx ring .. you check if +status bit says theres more packets just in ... it says none; you then +enable rx interrupts again; if a new packet just came in during this check, +we are counting that CSR5 will be set in that small window of opportunity +and that by re-enabling interrupts, we would actually triger an interrupt +to register the new packet for processing. + +[The above description nay be very verbose, if you have better wording +that will make this more understandable, please suggest it.] + +2) non-capable hardware + +These do not generally respect level triggered IRQs. Normally, +irqs may be lost while being masked and the only way to leave poll is to do +a double check for new input after netif_rx_complete() is invoked +and re-enable polling (after seeing this new input). + +Sample code: + +--------- + . + . +restart_poll: + while (ring_is_not_empty()) { + work-work-work + if quota is exceeded: exit, not touching irq status/mask + } + . + . + . + enable_rx_interrupts() + netif_rx_complete(dev); + if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { + disable_rx_and_rxnobufs() + goto restart_poll + } while (rx_status_is_set); +--------- + +Basically netif_rx_complete() removes us from the poll list, but because a +new packet which will never be caught due to the possibility of a race +might come in, we attempt to re-add ourselves to the poll list. + + + + +APPENDIX 3: Scheduling issues. +============================== +As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the +general solution to schedule softirq's to run before next interrupt and by putting +them under scheduler control. Also this prevents consecutive softirq's from +monopolize the CPU. This also have the effect that the priority of ksoftirq needs +to be considered when running very CPU-intensive applications and networking to +get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 +(eventually more) is reported cure problems with low network performance at high +CPU load. + +Most used processes in a GIGE router: +USER PID %CPU %MEM SIZE RSS TTY STAT START TIME COMMAND +root 3 0.2 0.0 0 0 ? RWN Aug 15 602:00 (ksoftirqd_CPU0) +root 232 0.0 7.9 41400 40884 ? S Aug 15 74:12 gated + +-------------------------------------------------------------------- + +relevant sites: +================== +ftp://robur.slu.se/pub/Linux/net-development/NAPI/ + + +-------------------------------------------------------------------- +TODO: Write net-skeleton.c driver. +------------------------------------------------------------- + +Authors: +======== +Alexey Kuznetsov +Jamal Hadi Salim +Robert Olsson + +Acknowledgements: +================ +People who made this document better: + +Lennert Buytenhek +Andrew Morton +Manfred Spraul +Donald Becker +Jeff Garzik -- cgit v1.2.3