| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | HISTORY: | 
|  | 2 | February 16/2002 -- revision 0.2.1: | 
|  | 3 | COR typo corrected | 
|  | 4 | February 10/2002 -- revision 0.2: | 
|  | 5 | some spell checking ;-> | 
|  | 6 | January 12/2002 -- revision 0.1 | 
|  | 7 | This is still work in progress so may change. | 
|  | 8 | To keep up to date please watch this space. | 
|  | 9 |  | 
|  | 10 | Introduction to NAPI | 
|  | 11 | ==================== | 
|  | 12 |  | 
|  | 13 | NAPI is a proven (www.cyberus.ca/~hadi/usenix-paper.tgz) technique | 
|  | 14 | to improve network performance on Linux. For more details please | 
|  | 15 | read that paper. | 
|  | 16 | NAPI provides a "inherent mitigation" which is bound by system capacity | 
|  | 17 | as can be seen from the following data collected by Robert on Gigabit | 
|  | 18 | ethernet (e1000): | 
|  | 19 |  | 
|  | 20 | Psize    Ipps       Tput     Rxint     Txint    Done     Ndone | 
|  | 21 | --------------------------------------------------------------- | 
|  | 22 | 60    890000     409362        17     27622        7     6823 | 
|  | 23 | 128    758150     464364        21      9301       10     7738 | 
|  | 24 | 256    445632     774646        42     15507       21    12906 | 
|  | 25 | 512    232666     994445    241292     19147   241192     1062 | 
|  | 26 | 1024    119061    1000003    872519     19258   872511        0 | 
|  | 27 | 1440     85193    1000003    946576     19505   946569        0 | 
|  | 28 |  | 
|  | 29 |  | 
|  | 30 | Legend: | 
|  | 31 | "Ipps" stands for input packets per second. | 
|  | 32 | "Tput" == packets out of total 1M that made it out. | 
|  | 33 | "txint" == transmit completion interrupts seen | 
|  | 34 | "Done" == The number of times that the poll() managed to pull all | 
|  | 35 | packets out of the rx ring. Note from this that the lower the | 
|  | 36 | load the more we could clean up the rxring | 
|  | 37 | "Ndone" == is the converse of "Done". Note again, that the higher | 
| Matt LaPlante | fff9289 | 2006-10-03 22:47:42 +0200 | [diff] [blame] | 38 | the load the more times we couldn't clean up the rxring. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 39 |  | 
|  | 40 | Observe that: | 
|  | 41 | when the NIC receives 890Kpackets/sec only 17 rx interrupts are generated. | 
|  | 42 | The system cant handle the processing at 1 interrupt/packet at that load level. | 
|  | 43 | At lower rates on the other hand, rx interrupts go up and therefore the | 
|  | 44 | interrupt/packet ratio goes up (as observable from that table). So there is | 
|  | 45 | possibility that under low enough input, you get one poll call for each | 
|  | 46 | input packet caused by a single interrupt each time. And if the system | 
|  | 47 | cant handle interrupt per packet ratio of 1, then it will just have to | 
|  | 48 | chug along .... | 
|  | 49 |  | 
|  | 50 |  | 
|  | 51 | 0) Prerequisites: | 
|  | 52 | ================== | 
|  | 53 | A driver MAY continue using the old 2.4 technique for interfacing | 
|  | 54 | to the network stack and not benefit from the NAPI changes. | 
|  | 55 | NAPI additions to the kernel do not break backward compatibility. | 
|  | 56 | NAPI, however, requires the following features to be available: | 
|  | 57 |  | 
|  | 58 | A) DMA ring or enough RAM to store packets in software devices. | 
|  | 59 |  | 
|  | 60 | B) Ability to turn off interrupts or maybe events that send packets up | 
|  | 61 | the stack. | 
|  | 62 |  | 
|  | 63 | NAPI processes packet events in what is known as dev->poll() method. | 
|  | 64 | Typically, only packet receive events are processed in dev->poll(). | 
|  | 65 | The rest of the events MAY be processed by the regular interrupt handler | 
|  | 66 | to reduce processing latency (justified also because there are not that | 
|  | 67 | many of them). | 
|  | 68 | Note, however, NAPI does not enforce that dev->poll() only processes | 
|  | 69 | receive events. | 
|  | 70 | Tests with the tulip driver indicated slightly increased latency if | 
|  | 71 | all of the interrupt handler is moved to dev->poll(). Also MII handling | 
|  | 72 | gets a little trickier. | 
|  | 73 | The example used in this document is to move the receive processing only | 
|  | 74 | to dev->poll(); this is shown with the patch for the tulip driver. | 
|  | 75 | For an example of code that moves all the interrupt driver to | 
|  | 76 | dev->poll() look at the ported e1000 code. | 
|  | 77 |  | 
|  | 78 | There are caveats that might force you to go with moving everything to | 
|  | 79 | dev->poll(). Different NICs work differently depending on their status/event | 
|  | 80 | acknowledgement setup. | 
|  | 81 | There are two types of event register ACK mechanisms. | 
|  | 82 | I)  what is known as Clear-on-read (COR). | 
|  | 83 | when you read the status/event register, it clears everything! | 
|  | 84 | The natsemi and sunbmac NICs are known to do this. | 
|  | 85 | In this case your only choice is to move all to dev->poll() | 
|  | 86 |  | 
|  | 87 | II) Clear-on-write (COW) | 
|  | 88 | i) you clear the status by writing a 1 in the bit-location you want. | 
|  | 89 | These are the majority of the NICs and work the best with NAPI. | 
|  | 90 | Put only receive events in dev->poll(); leave the rest in | 
|  | 91 | the old interrupt handler. | 
|  | 92 | ii) whatever you write in the status register clears every thing ;-> | 
|  | 93 | Cant seem to find any supported by Linux which do this. If | 
|  | 94 | someone knows such a chip email us please. | 
|  | 95 | Move all to dev->poll() | 
|  | 96 |  | 
|  | 97 | C) Ability to detect new work correctly. | 
| Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 98 | NAPI works by shutting down event interrupts when there's work and | 
|  | 99 | turning them on when there's none. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 100 | New packets might show up in the small window while interrupts were being | 
|  | 101 | re-enabled (refer to appendix 2).  A packet might sneak in during the period | 
|  | 102 | we are enabling interrupts. We only get to know about such a packet when the | 
|  | 103 | next new packet arrives and generates an interrupt. | 
|  | 104 | Essentially, there is a small window of opportunity for a race condition | 
|  | 105 | which for clarity we'll refer to as the "rotting packet". | 
|  | 106 |  | 
|  | 107 | This is a very important topic and appendix 2 is dedicated for more | 
|  | 108 | discussion. | 
|  | 109 |  | 
|  | 110 | Locking rules and environmental guarantees | 
|  | 111 | ========================================== | 
|  | 112 |  | 
|  | 113 | -Guarantee: Only one CPU at any time can call dev->poll(); this is because | 
|  | 114 | only one CPU can pick the initial interrupt and hence the initial | 
|  | 115 | netif_rx_schedule(dev); | 
|  | 116 | - The core layer invokes devices to send packets in a round robin format. | 
| Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 117 | This implies receive is totally lockless because of the guarantee that only | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 118 | one CPU is executing it. | 
|  | 119 | -  contention can only be the result of some other CPU accessing the rx | 
|  | 120 | ring. This happens only in close() and suspend() (when these methods | 
|  | 121 | try to clean the rx ring); | 
|  | 122 | ****guarantee: driver authors need not worry about this; synchronization | 
|  | 123 | is taken care for them by the top net layer. | 
|  | 124 | -local interrupts are enabled (if you dont move all to dev->poll()). For | 
|  | 125 | example link/MII and txcomplete continue functioning just same old way. | 
|  | 126 | This improves the latency of processing these events. It is also assumed that | 
|  | 127 | the receive interrupt is the largest cause of noise. Note this might not | 
|  | 128 | always be true. | 
|  | 129 | [according to Manfred Spraul, the winbond insists on sending one | 
|  | 130 | txmitcomplete interrupt for each packet (although this can be mitigated)]. | 
|  | 131 | For these broken drivers, move all to dev->poll(). | 
|  | 132 |  | 
|  | 133 | For the rest of this text, we'll assume that dev->poll() only | 
|  | 134 | processes receive events. | 
|  | 135 |  | 
|  | 136 | new methods introduce by NAPI | 
|  | 137 | ============================= | 
|  | 138 |  | 
|  | 139 | a) netif_rx_schedule(dev) | 
|  | 140 | Called by an IRQ handler to schedule a poll for device | 
|  | 141 |  | 
|  | 142 | b) netif_rx_schedule_prep(dev) | 
|  | 143 | puts the device in a state which allows for it to be added to the | 
|  | 144 | CPU polling list if it is up and running. You can look at this as | 
|  | 145 | the first half of  netif_rx_schedule(dev) above; the second half | 
|  | 146 | being c) below. | 
|  | 147 |  | 
|  | 148 | c) __netif_rx_schedule(dev) | 
|  | 149 | Add device to the poll list for this CPU; assuming that _prep above | 
|  | 150 | has already been called and returned 1. | 
|  | 151 |  | 
|  | 152 | d) netif_rx_reschedule(dev, undo) | 
|  | 153 | Called to reschedule polling for device specifically for some | 
|  | 154 | deficient hardware. Read Appendix 2 for more details. | 
|  | 155 |  | 
|  | 156 | e) netif_rx_complete(dev) | 
|  | 157 |  | 
|  | 158 | Remove interface from the CPU poll list: it must be in the poll list | 
|  | 159 | on current cpu. This primitive is called by dev->poll(), when | 
|  | 160 | it completes its work. The device cannot be out of poll list at this | 
|  | 161 | call, if it is then clearly it is a BUG(). You'll know ;-> | 
|  | 162 |  | 
| Matt LaPlante | a982ac0 | 2007-05-09 07:35:06 +0200 | [diff] [blame] | 163 | All of the above methods are used below, so keep reading for clarity. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 164 |  | 
|  | 165 | Device driver changes to be made when porting NAPI | 
|  | 166 | ================================================== | 
|  | 167 |  | 
|  | 168 | Below we describe what kind of changes are required for NAPI to work. | 
|  | 169 |  | 
|  | 170 | 1) introduction of dev->poll() method | 
|  | 171 | ===================================== | 
|  | 172 |  | 
|  | 173 | This is the method that is invoked by the network core when it requests | 
|  | 174 | for new packets from the driver. A driver is allowed to send upto | 
|  | 175 | dev->quota packets by the current CPU before yielding to the network | 
|  | 176 | subsystem (so other devices can also get opportunity to send to the stack). | 
|  | 177 |  | 
|  | 178 | dev->poll() prototype looks as follows: | 
|  | 179 | int my_poll(struct net_device *dev, int *budget) | 
|  | 180 |  | 
|  | 181 | budget is the remaining number of packets the network subsystem on the | 
|  | 182 | current CPU can send up the stack before yielding to other system tasks. | 
|  | 183 | *Each driver is responsible for decrementing budget by the total number of | 
|  | 184 | packets sent. | 
|  | 185 | Total number of packets cannot exceed dev->quota. | 
|  | 186 |  | 
|  | 187 | dev->poll() method is invoked by the top layer, the driver just sends if it | 
|  | 188 | can to the stack the packet quantity requested. | 
|  | 189 |  | 
|  | 190 | more on dev->poll() below after the interrupt changes are explained. | 
|  | 191 |  | 
|  | 192 | 2) registering dev->poll() method | 
|  | 193 | =================================== | 
|  | 194 |  | 
|  | 195 | dev->poll should be set in the dev->probe() method. | 
|  | 196 | e.g: | 
|  | 197 | dev->open = my_open; | 
|  | 198 | . | 
|  | 199 | . | 
|  | 200 | /* two new additions */ | 
|  | 201 | /* first register my poll method */ | 
|  | 202 | dev->poll = my_poll; | 
|  | 203 | /* next register my weight/quanta; can be overridden in /proc */ | 
|  | 204 | dev->weight = 16; | 
|  | 205 | . | 
|  | 206 | . | 
|  | 207 | dev->stop = my_close; | 
|  | 208 |  | 
|  | 209 |  | 
|  | 210 |  | 
|  | 211 | 3) scheduling dev->poll() | 
|  | 212 | ============================= | 
|  | 213 | This involves modifying the interrupt handler and the code | 
|  | 214 | path which takes the packet off the NIC and sends them to the | 
|  | 215 | stack. | 
|  | 216 |  | 
|  | 217 | it's important at this point to introduce the classical D Becker | 
|  | 218 | interrupt processor: | 
|  | 219 |  | 
|  | 220 | ------------------ | 
|  | 221 | static irqreturn_t | 
|  | 222 | netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) | 
|  | 223 | { | 
|  | 224 |  | 
|  | 225 | struct net_device *dev = (struct net_device *)dev_instance; | 
|  | 226 | struct my_private *tp = (struct my_private *)dev->priv; | 
|  | 227 |  | 
|  | 228 | int work_count = my_work_count; | 
|  | 229 | status = read_interrupt_status_reg(); | 
|  | 230 | if (status == 0) | 
|  | 231 | return IRQ_NONE; /* Shared IRQ: not us */ | 
|  | 232 | if (status == 0xffff) | 
|  | 233 | return IRQ_HANDLED;      /* Hot unplug */ | 
|  | 234 | if (status & error) | 
|  | 235 | do_some_error_handling() | 
|  | 236 |  | 
|  | 237 | do { | 
|  | 238 | acknowledge_ints_ASAP(); | 
|  | 239 |  | 
|  | 240 | if (status & link_interrupt) { | 
|  | 241 | spin_lock(&tp->link_lock); | 
|  | 242 | do_some_link_stat_stuff(); | 
|  | 243 | spin_lock(&tp->link_lock); | 
|  | 244 | } | 
|  | 245 |  | 
|  | 246 | if (status & rx_interrupt) { | 
|  | 247 | receive_packets(dev); | 
|  | 248 | } | 
|  | 249 |  | 
|  | 250 | if (status & rx_nobufs) { | 
|  | 251 | make_rx_buffs_avail(); | 
|  | 252 | } | 
|  | 253 |  | 
|  | 254 | if (status & tx_related) { | 
|  | 255 | spin_lock(&tp->lock); | 
|  | 256 | tx_ring_free(dev); | 
|  | 257 | if (tx_died) | 
|  | 258 | restart_tx(); | 
|  | 259 | spin_unlock(&tp->lock); | 
|  | 260 | } | 
|  | 261 |  | 
|  | 262 | status = read_interrupt_status_reg(); | 
|  | 263 |  | 
|  | 264 | } while (!(status & error) || more_work_to_be_done); | 
|  | 265 | return IRQ_HANDLED; | 
|  | 266 | } | 
|  | 267 |  | 
|  | 268 | ---------------------------------------------------------------------- | 
|  | 269 |  | 
|  | 270 | We now change this to what is shown below to NAPI-enable it: | 
|  | 271 |  | 
|  | 272 | ---------------------------------------------------------------------- | 
|  | 273 | static irqreturn_t | 
|  | 274 | netdevice_interrupt(int irq, void *dev_id, struct pt_regs *regs) | 
|  | 275 | { | 
|  | 276 | struct net_device *dev = (struct net_device *)dev_instance; | 
|  | 277 | struct my_private *tp = (struct my_private *)dev->priv; | 
|  | 278 |  | 
|  | 279 | status = read_interrupt_status_reg(); | 
|  | 280 | if (status == 0) | 
|  | 281 | return IRQ_NONE;         /* Shared IRQ: not us */ | 
|  | 282 | if (status == 0xffff) | 
|  | 283 | return IRQ_HANDLED;         /* Hot unplug */ | 
|  | 284 | if (status & error) | 
|  | 285 | do_some_error_handling(); | 
|  | 286 |  | 
|  | 287 | do { | 
|  | 288 | /************************ start note *********************************/ | 
|  | 289 | acknowledge_ints_ASAP();  // dont ack rx and rxnobuff here | 
|  | 290 | /************************ end note *********************************/ | 
|  | 291 |  | 
|  | 292 | if (status & link_interrupt) { | 
|  | 293 | spin_lock(&tp->link_lock); | 
|  | 294 | do_some_link_stat_stuff(); | 
|  | 295 | spin_unlock(&tp->link_lock); | 
|  | 296 | } | 
|  | 297 | /************************ start note *********************************/ | 
|  | 298 | if (status & rx_interrupt || (status & rx_nobuffs)) { | 
|  | 299 | if (netif_rx_schedule_prep(dev)) { | 
|  | 300 |  | 
|  | 301 | /* disable interrupts caused | 
|  | 302 | *	by arriving packets */ | 
|  | 303 | disable_rx_and_rxnobuff_ints(); | 
|  | 304 | /* tell system we have work to be done. */ | 
|  | 305 | __netif_rx_schedule(dev); | 
|  | 306 | } else { | 
|  | 307 | printk("driver bug! interrupt while in poll\n"); | 
|  | 308 | /* FIX by disabling interrupts  */ | 
|  | 309 | disable_rx_and_rxnobuff_ints(); | 
|  | 310 | } | 
|  | 311 | } | 
|  | 312 | /************************ end note note *********************************/ | 
|  | 313 |  | 
|  | 314 | if (status & tx_related) { | 
|  | 315 | spin_lock(&tp->lock); | 
|  | 316 | tx_ring_free(dev); | 
|  | 317 |  | 
|  | 318 | if (tx_died) | 
|  | 319 | restart_tx(); | 
|  | 320 | spin_unlock(&tp->lock); | 
|  | 321 | } | 
|  | 322 |  | 
|  | 323 | status = read_interrupt_status_reg(); | 
|  | 324 |  | 
|  | 325 | /************************ start note *********************************/ | 
|  | 326 | } while (!(status & error) || more_work_to_be_done(status)); | 
|  | 327 | /************************ end note note *********************************/ | 
|  | 328 | return IRQ_HANDLED; | 
|  | 329 | } | 
|  | 330 |  | 
|  | 331 | --------------------------------------------------------------------- | 
|  | 332 |  | 
|  | 333 |  | 
|  | 334 | We note several things from above: | 
|  | 335 |  | 
|  | 336 | I) Any interrupt source which is caused by arriving packets is now | 
|  | 337 | turned off when it occurs. Depending on the hardware, there could be | 
|  | 338 | several reasons that arriving packets would cause interrupts; these are the | 
|  | 339 | interrupt sources we wish to avoid. The two common ones are a) a packet | 
|  | 340 | arriving (rxint) b) a packet arriving and finding no DMA buffers available | 
|  | 341 | (rxnobuff) . | 
|  | 342 | This means also acknowledge_ints_ASAP() will not clear the status | 
|  | 343 | register for those two items above; clearing is done in the place where | 
|  | 344 | proper work is done within NAPI; at the poll() and refill_rx_ring() | 
|  | 345 | discussed further below. | 
|  | 346 | netif_rx_schedule_prep() returns 1 if device is in running state and | 
|  | 347 | gets successfully added to the core poll list. If we get a zero value | 
|  | 348 | we can _almost_ assume are already added to the list (instead of not running. | 
|  | 349 | Logic based on the fact that you shouldn't get interrupt if not running) | 
|  | 350 | We rectify this by disabling rx and rxnobuf interrupts. | 
|  | 351 |  | 
|  | 352 | II) that receive_packets(dev) and make_rx_buffs_avail() may have disappeared. | 
|  | 353 | These functionalities are still around actually...... | 
|  | 354 |  | 
|  | 355 | infact, receive_packets(dev) is very close to my_poll() and | 
|  | 356 | make_rx_buffs_avail() is invoked from my_poll() | 
|  | 357 |  | 
|  | 358 | 4) converting receive_packets() to dev->poll() | 
|  | 359 | =============================================== | 
|  | 360 |  | 
|  | 361 | We need to convert the classical D Becker receive_packets(dev) to my_poll() | 
|  | 362 |  | 
|  | 363 | First the typical receive_packets() below: | 
|  | 364 | ------------------------------------------------------------------- | 
|  | 365 |  | 
|  | 366 | /* this is called by interrupt handler */ | 
|  | 367 | static void receive_packets (struct net_device *dev) | 
|  | 368 | { | 
|  | 369 |  | 
|  | 370 | struct my_private *tp = (struct my_private *)dev->priv; | 
|  | 371 | rx_ring = tp->rx_ring; | 
|  | 372 | cur_rx = tp->cur_rx; | 
|  | 373 | int entry = cur_rx % RX_RING_SIZE; | 
|  | 374 | int received = 0; | 
|  | 375 | int rx_work_limit = tp->dirty_rx + RX_RING_SIZE - tp->cur_rx; | 
|  | 376 |  | 
|  | 377 | while (rx_ring_not_empty) { | 
|  | 378 | u32 rx_status; | 
|  | 379 | unsigned int rx_size; | 
|  | 380 | unsigned int pkt_size; | 
|  | 381 | struct sk_buff *skb; | 
|  | 382 | /* read size+status of next frame from DMA ring buffer */ | 
|  | 383 | /* the number 16 and 4 are just examples */ | 
|  | 384 | rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); | 
|  | 385 | rx_size = rx_status >> 16; | 
|  | 386 | pkt_size = rx_size - 4; | 
|  | 387 |  | 
|  | 388 | /* process errors */ | 
|  | 389 | if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || | 
|  | 390 | (!(rx_status & RxStatusOK))) { | 
|  | 391 | netdrv_rx_err (rx_status, dev, tp, ioaddr); | 
|  | 392 | return; | 
|  | 393 | } | 
|  | 394 |  | 
|  | 395 | if (--rx_work_limit < 0) | 
|  | 396 | break; | 
|  | 397 |  | 
|  | 398 | /* grab a skb */ | 
|  | 399 | skb = dev_alloc_skb (pkt_size + 2); | 
|  | 400 | if (skb) { | 
|  | 401 | . | 
|  | 402 | . | 
|  | 403 | netif_rx (skb); | 
|  | 404 | . | 
|  | 405 | . | 
|  | 406 | } else {  /* OOM */ | 
|  | 407 | /*seems very driver specific ... some just pass | 
|  | 408 | whatever is on the ring already. */ | 
|  | 409 | } | 
|  | 410 |  | 
|  | 411 | /* move to the next skb on the ring */ | 
|  | 412 | entry = (++tp->cur_rx) % RX_RING_SIZE; | 
|  | 413 | received++ ; | 
|  | 414 |  | 
|  | 415 | } | 
|  | 416 |  | 
|  | 417 | /* store current ring pointer state */ | 
|  | 418 | tp->cur_rx = cur_rx; | 
|  | 419 |  | 
|  | 420 | /* Refill the Rx ring buffers if they are needed */ | 
|  | 421 | refill_rx_ring(); | 
|  | 422 | . | 
|  | 423 | . | 
|  | 424 |  | 
|  | 425 | } | 
|  | 426 | ------------------------------------------------------------------- | 
|  | 427 | We change it to a new one below; note the additional parameter in | 
|  | 428 | the call. | 
|  | 429 |  | 
|  | 430 | ------------------------------------------------------------------- | 
|  | 431 |  | 
|  | 432 | /* this is called by the network core */ | 
|  | 433 | static int my_poll (struct net_device *dev, int *budget) | 
|  | 434 | { | 
|  | 435 |  | 
|  | 436 | struct my_private *tp = (struct my_private *)dev->priv; | 
|  | 437 | rx_ring = tp->rx_ring; | 
|  | 438 | cur_rx = tp->cur_rx; | 
|  | 439 | int entry = cur_rx % RX_BUF_LEN; | 
|  | 440 | /* maximum packets to send to the stack */ | 
|  | 441 | /************************ note note *********************************/ | 
|  | 442 | int rx_work_limit = dev->quota; | 
|  | 443 |  | 
|  | 444 | /************************ end note note *********************************/ | 
|  | 445 | do {  // outer beginning loop starts here | 
|  | 446 |  | 
|  | 447 | clear_rx_status_register_bit(); | 
|  | 448 |  | 
|  | 449 | while (rx_ring_not_empty) { | 
|  | 450 | u32 rx_status; | 
|  | 451 | unsigned int rx_size; | 
|  | 452 | unsigned int pkt_size; | 
|  | 453 | struct sk_buff *skb; | 
|  | 454 | /* read size+status of next frame from DMA ring buffer */ | 
|  | 455 | /* the number 16 and 4 are just examples */ | 
|  | 456 | rx_status = le32_to_cpu (*(u32 *) (rx_ring + ring_offset)); | 
|  | 457 | rx_size = rx_status >> 16; | 
|  | 458 | pkt_size = rx_size - 4; | 
|  | 459 |  | 
|  | 460 | /* process errors */ | 
|  | 461 | if ((rx_size > (MAX_ETH_FRAME_SIZE+4)) || | 
|  | 462 | (!(rx_status & RxStatusOK))) { | 
|  | 463 | netdrv_rx_err (rx_status, dev, tp, ioaddr); | 
|  | 464 | return 1; | 
|  | 465 | } | 
|  | 466 |  | 
|  | 467 | /************************ note note *********************************/ | 
|  | 468 | if (--rx_work_limit < 0) { /* we got packets, but no quota */ | 
|  | 469 | /* store current ring pointer state */ | 
|  | 470 | tp->cur_rx = cur_rx; | 
|  | 471 |  | 
|  | 472 | /* Refill the Rx ring buffers if they are needed */ | 
|  | 473 | refill_rx_ring(dev); | 
|  | 474 | goto not_done; | 
|  | 475 | } | 
|  | 476 | /**********************  end note **********************************/ | 
|  | 477 |  | 
|  | 478 | /* grab a skb */ | 
|  | 479 | skb = dev_alloc_skb (pkt_size + 2); | 
|  | 480 | if (skb) { | 
|  | 481 | . | 
|  | 482 | . | 
|  | 483 | /************************ note note *********************************/ | 
|  | 484 | netif_receive_skb (skb); | 
|  | 485 | /**********************  end note **********************************/ | 
|  | 486 | . | 
|  | 487 | . | 
|  | 488 | } else {  /* OOM */ | 
|  | 489 | /*seems very driver specific ... common is just pass | 
|  | 490 | whatever is on the ring already. */ | 
|  | 491 | } | 
|  | 492 |  | 
|  | 493 | /* move to the next skb on the ring */ | 
|  | 494 | entry = (++tp->cur_rx) % RX_RING_SIZE; | 
|  | 495 | received++ ; | 
|  | 496 |  | 
|  | 497 | } | 
|  | 498 |  | 
|  | 499 | /* store current ring pointer state */ | 
|  | 500 | tp->cur_rx = cur_rx; | 
|  | 501 |  | 
|  | 502 | /* Refill the Rx ring buffers if they are needed */ | 
|  | 503 | refill_rx_ring(dev); | 
|  | 504 |  | 
|  | 505 | /* no packets on ring; but new ones can arrive since we last | 
|  | 506 | checked  */ | 
|  | 507 | status = read_interrupt_status_reg(); | 
|  | 508 | if (rx status is not set) { | 
|  | 509 | /* If something arrives in this narrow window, | 
|  | 510 | an interrupt will be generated */ | 
|  | 511 | goto done; | 
|  | 512 | } | 
| Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 513 | /* done! at least that's what it looks like ;-> | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 514 | if new packets came in after our last check on status bits | 
|  | 515 | they'll be caught by the while check and we go back and clear them | 
|  | 516 | since we havent exceeded our quota */ | 
|  | 517 | } while (rx_status_is_set); | 
|  | 518 |  | 
|  | 519 | done: | 
|  | 520 |  | 
|  | 521 | /************************ note note *********************************/ | 
|  | 522 | dev->quota -= received; | 
|  | 523 | *budget -= received; | 
|  | 524 |  | 
|  | 525 | /* If RX ring is not full we are out of memory. */ | 
|  | 526 | if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | 
|  | 527 | goto oom; | 
|  | 528 |  | 
|  | 529 | /* we are happy/done, no more packets on ring; put us back | 
|  | 530 | to where we can start processing interrupts again */ | 
|  | 531 | netif_rx_complete(dev); | 
|  | 532 | enable_rx_and_rxnobuf_ints(); | 
|  | 533 |  | 
|  | 534 | /* The last op happens after poll completion. Which means the following: | 
|  | 535 | * 1. it can race with disabling irqs in irq handler (which are done to | 
|  | 536 | * schedule polls) | 
|  | 537 | * 2. it can race with dis/enabling irqs in other poll threads | 
| Matt LaPlante | 5d3f083 | 2006-11-30 05:21:10 +0100 | [diff] [blame] | 538 | * 3. if an irq raised after the beginning of the outer beginning | 
|  | 539 | * loop (marked in the code above), it will be immediately | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 540 | * triggered here. | 
|  | 541 | * | 
| Matt LaPlante | 5d3f083 | 2006-11-30 05:21:10 +0100 | [diff] [blame] | 542 | * Summarizing: the logic may result in some redundant irqs both | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 543 | * due to races in masking and due to too late acking of already | 
|  | 544 | * processed irqs. The good news: no events are ever lost. | 
|  | 545 | */ | 
|  | 546 |  | 
|  | 547 | return 0;   /* done */ | 
|  | 548 |  | 
|  | 549 | not_done: | 
|  | 550 | if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || | 
|  | 551 | tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | 
|  | 552 | refill_rx_ring(dev); | 
|  | 553 |  | 
|  | 554 | if (!received) { | 
|  | 555 | printk("received==0\n"); | 
|  | 556 | received = 1; | 
|  | 557 | } | 
|  | 558 | dev->quota -= received; | 
|  | 559 | *budget -= received; | 
|  | 560 | return 1;  /* not_done */ | 
|  | 561 |  | 
|  | 562 | oom: | 
|  | 563 | /* Start timer, stop polling, but do not enable rx interrupts. */ | 
|  | 564 | start_poll_timer(dev); | 
|  | 565 | return 0;  /* we'll take it from here so tell core "done"*/ | 
|  | 566 |  | 
|  | 567 | /************************ End note note *********************************/ | 
|  | 568 | } | 
|  | 569 | ------------------------------------------------------------------- | 
|  | 570 |  | 
|  | 571 | From above we note that: | 
|  | 572 | 0) rx_work_limit = dev->quota | 
|  | 573 | 1) refill_rx_ring() is in charge of clearing the bit for rxnobuff when | 
|  | 574 | it does the work. | 
|  | 575 | 2) We have a done and not_done state. | 
|  | 576 | 3) instead of netif_rx() we call netif_receive_skb() to pass the skb. | 
|  | 577 | 4) we have a new way of handling oom condition | 
|  | 578 | 5) A new outer for (;;) loop has been added. This serves the purpose of | 
|  | 579 | ensuring that if a new packet has come in, after we are all set and done, | 
|  | 580 | and we have not exceeded our quota that we continue sending packets up. | 
|  | 581 |  | 
|  | 582 |  | 
|  | 583 | ----------------------------------------------------------- | 
|  | 584 | Poll timer code will need to do the following: | 
|  | 585 |  | 
|  | 586 | a) | 
|  | 587 |  | 
|  | 588 | if (tp->cur_rx - tp->dirty_rx > RX_RING_SIZE/2 || | 
|  | 589 | tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | 
|  | 590 | refill_rx_ring(dev); | 
|  | 591 |  | 
|  | 592 | /* If RX ring is not full we are still out of memory. | 
|  | 593 | Restart the timer again. Else we re-add ourselves | 
|  | 594 | to the master poll list. | 
|  | 595 | */ | 
|  | 596 |  | 
|  | 597 | if (tp->rx_buffers[tp->dirty_rx % RX_RING_SIZE].skb == NULL) | 
|  | 598 | restart_timer(); | 
|  | 599 |  | 
|  | 600 | else netif_rx_schedule(dev);  /* we are back on the poll list */ | 
|  | 601 |  | 
|  | 602 | 5) dev->close() and dev->suspend() issues | 
|  | 603 | ========================================== | 
| Matt LaPlante | 4ae0edc | 2006-11-30 04:58:40 +0100 | [diff] [blame] | 604 | The driver writer needn't worry about this; the top net layer takes | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 605 | care of it. | 
|  | 606 |  | 
|  | 607 | 6) Adding new Stats to /proc | 
|  | 608 | ============================= | 
|  | 609 | In order to debug some of the new features, we introduce new stats | 
|  | 610 | that need to be collected. | 
|  | 611 | TODO: Fill this later. | 
|  | 612 |  | 
|  | 613 | APPENDIX 1: discussion on using ethernet HW FC | 
|  | 614 | ============================================== | 
|  | 615 | Most chips with FC only send a pause packet when they run out of Rx buffers. | 
|  | 616 | Since packets are pulled off the DMA ring by a softirq in NAPI, | 
|  | 617 | if the system is slow in grabbing them and we have a high input | 
|  | 618 | rate (faster than the system's capacity to remove packets), then theoretically | 
|  | 619 | there will only be one rx interrupt for all packets during a given packetstorm. | 
|  | 620 | Under low load, we might have a single interrupt per packet. | 
|  | 621 | FC should be programmed to apply in the case when the system cant pull out | 
|  | 622 | packets fast enough i.e send a pause only when you run out of rx buffers. | 
|  | 623 | Note FC in itself is a good solution but we have found it to not be | 
|  | 624 | much of a commodity feature (both in NICs and switches) and hence falls | 
| Matt LaPlante | 4ae0edc | 2006-11-30 04:58:40 +0100 | [diff] [blame] | 625 | under the same category as using NIC based mitigation. Also, experiments | 
|  | 626 | indicate that it's much harder to resolve the resource allocation | 
|  | 627 | issue (aka lazy receiving that NAPI offers) and hence quantify its usefulness | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 628 | proved harder. In any case, FC works even better with NAPI but is not | 
|  | 629 | necessary. | 
|  | 630 |  | 
|  | 631 |  | 
|  | 632 | APPENDIX 2: the "rotting packet" race-window avoidance scheme | 
|  | 633 | ============================================================= | 
|  | 634 |  | 
|  | 635 | There are two types of associations seen here | 
|  | 636 |  | 
|  | 637 | 1) status/int which honors level triggered IRQ | 
|  | 638 |  | 
|  | 639 | If a status bit for receive or rxnobuff is set and the corresponding | 
|  | 640 | interrupt-enable bit is not on, then no interrupts will be generated. However, | 
|  | 641 | as soon as the "interrupt-enable" bit is unmasked, an immediate interrupt is | 
|  | 642 | generated.  [assuming the status bit was not turned off]. | 
|  | 643 | Generally the concept of level triggered IRQs in association with a status and | 
|  | 644 | interrupt-enable CSR register set is used to avoid the race. | 
|  | 645 |  | 
|  | 646 | If we take the example of the tulip: | 
|  | 647 | "pending work" is indicated by the status bit(CSR5 in tulip). | 
|  | 648 | the corresponding interrupt bit (CSR7 in tulip) might be turned off (but | 
|  | 649 | the CSR5 will continue to be turned on with new packet arrivals even if | 
|  | 650 | we clear it the first time) | 
|  | 651 | Very important is the fact that if we turn on the interrupt bit on when | 
|  | 652 | status is set that an immediate irq is triggered. | 
|  | 653 |  | 
|  | 654 | If we cleared the rx ring and proclaimed there was "no more work | 
|  | 655 | to be done" and then went on to do a few other things;  then when we enable | 
|  | 656 | interrupts, there is a possibility that a new packet might sneak in during | 
|  | 657 | this phase. It helps to look at the pseudo code for the tulip poll | 
|  | 658 | routine: | 
|  | 659 |  | 
|  | 660 | -------------------------- | 
|  | 661 | do { | 
|  | 662 | ACK; | 
|  | 663 | while (ring_is_not_empty()) { | 
|  | 664 | work-work-work | 
|  | 665 | if quota is exceeded: exit, no touching irq status/mask | 
|  | 666 | } | 
|  | 667 | /* No packets, but new can arrive while we are doing this*/ | 
|  | 668 | CSR5 := read | 
|  | 669 | if (CSR5 is not set) { | 
|  | 670 | /* If something arrives in this narrow window here, | 
|  | 671 | *  where the comments are ;-> irq will be generated */ | 
|  | 672 | unmask irqs; | 
|  | 673 | exit poll; | 
|  | 674 | } | 
|  | 675 | } while (rx_status_is_set); | 
|  | 676 | ------------------------ | 
|  | 677 |  | 
|  | 678 | CSR5 bit of interest is only the rx status. | 
|  | 679 | If you look at the last if statement: | 
|  | 680 | you just finished grabbing all the packets from the rx ring .. you check if | 
| Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 681 | status bit says there are more packets just in ... it says none; you then | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 682 | enable rx interrupts again; if a new packet just came in during this check, | 
|  | 683 | we are counting that CSR5 will be set in that small window of opportunity | 
| Matt LaPlante | fa00e7e | 2006-11-30 04:55:36 +0100 | [diff] [blame] | 684 | and that by re-enabling interrupts, we would actually trigger an interrupt | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 685 | to register the new packet for processing. | 
|  | 686 |  | 
|  | 687 | [The above description nay be very verbose, if you have better wording | 
|  | 688 | that will make this more understandable, please suggest it.] | 
|  | 689 |  | 
|  | 690 | 2) non-capable hardware | 
|  | 691 |  | 
|  | 692 | These do not generally respect level triggered IRQs. Normally, | 
|  | 693 | irqs may be lost while being masked and the only way to leave poll is to do | 
|  | 694 | a double check for new input after netif_rx_complete() is invoked | 
|  | 695 | and re-enable polling (after seeing this new input). | 
|  | 696 |  | 
|  | 697 | Sample code: | 
|  | 698 |  | 
|  | 699 | --------- | 
|  | 700 | . | 
|  | 701 | . | 
|  | 702 | restart_poll: | 
|  | 703 | while (ring_is_not_empty()) { | 
|  | 704 | work-work-work | 
|  | 705 | if quota is exceeded: exit, not touching irq status/mask | 
|  | 706 | } | 
|  | 707 | . | 
|  | 708 | . | 
|  | 709 | . | 
|  | 710 | enable_rx_interrupts() | 
|  | 711 | netif_rx_complete(dev); | 
|  | 712 | if (ring_has_new_packet() && netif_rx_reschedule(dev, received)) { | 
|  | 713 | disable_rx_and_rxnobufs() | 
|  | 714 | goto restart_poll | 
|  | 715 | } while (rx_status_is_set); | 
|  | 716 | --------- | 
|  | 717 |  | 
|  | 718 | Basically netif_rx_complete() removes us from the poll list, but because a | 
|  | 719 | new packet which will never be caught due to the possibility of a race | 
|  | 720 | might come in, we attempt to re-add ourselves to the poll list. | 
|  | 721 |  | 
|  | 722 |  | 
|  | 723 |  | 
|  | 724 |  | 
|  | 725 | APPENDIX 3: Scheduling issues. | 
|  | 726 | ============================== | 
|  | 727 | As seen NAPI moves processing to softirq level. Linux uses the ksoftirqd as the | 
|  | 728 | general solution to schedule softirq's to run before next interrupt and by putting | 
|  | 729 | them under scheduler control. Also this prevents consecutive softirq's from | 
|  | 730 | monopolize the CPU. This also have the effect that the priority of ksoftirq needs | 
|  | 731 | to be considered when running very CPU-intensive applications and networking to | 
|  | 732 | get the proper balance of softirq/user balance. Increasing ksoftirq priority to 0 | 
|  | 733 | (eventually more) is reported cure problems with low network performance at high | 
|  | 734 | CPU load. | 
|  | 735 |  | 
|  | 736 | Most used processes in a GIGE router: | 
|  | 737 | USER       PID %CPU %MEM  SIZE   RSS TTY STAT START   TIME COMMAND | 
|  | 738 | root         3  0.2  0.0     0     0  ?  RWN Aug 15 602:00 (ksoftirqd_CPU0) | 
|  | 739 | root       232  0.0  7.9 41400 40884  ?  S   Aug 15  74:12 gated | 
|  | 740 |  | 
|  | 741 | -------------------------------------------------------------------- | 
|  | 742 |  | 
|  | 743 | relevant sites: | 
|  | 744 | ================== | 
|  | 745 | ftp://robur.slu.se/pub/Linux/net-development/NAPI/ | 
|  | 746 |  | 
|  | 747 |  | 
|  | 748 | -------------------------------------------------------------------- | 
|  | 749 | TODO: Write net-skeleton.c driver. | 
|  | 750 | ------------------------------------------------------------- | 
|  | 751 |  | 
|  | 752 | Authors: | 
|  | 753 | ======== | 
|  | 754 | Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> | 
|  | 755 | Jamal Hadi Salim <hadi@cyberus.ca> | 
|  | 756 | Robert Olsson <Robert.Olsson@data.slu.se> | 
|  | 757 |  | 
|  | 758 | Acknowledgements: | 
|  | 759 | ================ | 
|  | 760 | People who made this document better: | 
|  | 761 |  | 
|  | 762 | Lennert Buytenhek <buytenh@gnu.org> | 
|  | 763 | Andrew Morton  <akpm@zip.com.au> | 
|  | 764 | Manfred Spraul <manfred@colorfullife.com> | 
|  | 765 | Donald Becker <becker@scyld.com> | 
|  | 766 | Jeff Garzik <jgarzik@pobox.com> |