| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 1 | -------------------------------------------------------------------------------- | 
 | 2 | + ABSTRACT | 
 | 3 | -------------------------------------------------------------------------------- | 
 | 4 |  | 
| David S. Miller | 889b8f9 | 2010-02-05 16:29:48 -0800 | [diff] [blame] | 5 | This file documents the mmap() facility available with the PACKET | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 6 | socket interface on 2.4 and 2.6 kernels. This type of sockets is used for  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 7 | capture network traffic with utilities like tcpdump or any other that needs | 
 | 8 | raw access to network interface. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 9 |  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 10 | You can find the latest version of this document at: | 
| Justin P. Mattock | 0ea6e61 | 2010-07-23 20:51:24 -0700 | [diff] [blame] | 11 |     http://wiki.ipxwarzone.com/index.php5?title=Linux_packet_mmap | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 12 |  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 13 | Howto can be found at: | 
 | 14 |     http://wiki.gnu-log.net (packet_mmap) | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 15 |  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 16 | Please send your comments to | 
| John Anthony Kazos Jr | be2a608 | 2007-05-09 08:50:42 +0200 | [diff] [blame] | 17 |     Ulisses Alonso CamarĂ³ <uaca@i.hate.spam.alumni.uv.es> | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 18 |     Johann Baudy <johann.baudy@gnu-log.net> | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 19 |  | 
 | 20 | ------------------------------------------------------------------------------- | 
 | 21 | + Why use PACKET_MMAP | 
 | 22 | -------------------------------------------------------------------------------- | 
 | 23 |  | 
 | 24 | In Linux 2.4/2.6 if PACKET_MMAP is not enabled, the capture process is very | 
 | 25 | inefficient. It uses very limited buffers and requires one system call | 
 | 26 | to capture each packet, it requires two if you want to get packet's  | 
 | 27 | timestamp (like libpcap always does). | 
 | 28 |  | 
 | 29 | In the other hand PACKET_MMAP is very efficient. PACKET_MMAP provides a size  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 30 | configurable circular buffer mapped in user space that can be used to either | 
 | 31 | send or receive packets. This way reading packets just needs to wait for them, | 
 | 32 | most of the time there is no need to issue a single system call. Concerning | 
 | 33 | transmission, multiple packets can be sent through one system call to get the | 
 | 34 | highest bandwidth. | 
 | 35 | By using a shared buffer between the kernel and the user also has the benefit | 
 | 36 | of minimizing packet copies. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 37 |  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 38 | It's fine to use PACKET_MMAP to improve the performance of the capture and | 
 | 39 | transmission process, but it isn't everything. At least, if you are capturing | 
 | 40 | at high speeds (this is relative to the cpu speed), you should check if the | 
 | 41 | device driver of your network interface card supports some sort of interrupt | 
 | 42 | load mitigation or (even better) if it supports NAPI, also make sure it is | 
 | 43 | enabled. For transmission, check the MTU (Maximum Transmission Unit) used and | 
 | 44 | supported by devices of your network. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 45 |  | 
 | 46 | -------------------------------------------------------------------------------- | 
| David S. Miller | 889b8f9 | 2010-02-05 16:29:48 -0800 | [diff] [blame] | 47 | + How to use mmap() to improve capture process | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 48 | -------------------------------------------------------------------------------- | 
 | 49 |  | 
| Uwe Zeisberger | c30fe7f | 2006-03-24 18:23:14 +0100 | [diff] [blame] | 50 | From the user standpoint, you should use the higher level libpcap library, which | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 51 | is a de facto standard, portable across nearly all operating systems | 
 | 52 | including Win32.  | 
 | 53 |  | 
 | 54 | Said that, at time of this writing, official libpcap 0.8.1 is out and doesn't include | 
 | 55 | support for PACKET_MMAP, and also probably the libpcap included in your distribution.  | 
 | 56 |  | 
 | 57 | I'm aware of two implementations of PACKET_MMAP in libpcap: | 
 | 58 |  | 
| Justin P. Mattock | 0ea6e61 | 2010-07-23 20:51:24 -0700 | [diff] [blame] | 59 |     http://wiki.ipxwarzone.com/		     (by Simon Patarin, based on libpcap 0.6.2) | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 60 |     http://public.lanl.gov/cpw/              (by Phil Wood, based on lastest libpcap) | 
 | 61 |  | 
 | 62 | The rest of this document is intended for people who want to understand | 
 | 63 | the low level details or want to improve libpcap by including PACKET_MMAP | 
 | 64 | support. | 
 | 65 |  | 
 | 66 | -------------------------------------------------------------------------------- | 
| David S. Miller | 889b8f9 | 2010-02-05 16:29:48 -0800 | [diff] [blame] | 67 | + How to use mmap() directly to improve capture process | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 68 | -------------------------------------------------------------------------------- | 
 | 69 |  | 
 | 70 | From the system calls stand point, the use of PACKET_MMAP involves | 
 | 71 | the following process: | 
 | 72 |  | 
 | 73 |  | 
 | 74 | [setup]     socket() -------> creation of the capture socket | 
 | 75 |             setsockopt() ---> allocation of the circular buffer (ring) | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 76 |                               option: PACKET_RX_RING | 
| Matt LaPlante | 6c28f2c | 2006-10-03 22:46:31 +0200 | [diff] [blame] | 77 |             mmap() ---------> mapping of the allocated buffer to the | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 78 |                               user process | 
 | 79 |  | 
 | 80 | [capture]   poll() ---------> to wait for incoming packets | 
 | 81 |  | 
 | 82 | [shutdown]  close() --------> destruction of the capture socket and | 
 | 83 |                               deallocation of all associated  | 
 | 84 |                               resources. | 
 | 85 |  | 
 | 86 |  | 
 | 87 | socket creation and destruction is straight forward, and is done  | 
 | 88 | the same way with or without PACKET_MMAP: | 
 | 89 |  | 
 | 90 | int fd; | 
 | 91 |  | 
 | 92 | fd= socket(PF_PACKET, mode, htons(ETH_P_ALL)) | 
 | 93 |  | 
 | 94 | where mode is SOCK_RAW for the raw interface were link level | 
 | 95 | information can be captured or SOCK_DGRAM for the cooked | 
 | 96 | interface where link level information capture is not  | 
 | 97 | supported and a link level pseudo-header is provided  | 
 | 98 | by the kernel. | 
 | 99 |  | 
 | 100 | The destruction of the socket and all associated resources | 
 | 101 | is done by a simple call to close(fd). | 
 | 102 |  | 
| Francis Galiegue | a33f322 | 2010-04-23 00:08:02 +0200 | [diff] [blame] | 103 | Next I will describe PACKET_MMAP settings and its constraints, | 
| Matt LaPlante | 6c28f2c | 2006-10-03 22:46:31 +0200 | [diff] [blame] | 104 | also the mapping of the circular buffer in the user process and  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 105 | the use of this buffer. | 
 | 106 |  | 
 | 107 | -------------------------------------------------------------------------------- | 
| David S. Miller | 889b8f9 | 2010-02-05 16:29:48 -0800 | [diff] [blame] | 108 | + How to use mmap() directly to improve transmission process | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 109 | -------------------------------------------------------------------------------- | 
 | 110 | Transmission process is similar to capture as shown below. | 
 | 111 |  | 
 | 112 | [setup]          socket() -------> creation of the transmission socket | 
 | 113 |                  setsockopt() ---> allocation of the circular buffer (ring) | 
 | 114 |                                    option: PACKET_TX_RING | 
 | 115 |                  bind() ---------> bind transmission socket with a network interface | 
 | 116 |                  mmap() ---------> mapping of the allocated buffer to the | 
 | 117 |                                    user process | 
 | 118 |  | 
 | 119 | [transmission]   poll() ---------> wait for free packets (optional) | 
 | 120 |                  send() ---------> send all packets that are set as ready in | 
 | 121 |                                    the ring | 
 | 122 |                                    The flag MSG_DONTWAIT can be used to return | 
 | 123 |                                    before end of transfer. | 
 | 124 |  | 
 | 125 | [shutdown]  close() --------> destruction of the transmission socket and | 
 | 126 |                               deallocation of all associated resources. | 
 | 127 |  | 
 | 128 | Binding the socket to your network interface is mandatory (with zero copy) to | 
 | 129 | know the header size of frames used in the circular buffer. | 
 | 130 |  | 
 | 131 | As capture, each frame contains two parts: | 
 | 132 |  | 
 | 133 |  -------------------- | 
 | 134 | | struct tpacket_hdr | Header. It contains the status of | 
 | 135 | |                    | of this frame | 
 | 136 | |--------------------| | 
 | 137 | | data buffer        | | 
 | 138 | .                    .  Data that will be sent over the network interface. | 
 | 139 | .                    . | 
 | 140 |  -------------------- | 
 | 141 |  | 
 | 142 |  bind() associates the socket to your network interface thanks to | 
 | 143 |  sll_ifindex parameter of struct sockaddr_ll. | 
 | 144 |  | 
 | 145 |  Initialization example: | 
 | 146 |  | 
 | 147 |  struct sockaddr_ll my_addr; | 
 | 148 |  struct ifreq s_ifr; | 
 | 149 |  ... | 
 | 150 |  | 
 | 151 |  strncpy (s_ifr.ifr_name, "eth0", sizeof(s_ifr.ifr_name)); | 
 | 152 |  | 
 | 153 |  /* get interface index of eth0 */ | 
 | 154 |  ioctl(this->socket, SIOCGIFINDEX, &s_ifr); | 
 | 155 |  | 
 | 156 |  /* fill sockaddr_ll struct to prepare binding */ | 
 | 157 |  my_addr.sll_family = AF_PACKET; | 
| Wei Yongjun | 30e7dfe | 2011-12-22 17:47:54 +0000 | [diff] [blame] | 158 |  my_addr.sll_protocol = htons(ETH_P_ALL); | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 159 |  my_addr.sll_ifindex =  s_ifr.ifr_ifindex; | 
 | 160 |  | 
 | 161 |  /* bind socket to eth0 */ | 
 | 162 |  bind(this->socket, (struct sockaddr *)&my_addr, sizeof(struct sockaddr_ll)); | 
 | 163 |  | 
 | 164 |  A complete tutorial is available at: http://wiki.gnu-log.net/ | 
 | 165 |  | 
 | 166 | -------------------------------------------------------------------------------- | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 167 | + PACKET_MMAP settings | 
 | 168 | -------------------------------------------------------------------------------- | 
 | 169 |  | 
 | 170 |  | 
 | 171 | To setup PACKET_MMAP from user level code is done with a call like | 
 | 172 |  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 173 |  - Capture process | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 174 |      setsockopt(fd, SOL_PACKET, PACKET_RX_RING, (void *) &req, sizeof(req)) | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 175 |  - Transmission process | 
 | 176 |      setsockopt(fd, SOL_PACKET, PACKET_TX_RING, (void *) &req, sizeof(req)) | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 177 |  | 
 | 178 | The most significant argument in the previous call is the req parameter,  | 
 | 179 | this parameter must to have the following structure: | 
 | 180 |  | 
 | 181 |     struct tpacket_req | 
 | 182 |     { | 
 | 183 |         unsigned int    tp_block_size;  /* Minimal size of contiguous block */ | 
 | 184 |         unsigned int    tp_block_nr;    /* Number of blocks */ | 
 | 185 |         unsigned int    tp_frame_size;  /* Size of frame */ | 
 | 186 |         unsigned int    tp_frame_nr;    /* Total number of frames */ | 
 | 187 |     }; | 
 | 188 |  | 
 | 189 | This structure is defined in /usr/include/linux/if_packet.h and establishes a  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 190 | circular buffer (ring) of unswappable memory. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 191 | Being mapped in the capture process allows reading the captured frames and  | 
 | 192 | related meta-information like timestamps without requiring a system call. | 
 | 193 |  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 194 | Frames are grouped in blocks. Each block is a physically contiguous | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 195 | region of memory and holds tp_block_size/tp_frame_size frames. The total number  | 
 | 196 | of blocks is tp_block_nr. Note that tp_frame_nr is a redundant parameter because | 
 | 197 |  | 
 | 198 |     frames_per_block = tp_block_size/tp_frame_size | 
 | 199 |  | 
 | 200 | indeed, packet_set_ring checks that the following condition is true | 
 | 201 |  | 
 | 202 |     frames_per_block * tp_block_nr == tp_frame_nr | 
 | 203 |  | 
 | 204 |  | 
 | 205 | Lets see an example, with the following values: | 
 | 206 |  | 
 | 207 |      tp_block_size= 4096 | 
 | 208 |      tp_frame_size= 2048 | 
 | 209 |      tp_block_nr  = 4 | 
 | 210 |      tp_frame_nr  = 8 | 
 | 211 |  | 
 | 212 | we will get the following buffer structure: | 
 | 213 |  | 
 | 214 |         block #1                 block #2          | 
 | 215 | +---------+---------+    +---------+---------+     | 
 | 216 | | frame 1 | frame 2 |    | frame 3 | frame 4 |     | 
 | 217 | +---------+---------+    +---------+---------+     | 
 | 218 |  | 
 | 219 |         block #3                 block #4 | 
 | 220 | +---------+---------+    +---------+---------+ | 
 | 221 | | frame 5 | frame 6 |    | frame 7 | frame 8 | | 
 | 222 | +---------+---------+    +---------+---------+ | 
 | 223 |  | 
 | 224 | A frame can be of any size with the only condition it can fit in a block. A block | 
 | 225 | can only hold an integer number of frames, or in other words, a frame cannot  | 
| Lucas De Marchi | 25985ed | 2011-03-30 22:57:33 -0300 | [diff] [blame] | 226 | be spawned across two blocks, so there are some details you have to take into  | 
| Matt LaPlante | 6c28f2c | 2006-10-03 22:46:31 +0200 | [diff] [blame] | 227 | account when choosing the frame_size. See "Mapping and use of the circular  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 228 | buffer (ring)". | 
 | 229 |  | 
 | 230 |  | 
 | 231 | -------------------------------------------------------------------------------- | 
 | 232 | + PACKET_MMAP setting constraints | 
 | 233 | -------------------------------------------------------------------------------- | 
 | 234 |  | 
 | 235 | In kernel versions prior to 2.4.26 (for the 2.4 branch) and 2.6.5 (2.6 branch), | 
 | 236 | the PACKET_MMAP buffer could hold only 32768 frames in a 32 bit architecture or | 
 | 237 | 16384 in a 64 bit architecture. For information on these kernel versions | 
 | 238 | see http://pusa.uv.es/~ulisses/packet_mmap/packet_mmap.pre-2.4.26_2.6.5.txt | 
 | 239 |  | 
 | 240 |  Block size limit | 
 | 241 | ------------------ | 
 | 242 |  | 
 | 243 | As stated earlier, each block is a contiguous physical region of memory. These  | 
 | 244 | memory regions are allocated with calls to the __get_free_pages() function. As  | 
 | 245 | the name indicates, this function allocates pages of memory, and the second | 
 | 246 | argument is "order" or a power of two number of pages, that is  | 
 | 247 | (for PAGE_SIZE == 4096) order=0 ==> 4096 bytes, order=1 ==> 8192 bytes,  | 
 | 248 | order=2 ==> 16384 bytes, etc. The maximum size of a  | 
 | 249 | region allocated by __get_free_pages is determined by the MAX_ORDER macro. More  | 
 | 250 | precisely the limit can be calculated as: | 
 | 251 |  | 
 | 252 |    PAGE_SIZE << MAX_ORDER | 
 | 253 |  | 
 | 254 |    In a i386 architecture PAGE_SIZE is 4096 bytes  | 
 | 255 |    In a 2.4/i386 kernel MAX_ORDER is 10 | 
 | 256 |    In a 2.6/i386 kernel MAX_ORDER is 11 | 
 | 257 |  | 
 | 258 | So get_free_pages can allocate as much as 4MB or 8MB in a 2.4/2.6 kernel  | 
 | 259 | respectively, with an i386 architecture. | 
 | 260 |  | 
 | 261 | User space programs can include /usr/include/sys/user.h and  | 
 | 262 | /usr/include/linux/mmzone.h to get PAGE_SIZE MAX_ORDER declarations. | 
 | 263 |  | 
 | 264 | The pagesize can also be determined dynamically with the getpagesize (2)  | 
 | 265 | system call.  | 
 | 266 |  | 
 | 267 |  | 
 | 268 |  Block number limit | 
 | 269 | -------------------- | 
 | 270 |  | 
 | 271 | To understand the constraints of PACKET_MMAP, we have to see the structure  | 
 | 272 | used to hold the pointers to each block. | 
 | 273 |  | 
 | 274 | Currently, this structure is a dynamically allocated vector with kmalloc  | 
 | 275 | called pg_vec, its size limits the number of blocks that can be allocated. | 
 | 276 |  | 
 | 277 |     +---+---+---+---+ | 
 | 278 |     | x | x | x | x | | 
 | 279 |     +---+---+---+---+ | 
 | 280 |       |   |   |   | | 
 | 281 |       |   |   |   v | 
 | 282 |       |   |   v  block #4 | 
 | 283 |       |   v  block #3 | 
 | 284 |       v  block #2 | 
 | 285 |      block #1 | 
 | 286 |  | 
 | 287 |  | 
| Matt LaPlante | 2fe0ae7 | 2006-10-03 22:50:39 +0200 | [diff] [blame] | 288 | kmalloc allocates any number of bytes of physically contiguous memory from  | 
 | 289 | a pool of pre-determined sizes. This pool of memory is maintained by the slab  | 
| Uwe Zeisberger | c30fe7f | 2006-03-24 18:23:14 +0100 | [diff] [blame] | 290 | allocator which is at the end the responsible for doing the allocation and  | 
 | 291 | hence which imposes the maximum memory that kmalloc can allocate.  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 292 |  | 
 | 293 | In a 2.4/2.6 kernel and the i386 architecture, the limit is 131072 bytes. The  | 
 | 294 | predetermined sizes that kmalloc uses can be checked in the "size-<bytes>"  | 
 | 295 | entries of /proc/slabinfo | 
 | 296 |  | 
 | 297 | In a 32 bit architecture, pointers are 4 bytes long, so the total number of  | 
 | 298 | pointers to blocks is | 
 | 299 |  | 
 | 300 |      131072/4 = 32768 blocks | 
 | 301 |  | 
 | 302 |  | 
 | 303 |  PACKET_MMAP buffer size calculator | 
 | 304 | ------------------------------------ | 
 | 305 |  | 
 | 306 | Definitions: | 
 | 307 |  | 
 | 308 | <size-max>    : is the maximum size of allocable with kmalloc (see /proc/slabinfo) | 
 | 309 | <pointer size>: depends on the architecture -- sizeof(void *) | 
 | 310 | <page size>   : depends on the architecture -- PAGE_SIZE or getpagesize (2) | 
 | 311 | <max-order>   : is the value defined with MAX_ORDER | 
 | 312 | <frame size>  : it's an upper bound of frame's capture size (more on this later) | 
 | 313 |  | 
 | 314 | from these definitions we will derive  | 
 | 315 |  | 
 | 316 | 	<block number> = <size-max>/<pointer size> | 
 | 317 | 	<block size> = <pagesize> << <max-order> | 
 | 318 |  | 
 | 319 | so, the max buffer size is | 
 | 320 |  | 
 | 321 | 	<block number> * <block size> | 
 | 322 |  | 
 | 323 | and, the number of frames be | 
 | 324 |  | 
 | 325 | 	<block number> * <block size> / <frame size> | 
 | 326 |  | 
| Uwe Zeisberger | 2e150f6 | 2006-04-01 01:29:43 +0200 | [diff] [blame] | 327 | Suppose the following parameters, which apply for 2.6 kernel and an | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 328 | i386 architecture: | 
 | 329 |  | 
 | 330 | 	<size-max> = 131072 bytes | 
 | 331 | 	<pointer size> = 4 bytes | 
 | 332 | 	<pagesize> = 4096 bytes | 
 | 333 | 	<max-order> = 11 | 
 | 334 |  | 
| Matt LaPlante | 6c28f2c | 2006-10-03 22:46:31 +0200 | [diff] [blame] | 335 | and a value for <frame size> of 2048 bytes. These parameters will yield | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 336 |  | 
 | 337 | 	<block number> = 131072/4 = 32768 blocks | 
 | 338 | 	<block size> = 4096 << 11 = 8 MiB. | 
 | 339 |  | 
 | 340 | and hence the buffer will have a 262144 MiB size. So it can hold  | 
 | 341 | 262144 MiB / 2048 bytes = 134217728 frames | 
 | 342 |  | 
 | 343 |  | 
 | 344 | Actually, this buffer size is not possible with an i386 architecture.  | 
 | 345 | Remember that the memory is allocated in kernel space, in the case of  | 
 | 346 | an i386 kernel's memory size is limited to 1GiB. | 
 | 347 |  | 
 | 348 | All memory allocations are not freed until the socket is closed. The memory  | 
 | 349 | allocations are done with GFP_KERNEL priority, this basically means that  | 
 | 350 | the allocation can wait and swap other process' memory in order to allocate  | 
| Matt LaPlante | 992caac | 2006-10-03 22:52:05 +0200 | [diff] [blame] | 351 | the necessary memory, so normally limits can be reached. | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 352 |  | 
 | 353 |  Other constraints | 
 | 354 | ------------------- | 
 | 355 |  | 
 | 356 | If you check the source code you will see that what I draw here as a frame | 
| Matt LaPlante | 5d3f083 | 2006-11-30 05:21:10 +0100 | [diff] [blame] | 357 | is not only the link level frame. At the beginning of each frame there is a  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 358 | header called struct tpacket_hdr used in PACKET_MMAP to hold link level's frame | 
 | 359 | meta information like timestamp. So what we draw here a frame it's really  | 
 | 360 | the following (from include/linux/if_packet.h): | 
 | 361 |  | 
 | 362 | /* | 
 | 363 |    Frame structure: | 
 | 364 |  | 
 | 365 |    - Start. Frame must be aligned to TPACKET_ALIGNMENT=16 | 
 | 366 |    - struct tpacket_hdr | 
 | 367 |    - pad to TPACKET_ALIGNMENT=16 | 
 | 368 |    - struct sockaddr_ll | 
| Matt LaPlante | 3f6dee9 | 2006-10-03 22:45:33 +0200 | [diff] [blame] | 369 |    - Gap, chosen so that packet data (Start+tp_net) aligns to  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 370 |      TPACKET_ALIGNMENT=16 | 
 | 371 |    - Start+tp_mac: [ Optional MAC header ] | 
 | 372 |    - Start+tp_net: Packet data, aligned to TPACKET_ALIGNMENT=16. | 
 | 373 |    - Pad to align to TPACKET_ALIGNMENT=16 | 
 | 374 |  */ | 
 | 375 |             | 
 | 376 |   | 
 | 377 |  The following are conditions that are checked in packet_set_ring | 
 | 378 |  | 
 | 379 |    tp_block_size must be a multiple of PAGE_SIZE (1) | 
 | 380 |    tp_frame_size must be greater than TPACKET_HDRLEN (obvious) | 
 | 381 |    tp_frame_size must be a multiple of TPACKET_ALIGNMENT | 
 | 382 |    tp_frame_nr   must be exactly frames_per_block*tp_block_nr | 
 | 383 |  | 
| Matt LaPlante | 6c28f2c | 2006-10-03 22:46:31 +0200 | [diff] [blame] | 384 | Note that tp_block_size should be chosen to be a power of two or there will | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 385 | be a waste of memory. | 
 | 386 |  | 
 | 387 | -------------------------------------------------------------------------------- | 
| Matt LaPlante | 6c28f2c | 2006-10-03 22:46:31 +0200 | [diff] [blame] | 388 | + Mapping and use of the circular buffer (ring) | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 389 | -------------------------------------------------------------------------------- | 
 | 390 |  | 
| Matt LaPlante | 6c28f2c | 2006-10-03 22:46:31 +0200 | [diff] [blame] | 391 | The mapping of the buffer in the user process is done with the conventional  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 392 | mmap function. Even the circular buffer is compound of several physically | 
 | 393 | discontiguous blocks of memory, they are contiguous to the user space, hence | 
 | 394 | just one call to mmap is needed: | 
 | 395 |  | 
 | 396 |     mmap(0, size, PROT_READ|PROT_WRITE, MAP_SHARED, fd, 0); | 
 | 397 |  | 
 | 398 | If tp_frame_size is a divisor of tp_block_size frames will be  | 
| Matt LaPlante | d919588 | 2008-07-25 19:45:33 -0700 | [diff] [blame] | 399 | contiguously spaced by tp_frame_size bytes. If not, each | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 400 | tp_block_size/tp_frame_size frames there will be a gap between  | 
 | 401 | the frames. This is because a frame cannot be spawn across two | 
 | 402 | blocks.  | 
 | 403 |  | 
 | 404 | At the beginning of each frame there is an status field (see  | 
 | 405 | struct tpacket_hdr). If this field is 0 means that the frame is ready | 
 | 406 | to be used for the kernel, If not, there is a frame the user can read  | 
 | 407 | and the following flags apply: | 
 | 408 |  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 409 | +++ Capture process: | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 410 |      from include/linux/if_packet.h | 
 | 411 |  | 
 | 412 |      #define TP_STATUS_COPY          2  | 
 | 413 |      #define TP_STATUS_LOSING        4  | 
 | 414 |      #define TP_STATUS_CSUMNOTREADY  8  | 
 | 415 |  | 
 | 416 |  | 
 | 417 | TP_STATUS_COPY        : This flag indicates that the frame (and associated | 
 | 418 |                         meta information) has been truncated because it's  | 
 | 419 |                         larger than tp_frame_size. This packet can be  | 
 | 420 |                         read entirely with recvfrom(). | 
 | 421 |                          | 
 | 422 |                         In order to make this work it must to be | 
 | 423 |                         enabled previously with setsockopt() and  | 
 | 424 |                         the PACKET_COPY_THRESH option.  | 
 | 425 |  | 
 | 426 |                         The number of frames than can be buffered to  | 
 | 427 |                         be read with recvfrom is limited like a normal socket. | 
 | 428 |                         See the SO_RCVBUF option in the socket (7) man page. | 
 | 429 |  | 
 | 430 | TP_STATUS_LOSING      : indicates there were packet drops from last time  | 
 | 431 |                         statistics where checked with getsockopt() and | 
 | 432 |                         the PACKET_STATISTICS option. | 
 | 433 |  | 
| Uwe Zeisberger | c30fe7f | 2006-03-24 18:23:14 +0100 | [diff] [blame] | 434 | TP_STATUS_CSUMNOTREADY: currently it's used for outgoing IP packets which  | 
| Francis Galiegue | a33f322 | 2010-04-23 00:08:02 +0200 | [diff] [blame] | 435 |                         its checksum will be done in hardware. So while | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 436 |                         reading the packet we should not try to check the  | 
 | 437 |                         checksum.  | 
 | 438 |  | 
 | 439 | for convenience there are also the following defines: | 
 | 440 |  | 
 | 441 |      #define TP_STATUS_KERNEL        0 | 
 | 442 |      #define TP_STATUS_USER          1 | 
 | 443 |  | 
 | 444 | The kernel initializes all frames to TP_STATUS_KERNEL, when the kernel | 
 | 445 | receives a packet it puts in the buffer and updates the status with | 
 | 446 | at least the TP_STATUS_USER flag. Then the user can read the packet, | 
 | 447 | once the packet is read the user must zero the status field, so the kernel  | 
 | 448 | can use again that frame buffer. | 
 | 449 |  | 
 | 450 | The user can use poll (any other variant should apply too) to check if new | 
 | 451 | packets are in the ring: | 
 | 452 |  | 
 | 453 |     struct pollfd pfd; | 
 | 454 |  | 
 | 455 |     pfd.fd = fd; | 
 | 456 |     pfd.revents = 0; | 
 | 457 |     pfd.events = POLLIN|POLLRDNORM|POLLERR; | 
 | 458 |  | 
 | 459 |     if (status == TP_STATUS_KERNEL) | 
 | 460 |         retval = poll(&pfd, 1, timeout); | 
 | 461 |  | 
 | 462 | It doesn't incur in a race condition to first check the status value and  | 
 | 463 | then poll for frames. | 
 | 464 |  | 
| Johann Baudy | 69e3c75 | 2009-05-18 22:11:22 -0700 | [diff] [blame] | 465 |  | 
 | 466 | ++ Transmission process | 
 | 467 | Those defines are also used for transmission: | 
 | 468 |  | 
 | 469 |      #define TP_STATUS_AVAILABLE        0 // Frame is available | 
 | 470 |      #define TP_STATUS_SEND_REQUEST     1 // Frame will be sent on next send() | 
 | 471 |      #define TP_STATUS_SENDING          2 // Frame is currently in transmission | 
 | 472 |      #define TP_STATUS_WRONG_FORMAT     4 // Frame format is not correct | 
 | 473 |  | 
 | 474 | First, the kernel initializes all frames to TP_STATUS_AVAILABLE. To send a | 
 | 475 | packet, the user fills a data buffer of an available frame, sets tp_len to | 
 | 476 | current data buffer size and sets its status field to TP_STATUS_SEND_REQUEST. | 
 | 477 | This can be done on multiple frames. Once the user is ready to transmit, it | 
 | 478 | calls send(). Then all buffers with status equal to TP_STATUS_SEND_REQUEST are | 
 | 479 | forwarded to the network device. The kernel updates each status of sent | 
 | 480 | frames with TP_STATUS_SENDING until the end of transfer. | 
 | 481 | At the end of each transfer, buffer status returns to TP_STATUS_AVAILABLE. | 
 | 482 |  | 
 | 483 |     header->tp_len = in_i_size; | 
 | 484 |     header->tp_status = TP_STATUS_SEND_REQUEST; | 
 | 485 |     retval = send(this->socket, NULL, 0, 0); | 
 | 486 |  | 
 | 487 | The user can also use poll() to check if a buffer is available: | 
 | 488 | (status == TP_STATUS_SENDING) | 
 | 489 |  | 
 | 490 |     struct pollfd pfd; | 
 | 491 |     pfd.fd = fd; | 
 | 492 |     pfd.revents = 0; | 
 | 493 |     pfd.events = POLLOUT; | 
 | 494 |     retval = poll(&pfd, 1, timeout); | 
 | 495 |  | 
| Scott McMillan | 614f60f | 2010-06-02 05:53:56 -0700 | [diff] [blame] | 496 | ------------------------------------------------------------------------------- | 
 | 497 | + PACKET_TIMESTAMP | 
 | 498 | ------------------------------------------------------------------------------- | 
 | 499 |  | 
 | 500 | The PACKET_TIMESTAMP setting determines the source of the timestamp in | 
 | 501 | the packet meta information.  If your NIC is capable of timestamping | 
 | 502 | packets in hardware, you can request those hardware timestamps to used. | 
 | 503 | Note: you may need to enable the generation of hardware timestamps with | 
 | 504 | SIOCSHWTSTAMP. | 
 | 505 |  | 
 | 506 | PACKET_TIMESTAMP accepts the same integer bit field as | 
 | 507 | SO_TIMESTAMPING.  However, only the SOF_TIMESTAMPING_SYS_HARDWARE | 
 | 508 | and SOF_TIMESTAMPING_RAW_HARDWARE values are recognized by | 
 | 509 | PACKET_TIMESTAMP.  SOF_TIMESTAMPING_SYS_HARDWARE takes precedence over | 
 | 510 | SOF_TIMESTAMPING_RAW_HARDWARE if both bits are set. | 
 | 511 |  | 
 | 512 |     int req = 0; | 
 | 513 |     req |= SOF_TIMESTAMPING_SYS_HARDWARE; | 
 | 514 |     setsockopt(fd, SOL_PACKET, PACKET_TIMESTAMP, (void *) &req, sizeof(req)) | 
 | 515 |  | 
 | 516 | If PACKET_TIMESTAMP is not set, a software timestamp generated inside | 
 | 517 | the networking stack is used (the behavior before this setting was added). | 
 | 518 |  | 
 | 519 | See include/linux/net_tstamp.h and Documentation/networking/timestamping | 
 | 520 | for more information on hardware timestamps. | 
 | 521 |  | 
| Linus Torvalds | 1da177e | 2005-04-16 15:20:36 -0700 | [diff] [blame] | 522 | -------------------------------------------------------------------------------- | 
 | 523 | + THANKS | 
 | 524 | -------------------------------------------------------------------------------- | 
 | 525 |     | 
 | 526 |    Jesse Brandeburg, for fixing my grammathical/spelling errors | 
 | 527 |  |