| Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 1 |  | 
|  | 2 | Overview | 
|  | 3 | ======== | 
|  | 4 |  | 
|  | 5 | This readme tries to provide some background on the hows and whys of RDS, | 
|  | 6 | and will hopefully help you find your way around the code. | 
|  | 7 |  | 
|  | 8 | In addition, please see this email about RDS origins: | 
|  | 9 | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html | 
|  | 10 |  | 
|  | 11 | RDS Architecture | 
|  | 12 | ================ | 
|  | 13 |  | 
|  | 14 | RDS provides reliable, ordered datagram delivery by using a single | 
|  | 15 | reliable connection between any two nodes in the cluster. This allows | 
|  | 16 | applications to use a single socket to talk to any other process in the | 
|  | 17 | cluster - so in a cluster with N processes you need N sockets, in contrast | 
|  | 18 | to N*N if you use a connection-oriented socket transport like TCP. | 
|  | 19 |  | 
|  | 20 | RDS is not Infiniband-specific; it was designed to support different | 
|  | 21 | transports.  The current implementation used to support RDS over TCP as well | 
|  | 22 | as IB. Work is in progress to support RDS over iWARP, and using DCE to | 
|  | 23 | guarantee no dropped packets on Ethernet, it may be possible to use RDS over | 
|  | 24 | UDP in the future. | 
|  | 25 |  | 
|  | 26 | The high-level semantics of RDS from the application's point of view are | 
|  | 27 |  | 
|  | 28 | *	Addressing | 
|  | 29 | RDS uses IPv4 addresses and 16bit port numbers to identify | 
|  | 30 | the end point of a connection. All socket operations that involve | 
|  | 31 | passing addresses between kernel and user space generally | 
|  | 32 | use a struct sockaddr_in. | 
|  | 33 |  | 
|  | 34 | The fact that IPv4 addresses are used does not mean the underlying | 
|  | 35 | transport has to be IP-based. In fact, RDS over IB uses a | 
|  | 36 | reliable IB connection; the IP address is used exclusively to | 
|  | 37 | locate the remote node's GID (by ARPing for the given IP). | 
|  | 38 |  | 
|  | 39 | The port space is entirely independent of UDP, TCP or any other | 
|  | 40 | protocol. | 
|  | 41 |  | 
|  | 42 | *	Socket interface | 
|  | 43 | RDS sockets work *mostly* as you would expect from a BSD | 
|  | 44 | socket. The next section will cover the details. At any rate, | 
|  | 45 | all I/O is performed through the standard BSD socket API. | 
|  | 46 | Some additions like zerocopy support are implemented through | 
|  | 47 | control messages, while other extensions use the getsockopt/ | 
|  | 48 | setsockopt calls. | 
|  | 49 |  | 
|  | 50 | Sockets must be bound before you can send or receive data. | 
|  | 51 | This is needed because binding also selects a transport and | 
|  | 52 | attaches it to the socket. Once bound, the transport assignment | 
|  | 53 | does not change. RDS will tolerate IPs moving around (eg in | 
|  | 54 | a active-active HA scenario), but only as long as the address | 
|  | 55 | doesn't move to a different transport. | 
|  | 56 |  | 
|  | 57 | *	sysctls | 
|  | 58 | RDS supports a number of sysctls in /proc/sys/net/rds | 
|  | 59 |  | 
|  | 60 |  | 
|  | 61 | Socket Interface | 
|  | 62 | ================ | 
|  | 63 |  | 
|  | 64 | AF_RDS, PF_RDS, SOL_RDS | 
|  | 65 | These constants haven't been assigned yet, because RDS isn't in | 
|  | 66 | mainline yet. Currently, the kernel module assigns some constant | 
|  | 67 | and publishes it to user space through two sysctl files | 
|  | 68 | /proc/sys/net/rds/pf_rds | 
|  | 69 | /proc/sys/net/rds/sol_rds | 
|  | 70 |  | 
|  | 71 | fd = socket(PF_RDS, SOCK_SEQPACKET, 0); | 
|  | 72 | This creates a new, unbound RDS socket. | 
|  | 73 |  | 
|  | 74 | setsockopt(SOL_SOCKET): send and receive buffer size | 
|  | 75 | RDS honors the send and receive buffer size socket options. | 
|  | 76 | You are not allowed to queue more than SO_SNDSIZE bytes to | 
|  | 77 | a socket. A message is queued when sendmsg is called, and | 
|  | 78 | it leaves the queue when the remote system acknowledges | 
|  | 79 | its arrival. | 
|  | 80 |  | 
|  | 81 | The SO_RCVSIZE option controls the maximum receive queue length. | 
|  | 82 | This is a soft limit rather than a hard limit - RDS will | 
|  | 83 | continue to accept and queue incoming messages, even if that | 
|  | 84 | takes the queue length over the limit. However, it will also | 
|  | 85 | mark the port as "congested" and send a congestion update to | 
|  | 86 | the source node. The source node is supposed to throttle any | 
|  | 87 | processes sending to this congested port. | 
|  | 88 |  | 
|  | 89 | bind(fd, &sockaddr_in, ...) | 
|  | 90 | This binds the socket to a local IP address and port, and a | 
|  | 91 | transport. | 
|  | 92 |  | 
|  | 93 | sendmsg(fd, ...) | 
|  | 94 | Sends a message to the indicated recipient. The kernel will | 
|  | 95 | transparently establish the underlying reliable connection | 
|  | 96 | if it isn't up yet. | 
|  | 97 |  | 
|  | 98 | An attempt to send a message that exceeds SO_SNDSIZE will | 
|  | 99 | return with -EMSGSIZE | 
|  | 100 |  | 
|  | 101 | An attempt to send a message that would take the total number | 
|  | 102 | of queued bytes over the SO_SNDSIZE threshold will return | 
|  | 103 | EAGAIN. | 
|  | 104 |  | 
|  | 105 | An attempt to send a message to a destination that is marked | 
|  | 106 | as "congested" will return ENOBUFS. | 
|  | 107 |  | 
|  | 108 | recvmsg(fd, ...) | 
|  | 109 | Receives a message that was queued to this socket. The sockets | 
|  | 110 | recv queue accounting is adjusted, and if the queue length | 
|  | 111 | drops below SO_SNDSIZE, the port is marked uncongested, and | 
|  | 112 | a congestion update is sent to all peers. | 
|  | 113 |  | 
|  | 114 | Applications can ask the RDS kernel module to receive | 
|  | 115 | notifications via control messages (for instance, there is a | 
|  | 116 | notification when a congestion update arrived, or when a RDMA | 
|  | 117 | operation completes). These notifications are received through | 
|  | 118 | the msg.msg_control buffer of struct msghdr. The format of the | 
|  | 119 | messages is described in manpages. | 
|  | 120 |  | 
|  | 121 | poll(fd) | 
|  | 122 | RDS supports the poll interface to allow the application | 
|  | 123 | to implement async I/O. | 
|  | 124 |  | 
|  | 125 | POLLIN handling is pretty straightforward. When there's an | 
|  | 126 | incoming message queued to the socket, or a pending notification, | 
|  | 127 | we signal POLLIN. | 
|  | 128 |  | 
|  | 129 | POLLOUT is a little harder. Since you can essentially send | 
|  | 130 | to any destination, RDS will always signal POLLOUT as long as | 
|  | 131 | there's room on the send queue (ie the number of bytes queued | 
|  | 132 | is less than the sendbuf size). | 
|  | 133 |  | 
|  | 134 | However, the kernel will refuse to accept messages to | 
|  | 135 | a destination marked congested - in this case you will loop | 
|  | 136 | forever if you rely on poll to tell you what to do. | 
|  | 137 | This isn't a trivial problem, but applications can deal with | 
|  | 138 | this - by using congestion notifications, and by checking for | 
|  | 139 | ENOBUFS errors returned by sendmsg. | 
|  | 140 |  | 
|  | 141 | setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) | 
|  | 142 | This allows the application to discard all messages queued to a | 
|  | 143 | specific destination on this particular socket. | 
|  | 144 |  | 
|  | 145 | This allows the application to cancel outstanding messages if | 
|  | 146 | it detects a timeout. For instance, if it tried to send a message, | 
|  | 147 | and the remote host is unreachable, RDS will keep trying forever. | 
|  | 148 | The application may decide it's not worth it, and cancel the | 
|  | 149 | operation. In this case, it would use RDS_CANCEL_SENT_TO to | 
|  | 150 | nuke any pending messages. | 
|  | 151 |  | 
|  | 152 |  | 
|  | 153 | RDMA for RDS | 
|  | 154 | ============ | 
|  | 155 |  | 
|  | 156 | see rds-rdma(7) manpage (available in rds-tools) | 
|  | 157 |  | 
|  | 158 |  | 
|  | 159 | Congestion Notifications | 
|  | 160 | ======================== | 
|  | 161 |  | 
|  | 162 | see rds(7) manpage | 
|  | 163 |  | 
|  | 164 |  | 
|  | 165 | RDS Protocol | 
|  | 166 | ============ | 
|  | 167 |  | 
|  | 168 | Message header | 
|  | 169 |  | 
|  | 170 | The message header is a 'struct rds_header' (see rds.h): | 
|  | 171 | Fields: | 
|  | 172 | h_sequence: | 
|  | 173 | per-packet sequence number | 
|  | 174 | h_ack: | 
|  | 175 | piggybacked acknowledgment of last packet received | 
|  | 176 | h_len: | 
|  | 177 | length of data, not including header | 
|  | 178 | h_sport: | 
|  | 179 | source port | 
|  | 180 | h_dport: | 
|  | 181 | destination port | 
|  | 182 | h_flags: | 
|  | 183 | CONG_BITMAP - this is a congestion update bitmap | 
|  | 184 | ACK_REQUIRED - receiver must ack this packet | 
|  | 185 | RETRANSMITTED - packet has previously been sent | 
|  | 186 | h_credit: | 
|  | 187 | indicate to other end of connection that | 
|  | 188 | it has more credits available (i.e. there is | 
|  | 189 | more send room) | 
|  | 190 | h_padding[4]: | 
|  | 191 | unused, for future use | 
|  | 192 | h_csum: | 
|  | 193 | header checksum | 
|  | 194 | h_exthdr: | 
|  | 195 | optional data can be passed here. This is currently used for | 
|  | 196 | passing RDMA-related information. | 
|  | 197 |  | 
|  | 198 | ACK and retransmit handling | 
|  | 199 |  | 
|  | 200 | One might think that with reliable IB connections you wouldn't need | 
|  | 201 | to ack messages that have been received.  The problem is that IB | 
|  | 202 | hardware generates an ack message before it has DMAed the message | 
|  | 203 | into memory.  This creates a potential message loss if the HCA is | 
|  | 204 | disabled for any reason between when it sends the ack and before | 
|  | 205 | the message is DMAed and processed.  This is only a potential issue | 
|  | 206 | if another HCA is available for fail-over. | 
|  | 207 |  | 
|  | 208 | Sending an ack immediately would allow the sender to free the sent | 
|  | 209 | message from their send queue quickly, but could cause excessive | 
|  | 210 | traffic to be used for acks. RDS piggybacks acks on sent data | 
|  | 211 | packets.  Ack-only packets are reduced by only allowing one to be | 
|  | 212 | in flight at a time, and by the sender only asking for acks when | 
|  | 213 | its send buffers start to fill up. All retransmissions are also | 
|  | 214 | acked. | 
|  | 215 |  | 
|  | 216 | Flow Control | 
|  | 217 |  | 
|  | 218 | RDS's IB transport uses a credit-based mechanism to verify that | 
|  | 219 | there is space in the peer's receive buffers for more data. This | 
|  | 220 | eliminates the need for hardware retries on the connection. | 
|  | 221 |  | 
|  | 222 | Congestion | 
|  | 223 |  | 
|  | 224 | Messages waiting in the receive queue on the receiving socket | 
|  | 225 | are accounted against the sockets SO_RCVBUF option value.  Only | 
|  | 226 | the payload bytes in the message are accounted for.  If the | 
|  | 227 | number of bytes queued equals or exceeds rcvbuf then the socket | 
|  | 228 | is congested.  All sends attempted to this socket's address | 
|  | 229 | should return block or return -EWOULDBLOCK. | 
|  | 230 |  | 
|  | 231 | Applications are expected to be reasonably tuned such that this | 
|  | 232 | situation very rarely occurs.  An application encountering this | 
|  | 233 | "back-pressure" is considered a bug. | 
|  | 234 |  | 
|  | 235 | This is implemented by having each node maintain bitmaps which | 
|  | 236 | indicate which ports on bound addresses are congested.  As the | 
|  | 237 | bitmap changes it is sent through all the connections which | 
|  | 238 | terminate in the local address of the bitmap which changed. | 
|  | 239 |  | 
|  | 240 | The bitmaps are allocated as connections are brought up.  This | 
|  | 241 | avoids allocation in the interrupt handling path which queues | 
|  | 242 | sages on sockets.  The dense bitmaps let transports send the | 
|  | 243 | entire bitmap on any bitmap change reasonably efficiently.  This | 
|  | 244 | is much easier to implement than some finer-grained | 
|  | 245 | communication of per-port congestion.  The sender does a very | 
|  | 246 | inexpensive bit test to test if the port it's about to send to | 
|  | 247 | is congested or not. | 
|  | 248 |  | 
|  | 249 |  | 
|  | 250 | RDS Transport Layer | 
|  | 251 | ================== | 
|  | 252 |  | 
|  | 253 | As mentioned above, RDS is not IB-specific. Its code is divided | 
|  | 254 | into a general RDS layer and a transport layer. | 
|  | 255 |  | 
|  | 256 | The general layer handles the socket API, congestion handling, | 
|  | 257 | loopback, stats, usermem pinning, and the connection state machine. | 
|  | 258 |  | 
|  | 259 | The transport layer handles the details of the transport. The IB | 
|  | 260 | transport, for example, handles all the queue pairs, work requests, | 
|  | 261 | CM event handlers, and other Infiniband details. | 
|  | 262 |  | 
|  | 263 |  | 
|  | 264 | RDS Kernel Structures | 
|  | 265 | ===================== | 
|  | 266 |  | 
|  | 267 | struct rds_message | 
|  | 268 | aka possibly "rds_outgoing", the generic RDS layer copies data to | 
|  | 269 | be sent and sets header fields as needed, based on the socket API. | 
|  | 270 | This is then queued for the individual connection and sent by the | 
|  | 271 | connection's transport. | 
|  | 272 | struct rds_incoming | 
|  | 273 | a generic struct referring to incoming data that can be handed from | 
|  | 274 | the transport to the general code and queued by the general code | 
|  | 275 | while the socket is awoken. It is then passed back to the transport | 
|  | 276 | code to handle the actual copy-to-user. | 
|  | 277 | struct rds_socket | 
|  | 278 | per-socket information | 
|  | 279 | struct rds_connection | 
|  | 280 | per-connection information | 
|  | 281 | struct rds_transport | 
|  | 282 | pointers to transport-specific functions | 
|  | 283 | struct rds_statistics | 
|  | 284 | non-transport-specific statistics | 
|  | 285 | struct rds_cong_map | 
|  | 286 | wraps the raw congestion bitmap, contains rbnode, waitq, etc. | 
|  | 287 |  | 
|  | 288 | Connection management | 
|  | 289 | ===================== | 
|  | 290 |  | 
|  | 291 | Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and | 
|  | 292 | ERROR states. | 
|  | 293 |  | 
|  | 294 | The first time an attempt is made by an RDS socket to send data to | 
|  | 295 | a node, a connection is allocated and connected. That connection is | 
|  | 296 | then maintained forever -- if there are transport errors, the | 
|  | 297 | connection will be dropped and re-established. | 
|  | 298 |  | 
|  | 299 | Dropping a connection while packets are queued will cause queued or | 
|  | 300 | partially-sent datagrams to be retransmitted when the connection is | 
|  | 301 | re-established. | 
|  | 302 |  | 
|  | 303 |  | 
|  | 304 | The send path | 
|  | 305 | ============= | 
|  | 306 |  | 
|  | 307 | rds_sendmsg() | 
|  | 308 | struct rds_message built from incoming data | 
|  | 309 | CMSGs parsed (e.g. RDMA ops) | 
|  | 310 | transport connection alloced and connected if not already | 
|  | 311 | rds_message placed on send queue | 
|  | 312 | send worker awoken | 
|  | 313 | rds_send_worker() | 
|  | 314 | calls rds_send_xmit() until queue is empty | 
|  | 315 | rds_send_xmit() | 
|  | 316 | transmits congestion map if one is pending | 
|  | 317 | may set ACK_REQUIRED | 
|  | 318 | calls transport to send either non-RDMA or RDMA message | 
|  | 319 | (RDMA ops never retransmitted) | 
|  | 320 | rds_ib_xmit() | 
|  | 321 | allocs work requests from send ring | 
|  | 322 | adds any new send credits available to peer (h_credits) | 
|  | 323 | maps the rds_message's sg list | 
|  | 324 | piggybacks ack | 
|  | 325 | populates work requests | 
|  | 326 | post send to connection's queue pair | 
|  | 327 |  | 
|  | 328 | The recv path | 
|  | 329 | ============= | 
|  | 330 |  | 
|  | 331 | rds_ib_recv_cq_comp_handler() | 
|  | 332 | looks at write completions | 
|  | 333 | unmaps recv buffer from device | 
|  | 334 | no errors, call rds_ib_process_recv() | 
|  | 335 | refill recv ring | 
|  | 336 | rds_ib_process_recv() | 
|  | 337 | validate header checksum | 
|  | 338 | copy header to rds_ib_incoming struct if start of a new datagram | 
|  | 339 | add to ibinc's fraglist | 
|  | 340 | if competed datagram: | 
|  | 341 | update cong map if datagram was cong update | 
|  | 342 | call rds_recv_incoming() otherwise | 
|  | 343 | note if ack is required | 
|  | 344 | rds_recv_incoming() | 
|  | 345 | drop duplicate packets | 
|  | 346 | respond to pings | 
|  | 347 | find the sock associated with this datagram | 
|  | 348 | add to sock queue | 
|  | 349 | wake up sock | 
|  | 350 | do some congestion calculations | 
|  | 351 | rds_recvmsg | 
|  | 352 | copy data into user iovec | 
|  | 353 | handle CMSGs | 
|  | 354 | return to application | 
|  | 355 |  | 
|  | 356 |  |