| Andy Grover | 0c5f9b8 | 2009-02-24 15:30:38 +0000 | [diff] [blame] | 1 |  | 
 | 2 | Overview | 
 | 3 | ======== | 
 | 4 |  | 
 | 5 | This readme tries to provide some background on the hows and whys of RDS, | 
 | 6 | and will hopefully help you find your way around the code. | 
 | 7 |  | 
 | 8 | In addition, please see this email about RDS origins: | 
 | 9 | http://oss.oracle.com/pipermail/rds-devel/2007-November/000228.html | 
 | 10 |  | 
 | 11 | RDS Architecture | 
 | 12 | ================ | 
 | 13 |  | 
 | 14 | RDS provides reliable, ordered datagram delivery by using a single | 
 | 15 | reliable connection between any two nodes in the cluster. This allows | 
 | 16 | applications to use a single socket to talk to any other process in the | 
 | 17 | cluster - so in a cluster with N processes you need N sockets, in contrast | 
 | 18 | to N*N if you use a connection-oriented socket transport like TCP. | 
 | 19 |  | 
 | 20 | RDS is not Infiniband-specific; it was designed to support different | 
 | 21 | transports.  The current implementation used to support RDS over TCP as well | 
 | 22 | as IB. Work is in progress to support RDS over iWARP, and using DCE to | 
 | 23 | guarantee no dropped packets on Ethernet, it may be possible to use RDS over | 
 | 24 | UDP in the future. | 
 | 25 |  | 
 | 26 | The high-level semantics of RDS from the application's point of view are | 
 | 27 |  | 
 | 28 |  *	Addressing | 
 | 29 |         RDS uses IPv4 addresses and 16bit port numbers to identify | 
 | 30 |         the end point of a connection. All socket operations that involve | 
 | 31 |         passing addresses between kernel and user space generally | 
 | 32 |         use a struct sockaddr_in. | 
 | 33 |  | 
 | 34 |         The fact that IPv4 addresses are used does not mean the underlying | 
 | 35 |         transport has to be IP-based. In fact, RDS over IB uses a | 
 | 36 |         reliable IB connection; the IP address is used exclusively to | 
 | 37 |         locate the remote node's GID (by ARPing for the given IP). | 
 | 38 |  | 
 | 39 |         The port space is entirely independent of UDP, TCP or any other | 
 | 40 |         protocol. | 
 | 41 |  | 
 | 42 |  *	Socket interface | 
 | 43 |         RDS sockets work *mostly* as you would expect from a BSD | 
 | 44 |         socket. The next section will cover the details. At any rate, | 
 | 45 |         all I/O is performed through the standard BSD socket API. | 
 | 46 |         Some additions like zerocopy support are implemented through | 
 | 47 |         control messages, while other extensions use the getsockopt/ | 
 | 48 |         setsockopt calls. | 
 | 49 |  | 
 | 50 |         Sockets must be bound before you can send or receive data. | 
 | 51 |         This is needed because binding also selects a transport and | 
 | 52 |         attaches it to the socket. Once bound, the transport assignment | 
 | 53 |         does not change. RDS will tolerate IPs moving around (eg in | 
 | 54 |         a active-active HA scenario), but only as long as the address | 
 | 55 |         doesn't move to a different transport. | 
 | 56 |  | 
 | 57 |  *	sysctls | 
 | 58 |         RDS supports a number of sysctls in /proc/sys/net/rds | 
 | 59 |  | 
 | 60 |  | 
 | 61 | Socket Interface | 
 | 62 | ================ | 
 | 63 |  | 
 | 64 |   AF_RDS, PF_RDS, SOL_RDS | 
 | 65 |         These constants haven't been assigned yet, because RDS isn't in | 
 | 66 |         mainline yet. Currently, the kernel module assigns some constant | 
 | 67 |         and publishes it to user space through two sysctl files | 
 | 68 |                 /proc/sys/net/rds/pf_rds | 
 | 69 |                 /proc/sys/net/rds/sol_rds | 
 | 70 |  | 
 | 71 |   fd = socket(PF_RDS, SOCK_SEQPACKET, 0); | 
 | 72 |         This creates a new, unbound RDS socket. | 
 | 73 |  | 
 | 74 |   setsockopt(SOL_SOCKET): send and receive buffer size | 
 | 75 |         RDS honors the send and receive buffer size socket options. | 
 | 76 |         You are not allowed to queue more than SO_SNDSIZE bytes to | 
 | 77 |         a socket. A message is queued when sendmsg is called, and | 
 | 78 |         it leaves the queue when the remote system acknowledges | 
 | 79 |         its arrival. | 
 | 80 |  | 
 | 81 |         The SO_RCVSIZE option controls the maximum receive queue length. | 
 | 82 |         This is a soft limit rather than a hard limit - RDS will | 
 | 83 |         continue to accept and queue incoming messages, even if that | 
 | 84 |         takes the queue length over the limit. However, it will also | 
 | 85 |         mark the port as "congested" and send a congestion update to | 
 | 86 |         the source node. The source node is supposed to throttle any | 
 | 87 |         processes sending to this congested port. | 
 | 88 |  | 
 | 89 |   bind(fd, &sockaddr_in, ...) | 
 | 90 |         This binds the socket to a local IP address and port, and a | 
 | 91 |         transport. | 
 | 92 |  | 
 | 93 |   sendmsg(fd, ...) | 
 | 94 |         Sends a message to the indicated recipient. The kernel will | 
 | 95 |         transparently establish the underlying reliable connection | 
 | 96 |         if it isn't up yet. | 
 | 97 |  | 
 | 98 |         An attempt to send a message that exceeds SO_SNDSIZE will | 
 | 99 |         return with -EMSGSIZE | 
 | 100 |  | 
 | 101 |         An attempt to send a message that would take the total number | 
 | 102 |         of queued bytes over the SO_SNDSIZE threshold will return | 
 | 103 |         EAGAIN. | 
 | 104 |  | 
 | 105 |         An attempt to send a message to a destination that is marked | 
 | 106 |         as "congested" will return ENOBUFS. | 
 | 107 |  | 
 | 108 |   recvmsg(fd, ...) | 
 | 109 |         Receives a message that was queued to this socket. The sockets | 
 | 110 |         recv queue accounting is adjusted, and if the queue length | 
 | 111 |         drops below SO_SNDSIZE, the port is marked uncongested, and | 
 | 112 |         a congestion update is sent to all peers. | 
 | 113 |  | 
 | 114 |         Applications can ask the RDS kernel module to receive | 
 | 115 |         notifications via control messages (for instance, there is a | 
 | 116 |         notification when a congestion update arrived, or when a RDMA | 
 | 117 |         operation completes). These notifications are received through | 
 | 118 |         the msg.msg_control buffer of struct msghdr. The format of the | 
 | 119 |         messages is described in manpages. | 
 | 120 |  | 
 | 121 |   poll(fd) | 
 | 122 |         RDS supports the poll interface to allow the application | 
 | 123 |         to implement async I/O. | 
 | 124 |  | 
 | 125 |         POLLIN handling is pretty straightforward. When there's an | 
 | 126 |         incoming message queued to the socket, or a pending notification, | 
 | 127 |         we signal POLLIN. | 
 | 128 |  | 
 | 129 |         POLLOUT is a little harder. Since you can essentially send | 
 | 130 |         to any destination, RDS will always signal POLLOUT as long as | 
 | 131 |         there's room on the send queue (ie the number of bytes queued | 
 | 132 |         is less than the sendbuf size). | 
 | 133 |  | 
 | 134 |         However, the kernel will refuse to accept messages to | 
 | 135 |         a destination marked congested - in this case you will loop | 
 | 136 |         forever if you rely on poll to tell you what to do. | 
 | 137 |         This isn't a trivial problem, but applications can deal with | 
 | 138 |         this - by using congestion notifications, and by checking for | 
 | 139 |         ENOBUFS errors returned by sendmsg. | 
 | 140 |  | 
 | 141 |   setsockopt(SOL_RDS, RDS_CANCEL_SENT_TO, &sockaddr_in) | 
 | 142 |         This allows the application to discard all messages queued to a | 
 | 143 |         specific destination on this particular socket. | 
 | 144 |  | 
 | 145 |         This allows the application to cancel outstanding messages if | 
 | 146 |         it detects a timeout. For instance, if it tried to send a message, | 
 | 147 |         and the remote host is unreachable, RDS will keep trying forever. | 
 | 148 |         The application may decide it's not worth it, and cancel the | 
 | 149 |         operation. In this case, it would use RDS_CANCEL_SENT_TO to | 
 | 150 |         nuke any pending messages. | 
 | 151 |  | 
 | 152 |  | 
 | 153 | RDMA for RDS | 
 | 154 | ============ | 
 | 155 |  | 
 | 156 |   see rds-rdma(7) manpage (available in rds-tools) | 
 | 157 |  | 
 | 158 |  | 
 | 159 | Congestion Notifications | 
 | 160 | ======================== | 
 | 161 |  | 
 | 162 |   see rds(7) manpage | 
 | 163 |  | 
 | 164 |  | 
 | 165 | RDS Protocol | 
 | 166 | ============ | 
 | 167 |  | 
 | 168 |   Message header | 
 | 169 |  | 
 | 170 |     The message header is a 'struct rds_header' (see rds.h): | 
 | 171 |     Fields: | 
 | 172 |       h_sequence: | 
 | 173 |           per-packet sequence number | 
 | 174 |       h_ack: | 
 | 175 |           piggybacked acknowledgment of last packet received | 
 | 176 |       h_len: | 
 | 177 |           length of data, not including header | 
 | 178 |       h_sport: | 
 | 179 |           source port | 
 | 180 |       h_dport: | 
 | 181 |           destination port | 
 | 182 |       h_flags: | 
 | 183 |           CONG_BITMAP - this is a congestion update bitmap | 
 | 184 |           ACK_REQUIRED - receiver must ack this packet | 
 | 185 |           RETRANSMITTED - packet has previously been sent | 
 | 186 |       h_credit: | 
 | 187 |           indicate to other end of connection that | 
 | 188 |           it has more credits available (i.e. there is | 
 | 189 |           more send room) | 
 | 190 |       h_padding[4]: | 
 | 191 |           unused, for future use | 
 | 192 |       h_csum: | 
 | 193 |           header checksum | 
 | 194 |       h_exthdr: | 
 | 195 |           optional data can be passed here. This is currently used for | 
 | 196 |           passing RDMA-related information. | 
 | 197 |  | 
 | 198 |   ACK and retransmit handling | 
 | 199 |  | 
 | 200 |       One might think that with reliable IB connections you wouldn't need | 
 | 201 |       to ack messages that have been received.  The problem is that IB | 
 | 202 |       hardware generates an ack message before it has DMAed the message | 
 | 203 |       into memory.  This creates a potential message loss if the HCA is | 
 | 204 |       disabled for any reason between when it sends the ack and before | 
 | 205 |       the message is DMAed and processed.  This is only a potential issue | 
 | 206 |       if another HCA is available for fail-over. | 
 | 207 |  | 
 | 208 |       Sending an ack immediately would allow the sender to free the sent | 
 | 209 |       message from their send queue quickly, but could cause excessive | 
 | 210 |       traffic to be used for acks. RDS piggybacks acks on sent data | 
 | 211 |       packets.  Ack-only packets are reduced by only allowing one to be | 
 | 212 |       in flight at a time, and by the sender only asking for acks when | 
 | 213 |       its send buffers start to fill up. All retransmissions are also | 
 | 214 |       acked. | 
 | 215 |  | 
 | 216 |   Flow Control | 
 | 217 |  | 
 | 218 |       RDS's IB transport uses a credit-based mechanism to verify that | 
 | 219 |       there is space in the peer's receive buffers for more data. This | 
 | 220 |       eliminates the need for hardware retries on the connection. | 
 | 221 |  | 
 | 222 |   Congestion | 
 | 223 |  | 
 | 224 |       Messages waiting in the receive queue on the receiving socket | 
 | 225 |       are accounted against the sockets SO_RCVBUF option value.  Only | 
 | 226 |       the payload bytes in the message are accounted for.  If the | 
 | 227 |       number of bytes queued equals or exceeds rcvbuf then the socket | 
 | 228 |       is congested.  All sends attempted to this socket's address | 
 | 229 |       should return block or return -EWOULDBLOCK. | 
 | 230 |  | 
 | 231 |       Applications are expected to be reasonably tuned such that this | 
 | 232 |       situation very rarely occurs.  An application encountering this | 
 | 233 |       "back-pressure" is considered a bug. | 
 | 234 |  | 
 | 235 |       This is implemented by having each node maintain bitmaps which | 
 | 236 |       indicate which ports on bound addresses are congested.  As the | 
 | 237 |       bitmap changes it is sent through all the connections which | 
 | 238 |       terminate in the local address of the bitmap which changed. | 
 | 239 |  | 
 | 240 |       The bitmaps are allocated as connections are brought up.  This | 
 | 241 |       avoids allocation in the interrupt handling path which queues | 
 | 242 |       sages on sockets.  The dense bitmaps let transports send the | 
 | 243 |       entire bitmap on any bitmap change reasonably efficiently.  This | 
 | 244 |       is much easier to implement than some finer-grained | 
 | 245 |       communication of per-port congestion.  The sender does a very | 
 | 246 |       inexpensive bit test to test if the port it's about to send to | 
 | 247 |       is congested or not. | 
 | 248 |  | 
 | 249 |  | 
 | 250 | RDS Transport Layer | 
 | 251 | ================== | 
 | 252 |  | 
 | 253 |   As mentioned above, RDS is not IB-specific. Its code is divided | 
 | 254 |   into a general RDS layer and a transport layer. | 
 | 255 |  | 
 | 256 |   The general layer handles the socket API, congestion handling, | 
 | 257 |   loopback, stats, usermem pinning, and the connection state machine. | 
 | 258 |  | 
 | 259 |   The transport layer handles the details of the transport. The IB | 
 | 260 |   transport, for example, handles all the queue pairs, work requests, | 
 | 261 |   CM event handlers, and other Infiniband details. | 
 | 262 |  | 
 | 263 |  | 
 | 264 | RDS Kernel Structures | 
 | 265 | ===================== | 
 | 266 |  | 
 | 267 |   struct rds_message | 
 | 268 |     aka possibly "rds_outgoing", the generic RDS layer copies data to | 
 | 269 |     be sent and sets header fields as needed, based on the socket API. | 
 | 270 |     This is then queued for the individual connection and sent by the | 
 | 271 |     connection's transport. | 
 | 272 |   struct rds_incoming | 
 | 273 |     a generic struct referring to incoming data that can be handed from | 
 | 274 |     the transport to the general code and queued by the general code | 
 | 275 |     while the socket is awoken. It is then passed back to the transport | 
 | 276 |     code to handle the actual copy-to-user. | 
 | 277 |   struct rds_socket | 
 | 278 |     per-socket information | 
 | 279 |   struct rds_connection | 
 | 280 |     per-connection information | 
 | 281 |   struct rds_transport | 
 | 282 |     pointers to transport-specific functions | 
 | 283 |   struct rds_statistics | 
 | 284 |     non-transport-specific statistics | 
 | 285 |   struct rds_cong_map | 
 | 286 |     wraps the raw congestion bitmap, contains rbnode, waitq, etc. | 
 | 287 |  | 
 | 288 | Connection management | 
 | 289 | ===================== | 
 | 290 |  | 
 | 291 |   Connections may be in UP, DOWN, CONNECTING, DISCONNECTING, and | 
 | 292 |   ERROR states. | 
 | 293 |  | 
 | 294 |   The first time an attempt is made by an RDS socket to send data to | 
 | 295 |   a node, a connection is allocated and connected. That connection is | 
 | 296 |   then maintained forever -- if there are transport errors, the | 
 | 297 |   connection will be dropped and re-established. | 
 | 298 |  | 
 | 299 |   Dropping a connection while packets are queued will cause queued or | 
 | 300 |   partially-sent datagrams to be retransmitted when the connection is | 
 | 301 |   re-established. | 
 | 302 |  | 
 | 303 |  | 
 | 304 | The send path | 
 | 305 | ============= | 
 | 306 |  | 
 | 307 |   rds_sendmsg() | 
 | 308 |     struct rds_message built from incoming data | 
 | 309 |     CMSGs parsed (e.g. RDMA ops) | 
 | 310 |     transport connection alloced and connected if not already | 
 | 311 |     rds_message placed on send queue | 
 | 312 |     send worker awoken | 
 | 313 |   rds_send_worker() | 
 | 314 |     calls rds_send_xmit() until queue is empty | 
 | 315 |   rds_send_xmit() | 
 | 316 |     transmits congestion map if one is pending | 
 | 317 |     may set ACK_REQUIRED | 
 | 318 |     calls transport to send either non-RDMA or RDMA message | 
 | 319 |     (RDMA ops never retransmitted) | 
 | 320 |   rds_ib_xmit() | 
 | 321 |     allocs work requests from send ring | 
 | 322 |     adds any new send credits available to peer (h_credits) | 
 | 323 |     maps the rds_message's sg list | 
 | 324 |     piggybacks ack | 
 | 325 |     populates work requests | 
 | 326 |     post send to connection's queue pair | 
 | 327 |  | 
 | 328 | The recv path | 
 | 329 | ============= | 
 | 330 |  | 
 | 331 |   rds_ib_recv_cq_comp_handler() | 
 | 332 |     looks at write completions | 
 | 333 |     unmaps recv buffer from device | 
 | 334 |     no errors, call rds_ib_process_recv() | 
 | 335 |     refill recv ring | 
 | 336 |   rds_ib_process_recv() | 
 | 337 |     validate header checksum | 
 | 338 |     copy header to rds_ib_incoming struct if start of a new datagram | 
 | 339 |     add to ibinc's fraglist | 
 | 340 |     if competed datagram: | 
 | 341 |       update cong map if datagram was cong update | 
 | 342 |       call rds_recv_incoming() otherwise | 
 | 343 |       note if ack is required | 
 | 344 |   rds_recv_incoming() | 
 | 345 |     drop duplicate packets | 
 | 346 |     respond to pings | 
 | 347 |     find the sock associated with this datagram | 
 | 348 |     add to sock queue | 
 | 349 |     wake up sock | 
 | 350 |     do some congestion calculations | 
 | 351 |   rds_recvmsg | 
 | 352 |     copy data into user iovec | 
 | 353 |     handle CMSGs | 
 | 354 |     return to application | 
 | 355 |  | 
 | 356 |  |