One of the most important optimizations that can be made for high-speed Sockets applications is the tuning of the socket send and receive buffer sizes. The SO_SNDBUF and SO_RCVBUF socket options provide the means to adjust these buffers; the remainder of this section provides some background and discusses how to identify the optimal size of these buffers.
Although the socket buffer sizes might appear to be detached from the protocol stack, they actually modify the operation of the TCP layer for the receive path. Let’s first review flow control for TCP for the receive path.
When TCP initiates a connection, each side advertises a window. The window is the maximum amount of data that it can receive at a given time without the receiver of the data consuming it. Therefore, if a peer advertises a window of 8 KB, and receives 8 KB of data, the sender can transmit no further data until the data is consumed at the receiver. As data is received, the window shrinks to communicate how much new data can be received. This advertisement occurs in each packet sent from the receiver to the sender. The size of this advertised window is the size of the buffer configured at the Sockets layer with the SO_RCVBUF socket option.
Conversely, the sending peer also advertises its window for receive data. This is part of flow control for the send path. TCP also maintains another window called the congestion window that operates within the advertised window. Whereas the advertised window provides an upper bound for transmit data, the congestion window provides an optimized window given detected congestion on the link. The size of the send buffer at the Sockets layer is configured with the SO_SNDBUF socket option.
What is so fundamentally important about the size of the socket buffer is that it defines how much data can be sent to a peer before expecting an acknowledge to shift the window and send additional data. Consider what happens if we have a socket buffer the same size as our segment (packet payload size). We’d send one packet, and then await a response (acknowledge) from the peer before being able to send another. This lock-step operation isn’t very efficient, considering the time that it takes for the peer to receive the packet and then return an acknowledge. Now consider a socket buffer that is the size of two packets. We emit two packets, and then await one (or possibly two) acknowledgments from the peer. In roughly the same amount of time, two packets have been transmitted instead of one. If one considers long-haul networks (geographically distant nodes with very large round-trip times), quite a bit of data could be sent before the peer actually received any.
Another way to think about the socket buffer size is that its size is similar to the amount of data that can be sent before the peer receives it (the bit length of the pipe). If we can tune these to be similar in size, then we have an optimal setting for both transmit and receipt. This concept is known as the “Bandwidth-Delay Product” (BDP), which is the bandwidth of the particular link connecting the two peers times the round-trip time of the connection.
Let’s now look at an example of BDP in practice. Consider a 10-Mbps link with a round-trip time of 35 ms. That’s ~43KB for an optimal socket buffer size to maintain a full pipe for the given connection. Given the standard of 8 KB in many systems, the default would be suboptimal for this interface and round-trip time.
After the optimal socket buffer sizes are known, it’s relatively easy to configure the buffer sizes. Note that this must be done prior to a TCP socket going into the connected state, which means prior to an accept for servers and prior to connect for clients. An example of the socket buffer configuration is shown in Listing 7.1.
Listing 7.1 Configuring the socket buffer sizes.
int sock, ret; int size; sock = socket( AF_INET, SOCK_STREAM, 0 ); . size = 44000; ret = setsockopt( sock, SOL_SOCKET, SO_SNDBUF, &size, sizeof( size ) ); ret = setsockopt( sock, SOL_SOCKET, SO_RCVBUF, &size, sizeof( size ) );
If the round-trip time (RTT) can’t be determined prior to connection setup, another method is to sample the link to identify what an average RTT will be and then use this statically with the link speed for a socket buffer size using the BDP calculation.
Proposals exist to provide an auto-tuning mechanism for socket buffer sizes. Many of the proposals include ICMP-based testing mechanisms to identify the RTT, and they utilize this to set the socket buffer sizes for a given connection. In many cases, setting the socket buffers to known optimistic values works just as well.