Socket Programming - Advanced Topics
Introduction
This article covers some advanced topics about socket programming.
setsockopt
#include <sys/socket.h>
int
getsockopt(int socket, int level, int option_name,
    void *restrict option_value, socklen_t *restrict option_len);
int
setsockopt(int socket, int level, int option_name,
    const void *option_value, socklen_t option_len);
getsockopt()andsetsockopt()manipulate the options associated with a socket. Options may exist at multiple protocol levels; they are always present at the uppermost ``socket’’ level.
Table of Contents
- Introduction
- setsockopt
- Table of Contents
- SOL_SOCKET    - SO_DEBUGenables recording of debugging information
- SO_REUSEADDRenables local address reuse
- SO_REUSEPORTenables duplicate address and port bindings
- SO_KEEPALIVEenables keep connections alive
- SO_DONTROUTEenables routing bypass for outgoing messages
- SO_LINGERlinger on close if data present
- SO_BROADCASTenables permission to transmit broadcast messages
- SO_OOBINLINEenables reception of out-of-band data in band
- SO_SNDBUFset buffer size for output
- SO_RCVBUFset buffer size for input
- SO_SNDLOWATset minimum count for output
- SO_RCVLOWATset minimum count for input
- SO_SNDTIMEOset timeout value for output
- SO_RCVTIMEOset timeout value for input
- SO_TYPEget the type of the socket (get only)
- SO_TIMESTAMPenable or disable the receiving of the- SO_TIMESTAMPcontrol message
- SO_ERRORget and clear error on the socket (get only)
- SO_NOSIGPIPEdo not generate- SIGPIPE, instead return- EPIPE
- SO_NREADnumber of bytes to be read (get only)
- SO_NWRITEnumber of bytes written not yet sent by the protocol (get only)
- SO_LINGER_SEClinger on close if data present with timeout in seconds
- Options not in the manual
- /proc/sys/net/core/interfaces
 
- IPPROTO_IP
- IPPROTO_IPV6
- IPPROTO_TCP
- TCP FLAGS
- Enable Non-Blocking Socket Option
- EINTR
- SIGPIPE
- OOB Data
- ACK
- Glossary
- References
SOL_SOCKET
SO_DEBUG        enables recording of debugging information
SO_REUSEADDR    enables local address reuse
Indicates that the rules used in validating addresses supplied in a bind(2) call should allow reuse of local addresses. For AF_INET sockets this means that a socket may bind, except when there is an active listening socket bound to the address. When the listening socket is bound to INADDR_ANY with a specific port then it is not possible to bind to this port for any local address. Argument is an integer boolean flag.
SO_REUSEPORT    enables duplicate address and port bindings
SO_KEEPALIVE    enables keep connections alive
Enable sending of keep-alive messages on connection-oriented sockets. Expects an integer boolean flag.
SO_DONTROUTE    enables routing bypass for outgoing messages
Don’t send via a gateway, only send to directly connected hosts.
The same effect can be achieved by setting the MSG_DONTROUTE flag on a socket send(2) operation.
Expects an integer boolean flag.
SO_LINGER       linger on close if data present
When enabled, a
close(2)orshutdown(2)will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part ofexit(2), it always lingers in the background.
The typical reason to set a SO_LINGER timeout of zero is to avoid large numbers of connections sitting in the TIME_WAIT state, tying up all the available resources on a server.
When a TCP connection is closed cleanly, the end that initiated the close (“active close”) ends up with the connection sitting in TIME_WAIT for several minutes. So if your protocol is one where the server initiates the connection close, and involves very large numbers of short-lived connections, then it might be susceptible to this problem.
This isn’t a good idea, though - TIME_WAIT exists for a reason (to ensure that stray packets from old connections don’t interfere with new connections). It’s a better idea to redesign your protocol to one where the client initiates the connection close, if possible.
Moreover, the purpose of SO_LINGER is very, very specific and only a 
tiny minority of socket applications actually need it. Unless you are 
extremely familiar with the intricacies of TCP and the BSD socket 
API, you could very easily end up using SO_LINGER in a way for which 
it was not designed.
The effect of an SO_LINGER depends on what the values in the linger structure (the third parameter passed to setsockopt())
/* /usr/include/sys/socket.h -> /usr/include/bits/socket.h */
/* structure used to manipulate the SO_LINGER option */
struct linger {
    int l_onoff;      /* non-zero to linger on close */
    int l_linger;     /* time to linger */
}
which has 3 cases:
- 
    linger->l_onoff == 0linger->l_lingerhas no meaning. This is the default.On close(), the underlying stack attempts to gracefully shutdown the connection after ensuring all unsent data is sent. In the case of connection-oriented protocols such as TCP, the stack also ensures that sent data is acknowledged by the peer. The stack will perform the above-mentioned graceful shutdown in the background (after the call to close() returns), regardless of whether the socket is blocking or non-blocking. 
- 
    linger->l_onoff != 0 && linger->l_linger == 0A close() returns immediately. The underlying stack discards any unsent data, and, in the case of connection-oriented protocols such as TCP, sends a RST (reset) to the peer (this is termed a hard or abortive close). All subsequent attempts by the peer’s application to read()/recv() data will result in an ECONNRESET. 
- 
    linger->l_onoff != 0 && linger->l_linger != 0A close() will either block (if a blocking socket) or fail with EWOULDBLOCK (if non-blocking) until a graceful shutdown completes or the time specified in linger->l_linger elapses (time-out). Upon time-out the stack behaves as in case 2 above. 
Portability Note
- 
    Some implementations of the BSD socket API do not implement SO_LINGER at all. On such systems, applying SO_LINGER either fails with EINVAL or is (silently) ignored. Having SO_LINGER defined in the headers is no guarantee that SO_LINGER is actually implemented. 
- 
    Since the BSD documentation on SO_LINGER is sparse and inadequate, it is not surprising to find the various implementations interpreting the effect of SO_LINGER differently. For instance, the effect of SO_LINGER on non-blocking sockets is not mentioned at all in BSD documentation, and is consequently treated differently on different platforms. Taking case 3 for example: Some implementations behave as described above. With others, a non-blocking socket close() succeed immediately leaving the rest to a background process. Others ignore non-blocking’ness and behave as if the socket were blocking. Yet others behave as if SO_LINGER wasn’t in effect [as if the case 1, the default, was in effect], or ignore linger->l_linger [case 3 is treated as case 2]. Given the lack of adequate documentation, such differences are not (by themselves) indicative of an “incomplete” or “broken” implementation. They are simply different, not incorrect. 
- 
    Some implementations of the BSD socket API do not implement SO_LINGER completely. On such systems, the value of linger->l_linger is ignored (always treated as if it were zero). Technical/Developer note: SO_LINGER does (should) not affect a stack’s implementation of TIME_WAIT. In any event, SO_LINGER is not the way to get around TIME_WAIT. If an application expects to open and close many TCP sockets in quick succession, it should be written to use only a fixed number and/or range of ports, and apply SO_REUSEPORT to sockets that use those ports. 
Related Note
- SO_DONTLINGER
- This socket option has the exact opposite meaning of SO_LINGER, and the two are treated (after inverting the value of linger->l_onoff) as equivalent. In other words, SO_LINGER with a zero linger->l_onoffis the same as SO_DONTLINGER with a non-zerolinger->l_onoff, and vice versa.
using SO_LINGER with timeout 0 should really be a last resort
SO_BROADCAST    enables permission to transmit broadcast messages
Set or get the broadcast flag. When enabled, datagram sockets are allowed to send packets to a broadcast address. This option has no effect on stream-oriented sockets.
SO_OOBINLINE    enables reception of out-of-band data in band
If this option is enabled, out-of-band data is directly placed into the receive data stream. Otherwise out-of-band data is only passed when the MSG_OOB flag is set during receiving.
SO_SNDBUF       set buffer size for output
Sets or gets the maximum socket send buffer in bytes.
The kernel doubles this value (to allow space for bookkeeping overhead) when it is set using setsockopt(2), and this doubled value is returned by getsockopt(2).
The default value is set by the /proc/sys/net/core/wmem_default file and the maximum allowed value is set by the /proc/sys/net/core/wmem_max file.
The minimum (doubled) value for this option is 2048.
NOTES
Linux assumes that half of the send/receive buffer is used for internal kernel structures; thus the sysctls are twice what can be observed on the wire.
SO_RCVBUF       set buffer size for input
SO_SNDLOWAT     set minimum count for output
Specify the minimum number of bytes in the buffer until the socket layer will pass the data to the protocol (SO_SNDLOWAT) or the user on receiving (SO_RCVLOWAT).
These two values are initialized to 1.
SO_SNDLOWAT is not changeable on Linux (setsockopt(2) fails with the error ENOPROTOOPT).
SO_RCVLOWAT is changeable only since Linux 2.4.
The select(2) and poll(2) system calls currently do not respect the SO_RCVLOWAT setting on Linux, and mark a socket readable when even a single byte of data is available.
A subsequent read from the socket will block until SO_RCVLOWAT bytes are available.
SO_RCVLOWAT     set minimum count for input
SO_SNDTIMEO     set timeout value for output
Specify the receiving or sending timeouts until reporting an error.
The argument is a struct timeval.
- If an input or output function blocks for this period of time, and data has been sent or received, the return value of that function will be the amount of data transferred;
- if no data has been transferred and the timeout has been reached then -1 is returned with errno set to EAGAINorEWOULDBLOCK, orEINPROGRESS(forconnect(2)) just as if the socket was specified to be nonblocking.
- If the timeout is set to zero (the default) then the operation will never timeout.
Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on.
SO_RCVTIMEO     set timeout value for input
SO_TYPE         get the type of the socket (get only)
SO_TIMESTAMP    enable or disable the receiving of the SO_TIMESTAMP control message
The timestamp control message is sent with level SOL_SOCKET and the cmsg_data field is a struct timeval indicating the reception time of the last packet passed to the user in this call. See cmsg(3) for details on control messages.
SO_ERROR        get and clear error on the socket (get only)
SO_NOSIGPIPE    do not generate SIGPIPE, instead return EPIPE
SO_NREAD        number of bytes to be read (get only)
SO_NWRITE number of bytes written not yet sent by the protocol (get only)
SO_LINGER_SEC linger on close if data present with timeout in seconds
Options not in the manual
SO_EXCLUSIVEADDRUSE
SO_USELOOPBACK
SO_BSDCOMPAT
/proc/sys/net/core/ interfaces
- bpf_jit_enable
- busy_poll
- busy_read
- default_qdisc
- dev_weight
- message_burst
- message_cost
    - configure the token bucket filter used to load limit warning messages caused by external network events.
 
- netdev_budget
- netdev_max_backlog
    - Maximum number of packets in the global input queue.
 
- netdev_rss_key
- netdev_tstamp_prequeue
- optmem_max
    - Maximum length of ancillary data and user control data like the iovecs per socket.
 
- rmem_default
    - contains the default setting in bytes of the socket receive buffer.
 
- rmem_max
    - contains the maximum socket receive buffer size in bytes which a user may set by using the SO_RCVBUF socket option.
 
- rps_sock_flow_entries
- somaxconn
- warnings
- wmem_default
    - contains the default setting in bytes of the socket send buffer.
 
- wmem_max
    - contains the maximum socket send buffer size in bytes which a user may set by using the SO_SNDBUF socket option.
 
- xfrm_acq_expires
- xfrm_aevent_etime
- xfrm_aevent_rseqth
- xfrm_larval_drop
IPPROTO_IP
IP_HDRINCL
IP_OPTIONS
IP_RECVDSTADDR
IP_RECVIF
IP_TOS
- IPTOS_LOWDELAY
- IPTOS_THROUGHPUT
- IPTOS_RELIABILITY
- IPTOS_LOWCOST
IP_TTL
ICMP6_FILTER
IPPROTO_IPV6
IPV6_ADDRFORM
IPV6_CHECKSUM
IPV6_DSTOPTS
IPV6_HOPLIMIT
IPV6_HOPOPTS
IPV6_NEXTHOP
IPV6_PKTINFO
IPV6_PKTOPTIONS
IPV6_RTHDR
IPV6_UNICAST_HOPS
IPPROTO_TCP
TCP_MAXSEG
TCP_NODELAY
TCP_TW_REUSE
Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.
It is generally a safer alternative to TCP_TW_RECYCLE
The TCP_TW_REUSE setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers.
Reusing the sockets can be very effective in reducing server load.
TCP_TW_RECYCLE
Enable fast recycling TIME-WAIT sockets. Default value is 0.
Known to cause some issues with hoststated (load balancing and fail over) if enabled, should be used with caution.
TCP_NODELAY
First off, be sure you really want to use it in the first place.
It will disable the Nagle algorithm, which will cause network traffic to increase, with smaller than needed packets wasting bandwidth.
Also, from what I have been able to tell, the speed increase is very small, so you should probably do it without TCP_NODELAY first, and only turn it on if there is a problem.
int optval = 1;
int result = setsockopt(sock,     /* socket affected */
                IPPROTO_TCP,      /* set option at TCP level */
                TCP_NODELAY,      /* name of option */
                (char *) &optval, /* the cast is historical cruft */
                sizeof(int));     /* length of option value */
if (result < 0)
   ... handle the error ...
TCP_NODELAY is for a specific purpose; to disable the Nagle buffering algorithm.
It should only be set for applications that send frequent small bursts of information without getting an immediate response, where timely delivery of data is required (the canonical example is mouse movements).
TCP_CORK
TCP_DEFER_ACCEPT
TCP_QUICKACK
TCP FLAGS
- SYN
- FIN
- ACK
- 
    - PSH
- Data is Not Empty
 
- RST
- 
    - URG
- Urgent pointer is valid
 
Enable Non-Blocking Socket Option
flags = fcntl(sock_descriptor, F_GETFL, 0)
fcntl(socket_descriptor, F_SETFL, flags | O_NONBLOCK)
or
ioctl(sockfd, FIONBIO, (char *)&one);
EINTR
This isn’t really so much an error as an exit condition. It means that the call was interrupted by a signal. Any call that might block should be wrapped in a loop that checkes for EINTR
SIGPIPE
with TCP you get SIGPIPE if your end of the connection has received an RST from the other end. What this also means is that if you were using select instead of write, the select would have indicated the socket as being readable, since the RST is there for you to read (read will return an error with errno set to ECONNRESET).
Basically an RST is TCP’s response to some packet that it doesn’t expect and has no other way of dealing with. A common case is when the peer closes the connection (sending you a FIN) but you ignore it because you’re writing and not reading. (You should be using select.) So you write to a connection that has been closed by the other end and the other end’s TCP responds with an RST.
OOB Data
Out-of-band data is the data transferred through a stream that is independent from the main in-band data stream. An out-of-band data mechanism provides a conceptually independent channel, which allows any data sent via that mechanism to be kept separate from in-band data
- Out-of-band data (called “urgent data” in TCP) looks to the application like a separate stream of data from the main data stream.
- This can be useful for separating two different kinds of data.
- Note that just because it is called “urgent data” does not mean that it will be delivered any faster, or with higher priorety than data in the in-band data stream.
- Also beware that unlike the main data stream, the out-of-bound data may be lost if your application can’t keep up with it.
ACK
ACK
tcp_ack
tcp_clean_rtx_queue
tcp_ack_update_rtt
tcp_ack_saw_tstamp | tcp_ack_no_tstamp
tcp_valid_rtt_meas
tcp_rtt_estimator
tcp_set_rto
Glossary
- RTO
- Retransmission Timeout
- MSL
- Maximum Segment Lifetime
- RTT
- Round-Trip Time
- MSS
- Maximum Segment Size
- MTU
- Maximum Transmission Unit