Socket Programming - Advanced Topics

Introduction

This article covers some advanced topics about socket programming.

`setsockopt`

#include <sys/socket.h>
int
getsockopt(int socket, int level, int option_name,
    void *restrict option_value, socklen_t *restrict option_len);
int
setsockopt(int socket, int level, int option_name,
    const void *option_value, socklen_t option_len);

getsockopt() and setsockopt() manipulate the options associated with a socket. Options may exist at multiple protocol levels; they are always present at the uppermost ``socket’’ level.

Introduction
setsockopt
Table of Contents
SOL_SOCKET
IPPROTO_IP
- IP_HDRINCL
- IP_OPTIONS
- IP_RECVDSTADDR
- IP_RECVIF
- IP_TOS
- IP_TTL
- ICMP6_FILTER
IPPROTO_IPV6
IPPROTO_TCP
TCP FLAGS
Enable Non-Blocking Socket Option
EINTR
SIGPIPE
OOB Data
ACK
Glossary
References

SOL_SOCKET

`SO_DEBUG` enables recording of debugging information

`SO_REUSEADDR` enables local address reuse

Indicates that the rules used in validating addresses supplied in a bind(2) call should allow reuse of local addresses. For AF_INET sockets this means that a socket may bind, except when there is an active listening socket bound to the address. When the listening socket is bound to INADDR_ANY with a specific port then it is not possible to bind to this port for any local address. Argument is an integer boolean flag.

`SO_REUSEPORT` enables duplicate address and port bindings

`SO_KEEPALIVE` enables keep connections alive

Enable sending of keep-alive messages on connection-oriented sockets. Expects an integer boolean flag.

`SO_DONTROUTE` enables routing bypass for outgoing messages

Don’t send via a gateway, only send to directly connected hosts. The same effect can be achieved by setting the MSG_DONTROUTE flag on a socket send(2) operation. Expects an integer boolean flag.

`SO_LINGER` linger on close if data present

When enabled, a close(2) or shutdown(2) will not return until all queued messages for the socket have been successfully sent or the linger timeout has been reached. Otherwise, the call returns immediately and the closing is done in the background. When the socket is closed as part of exit(2), it always lingers in the background.

The typical reason to set a SO_LINGER timeout of zero is to avoid large numbers of connections sitting in the TIME_WAIT state, tying up all the available resources on a server.

When a TCP connection is closed cleanly, the end that initiated the close (“active close”) ends up with the connection sitting in TIME_WAIT for several minutes. So if your protocol is one where the server initiates the connection close, and involves very large numbers of short-lived connections, then it might be susceptible to this problem.

This isn’t a good idea, though - TIME_WAIT exists for a reason (to ensure that stray packets from old connections don’t interfere with new connections). It’s a better idea to redesign your protocol to one where the client initiates the connection close, if possible.

Moreover, the purpose of SO_LINGER is very, very specific and only a tiny minority of socket applications actually need it. Unless you are extremely familiar with the intricacies of TCP and the BSD socket API, you could very easily end up using SO_LINGER in a way for which it was not designed.

The effect of an SO_LINGER depends on what the values in the linger structure (the third parameter passed to setsockopt())

/* /usr/include/sys/socket.h -> /usr/include/bits/socket.h */
/* structure used to manipulate the SO_LINGER option */
struct linger {
    int l_onoff;      /* non-zero to linger on close */
    int l_linger;     /* time to linger */
}

which has 3 cases:

linger->l_onoff == 0

linger->l_linger has no meaning. This is the default.

On close(), the underlying stack attempts to gracefully shutdown the connection after ensuring all unsent data is sent. In the case of connection-oriented protocols such as TCP, the stack also ensures that sent data is acknowledged by the peer. The stack will perform the above-mentioned graceful shutdown in the background (after the call to close() returns), regardless of whether the socket is blocking or non-blocking.
linger->l_onoff != 0 && linger->l_linger == 0

A close() returns immediately. The underlying stack discards any unsent data, and, in the case of connection-oriented protocols such as TCP, sends a RST (reset) to the peer (this is termed a hard or abortive close). All subsequent attempts by the peer’s application to read()/recv() data will result in an ECONNRESET.
linger->l_onoff != 0 && linger->l_linger != 0

A close() will either block (if a blocking socket) or fail with EWOULDBLOCK (if non-blocking) until a graceful shutdown completes or the time specified in linger->l_linger elapses (time-out). Upon time-out the stack behaves as in case 2 above.

Portability Note

Some implementations of the BSD socket API do not implement SO_LINGER at all.

On such systems, applying SO_LINGER either fails with EINVAL or is (silently) ignored. Having SO_LINGER defined in the headers is no guarantee that SO_LINGER is actually implemented.
Since the BSD documentation on SO_LINGER is sparse and inadequate, it is not surprising to find the various implementations interpreting the effect of SO_LINGER differently.

For instance, the effect of SO_LINGER on non-blocking sockets is not mentioned at all in BSD documentation, and is consequently treated differently on different platforms. Taking case 3 for example: Some implementations behave as described above. With others, a non-blocking socket close() succeed immediately leaving the rest to a background process. Others ignore non-blocking’ness and behave as if the socket were blocking. Yet others behave as if SO_LINGER wasn’t in effect [as if the case 1, the default, was in effect], or ignore linger->l_linger [case 3 is treated as case 2]. Given the lack of adequate documentation, such differences are not (by themselves) indicative of an “incomplete” or “broken” implementation. They are simply different, not incorrect.
Some implementations of the BSD socket API do not implement SO_LINGER completely.

On such systems, the value of linger->l_linger is ignored (always treated as if it were zero). Technical/Developer note: SO_LINGER does (should) not affect a stack’s implementation of TIME_WAIT. In any event, SO_LINGER is not the way to get around TIME_WAIT. If an application expects to open and close many TCP sockets in quick succession, it should be written to use only a fixed number and/or range of ports, and apply SO_REUSEPORT to sockets that use those ports.

SO_DONTLINGER: This socket option has the exact opposite meaning of SO_LINGER, and the two are treated (after inverting the value of linger->l_onoff) as equivalent. In other words, SO_LINGER with a zero linger->l_onoff is the same as SO_DONTLINGER with a non-zero linger->l_onoff, and vice versa.

using SO_LINGER with timeout 0 should really be a last resort

`SO_BROADCAST` enables permission to transmit broadcast messages

Set or get the broadcast flag. When enabled, datagram sockets are allowed to send packets to a broadcast address. This option has no effect on stream-oriented sockets.

`SO_OOBINLINE` enables reception of out-of-band data in band

If this option is enabled, out-of-band data is directly placed into the receive data stream. Otherwise out-of-band data is only passed when the MSG_OOB flag is set during receiving.

`SO_SNDBUF` set buffer size for output

Sets or gets the maximum socket send buffer in bytes. The kernel doubles this value (to allow space for bookkeeping overhead) when it is set using setsockopt(2), and this doubled value is returned by getsockopt(2). The default value is set by the /proc/sys/net/core/wmem_default file and the maximum allowed value is set by the /proc/sys/net/core/wmem_max file. The minimum (doubled) value for this option is 2048.

NOTES

Linux assumes that half of the send/receive buffer is used for internal kernel structures; thus the sysctls are twice what can be observed on the wire.

`SO_RCVBUF` set buffer size for input

`SO_SNDLOWAT` set minimum count for output

Specify the minimum number of bytes in the buffer until the socket layer will pass the data to the protocol (SO_SNDLOWAT) or the user on receiving (SO_RCVLOWAT). These two values are initialized to 1. SO_SNDLOWAT is not changeable on Linux (setsockopt(2) fails with the error ENOPROTOOPT). SO_RCVLOWAT is changeable only since Linux 2.4. The select(2) and poll(2) system calls currently do not respect the SO_RCVLOWAT setting on Linux, and mark a socket readable when even a single byte of data is available. A subsequent read from the socket will block until SO_RCVLOWAT bytes are available.

`SO_RCVLOWAT` set minimum count for input

`SO_SNDTIMEO` set timeout value for output

Specify the receiving or sending timeouts until reporting an error. The argument is a struct timeval.

If an input or output function blocks for this period of time, and data has been sent or received, the return value of that function will be the amount of data transferred;
if no data has been transferred and the timeout has been reached then -1 is returned with errno set to EAGAIN or EWOULDBLOCK, or EINPROGRESS (for connect(2)) just as if the socket was specified to be nonblocking.
If the timeout is set to zero (the default) then the operation will never timeout.

Timeouts only have effect for system calls that perform socket I/O (e.g., read(2), recvmsg(2), send(2), sendmsg(2)); timeouts have no effect for select(2), poll(2), epoll_wait(2), and so on.

`SO_RCVTIMEO` set timeout value for input

`SO_TYPE` get the type of the socket (get only)

`SO_TIMESTAMP` enable or disable the receiving of the `SO_TIMESTAMP` control message

The timestamp control message is sent with level SOL_SOCKET and the cmsg_data field is a struct timeval indicating the reception time of the last packet passed to the user in this call. See cmsg(3) for details on control messages.

`SO_ERROR` get and clear error on the socket (get only)

`SO_NOSIGPIPE` do not generate `SIGPIPE`, instead return `EPIPE`

`SO_NREAD` number of bytes to be read (get only)

`SO_NWRITE` number of bytes written not yet sent by the protocol (get only)

`SO_LINGER_SEC` linger on close if data present with timeout in seconds

Options not in the manual

`SO_EXCLUSIVEADDRUSE`

`SO_USELOOPBACK`

`SO_BSDCOMPAT`

`/proc/sys/net/core/` interfaces

bpf_jit_enable
busy_poll
busy_read
default_qdisc
dev_weight
message_burst
message_cost
- configure the token bucket filter used to load limit warning messages caused by external network events.
netdev_budget
netdev_max_backlog
- Maximum number of packets in the global input queue.
netdev_rss_key
netdev_tstamp_prequeue
optmem_max
- Maximum length of ancillary data and user control data like the iovecs per socket.
rmem_default
- contains the default setting in bytes of the socket receive buffer.
rmem_max
- contains the maximum socket receive buffer size in bytes which a user may set by using the SO_RCVBUF socket option.
rps_sock_flow_entries
somaxconn
warnings
wmem_default
- contains the default setting in bytes of the socket send buffer.
wmem_max
- contains the maximum socket send buffer size in bytes which a user may set by using the SO_SNDBUF socket option.
xfrm_acq_expires
xfrm_aevent_etime
xfrm_aevent_rseqth
xfrm_larval_drop

IPPROTO_IP

IP_HDRINCL

IP_OPTIONS

IP_RECVDSTADDR

IP_RECVIF

IP_TOS

IPTOS_LOWDELAY
IPTOS_THROUGHPUT
IPTOS_RELIABILITY
IPTOS_LOWCOST

IP_TTL

ICMP6_FILTER

IPPROTO_IPV6

IPV6_ADDRFORM

IPV6_CHECKSUM

IPV6_DSTOPTS

IPV6_HOPLIMIT

IPV6_HOPOPTS

IPV6_NEXTHOP

IPV6_PKTINFO

IPV6_PKTOPTIONS

IPV6_RTHDR

IPV6_UNICAST_HOPS

IPPROTO_TCP

`TCP_MAXSEG`

`TCP_NODELAY`

`TCP_TW_REUSE`

Allow to reuse TIME-WAIT sockets for new connections when it is safe from protocol viewpoint. Default value is 0. It should not be changed without advice/request of technical experts.

It is generally a safer alternative to TCP_TW_RECYCLE

The TCP_TW_REUSE setting is particularly useful in environments where numerous short connections are open and left in TIME_WAIT state, such as web servers. Reusing the sockets can be very effective in reducing server load.

`TCP_TW_RECYCLE`

Enable fast recycling TIME-WAIT sockets. Default value is 0.

Known to cause some issues with hoststated (load balancing and fail over) if enabled, should be used with caution.

`TCP_NODELAY`

First off, be sure you really want to use it in the first place. It will disable the Nagle algorithm, which will cause network traffic to increase, with smaller than needed packets wasting bandwidth. Also, from what I have been able to tell, the speed increase is very small, so you should probably do it without TCP_NODELAY first, and only turn it on if there is a problem.

int optval = 1;
int result = setsockopt(sock,     /* socket affected */
                IPPROTO_TCP,      /* set option at TCP level */
                TCP_NODELAY,      /* name of option */
                (char *) &optval, /* the cast is historical cruft */
                sizeof(int));     /* length of option value */
if (result < 0)
   ... handle the error ...

TCP_NODELAY is for a specific purpose; to disable the Nagle buffering algorithm. It should only be set for applications that send frequent small bursts of information without getting an immediate response, where timely delivery of data is required (the canonical example is mouse movements).

`TCP_CORK`

`TCP_DEFER_ACCEPT`

`TCP_QUICKACK`

TCP FLAGS

SYN
FIN
ACK
PSH

Data is Not Empty
RST
URG

Urgent pointer is valid

Enable Non-Blocking Socket Option

flags = fcntl(sock_descriptor, F_GETFL, 0)
fcntl(socket_descriptor, F_SETFL, flags | O_NONBLOCK)

ioctl(sockfd, FIONBIO, (char *)&one);

`EINTR`

This isn’t really so much an error as an exit condition. It means that the call was interrupted by a signal. Any call that might block should be wrapped in a loop that checkes for EINTR

`SIGPIPE`

with TCP you get SIGPIPE if your end of the connection has received an RST from the other end. What this also means is that if you were using select instead of write, the select would have indicated the socket as being readable, since the RST is there for you to read (read will return an error with errno set to ECONNRESET).

Basically an RST is TCP’s response to some packet that it doesn’t expect and has no other way of dealing with. A common case is when the peer closes the connection (sending you a FIN) but you ignore it because you’re writing and not reading. (You should be using select.) So you write to a connection that has been closed by the other end and the other end’s TCP responds with an RST.

OOB Data

Out-of-band data is the data transferred through a stream that is independent from the main in-band data stream. An out-of-band data mechanism provides a conceptually independent channel, which allows any data sent via that mechanism to be kept separate from in-band data

Out-of-band data (called “urgent data” in TCP) looks to the application like a separate stream of data from the main data stream.
This can be useful for separating two different kinds of data.
Note that just because it is called “urgent data” does not mean that it will be delivered any faster, or with higher priorety than data in the in-band data stream.
Also beware that unlike the main data stream, the out-of-bound data may be lost if your application can’t keep up with it.

ACK

ACK
tcp_ack
tcp_clean_rtx_queue
tcp_ack_update_rtt
tcp_ack_saw_tstamp | tcp_ack_no_tstamp
tcp_valid_rtt_meas
tcp_rtt_estimator
tcp_set_rto

Glossary

RTO: Retransmission Timeout
MSL: Maximum Segment Lifetime
RTT: Round-Trip Time
MSS: Maximum Segment Size
MTU: Maximum Transmission Unit

Introduction

setsockopt

Table of Contents

SOL_SOCKET

SO_DEBUG enables recording of debugging information

SO_REUSEADDR enables local address reuse

SO_REUSEPORT enables duplicate address and port bindings

SO_KEEPALIVE enables keep connections alive

SO_DONTROUTE enables routing bypass for outgoing messages

SO_LINGER linger on close if data present

Portability Note

Related Note

SO_BROADCAST enables permission to transmit broadcast messages

SO_OOBINLINE enables reception of out-of-band data in band

SO_SNDBUF set buffer size for output

SO_RCVBUF set buffer size for input

SO_SNDLOWAT set minimum count for output

SO_RCVLOWAT set minimum count for input

SO_SNDTIMEO set timeout value for output

SO_RCVTIMEO set timeout value for input

SO_TYPE get the type of the socket (get only)

SO_TIMESTAMP enable or disable the receiving of the SO_TIMESTAMP control message

SO_ERROR get and clear error on the socket (get only)

SO_NOSIGPIPE do not generate SIGPIPE, instead return EPIPE

SO_NREAD number of bytes to be read (get only)

SO_NWRITE number of bytes written not yet sent by the protocol (get only)

SO_LINGER_SEC linger on close if data present with timeout in seconds

Options not in the manual

SO_EXCLUSIVEADDRUSE

SO_USELOOPBACK

SO_BSDCOMPAT

/proc/sys/net/core/ interfaces

IPPROTO_IP

IP_HDRINCL

IP_OPTIONS

IP_RECVDSTADDR

IP_RECVIF

IP_TOS

IP_TTL

ICMP6_FILTER

IPPROTO_IPV6

IPV6_ADDRFORM

IPV6_CHECKSUM

IPV6_DSTOPTS

IPV6_HOPLIMIT

IPV6_HOPOPTS

IPV6_NEXTHOP

IPV6_PKTINFO

IPV6_PKTOPTIONS

IPV6_RTHDR

IPV6_UNICAST_HOPS

IPPROTO_TCP

TCP_MAXSEG

TCP_NODELAY

TCP_TW_REUSE

TCP_TW_RECYCLE

TCP_NODELAY

TCP_CORK

TCP_DEFER_ACCEPT

TCP_QUICKACK

TCP FLAGS

Enable Non-Blocking Socket Option

EINTR

SIGPIPE

OOB Data

ACK

Glossary

References

`setsockopt`

`SO_DEBUG` enables recording of debugging information

`SO_REUSEADDR` enables local address reuse

`SO_REUSEPORT` enables duplicate address and port bindings

`SO_KEEPALIVE` enables keep connections alive

`SO_DONTROUTE` enables routing bypass for outgoing messages

`SO_LINGER` linger on close if data present

`SO_BROADCAST` enables permission to transmit broadcast messages

`SO_OOBINLINE` enables reception of out-of-band data in band

`SO_SNDBUF` set buffer size for output

`SO_RCVBUF` set buffer size for input

`SO_SNDLOWAT` set minimum count for output

`SO_RCVLOWAT` set minimum count for input

`SO_SNDTIMEO` set timeout value for output

`SO_RCVTIMEO` set timeout value for input

`SO_TYPE` get the type of the socket (get only)

`SO_TIMESTAMP` enable or disable the receiving of the `SO_TIMESTAMP` control message

`SO_ERROR` get and clear error on the socket (get only)

`SO_NOSIGPIPE` do not generate `SIGPIPE`, instead return `EPIPE`

`SO_NREAD` number of bytes to be read (get only)

`SO_NWRITE` number of bytes written not yet sent by the protocol (get only)

`SO_LINGER_SEC` linger on close if data present with timeout in seconds

`SO_EXCLUSIVEADDRUSE`

`SO_USELOOPBACK`

`SO_BSDCOMPAT`

`/proc/sys/net/core/` interfaces

`TCP_MAXSEG`

`TCP_NODELAY`

`TCP_TW_REUSE`

`TCP_TW_RECYCLE`

`TCP_NODELAY`

`TCP_CORK`

`TCP_DEFER_ACCEPT`

`TCP_QUICKACK`

`EINTR`

`SIGPIPE`