Linux select

Introduction

本篇为 Linux I/O 事件通知机制系列第一篇，介绍 select。其他两篇为:

Introduction
Table of Contents
select, pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O multiplexing
Description
Return Value
Errors
Conforming To
Notes
Bugs
Example
实例程序
- 输入监测
- web 服务器
References

select, pselect, FD_CLR, FD_ISSET, FD_SET, FD_ZERO - synchronous I/O multiplexing

/* According to POSIX.1-2001 */
#include <sys/select.h>

/* According to earlier standards */
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

int select(int nfds, fd_set *readfds, fd_set *writefds,
           fd_set *exceptfds, struct timeval *timeout);

void FD_CLR(int fd, fd_set *set);
int  FD_ISSET(int fd, fd_set *set);
void FD_SET(int fd, fd_set *set);
void FD_ZERO(fd_set *set);

#include <sys/select.h>

int pselect(int nfds, fd_set *readfds, fd_set *writefds,
            fd_set *exceptfds, const struct timespec *timeout,
            const sigset_t *sigmask);

Feature Test Macro Requirements for glibc (see feature_test_macros(7)):

pselect(): _POSIX_C_SOURCE >= 200112L || _XOPEN_SOURCE >= 600

Description

select() 和 pselect() 运行程序来监控多个文件描述符，等待至少一个文件描述符对一些类型的 I/O 操作 (例如输入)变为就绪状态。文件描述符就绪意味着可以进行对应的操作 (e.g., read(2))而不阻塞。

select() 和 pselect() 的操作是相同的, 除了以下三点不同:

(i): select() 使用一个 struct timeval 结构体的超时值(包含秒和微秒), 而 pselect() 使用一个 struct timespec 结构体做超时值 (包含秒和纳秒).
(ii): select() 可能会更新 timeout 参数来反映剩余的超时时间。pselect() 不改变此参数。
(iii): select() 没有 sigmask 参数, 行为和使用 NULL sigmask 调用 pselect() 一样。

单个独立的文件描述符集合被监控。readfds 中的被监控是否有字符可以读取(更精确的说是读操作是不会阻塞；特别的，到达文件尾的文件描述符也会变为就绪状态),writefds 被监控写不会阻塞，exceptfds 被监控是否出现异常。退出时，这些集合被修改以指示具体哪些文件描述符有实际的状态更改。这三个集合每个都可以为 NULL 表示不需要监控特定类型的事件。

四个宏被用来操作这些集合。FD_ZERO() 清空集合。FD_SET() 和 FD_CLR() 分别从集合中添加和移除给定的问及描述符。FD_ISSET() 测试一个文件描述符是否属于某个集合；这在select() 返回的时候非常有用。

nfds 是三个集合中最大的文件描述符+1。

timeout 参数指定了 select() 应当阻塞等待文件描述符变为就绪状态的最小间隔。(This interval will be rounded up to the system clock granularity, and kernel scheduling delays mean that the blocking interval may overrun by a small amount.) 如果 timeval 结构体的两个成员都为0，则 select() 立即返回。(This is useful for polling.) 如果 timeout 为 NULL (no timeout), select() 将会无限期阻塞。

sigmask 是一个指向 signal mask (see sigprocmask(2)) 的指针; 如果它不为 NULL, 则 pselect() 首先使用 sigmask 指向的 signal mask 替换当前的 signal mask, 然后执行 "select" 函数, 最后回复原始的 signal mask.除了 timeout 参数的精度不同之外，下面的 pselect() 调用:

ready = pselect(nfds, &readfds, &writefds, &exceptfds, timeout, &sigmask);

等同于原子执行以下调用:

sigset_t origmask;

pthread_sigmask(SIG_SETMASK, &sigmask, &origmask);
ready = select(nfds, &readfds, &writefds, &exceptfds, timeout);
pthread_sigmask(SIG_SETMASK, &origmask, NULL);

pselect() 存在的原因在于如果有人想要等待一个文件描述符就绪或者一个信号发生，就需要一个原子测试来防止竞争条件(to prevent race conditions)。(Suppose the signal handler sets a global flag and returns. Then a test of this global flag followed by a call of select() could hang indefinitely if the signal arrived just after the test but just before the call. By contrast, pselect() allows one to first block signals, handle the signals that have come in, then call pselect() with the desired sigmask, avoiding the race.)

The timeout

The time structures involved are defined in <sys/time.h> and look like

struct timeval {
    long    tv_sec;         /* seconds */
    long    tv_usec;        /* microseconds */
};

and

struct timespec {
    long    tv_sec;         /* seconds */
    long    tv_nsec;        /* nanoseconds */
};

(However, see below on the POSIX.1-2001 versions.)有些代码使用三个空集合、nfds为0，timeout不为NULL 来调用 select()作为一种可移植的方法来达到亚秒精度的睡眠(subsecond precision)。在 Linux 系统中, select() 修改 timeout 来反映剩余的睡眠时间; 大多数其他实现并无此操作。(POSIX.1-2001 permits either behavior.) This causes problems both when Linux code which reads timeout is ported to other operating systems, and when code is ported to Linux that reuses a struct timeval for multiple select()s in a loop without reinitializing it. 最好还是把调用 select() 之后的 timeout 当做未定义的值。

Return Value

成功时, select() 和 pselect() 返回三个集合中文件描述符数目(也就是, 三个集合 readfds, writefds, exceptfds 中被设置的位的总数) ；超时的时候返回0。有错误的情况下返回 -1，并且 errno 被适当设置；集合和 timeout 变成未定义的，所以不要在错误发生之后依赖它们的内容。

Errors

EBADF: 集合中含有无效的文件描述符。(可能是有的文件描述符已经被关闭，或者有错误发生。)
EINTR: A signal was caught; see signal(7).
EINVAL: nfds 为负或者 timeout 包含无效值。
ENOMEM: 为内部表分配内存失败。

Conforming To

select() conforms to POSIX.1-2001 and 4.4BSD (select() first appeared in 4.2BSD). Generally portable to/from non-BSD systems supporting clones of the BSD socket layer (including System V variants). However, note that the System V variant typically sets the timeout variable before exit, but the BSD variant does not.

pselect() is defined in POSIX.1g, and in POSIX.1-2001.

Notes

fd_set 是一个固定大小的缓冲区。调用 FD_CLR() 或者 FD_SET() 的 fd 值为负或者不小于 FD_SETSIZE 将会导致未定义的行为。而且, POSIX 要求 fd 为有效的文件描述符。

Concerning the types involved, the classical situation is that the two fields of a timeval structure are typed as long (as shown above), and the structure is defined in<sys/time.h>. The POSIX.1-2001 situation is

struct timeval {
    time_t         tv_sec;     /* seconds */
    suseconds_t    tv_usec;    /* microseconds */
};

where the structure is defined in <sys/select.h> and the data types time_t and suseconds_t are defined in <sys/types.h>.Concerning prototypes, the classical situation is that one should include <time.h> for select(). The POSIX.1-2001 situation is that one should include <sys/select.h> for select() and pselect().Libc4 and libc5 do not have a <sys/select.h> header; under glibc 2.0 and later this header exists. Under glibc 2.0 it unconditionally gives the wrong prototype for pselect(). Under glibc 2.1 to 2.2.1 it gives pselect() when _GNU_SOURCE is defined. Since glibc 2.2.2 the requirements are as shown in the SYNOPSIS.

Multithreaded applications

If a file descriptor being monitored by select() is closed in another thread, the result is unspecified. On some UNIX systems, select() unblocks and returns, with an indication that the file descriptor is ready (a subsequent I/O operation will likely fail with an error, unless another the file descriptor reopened between the time select() returned and the I/O operations was performed). On Linux (and some other systems), closing the file descriptor in another thread has no effect on select(). 总之，任何依赖于特定的某种情况的程序被认为是有缺陷的(In summary, any application that relies on a particular behavior in this scenario must be considered buggy)。

Linux notes

The pselect() interface described in this page is implemented by glibc. The underlying Linux system call is named pselect6(). This system call has somewhat different behavior from the glibc wrapper function.The Linux pselect6() system call modifies its timeout argument. However, the glibc wrapper function hides this behavior by using a local variable for the timeout argument that is passed to the system call. Thus, the glibc pselect() function does not modify its timeout argument; this is the behavior required by POSIX.1-2001.The final argument of the pselect6() system call is not a sigset_t * pointer, but is instead a structure of the form:

struct {
    const sigset_t *ss;     /* Pointer to signal set */
    size_t          ss_len; /* Size (in bytes) of object pointed
                               to by 'ss' */
};

This allows the system call to obtain both a pointer to the signal set and its size, while allowing for the fact that most architectures support a maximum of 6 arguments to a system call.

Bugs

Glibc 2.0 provided a version of pselect() that did not take a sigmask argument.

Starting with version 2.1, glibc provided an emulation of pselect() that was implemented using sigprocmask(2) and select(). This implementation remained vulnerable to the very race condition that pselect() was designed to prevent. Modern versions of glibc use the (race-free) pselect() system call on kernels where it is provided.

On systems that lack pselect(), reliable (and more portable) signal trapping can be achieved using the self-pipe trick. In this technique, a signal handler writes a byte to a pipe whose other end is monitored by select() in the main program. (To avoid possibly blocking when writing to a pipe that may be full or reading from a pipe that may be empty, nonblocking I/O is used when reading from and writing to the pipe.)

Under Linux, select() may report a socket file descriptor as "ready for reading", while nevertheless a subsequent read blocks. This could for example happen when data has arrived but upon examination has wrong checksum and is discarded. There may be other circumstances in which a file descriptor is spuriously reported as ready. Thus it may be safer to use O_NONBLOCK on sockets that should not block.

On Linux, select() also modifies timeout if the call is interrupted by a signal handler (i.e., the EINTR error return). This is not permitted by POSIX.1-2001. The Linux pselect() system call has the same behavior, but the glibc wrapper hides this behavior by internally copying the timeout to a local variable and passing that variable to the system call.

</blockquote>

Example

#include <stdio.h>
#include <stdlib.h>
#include <sys/time.h>
#include <sys/types.h>
#include <unistd.h>

int
main(void)
{
    fd_set rfds;
    struct timeval tv;
    int retval;

   /* Watch stdin (fd 0) to see when it has input. */
    FD_ZERO(&rfds);
    FD_SET(0, &rfds);

   /* Wait up to five seconds. */
    tv.tv_sec = 5;
    tv.tv_usec = 0;

   retval = select(1, &rfds, NULL, NULL, &tv);
    /* Don't rely on the value of tv now! */

   if (retval == -1)
        perror("select()");
    else if (retval)
        printf("Data is available now.n");
        /* FD_ISSET(0, &rfds) will be true. */
    else
        printf("No data within five seconds.n");

   exit(EXIT_SUCCESS);
}

实例程序

输入监测

检测键盘有无输入，完整的程序如下：

#include <stdio.h>
#include <stdlib.h>
#include <string.h>

#include <unistd.h>

#include <sys/select.h>
#include <sys/time.h>
#include <sys/types.h>
#include <sys/socket.h>
#include <arpa/inet.h>

#define BUFLEN 32U

void errpro(int condition, const char *errmsg) {
    if (condition) {
        perror(errmsg);
        exit(EXIT_FAILURE);
    }
}

int main(int argc, char *argv[]) {
    fd_set rdfds;
    struct timeval tv;
    while (1) {
        FD_ZERO(&rdfds);
        FD_SET(STDIN_FILENO, &rdfds);
        tv.tv_sec = 2;
        tv.tv_usec = 0;
        int cnt = select(STDIN_FILENO+1, &rdfds, NULL, NULL, &tv);
        switch (cnt) {
            case -1:
                errpro(EXIT_FAILURE, "select");
                break;
            case 0:
                printf("timeoutn");
                continue;
                break;
            default:
                if (FD_ISSET(STDIN_FILENO, &rdfds)) {
                    char buf[BUFLEN];
                    cnt = read(STDIN_FILENO, buf, BUFLEN-1);
                    errpro(-1 == cnt, "read");
                    if (0 == cnt) // EOF
                        return 0;
                    printf("read: %sn", buf);
                }
                break;
        }
    }
    return 0;
}

web 服务器

利用Select模型，设计的web服务器：

总结

理解select模型的关键在于理解fd_set,为说明方便，取fd_set长度为1字节，fd_set中的每一bit可以对应一个文件描述符fd。则1字节长的fd_set最大可以对应8个fd。

执行fd_set set; FD_ZERO(&set);则set用位表示是0000,0000。
若fd＝5,执行FD_SET(fd,&set);后set变为0001,0000(第5位置为1)
若再加入fd＝2，fd=1,则set变为0001,0011
执行select(6,&set,0,0,0)阻塞等待
若fd=1,fd=2上都发生可读事件，则select返回，此时set变为0000,0011。注意：没有事件发生的fd=5被清空。

基于上面的讨论，可以轻松得出select模型的特点：

可监控的文件描述符个数取决与sizeof(fd_set)的值。我这边服务器上sizeof(fd_set)＝512，每bit表示一个文件描述符，则我服务器上支持的最大文件描述符是512*8=4096。据说可调，另有说虽然可调，但调整上限受于编译内核时的变量值。本人对调整fd_set的大小不太感兴趣，参考技术系列之网络模型（二）中的模型2(1)可以有效突破select可监控的文件描述符上限。
将fd加入select监控集的同时，还要再使用一个数据结构array保存放到select监控集中的fd，一是用于再select 返回后，array作为源数据和fd_set进行FD_ISSET判断。二是select返回后会把以前加入的但并无事件发生的fd清空，则每次开始 select前都要重新从array取得fd逐一加入（FD_ZERO最先），扫描array的同时取得fd最大值maxfd，用于select的第一个参数。
可见select模型必须在select前循环array（加fd，取maxfd），select返回后循环array（FD_ISSET判断是否有时间发生）。

References

http://blog.csdn.net/tianmohust/article/details/6595998
http://publib.boulder.ibm.com/infocenter/aix/v7r1/index.jsp?topic=%2Fcom.ibm.aix.progcomm%2Fdoc%2Fprogcomc%2Fsocket_io_md.htm
http://stackoverflow.com/questions/1150635/unix-nonblocking-i-o-o-nonblock-vs-fionbio