Linux Asynchronous I/O
Introduction
Asynchronous I/O (AIO) is a method for performing I/O operations so that the process that issued an I/O request is not blocked till the operation is complished. Instead, after an I/O request is submitted, the process continues to execute its code and can later check the status of the submitted request.
There are several means to accomplish asynchronous I/O in Linux:
- kernel syscalls
- user space library implementation and use system calls internally (libaio)
- emulated AIO entirely in the user space without any kernel support (librt for now, part of libc)
Table of Contents
I/O Models
Mode | Blocking | Non-blocking |
---|---|---|
Synchronous | read/write | read/write (O_NONBLOCK) |
Asynchronous | I/O multiplexing (select/poll/epoll) |
AIO |
AIO System Calls
ABI Interface
AIO system call entry points are located in fs/aio.c
file in the kernel’s
source code.
Types and constants exported to the user space reside in
/usr/include/linux/aio_abi.h
header file.
Linux kernel provides only 5 system calls for performing asynchronoes I/O.
#include <linux/aio_abi.h>
int io_setup(unsigned nr_events, aio_context_t *ctxp);
int io_destroy(aio_context_t ctx);
int io_submit(aio_context_t ctx, long nr, struct iocb **iocbpp);
int io_cancel(aio_context_t ctx, struct iocb *, struct io_event *result);
int io_getevents(aio_context_t ctx, long min_nr, long nr,
struct io_event *events, struct timespec *timeout);
main 2 io_submit
Every I/O request that is submitted to an AIO context is represented by an I/O
control block structure - struct iocb
io_submit()
takes AIO context ID, size of the array and the array itself as the
arguments. Notice, that array should contain pointers to the iocb structures,
not the structures themself.
io_submit()
’s return code can be one of the following values:
-
ret = (number of iocbs sumbmitted)
Ideal case, all iocbs were accepted for processing.
-
0 < ret < (number of iocbs sumbmitted)
io_submit() system call processes iocbs one by one starting from the first entry in the passed array. If submission of some iocb fails, it stops at this point and returns the index of iocb that failed. There is no way to know what is the exact reason of a failure. However, if the very first iocb submission fails, see point C.
-
ret < 0
There are two reasons why this could happen:
- Some error happened even before io_submit() started to iterate over iocbs in the array (e.g., AIO context was invalid).
- The submission of the very first iocb (cbx[0]) failed).
After iocb is submitted we can perform any other actions without waiting for I/O
to complete. For every completed I/O request (successfully or unsuccessfully) kernel creates an io_event structure.
To obtain the list of io_events (and consequently all completed iocbs) io_getevent()
system call should be used.
When calling io_getevents()
, one needs to specify:
- which AIO context to get events from (ctx variable)
- a buffer where the kernel should load events to (events varaiable)
-
minimal number of events one wants to get.
If less then this number of iocbs are currently completed, io_getevents() will block till enough events appear. See point e) for more details on how to control blocking time.
-
maximum number of events one wants to get. This usually is the size of the events buffer (second 1 in our program)
- If not enough events are available, we don’t want to wait forever. One can specify a relative deadline as the last argument. NULL in this case means to wait infinitely. If one wants io_getevents() not to block at all then timespec timeout structure need to be initialzed to zero seconds and zero nanoseconds.
The return code of io_getevents can be:
-
ret = (max number of events)
All events that fit in the user provided buffer were obtained from the kernel. There might be more pending events in the kernel.
-
(min number of events) <= ret <= (max number of events)
All currently available events were read from the kernel and no blocking happened.
-
0 < ret < (min number of events)
All currently available events were read from the kernel and we blocked to wait for the time user has specified.
-
ret = 0
no events are available XXX:? does blocking happen in this case?..
-
ret < 0
an error happened
struct io_event
/* read() from /dev/aio returns these structures. */
struct io_event {
__u64 data; /* the data field from the iocb */
__u64 obj; /* what iocb this event came from */
__s64 res; /* result code for this event */
__s64 res2; /* secondary result */
};
struct iocb
/*
* we always use a 64bit off_t when communicating
* with userland. its up to libraries to do the
* proper padding and aio_error abstraction
*/
struct iocb {
/* these are internal to the kernel/libc. */
__u64 aio_data; /* data to be returned in event's data */
__u32 PADDED(aio_key, aio_reserved1);
/* the kernel sets aio_key to the req # */
/* common fields */
__u16 aio_lio_opcode; /* see IOCB_CMD_ above */
__s16 aio_reqprio;
__u32 aio_fildes;
__u64 aio_buf;
__u64 aio_nbytes;
__s64 aio_offset;
/* extra parameters */
__u64 aio_reserved2; /* TODO: use this for a (struct sigevent *) */
/* flags for the "struct iocb" */
__u32 aio_flags;
/*
* if the IOCB_FLAG_RESFD flag of "aio_flags" is set, this is an
* eventfd to signal AIO readiness to
*/
__u32 aio_resfd;
}; /* 64 bytes */
AIO Command
# /usr/include/linux/aio_abi.h
enum {
IOCB_CMD_PREAD = 0,
IOCB_CMD_PWRITE = 1,
IOCB_CMD_FSYNC = 2,
IOCB_CMD_FDSYNC = 3,
/* These two are experimental.
* IOCB_CMD_PREADX = 4,
* IOCB_CMD_POLL = 5,
*/
IOCB_CMD_NOOP = 6,
IOCB_CMD_PREADV = 7,
IOCB_CMD_PWRITEV = 8,
};
-
- IOCB_CMD_PREAD
- positioned read; corresponds to
pread()
system call.
-
- IOCB_CMD_PWRITE
- positioned write; corresponds to
pwrite()
system call.
-
- IOCB_CMD_FSYNC
- sync file’s data and metadata with disk; corresponds to
fsync()
system call.
-
- IOCB_CMD_FDSYNC
- sync file’s data and metadata with disk, but only metadata needed to access
modified file data is written; corresponds to
fdatasync()
system call.
-
- IOCB_CMD_PREADV
- vectored positioned read, sometimes called “scattered input”;
corresponds to
preadv()
system call.
-
- IOCB_CMD_PWRITEV
- vectored positioned write, sometimes called “gathered output”;
corresponds to
pwritev()
system call.
-
- IOCB_CMD_NOOP
- defined in the header file, but is not used anywhere else in the kernel.
The semantics of other fields in the iocb structure depends on the command specified.
AIO Context
AIO context is a set of data structures that the kernel supports to perform AIO.
Every process can have multiple AIO contextes and as such one needs an identificator for every AIO context in a process.
A pointer to ctx
variable
is passed to io_setup()
as a second argument and kernel fills this variable
with a context identifier. Interestingly, aio_context_t
is actually just an
unsigned long
defined in the kernel (linux/aio_abi.h
) like that:
typedef unsigned long aio_context_t;
The first argument of io_setup()
function is the maximum
number of requests that can simultaneously reside in the context.
syscall()
man syscall
#define _GNU_SOURCE /* See feature_test_macros(7) */
#include <unistd.h>
#include <sys/syscall.h> /* For SYS_xxx definitions */
int syscall(int number, ...);
syscall() is a small library function that invokes the system call whose assembly language interface has the specified number with the specified arguments. Employing syscall() is useful, for example, when invoking a system call that has no wrapper function in the C library.
syscall() saves CPU registers before making the system call, restores the registers upon return from the system call, and stores any error code returned by the system call in errno(3) if an error occurs.
Symbolic constants for system call numbers can be found in the header file <sys/syscall.h>.
Example
#include <stdio.h>
#include <string.h>
#include <inttypes.h>
#include <unistd.h>
#include <fcntl.h>
#include <sys/syscall.h>
#include <linux/aio_abi.h>
inline int io_setup(unsigned nr, aio_context_t *ctxp) {
return syscall(__NR_io_setup, nr, ctxp);
}
inline int io_destroy(aio_context_t ctx) {
return syscall(__NR_io_destroy, ctx);
}
inline int io_submit(aio_context_t ctx, long nr, struct iocb **iocbpp) {
return syscall(__NR_io_submit, ctx, nr, iocbpp);
}
inline int io_getevents(aio_context_t ctx, long min_nr, long max_nr,
struct io_event *events, struct timespec *timeout) {
return syscall(__NR_io_getevents, ctx, min_nr, max_nr, events, timeout);
}
int main(int argc, char *argv[]) {
aio_context_t ctx;
struct iocb cb;
struct iocb *cbs[1];
char data[4096];
struct io_event events[1];
int ret;
int fd;
fd = open("/tmp/test", O_RDWR | O_CREAT);
if (fd < 0) {
perror("open");
return -1;
}
ctx = 0;
ret = io_setup(128, &ctx);
if (ret < 0) {
perror("io_setup");
return -1;
}
/* setup I/O control block */
memset(&cb, 0, sizeof(cb));
cb.aio_fildes = fd;
cb.aio_lio_opcode = IOCB_CMD_PWRITE;
/* command-specific options */
int i;
for (i = 0; i < 4096; ++i)
data[i] = 'A';
cb.aio_buf = (uint64_t)data;
cb.aio_offset = 0;
cb.aio_nbytes = 4096;
cbs[0] = &cb;
ret = io_submit(ctx, 1, cbs);
if (ret != 1) {
if (ret < 0) perror("io_submit");
else fprintf(stderr, "io_submit failed\n");
return -1;
}
/* get reply */
ret = io_getevents(ctx, 1, 1, events, NULL);
printf("events: %d\n", ret);
ret = io_destroy(ctx);
if (ret < 0) {
perror("io_destroy");
return -1;
}
return 0;
}
System Tuning
/proc/sys/fs/aio-max-nr
/proc/sys/fs/aio-nr
libaio
Install
[oxnz@localhost aio]$ sudo yum install libaio-devel
[oxnz@localhost aio]$ rpm -ql libaio
/lib64/libaio.so.1
/lib64/libaio.so.1.0.0
/lib64/libaio.so.1.0.1
/usr/share/doc/libaio-0.3.109
/usr/share/doc/libaio-0.3.109/COPYING
/usr/share/doc/libaio-0.3.109/TODO
[oxnz@localhost aio]$ rpm -ql libaio-devel
/usr/include/libaio.h
/usr/lib64/libaio.so
Syscall Wrappers
/* /usr/include/libaio.h */
/* Actual syscalls */
extern int io_setup(int maxevents, io_context_t *ctxp);
extern int io_destroy(io_context_t ctx);
extern int io_submit(io_context_t ctx, long nr, struct iocb *ios[]);
extern int io_cancel(io_context_t ctx, struct iocb *iocb, struct io_event *evt);
extern int io_getevents(io_context_t ctx_id, long min_nr, long nr, struct io_event *events, struct timespec *timeout);
Helper Functions
static inline void io_prep_pread(struct iocb *iocb, int fd, void *buf, size_t count, long long offset)
static inline void io_prep_pwrite(struct iocb *iocb, int fd, void *buf, size_t count, long long offset)
static inline void io_prep_preadv(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset)
static inline void io_prep_pwritev(struct iocb *iocb, int fd, const struct iovec *iov, int iovcnt, long long offset)
static inline void io_prep_poll(struct iocb *iocb, int fd, int events)
static inline void io_prep_fsync(struct iocb *iocb, int fd)
static inline void io_prep_fdsync(struct iocb *iocb, int fd)
static inline int io_poll(io_context_t ctx, struct iocb *iocb, io_callback_t cb, int fd, int events)
static inline int io_fsync(io_context_t ctx, struct iocb *iocb, io_callback_t cb, int fd)
static inline int io_fdsync(io_context_t ctx, struct iocb *iocb, io_callback_t cb, int fd)
static inline void io_set_eventfd(struct iocb *iocb, int eventfd);
struct iocb
struct io_iocb_poll {
PADDED(int events, __pad1);
}; /* result code is the set of result flags or -'ve errno */
struct io_iocb_sockaddr {
struct sockaddr *addr;
int len;
}; /* result code is the length of the sockaddr, or -'ve errno */
struct io_iocb_common {
PADDEDptr(void *buf, __pad1);
PADDEDul(nbytes, __pad2);
long long offset;
long long __pad3;
unsigned flags;
unsigned resfd;
}; /* result code is the amount read or -'ve errno */
struct io_iocb_vector {
const struct iovec *vec;
int nr;
long long offset;
}; /* result code is the amount read or -'ve errno */
struct iocb {
PADDEDptr(void *data, __pad1); /* Return in the io completion event */
PADDED(unsigned key, __pad2); /* For use in identifying io requests */
short aio_lio_opcode;
short aio_reqprio;
int aio_fildes;
union {
struct io_iocb_common c;
struct io_iocb_vector v;
struct io_iocb_poll poll;
struct io_iocb_sockaddr saddr;
} u;
};
struct io_event {
PADDEDptr(void *data, __pad1);
PADDEDptr(struct iocb *obj, __pad2);
PADDEDul(res, __pad3);
PADDEDul(res2, __pad4);
};
Example
#include <stdio.h>
#include <stdlib.h>
#include <string.h>
#include <err.h>
#include <errno.h>
#include <unistd.h>
#include <fcntl.h>
#include <libaio.h>
int main() {
io_context_t ctx;
struct iocb iocb;
struct iocb * iocbs[1];
struct io_event events[1];
struct timespec timeout;
int fd;
fd = open("/tmp/test", O_WRONLY | O_CREAT);
if (fd < 0) err(1, "open");
memset(&ctx, 0, sizeof(ctx));
if (io_setup(10, &ctx) != 0) err(1, "io_setup");
const char *msg = "hello";
io_prep_pwrite(&iocb, fd, (void *)msg, strlen(msg), 0);
iocb.data = (void *)msg;
iocbs[0] = &iocb;
if (io_submit(ctx, 1, iocbs) != 1) {
io_destroy(ctx);
err(1, "io_submit");
}
while (1) {
timeout.tv_sec = 0;
timeout.tv_nsec = 500000000;
if (io_getevents(ctx, 0, 1, events, &timeout) == 1) {
close(fd);
break;
}
printf("not done yet\n");
sleep(1);
}
io_destroy(ctx);
return 0;
}
Compile:
cc libaio.c -o libaio -laio
POSIX asynchronous I/O
Library
/lib64/librt.so
/usr/include/aio.h
Interfaces
The POSIX AIO interface consists of the following functions:
- aio_read(3) Enqueue a read request. This is the asynchronous analog of read(2).
- aio_write(3) Enqueue a write request. This is the asynchronous analog of write(2).
- aio_fsync(3) Enqueue a sync request for the I/O operations on a file descriptor. This is the asynchronous analog of fsync(2) and fdatasync(2).
- aio_error(3) Obtain the error status of an enqueued I/O request.
- aio_return(3) Obtain the return status of a completed I/O request.
- aio_suspend(3) Suspend the caller until one or more of a specified set of I/O requests completes.
- aio_cancel(3) Attempt to cancel outstanding I/O requests on a specified file descriptor.
- lio_listio(3) Enqueue multiple I/O requests using a single function call.
man 7 aio
The current Linux POSIX AIO implementation is provided in user space by glibc. This has a number of limitations, most notably that maintaining multiple threads to perform I/O operations is expensive and scales poorly. Work has been in progress for some time on a kernel state-machine-based implementation of asynchronous I/O (see io_submit(2), io_setup(2), io_cancel(2), io_destroy(2), io_getevents(2)), but this implementation hasn’t yet matured to the point where the POSIX AIO implementation can be completely reimplemented using the kernel system calls.