| .\" Copyright (C) 2003 Davide Libenzi |
| .\" |
| .\" %%%LICENSE_START(GPLv2+_SW_3_PARA) |
| .\" This program is free software; you can redistribute it and/or modify |
| .\" it under the terms of the GNU General Public License as published by |
| .\" the Free Software Foundation; either version 2 of the License, or |
| .\" (at your option) any later version. |
| .\" |
| .\" This program is distributed in the hope that it will be useful, |
| .\" but WITHOUT ANY WARRANTY; without even the implied warranty of |
| .\" MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE. See the |
| .\" GNU General Public License for more details. |
| .\" |
| .\" You should have received a copy of the GNU General Public |
| .\" License along with this manual; if not, see |
| .\" <http://www.gnu.org/licenses/>. |
| .\" %%%LICENSE_END |
| .\" |
| .\" Davide Libenzi <davidel@xmailserver.org> |
| .\" |
| .TH EPOLL 7 2021-03-22 "Linux" "Linux Programmer's Manual" |
| .SH NAME |
| epoll \- I/O event notification facility |
| .SH SYNOPSIS |
| .nf |
| .B #include <sys/epoll.h> |
| .fi |
| .SH DESCRIPTION |
| The |
| .B epoll |
| API performs a similar task to |
| .BR poll (2): |
| monitoring multiple file descriptors to see if I/O is possible on any of them. |
| The |
| .B epoll |
| API can be used either as an edge-triggered or a level-triggered |
| interface and scales well to large numbers of watched file descriptors. |
| .PP |
| The central concept of the |
| .B epoll |
| API is the |
| .B epoll |
| .IR instance , |
| an in-kernel data structure which, from a user-space perspective, |
| can be considered as a container for two lists: |
| .IP \(bu 2 |
| The |
| .I interest |
| list (sometimes also called the |
| .B epoll |
| set): the set of file descriptors that the process has registered |
| an interest in monitoring. |
| .IP \(bu |
| The |
| .I ready |
| list: the set of file descriptors that are "ready" for I/O. |
| The ready list is a subset of |
| (or, more precisely, a set of references to) |
| the file descriptors in the interest list. |
| The ready list is dynamically populated |
| by the kernel as a result of I/O activity on those file descriptors. |
| .PP |
| The following system calls are provided to |
| create and manage an |
| .B epoll |
| instance: |
| .IP \(bu 2 |
| .BR epoll_create (2) |
| creates a new |
| .B epoll |
| instance and returns a file descriptor referring to that instance. |
| (The more recent |
| .BR epoll_create1 (2) |
| extends the functionality of |
| .BR epoll_create (2).) |
| .IP \(bu |
| Interest in particular file descriptors is then registered via |
| .BR epoll_ctl (2), |
| which adds items to the interest list of the |
| .B epoll |
| instance. |
| .IP \(bu |
| .BR epoll_wait (2) |
| waits for I/O events, |
| blocking the calling thread if no events are currently available. |
| (This system call can be thought of as fetching items from |
| the ready list of the |
| .B epoll |
| instance.) |
| .\" |
| .SS Level-triggered and edge-triggered |
| The |
| .B epoll |
| event distribution interface is able to behave both as edge-triggered |
| (ET) and as level-triggered (LT). |
| The difference between the two mechanisms |
| can be described as follows. |
| Suppose that |
| this scenario happens: |
| .IP 1. 3 |
| The file descriptor that represents the read side of a pipe |
| .RI ( rfd ) |
| is registered on the |
| .B epoll |
| instance. |
| .IP 2. |
| A pipe writer writes 2\ kB of data on the write side of the pipe. |
| .IP 3. |
| A call to |
| .BR epoll_wait (2) |
| is done that will return |
| .I rfd |
| as a ready file descriptor. |
| .IP 4. |
| The pipe reader reads 1\ kB of data from |
| .IR rfd . |
| .IP 5. |
| A call to |
| .BR epoll_wait (2) |
| is done. |
| .PP |
| If the |
| .I rfd |
| file descriptor has been added to the |
| .B epoll |
| interface using the |
| .B EPOLLET |
| (edge-triggered) |
| flag, the call to |
| .BR epoll_wait (2) |
| done in step |
| .B 5 |
| will probably hang despite the available data still present in the file |
| input buffer; |
| meanwhile the remote peer might be expecting a response based on the |
| data it already sent. |
| The reason for this is that edge-triggered mode |
| delivers events only when changes occur on the monitored file descriptor. |
| So, in step |
| .B 5 |
| the caller might end up waiting for some data that is already present inside |
| the input buffer. |
| In the above example, an event on |
| .I rfd |
| will be generated because of the write done in |
| .B 2 |
| and the event is consumed in |
| .BR 3 . |
| Since the read operation done in |
| .B 4 |
| does not consume the whole buffer data, the call to |
| .BR epoll_wait (2) |
| done in step |
| .B 5 |
| might block indefinitely. |
| .PP |
| An application that employs the |
| .B EPOLLET |
| flag should use nonblocking file descriptors to avoid having a blocking |
| read or write starve a task that is handling multiple file descriptors. |
| The suggested way to use |
| .B epoll |
| as an edge-triggered |
| .RB ( EPOLLET ) |
| interface is as follows: |
| .IP a) 3 |
| with nonblocking file descriptors; and |
| .IP b) |
| by waiting for an event only after |
| .BR read (2) |
| or |
| .BR write (2) |
| return |
| .BR EAGAIN . |
| .PP |
| By contrast, when used as a level-triggered interface |
| (the default, when |
| .B EPOLLET |
| is not specified), |
| .B epoll |
| is simply a faster |
| .BR poll (2), |
| and can be used wherever the latter is used since it shares the |
| same semantics. |
| .PP |
| Since even with edge-triggered |
| .BR epoll , |
| multiple events can be generated upon receipt of multiple chunks of data, |
| the caller has the option to specify the |
| .B EPOLLONESHOT |
| flag, to tell |
| .B epoll |
| to disable the associated file descriptor after the receipt of an event with |
| .BR epoll_wait (2). |
| When the |
| .B EPOLLONESHOT |
| flag is specified, |
| it is the caller's responsibility to rearm the file descriptor using |
| .BR epoll_ctl (2) |
| with |
| .BR EPOLL_CTL_MOD . |
| .PP |
| If multiple threads |
| (or processes, if child processes have inherited the |
| .B epoll |
| file descriptor across |
| .BR fork (2)) |
| are blocked in |
| .BR epoll_wait (2) |
| waiting on the same epoll file descriptor and a file descriptor |
| in the interest list that is marked for edge-triggered |
| .RB ( EPOLLET ) |
| notification becomes ready, |
| just one of the threads (or processes) is awoken from |
| .BR epoll_wait (2). |
| This provides a useful optimization for avoiding "thundering herd" wake-ups |
| in some scenarios. |
| .\" |
| .SS Interaction with autosleep |
| If the system is in |
| .B autosleep |
| mode via |
| .I /sys/power/autosleep |
| and an event happens which wakes the device from sleep, the device |
| driver will keep the device awake only until that event is queued. |
| To keep the device awake until the event has been processed, |
| it is necessary to use the |
| .BR epoll_ctl (2) |
| .B EPOLLWAKEUP |
| flag. |
| .PP |
| When the |
| .B EPOLLWAKEUP |
| flag is set in the |
| .B events |
| field for a |
| .IR "struct epoll_event" , |
| the system will be kept awake from the moment the event is queued, |
| through the |
| .BR epoll_wait (2) |
| call which returns the event until the subsequent |
| .BR epoll_wait (2) |
| call. |
| If the event should keep the system awake beyond that time, |
| then a separate |
| .I wake_lock |
| should be taken before the second |
| .BR epoll_wait (2) |
| call. |
| .SS /proc interfaces |
| The following interfaces can be used to limit the amount of |
| kernel memory consumed by epoll: |
| .\" Following was added in 2.6.28, but them removed in 2.6.29 |
| .\" .TP |
| .\" .IR /proc/sys/fs/epoll/max_user_instances " (since Linux 2.6.28)" |
| .\" This specifies an upper limit on the number of epoll instances |
| .\" that can be created per real user ID. |
| .TP |
| .IR /proc/sys/fs/epoll/max_user_watches " (since Linux 2.6.28)" |
| This specifies a limit on the total number of |
| file descriptors that a user can register across |
| all epoll instances on the system. |
| The limit is per real user ID. |
| Each registered file descriptor costs roughly 90 bytes on a 32-bit kernel, |
| and roughly 160 bytes on a 64-bit kernel. |
| Currently, |
| .\" 2.6.29 (in 2.6.28, the default was 1/32 of lowmem) |
| the default value for |
| .I max_user_watches |
| is 1/25 (4%) of the available low memory, |
| divided by the registration cost in bytes. |
| .SS Example for suggested usage |
| While the usage of |
| .B epoll |
| when employed as a level-triggered interface does have the same |
| semantics as |
| .BR poll (2), |
| the edge-triggered usage requires more clarification to avoid stalls |
| in the application event loop. |
| In this example, listener is a |
| nonblocking socket on which |
| .BR listen (2) |
| has been called. |
| The function |
| .I do_use_fd() |
| uses the new ready file descriptor until |
| .B EAGAIN |
| is returned by either |
| .BR read (2) |
| or |
| .BR write (2). |
| An event-driven state machine application should, after having received |
| .BR EAGAIN , |
| record its current state so that at the next call to |
| .I do_use_fd() |
| it will continue to |
| .BR read (2) |
| or |
| .BR write (2) |
| from where it stopped before. |
| .PP |
| .in +4n |
| .EX |
| #define MAX_EVENTS 10 |
| struct epoll_event ev, events[MAX_EVENTS]; |
| int listen_sock, conn_sock, nfds, epollfd; |
| |
| /* Code to set up listening socket, \(aqlisten_sock\(aq, |
| (socket(), bind(), listen()) omitted. */ |
| |
| epollfd = epoll_create1(0); |
| if (epollfd == \-1) { |
| perror("epoll_create1"); |
| exit(EXIT_FAILURE); |
| } |
| |
| ev.events = EPOLLIN; |
| ev.data.fd = listen_sock; |
| if (epoll_ctl(epollfd, EPOLL_CTL_ADD, listen_sock, &ev) == \-1) { |
| perror("epoll_ctl: listen_sock"); |
| exit(EXIT_FAILURE); |
| } |
| |
| for (;;) { |
| nfds = epoll_wait(epollfd, events, MAX_EVENTS, \-1); |
| if (nfds == \-1) { |
| perror("epoll_wait"); |
| exit(EXIT_FAILURE); |
| } |
| |
| for (n = 0; n < nfds; ++n) { |
| if (events[n].data.fd == listen_sock) { |
| conn_sock = accept(listen_sock, |
| (struct sockaddr *) &addr, &addrlen); |
| if (conn_sock == \-1) { |
| perror("accept"); |
| exit(EXIT_FAILURE); |
| } |
| setnonblocking(conn_sock); |
| ev.events = EPOLLIN | EPOLLET; |
| ev.data.fd = conn_sock; |
| if (epoll_ctl(epollfd, EPOLL_CTL_ADD, conn_sock, |
| &ev) == \-1) { |
| perror("epoll_ctl: conn_sock"); |
| exit(EXIT_FAILURE); |
| } |
| } else { |
| do_use_fd(events[n].data.fd); |
| } |
| } |
| } |
| .EE |
| .in |
| .PP |
| When used as an edge-triggered interface, for performance reasons, it is |
| possible to add the file descriptor inside the |
| .B epoll |
| interface |
| .RB ( EPOLL_CTL_ADD ) |
| once by specifying |
| .RB ( EPOLLIN | EPOLLOUT ). |
| This allows you to avoid |
| continuously switching between |
| .B EPOLLIN |
| and |
| .B EPOLLOUT |
| calling |
| .BR epoll_ctl (2) |
| with |
| .BR EPOLL_CTL_MOD . |
| .SS Questions and answers |
| .IP 0. 4 |
| What is the key used to distinguish the file descriptors registered in an |
| interest list? |
| .IP |
| The key is the combination of the file descriptor number and |
| the open file description |
| (also known as an "open file handle", |
| the kernel's internal representation of an open file). |
| .IP 1. |
| What happens if you register the same file descriptor on an |
| .B epoll |
| instance twice? |
| .IP |
| You will probably get |
| .BR EEXIST . |
| However, it is possible to add a duplicate |
| .RB ( dup (2), |
| .BR dup2 (2), |
| .BR fcntl (2) |
| .BR F_DUPFD ) |
| file descriptor to the same |
| .B epoll |
| instance. |
| .\" But a file descriptor duplicated by fork(2) can't be added to the |
| .\" set, because the [file *, fd] pair is already in the epoll set. |
| .\" That is a somewhat ugly inconsistency. On the one hand, a child process |
| .\" cannot add the duplicate file descriptor to the epoll set. (In every |
| .\" other case that I can think of, file descriptors duplicated by fork have |
| .\" similar semantics to file descriptors duplicated by dup() and friends.) On |
| .\" the other hand, the very fact that the child has a duplicate of the |
| .\" file descriptor means that even if the parent closes its file descriptor, |
| .\" then epoll_wait() in the parent will continue to receive notifications for |
| .\" that file descriptor because of the duplicated file descriptor in the child. |
| .\" |
| .\" See http://thread.gmane.org/gmane.linux.kernel/596462/ |
| .\" "epoll design problems with common fork/exec patterns" |
| .\" |
| .\" mtk, Feb 2008 |
| This can be a useful technique for filtering events, |
| if the duplicate file descriptors are registered with different |
| .I events |
| masks. |
| .IP 2. |
| Can two |
| .B epoll |
| instances wait for the same file descriptor? |
| If so, are events reported to both |
| .B epoll |
| file descriptors? |
| .IP |
| Yes, and events would be reported to both. |
| However, careful programming may be needed to do this correctly. |
| .IP 3. |
| Is the |
| .B epoll |
| file descriptor itself poll/epoll/selectable? |
| .IP |
| Yes. |
| If an |
| .B epoll |
| file descriptor has events waiting, then it will |
| indicate as being readable. |
| .IP 4. |
| What happens if one attempts to put an |
| .B epoll |
| file descriptor into its own file descriptor set? |
| .IP |
| The |
| .BR epoll_ctl (2) |
| call fails |
| .RB ( EINVAL ). |
| However, you can add an |
| .B epoll |
| file descriptor inside another |
| .B epoll |
| file descriptor set. |
| .IP 5. |
| Can I send an |
| .B epoll |
| file descriptor over a UNIX domain socket to another process? |
| .IP |
| Yes, but it does not make sense to do this, since the receiving process |
| would not have copies of the file descriptors in the interest list. |
| .IP 6. |
| Will closing a file descriptor cause it to be removed from all |
| .B epoll |
| interest lists? |
| .IP |
| Yes, but be aware of the following point. |
| A file descriptor is a reference to an open file description (see |
| .BR open (2)). |
| Whenever a file descriptor is duplicated via |
| .BR dup (2), |
| .BR dup2 (2), |
| .BR fcntl (2) |
| .BR F_DUPFD , |
| or |
| .BR fork (2), |
| a new file descriptor referring to the same open file description is |
| created. |
| An open file description continues to exist until all |
| file descriptors referring to it have been closed. |
| .IP |
| A file descriptor is removed from an |
| interest list only after all the file descriptors referring to the underlying |
| open file description have been closed. |
| This means that even after a file descriptor that is part of an |
| interest list has been closed, |
| events may be reported for that file descriptor if other file |
| descriptors referring to the same underlying file description remain open. |
| To prevent this happening, |
| the file descriptor must be explicitly removed from the interest list (using |
| .BR epoll_ctl (2) |
| .BR EPOLL_CTL_DEL ) |
| before it is duplicated. |
| Alternatively, |
| the application must ensure that all file descriptors are closed |
| (which may be difficult if file descriptors were duplicated |
| behind the scenes by library functions that used |
| .BR dup (2) |
| or |
| .BR fork (2)). |
| .IP 7. |
| If more than one event occurs between |
| .BR epoll_wait (2) |
| calls, are they combined or reported separately? |
| .IP |
| They will be combined. |
| .IP 8. |
| Does an operation on a file descriptor affect the |
| already collected but not yet reported events? |
| .IP |
| You can do two operations on an existing file descriptor. |
| Remove would be meaningless for |
| this case. |
| Modify will reread available I/O. |
| .IP 9. |
| Do I need to continuously read/write a file descriptor |
| until |
| .B EAGAIN |
| when using the |
| .B EPOLLET |
| flag (edge-triggered behavior)? |
| .IP |
| Receiving an event from |
| .BR epoll_wait (2) |
| should suggest to you that such |
| file descriptor is ready for the requested I/O operation. |
| You must consider it ready until the next (nonblocking) |
| read/write yields |
| .BR EAGAIN . |
| When and how you will use the file descriptor is entirely up to you. |
| .IP |
| For packet/token-oriented files (e.g., datagram socket, |
| terminal in canonical mode), |
| the only way to detect the end of the read/write I/O space |
| is to continue to read/write until |
| .BR EAGAIN . |
| .IP |
| For stream-oriented files (e.g., pipe, FIFO, stream socket), the |
| condition that the read/write I/O space is exhausted can also be detected by |
| checking the amount of data read from / written to the target file |
| descriptor. |
| For example, if you call |
| .BR read (2) |
| by asking to read a certain amount of data and |
| .BR read (2) |
| returns a lower number of bytes, you |
| can be sure of having exhausted the read I/O space for the file |
| descriptor. |
| The same is true when writing using |
| .BR write (2). |
| (Avoid this latter technique if you cannot guarantee that |
| the monitored file descriptor always refers to a stream-oriented file.) |
| .SS Possible pitfalls and ways to avoid them |
| .TP |
| .B o Starvation (edge-triggered) |
| .PP |
| If there is a large amount of I/O space, |
| it is possible that by trying to drain |
| it the other files will not get processed causing starvation. |
| (This problem is not specific to |
| .BR epoll .) |
| .PP |
| The solution is to maintain a ready list |
| and mark the file descriptor as ready |
| in its associated data structure, thereby allowing the application to |
| remember which files need to be processed but still round robin amongst |
| all the ready files. |
| This also supports ignoring subsequent events you |
| receive for file descriptors that are already ready. |
| .TP |
| .B o If using an event cache... |
| .PP |
| If you use an event cache or store all the file descriptors returned from |
| .BR epoll_wait (2), |
| then make sure to provide a way to mark |
| its closure dynamically (i.e., caused by |
| a previous event's processing). |
| Suppose you receive 100 events from |
| .BR epoll_wait (2), |
| and in event #47 a condition causes event #13 to be closed. |
| If you remove the structure and |
| .BR close (2) |
| the file descriptor for event #13, then your |
| event cache might still say there are events waiting for that |
| file descriptor causing confusion. |
| .PP |
| One solution for this is to call, during the processing of event 47, |
| .BR epoll_ctl ( EPOLL_CTL_DEL ) |
| to delete file descriptor 13 and |
| .BR close (2), |
| then mark its associated |
| data structure as removed and link it to a cleanup list. |
| If you find another |
| event for file descriptor 13 in your batch processing, |
| you will discover the file descriptor had been |
| previously removed and there will be no confusion. |
| .SH VERSIONS |
| The |
| .B epoll |
| API was introduced in Linux kernel 2.5.44. |
| .\" Its interface should be finalized in Linux kernel 2.5.66. |
| Support was added to glibc in version 2.3.2. |
| .SH CONFORMING TO |
| The |
| .B epoll |
| API is Linux-specific. |
| Some other systems provide similar |
| mechanisms, for example, FreeBSD has |
| .IR kqueue , |
| and Solaris has |
| .IR /dev/poll . |
| .SH NOTES |
| The set of file descriptors that is being monitored via |
| an epoll file descriptor can be viewed via the entry for |
| the epoll file descriptor in the process's |
| .IR /proc/[pid]/fdinfo |
| directory. |
| See |
| .BR proc (5) |
| for further details. |
| .PP |
| The |
| .BR kcmp (2) |
| .B KCMP_EPOLL_TFD |
| operation can be used to test whether a file descriptor |
| is present in an epoll instance. |
| .SH SEE ALSO |
| .BR epoll_create (2), |
| .BR epoll_create1 (2), |
| .BR epoll_ctl (2), |
| .BR epoll_wait (2), |
| .BR poll (2), |
| .BR select (2) |