| .\" Copyright (C) 2015 Alexei Starovoitov <ast@kernel.org> |
| .\" and Copyright (C) 2015 Michael Kerrisk <mtk.manpages@gmail.com> |
| .\" |
| .\" %%%LICENSE_START(VERBATIM) |
| .\" Permission is granted to make and distribute verbatim copies of this |
| .\" manual provided the copyright notice and this permission notice are |
| .\" preserved on all copies. |
| .\" |
| .\" Permission is granted to copy and distribute modified versions of this |
| .\" manual under the conditions for verbatim copying, provided that the |
| .\" entire resulting derived work is distributed under the terms of a |
| .\" permission notice identical to this one. |
| .\" |
| .\" Since the Linux kernel and libraries are constantly changing, this |
| .\" manual page may be incorrect or out-of-date. The author(s) assume no |
| .\" responsibility for errors or omissions, or for damages resulting from |
| .\" the use of the information contained herein. The author(s) may not |
| .\" have taken the same level of care in the production of this manual, |
| .\" which is licensed free of charge, as they might when working |
| .\" professionally. |
| .\" |
| .\" Formatted or processed versions of this manual, if unaccompanied by |
| .\" the source, must acknowledge the copyright and authors of this work. |
| .\" %%%LICENSE_END |
| .\" |
| .TH BPF 2 2018-02-02 "Linux" "Linux Programmer's Manual" |
| .SH NAME |
| bpf \- perform a command on an extended BPF map or program |
| .SH SYNOPSIS |
| .nf |
| .B #include <linux/bpf.h> |
| .PP |
| .BI "int bpf(int " cmd ", union bpf_attr *" attr ", unsigned int " size "); |
| .SH DESCRIPTION |
| The |
| .BR bpf () |
| system call performs a range of operations related to extended |
| Berkeley Packet Filters. |
| Extended BPF (or eBPF) is similar to |
| the original ("classic") BPF (cBPF) used to filter network packets. |
| For both cBPF and eBPF programs, |
| the kernel statically analyzes the programs before loading them, |
| in order to ensure that they cannot harm the running system. |
| .PP |
| eBPF extends cBPF in multiple ways, including the ability to call |
| a fixed set of in-kernel helper functions |
| .\" See 'enum bpf_func_id' in include/uapi/linux/bpf.h |
| (via the |
| .B BPF_CALL |
| opcode extension provided by eBPF) |
| and access shared data structures such as eBPF maps. |
| .\" |
| .SS Extended BPF Design/Architecture |
| eBPF maps are a generic data structure for storage of different data types. |
| Data types are generally treated as binary blobs, so a user just specifies |
| the size of the key and the size of the value at map-creation time. |
| In other words, a key/value for a given map can have an arbitrary structure. |
| .PP |
| A user process can create multiple maps (with key/value-pairs being |
| opaque bytes of data) and access them via file descriptors. |
| Different eBPF programs can access the same maps in parallel. |
| It's up to the user process and eBPF program to decide what they store |
| inside maps. |
| .PP |
| There's one special map type, called a program array. |
| This type of map stores file descriptors referring to other eBPF programs. |
| When a lookup in the map is performed, the program flow is |
| redirected in-place to the beginning of another eBPF program and does not |
| return back to the calling program. |
| The level of nesting has a fixed limit of 32, |
| .\" Defined by the kernel constant MAX_TAIL_CALL_CNT in include/linux/bpf.h |
| so that infinite loops cannot be crafted. |
| At run time, the program file descriptors stored in the map can be modified, |
| so program functionality can be altered based on specific requirements. |
| All programs referred to in a program-array map must |
| have been previously loaded into the kernel via |
| .BR bpf (). |
| If a map lookup fails, the current program continues its execution. |
| See |
| .B BPF_MAP_TYPE_PROG_ARRAY |
| below for further details. |
| .PP |
| Generally, eBPF programs are loaded by the user process and automatically |
| unloaded when the process exits. |
| In some cases, for example, |
| .BR tc-bpf (8), |
| the program will continue to stay alive inside the kernel even after the |
| process that loaded the program exits. |
| In that case, |
| the tc subsystem holds a reference to the eBPF program after the |
| file descriptor has been closed by the user-space program. |
| Thus, whether a specific program continues to live inside the kernel |
| depends on how it is further attached to a given kernel subsystem |
| after it was loaded via |
| .BR bpf (). |
| .PP |
| Each eBPF program is a set of instructions that is safe to run until |
| its completion. |
| An in-kernel verifier statically determines that the eBPF program |
| terminates and is safe to execute. |
| During verification, the kernel increments reference counts for each of |
| the maps that the eBPF program uses, |
| so that the attached maps can't be removed until the program is unloaded. |
| .PP |
| eBPF programs can be attached to different events. |
| These events can be the arrival of network packets, tracing |
| events, classification events by network queueing disciplines |
| (for eBPF programs attached to a |
| .BR tc (8) |
| classifier), and other types that may be added in the future. |
| A new event triggers execution of the eBPF program, which |
| may store information about the event in eBPF maps. |
| Beyond storing data, eBPF programs may call a fixed set of |
| in-kernel helper functions. |
| .PP |
| The same eBPF program can be attached to multiple events and different |
| eBPF programs can access the same map: |
| .PP |
| .in +4n |
| .EX |
| tracing tracing tracing packet packet packet |
| event A event B event C on eth0 on eth1 on eth2 |
| | | | | | ^ |
| | | | | v | |
| --> tracing <-- tracing socket tc ingress tc egress |
| prog_1 prog_2 prog_3 classifier action |
| | | | | prog_4 prog_5 |
| |--- -----| |------| map_3 | | |
| map_1 map_2 --| map_4 |-- |
| .EE |
| .in |
| .\" |
| .SS Arguments |
| The operation to be performed by the |
| .BR bpf () |
| system call is determined by the |
| .IR cmd |
| argument. |
| Each operation takes an accompanying argument, |
| provided via |
| .IR attr , |
| which is a pointer to a union of type |
| .IR bpf_attr |
| (see below). |
| The |
| .I size |
| argument is the size of the union pointed to by |
| .IR attr . |
| .PP |
| The value provided in |
| .IR cmd |
| is one of the following: |
| .TP |
| .B BPF_MAP_CREATE |
| Create a map and return a file descriptor that refers to the map. |
| The close-on-exec file descriptor flag (see |
| .BR fcntl (2)) |
| is automatically enabled for the new file descriptor. |
| .TP |
| .B BPF_MAP_LOOKUP_ELEM |
| Look up an element by key in a specified map and return its value. |
| .TP |
| .B BPF_MAP_UPDATE_ELEM |
| Create or update an element (key/value pair) in a specified map. |
| .TP |
| .B BPF_MAP_DELETE_ELEM |
| Look up and delete an element by key in a specified map. |
| .TP |
| .B BPF_MAP_GET_NEXT_KEY |
| Look up an element by key in a specified map and return the key |
| of the next element. |
| .TP |
| .B BPF_PROG_LOAD |
| Verify and load an eBPF program, |
| returning a new file descriptor associated with the program. |
| The close-on-exec file descriptor flag (see |
| .BR fcntl (2)) |
| is automatically enabled for the new file descriptor. |
| .IP |
| The |
| .I bpf_attr |
| union consists of various anonymous structures that are used by different |
| .BR bpf () |
| commands: |
| .PP |
| .in +4n |
| .EX |
| union bpf_attr { |
| struct { /* Used by BPF_MAP_CREATE */ |
| __u32 map_type; |
| __u32 key_size; /* size of key in bytes */ |
| __u32 value_size; /* size of value in bytes */ |
| __u32 max_entries; /* maximum number of entries |
| in a map */ |
| }; |
| |
| struct { /* Used by BPF_MAP_*_ELEM and BPF_MAP_GET_NEXT_KEY |
| commands */ |
| __u32 map_fd; |
| __aligned_u64 key; |
| union { |
| __aligned_u64 value; |
| __aligned_u64 next_key; |
| }; |
| __u64 flags; |
| }; |
| |
| struct { /* Used by BPF_PROG_LOAD */ |
| __u32 prog_type; |
| __u32 insn_cnt; |
| __aligned_u64 insns; /* 'const struct bpf_insn *' */ |
| __aligned_u64 license; /* 'const char *' */ |
| __u32 log_level; /* verbosity level of verifier */ |
| __u32 log_size; /* size of user buffer */ |
| __aligned_u64 log_buf; /* user supplied 'char *' |
| buffer */ |
| __u32 kern_version; |
| /* checked when prog_type=kprobe |
| (since Linux 4.1) */ |
| .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 |
| }; |
| } __attribute__((aligned(8))); |
| .EE |
| .in |
| .\" |
| .SS eBPF maps |
| Maps are a generic data structure for storage of different types of data. |
| They allow sharing of data between eBPF kernel programs, |
| and also between kernel and user-space applications. |
| .PP |
| Each map type has the following attributes: |
| .IP * 3 |
| type |
| .IP * |
| maximum number of elements |
| .IP * |
| key size in bytes |
| .IP * |
| value size in bytes |
| .PP |
| The following wrapper functions demonstrate how various |
| .BR bpf () |
| commands can be used to access the maps. |
| The functions use the |
| .IR cmd |
| argument to invoke different operations. |
| .TP |
| .B BPF_MAP_CREATE |
| The |
| .B BPF_MAP_CREATE |
| command creates a new map, |
| returning a new file descriptor that refers to the map. |
| .IP |
| .in +4n |
| .EX |
| int |
| bpf_create_map(enum bpf_map_type map_type, |
| unsigned int key_size, |
| unsigned int value_size, |
| unsigned int max_entries) |
| { |
| union bpf_attr attr = { |
| .map_type = map_type, |
| .key_size = key_size, |
| .value_size = value_size, |
| .max_entries = max_entries |
| }; |
| |
| return bpf(BPF_MAP_CREATE, &attr, sizeof(attr)); |
| } |
| .EE |
| .in |
| .IP |
| The new map has the type specified by |
| .IR map_type , |
| and attributes as specified in |
| .IR key_size , |
| .IR value_size , |
| and |
| .IR max_entries . |
| On success, this operation returns a file descriptor. |
| On error, \-1 is returned and |
| .I errno |
| is set to |
| .BR EINVAL , |
| .BR EPERM , |
| or |
| .BR ENOMEM . |
| .IP |
| The |
| .I key_size |
| and |
| .I value_size |
| attributes will be used by the verifier during program loading |
| to check that the program is calling |
| .BR bpf_map_*_elem () |
| helper functions with a correctly initialized |
| .I key |
| and to check that the program doesn't access the map element |
| .I value |
| beyond the specified |
| .IR value_size . |
| For example, when a map is created with a |
| .IR key_size |
| of 8 and the eBPF program calls |
| .IP |
| .in +4n |
| .EX |
| bpf_map_lookup_elem(map_fd, fp - 4) |
| .EE |
| .in |
| .IP |
| the program will be rejected, |
| since the in-kernel helper function |
| .IP |
| .EX |
| bpf_map_lookup_elem(map_fd, void *key) |
| .EE |
| .IP |
| expects to read 8 bytes from the location pointed to by |
| .IR key , |
| but the |
| .IR "fp\ -\ 4" |
| (where |
| .I fp |
| is the top of the stack) |
| starting address will cause out-of-bounds stack access. |
| .IP |
| Similarly, when a map is created with a |
| .I value_size |
| of 1 and the eBPF program contains |
| .IP |
| .in +4n |
| .EX |
| value = bpf_map_lookup_elem(...); |
| *(u32 *) value = 1; |
| .EE |
| .in |
| .IP |
| the program will be rejected, since it accesses the |
| .I value |
| pointer beyond the specified 1 byte |
| .I value_size |
| limit. |
| .IP |
| Currently, the following values are supported for |
| .IR map_type : |
| .IP |
| .in +4n |
| .EX |
| enum bpf_map_type { |
| BPF_MAP_TYPE_UNSPEC, /* Reserve 0 as invalid map type */ |
| BPF_MAP_TYPE_HASH, |
| BPF_MAP_TYPE_ARRAY, |
| BPF_MAP_TYPE_PROG_ARRAY, |
| BPF_MAP_TYPE_PERF_EVENT_ARRAY, |
| BPF_MAP_TYPE_PERCPU_HASH, |
| BPF_MAP_TYPE_PERCPU_ARRAY, |
| BPF_MAP_TYPE_STACK_TRACE, |
| BPF_MAP_TYPE_CGROUP_ARRAY, |
| BPF_MAP_TYPE_LRU_HASH, |
| BPF_MAP_TYPE_LRU_PERCPU_HASH, |
| BPF_MAP_TYPE_LPM_TRIE, |
| BPF_MAP_TYPE_ARRAY_OF_MAPS, |
| BPF_MAP_TYPE_HASH_OF_MAPS, |
| BPF_MAP_TYPE_DEVMAP, |
| BPF_MAP_TYPE_SOCKMAP, |
| BPF_MAP_TYPE_CPUMAP, |
| }; |
| .EE |
| .in |
| .IP |
| .I map_type |
| selects one of the available map implementations in the kernel. |
| .\" FIXME We need an explanation of why one might choose each of |
| .\" these map implementations |
| For all map types, |
| eBPF programs access maps with the same |
| .BR bpf_map_lookup_elem () |
| and |
| .BR bpf_map_update_elem () |
| helper functions. |
| Further details of the various map types are given below. |
| .TP |
| .B BPF_MAP_LOOKUP_ELEM |
| The |
| .B BPF_MAP_LOOKUP_ELEM |
| command looks up an element with a given |
| .I key |
| in the map referred to by the file descriptor |
| .IR fd . |
| .IP |
| .in +4n |
| .EX |
| int |
| bpf_lookup_elem(int fd, const void *key, void *value) |
| { |
| union bpf_attr attr = { |
| .map_fd = fd, |
| .key = ptr_to_u64(key), |
| .value = ptr_to_u64(value), |
| }; |
| |
| return bpf(BPF_MAP_LOOKUP_ELEM, &attr, sizeof(attr)); |
| } |
| .EE |
| .in |
| .IP |
| If an element is found, |
| the operation returns zero and stores the element's value into |
| .IR value , |
| which must point to a buffer of |
| .I value_size |
| bytes. |
| .IP |
| If no element is found, the operation returns \-1 and sets |
| .I errno |
| to |
| .BR ENOENT . |
| .TP |
| .B BPF_MAP_UPDATE_ELEM |
| The |
| .B BPF_MAP_UPDATE_ELEM |
| command |
| creates or updates an element with a given |
| .I key/value |
| in the map referred to by the file descriptor |
| .IR fd . |
| .IP |
| .in +4n |
| .EX |
| int |
| bpf_update_elem(int fd, const void *key, const void *value, |
| uint64_t flags) |
| { |
| union bpf_attr attr = { |
| .map_fd = fd, |
| .key = ptr_to_u64(key), |
| .value = ptr_to_u64(value), |
| .flags = flags, |
| }; |
| |
| return bpf(BPF_MAP_UPDATE_ELEM, &attr, sizeof(attr)); |
| } |
| .EE |
| .in |
| .IP |
| The |
| .I flags |
| argument should be specified as one of the following: |
| .RS |
| .TP |
| .B BPF_ANY |
| Create a new element or update an existing element. |
| .TP |
| .B BPF_NOEXIST |
| Create a new element only if it did not exist. |
| .TP |
| .B BPF_EXIST |
| Update an existing element. |
| .RE |
| .IP |
| On success, the operation returns zero. |
| On error, \-1 is returned and |
| .I errno |
| is set to |
| .BR EINVAL , |
| .BR EPERM , |
| .BR ENOMEM , |
| or |
| .BR E2BIG . |
| .B E2BIG |
| indicates that the number of elements in the map reached the |
| .I max_entries |
| limit specified at map creation time. |
| .B EEXIST |
| will be returned if |
| .I flags |
| specifies |
| .B BPF_NOEXIST |
| and the element with |
| .I key |
| already exists in the map. |
| .B ENOENT |
| will be returned if |
| .I flags |
| specifies |
| .B BPF_EXIST |
| and the element with |
| .I key |
| doesn't exist in the map. |
| .TP |
| .B BPF_MAP_DELETE_ELEM |
| The |
| .B BPF_MAP_DELETE_ELEM |
| command |
| deleted the element whose key is |
| .I key |
| from the map referred to by the file descriptor |
| .IR fd . |
| .IP |
| .in +4n |
| .EX |
| int |
| bpf_delete_elem(int fd, const void *key) |
| { |
| union bpf_attr attr = { |
| .map_fd = fd, |
| .key = ptr_to_u64(key), |
| }; |
| |
| return bpf(BPF_MAP_DELETE_ELEM, &attr, sizeof(attr)); |
| } |
| .EE |
| .in |
| .IP |
| On success, zero is returned. |
| If the element is not found, \-1 is returned and |
| .I errno |
| is set to |
| .BR ENOENT . |
| .TP |
| .B BPF_MAP_GET_NEXT_KEY |
| The |
| .B BPF_MAP_GET_NEXT_KEY |
| command looks up an element by |
| .I key |
| in the map referred to by the file descriptor |
| .IR fd |
| and sets the |
| .I next_key |
| pointer to the key of the next element. |
| .IP |
| .in +4n |
| .EX |
| int |
| bpf_get_next_key(int fd, const void *key, void *next_key) |
| { |
| union bpf_attr attr = { |
| .map_fd = fd, |
| .key = ptr_to_u64(key), |
| .next_key = ptr_to_u64(next_key), |
| }; |
| |
| return bpf(BPF_MAP_GET_NEXT_KEY, &attr, sizeof(attr)); |
| } |
| .EE |
| .in |
| .IP |
| If |
| .I key |
| is found, the operation returns zero and sets the |
| .I next_key |
| pointer to the key of the next element. |
| If |
| .I key |
| is not found, the operation returns zero and sets the |
| .I next_key |
| pointer to the key of the first element. |
| If |
| .I key |
| is the last element, \-1 is returned and |
| .I errno |
| is set to |
| .BR ENOENT . |
| Other possible |
| .I errno |
| values are |
| .BR ENOMEM , |
| .BR EFAULT , |
| .BR EPERM , |
| and |
| .BR EINVAL . |
| This method can be used to iterate over all elements in the map. |
| .TP |
| .B close(map_fd) |
| Delete the map referred to by the file descriptor |
| .IR map_fd . |
| When the user-space program that created a map exits, all maps will |
| be deleted automatically (but see NOTES). |
| .\" |
| .SS eBPF map types |
| The following map types are supported: |
| .TP |
| .B BPF_MAP_TYPE_HASH |
| .\" commit 0f8e4bd8a1fc8c4185f1630061d0a1f2d197a475 |
| Hash-table maps have the following characteristics: |
| .RS |
| .IP * 3 |
| Maps are created and destroyed by user-space programs. |
| Both user-space and eBPF programs |
| can perform lookup, update, and delete operations. |
| .IP * |
| The kernel takes care of allocating and freeing key/value pairs. |
| .IP * |
| The |
| .BR map_update_elem () |
| helper will fail to insert new element when the |
| .I max_entries |
| limit is reached. |
| (This ensures that eBPF programs cannot exhaust memory.) |
| .IP * |
| .BR map_update_elem () |
| replaces existing elements atomically. |
| .RE |
| .IP |
| Hash-table maps are |
| optimized for speed of lookup. |
| .TP |
| .B BPF_MAP_TYPE_ARRAY |
| .\" commit 28fbcfa08d8ed7c5a50d41a0433aad222835e8e3 |
| Array maps have the following characteristics: |
| .RS |
| .IP * 3 |
| Optimized for fastest possible lookup. |
| In the future the verifier/JIT compiler |
| may recognize lookup() operations that employ a constant key |
| and optimize it into constant pointer. |
| It is possible to optimize a non-constant |
| key into direct pointer arithmetic as well, since pointers and |
| .I value_size |
| are constant for the life of the eBPF program. |
| In other words, |
| .BR array_map_lookup_elem () |
| may be 'inlined' by the verifier/JIT compiler |
| while preserving concurrent access to this map from user space. |
| .IP * |
| All array elements pre-allocated and zero initialized at init time |
| .IP * |
| The key is an array index, and must be exactly four bytes. |
| .IP * |
| .BR map_delete_elem () |
| fails with the error |
| .BR EINVAL , |
| since elements cannot be deleted. |
| .IP * |
| .BR map_update_elem () |
| replaces elements in a |
| .B nonatomic |
| fashion; |
| for atomic updates, a hash-table map should be used instead. |
| There is however one special case that can also be used with arrays: |
| the atomic built-in |
| .BR __sync_fetch_and_add() |
| can be used on 32 and 64 bit atomic counters. |
| For example, it can be |
| applied on the whole value itself if it represents a single counter, |
| or in case of a structure containing multiple counters, it could be |
| used on individual counters. |
| This is quite often useful for aggregation and accounting of events. |
| .RE |
| .IP |
| Among the uses for array maps are the following: |
| .RS |
| .IP * 3 |
| As "global" eBPF variables: an array of 1 element whose key is (index) 0 |
| and where the value is a collection of 'global' variables which |
| eBPF programs can use to keep state between events. |
| .IP * |
| Aggregation of tracing events into a fixed set of buckets. |
| .IP * |
| Accounting of networking events, for example, number of packets and packet |
| sizes. |
| .RE |
| .TP |
| .BR BPF_MAP_TYPE_PROG_ARRAY " (since Linux 4.2)" |
| A program array map is a special kind of array map whose map values |
| contain only file descriptors referring to other eBPF programs. |
| Thus, both the |
| .I key_size |
| and |
| .I value_size |
| must be exactly four bytes. |
| This map is used in conjunction with the |
| .BR bpf_tail_call () |
| helper. |
| .IP |
| This means that an eBPF program with a program array map attached to it |
| can call from kernel side into |
| .IP |
| .in +4n |
| .EX |
| void bpf_tail_call(void *context, void *prog_map, |
| unsigned int index); |
| .EE |
| .in |
| .IP |
| and therefore replace its own program flow with the one from the program |
| at the given program array slot, if present. |
| This can be regarded as kind of a jump table to a different eBPF program. |
| The invoked program will then reuse the same stack. |
| When a jump into the new program has been performed, |
| it won't return to the old program anymore. |
| .IP |
| If no eBPF program is found at the given index of the program array |
| (because the map slot doesn't contain a valid program file descriptor, |
| the specified lookup index/key is out of bounds, |
| or the limit of 32 |
| .\" MAX_TAIL_CALL_CNT |
| nested calls has been exceed), |
| execution continues with the current eBPF program. |
| This can be used as a fall-through for default cases. |
| .IP |
| A program array map is useful, for example, in tracing or networking, to |
| handle individual system calls or protocols in their own subprograms and |
| use their identifiers as an individual map index. |
| This approach may result in performance benefits, |
| and also makes it possible to overcome the maximum |
| instruction limit of a single eBPF program. |
| In dynamic environments, |
| a user-space daemon might atomically replace individual subprograms |
| at run-time with newer versions to alter overall program behavior, |
| for instance, if global policies change. |
| .\" |
| .SS eBPF programs |
| The |
| .B BPF_PROG_LOAD |
| command is used to load an eBPF program into the kernel. |
| The return value for this command is a new file descriptor associated |
| with this eBPF program. |
| .PP |
| .in +4n |
| .EX |
| char bpf_log_buf[LOG_BUF_SIZE]; |
| |
| int |
| bpf_prog_load(enum bpf_prog_type type, |
| const struct bpf_insn *insns, int insn_cnt, |
| const char *license) |
| { |
| union bpf_attr attr = { |
| .prog_type = type, |
| .insns = ptr_to_u64(insns), |
| .insn_cnt = insn_cnt, |
| .license = ptr_to_u64(license), |
| .log_buf = ptr_to_u64(bpf_log_buf), |
| .log_size = LOG_BUF_SIZE, |
| .log_level = 1, |
| }; |
| |
| return bpf(BPF_PROG_LOAD, &attr, sizeof(attr)); |
| } |
| .EE |
| .in |
| .PP |
| .I prog_type |
| is one of the available program types: |
| .IP |
| .in +4n |
| .EX |
| enum bpf_prog_type { |
| BPF_PROG_TYPE_UNSPEC, /* Reserve 0 as invalid |
| program type */ |
| BPF_PROG_TYPE_SOCKET_FILTER, |
| BPF_PROG_TYPE_KPROBE, |
| BPF_PROG_TYPE_SCHED_CLS, |
| BPF_PROG_TYPE_SCHED_ACT, |
| }; |
| .EE |
| .in |
| .PP |
| For further details of eBPF program types, see below. |
| .PP |
| The remaining fields of |
| .I bpf_attr |
| are set as follows: |
| .IP * 3 |
| .I insns |
| is an array of |
| .I "struct bpf_insn" |
| instructions. |
| .IP * |
| .I insn_cnt |
| is the number of instructions in the program referred to by |
| .IR insns . |
| .IP * |
| .I license |
| is a license string, which must be GPL compatible to call helper functions |
| marked |
| .IR gpl_only . |
| (The licensing rules are the same as for kernel modules, |
| so that also dual licenses, such as "Dual BSD/GPL", may be used.) |
| .IP * |
| .I log_buf |
| is a pointer to a caller-allocated buffer in which the in-kernel |
| verifier can store the verification log. |
| This log is a multi-line string that can be checked by |
| the program author in order to understand how the verifier came to |
| the conclusion that the eBPF program is unsafe. |
| The format of the output can change at any time as the verifier evolves. |
| .IP * |
| .I log_size |
| size of the buffer pointed to by |
| .IR log_buf . |
| If the size of the buffer is not large enough to store all |
| verifier messages, \-1 is returned and |
| .I errno |
| is set to |
| .BR ENOSPC . |
| .IP * |
| .I log_level |
| verbosity level of the verifier. |
| A value of zero means that the verifier will not provide a log; |
| in this case, |
| .I log_buf |
| must be a NULL pointer, and |
| .I log_size |
| must be zero. |
| .PP |
| Applying |
| .BR close (2) |
| to the file descriptor returned by |
| .B BPF_PROG_LOAD |
| will unload the eBPF program (but see NOTES). |
| .PP |
| Maps are accessible from eBPF programs and are used to exchange data between |
| eBPF programs and between eBPF programs and user-space programs. |
| For example, |
| eBPF programs can process various events (like kprobe, packets) and |
| store their data into a map, |
| and user-space programs can then fetch data from the map. |
| Conversely, user-space programs can use a map as a configuration mechanism, |
| populating the map with values checked by the eBPF program, |
| which then modifies its behavior on the fly according to those values. |
| .\" |
| .\" |
| .SS eBPF program types |
| The eBPF program type |
| .RI ( prog_type ) |
| determines the subset of kernel helper functions that the program |
| may call. |
| The program type also determines the program input (context)\(emthe |
| format of |
| .I "struct bpf_context" |
| (which is the data blob passed into the eBPF program as the first argument). |
| .\" |
| .\" FIXME |
| .\" Somewhere in this page we need a general introduction to the |
| .\" bpf_context. For example, how does a BPF program access the |
| .\" context? |
| .PP |
| For example, a tracing program does not have the exact same |
| subset of helper functions as a socket filter program |
| (though they may have some helpers in common). |
| Similarly, |
| the input (context) for a tracing program is a set of register values, |
| while for a socket filter it is a network packet. |
| .PP |
| The set of functions available to eBPF programs of a given type may increase |
| in the future. |
| .PP |
| The following program types are supported: |
| .TP |
| .BR BPF_PROG_TYPE_SOCKET_FILTER " (since Linux 3.19)" |
| Currently, the set of functions for |
| .B BPF_PROG_TYPE_SOCKET_FILTER |
| is: |
| .IP |
| .in +4n |
| .EX |
| bpf_map_lookup_elem(map_fd, void *key) |
| /* look up key in a map_fd */ |
| bpf_map_update_elem(map_fd, void *key, void *value) |
| /* update key/value */ |
| bpf_map_delete_elem(map_fd, void *key) |
| /* delete key in a map_fd */ |
| .EE |
| .in |
| .IP |
| The |
| .I bpf_context |
| argument is a pointer to a |
| .IR "struct __sk_buff" . |
| .\" FIXME: We need some text here to explain how the program |
| .\" accesses __sk_buff. |
| .\" See 'struct __sk_buff' and commit 9bac3d6d548e5 |
| .\" |
| .\" Alexei commented: |
| .\" Actually now in case of SOCKET_FILTER, SCHED_CLS, SCHED_ACT |
| .\" the program can now access skb fields. |
| .\" |
| .TP |
| .BR BPF_PROG_TYPE_KPROBE " (since Linux 4.1) |
| .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 |
| [To be documented] |
| .\" FIXME Document this program type |
| .\" Describe allowed helper functions for this program type |
| .\" Describe bpf_context for this program type |
| .\" |
| .\" FIXME We need text here to describe 'kern_version' |
| .TP |
| .BR BPF_PROG_TYPE_SCHED_CLS " (since Linux 4.1) |
| .\" commit 96be4325f443dbbfeb37d2a157675ac0736531a1 |
| .\" commit e2e9b6541dd4b31848079da80fe2253daaafb549 |
| [To be documented] |
| .\" FIXME Document this program type |
| .\" Describe allowed helper functions for this program type |
| .\" Describe bpf_context for this program type |
| .TP |
| .BR BPF_PROG_TYPE_SCHED_ACT " (since Linux 4.1) |
| .\" commit 94caee8c312d96522bcdae88791aaa9ebcd5f22c |
| .\" commit a8cb5f556b567974d75ea29c15181c445c541b1f |
| [To be documented] |
| .\" FIXME Document this program type |
| .\" Describe allowed helper functions for this program type |
| .\" Describe bpf_context for this program type |
| .SS Events |
| Once a program is loaded, it can be attached to an event. |
| Various kernel subsystems have different ways to do so. |
| .PP |
| Since Linux 3.19, |
| .\" commit 89aa075832b0da4402acebd698d0411dcc82d03e |
| the following call will attach the program |
| .I prog_fd |
| to the socket |
| .IR sockfd , |
| which was created by an earlier call to |
| .BR socket (2): |
| .PP |
| .in +4n |
| .EX |
| setsockopt(sockfd, SOL_SOCKET, SO_ATTACH_BPF, |
| &prog_fd, sizeof(prog_fd)); |
| .EE |
| .in |
| .PP |
| Since Linux 4.1, |
| .\" commit 2541517c32be2531e0da59dfd7efc1ce844644f5 |
| the following call may be used to attach |
| the eBPF program referred to by the file descriptor |
| .I prog_fd |
| to a perf event file descriptor, |
| .IR event_fd , |
| that was created by a previous call to |
| .BR perf_event_open (2): |
| .PP |
| .in +4n |
| .EX |
| ioctl(event_fd, PERF_EVENT_IOC_SET_BPF, prog_fd); |
| .EE |
| .in |
| .\" |
| .\" |
| .SH EXAMPLES |
| .EX |
| /* bpf+sockets example: |
| * 1. create array map of 256 elements |
| * 2. load program that counts number of packets received |
| * r0 = skb->data[ETH_HLEN + offsetof(struct iphdr, protocol)] |
| * map[r0]++ |
| * 3. attach prog_fd to raw socket via setsockopt() |
| * 4. print number of received TCP/UDP packets every second |
| */ |
| int |
| main(int argc, char **argv) |
| { |
| int sock, map_fd, prog_fd, key; |
| long long value = 0, tcp_cnt, udp_cnt; |
| |
| map_fd = bpf_create_map(BPF_MAP_TYPE_ARRAY, sizeof(key), |
| sizeof(value), 256); |
| if (map_fd < 0) { |
| printf("failed to create map '%s'\\n", strerror(errno)); |
| /* likely not run as root */ |
| return 1; |
| } |
| |
| struct bpf_insn prog[] = { |
| BPF_MOV64_REG(BPF_REG_6, BPF_REG_1), /* r6 = r1 */ |
| BPF_LD_ABS(BPF_B, ETH_HLEN + offsetof(struct iphdr, protocol)), |
| /* r0 = ip->proto */ |
| BPF_STX_MEM(BPF_W, BPF_REG_10, BPF_REG_0, -4), |
| /* *(u32 *)(fp - 4) = r0 */ |
| BPF_MOV64_REG(BPF_REG_2, BPF_REG_10), /* r2 = fp */ |
| BPF_ALU64_IMM(BPF_ADD, BPF_REG_2, -4), /* r2 = r2 - 4 */ |
| BPF_LD_MAP_FD(BPF_REG_1, map_fd), /* r1 = map_fd */ |
| BPF_CALL_FUNC(BPF_FUNC_map_lookup_elem), |
| /* r0 = map_lookup(r1, r2) */ |
| BPF_JMP_IMM(BPF_JEQ, BPF_REG_0, 0, 2), |
| /* if (r0 == 0) goto pc+2 */ |
| BPF_MOV64_IMM(BPF_REG_1, 1), /* r1 = 1 */ |
| BPF_XADD(BPF_DW, BPF_REG_0, BPF_REG_1, 0, 0), |
| /* lock *(u64 *) r0 += r1 */ |
| .\" == atomic64_add |
| BPF_MOV64_IMM(BPF_REG_0, 0), /* r0 = 0 */ |
| BPF_EXIT_INSN(), /* return r0 */ |
| }; |
| |
| prog_fd = bpf_prog_load(BPF_PROG_TYPE_SOCKET_FILTER, prog, |
| sizeof(prog), "GPL"); |
| |
| sock = open_raw_sock("lo"); |
| |
| assert(setsockopt(sock, SOL_SOCKET, SO_ATTACH_BPF, &prog_fd, |
| sizeof(prog_fd)) == 0); |
| |
| for (;;) { |
| key = IPPROTO_TCP; |
| assert(bpf_lookup_elem(map_fd, &key, &tcp_cnt) == 0); |
| key = IPPROTO_UDP; |
| assert(bpf_lookup_elem(map_fd, &key, &udp_cnt) == 0); |
| printf("TCP %lld UDP %lld packets\\n", tcp_cnt, udp_cnt); |
| sleep(1); |
| } |
| |
| return 0; |
| } |
| .EE |
| .PP |
| Some complete working code can be found in the |
| .IR samples/bpf |
| directory in the kernel source tree. |
| .SH RETURN VALUE |
| For a successful call, the return value depends on the operation: |
| .TP |
| .B BPF_MAP_CREATE |
| The new file descriptor associated with the eBPF map. |
| .TP |
| .B BPF_PROG_LOAD |
| The new file descriptor associated with the eBPF program. |
| .TP |
| All other commands |
| Zero. |
| .PP |
| On error, \-1 is returned, and |
| .I errno |
| is set appropriately. |
| .SH ERRORS |
| .TP |
| .BR E2BIG |
| The eBPF program is too large or a map reached the |
| .I max_entries |
| limit (maximum number of elements). |
| .TP |
| .BR EACCES |
| For |
| .BR BPF_PROG_LOAD, |
| even though all program instructions are valid, the program has been |
| rejected because it was deemed unsafe. |
| This may be because it may have |
| accessed a disallowed memory region or an uninitialized stack/register or |
| because the function constraints don't match the actual types or because |
| there was a misaligned memory access. |
| In this case, it is recommended to call |
| .BR bpf () |
| again with |
| .I log_level = 1 |
| and examine |
| .I log_buf |
| for the specific reason provided by the verifier. |
| .TP |
| .B EBADF |
| .I fd |
| is not an open file descriptor. |
| .TP |
| .B EFAULT |
| One of the pointers |
| .RI ( key |
| or |
| .I value |
| or |
| .I log_buf |
| or |
| .IR insns ) |
| is outside the accessible address space. |
| .TP |
| .B EINVAL |
| The value specified in |
| .I cmd |
| is not recognized by this kernel. |
| .TP |
| .B EINVAL |
| For |
| .BR BPF_MAP_CREATE , |
| either |
| .I map_type |
| or attributes are invalid. |
| .TP |
| .B EINVAL |
| For |
| .BR BPF_MAP_*_ELEM |
| commands, |
| some of the fields of |
| .I "union bpf_attr" |
| that are not used by this command |
| are not set to zero. |
| .TP |
| .B EINVAL |
| For |
| .BR BPF_PROG_LOAD, |
| indicates an attempt to load an invalid program. |
| eBPF programs can be deemed |
| invalid due to unrecognized instructions, the use of reserved fields, jumps |
| out of range, infinite loops or calls of unknown functions. |
| .TP |
| .BR ENOENT |
| For |
| .B BPF_MAP_LOOKUP_ELEM |
| or |
| .BR BPF_MAP_DELETE_ELEM , |
| indicates that the element with the given |
| .I key |
| was not found. |
| .TP |
| .B ENOMEM |
| Cannot allocate sufficient memory. |
| .TP |
| .B EPERM |
| The call was made without sufficient privilege |
| (without the |
| .B CAP_SYS_ADMIN |
| capability). |
| .SH VERSIONS |
| The |
| .BR bpf () |
| system call first appeared in Linux 3.18. |
| .SH CONFORMING TO |
| The |
| .BR bpf () |
| system call is Linux-specific. |
| .SH NOTES |
| In the current implementation, all |
| .BR bpf () |
| commands require the caller to have the |
| .B CAP_SYS_ADMIN |
| capability. |
| .PP |
| eBPF objects (maps and programs) can be shared between processes. |
| For example, after |
| .BR fork (2), |
| the child inherits file descriptors referring to the same eBPF objects. |
| In addition, file descriptors referring to eBPF objects can be |
| transferred over UNIX domain sockets. |
| File descriptors referring to eBPF objects can be duplicated |
| in the usual way, using |
| .BR dup (2) |
| and similar calls. |
| An eBPF object is deallocated only after all file descriptors |
| referring to the object have been closed. |
| .PP |
| eBPF programs can be written in a restricted C that is compiled (using the |
| .B clang |
| compiler) into eBPF bytecode. |
| Various features are omitted from this restricted C, such as loops, |
| global variables, variadic functions, floating-point numbers, |
| and passing structures as function arguments. |
| Some examples can be found in the |
| .I samples/bpf/*_kern.c |
| files in the kernel source tree. |
| .\" There are also examples for the tc classifier, in the iproute2 |
| .\" project, in examples/bpf |
| .PP |
| The kernel contains a just-in-time (JIT) compiler that translates |
| eBPF bytecode into native machine code for better performance. |
| In kernels before Linux 4.15, |
| the JIT compiler is disabled by default, |
| but its operation can be controlled by writing one of the |
| following integer strings to the file |
| .IR /proc/sys/net/core/bpf_jit_enable : |
| .IP 0 3 |
| Disable JIT compilation (default). |
| .IP 1 |
| Normal compilation. |
| .IP 2 |
| Debugging mode. |
| The generated opcodes are dumped in hexadecimal into the kernel log. |
| These opcodes can then be disassembled using the program |
| .IR tools/net/bpf_jit_disasm.c |
| provided in the kernel source tree. |
| .PP |
| Since Linux 4.15, |
| .\" commit 290af86629b25ffd1ed6232c4e9107da031705cb |
| the kernel may configured with the |
| .B CONFIG_BPF_JIT_ALWAYS_ON |
| option. |
| In this case, the JIT compiler is always enabled, and the |
| .I bpf_jit_enable |
| is initialized to 1 and is immutable. |
| (This kernel configuration option was provided as a mitigation for |
| one of the Spectre attacks against the BPF interpreter.) |
| .PP |
| The JIT compiler for eBPF is currently |
| .\" last reviewed in Linux 4.16-rc by grepping for BPF_ALU64 |
| available for the following architectures: |
| .IP * 3 |
| x86-64 (since Linux 3.18); |
| .PD 0 |
| .IP * |
| ARM-64 (since Linux 3.18); |
| .IP * |
| s390 (since Linux 4.1); |
| .IP * |
| PowerPC 64 (since Linux 4.8); |
| .IP * |
| SPARC 64 (since Linux 4.12); |
| .IP * |
| MIPS (since Linux 4.13); |
| .IP * |
| ARM32 (since Linux 4.14). |
| .PD |
| .SH SEE ALSO |
| .BR seccomp (2), |
| .BR socket (7), |
| .BR tc (8), |
| .BR tc-bpf (8) |
| .PP |
| Both classic and extended BPF are explained in the kernel source file |
| .IR Documentation/networking/filter.txt . |