summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* bpf, sockmap: convert to generic sk_msg interfaceDaniel Borkmann2018-10-1517-2860/+2925
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add a generic sk_msg layer, and convert current sockmap and later kTLS over to make use of it. While sk_buff handles network packet representation from netdevice up to socket, sk_msg handles data representation from application to socket layer. This means that sk_msg framework spans across ULP users in the kernel, and enables features such as introspection or filtering of data with the help of BPF programs that operate on this data structure. Latter becomes in particular useful for kTLS where data encryption is deferred into the kernel, and as such enabling the kernel to perform L7 introspection and policy based on BPF for TLS connections where the record is being encrypted after BPF has run and came to a verdict. In order to get there, first step is to transform open coding of scatter-gather list handling into a common core framework that subsystems can use. The code itself has been split and refactored into three bigger pieces: i) the generic sk_msg API which deals with managing the scatter gather ring, providing helpers for walking and mangling, transferring application data from user space into it, and preparing it for BPF pre/post-processing, ii) the plain sock map itself where sockets can be attached to or detached from; these bits are independent of i) which can now be used also without sock map, and iii) the integration with plain TCP as one protocol to be used for processing L7 application data (later this could e.g. also be extended to other protocols like UDP). The semantics are the same with the old sock map code and therefore no change of user facing behavior or APIs. While pursuing this work it also helped finding a number of bugs in the old sockmap code that we've fixed already in earlier commits. The test_sockmap kselftest suite passes through fine as well. Joint work with John. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* tcp, ulp: remove ulp bits from sockmapDaniel Borkmann2018-10-153-88/+23
| | | | | | | | | | | | In order to prepare sockmap logic to be used in combination with kTLS we need to detangle it from ULP, and further split it in later commits into a generic API. Joint work with John. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* tcp, ulp: enforce sock_owned_by_me upon ulp init and cleanupDaniel Borkmann2018-10-151-0/+4
| | | | | | | | | | | | | Whenever the ULP data on the socket is mangled, enforce that the caller has the socket lock held as otherwise things may race with initialization and cleanup callbacks from ulp ops as both would mangle internal socket state. Joint work with John. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: Fix dev pointer dereference from sk_skbJoe Stringer2018-10-141-1/+4
| | | | | | | | | | | | | | | Dan Carpenter reports: The patch 6acc9b432e67: "bpf: Add helper to retrieve socket in BPF" from Oct 2, 2018, leads to the following Smatch complaint: net/core/filter.c:4893 bpf_sk_lookup() error: we previously assumed 'skb->dev' could be null (see line 4885) Fix this issue by checking skb->dev before using it. Signed-off-by: Joe Stringer <joe@wand.net.nz> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: wait for running BPF programs when updating map-in-mapDaniel Colascione2018-10-131-0/+13
| | | | | | | | | | | | | | | | The map-in-map frequently serves as a mechanism for atomic snapshotting of state that a BPF program might record. The current implementation is dangerous to use in this way, however, since userspace has no way of knowing when all programs that might have retrieved the "old" value of the map may have completed. This change ensures that map update operations on map-in-map map types always wait for all references to the old map to drop before returning to userspace. Signed-off-by: Daniel Colascione <dancol@google.com> Reviewed-by: Joel Fernandes (Google) <joel@joelfernandes.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* selftests: bpf: install script with_addr.shAnders Roxell2018-10-111-0/+2
| | | | | | | | | | | | | | | | When test_flow_dissector.sh runs it complains that it can't find script with_addr.sh: ./test_flow_dissector.sh: line 81: ./with_addr.sh: No such file or directory Rework so that with_addr.sh gets installed, add it to TEST_PROGS_EXTENDED variable. Fixes: 50b3ed57dee9 ("selftests/bpf: test bpf flow dissection") Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* selftests: bpf: add config fragment LWTUNNELAnders Roxell2018-10-111-0/+1
| | | | | | | | | | | | When test_lwt_seg6local.sh was added commit c99a84eac026 ("selftests/bpf: test for seg6local End.BPF action") config fragment wasn't added, and without CONFIG_LWTUNNEL enabled we see this: Error: CONFIG_LWTUNNEL is not enabled in this kernel. selftests: test_lwt_seg6local [FAILED] Signed-off-by: Anders Roxell <anders.roxell@linaro.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpftool: Allow add linker flags via EXTRA_LDFLAGS variableJiri Olsa2018-10-111-1/+4
| | | | | | | | | | Adding EXTRA_LDFLAGS allowing user to specify extra flags for LD_FLAGS variable. Also adding LDFLAGS to build command line. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* bpftool: Allow to add compiler flags via EXTRA_CFLAGS variableJiri Olsa2018-10-111-0/+4
| | | | | | | | | Adding EXTRA_CFLAGS allowing user to specify extra flags for CFLAGS variable. Signed-off-by: Jiri Olsa <jolsa@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* tools/bpf: use proper type and uapi perf_event.h header for libbpfYonghong Song2018-10-102-5/+5
| | | | | | | | | Use __u32 instead u32 in libbpf.c and also use uapi perf_event.h instead of tools/perf/perf-sys.h. Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Merge branch 'xdp-vlan'Alexei Starovoitov2018-10-105-2/+509
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jesper Dangaard Brouer says: ==================== While implementing PoC building blocks for eBPF code XDP+TC that can manipulate VLANs headers, I discovered a bug in generic-XDP. The fix should be backported to stable kernels. Even-though generic-XDP was introduced in v4.12, I think the bug is not exposed until v4.14 in the mentined fixes commit. ==================== Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * selftests/bpf: add XDP selftests for modifying and popping VLAN headersJesper Dangaard Brouer2018-10-103-2/+491
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This XDP selftest also contain a small TC-bpf component. It provoke the generic-XDP bug fixed in previous commit. The selftest itself shows how to do VLAN manipulation from XDP and TC. The test demonstrate how XDP ingress can remove a VLAN tag, and how TC egress can add back a VLAN tag. This use-case originates from a production need by ISP (kviknet.dk), who gets DSL-lines terminated as VLAN Q-in-Q tagged packets, and want to avoid having an net_device for every end-customer on the box doing the L2 to L3 termination. The test-setup is done via a veth-pair and creating two network namespaces (ns1 and ns2). The 'ns1' simulate the ISP network that are loading the BPF-progs stripping and adding VLAN IDs. The 'ns2' simulate the DSL-customer that are using VLAN tagged packets. Running the script with --interactive, will simply not call the cleanup function. This gives the effect of creating a testlab, that the users can inspect and play with. The --verbose option will simply request that the shell will print input lines as they are read, this include comments, which in effect make the comments visible docs. Reported-by: Yoel Caspersen <yoel@kviknet.dk> Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: make TC vlan bpf_helpers avail to selftestsJesper Dangaard Brouer2018-10-101-0/+4
| | | | | | | | | | | | | | | | The helper bpf_skb_vlan_push is needed by next patch, and the helper bpf_skb_vlan_pop is added for completeness, regarding VLAN helpers. Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * net: fix generic XDP to handle if eth header was mangledJesper Dangaard Brouer2018-10-101-0/+14
|/ | | | | | | | | | | | | | | | | | | | | | XDP can modify (and resize) the Ethernet header in the packet. There is a bug in generic-XDP, because skb->protocol and skb->pkt_type are setup before reaching (netif_receive_)generic_xdp. This bug was hit when XDP were popping VLAN headers (changing eth->h_proto), as skb->protocol still contains VLAN-indication (ETH_P_8021Q) causing invocation of skb_vlan_untag(skb), which corrupt the packet (basically popping the VLAN again). This patch catch if XDP changed eth header in such a way, that SKB fields needs to be updated. V2: on request from Song Liu, use ETH_HLEN instead of mac_len, in __skb_push() as eth_type_trans() use ETH_HLEN in paired skb_pull_inline(). Fixes: d445516966dc ("net: xdp: support xdp generic on virtual devices") Signed-off-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Merge branch 'unsupported-map-lookup'Alexei Starovoitov2018-10-107-231/+389
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prashant Bhole says: ==================== Currently when map a lookup fails, user space API can not make any distinction whether given key was not found or lookup is not supported by particular map. In this series we modify return value of maps which do not support lookup. Lookup on such map implementation will return -EOPNOTSUPP. bpf() syscall with BPF_MAP_LOOKUP_ELEM command will set EOPNOTSUPP errno. We also handle this error in bpftool to print appropriate message. Patch 1: adds handling of BPF_MAP_LOOKUP ELEM command of bpf syscall such that errno will set to EOPNOTSUPP when map doesn't support lookup Patch 2: Modifies the return value of map_lookup_elem() to EOPNOTSUPP for maps which do not support lookup Patch 3: Splits do_dump() in bpftool/map.c. Element printing code is moved out into new function dump_map_elem(). This was done in order to reduce deep indentation and accomodate further changes. Patch 4: Changes in bpftool to print strerror() message when lookup error is occured. This will result in appropriate message like "Operation not supported" when map doesn't support lookup. Patch 5: test_verifier: change fixup map naming convention as suggested by Alexei Patch 6: Added verifier tests to check whether verifier rejects call to bpf_map_lookup_elem from bpf program. For all map types those do not support map lookup. ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * selftests/bpf: test_verifier, check bpf_map_lookup_elem access in bpf progPrashant Bhole2018-10-101-1/+120
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | map_lookup_elem isn't supported by certain map types like: - BPF_MAP_TYPE_PROG_ARRAY - BPF_MAP_TYPE_STACK_TRACE - BPF_MAP_TYPE_XSKMAP - BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH Let's add verfier tests to check whether verifier prevents bpf_map_lookup_elem call on above programs from bpf program. Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * selftests/bpf: test_verifier, change names of fixup mapsPrashant Bhole2018-10-101-190/+190
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently fixup map are named like fixup_map1, fixup_map2, and so on. As suggested by Alexei let's change change map names such that we can identify map type by looking at the name. This patch is basically a find and replace change: fixup_map1 -> fixup_map_hash_8b fixup_map2 -> fixup_map_hash_48b fixup_map3 -> fixup_map_hash_16b fixup_map4 -> fixup_map_array_48b Suggested-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * tools/bpf: bpftool, print strerror when map lookup error occursPrashant Bhole2018-10-101-5/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since map lookup error can be ENOENT or EOPNOTSUPP, let's print strerror() as error message in normal and JSON output. This patch adds helper function print_entry_error() to print entry from lookup error occurs Example: Following example dumps a map which does not support lookup. Output before: root# bpftool map -jp dump id 40 [ "key": ["0x0a","0x00","0x00","0x00" ], "value": { "error": "can\'t lookup element" }, "key": ["0x0b","0x00","0x00","0x00" ], "value": { "error": "can\'t lookup element" } ] root# bpftool map dump id 40 can't lookup element with key: 0a 00 00 00 can't lookup element with key: 0b 00 00 00 Found 0 elements Output after changes: root# bpftool map dump -jp id 45 [ "key": ["0x0a","0x00","0x00","0x00" ], "value": { "error": "Operation not supported" }, "key": ["0x0b","0x00","0x00","0x00" ], "value": { "error": "Operation not supported" } ] root# bpftool map dump id 45 key: 0a 00 00 00 value: Operation not supported key: 0b 00 00 00 value: Operation not supported Found 0 elements Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * tools/bpf: bpftool, split the function do_dump()Prashant Bhole2018-10-101-34/+49
| | | | | | | | | | | | | | | | | | | | | | | | do_dump() function in bpftool/map.c has deep indentations. In order to reduce deep indent, let's move element printing code out of do_dump() into dump_map_elem() function. Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: return EOPNOTSUPP when map lookup isn't supportedPrashant Bhole2018-10-104-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Return ERR_PTR(-EOPNOTSUPP) from map_lookup_elem() methods of below map types: - BPF_MAP_TYPE_PROG_ARRAY - BPF_MAP_TYPE_STACK_TRACE - BPF_MAP_TYPE_XSKMAP - BPF_MAP_TYPE_SOCKMAP/BPF_MAP_TYPE_SOCKHASH Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: error handling when map_lookup_elem isn't supportedPrashant Bhole2018-10-101-2/+7
|/ | | | | | | | | | | | | | | | | | The error value returned by map_lookup_elem doesn't differentiate whether lookup was failed because of invalid key or lookup is not supported. Lets add handling for -EOPNOTSUPP return value of map_lookup_elem() method of map, with expectation from map's implementation that it should return -EOPNOTSUPP if lookup is not supported. The errno for bpf syscall for BPF_MAP_LOOKUP_ELEM command will be set to EOPNOTSUPP if map lookup is not supported. Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* bpf: btf: Fix a missing check bugWenwen Wang2018-10-101-0/+3
| | | | | | | | | | | | | | | | | | | | In btf_parse_hdr(), the length of the btf data header is firstly copied from the user space to 'hdr_len' and checked to see whether it is larger than 'btf_data_size'. If yes, an error code EINVAL is returned. Otherwise, the whole header is copied again from the user space to 'btf->hdr'. However, after the second copy, there is no check between 'btf->hdr->hdr_len' and 'hdr_len' to confirm that the two copies get the same value. Given that the btf data is in the user space, a malicious user can race to change the data between the two copies. By doing so, the user can provide malicious data to the kernel and cause undefined behavior. This patch adds a necessary check after the second copy, to make sure 'btf->hdr->hdr_len' has the same value as 'hdr_len'. Otherwise, an error code EINVAL will be returned. Signed-off-by: Wenwen Wang <wang6495@umn.edu> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2018-10-0966-663/+4140
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf-next 2018-10-08 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) sk_lookup_[tcp|udp] and sk_release helpers from Joe Stringer which allow BPF programs to perform lookups for sockets in a network namespace. This would allow programs to determine early on in processing whether the stack is expecting to receive the packet, and perform some action (eg drop, forward somewhere) based on this information. 2) per-cpu cgroup local storage from Roman Gushchin. Per-cpu cgroup local storage is very similar to simple cgroup storage except all the data is per-cpu. The main goal of per-cpu variant is to implement super fast counters (e.g. packet counters), which don't require neither lookups, neither atomic operations in a fast path. The example of these hybrid counters is in selftests/bpf/netcnt_prog.c 3) allow HW offload of programs with BPF-to-BPF function calls from Quentin Monnet 4) support more than 64-byte key/value in HW offloaded BPF maps from Jakub Kicinski 5) rename of libbpf interfaces from Andrey Ignatov. libbpf is maturing as a library and should follow good practices in library design and implementation to play well with other libraries. This patch set brings consistent naming convention to global symbols. 6) relicense libbpf as LGPL-2.1 OR BSD-2-Clause from Alexei Starovoitov to let Apache2 projects use libbpf 7) various AF_XDP fixes from Björn and Magnus ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * bpf: fix building without CONFIG_INETArnd Bergmann2018-10-091-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The newly added TCP and UDP handling fails to link when CONFIG_INET is disabled: net/core/filter.o: In function `sk_lookup': filter.c:(.text+0x7ff8): undefined reference to `tcp_hashinfo' filter.c:(.text+0x7ffc): undefined reference to `tcp_hashinfo' filter.c:(.text+0x8020): undefined reference to `__inet_lookup_established' filter.c:(.text+0x8058): undefined reference to `__inet_lookup_listener' filter.c:(.text+0x8068): undefined reference to `udp_table' filter.c:(.text+0x8070): undefined reference to `udp_table' filter.c:(.text+0x808c): undefined reference to `__udp4_lib_lookup' net/core/filter.o: In function `bpf_sk_release': filter.c:(.text+0x82e8): undefined reference to `sock_gen_put' Wrap the related sections of code in #ifdefs for the config option. Furthermore, sk_lookup() should always have been marked 'static', this also avoids a warning about a missing prototype when building with 'make W=1'. Fixes: 6acc9b432e67 ("bpf: Add helper to retrieve socket in BPF") Signed-off-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Joe Stringer <joe@wand.net.nz> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * Merge branch 'bpf-to-bpf-calls-nfp'Daniel Borkmann2018-10-0810-46/+589
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Quentin Monnet says: ==================== This patch series adds support for hardware offload of programs containing BPF-to-BPF function calls. First, a new callback is added to the kernel verifier, to collect information after the main part of the verification has been performed. Then support for BPF-to-BPF calls is incrementally added to the nfp driver, before offloading programs containing such calls is eventually allowed by lifting the restriction in the kernel verifier, in the last patch. Please refer to individual patches for details. Many thanks to Jiong and Jakub for their precious help and contribution on the main patches for the JIT-compiler, and everything related to stack accesses. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * bpf: allow offload of programs with BPF-to-BPF function callsQuentin Monnet2018-10-081-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that there is at least one driver supporting BPF-to-BPF function calls, lift the restriction, in the verifier, on hardware offload of eBPF programs containing such calls. But prevent jit_subprogs(), still in the verifier, from being run for offloaded programs. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: support pointers to other stack frames for BPF-to-BPF callsQuentin Monnet2018-10-083-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark instructions that use pointers to areas in the stack outside of the current stack frame, and process them accordingly in mem_op_stack(). This way, we also support BPF-to-BPF calls where the caller passes a pointer to data in its own stack frame to the callee (typically, when the caller passes an address to one of its local variables located in the stack, as an argument). Thanks to Jakub and Jiong for figuring out how to deal with this case, I just had to turn their email discussion into this patch. Suggested-by: Jiong Wang <jiong.wang@netronome.com> Suggested-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: optimise save/restore for R6~R9 based on register usageQuentin Monnet2018-10-083-23/+78
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When pre-processing the instructions, it is trivial to detect what subprograms are using R6, R7, R8 or R9 as destination registers. If a subprogram uses none of those, then we do not need to jump to the subroutines dedicated to saving and restoring callee-saved registers in its prologue and epilogue. This patch introduces detection of callee-saved registers in subprograms and prevents the JIT from adding calls to those subroutines whenever we can: we save some instructions in the translated program, and some time at runtime on BPF-to-BPF calls and returns. If no subprogram needs to save those registers, we can avoid appending the subroutines at the end of the program. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: fix return address from register-saving subroutine to calleeQuentin Monnet2018-10-081-1/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On performing a BPF-to-BPF call, we first jump to a subroutine that pushes callee-saved registers (R6~R9) to the stack, and from there we goes to the start of the callee next. In order to do so, the caller must pass to the subroutine the address of the NFP instruction to jump to at the end of that subroutine. This cannot be reliably implemented when translated the caller, as we do not always know the start offset of the callee yet. This patch implement the required fixup step for passing the start offset in the callee via the register used by the subroutine to hold its return address. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: update fixup function for BPF-to-BPF calls supportQuentin Monnet2018-10-082-3/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Relocation for targets of BPF-to-BPF calls are required at the end of translation. Update the nfp_fixup_branches() function in that regard. When checking that the last instruction of each bloc is a branch, we must account for the length of the instructions required to pop the return address from the stack. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: account for additional stack usage when checking stack limitQuentin Monnet2018-10-082-8/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Offloaded programs using BPF-to-BPF calls use the stack to store the return address when calling into a subprogram. Callees also need some space to save eBPF registers R6 to R9. And contrarily to kernel verifier, we align stack frames on 64 bytes (and not 32). Account for all this when checking the stack size limit before JIT-ing the program. This means we have to recompute maximum stack usage for the program, we cannot get the value from the kernel. In addition to adapting the checks on stack usage, move them to the finalize() callback, now that we have it and because such checks are part of the verification step rather than translation. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: add main logics for BPF-to-BPF calls support in nfp driverQuentin Monnet2018-10-085-4/+295
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the main patch for the logics of BPF-to-BPF calls in the nfp driver. The functions called on BPF_JUMP | BPF_CALL and BPF_JUMP | BPF_EXIT were used to call helpers and exit from the program, respectively; make them usable for calling into, or returning from, a BPF subprogram as well. For all calls, push the return address as well as the callee-saved registers (R6 to R9) to the stack, and pop them upon returning from the calls. In order to limit the overhead in terms of instruction number, this is done through dedicated subroutines. Jumping to the callee actually consists in jumping to the subroutine, that "returns" to the callee: this will require some fixup for passing the address in a later patch. Similarly, returning consists in jumping to the subroutine, which pops registers and then return directly to the caller (but no fixup is needed here). Return to the caller is performed with the RTN instruction newly added to the JIT. For the few steps where we need to know what subprogram an instruction belongs to, the struct nfp_insn_meta is extended with a new subprog_idx field. Note that checks on the available stack size, to take into account the additional requirements associated to BPF-to-BPF calls (storing R6-R9 and return addresses), are added in a later patch. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: account for BPF-to-BPF calls when preparing nfp JITQuentin Monnet2018-10-082-11/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similarly to "exit" or "helper call" instructions, BPF-to-BPF calls will require additional processing before translation starts, in order to record and mark jump destinations. We also mark the instructions where each subprogram begins. This will be used in a following commit to determine where to add prologues for subprograms. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: ignore helper-related checks for BPF calls in nfp verifierQuentin Monnet2018-10-082-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The checks related to eBPF helper calls are performed each time the nfp driver meets a BPF_JUMP | BPF_CALL instruction. However, these checks are not relevant for BPF-to-BPF call (same instruction code, different value in source register), so just skip the checks for such calls. While at it, rename the function that runs those checks to make it clear they apply to _helper_ calls only. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: copy eBPF subprograms information from kernel verifierQuentin Monnet2018-10-083-0/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to support BPF-to-BPF calls in offloaded programs, the nfp driver must collect information about the distinct subprograms: namely, the number of subprograms composing the complete program and the stack depth of those subprograms. The latter in particular is non-trivial to collect, so we copy those elements from the kernel verifier via the newly added post-verification hook. The struct nfp_prog is extended to store this information. Stack depths are stored in an array of dedicated structs. Subprogram start indexes are not collected. Instead, meta instructions associated to the start of a subprogram will be marked with a flag in a later patch. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * nfp: bpf: rename nfp_prog->stack_depth as nfp_prog->stack_frame_depthQuentin Monnet2018-10-083-8/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for support for BPF to BPF calls in offloaded programs, rename the "stack_depth" field of the struct nfp_prog as "stack_frame_depth". This is to make it clear that the field refers to the maximum size of the current stack frame (as opposed to the maximum size of the whole stack memory). Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * bpf: add verifier callback to get stack usage info for offloaded progsQuentin Monnet2018-10-086-2/+37
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In preparation for BPF-to-BPF calls in offloaded programs, add a new function attribute to the struct bpf_prog_offload_ops so that drivers supporting eBPF offload can hook at the end of program verification, and potentially extract information collected by the verifier. Implement a minimal callback (returning 0) in the drivers providing the structs, namely netdevsim and nfp. This will be useful in the nfp driver, in later commits, to extract the number of subprograms as well as the stack depth for those subprograms. Signed-off-by: Quentin Monnet <quentin.monnet@netronome.com> Reviewed-by: Jiong Wang <jiong.wang@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf, doc: Document Jump X addressing modeArthur Fabre2018-10-081-14/+16
| | | | | | | | | | | | | | | | | | | | | | | | bpf_asm and the other classic BPF tools support jump conditions comparing register A to register X, in addition to comparing register A with constant K. Only the latter was documented in filter.txt, add two new addressing modes that describe the former. Signed-off-by: Arthur Fabre <arthur@arthurfabre.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * libbpf: relicense libbpf as LGPL-2.1 OR BSD-2-ClauseAlexei Starovoitov2018-10-0813-62/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | libbpf is maturing as a library and gaining features that no other bpf libraries support (BPF Type Format, bpf to bpf calls, etc) Many Apache2 licensed projects (like bcc, bpftrace, gobpf, cilium, etc) would like to use libbpf, but cannot do this yet, since Apache Foundation explicitly states that LGPL is incompatible with Apache2. Hence let's relicense libbpf as dual license LGPL-2.1 or BSD-2-Clause, since BSD-2 is compatible with Apache2. Dual LGPL or Apache2 is invalid combination. Fix license mistake in Makefile as well. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Andrey Ignatov <rdna@fb.com> Acked-by: Arnaldo Carvalho de Melo <acme@kernel.org> Acked-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: David Beckett <david.beckett@netronome.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Joe Stringer <joe@ovn.org> Acked-by: John Fastabend <john.fastabend@gmail.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Acked-by: Quentin Monnet <quentin.monnet@netronome.com> Acked-by: Thomas Graf <tgraf@suug.ch> Acked-by: Roman Gushchin <guro@fb.com> Acked-by: Wang Nan <wangnan0@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * xsk: proper AF_XDP socket teardown orderingBjörn Töpel2018-10-082-13/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The AF_XDP socket struct can exist in three different, implicit states: setup, bound and released. Setup is prior the socket has been bound to a device. Bound is when the socket is active for receive and send. Released is when the process/userspace side of the socket is released, but the sock object is still lingering, e.g. when there is a reference to the socket in an XSKMAP after process termination. The Rx fast-path code uses the "dev" member of struct xdp_sock to check whether a socket is bound or relased, and the Tx code uses the struct xdp_umem "xsk_list" member in conjunction with "dev" to determine the state of a socket. However, the transition from bound to released did not tear the socket down in correct order. On the Rx side "dev" was cleared after synchronize_net() making the synchronization useless. On the Tx side, the internal queues were destroyed prior removing them from the "xsk_list". This commit corrects the cleanup order, and by doing so xdp_del_sk_umem() can be simplified and one synchronize_net() can be removed. Fixes: 965a99098443 ("xsk: add support for bind for Rx") Fixes: ac98d8aab61b ("xsk: wire upp Tx zero-copy functions") Reported-by: Jesper Dangaard Brouer <brouer@redhat.com> Signed-off-by: Björn Töpel <bjorn.topel@intel.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * Merge branch 'bpf-xsk-fix-mixed-mode'Daniel Borkmann2018-10-056-41/+91
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Magnus Karlsson says: ==================== Previously, the xsk code did not record which umem was bound to a specific queue id. This was not required if all drivers were zero-copy enabled as this had to be recorded in the driver anyway. So if a user tried to bind two umems to the same queue, the driver would say no. But if copy-mode was first enabled and then zero-copy mode (or the reverse order), we mistakenly enabled both of them on the same umem leading to buggy behavior. The main culprit for this is that we did not store the association of umem to queue id in the copy case and only relied on the driver reporting this. As this relation was not stored in the driver for copy mode (it does not rely on the AF_XDP NDOs), this obviously could not work. This patch fixes the problem by always recording the umem to queue id relationship in the netdev_queue and netdev_rx_queue structs. This way we always know what kind of umem has been bound to a queue id and can act appropriately at bind time. To make the bind semantics consistent with ethtool queue manipulations and to facilitate the implementation of drivers, we also forbid decreasing the number of queues/channels with ethtool if there is an active AF_XDP socket in the set of queues that are disabled. Jakub, please take a look at your patches. The last one I had to change slightly to make it fit with the new interface xdp_get_umem_from_qid(). An added bonus with this function is that we, in the future, can also use it from the driver to get a umem, thus simplifying driver implementations (and later remove the umem from the NDO completely). Björn will mail patches, at a later point in time, using this in the i40e and ixgbe drivers, that removes a good chunk of code from the ZC implementations. I also made your code aware of Tx queues. If we create a socket that only has a Tx queue, then the queue id will refer to a Tx queue id only and could be larger than the available amount of Rx queues. Please take a look at it. Differences against v1: * Included patches from Jakub that forbids decreasing the number of active queues if a queue to be deactivated has an AF_XDP socket. These have been adapted somewhat to the new interfaces in patch 2. * Removed redundant check against real_num_[rt]x_queue in xsk_bind * Only need to test against real_num_[rt]x_queues in xdp_clear_umem_at_qid. Patch 1: Introduces a umem reference in the netdev_rx_queue and netdev_queue structs. Patch 2: Records which queue_id is bound to which umem and make sure that you cannot bind two different umems to the same queue_id. Patch 3: Pre patch to ethtool_set_channels. Patch 4: Forbid decreasing the number of active queues if a deactivated queue has an AF_XDP socket. Patch 5: Simplify xdp_clear_umem_at_qid now when ethtool cannot deactivate the queue id we are running on. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * xsk: simplify xdp_clear_umem_at_qid implementationMagnus Karlsson2018-10-051-5/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | As we now do not allow ethtool to deactivate the queue id we are running an AF_XDP socket on, we can simplify the implementation of xdp_clear_umem_at_qid(). Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * ethtool: don't allow disabling queues with umem installedJakub Kicinski2018-10-053-2/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We already check the RSS indirection table does not use queues which would be disabled by channel reconfiguration. Make sure user does not try to disable queues which have a UMEM and zero-copy AF_XDP socket installed. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * ethtool: rename local variable max -> currJakub Kicinski2018-10-051-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ethtool_set_channels() validates the config against driver's max settings. It retrieves the current config and stores it in a variable called max. This was okay when only max settings were accessed but we will soon want to access current settings as well, so calling the entire structure max makes the code less readable. While at it drop unnecessary parenthesis. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Quentin Monnet <quentin.monnet@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * xsk: fix bug when trying to use both copy and zero-copy on one queue idMagnus Karlsson2018-10-053-35/+64
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, the xsk code did not record which umem was bound to a specific queue id. This was not required if all drivers were zero-copy enabled as this had to be recorded in the driver anyway. So if a user tried to bind two umems to the same queue, the driver would say no. But if copy-mode was first enabled and then zero-copy mode (or the reverse order), we mistakenly enabled both of them on the same umem leading to buggy behavior. The main culprit for this is that we did not store the association of umem to queue id in the copy case and only relied on the driver reporting this. As this relation was not stored in the driver for copy mode (it does not rely on the AF_XDP NDOs), this obviously could not work. This patch fixes the problem by always recording the umem to queue id relationship in the netdev_queue and netdev_rx_queue structs. This way we always know what kind of umem has been bound to a queue id and can act appropriately at bind time. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * net: add umem reference in netdev{_rx}_queueMagnus Karlsson2018-10-051-0/+6
| |/ | | | | | | | | | | | | | | These references to the umem will be used to store information on what kind of AF_XDP umem that is bound to a queue id, if any. Signed-off-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf: typo fix in Documentation/networking/af_xdp.rstKonrad Djimeli2018-10-051-2/+2
| | | | | | | | | | | | | | | | Fix a simple typo: Completetion -> Completion Signed-off-by: Konrad Djimeli <kdjimeli@igalia.com> Acked-by: Björn Töpel <bjorn.topel@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf, tracex3_user: erase "ARRAY_SIZE" redefinedBo YU2018-10-041-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a warning when compiling bpf sample programs in sample/bpf: make -C /home/foo/bpf/samples/bpf/../../tools/lib/bpf/ RM='rm -rf' LDFLAGS= srctree=/home/foo/bpf/samples/bpf/../../ O= HOSTCC /home/foo/bpf/samples/bpf/tracex3_user.o /home/foo/bpf/samples/bpf/tracex3_user.c:20:0: warning: "ARRAY_SIZE" redefined #define ARRAY_SIZE(x) (sizeof(x) / sizeof(*(x))) In file included from /home/foo/bpf/samples/bpf/tracex3_user.c:18:0: ./tools/testing/selftests/bpf/bpf_util.h:48:0: note: this is the location of the previous definition # define ARRAY_SIZE(x) (sizeof(x) / sizeof((x)[0])) Signed-off-by: Bo YU <tsu.yubo@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * Merge branch 'bpf-libbpf-consistent-iface'Daniel Borkmann2018-10-0411-154/+171
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrey Ignatov says: ==================== This patch set renames a few interfaces in libbpf, mostly netlink related, so that all symbols provided by the library have only three possible prefixes: % nm -D tools/lib/bpf/libbpf.so | \ awk '$2 == "T" {sub(/[_\(].*/, "", $3); if ($3) print $3}' | \ sort | \ uniq -c 91 bpf 8 btf 14 libbpf libbpf is used more and more outside kernel tree. That means the library should follow good practices in library design and implementation to play well with third party code that uses it. One of such practices is to have a common prefix (or a few) for every interface, function or data structure, library provides. It helps to avoid name conflicts with other libraries and keeps API/ABI consistent. Inconsistent names in libbpf already cause problems in real life. E.g. an application can't use both libbpf and libnl due to conflicting symbols (specifically nla_parse, nla_parse_nested and a few others). Some of problematic global symbols are not part of ABI and can be restricted from export with either visibility attribute/pragma or export map (what is useful by itself and can be done in addition). That won't solve the problem for those that are part of ABI though. Also export restrictions would help only in DSO case. If third party application links libbpf statically it won't help, and people do it (e.g. Facebook links most of libraries statically, including libbpf). libbpf already uses the following prefixes for its interfaces: * bpf_ for bpf system call wrappers, program/map/elf-object abstractions and a few other things; * btf_ for BTF related API; * libbpf_ for everything else. The patch adds libbpf_ prefix to interfaces that use none of mentioned above prefixes and don't fit well into the first two categories. Long term benefits of having common prefix should outweigh possible inconvenience of changing API for those functions now. Patches 2-4 add libbpf_ prefix to libbpf interfaces: separate patch per header. Other patches are simple improvements in API. ==================== Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * libbpf: Use __u32 instead of u32 in bpf_program__loadAndrey Ignatov2018-10-042-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make bpf_program__load consistent with other interfaces: use __u32 instead of u32. That in turn fixes build of samples: In file included from ./samples/bpf/trace_output_user.c:21:0: ./tools/lib/bpf/libbpf.h:132:9: error: unknown type name ‘u32’ u32 kern_version); ^ Fixes: commit 29cd77f41620d ("libbpf: Support loading individual progs") Signed-off-by: Andrey Ignatov <rdna@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>