summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* rhashtable: replace rht_ptr_locked() with rht_assign_locked()NeilBrown2019-04-132-6/+9
| | | | | | | | | | The only times rht_ptr_locked() is used, it is to store a new value in a bucket-head. This is the only time it makes sense to use it too. So replace it by a function which does the whole task: Sets the lock bit and assigns to a bucket head. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rhashtable: move dereference inside rht_ptr()NeilBrown2019-04-133-34/+49
| | | | | | | | | | | | | | | | | | | | | | | Rather than dereferencing a pointer to a bucket and then passing the result to rht_ptr(), we now pass in the pointer and do the dereference in rht_ptr(). This requires that we pass in the tbl and hash as well to support RCU checks, and means that the various rht_for_each functions can expect a pointer that can be dereferenced without further care. There are two places where we dereference a bucket pointer where there is no testable protection - in each case we know that we much have exclusive access without having taken a lock. The previous code used rht_dereference() to pretend that holding the mutex provided protects, but holding the mutex never provides protection for accessing buckets. So instead introduce rht_ptr_exclusive() that can be used when there is known to be exclusive access without holding any locks. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rhashtable: reorder some inline functions and macros.NeilBrown2019-04-131-71/+71
| | | | | | | | | | | This patch only moves some code around, it doesn't change the code at all. A subsequent patch will benefit from this as it needs to add calls to functions which are now defined before the call-site, but weren't before. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rhashtable: fix some __rcu annotation errorsNeilBrown2019-04-132-7/+8
| | | | | | | | With these annotations, the rhashtable now gets no warnings when compiled with "C=1" for sparse checking. Signed-off-by: NeilBrown <neilb@suse.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rhashtable: use struct_size() in kvzalloc()Gustavo A. R. Silva2019-04-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | One of the more common cases of allocation size calculations is finding the size of a structure that has a zero-sized array at the end, along with memory for some number of elements for that array. For example: struct foo { int stuff; struct boo entry[]; }; size = sizeof(struct foo) + count * sizeof(struct boo); instance = kvzalloc(size, GFP_KERNEL); Instead of leaving these open-coded and prone to type mistakes, we can now use the new struct_size() helper: instance = kvzalloc(struct_size(instance, entry, count), GFP_KERNEL); This code was detected with the help of Coccinelle. Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'nfp-update-to-control-structures'David S. Miller2019-04-1312-293/+451
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jakub Kicinski says: ==================== nfp: update to control structures This series prepares NFP control structures for crypto offloads. So far we mostly dealt with configuration requests under rtnl lock. This will no longer be the case with crypto. Additionally we will try to reuse the BPF control message format, so we move common code out of BPF. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * nfp: split out common control message handling codeJakub Kicinski2019-04-138-263/+340
| | | | | | | | | | | | | | | | | | | | BPF's control message handler seems like a good base to built on for request-reply control messages. Split it out to allow for reuse. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * nfp: move vNIC reset before netdev initJakub Kicinski2019-04-131-3/+3
| | | | | | | | | | | | | | | | | | | | | | During probe we clear vNIC configuration in case the device wasn't closed cleanly by previous driver. Move that code before netdev init, so netdev init can already try to apply its config parameters. Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * nfp: add a mutex lock for the vNIC ctrl BARJakub Kicinski2019-04-134-23/+87
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Soon we will try to write to the vNIC mailbox without RTNL held. Add a new mutex to protect access to specific parts of the PCI control BAR. Move the mailbox size checking to the mailbox lock() helper, where it can be more effective (happen prior to potential overwrite of other data). Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * nfp: opportunistically poll for reconfig resultDirk van der Merwe2019-04-131-4/+21
|/ | | | | | | | | If the reconfig was a quick update, we could have results available from firmware within 200us. Signed-off-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Remove flowi6_oif compare from __ip6_route_redirectDavid Ahern2019-04-131-2/+0
| | | | | | | | | | | In the review of 0b34eb004347 ("ipv6: Refactor __ip6_route_redirect"), Martin noted that the flowi6_oif compare is moved to the new helper and should be removed from __ip6_route_redirect. Fix the oversight. Fixes: 0b34eb004347 ("ipv6: Refactor __ip6_route_redirect") Reported-by: Martin Lau <kafai@fb.com> Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'netdevsim-Mostly-cleanup-in-sdev-bpf-iface-area'David S. Miller2019-04-136-90/+159
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Jiri Pirko says: ==================== netdevsim: Mostly cleanup in sdev/bpf iface area This patches does mainly internal netdevsim code shuffle. Nothing serious, just small changes to help readability and preparations for future work. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * netdevsim: move sdev-specific init/uninit code into separate functionsJiri Pirko2019-04-131-25/+38
| | | | | | | | | | | | | | | | In order to improve readability and prepare for future code changes, move sdev specific init/uninit code into separate functions. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netdevsim: make bpf_offload_dev_create() per-sdev instead of first nsJiri Pirko2019-04-131-12/+14
| | | | | | | | | | | | | | | | | | offload dev is stored in sdev struct. However, first netdevsim instance is used as a priv. Change this to be sdev to as it is shared among multiple netdevsim instances. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netdevsim: move sdev specific bpf debugfs files to sdev dirJiri Pirko2019-04-133-13/+13
| | | | | | | | | | | | | | | | Some netdevsim bpf debugfs files are per-sdev, yet they are defined per netdevsim instance. Move them under sdev directory. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netdevsim: move shared dev creation and destruction into separate fileJiri Pirko2019-04-134-50/+104
|/ | | | | | | To make code easier to read, move shared dev bits into a separate file. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Documentation: net: dsa: transition to the rst formatIoana Ciornei2019-04-135-154/+169
| | | | | | | | | This patch also performs some minor adjustments such as numbering for the receive path sequence, conversion of keywords to inline literals and adding an index page so it looks better in the output of 'make htmldocs'. Signed-off-by: Ioana Ciornei <ciorneiioana@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: veth: use generic helper to report timestamping infoJulian Wiedmann2019-04-131-13/+1
| | | | | | | | For reporting the common set of SW timestamping capabilities, use ethtool_op_get_ts_info() instead of re-implementing it. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: loopback: use generic helper to report timestamping infoJulian Wiedmann2019-04-131-13/+1
| | | | | | | | For reporting the common set of SW timestamping capabilities, use ethtool_op_get_ts_info() instead of re-implementing it. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dummy: use generic helper to report timestamping infoJulian Wiedmann2019-04-131-13/+2
| | | | | | | | For reporting the common set of SW timestamping capabilities, use ethtool_op_get_ts_info() instead of re-implementing it. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'smc-next'David S. Miller2019-04-128-274/+294
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ursula Braun says: ==================== net/smc: patches 2019-04-12 here are patches for SMC: * patch 1 improves behavior of non-blocking connect * patches 2, 3, 5, 7, and 8 improve connecting return codes * patches 4 and 6 are a cleanups without functional change ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: improve smc_conn_create reason codesKarsten Graul2019-04-124-59/+58
| | | | | | | | | | | | | | | | | | | | | | Rework smc_conn_create() to always return a valid DECLINE reason code. This removes the need to translate the return codes on 4 different places and allows to easily add more detailed return codes by changing smc_conn_create() only. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: improve smc_listen_work reason codesKarsten Graul2019-04-122-46/+54
| | | | | | | | | | | | | | | | | | | | | | | | Rework smc_listen_work() to provide improved reason codes when an SMC connection is declined. This allows better debugging on user side. This also adds 3 more detailed reason codes in smc_clc.h to indicate what type of device was not found (ism or rdma or both), or if ism cannot talk to the peer. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: code cleanup smc_listen_workKarsten Graul2019-04-121-15/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | In smc_listen_work() the variables rc and reason_code are defined which have the same meaning. Eliminate reason_code in favor of the shorter name rc. No functional changes. Rename the functions smc_check_ism() and smc_check_rdma() into smc_find_ism_device() and smc_find_rdma_device() to make there purpose more clear. No functional changes. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: cleanup of get vlan idKarsten Graul2019-04-123-6/+10
| | | | | | | | | | | | | | | | | | | | | | | | The vlan_id of the underlying CLC socket was retrieved two times during processing of the listen handshaking. Change this to get the vlan id one time in connect and in listen processing, and reuse the id. And add a new CLC DECLINE return code for the case when the retrieval of the vlan id failed. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: consolidate function parametersKarsten Graul2019-04-127-141/+139
| | | | | | | | | | | | | | | | | | | | | | During initialization of an SMC socket a lot of function parameters need to get passed down the function call path. Consolidate the parameters in a helper struct so there are less enough parameters to get all passed by register. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: check for ip prefix and subnetKarsten Graul2019-04-122-3/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The check for a matching ip prefix and subnet was only done for SMC-R in smc_listen_rdma_check() but not when an SMC-D connection was possible. Rename the function into smc_listen_prfx_check() and move its call to a place where it is called for both SMC variants. And add a new CLC DECLINE reason for the case when the IP prefix or subnet check fails so the reason for the failing SMC connection can be found out more easily. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: fallback to TCP after connect problemsKarsten Graul2019-04-121-4/+4
| | | | | | | | | | | | | | | | | | | | Correct the CLC decline reason codes for internal problems to not have the sign bit set, negative reason codes are interpreted as not eligible for TCP fallback. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: nonblocking connect reworkUrsula Braun2019-04-122-42/+47
|/ | | | | | | | | | For nonblocking sockets move the kernel_connect() from the connect worker into the initial smc_connect part to return kernel_connect() errors other than -EINPROGRESS to user space. Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* xen-netback: add reference from xenvif to backend_info to facilitate ↵Dongli Zhang2019-04-122-16/+19
| | | | | | | | | | | | | | | | | | | | | | coredump analysis During coredump analysis, it is not easy to obtain the address of backend_info in xen-netback. So far there are two ways to obtain backend_info: 1. Do what xenbus_device_find() does for vmcore to find the xenbus_device and then derive it from dev_get_drvdata(). 2. Extract backend_info from callstack of xenwatch (e.g., netback_remove() or frontend_changed()). This patch adds a reference from xenvif to backend_info so that it would be much more easier to obtain backend_info during coredump analysis. Signed-off-by: Dongli Zhang <dongli.zhang@oracle.com> Acked-by: Wei Liu <wei.liu2@citrix.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'sctp-skb-list'David S. Miller2019-04-123-41/+71
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | David Miller says: ==================== SCTP: Event skb list overhaul. This patch series eliminates the explicit reference to the skb list implementation via skb->prev dereferences. The approach used is to pass a non-empty skb list around instead of an event skb object which may or may not be on a list. I'd like to thank Marcelo Leitner, Xin Long, and Neil Horman for reviewing previous versions of this series. Testing would be very much appreciated, in addition to the review of course. v4 --> v5: Rebase to net-next v3 --> v4: Fix the logic in patch #4 so that we don't miss cases where we should add event to the on-stack temp list. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: Pass sk_buff_head explicitly to sctp_ulpq_tail_event().David Miller2019-04-123-20/+13
| | | | | | | | | | | | | | | | | | | | | | Now the SKB list implementation assumption can be removed. And now that we know that the list head is always non-NULL we can remove the code blocks dealing with that as well. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: Make sctp_enqueue_event tak an skb list.David Miller2019-04-122-15/+39
| | | | | | | | | | | | | | | | | | | | | | Pass this, instead of an event. Then everything trickles down and we always have events a non-empty list. Then we needs a list creating stub to place into .enqueue_event for sctp_stream_interleave_1. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: Use helper for sctp_ulpq_tail_event() when hooked up to ->enqueue_eventDavid Miller2019-04-121-1/+10
| | | | | | | | | | | | | | | | | | | | This way we can make sure events sent this way to sctp_ulpq_tail_event() are on a list as well. Now all such code paths are fully covered. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: Always pass skbs on a list to sctp_ulpq_tail_event().David Miller2019-04-121-6/+10
| | | | | | | | | | | | | | | | | | This way we can simplify the logic and remove assumptions about the implementation of skb lists. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sctp: Remove superfluous test in sctp_ulpq_reasm_drain().David Miller2019-04-121-1/+1
|/ | | | | | | | Inside the loop, we always start with event non-NULL. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: flower: fix filter net reference countingVlad Buslov2019-04-121-11/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix net reference counting in fl_change() and remove redundant call to tcf_exts_get_net() from __fl_delete(). __fl_put() already tries to get net before releasing exts and deallocating a filter, so this code caused flower classifier to obtain net twice per filter that is being deleted. Implementation of __fl_delete() called tcf_exts_get_net() to pass its result as 'async' flag to fl_mask_put(). However, 'async' flag is redundant and only complicates fl_mask_put() implementation. This functionality seems to be copied from filter cleanup code, where it was added by Cong with following explanation: This patchset tries to fix the race between call_rcu() and cleanup_net() again. Without holding the netns refcnt the tc_action_net_exit() in netns workqueue could be called before filter destroy works in tc filter workqueue. This patchset moves the netns refcnt from tc actions to tcf_exts, without breaking per-netns tc actions. This doesn't apply to flower mask, which doesn't call any tc action code during cleanup. Simplify fl_mask_put() by removing the flag parameter and always use tcf_queue_work() to free mask objects. Fixes: 061775583e35 ("net: sched: flower: introduce reference counting for filters") Fixes: 1f17f7742eeb ("net: sched: flower: insert filter to ht before offloading it to hw") Fixes: 05cd271fd61a ("cls_flower: Support multiple masks per priority") Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* selftests: Add debugging options to pmtu.shDavid Ahern2019-04-121-89/+124
| | | | | | | | | | | | | | | | | | | | | pmtu.sh script runs a number of tests and dumps a summary of pass/fail. If a test fails, it is near impossible to debug why. For example: TEST: ipv6: PMTU exceptions [FAIL] There are a lot of commands run behind the scenes for this test. Which one is failing? Add a VERBOSE option to show commands that are run and any output from those commands. Add a PAUSE_ON_FAIL option to halt the script if a test fails allowing users to poke around with the setup in the failed state. In the process, rename tracing to TRACING and move declaration to top with the new variables. Signed-off-by: David Ahern <dsahern@gmail.com> Reviewed-by: Stefano Brivio <sbrivio@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2019-04-1275-425/+4603
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2019-04-12 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Improve BPF verifier scalability for large programs through two optimizations: i) remove verifier states that are not useful in pruning, ii) stop walking parentage chain once first LIVE_READ is seen. Combined gives approx 20x speedup. Increase limits for accepting large programs under root, and add various stress tests, from Alexei. 2) Implement global data support in BPF. This enables static global variables for .data, .rodata and .bss sections to be properly handled which allows for more natural program development. This also opens up the possibility to optimize program workflow by compiling ELFs only once and later only rewriting section data before reload, from Daniel and with test cases and libbpf refactoring from Joe. 3) Add config option to generate BTF type info for vmlinux as part of the kernel build process. DWARF debug info is converted via pahole to BTF. Latter relies on libbpf and makes use of BTF deduplication algorithm which results in 100x savings compared to DWARF data. Resulting .BTF section is typically about 2MB in size, from Andrii. 4) Add BPF verifier support for stack access with variable offset from helpers and add various test cases along with it, from Andrey. 5) Extend bpf_skb_adjust_room() growth BPF helper to mark inner MAC header so that L2 encapsulation can be used for tc tunnels, from Alan. 6) Add support for input __sk_buff context in BPF_PROG_TEST_RUN so that users can define a subset of allowed __sk_buff fields that get fed into the test program, from Stanislav. 7) Add bpf fs multi-dimensional array tests for BTF test suite and fix up various UBSAN warnings in bpftool, from Yonghong. 8) Generate a pkg-config file for libbpf, from Luca. 9) Dump program's BTF id in bpftool, from Prashant. 10) libbpf fix to use smaller BPF log buffer size for AF_XDP's XDP program, from Magnus. 11) kallsyms related fixes for the case when symbols are not present in BPF selftests and samples, from Daniel ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * bpf: explicitly prohibit ctx_{in, out} in non-skb BPF_PROG_TEST_RUNStanislav Fomichev2019-04-121-0/+6
| | | | | | | | | | | | | | | | | | | | This should allow us later to extend BPF_PROG_TEST_RUN for non-skb case and be sure that nobody is erroneously setting ctx_{in,out}. Fixes: b0b9395d865e ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN") Reported-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * tools: add smp_* barrier variants to include infrastructureDaniel Borkmann2019-04-112-2/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add the definition for smp_rmb(), smp_wmb(), and smp_mb() to the tools include infrastructure: this patch adds the implementation for x86-64 and arm64, and have it fall back as currently is for other archs which do not have it implemented at this point. The x86-64 one uses lock + add combination for smp_mb() with address below red zone. This is on top of 09d62154f613 ("tools, perf: add and use optimized ring_buffer_{read_head, write_tail} helpers"), which didn't touch smp_* barrier implementations. Magnus recently rightfully reported however that the latter on x86-64 still wrongly falls back to sfence, lfence and mfence respectively, thus fix that for applications under tools making use of these to avoid such ugly surprises. The main header under tools (include/asm/barrier.h) will in that case not select the fallback implementation. Reported-by: Magnus Karlsson <magnus.karlsson@intel.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * Merge branch 'bpf-l2-encap'Daniel Borkmann2019-04-116-80/+417
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alan Maguire says: ==================== Extend bpf_skb_adjust_room growth to mark inner MAC header so that L2 encapsulation can be used for tc tunnels. Patch #1 extends the existing test_tc_tunnel to support UDP encapsulation; later we want to be able to test MPLS over UDP and MPLS over GRE encapsulation. Patch #2 adds the BPF_F_ADJ_ROOM_ENCAP_L2(len) macro, which allows specification of inner mac length. Other approaches were explored prior to taking this approach. Specifically, I tried automatically computing the inner mac length on the basis of the specified flags (so inner maclen for GRE/IPv4 encap is the len_diff specified to bpf_skb_adjust_room minus GRE + IPv4 header length for example). Problem with this is that we don't know for sure what form of GRE/UDP header we have; is it a full GRE header, or is it a FOU UDP header or generic UDP encap header? My fear here was we'd end up with an explosion of flags. The other approach tried was to support inner L2 header marking as a separate room adjustment, i.e. adjust for L3/L4 encap, then call bpf_skb_adjust_room for L2 encap. This can be made to work but because it imposed an order on operations, felt a bit clunky. Patch #3 syncs tools/ bpf.h. Patch #4 extends the tests again to support MPLSoverGRE, MPLSoverUDP, and transparent ethernet bridging (TEB) where the inner L2 header is an ethernet header. Testing of BPF encap against tunnels is done for cases where configuration of such tunnels is possible (MPLSoverGRE[6], MPLSoverUDP, gre[6]tap), and skipped otherwise. Testing of BPF encap/decap is always carried out. Changes since v2: - updated tools/testing/selftest/bpf/config with FOU/MPLS CONFIG variables (patches 1, 4) - reduced noise in patch 1 by avoiding unnecessary movement of code - eliminated inner_mac variable in bpf_skb_net_grow (patch 2) Changes since v1: - fixed formatting of commit references. - BPF_F_ADJ_ROOM_FIXED_GSO flag enabled on all variants (patch 1) - fixed fou6 options for UDP encap; checksum errors observed were due to the fact fou6 tunnel was not set up with correct ipproto options (41 -6). 0 checksums work fine (patch 1) - added definitions for mask and shift used in setting L2 length (patch 2) - allow udp encap with fixed GSO (patch 2) - changed "elen" to "l2_len" to be more descriptive (patch 4) ==================== Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * selftests_bpf: add L2 encap to test_tc_tunnelAlan Maguire2019-04-113-59/+277
| | | | | | | | | | | | | | | | | | | | | | | | Update test_tc_tunnel to verify adding inner L2 header encapsulation (an MPLS label or ethernet header) works. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * bpf: sync bpf.h to tools/ for BPF_F_ADJ_ROOM_ENCAP_L2Alan Maguire2019-04-111-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | Sync include/uapi/linux/bpf.h with tools/ equivalent to add BPF_F_ADJ_ROOM_ENCAP_L2(len) macro. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * bpf: add layer 2 encap support to bpf_skb_adjust_roomAlan Maguire2019-04-112-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 868d523535c2 ("bpf: add bpf_skb_adjust_room encap flags") introduced support to bpf_skb_adjust_room for GSO-friendly GRE and UDP encapsulation. For GSO to work for skbs, the inner headers (mac and network) need to be marked. For L3 encapsulation using bpf_skb_adjust_room, the mac and network headers are identical. Here we provide a way of specifying the inner mac header length for cases where L2 encap is desired. Such an approach can support encapsulated ethernet headers, MPLS headers etc. For example to convert from a packet of form [eth][ip][tcp] to [eth][ip][udp][inner mac][ip][tcp], something like the following could be done: headroom = sizeof(iph) + sizeof(struct udphdr) + inner_maclen; ret = bpf_skb_adjust_room(skb, headroom, BPF_ADJ_ROOM_MAC, BPF_F_ADJ_ROOM_ENCAP_L4_UDP | BPF_F_ADJ_ROOM_ENCAP_L3_IPV4 | BPF_F_ADJ_ROOM_ENCAP_L2(inner_maclen)); Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * selftests_bpf: extend test_tc_tunnel for UDP encapAlan Maguire2019-04-113-48/+143
| |/ | | | | | | | | | | | | | | | | | | commit 868d523535c2 ("bpf: add bpf_skb_adjust_room encap flags") introduced support to bpf_skb_adjust_room for GSO-friendly GRE and UDP encapsulation and later introduced associated test_tc_tunnel tests. Here those tests are extended to cover UDP encapsulation also. Signed-off-by: Alan Maguire <alan.maguire@oracle.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf: fix missing bpf_check_uarg_tail_zero in BPF_PROG_TEST_RUNStanislav Fomichev2019-04-112-9/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit b0b9395d865e ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN") started using bpf_check_uarg_tail_zero in BPF_PROG_TEST_RUN. However, bpf_check_uarg_tail_zero is not defined for !CONFIG_BPF_SYSCALL: net/bpf/test_run.c: In function ‘bpf_ctx_init’: net/bpf/test_run.c:142:9: error: implicit declaration of function ‘bpf_check_uarg_tail_zero’ [-Werror=implicit-function-declaration] err = bpf_check_uarg_tail_zero(data_in, max_size, size); ^~~~~~~~~~~~~~~~~~~~~~~~ Let's not build net/bpf/test_run.c when CONFIG_BPF_SYSCALL is not set. Reported-by: kbuild test robot <lkp@intel.com> Fixes: b0b9395d865e ("bpf: support input __sk_buff context in BPF_PROG_TEST_RUN") Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * selftests: bpf: add selftest for __sk_buff context in BPF_PROG_TEST_RUNStanislav Fomichev2019-04-112-0/+110
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simple test that sets cb to {1,2,3,4,5} and priority to 6, runs bpf program that fails if cb is not what we expect and increments cb[i] and priority. When the test finishes, we check that cb is now {2,3,4,5,6} and priority is 7. We also test the sanity checks: * ctx_in is provided, but ctx_size_in is zero (same for ctx_out/ctx_size_out) * unexpected non-zero fields in __sk_buff return EINVAL Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * libbpf: add support for ctx_{size, }_{in, out} in BPF_PROG_TEST_RUNStanislav Fomichev2019-04-113-0/+17
| | | | | | | | | | | | | | | | | | | | Support recently introduced input/output context for test runs. We extend only bpf_prog_test_run_xattr. bpf_prog_test_run is unextendable and left as is. Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf: support input __sk_buff context in BPF_PROG_TEST_RUNStanislav Fomichev2019-04-113-9/+151
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add new set of arguments to bpf_attr for BPF_PROG_TEST_RUN: * ctx_in/ctx_size_in - input context * ctx_out/ctx_size_out - output context The intended use case is to pass some meta data to the test runs that operate on skb (this has being brought up on recent LPC). For programs that use bpf_prog_test_run_skb, support __sk_buff input and output. Initially, from input __sk_buff, copy _only_ cb and priority into skb, all other non-zero fields are prohibited (with EINVAL). If the user has set ctx_out/ctx_size_out, copy the potentially modified __sk_buff back to the userspace. We require all fields of input __sk_buff except the ones we explicitly support to be set to zero. The expectation is that in the future we might add support for more fields and we want to fail explicitly if the user runs the program on the kernel where we don't yet support them. The API is intentionally vague (i.e. we don't explicitly add __sk_buff to bpf_attr, but ctx_in) to potentially let other test_run types use this interface in the future (this can be xdp_md for xdp types for example). v4: * don't copy more than allowed in bpf_ctx_init [Martin] v3: * handle case where ctx_in is NULL, but ctx_out is not [Martin] * convert size==0 checks to ptr==NULL checks and add some extra ptr checks [Martin] v2: * Addressed comments from Martin Lau Signed-off-by: Stanislav Fomichev <sdf@google.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>