summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* net: vhost: factor out busy polling logic to vhost_net_busy_poll()Tonghao Zhang2018-09-271-40/+70
| | | | | | | | | | | | | Factor out generic busy polling logic and will be used for in tx path in the next patch. And with the patch, qemu can set differently the busyloop_timeout for rx queue. To avoid duplicate codes, introduce the helper functions: * sock_has_rx_data(changed from sk_has_rx_data) * vhost_net_busy_poll_try_queue Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: vhost: replace magic number of lock annotationTonghao Zhang2018-09-271-3/+3
| | | | | | | | Use the VHOST_NET_VQ_XXX as a subclass for mutex_lock_nested. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: vhost: lock the vqs one by oneTonghao Zhang2018-09-271-17/+7
| | | | | | | | | | | This patch changes the way that lock all vqs at the same, to lock them one by one. It will be used for next patch to avoid the deadlock. Signed-off-by: Tonghao Zhang <xiangxia.m.yue@gmail.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: expose sk_state in tcp_retransmit_skb tracepointYafang Shao2018-09-271-2/+5
| | | | | | | | | | | | | | After sk_state exposed, we can get in which state this retransmission occurs. That could give us more detail for dignostic. For example, if this retransmission occurs in SYN_SENT state, it may also indicates that the syn packet may be dropped on the remote peer due to syn backlog queue full and then we could check the remote peer. BTW,SYNACK retransmission is traced in tcp_retransmit_synack tracepoint. Signed-off-by: Yafang Shao <laoar.shao@gmail.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: faraday: fix return type of ndo_start_xmit functionYueHaibing2018-09-262-5/+6
| | | | | | | | | | | | The method ndo_start_xmit() is defined as returning an 'netdev_tx_t', which is a typedef for an enum type, so make sure the implementation in this driver has returns 'netdev_tx_t' value, and change the function return type to netdev_tx_t. Found by coccinelle. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: smsc: fix return type of ndo_start_xmit functionYueHaibing2018-09-263-3/+6
| | | | | | | | | | | | The method ndo_start_xmit() is defined as returning an 'netdev_tx_t', which is a typedef for an enum type, so make sure the implementation in this driver has returns 'netdev_tx_t' value, and change the function return type to netdev_tx_t. Found by coccinelle. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: liquidio: list usage cleanupzhong jiang2018-09-261-2/+1
| | | | | | | | Trival cleanup, list_move_tail will implement the same function that list_del() + list_add_tail() will do. hence just replace them. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: qed: list usage cleanupzhong jiang2018-09-263-13/+9
| | | | | | | | Trival cleanup, list_move_tail will implement the same function that list_del() + list_add_tail() will do. hence just replace them. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'net-bridge-convert-bool-options-to-bits'David S. Miller2018-09-2612-116/+151
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Nikolay Aleksandrov says: ==================== net: bridge: convert bool options to bits A lot of boolean bridge options have been added around the net_bridge structure resulting in holes and more importantly different cache lines that need to be fetched in the fast path. This set moves all of those to bits in a bitfield which resides in a hot cache line thus reducing the size of net_bridge, the number of holes and the number of cache lines needed for the fast path. The set is also sent in preparation for new boolean options to avoid spreading them in the structure and making new holes. One nice side-effect is that we avoid potential race conditions by using the bitops since some of the options were bits being directly set in parallel risking hard to debug issues (has_ipv6_addr). Before: size: 1184, holes: 8, sum holes: 30 After: size: 1160, holes: 3, sum holes: 7 Patch 01 is a trivial style fix Patch 02 adds the new options bitfield and converts the vlan boolean options to bits Patches 03-08 convert the rest of the boolean options to bits Patch 09 re-arranges a few fields in net_bridge to further reduce size v2: patch 09: remove the comment about offload_fwd_mark in net_bridge and leave it where it is now, thanks to Ido for spotting it ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: pack net_bridge betterNikolay Aleksandrov2018-09-261-2/+2
| | | | | | | | | | | | | | | | | | | | Further reduce the size of net_bridge with 8 bytes and reduce the number of holes in it: Before: holes: 5, sum holes: 15 After: holes: 3, sum holes: 7 Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: convert mtu_set_by_user to a bitNikolay Aleksandrov2018-09-263-4/+4
| | | | | | | | | | | | | | | | | | Convert the last remaining bool option to a bit thus reducing the overall net_bridge size further by 8 bytes. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: convert neigh_suppress_enabled option to a bitNikolay Aleksandrov2018-09-264-9/+12
| | | | | | | | | | | | | | | | Convert the neigh_suppress_enabled option to a bit. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: convert mcast options to bitsNikolay Aleksandrov2018-09-264-32/+33
| | | | | | | | | | | | | | | | | | | | This patch converts the rest of the mcast options to bits. It also packs the mcast options a little better by moving multicast_mld_version to an existing hole, reducing the net_bridge size by 8 bytes. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: convert and rename mcast disabledNikolay Aleksandrov2018-09-265-21/+24
| | | | | | | | | | | | | | | | | | | | | | | | Convert mcast disabled to an option bit and while doing so convert the logic to check if multicast is enabled instead. That is make the logic follow the option value - if it's set then mcast is enabled and vice versa. This avoids a few confusing places where we inverted the value that's being set to follow the mcast_disabled logic. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: convert group_addr_set option to a bitNikolay Aleksandrov2018-09-264-4/+4
| | | | | | | | | | | | | | | | Convert group_addr_set internal bridge opt to a bit. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: convert nf call options to bitsNikolay Aleksandrov2018-09-264-18/+19
| | | | | | | | | | | | | | | | No functional change, convert of nf_call_[ip|ip6|arp]tables to bits. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: add bitfield for options and convert vlan optsNikolay Aleksandrov2018-09-265-18/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | Bridge options have usually been added as separate fields all over the net_bridge struct taking up space and ending up in different cache lines. Let's move them to a single bitfield to save up space and speedup lookups. This patch adds a simple API for option modifying and retrieving using bitops and converts the first user of the API - the bridge vlan options (vlan_enabled and vlan_stats_enabled). Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: bridge: make struct opening bracket consistentNikolay Aleksandrov2018-09-261-8/+4
|/ | | | | | | | | Currently we have a mix of opening brackets on new lines and on the same line, let's move them all on the same line. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Reviewed-by: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 's390-net-next'David S. Miller2018-09-267-292/+200
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | Julian Wiedmann says: ==================== s390/net: updates 2018-09-26 please apply one more series of cleanups and small improvements for qeth to net-next. Note that one patch needs to touch both af_iucv and qeth, in order to untangle their receive paths. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: remove duplicated carrier state trackingJulian Wiedmann2018-09-265-22/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The netdevice is always available, apply any carrier state changes to it without caching them. On a STARTLAN event (ie. carrier-up), defer updating the state to qeth_core_hardsetup_card() in the subsequent recovery action. Also remove the carrier-state checks from the xmit routines. Stopping transmission on carrier-down is the responsibility of upper-level code (eg see dev_direct_xmit()). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: clean up drop conditions for received cmdsJulian Wiedmann2018-09-261-10/+11
| | | | | | | | | | | | | | | | | | If qeth_check_ipa_data() consumed an event, there's no point in processing it further. So drop it early, and make the surrounding code a tiny bit more readable. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: re-indent qeth_check_ipa_data()Julian Wiedmann2018-09-261-71/+58
| | | | | | | | | | | | | | | | Pull one level of checking up into qeth_send_control_data_cb(), and clean up an else-after-return. No functional change. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: consume local address eventsJulian Wiedmann2018-09-261-2/+2
| | | | | | | | | | | | | | | | We have no code that is waiting for these events, so just drop them when they arrive. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: remove various redundant codeJulian Wiedmann2018-09-263-15/+2
| | | | | | | | | | | | | | | | | | | | | | | | 1. tracing iob->rc makes no sense when it hasn't been modified by the callback, 2. the qeth_dbf_list is declared with LIST_HEAD, which also initializes the list, 3. the ccwgroup core only calls the thaw/restore callbacks if the gdev is online, so we don't have to check for it again. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: remove CARD_FROM_CDEV helperJulian Wiedmann2018-09-261-44/+29
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The cdev-to-card translation walks through two layers of drvdata, with no locking or refcounting (where eg. the ccwgroup core only accesses a cdev's drvdata while holding the ccwlock). This might be safe for now, but any careless usage of the helper has the potential for subtle races and use-after-free's. Luckily there's only one occurrence where we _really_ need it (in qeth_irq()), for any other user we can just pass through an appropriate card pointer. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: pass card pointer in iob callbackJulian Wiedmann2018-09-262-24/+32
| | | | | | | | | | | | | | This allows us to remove the CARD_FROM_CDEV calls in the iob callbacks. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: re-use qeth_notify_skbs()Julian Wiedmann2018-09-261-24/+5
| | | | | | | | | | | | | | | | When not using the CQ, this allows us avoid the second skb queue walk in qeth_release_skbs(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: remove additional skb refcountJulian Wiedmann2018-09-261-2/+0
| | | | | | | | | | | | | | | | This was presumably left over from back when qeth recursed into dev_queue_xmit(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: replace open-coded skb_queue_walk()Julian Wiedmann2018-09-261-15/+4
| | | | | | | | | | | | | | | | To match the use of __skb_queue_purge(), also make the skb's enqueue in qeth_fill_buffer() lockless. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/af_iucv: locate IUCV header via skb_network_header()Julian Wiedmann2018-09-263-36/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch attempts to untangle the TX and RX code in qeth from af_iucv's respective HiperTransport path: On the TX side, pointing skb_network_header() at the IUCV header means that qeth_l3_fill_af_iucv_hdr() no longer needs a magical offset to access the header. On the RX side, qeth pulls the (fake) L2 header off the skb like any normal ethernet driver would. This makes working with the IUCV header in af_iucv easier, since we no longer have to assume a fixed skb layout. While at it, replace the open-coded length checks in af_iucv's RX path with pskb_may_pull(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: on gdev release, reset drvdataJulian Wiedmann2018-09-261-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | qeth_core_probe_device() sets the gdev's drvdata, but doesn't reset it on a subsequent error. Move the (re-)setting around a bit, so that it happens symmetrically on allocating/freeing the qeth_card struct. This is no actual problem, as the ccwgroup core will discard the gdev on a probe error. But from qeth's perspective the gdev is an external resource, so it's best to manage it cleanly. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: fix discipline unload after setup errorJulian Wiedmann2018-09-264-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Device initialization code usually first loads a subdriver (via qeth_core_load_discipline()), and then runs its setup() callback. If this fails, it rolls back the load via qeth_core_free_discipline(). qeth_core_free_discipline() expects the options.layer attribute to be initialized, but on error in setup() that's currently not the case. Resulting in misbalanced symbol_put() calls. Fix this by setting options.layer when loading the subdriver. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: use DEFINE_MUTEX for qeth_mod_mutexJulian Wiedmann2018-09-261-7/+6
| | | | | | | | | | | | | | | | | | Consolidate declaration and initialization of a static variable. While at it reduce its scope in qeth_core_load_discipline(), and simplify the return logic accordingly. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * s390/qeth: convert layer attribute to enumJulian Wiedmann2018-09-265-26/+25
|/ | | | | | | | | | | While the raw values are fixed due to their use in a sysfs attribute, we can still use the proper QETH_DISCIPLINE_* enum within the driver. Also move the initialization into qeth_set_initial_options(), along with all other user-configurable fields. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: phy: marvell: Fix build.David S. Miller2018-09-261-1/+1
| | | | | | | | | | | | | Local variable 'autoneg' doesn't even exist: drivers/net/phy/marvell.c: In function 'm88e1121_config_aneg': drivers/net/phy/marvell.c:468:25: error: 'autoneg' undeclared (first use in this function); did you mean 'put_net'? if (phydev->autoneg != autoneg || changed) { ^~~~~~~ Fixes: d6ab93364734 ("net: phy: marvell: Avoid unnecessary soft reset") Reported-by:Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bridge: br_arp_nd_proxy: set icmp6_router if neigh has NTF_ROUTERRoopa Prabhu2018-09-261-1/+1
| | | | | | Fixes: ed842faeb2bd ("bridge: suppress nd pkts on BR_NEIGH_SUPPRESS ports") Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2018-09-2647-231/+3066
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2018-09-25 The following pull-request contains BPF updates for your *net-next* tree. The main changes are: 1) Allow for RX stack hardening by implementing the kernel's flow dissector in BPF. Idea was originally presented at netconf 2017 [0]. Quote from merge commit: [...] Because of the rigorous checks of the BPF verifier, this provides significant security guarantees. In particular, the BPF flow dissector cannot get inside of an infinite loop, as with CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot read outside of packet bounds, because all memory accesses are checked. Also, with BPF the administrator can decide which protocols to support, reducing potential attack surface. Rarely encountered protocols can be excluded from dissection and the program can be updated without kernel recompile or reboot if a bug is discovered. [...] Also, a sample flow dissector has been implemented in BPF as part of this work, from Petar and Willem. [0] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf 2) Add support for bpftool to list currently active attachment points of BPF networking programs providing a quick overview similar to bpftool's perf subcommand, from Yonghong. 3) Fix a verifier pruning instability bug where a union member from the register state was not cleared properly leading to branches not being pruned despite them being valid candidates, from Alexei. 4) Various smaller fast-path optimizations in XDP's map redirect code, from Jesper. 5) Enable to recognize BPF_MAP_TYPE_REUSEPORT_SOCKARRAY maps in bpftool, from Roman. 6) Remove a duplicate check in libbpf that probes for function storage, from Taeung. 7) Fix an issue in test_progs by avoid checking for errno since on success its value should not be checked, from Mauricio. 8) Fix unused variable warning in bpf_getsockopt() helper when CONFIG_INET is not configured, from Anders. 9) Fix a compilation failure in the BPF sample code's use of bpf_flow_keys, from Prashant. 10) Minor cleanups in BPF code, from Yue and Zhong. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * flow_dissector: lookup netns by skb->sk if skb->dev is NULLWillem de Bruijn2018-09-251-3/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | BPF flow dissectors are configured per network namespace. __skb_flow_dissect looks up the netns through dev_net(skb->dev). In some dissector paths skb->dev is NULL, such as for Unix sockets. In these cases fall back to looking up the netns by socket. Analyzing the codepaths leading to __skb_flow_dissect I did not find a case where both skb->dev and skb->sk are NULL. Warn and fall back to standard flow dissector if one is found. Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Reported-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpftool: add support for BPF_MAP_TYPE_REUSEPORT_SOCKARRAY mapsRoman Gushchin2018-09-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Add BPF_MAP_TYPE_REUSEPORT_SOCKARRAY map type to the list of maps types which bpftool recognizes. Signed-off-by: Roman Gushchin <guro@fb.com> Cc: Alexei Starovoitov <ast@kernel.org> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Jakub Kicinski <jakub.kicinski@netronome.com> Cc: Yonghong Song <yhs@fb.com> Acked-by: Jakub Kicinski <jakub.kicinski@netronome.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf: remove redundant null pointer check before consume_skbzhong jiang2018-09-211-4/+2
| | | | | | | | | | | | | | | | | | consume_skb has taken the null pointer into account. hence it is safe to remove the redundant null pointer check before consume_skb. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * samples/bpf: fix compilation failurePrashant Bhole2018-09-213-11/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | following commit: commit d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") added struct bpf_flow_keys which conflicts with the struct with same name in sockex2_kern.c and sockex3_kern.c similar to commit: commit 534e0e52bc23 ("samples/bpf: fix a compilation failure") we tried the rename it "flow_keys" but it also conflicted with struct having same name in include/net/flow_dissector.h. Hence renaming the struct to "flow_key_record". Also, this commit doesn't fix the compilation error completely because the similar struct is present in sockex3_kern.c. Hence renaming it in both files sockex3_user.c and sockex3_kern.c Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * flow_dissector: fix build failure without CONFIG_NETWillem de Bruijn2018-09-192-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If boolean CONFIG_BPF_SYSCALL is enabled, kernel/bpf/syscall.c will call flow_dissector functions from net/core/flow_dissector.c. This causes this build failure if CONFIG_NET is disabled: kernel/bpf/syscall.o: In function `__x64_sys_bpf': syscall.c:(.text+0x3278): undefined reference to `skb_flow_dissector_bpf_prog_attach' syscall.c:(.text+0x3310): undefined reference to `skb_flow_dissector_bpf_prog_detach' kernel/bpf/syscall.o:(.rodata+0x3f0): undefined reference to `flow_dissector_prog_ops' kernel/bpf/verifier.o:(.rodata+0x250): undefined reference to `flow_dissector_verifier_ops' Analogous to other optional BPF program types in syscall.c, add stubs if the relevant functions are not compiled and move the BPF_PROG_TYPE definition in the #ifdef CONFIG_NET block. Fixes: d58e468b1112 ("flow_dissector: implements flow dissector BPF hook") Reported-by: Randy Dunlap <rdunlap@infradead.org> Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Randy Dunlap <rdunlap@infradead.org> # build-tested Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * samples/bpf: fix a compilation failureYonghong Song2018-09-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | samples/bpf build failed with the following errors: $ make samples/bpf/ ... HOSTCC samples/bpf/sockex3_user.o /data/users/yhs/work/net-next/samples/bpf/sockex3_user.c:16:8: error: redefinition of ‘struct bpf_flow_keys’ struct bpf_flow_keys { ^ In file included from /data/users/yhs/work/net-next/samples/bpf/sockex3_user.c:4:0: ./usr/include/linux/bpf.h:2338:9: note: originally defined here struct bpf_flow_keys *flow_keys; ^ make[3]: *** [samples/bpf/sockex3_user.o] Error 1 Commit d58e468b1112d ("flow_dissector: implements flow dissector BPF hook") introduced struct bpf_flow_keys in include/uapi/linux/bpf.h and hence caused the naming conflict with samples/bpf/sockex3_user.c. The fix is to rename struct bpf_flow_keys in samples/bpf/sockex3_user.c to flow_keys to avoid the conflict. Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * samples/bpf: remove duplicated includesYueHaibing2018-09-183-3/+0
| | | | | | | | | | | | | | | | Remove duplicated includes. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * tools/bpf: bpftool: improve output format for bpftool netYonghong Song2018-09-185-130/+161
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a followup patch for Commit f6f3bac08ff9 ("tools/bpf: bpftool: add net support"). Some improvements are made for the bpftool net output. Specially, plain output is more concise such that per attachment should nicely fit in one line. Compared to previous output, the prog tag is removed since it can be easily obtained with program id. Similar to xdp attachments, the device name is added to tc attachments. The bpf program attached through shared block mechanism is supported as well. $ ip link add dev v1 type veth peer name v2 $ tc qdisc add dev v1 ingress_block 10 egress_block 20 clsact $ tc qdisc add dev v2 ingress_block 10 egress_block 20 clsact $ tc filter add block 10 protocol ip prio 25 bpf obj bpf_shared.o sec ingress flowid 1:1 $ tc filter add block 20 protocol ip prio 30 bpf obj bpf_cyclic.o sec classifier flowid 1:1 $ bpftool net xdp: tc: v2(7) clsact/ingress bpf_shared.o:[ingress] id 23 v2(7) clsact/egress bpf_cyclic.o:[classifier] id 24 v1(8) clsact/ingress bpf_shared.o:[ingress] id 23 v1(8) clsact/egress bpf_cyclic.o:[classifier] id 24 The documentation and "bpftool net help" are updated to make it clear that current implementation only supports xdp and tc attachments. For programs attached to cgroups, "bpftool cgroup" can be used to dump attachments. For other programs e.g. sk_{filter,skb,msg,reuseport} and lwt/seg6, iproute2 tools should be used. The new output: $ bpftool net xdp: eth0(2) driver id 198 tc: eth0(2) clsact/ingress fbflow_icmp id 335 act [{icmp_action id 336}] eth0(2) clsact/egress fbflow_egress id 334 $ bpftool -jp net [{ "xdp": [{ "devname": "eth0", "ifindex": 2, "mode": "driver", "id": 198 } ], "tc": [{ "devname": "eth0", "ifindex": 2, "kind": "clsact/ingress", "name": "fbflow_icmp", "id": 335, "act": [{ "name": "icmp_action", "id": 336 } ] },{ "devname": "eth0", "ifindex": 2, "kind": "clsact/egress", "name": "fbflow_egress", "id": 334 } ] } ] Signed-off-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * selftests/bpf: fix bpf_flow.c buildAlexei Starovoitov2018-09-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | fix the following build error: clang -I. -I./include/uapi -I../../../include/uapi -idirafter /usr/local/include -idirafter /data/users/ast/llvm/bld/lib/clang/7.0.0/include -idirafter /usr/include -Wno-compare-distinct-pointer-types \ -O2 -target bpf -emit-llvm -c bpf_flow.c -o - | \ llc -march=bpf -mcpu=generic -filetype=obj -o /data/users/ast/bpf-next/tools/testing/selftests/bpf/bpf_flow.o LLVM ERROR: 'dissect' label emitted multiple times to assembly file make: *** [/data/users/ast/bpf-next/tools/testing/selftests/bpf/bpf_flow.o] Error 1 Fixes: 9c98b13cc3bb ("flow_dissector: implements eBPF parser") Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * Merge branch 'bpf-flow-dissector'Alexei Starovoitov2018-09-1422-6/+1828
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Petar Penkov says: ==================== This patch series hardens the RX stack by allowing flow dissection in BPF, as previously discussed [1]. Because of the rigorous checks of the BPF verifier, this provides significant security guarantees. In particular, the BPF flow dissector cannot get inside of an infinite loop, as with CVE-2013-4348, because BPF programs are guaranteed to terminate. It cannot read outside of packet bounds, because all memory accesses are checked. Also, with BPF the administrator can decide which protocols to support, reducing potential attack surface. Rarely encountered protocols can be excluded from dissection and the program can be updated without kernel recompile or reboot if a bug is discovered. Patch 1 adds infrastructure to execute a BPF program in __skb_flow_dissect. This includes a new BPF program and attach type. Patch 2 adds the new BPF flow dissector definitions to tools/uapi. Patch 3 adds support for the new BPF program type to libbpf and bpftool. Patch 4 adds a flow dissector program in BPF. This parses most protocols in __skb_flow_dissect in BPF for a subset of flow keys (basic, control, ports, and address types). Patch 5 adds a selftest that attaches the BPF program to the flow dissector and sends traffic with different levels of encapsulation. Performance Evaluation: The in-kernel implementation was compared against the demo program from patch 4 using the test in patch 5 with IPv4/UDP traffic over 10 seconds. $perf record -a -C 4 taskset -c 4 ./test_flow_dissector -i 4 -f 8 \ -t 10 In-kernel Dissector: __skb_flow_dissect overhead: 2.12% Total Packets: 3,272,597 (from output of ./test_flow_dissector) BPF Dissector: __skb_flow_dissect overhead: 1.63% Total Packets: 3,232,356 (from output of ./test_flow_dissector) No-op BPF Dissector: __skb_flow_dissect overhead: 1.52% Total Packets: 3,330,635 (from output of ./test_flow_dissector) Changes since v3: 1/ struct bpf_flow_keys reorganized to remove holes in patch 1 and patch 2. Changes since v2: 1/ Changes to tools/include/uapi pulled into a separate patch 2 2/ Changes to tools/lib and tools/bpftool pulled into a separate patch 3 3/ Changed flow_keys in __sk_buff from __u32 to struct bpf_flow_keys * 4/ Added nhoff field in struct bpf_flow_keys to pass initial offset 5/ Saving all of the modified control block, rather than just the qdisc 6/ Sample BPF program in patch 4 modified to use the changes above Changes since v1: 1/ LD_ABS instructions now disallowed for the new BPF prog type 2/ now checks if skb is NULL in __skb_flow_dissect() 3/ fixed incorrect accesses in flow_dissector_is_valid_access() - writes to the flow_keys field now disallowed - reads/writes to tc_classid and data_meta now disallowed 4/ headers pulled with bpf_skb_load_data if direct access fails Changes since RFC: 1/ Flow dissector hook changed from global to per-netns 2/ Defined struct bpf_flow_keys to be used in BPF flow dissector programs instead of exposing the internal flow keys layout. Added a function to translate from bpf_flow_keys to the internal layout after BPF dissection is complete. The pointer to this struct is stored in qdisc_skb_cb rather than inside of the 20 byte control block which simplifies verification and allows access to all 20 bytes of the cb. 3/ Removed GUE parsing as it relied on a hardcoded port 4/ MPLS parsing now stops at the first label which is consistent with the in-kernel flow dissector 5/ Refactored to use direct packet access and to write out to struct bpf_flow_keys [1] http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * selftests/bpf: test bpf flow dissectionPetar Penkov2018-09-148-2/+1134
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds a test that sends different types of packets over multiple tunnels and verifies that valid packets are dissected correctly. To do so, a tc-flower rule is added to drop packets on UDP src port 9, and packets are sent from ports 8, 9, and 10. Only the packets on port 9 should be dropped. Because tc-flower relies on the flow dissector to match flows, correct classification demonstrates correct dissection. Also add support logic to load the BPF program and to inject the test packets. Signed-off-by: Petar Penkov <ppenkov@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * flow_dissector: implements eBPF parserPetar Penkov2018-09-142-1/+374
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This eBPF program extracts basic/control/ip address/ports keys from incoming packets. It supports recursive parsing for IP encapsulation, and VLAN, along with IPv4/IPv6 and extension headers. This program is meant to show how flow dissection and key extraction can be done in eBPF. Link: http://vger.kernel.org/netconf2017_files/rx_hardening_and_udp_gso.pdf Signed-off-by: Petar Penkov <ppenkov@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * bpf: support flow dissector in libbpf and bpftoolPetar Penkov2018-09-142-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | This patch extends libbpf and bpftool to work with programs of type BPF_PROG_TYPE_FLOW_DISSECTOR. Signed-off-by: Petar Penkov <ppenkov@google.com> Signed-off-by: Willem de Bruijn <willemb@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org>