summaryrefslogtreecommitdiffstats
path: root/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
* net: bridge: fdb: convert is_static to bitopsNikolay Aleksandrov2019-10-302-23/+21
| | | | | | | | Convert the is_static to bitops, make use of the combined test_and_set/clear_bit to simplify expressions in fdb_add_entry. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: bridge: fdb: convert is_local to bitopsNikolay Aleksandrov2019-10-303-16/+27
| | | | | | | | The patch adds a new fdb flags field in the hole between the two cache lines and uses it to convert is_local to bitops. Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tipc: improve throughput between nodes in netnsHoang Le2019-10-308-11/+197
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, TIPC transports intra-node user data messages directly socket to socket, hence shortcutting all the lower layers of the communication stack. This gives TIPC very good intra node performance, both regarding throughput and latency. We now introduce a similar mechanism for TIPC data traffic across network namespaces located in the same kernel. On the send path, the call chain is as always accompanied by the sending node's network name space pointer. However, once we have reliably established that the receiving node is represented by a namespace on the same host, we just replace the namespace pointer with the receiving node/namespace's ditto, and follow the regular socket receive patch though the receiving node. This technique gives us a throughput similar to the node internal throughput, several times larger than if we let the traffic go though the full network stacks. As a comparison, max throughput for 64k messages is four times larger than TCP throughput for the same type of traffic. To meet any security concerns, the following should be noted. - All nodes joining a cluster are supposed to have been be certified and authenticated by mechanisms outside TIPC. This is no different for nodes/namespaces on the same host; they have to auto discover each other using the attached interfaces, and establish links which are supervised via the regular link monitoring mechanism. Hence, a kernel local node has no other way to join a cluster than any other node, and have to obey to policies set in the IP or device layers of the stack. - Only when a sender has established with 100% certainty that the peer node is located in a kernel local namespace does it choose to let user data messages, and only those, take the crossover path to the receiving node/namespace. - If the receiving node/namespace is removed, its namespace pointer is invalidated at all peer nodes, and their neighbor link monitoring will eventually note that this node is gone. - To ensure the "100% certainty" criteria, and prevent any possible spoofing, received discovery messages must contain a proof that the sender knows a common secret. We use the hash mix of the sending node/namespace for this purpose, since it can be accessed directly by all other namespaces in the kernel. Upon reception of a discovery message, the receiver checks this proof against all the local namespaces'hash_mix:es. If it finds a match, that, along with a matching node id and cluster id, this is deemed sufficient proof that the peer node in question is in a local namespace, and a wormhole can be opened. - We should also consider that TIPC is intended to be a cluster local IPC mechanism (just like e.g. UNIX sockets) rather than a network protocol, and hence we think it can justified to allow it to shortcut the lower protocol layers. Regarding traceability, we should notice that since commit 6c9081a3915d ("tipc: add loopback device tracking") it is possible to follow the node internal packet flow by just activating tcpdump on the loopback interface. This will be true even for this mechanism; by activating tcpdump on the involved nodes' loopback interfaces their inter-name space messaging can easily be tracked. v2: - update 'net' pointer when node left/rejoined v3: - grab read/write lock when using node ref obj v4: - clone traffics between netns to loopback Suggested-by: Jon Maloy <jon.maloy@ericsson.com> Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: do not call sublist_rcv on empty listFlorian Westphal2019-10-302-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot triggered struct net NULL deref in NF_HOOK_LIST: RIP: 0010:NF_HOOK_LIST include/linux/netfilter.h:331 [inline] RIP: 0010:ip6_sublist_rcv+0x5c9/0x930 net/ipv6/ip6_input.c:292 ipv6_list_rcv+0x373/0x4b0 net/ipv6/ip6_input.c:328 __netif_receive_skb_list_ptype net/core/dev.c:5274 [inline] Reason: void ipv6_list_rcv(struct list_head *head, struct packet_type *pt, struct net_device *orig_dev) [..] list_for_each_entry_safe(skb, next, head, list) { /* iterates list */ skb = ip6_rcv_core(skb, dev, net); /* ip6_rcv_core drops skb -> NULL is returned */ if (skb == NULL) continue; [..] } /* sublist is empty -> curr_net is NULL */ ip6_sublist_rcv(&sublist, curr_dev, curr_net); Before the recent change NF_HOOK_LIST did a list iteration before struct net deref, i.e. it was a no-op in the empty list case. List iteration now happens after *net deref, causing crash. Follow the same pattern as the ip(v6)_list_rcv loop and add a list_empty test for the final sublist dispatch too. Cc: Edward Cree <ecree@solarflare.com> Reported-by: syzbot+c54f457cad330e57e967@syzkaller.appspotmail.com Fixes: ca58fbe06c54 ("netfilter: add and use nf_hook_slow_list()") Signed-off-by: Florian Westphal <fw@strlen.de> Tested-by: Leon Romanovsky <leonro@mellanox.com> Tested-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sock: remove unneeded semicolonYueHaibing2019-10-291-1/+1
| | | | | | | remove unneeded semicolon. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: Add support for devlink device parametersAndrew Lunn2019-10-292-1/+54
| | | | | | | | | | | Add plumbing to allow DSA drivers to register parameters with devlink. To keep with the abstraction, the DSA drivers pass the ds structure to these helpers, and the DSA core then translates that to the devlink structure associated to the device. Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* tipc: Spelling s/enpoint/endpoint/Geert Uytterhoeven2019-10-281-1/+1
| | | | | | | Fix misspelling of "endpoint". Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Fix various misspellings of "connect"Geert Uytterhoeven2019-10-282-2/+2
| | | | | | | | | | Fix misspellings of "disconnect", "disconnecting", "connections", and "disconnected". Signed-off-by: Geert Uytterhoeven <geert+renesas@glider.be> Acked-by: Kalle Valo <kvalo@codeaurora.org> Acked-by: Simon Horman <horms@verge.net.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: fix dereference on ds->dev before null check errorColin Ian King2019-10-281-2/+5
| | | | | | | | | | | | | | | | | Currently ds->dev is dereferenced on the assignments of pdata and np before ds->dev is null checked, hence there is a potential null pointer dereference on ds->dev. Fix this by assigning pdata and np after the ds->dev null pointer sanity check. Addresses-Coverity: ("Dereference before null check") Fixes: 7e99e3470172 ("net: dsa: remove dsa_switch_alloc helper") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Reported-by: kbuild test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2019-10-272-1/+23
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf-next 2019-10-27 The following pull-request contains BPF updates for your *net-next* tree. We've added 52 non-merge commits during the last 11 day(s) which contain a total of 65 files changed, 2604 insertions(+), 1100 deletions(-). The main changes are: 1) Revolutionize BPF tracing by using in-kernel BTF to type check BPF assembly code. The work here teaches BPF verifier to recognize kfree_skb()'s first argument as 'struct sk_buff *' in tracepoints such that verifier allows direct use of bpf_skb_event_output() helper used in tc BPF et al (w/o probing memory access) that dumps skb data into perf ring buffer. Also add direct loads to probe memory in order to speed up/replace bpf_probe_read() calls, from Alexei Starovoitov. 2) Big batch of changes to improve libbpf and BPF kselftests. Besides others: generalization of libbpf's CO-RE relocation support to now also include field existence relocations, revamp the BPF kselftest Makefile to add test runner concept allowing to exercise various ways to build BPF programs, and teach bpf_object__open() and friends to automatically derive BPF program type/expected attach type from section names to ease their use, from Andrii Nakryiko. 3) Fix deadlock in stackmap's build-id lookup on rq_lock(), from Song Liu. 4) Allow to read BTF as raw data from bpftool. Most notable use case is to dump /sys/kernel/btf/vmlinux through this, from Jiri Olsa. 5) Use bpf_redirect_map() helper in libbpf's AF_XDP helper prog which manages to improve "rx_drop" performance by ~4%., from Björn Töpel. 6) Fix to restore the flow dissector after reattach BPF test and also fix error handling in bpf_helper_defs.h generation, from Jakub Sitnicki. 7) Improve verifier's BTF ctx access for use outside of raw_tp, from Martin KaFai Lau. 8) Improve documentation for AF_XDP with new sections and to reflect latest features, from Magnus Karlsson. 9) Add back 'version' section parsing to libbpf for old kernels, from John Fastabend. 10) Fix strncat bounds error in libbpf's libbpf_prog_type_by_name(), from KP Singh. 11) Turn on -mattr=+alu32 in LLVM by default for BPF kselftests in order to improve insn coverage for built BPF progs, from Yonghong Song. 12) Misc minor cleanups and fixes, from various others. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * bpf: Check types of arguments passed into helpersAlexei Starovoitov2019-10-171-1/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Introduce new helper that reuses existing skb perf_event output implementation, but can be called from raw_tracepoint programs that receive 'struct sk_buff *' as tracepoint argument or can walk other kernel data structures to skb pointer. In order to do that teach verifier to resolve true C types of bpf helpers into in-kernel BTF ids. The type of kernel pointer passed by raw tracepoint into bpf program will be tracked by the verifier all the way until it's passed into helper function. For example: kfree_skb() kernel function calls trace_kfree_skb(skb, loc); bpf programs receives that skb pointer and may eventually pass it into bpf_skb_output() bpf helper which in-kernel is implemented via bpf_skb_event_output() kernel function. Its first argument in the kernel is 'struct sk_buff *'. The verifier makes sure that types match all the way. Signed-off-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Andrii Nakryiko <andriin@fb.com> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191016032505.2089704-11-ast@kernel.org
| * bpf: Allow __sk_buff tstamp in BPF_PROG_TEST_RUNStanislav Fomichev2019-10-161-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | It's useful for implementing EDT related tests (set tstamp, run the test, see how the tstamp is changed or observe some other parameter). Note that bpf_ktime_get_ns() helper is using monotonic clock, so for the BPF programs that compare tstamp against it, tstamp should be derived from clock_gettime(CLOCK_MONOTONIC, ...). Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Alexei Starovoitov <ast@kernel.org> Acked-by: Martin KaFai Lau <kafai@fb.com> Link: https://lore.kernel.org/bpf/20191015183125.124413-1-sdf@google.com
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2019-10-2632-497/+1064
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter/IPVS updates for net-next The following patchset contains Netfilter/IPVS updates for net-next, more specifically: * Updates for ipset: 1) Coding style fix for ipset comment extension, from Jeremy Sowden. 2) De-inline many functions in ipset, from Jeremy Sowden. 3) Move ipset function definition from header to source file. 4) Move ip_set_put_flags() to source, export it as a symbol, remove inline. 5) Move range_to_mask() to the source file where this is used. 6) Move ip_set_get_ip_port() to the source file where this is used. * IPVS selftests and netns improvements: 7) Two patches to speedup ipvs netns dismantle, from Haishuang Yan. 8) Three patches to add selftest script for ipvs, also from Haishuang Yan. * Conntrack updates and new nf_hook_slow_list() function: 9) Document ct ecache extension, from Florian Westphal. 10) Skip ct extensions from ctnetlink dump, from Florian. 11) Free ct extension immediately, from Florian. 12) Skip access to ecache extension from nf_ct_deliver_cached_events() this is not correct as reported by Syzbot. 13) Add and use nf_hook_slow_list(), from Florian. * Flowtable infrastructure updates: 14) Move priority to nf_flowtable definition. 15) Dynamic allocation of per-device hooks in flowtables. 16) Allow to include netdevice only once in flowtable definitions. 17) Rise maximum number of devices per flowtable. * Netfilter hardware offload infrastructure updates: 18) Add nft_flow_block_chain() helper function. 19) Pass callback list to nft_setup_cb_call(). 20) Add nft_flow_cls_offload_setup() helper function. 21) Remove rules for the unregistered device via netdevice event. 22) Support for multiple devices in a basechain definition at the ingress hook. 22) Add nft_chain_offload_cmd() helper function. 23) Add nft_flow_block_offload_init() helper function. 24) Rewind in case of failing to bind multiple devices to hook. 25) Typo in IPv6 tproxy module description, from Norman Rasmussen. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: nf_tables_offload: unbind if multi-device binding failsPablo Neira Ayuso2019-10-261-2/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | nft_flow_block_chain() needs to unbind in case of error when performing the multi-device binding. Fixes: d54725cd11a5 ("netfilter: nf_tables: support for multiple devices per netdev hook") Reported-by: wenxu <wenxu@ucloud.cn> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables_offload: add nft_flow_block_offload_init()Pablo Neira Ayuso2019-10-261-21/+21
| | | | | | | | | | | | | | | | | | | | | This patch adds the nft_flow_block_offload_init() helper function to initialize the flow_block_offload object. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables_offload: add nft_chain_offload_cmd()Pablo Neira Ayuso2019-10-261-5/+15
| | | | | | | | | | | | | | | | | | This patch adds the nft_chain_offload_cmd() helper function. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ecache: don't look for ecache extension on dying/unconfirmed ↵Florian Westphal2019-10-261-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | conntracks syzbot reported following splat: BUG: KASAN: use-after-free in __nf_ct_ext_exist include/net/netfilter/nf_conntrack_extend.h:53 [inline] BUG: KASAN: use-after-free in nf_ct_deliver_cached_events+0x5c3/0x6d0 net/netfilter/nf_conntrack_ecache.c:205 nf_conntrack_confirm include/net/netfilter/nf_conntrack_core.h:65 [inline] nf_confirm+0x3d8/0x4d0 net/netfilter/nf_conntrack_proto.c:154 [..] While there is no reproducer yet, the syzbot report contains one interesting bit of information: Freed by task 27585: [..] kfree+0x10a/0x2c0 mm/slab.c:3757 nf_ct_ext_destroy+0x2ab/0x2e0 net/netfilter/nf_conntrack_extend.c:38 nf_conntrack_free+0x8f/0xe0 net/netfilter/nf_conntrack_core.c:1418 destroy_conntrack+0x1a2/0x270 net/netfilter/nf_conntrack_core.c:626 nf_conntrack_put include/linux/netfilter/nf_conntrack_common.h:31 [inline] nf_ct_resolve_clash net/netfilter/nf_conntrack_core.c:915 [inline] ^^^^^^^^^^^^^^^^^^^ __nf_conntrack_confirm+0x21ca/0x2830 net/netfilter/nf_conntrack_core.c:1038 nf_conntrack_confirm include/net/netfilter/nf_conntrack_core.h:63 [inline] nf_confirm+0x3e7/0x4d0 net/netfilter/nf_conntrack_proto.c:154 This is whats happening: 1. a conntrack entry is about to be confirmed (added to hash table). 2. a clash with existing entry is detected. 3. nf_ct_resolve_clash() puts skb->nfct (the "losing" entry). 4. this entry now has a refcount of 0 and is freed to SLAB_TYPESAFE_BY_RCU kmem cache. skb->nfct has been replaced by the one found in the hash. Problem is that nf_conntrack_confirm() uses the old ct: static inline int nf_conntrack_confirm(struct sk_buff *skb) { struct nf_conn *ct = (struct nf_conn *)skb_nfct(skb); int ret = NF_ACCEPT; if (ct) { if (!nf_ct_is_confirmed(ct)) ret = __nf_conntrack_confirm(skb); if (likely(ret == NF_ACCEPT)) nf_ct_deliver_cached_events(ct); /* This ct has refcount 0! */ } return ret; } As of "netfilter: conntrack: free extension area immediately", we can't access conntrack extensions in this case. To fix this, make sure we check the dying bit presence before attempting to get the eache extension. Reported-by: syzbot+c7aabc9fe93e7f3637ba@syzkaller.appspotmail.com Fixes: 2ad9d7747c10d1 ("netfilter: conntrack: free extension area immediately") Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables: support for multiple devices per netdev hookPablo Neira Ayuso2019-10-233-96/+289
| | | | | | | | | | | | | | | | | | | | | | | | This patch allows you to register one netdev basechain to multiple devices. This adds a new NFTA_HOOK_DEVS netlink attribute to specify the list of netdevices. Basechains store a list of hooks. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables_offload: remove rules on unregistered device onlyPablo Neira Ayuso2019-10-231-13/+13
| | | | | | | | | | | | | | | | | | | | | | | | After unbinding the list of flow_block callbacks, iterate over it to remove the existing rules in the netdevice that has just been unregistered. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables_offload: add nft_flow_cls_offload_setup()Pablo Neira Ayuso2019-10-231-13/+24
| | | | | | | | | | | | | | | | | | Add helper function to set up the flow_cls_offload object. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables_offload: Pass callback list to nft_setup_cb_call()Pablo Neira Ayuso2019-10-231-4/+5
| | | | | | | | | | | | | | | | | | This allows to reuse nft_setup_cb_call() from the callback unbind path. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables_offload: add nft_flow_block_chain()Pablo Neira Ayuso2019-10-231-4/+11
| | | | | | | | | | | | | | | | | | | | | Add nft_flow_block_chain() helper function to reuse this function from netdev event handler. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables: increase maximum devices number per flowtablePablo Neira Ayuso2019-10-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Rise the maximum limit of devices per flowtable up to 256. Rename NFT_FLOWTABLE_DEVICE_MAX to NFT_NETDEVICE_MAX in preparation to reuse the netdev hook parser for ingress basechain. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables: allow netdevice to be used only once per flowtablePablo Neira Ayuso2019-10-231-0/+17
| | | | | | | | | | | | | | | | | | Allow netdevice only once per flowtable, otherwise hit EEXIST. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables: dynamically allocate hooks per net_device in flowtablesPablo Neira Ayuso2019-10-231-102/+151
| | | | | | | | | | | | | | | | | | Use a list of hooks per device instead an array. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_flow_table: move priority to struct nf_flowtablePablo Neira Ayuso2019-10-231-5/+5
| | | | | | | | | | | | | | | | | | | | | Hardware offload needs access to the priority field, store this field in the nf_flowtable object. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nft_tproxy: Fix typo in IPv6 module description.Norman Rasmussen2019-10-171-1/+1
| | | | | | | | | | | | | | | Signed-off-by: Norman Rasmussen <norman@rasmussen.co.za> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: add and use nf_hook_slow_list()Florian Westphal2019-10-171-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | At this time, NF_HOOK_LIST() macro will iterate the list and then calls nf_hook() for each individual skb. This makes it so the entire list is passed into the netfilter core. The advantage is that we only need to fetch the rule blob once per list instead of per-skb. NF_HOOK_LIST now only works for ipv4 and ipv6, as those are the only callers. v2: use skb_list_del_init() instead of list_del (Edward Cree) Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Edward Cree <ecree@solarflare.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: conntrack: free extension area immediatelyFlorian Westphal2019-10-172-13/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of waiting for rcu grace period just free it directly. This is safe because conntrack lookup doesn't consider extensions. Other accesses happen while ct->ext can't be free'd, either because a ct refcount was taken or because the conntrack hash bucket lock or the dying list spinlock have been taken. This allows to remove __krealloc in a followup patch, netfilter was the only user. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ctnetlink: don't dump ct extensions of unconfirmed conntracksFlorian Westphal2019-10-171-26/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When dumping the unconfirmed lists, the cpu that is processing the ct entry can reallocate ct->ext at any time. Right now accessing the extensions from another CPU is ok provided we're holding rcu read lock: extension reallocation does use rcu. Once RCU isn't used anymore this becomes unsafe, so skip extensions for the unconfirmed list. Dumping the extension area for confirmed or dying conntracks is fine: no reallocations are allowed and list iteration holds appropriate locks that prevent ct (and this ct->ext) from getting free'd. v2: fix compiler warnings due to misue of 'const' and missing return statement (kbuild robot). Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | Merge tag 'ipvs-next-for-v5.5' of ↵Pablo Neira Ayuso2019-10-173-34/+43
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | https://git.kernel.org/pub/scm/linux/kernel/git/horms/ipvs-next Pablo Neira Ayuso says: ==================== IPVS updates for v5.5 1) Two patches to speedup ipvs netns dismantle, from Haishuang Yan. 2) Three patches to add selftest script for ipvs, also from Haishuang Yan. 3) Simplify __ip_vs_get_out_rt() from zhang kai. ==================== Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | ipvs: batch __ip_vs_dev_cleanupHaishuang Yan2019-10-081-7/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's better to batch __ip_vs_cleanup to speedup ipvs devices dismantle. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| | * | ipvs: batch __ip_vs_cleanupHaishuang Yan2019-10-082-15/+25
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's better to batch __ip_vs_cleanup to speedup ipvs connections dismantle. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| | * | ipvs: no need to update skb route entry for local destination packets.zhang kai2019-10-081-12/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the end of function __ip_vs_get_out_rt/__ip_vs_get_out_rt_v6,the 'local' variable is always zero. Signed-off-by: zhang kai <zhangkaiheb@126.com> Acked-by: Julian Anastasov <ja@ssi.bg> Signed-off-by: Simon Horman <horms@verge.net.au>
| * | | netfilter: ecache: document extension area access rulesFlorian Westphal2019-10-171-2/+15
| |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once ct->ext gets free'd via kfree() rather than kfree_rcu we can't access the extension area anymore without owning the conntrack. This is a special case: The worker is walking the pcpu dying list while holding dying list lock: Neither ct nor ct->ext can be free'd until after the walk has completed. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipset: move ip_set_get_ip_port() to ip_set_bitmap_port.c.Jeremy Sowden2019-10-072-28/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | ip_set_get_ip_port() is only used in ip_set_bitmap_port.c. Move it there and make it static. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipset: move function to ip_set_bitmap_ip.c.Jeremy Sowden2019-10-071-0/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | One inline function in ip_set_bitmap.h is only called in ip_set_bitmap_ip.c: move it and remove inline function specifier. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipset: make ip_set_put_flags extern.Jeremy Sowden2019-10-071-0/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | ip_set_put_flags is rather large for a static inline function in a header-file. Move it to ip_set_core.c and export it. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipset: move functions to ip_set_core.c.Jeremy Sowden2019-10-071-0/+102
| | | | | | | | | | | | | | | | | | | | | | | | | | | Several inline functions in ip_set.h are only called in ip_set_core.c: move them and remove inline function specifier. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipset: move ip_set_comment functions from ip_set.h to ip_set_core.c.Jeremy Sowden2019-10-071-1/+65
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Most of the functions are only called from within ip_set_core.c. The exception is ip_set_init_comment. However, this is too complex to be a good candidate for a static inline function. Move it to ip_set_core.c, change its linkage to extern and export it, leaving a declaration in ip_set.h. ip_set_comment_free is only used as an extension destructor, so change its prototype to match and drop cast. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: ipset: remove inline from static functions in .c files.Jeremy Sowden2019-10-0719-138/+138
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inline function-specifier should not be used for static functions defined in .c files since it bloats the kernel. Instead leave the compiler to decide which functions to inline. While a couple of the files affected (ip_set_*_gen.h) are technically headers, they contain templates for generating the common parts of particular set-types and so we treat them like .c files. Signed-off-by: Jeremy Sowden <jeremy@azazel.net> Acked-by: Jozsef Kadlecsik <kadlec@netfilter.org> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | Merge branch 'for-upstream' of ↵David S. Miller2019-10-263-9/+18
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/bluetooth/bluetooth-next Johan Hedberg says: ==================== pull request: bluetooth-next 2019-10-23 Here's the main bluetooth-next pull request for the 5.5 kernel: - Multiple fixes to hci_qca driver - Fix for HCI_USER_CHANNEL initialization - btwlink: drop superseded driver - Add support for Intel FW download error recovery - Various other smaller fixes & improvements Please let me know if there are any issues pulling. Thanks. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Bluetooth: hci_core: fix init for HCI_USER_CHANNELMattijs Korpershoek2019-10-171-1/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | During the setup() stage, HCI device drivers expect the chip to acknowledge its setup() completion via vendor specific frames. If userspace opens() such HCI device in HCI_USER_CHANNEL [1] mode, the vendor specific frames are never tranmitted to the driver, as they are filtered in hci_rx_work(). Allow HCI devices which operate in HCI_USER_CHANNEL mode to receive frames if the HCI device is is HCI_INIT state. [1] https://www.spinics.net/lists/linux-bluetooth/msg37345.html Fixes: 23500189d7e0 ("Bluetooth: Introduce new HCI socket channel for user operation") Signed-off-by: Mattijs Korpershoek <mkorpershoek@baylibre.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
| * | | Bluetooth: Workaround directed advertising bug in Broadcom controllersSzymon Janc2019-10-161-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It appears that some Broadcom controllers (eg BCM20702A0) reject LE Set Advertising Parameters command if advertising intervals provided are not within range for undirected and low duty directed advertising. Workaround this bug by populating min and max intervals with 'valid' values. < HCI Command: LE Set Advertising Parameters (0x08|0x0006) plen 15 Min advertising interval: 0.000 msec (0x0000) Max advertising interval: 0.000 msec (0x0000) Type: Connectable directed - ADV_DIRECT_IND (high duty cycle) (0x01) Own address type: Public (0x00) Direct address type: Random (0x01) Direct address: E2:F0:7B:9F:DC:F4 (Static) Channel map: 37, 38, 39 (0x07) Filter policy: Allow Scan Request from Any, Allow Connect Request from Any (0x00) > HCI Event: Command Complete (0x0e) plen 4 LE Set Advertising Parameters (0x08|0x0006) ncmd 1 Status: Invalid HCI Command Parameters (0x12) Signed-off-by: Szymon Janc <szymon.janc@codecoup.pl> Tested-by: Sören Beye <linux@hypfer.de> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
| * | | Bluetooth: missed cpu_to_le16 conversion in hci_init4_reqBen Dooks (Codethink)2019-10-161-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It looks like in hci_init4_req() the request is being initialised from cpu-endian data but the packet is specified to be little-endian. This causes an warning from sparse due to __le16 to u16 conversion. Fix this by using cpu_to_le16() on the two fields in the packet. net/bluetooth/hci_core.c:845:27: warning: incorrect type in assignment (different base types) net/bluetooth/hci_core.c:845:27: expected restricted __le16 [usertype] tx_len net/bluetooth/hci_core.c:845:27: got unsigned short [usertype] le_max_tx_len net/bluetooth/hci_core.c:846:28: warning: incorrect type in assignment (different base types) net/bluetooth/hci_core.c:846:28: expected restricted __le16 [usertype] tx_time net/bluetooth/hci_core.c:846:28: got unsigned short [usertype] le_max_tx_time Signed-off-by: Ben Dooks <ben.dooks@codethink.co.uk> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
| * | | Bluetooth: remove set but not used variable 'smp'YueHaibing2019-10-161-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes gcc '-Wunused-but-set-variable' warning: net/bluetooth/smp.c: In function 'smp_irk_matches': net/bluetooth/smp.c:505:18: warning: variable 'smp' set but not used [-Wunused-but-set-variable] net/bluetooth/smp.c: In function 'smp_generate_rpa': net/bluetooth/smp.c:526:18: warning: variable 'smp' set but not used [-Wunused-but-set-variable] It is not used since commit 28a220aac596 ("bluetooth: switch to AES library") Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
* | | | tcp: add TCP_INFO status for failed client TFOJason Baron2019-10-263-1/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The TCPI_OPT_SYN_DATA bit as part of tcpi_options currently reports whether or not data-in-SYN was ack'd on both the client and server side. We'd like to gather more information on the client-side in the failure case in order to indicate the reason for the failure. This can be useful for not only debugging TFO, but also for creating TFO socket policies. For example, if a middle box removes the TFO option or drops a data-in-SYN, we can can detect this case, and turn off TFO for these connections saving the extra retransmits. The newly added tcpi_fastopen_client_fail status is 2 bits and has the following 4 states: 1) TFO_STATUS_UNSPEC Catch-all state which includes when TFO is disabled via black hole detection, which is indicated via LINUX_MIB_TCPFASTOPENBLACKHOLE. 2) TFO_COOKIE_UNAVAILABLE If TFO_CLIENT_NO_COOKIE mode is off, this state indicates that no cookie is available in the cache. 3) TFO_DATA_NOT_ACKED Data was sent with SYN, we received a SYN/ACK but it did not cover the data portion. Cookie is not accepted by server because the cookie may be invalid or the server may be overloaded. 4) TFO_SYN_RETRANSMITTED Data was sent with SYN, we received a SYN/ACK which did not cover the data after at least 1 additional SYN was sent (without data). It may be the case that a middle-box is dropping data-in-SYN packets. Thus, it would be more efficient to not use TFO on this connection to avoid extra retransmits during connection establishment. These new fields do not cover all the cases where TFO may fail, but other failures, such as SYN/ACK + data being dropped, will result in the connection not becoming established. And a connection blackhole after session establishment shows up as a stalled connection. Signed-off-by: Jason Baron <jbaron@akamai.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Christoph Paasch <cpaasch@apple.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | net: sch_generic: Use pfifo_fast as fallback scheduler for CAN hardwareVincent Prince2019-10-261-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is networking hardware that isn't based on Ethernet for layers 1 and 2. For example CAN. CAN is a multi-master serial bus standard for connecting Electronic Control Units [ECUs] also known as nodes. A frame on the CAN bus carries up to 8 bytes of payload. Frame corruption is detected by a CRC. However frame loss due to corruption is possible, but a quite unusual phenomenon. While fq_codel works great for TCP/IP, it doesn't for CAN. There are a lot of legacy protocols on top of CAN, which are not build with flow control or high CAN frame drop rates in mind. When using fq_codel, as soon as the queue reaches a certain delay based length, skbs from the head of the queue are silently dropped. Silently meaning that the user space using a send() or similar syscall doesn't get an error. However TCP's flow control algorithm will detect dropped packages and adjust the bandwidth accordingly. When using fq_codel and sending raw frames over CAN, which is the common use case, the user space thinks the package has been sent without problems, because send() returned without an error. pfifo_fast will drop skbs, if the queue length exceeds the maximum. But with this scheduler the skbs at the tail are dropped, an error (-ENOBUFS) is propagated to user space. So that the user space can slow down the package generation. On distributions, where fq_codel is made default via CONFIG_DEFAULT_NET_SCH during compile time, or set default during runtime with sysctl net.core.default_qdisc (see [1]), we get a bad user experience. In my test case with pfifo_fast, I can transfer thousands of million CAN frames without a frame drop. On the other hand with fq_codel there is more then one lost CAN frame per thousand frames. As pointed out fq_codel is not suited for CAN hardware, so this patch changes attach_one_default_qdisc() to use pfifo_fast for "ARPHRD_CAN" network devices. During transition of a netdev from down to up state the default queuing discipline is attached by attach_default_qdiscs() with the help of attach_one_default_qdisc(). This patch modifies attach_one_default_qdisc() to attach the pfifo_fast (pfifo_fast_ops) if the network device type is "ARPHRD_CAN". [1] https://github.com/systemd/systemd/issues/9194 Suggested-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Marc Kleine-Budde <mkl@pengutronix.de> Signed-off-by: Vincent Prince <vincent.prince.fr@gmail.com> Acked-by: Dave Taht <dave.taht@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | fq_codel: do not include <linux/jhash.h>Eric Dumazet2019-10-231-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 342db221829f ("sched: Call skb_get_hash_perturb in sch_fq_codel") we no longer need anything from this file. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
* | | | net: dsa: remove dsa_switch_alloc helperVivien Didelot2019-10-221-15/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that ports are dynamically listed in the fabric, there is no need to provide a special helper to allocate the dsa_switch structure. This will give more flexibility to drivers to embed this structure as they wish in their private structure. Signed-off-by: Vivien Didelot <vivien.didelot@gmail.com> Reviewed-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>