summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* cxgb4: collect HMA memory dumpRahul Lakkireddy2017-12-089-4/+51
| | | | | | Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* cxgb4: collect MC memory dumpRahul Lakkireddy2017-12-085-56/+108
| | | | | | | | | Use meminfo to get base address and size of MC memory. Also use same meminfo for EDC memory dumps. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* cxgb4: collect on-chip memory informationRahul Lakkireddy2017-12-087-241/+374
| | | | | | | | | | | | Collect memory layout of various on-chip memory regions. Move code for collecting on-chip memory information to cudbg_lib.c and update cxgb4_debugfs.c to use the common function. Also include cudbg_entity.h before cudbg_lib.h to avoid adding cudbg entity structure forward declarations in cudbg_lib.h. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'veth-and-GSO-maximums'David S. Miller2017-12-082-0/+37
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Stephen Hemminger says: ==================== veth and GSO maximums This is the more general way to solving the issue of GSO limits not being set correctly for containers on Azure. If a GSO packet is sent to host that exceeds the limit (reported by NDIS), then the host is forced to do segmentation in software which has noticeable performance impact. The core rtnetlink infrastructure already has the messages and infrastructure to allow changing gso limits. With an updated iproute2 the following already works: # ip li set dev dummy0 gso_max_size 30000 These patches are about making it easier with veth. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * veth: set peer GSO valuesStephen Hemminger2017-12-081-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When new veth is created, and GSO values have been configured on one device, clone those values to the peer. For example: # ip link add dev vm1 gso_max_size 65530 type veth peer name vm2 This should create vm1 <--> vm2 with both having GSO maximum size set to 65530. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rtnetlink: allow GSO maximums to be set on device creationStephen Hemminger2017-12-081-0/+34
|/ | | | | | | | | Netlink device already allows changing GSO sizes with ip set command. The part that is missing is allowing overriding GSO settings on device creation. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: stmmac: fix broken dma_interrupt handling for multi-queuesNiklas Cassel2017-12-081-8/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is nothing that says that number of TX queues == number of RX queues. E.g. the ARTPEC-6 SoC has 2 TX queues and 1 RX queue. This code is obviously wrong: for (chan = 0; chan < tx_channel_count; chan++) { struct stmmac_rx_queue *rx_q = &priv->rx_queue[chan]; priv->rx_queue has size MTL_MAX_RX_QUEUES, so this will send an uninitialized napi_struct to __napi_schedule(), causing us to crash in net_rx_action(), because napi_struct->poll is zero. [12846.759880] Unable to handle kernel NULL pointer dereference at virtual address 00000000 [12846.768014] pgd = (ptrval) [12846.770742] [00000000] *pgd=39ec7831, *pte=00000000, *ppte=00000000 [12846.777023] Internal error: Oops: 80000007 [#1] PREEMPT SMP ARM [12846.782942] Modules linked in: [12846.785998] CPU: 0 PID: 161 Comm: dropbear Not tainted 4.15.0-rc2-00285-gf5fb5f2f39a7 #36 [12846.794177] Hardware name: Axis ARTPEC-6 Platform [12846.798879] task: (ptrval) task.stack: (ptrval) [12846.803407] PC is at 0x0 [12846.805942] LR is at net_rx_action+0x274/0x43c [12846.810383] pc : [<00000000>] lr : [<80bff064>] psr: 200e0113 [12846.816648] sp : b90d9ae8 ip : b90d9ae8 fp : b90d9b44 [12846.821871] r10: 00000008 r9 : 0013250e r8 : 00000100 [12846.827094] r7 : 0000012c r6 : 00000000 r5 : 00000001 r4 : bac84900 [12846.833619] r3 : 00000000 r2 : b90d9b08 r1 : 00000000 r0 : bac84900 Since each DMA channel can be used for rx and tx simultaneously, the current code should probably be rewritten so that napi_struct is embedded in a new struct stmmac_channel. That way, stmmac_poll() can call stmmac_tx_clean() on just the tx queue where we got the IRQ, instead of looping through all tx queues. This is also how the xgbe driver does it (another driver for this IP). Fixes: c22a3f48ef99 ("net: stmmac: adding multiple napi mechanism") Signed-off-by: Niklas Cassel <niklas.cassel@axis.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: lan9303: Protect ALR operations with mutexEgil Hjelmeland2017-12-082-2/+13
| | | | | | | | | | ALR table operations are a sequence of related register operations which should be protected from concurrent access. The alr_cache should also be protected. Add alr_mutex doing that. Signed-off-by: Egil Hjelmeland <privat@egil-hjelmeland.no> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sched: fix use-after-free in tcf_block_put_extJiri Pirko2017-12-081-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the block is freed with last chain being put, once we reach the end of iteration of list_for_each_entry_safe, the block may be already freed. I'm hitting this only by creating and deleting clsact: [ 202.171952] ================================================================== [ 202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390 [ 202.187590] Read of size 8 at addr ffff880225539a80 by task tc/796 [ 202.194508] [ 202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5 [ 202.203200] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016 [ 202.213613] Call Trace: [ 202.216369] dump_stack+0xda/0x169 [ 202.220192] ? dma_virt_map_sg+0x147/0x147 [ 202.224790] ? show_regs_print_info+0x54/0x54 [ 202.229691] ? tcf_chain_destroy+0x1dc/0x250 [ 202.234494] print_address_description+0x83/0x3d0 [ 202.239781] ? tcf_block_put_ext+0x240/0x390 [ 202.244575] kasan_report+0x1ba/0x460 [ 202.248707] ? tcf_block_put_ext+0x240/0x390 [ 202.253518] tcf_block_put_ext+0x240/0x390 [ 202.258117] ? tcf_chain_flush+0x290/0x290 [ 202.262708] ? qdisc_hash_del+0x82/0x1a0 [ 202.267111] ? qdisc_hash_add+0x50/0x50 [ 202.271411] ? __lock_is_held+0x5f/0x1a0 [ 202.275843] clsact_destroy+0x3d/0x80 [sch_ingress] [ 202.281323] qdisc_destroy+0xcb/0x240 [ 202.285445] qdisc_graft+0x216/0x7b0 [ 202.289497] tc_get_qdisc+0x260/0x560 Fix this by holding the block also by chain 0 and put chain 0 explicitly, out of the list_for_each_entry_safe loop at the very end of tcf_block_put_ext. Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in tcf_block_put_ext()") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'lockless-qdisc-series'David S. Miller2017-12-0810-176/+512
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | John Fastabend says: ==================== lockless qdisc series This series adds support for building lockless qdiscs. This is the result of noticing the qdisc lock is a common hot-spot in perf analysis of the Linux network stack, especially when testing with high packet per second rates. However, nothing is free and most qdiscs rely on the qdisc lock for their data structures so each qdisc must be converted on a case by case basis. In this series, to kick things off, we make pfifo_fast, mq, and mqprio lockless. Follow up series can address additional qdiscs as needed. For example sch_tbf might be useful. To allow this the lockless design is an opt-in flag. In some future utopia we convert all qdiscs and we get to drop this case analysis, but in order to make progress we live in the real-world. There are also a handful of optimizations I have behind this series and a few code cleanups that I couldn't figure out how to fit neatly into this series with out increasing the patch count. Once this is in additional patches can address this. The most notable is in skb_dequeue we can push the consumer lock out a bit and consume multiple skbs off the skb_array in pfifo fast per iteration. Ideally we could push arrays of packets at drivers as well but we would need the infrastructure for this. The other notable improvement is to do less locking in the overrun cases where bad tx queue list and gso_skb are being hit. Although, nice in theory in practice this is the error case and I haven't found a benchmark where this matters yet. For testing... My first test case uses multiple containers (via cilium) where multiple client containers use 'wrk' to benchmark connections with a server container running lighttpd. Where lighttpd is configured to use multiple threads, one per core. Additionally this test has a proxy agent running so all traffic takes an extra hop through a proxy container. In these cases each TCP packet traverses the egress qdisc layer at least four times and the ingress qdisc layer an additional four times. This makes for a good stress test IMO, perf details below. The other micro-benchmark I run is injecting packets directly into qdisc layer using pktgen. This uses the benchmark script, ./pktgen_bench_xmit_mode_queue_xmit.sh Benchmarks taken in two cases, "base" running latest net-next no changes to qdisc layer and "qdisc" tests run with qdisc lockless updates. Numbers reported in req/sec. All virtual 'veth' devices run with pfifo_fast in the qdisc test case. `wrk -t16 -c $conns -d30 "http://[$SERVER_IP4]:80"` conns 16 32 64 1024 ----------------------------------------------- base: 18831 20201 21393 29151 qdisc: 19309 21063 23899 29265 notice in all cases we see performance improvement when running with qdisc case. Microbenchmarks using pktgen are as follows, `pktgen_bench_xmit_mode_queue_xmit.sh -t 1 -i eth2 -c 20000000 base(mq): 2.1Mpps base(pfifo_fast): 2.1Mpps qdisc(mq): 2.6Mpps qdisc(pfifo_fast): 2.6Mpps notice numbers are the same for mq and pfifo_fast because only testing a single thread here. In both tests we see a nice bump in performance gain. The key with 'mq' is it is already per txq ring so contention is minimal in the above cases. Qdiscs such as tbf or htb which have more contention will likely show larger gains when/if lockless versions are implemented. Thanks to everyone who helped with this work especially Daniel Borkmann, Eric Dumazet and Willem de Bruijn for discussing the design and reviewing versions of the code. Changes from the RFC: dropped a couple patches off the end, fixed a bug with skb_queue_walk_safe not unlinking skb in all cases, fixed a lockdep splat with pfifo_fast_destroy not calling *_bh lock variant, addressed _most_ of Willem's comments, there was a bug in the bulk locking (final patch) of the RFC series. @Willem, I left out lockdep annotation for a follow on series to add lockdep more completely, rather than just in code I touched. Comments and feedback welcome. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: pfifo_fast use skb_arrayJohn Fastabend2017-12-082-53/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This converts the pfifo_fast qdisc to use the skb_array data structure and set the lockless qdisc bit. pfifo_fast is the first qdisc to support the lockless bit that can be a child of a qdisc requiring locking. So we add logic to clear the lock bit on initialization in these cases when the qdisc graft operation occurs. This also removes the logic used to pick the next band to dequeue from and instead just checks a per priority array for packets from top priority to lowest. This might need to be a bit more clever but seems to work for now. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: skb_array: expose peek APIJohn Fastabend2017-12-081-0/+5
| | | | | | | | | | | | | | This adds a peek routine to skb_array.h for use with qdisc. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprioJohn Fastabend2017-12-082-35/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then called independently for enqueue and dequeue operations. However statistics are aggregated and pushed up to the "master" qdisc. This patch adds support for any of the sub-qdiscs to be per cpu statistic qdiscs. To handle this case add a check when calculating stats and aggregate the per cpu stats if needed. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqJohn Fastabend2017-12-083-11/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sch_mq qdisc creates a sub-qdisc per tx queue which are then called independently for enqueue and dequeue operations. However statistics are aggregated and pushed up to the "master" qdisc. This patch adds support for any of the sub-qdiscs to be per cpu statistic qdiscs. To handle this case add a check when calculating stats and aggregate the per cpu stats if needed. Also exports __gnet_stats_copy_queue() to use as a helper function. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: helpers to sum qlen and qlen for per cpu logicJohn Fastabend2017-12-082-1/+22
| | | | | | | | | | | | | | | | | | | | | | | | Add qdisc qlen helper routines for lockless qdiscs to use. The qdisc qlen is no longer used in the hotpath but it is reported via stats query on the qdisc so it still needs to be tracked. This adds the per cpu operations needed along with a helper to return the summation of per cpu stats. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: check for frozen queue before skb_bad_txq checkJohn Fastabend2017-12-081-4/+7
| | | | | | | | | | | | | | | | | | | | I can not think of any reason to pull the bad txq skb off the qdisc if the txq we plan to send this on is still frozen. So check for frozen queue first and abort before dequeuing either skb_bad_txq skb or normal qdisc dequeue() skb. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: use skb list for skb_bad_txJohn Fastabend2017-12-082-21/+87
| | | | | | | | | | | | | | | | | | Similar to how gso is handled use skb list for skb_bad_tx this is required with lockless qdiscs because we may have multiple cores attempting to push skbs into skb_bad_tx concurrently Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: drop qdisc_reset from dev_graft_qdiscJohn Fastabend2017-12-081-9/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In qdisc_graft_qdisc a "new" qdisc is attached and the 'qdisc_destroy' operation is called on the old qdisc. The destroy operation will wait a rcu grace period and call qdisc_rcu_free(). At which point gso_cpu_skb is free'd along with all stats so no need to zero stats and gso_cpu_skb from the graft operation itself. Further after dropping the qdisc locks we can not continue to call qdisc_reset before waiting an rcu grace period so that the qdisc is detached from all cpus. By removing the qdisc_reset() here we get the correct property of waiting an rcu grace period and letting the qdisc_destroy operation clean up the qdisc correctly. Note, a refcnt greater than 1 would cause the destroy operation to be aborted however if this ever happened the reference to the qdisc would be lost and we would have a memory leak. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: explicit locking in gso_cpu fallbackJohn Fastabend2017-12-082-21/+84
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This work is preparing the qdisc layer to support egress lockless qdiscs. If we are running the egress qdisc lockless in the case we overrun the netdev, for whatever reason, the netdev returns a busy error code and the skb is parked on the gso_skb pointer. With many cores all hitting this case at once its possible to have multiple sk_buffs here so we turn gso_skb into a queue. This should be the edge case and if we see this frequently then the netdev/qdisc layer needs to back off. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: a dflt qdisc may be used with per cpu statsJohn Fastabend2017-12-082-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | Enable dflt qdisc support for per cpu stats before this patch a dflt qdisc was required to use the global statistics qstats and bstats. This adds a static flags field to qdisc_ops that is propagated into qdisc->flags in qdisc allocate call. This allows the allocation block to completely allocate the qdisc object so we don't have dangling allocations after qdisc init. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: provide per cpu qstat helpersJohn Fastabend2017-12-081-0/+35
| | | | | | | | | | | | | | | | | | The per cpu qstats support was added with per cpu bstat support which is currently used by the ingress qdisc. This patch adds a set of helpers needed to make other qdiscs that use qstats per cpu as well. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: remove remaining uses for qdisc_qlen in xmit pathJohn Fastabend2017-12-082-18/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sch_direct_xmit() uses qdisc_qlen as a return value but all call sites of the routine only check if it is zero or not. Simplify the logic so that we don't need to return an actual queue length value. This introduces a case now where sch_direct_xmit would have returned a qlen of zero but now it returns true. However in this case all call sites of sch_direct_xmit will implement a dequeue() and get a null skb and abort. This trades tracking qlen in the hotpath for an extra dequeue operation. Overall this seems to be good for performance. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: allow qdiscs to handle lockingJohn Fastabend2017-12-083-14/+43
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds a flag for queueing disciplines to indicate the stack does not need to use the qdisc lock to protect operations. This can be used to build lockless scheduling algorithms and improving performance. The flag is checked in the tx path and the qdisc lock is only taken if it is not set. For now use a conditional if statement. Later we could be more aggressive if it proves worthwhile and use a static key or wrap this in a likely(). Also the lockless case drops the TCQ_F_CAN_BYPASS logic. The reason for this is synchronizing a qlen counter across threads proves to cost more than doing the enqueue/dequeue operations when tested with pktgen. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: cleanup qdisc_run and __qdisc_run semanticsJohn Fastabend2017-12-083-5/+6
|/ | | | | | | | | | | | Currently __qdisc_run calls qdisc_run_end() but does not call qdisc_run_begin(). This makes it hard to track pairs of qdisc_run_{begin,end} across function calls. To simplify reading these code paths this patch moves begin/end calls into qdisc_run(). Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* virtio_net: Disable interrupts if napi_complete_done rescheduled napiToshiaki Makita2017-12-081-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 39e6c8208d7b ("net: solve a NAPI race") napi has been able to be rescheduled within napi_complete_done() even in non-busypoll case, but virtnet_poll() always enabled interrupts before complete, and when napi was rescheduled within napi_complete_done() it did not disable interrupts. This caused more interrupts when event idx is disabled. According to commit cbdadbbf0c79 ("virtio_net: fix race in RX VQ processing") we cannot place virtqueue_enable_cb_prepare() after NAPI_STATE_SCHED is cleared, so disable interrupts again if napi_complete_done() returned false. Tested with vhost-user of OVS 2.7 on host, which does not have the event idx feature. * Before patch: $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 1472 60.00 32763206 0 6430.32 212992 60.00 23384299 4589.56 Interrupts on guest: 9872369 Packets/interrupt: 2.37 * After patch $ netperf -t UDP_STREAM -H 192.168.150.253 -l 60 -- -m 1472 MIGRATED UDP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.150.253 () port 0 AF_INET Socket Message Elapsed Messages Size Size Time Okay Errors Throughput bytes bytes secs # # 10^6bits/sec 212992 1472 60.00 32794646 0 6436.49 212992 60.00 32793501 6436.27 Interrupts on guest: 4941299 Packets/interrupt: 6.64 Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpf-nextDavid S. Miller2017-12-086-5/+572
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Alexei Starovoitov says: ==================== pull-request: bpf-next 2017-12-07 The following pull-request contains BPF updates for your net-next tree. The main changes are: 1) Detailed documentation of BPF development process from Daniel. 2) Addition of is_fullsock, snd_cwnd and srtt_us fields to bpf_sock_ops from Lawrence. 3) Minor follow up for bpf_skb_set_tunnel_key() from William. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge branch 'bpf-devel-doc'Alexei Starovoitov2017-12-062-0/+523
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== Few BPF doc updates Two changes, i) add BPF trees into maintainers file, and ii) add a BPF doc around the development process, similarly as we have with netdev FAQ, but just describing BPF specifics. Thanks! ==================== Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * bpf, doc: add faq about bpf development processDaniel Borkmann2017-12-061-0/+519
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the same spirit of netdev FAQ, start a BPF FAQ as a collection of expectations and/or workflow details in the context of BPF patch processing. Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| | * bpf, doc: add bpf trees and tps to maintainers entryDaniel Borkmann2017-12-061-0/+4
| |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | i) Add the bpf and bpf-next trees to the maintainers entry so they can be found easily and picked up by test bots etc that would integrate all trees from maintainers file. Suggested by Stephen while integrating the trees into linux-next. ii) Add the two headers defining BPF/XDP tracepoints to the list of files as well. Suggested-by: Stephen Rothwell <sfr@canb.auug.org.au> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Alexei Starovoitov <ast@kernel.org>
| * bpf: Add access to snd_cwnd and others in sock_opsLawrence Brakmo2017-12-054-2/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adds read access to snd_cwnd and srtt_us fields of tcp_sock. Since these fields are only valid if the socket associated with the sock_ops program call is a full socket, the field is_fullsock is also added to the bpf_sock_ops struct. If the socket is not a full socket, reading these fields returns 0. Note that in most cases it will not be necessary to check is_fullsock to know if there is a full socket. The context of the call, as specified by the 'op' field, can sometimes determine whether there is a full socket. The struct bpf_sock_ops has the following fields added: __u32 is_fullsock; /* Some TCP fields are only valid if * there is a full socket. If not, the * fields read as zero. */ __u32 snd_cwnd; __u32 srtt_us; /* Averaged RTT << 3 in usecs */ There is a new macro, SOCK_OPS_GET_TCP32(NAME), to make it easier to add read access to more 32 bit tcp_sock fields. Signed-off-by: Lawrence Brakmo <brakmo@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * bpf: move bpf csum flag checkWilliam Tu2017-12-041-3/+2
| | | | | | | | | | | | | | | | | | | | trivial move the BPF_F_ZERO_CSUM_TX check right below the 'flags & BPF_F_DONT_FRAGMENT', so common tun_flags handling is logically together. Signed-off-by: William Tu <u9012063@gmail.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
* | tuntap: fix possible deadlock when fail to register netdevJason Wang2017-12-081-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Private destructor could be called when register_netdev() fail with rtnl lock held. This will lead deadlock in tun_free_netdev() who tries to hold rtnl_lock. Fixing this by switching to use spinlock to synchronize. Fixes: 96f84061620c ("tun: add eBPF based queue selection method") Reported-by: Eric Dumazet <eric.dumazet@gmail.com> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Willem de Bruijn <willemb@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'smc-fixes-next'David S. Miller2017-12-078-55/+118
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ursula Braun says: ==================== smc: fixes 2017-12-07 here are some smc-patches. The initial 4 patches are cleanups. Patch 5 gets rid of ib_post_sends in tasklet context to avoid peer drops due to out-of-order receivals. Patch 6 makes sure, the Linux SMC code understands variable sized CLC proposal messages built according to RFC7609. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | smc: support variable CLC proposal messagesUrsula Braun2017-12-073-24/+107
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | According to RFC7609 [1] the CLC proposal message contains an area of unknown length for future growth. Additionally it may contain up to 8 IPv6 prefixes. The current version of the SMC-code does not understand CLC proposal messages using these variable length fields and, thus, is incompatible with SMC implementations in other operating systems. This patch makes sure, SMC understands incoming CLC proposals * with arbitrary length values for future growth * with up to 8 IPv6 prefixes [1] SMC-R Informational RFC: http://www.rfc-editor.org/info/rfc7609 Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Reviewed-by: Hans Wippel <hwippel@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | smc: no consumer update in tasklet contextUrsula Braun2017-12-072-16/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The SMC protocol requires to send a separate consumer cursor update, if it cannot be piggybacked to updates of the producer cursor. When receiving a blocked signal from the sender, this update is sent already in tasklet context. In addition consumer cursor updates are sent after data receival. Sending of cursor updates is controlled by sequence numbers. Assuming receiving stray messages the receiver drops updates with older sequence numbers than an already received cursor update with a higher sequence number. Sending consumer cursor updates in tasklet context may result in wrong order sends and its corresponding drops at the receiver. Since it is sufficient to send consumer cursor updates once the data is received, this patch gets rid of the consumer cursor update in tasklet context to guarantee in-sequence arrival of cursor updates. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | smc: cleanup close checking during data receivalUrsula Braun2017-12-071-4/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When waiting for data to be received it must be checked if the peer signals shutdown. The SMC code uses two different checks for this purpose, even though just one check is sufficient. This patch removes the superfluous test for SOCK_DONE. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | smc: no update for unused sk_write_pendingUrsula Braun2017-12-071-2/+0
| | | | | | | | | | | | | | | | | | | | | | | | The smc code never checks the sk_write_pending sock field. Thus there is no need to update it. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | smc: improve smc_clc_send_decline() error handlingUrsula Braun2017-12-072-7/+4
| | | | | | | | | | | | | | | | | | | | | | | | Let smc_clc_send_decline() return with an error, if the amount sent is smaller than the length of an smc decline message. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | smc: make smc_close_active_abort() staticUrsula Braun2017-12-072-2/+1
|/ / | | | | | | | | | | | | | | smc_close_active_abort() is used in smc_close.c only. Make it static. Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: dsa: Allow compiling out legacy supportFlorian Fainelli2017-12-078-22/+56
| | | | | | | | | | | | | | | | | | Introduce a configuration option: CONFIG_NET_DSA_LEGACY allowing to compile out support for the old platform device and Device Tree binding registration. Support for these configurations is scheduled to be removed in 4.17. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | bnxt_en: Don't print "Link speed -1 no longer supported" messages.Michael Chan2017-12-071-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On some dual port NICs, the 2 ports have to be configured with compatible link speeds. Under some conditions, a port's configured speed may no longer be supported. The firmware will send a message to the driver when this happens. Improve this logic that prints out the warning by only printing it if we can determine the link speed that is no longer supported. If the speed is unknown or it is in autoneg mode, skip the warning message. Reported-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Signed-off-by: Michael Chan <michael.chan@broadcom.com> Tested-by: Thomas Bogendoerfer <tbogendoerfer@suse.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipvlan: Eliminate duplicated codes with existing functionGao Feng2017-12-061-11/+2
| | | | | | | | | | | | | | | | | | The recv flow of ipvlan l2 mode performs as same as l3 mode for non-multicast packet, so use the existing func ipvlan_handle_mode_l3 instead of these duplicated statements in non-multicast case. Signed-off-by: Gao Feng <gfree.wind@vip.163.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mlxsw: spectrum: handle NETIF_F_HW_TC changes correctlyJiri Pirko2017-12-063-0/+52
| | | | | | | | | | | | | | | | | | | | | | Currently, whenever the NETIF_F_HW_TC feature changes, we silently always allow it, but we actually do not disable the flows in HW on disable. That breaks user's expectations. So just forbid the feature disable in case there are any filters offloaded. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tun: avoid unnecessary READ_ONCE in tun_net_xmitWillem de Bruijn2017-12-061-3/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The statement no longer serves a purpose. Commit fa35864e0bb7 ("tuntap: Fix for a race in accessing numqueues") added the ACCESS_ONCE to avoid a race condition with skb_queue_len. Commit 436accebb530 ("tuntap: remove unnecessary sk_receive_queue length check during xmit") removed the affected skb_queue_len check. Commit 96f84061620c ("tun: add eBPF based queue selection method") split the function, reading the field a second time in the callee. The temp variable is now only read once, so just remove it. Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rds: debug: fix null check on static arrayPrashant Bhole2017-12-061-2/+2
| | | | | | | | | | | | | | | | | | | | | | t_name cannot be NULL since it is an array field of a struct. Replacing null check on static array with string length check using strnlen() Signed-off-by: Prashant Bhole <bhole_prashant_q7@lab.ntt.co.jp> Acked-by: Sowmini Varadhan <sowmini.varadhan@oracle.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | act_mirred: get rid of mirred_list_lock spinlockCong Wang2017-12-061-9/+1
| | | | | | | | | | | | | | | | | | | | | | | | TC actions are no longer freed in RCU callbacks and we should always have RTNL lock, so this spinlock is no longer needed. Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | act_mirred: get rid of tcfm_ifindex from struct tcf_mirredCong Wang2017-12-0611-41/+26
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcfm_dev always points to the correct netdev and we already hold a refcnt, so no need to use tcfm_ifindex to lookup again. If we would support moving target netdev across netns, using pointer would be better than ifindex. This also fixes dumping obsolete ifindex, now after the target device is gone we just dump 0 as ifindex. Cc: Jiri Pirko <jiri@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Acked-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'ipv6-add-ip6erspan-collect_md-mode'David S. Miller2017-12-063-25/+180
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | William Tu says: ==================== ipv6: add ip6erspan collect_md mode Similar to erspan collect_md mode in ipv4, the first patch adds support for ip6erspan collect metadata mode. The second patch adds the test case using bpf_skb_[gs]et_tunnel_key helpers. The corresponding iproute2 patch: https://marc.info/?l=linux-netdev&m=151251545410047&w=2 ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | samples/bpf: add ip6erspan sample codeWilliam Tu2017-12-062-0/+95
| | | | | | | | | | | | | | | | | | | | | Extend the existing tests for ip6erspan. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | ip6_gre: add ip6 erspan collect_md modeWilliam Tu2017-12-061-25/+85
|/ / | | | | | | | | | | | | | | | | | | Similar to ip6 gretap and ip4 gretap, the patch allows erspan tunnel to operate in collect metadata mode. bpf_skb_[gs]et_tunnel_key() helpers can make use of it right away. Signed-off-by: William Tu <u9012063@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>