summaryrefslogtreecommitdiffstats
path: root/kernel/sysctl.c (unfollow)
Commit message (Collapse)AuthorFilesLines
2018-05-17tcp: don't mark recently sent packets lost on RTOYuchung Cheng1-4/+8
An RTO event indicates the head has not been acked for a long time after its last (re)transmission. But the other packets are not necessarily lost if they have been only sent recently (for example due to application limit). This patch would prohibit marking packets sent within an RTT to be lost on RTO event, using similar logic in TCP RACK detection. Normally the head (SND.UNA) would be marked lost since RTO should fire strictly after the head was sent. An exception is when the most recent RACK RTT measurement is larger than the (previous) RTO. To address this exception the head is always marked lost. Congestion control interaction: since we may not mark every packet lost, the congestion window may be more than 1 (inflight plus 1). But only one packet will be retransmitted after RTO, since tcp_retransmit_timer() calls tcp_retransmit_skb(...,segs=1). The connection still performs slow start from one packet (with Cubic congestion control). This commit was tested in an A/B test with Google web servers, and showed a reduction of 2% in (spurious) retransmits post timeout (SlowStartRetrans), and correspondingly reduced DSACKs (DSACKIgnoredOld) by 7%. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: new helper tcp_rack_skb_timeoutYuchung Cheng3-7/+14
Create and export a new helper tcp_rack_skb_timeout and move tcp_is_rack to prepare the final RTO change. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: separate loss marking and state update on RTOYuchung Cheng1-2/+2
Previously when TCP times out, it first updates cwnd and ssthresh, marks packets lost, and then updates congestion state again. This was fine because everything not yet delivered is marked lost, so the inflight is always 0 and cwnd can be safely set to 1 to retransmit one packet on timeout. But the inflight may not always be 0 on timeout if TCP changes to mark packets lost based on packet sent time. Therefore we must first mark the packet lost, then set the cwnd based on the (updated) inflight. This is not a pure refactor. Congestion control may potentially break if it uses (not yet updated) inflight to compute ssthresh. Fortunately all existing congestion control modules does not do that. Also it changes the inflight when CA_LOSS_EVENT is called, and only westwood processes such an event but does not use inflight. This change has two other minor side benefits: 1) consistent with Fast Recovery s.t. the inflight is updated first before tcp_enter_recovery flips state to CA_Recovery. 2) avoid intertwining loss marking with state update, making the code more readable. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: new helper tcp_timeout_mark_lostYuchung Cheng1-21/+29
Refactor using a new helper, tcp_timeout_mark_loss(), that marks packets lost upon RTO. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: account lost retransmit after timeoutYuchung Cheng3-17/+6
The previous approach for the lost and retransmit bits was to wipe the slate clean: zero all the lost and retransmit bits, correspondingly zero the lost_out and retrans_out counters, and then add back the lost bits (and correspondingly increment lost_out). The new approach is to treat this very much like marking packets lost in fast recovery. We don’t wipe the slate clean. We just say that for all packets that were not yet marked sacked or lost, we now mark them as lost in exactly the same way we do for fast recovery. This fixes the lost retransmit accounting at RTO time and greatly simplifies the RTO code by sharing much of the logic with Fast Recovery. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: simpler NewReno implementationYuchung Cheng3-8/+39
This is a rewrite of NewReno loss recovery implementation that is simpler and standalone for readability and better performance by using less states. Note that NewReno refers to RFC6582 as a modification to the fast recovery algorithm. It is used only if the connection does not support SACK in Linux. It should not to be confused with the Reno (AIMD) congestion control. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: disable RFC6675 loss detectionYuchung Cheng2-5/+10
This patch disables RFC6675 loss detection and make sysctl net.ipv4.tcp_recovery = 1 controls a binary choice between RACK (1) or RFC6675 (0). Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17tcp: support DUPACK threshold in RACKYuchung Cheng3-13/+29
This patch adds support for the classic DUPACK threshold rule (#DupThresh) in RACK. When the number of packets SACKed is greater or equal to the threshold, RACK sets the reordering window to zero which would immediately mark all the unsacked packets below the highest SACKed sequence lost. Since this approach is known to not work well with reordering, RACK only uses it if no reordering has been observed. The DUPACK threshold rule is a particularly useful extension to the fast recoveries triggered by RACK reordering timer. For example data-center transfers where the RTT is much smaller than a timer tick, or high RTT path where the default RTT/4 may take too long. Note that this patch differs slightly from RFC6675. RFC6675 considers a packet lost when at least #DupThresh higher-sequence packets are SACKed. With RACK, for connections that have seen reordering, RACK continues to use a dynamically-adaptive time-based reordering window to detect losses. But for connections on which we have not yet seen reordering, this patch considers a packet lost when at least one higher sequence packet is SACKed and the total number of SACKed packets is at least DupThresh. For example, suppose a connection has not seen reordering, and sends 10 packets, and packets 3, 5, 7 are SACKed. RFC6675 considers packets 1 and 2 lost. RACK considers packets 1, 2, 4, 6 lost. There is some small risk of spurious retransmits here due to reordering. However, this is mostly limited to the first flight of a connection on which the sender receives SACKs from reordering. And RFC 6675 and FACK loss detection have a similar risk on the first flight with reordering (it's just that the risk of spurious retransmits from reordering was slightly narrower for those older algorithms due to the margin of 3*MSS). Also the minimum reordering window is reduced from 1 msec to 0 to recover quicker on short RTT transfers. Therefore RACK is more aggressive in marking packets lost during recovery to reduce the reordering window timeouts. Signed-off-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Reviewed-by: Soheil Hassas Yeganeh <soheil@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17net: ethernet: ti: cpsw: disable mq feature for "AM33xx ES1.0" devicesIvan Khoronzhuk1-49/+60
The early versions of am33xx devices, related to ES1.0 SoC revision have errata limiting mq support. That's the same errata as commit 7da1160002f1 ("drivers: net: cpsw: add am335x errata workarround for interrutps") AM33xx Errata [1] Advisory 1.0.9 http://www.ti.com/lit/er/sprz360f/sprz360f.pdf After additional investigation were found that drivers w/a is propagated on all AM33xx SoCs and on DM814x. But the errata exists only for ES1.0 of AM33xx family, limiting mq support for revisions after ES1.0. So, disable mq support only for related SoCs and use separate polls for revisions allowing mq. Signed-off-by: Ivan Khoronzhuk <ivan.khoronzhuk@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17pfifo_fast: drop unneeded additional lock on dequeuePaolo Abeni2-2/+7
After the previous patch, for NOLOCK qdiscs, q->seqlock is always held when the dequeue() is invoked, we can drop any additional locking to protect such operation. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17sched: replace __QDISC_STATE_RUNNING bit with a spin lockPaolo Abeni2-5/+16
So that we can use lockdep on it. The newly introduced sequence lock has the same scope of busylock, so it shares the same lockdep annotation, but it's only used for NOLOCK qdiscs. With this changeset we acquire such lock in the control path around flushing operation (qdisc reset), to allow more NOLOCK qdisc perf improvement in the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-17bpf: sockmap, on update propagate errors back to userspaceJohn Fastabend1-1/+1
When an error happens in the update sockmap element logic also pass the err up to the user. Fixes: e5cd3abcb31a ("bpf: sockmap, refactor sockmap routines to work with hashmap") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17bpf: fix sock hashmap kmalloc warningYonghong Song1-0/+6
syzbot reported a kernel warning below: WARNING: CPU: 0 PID: 4499 at mm/slab_common.c:996 kmalloc_slab+0x56/0x70 mm/slab_common.c:996 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 4499 Comm: syz-executor050 Not tainted 4.17.0-rc3+ #9 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x1b9/0x294 lib/dump_stack.c:113 panic+0x22f/0x4de kernel/panic.c:184 __warn.cold.8+0x163/0x1b3 kernel/panic.c:536 report_bug+0x252/0x2d0 lib/bug.c:186 fixup_bug arch/x86/kernel/traps.c:178 [inline] do_error_trap+0x1de/0x490 arch/x86/kernel/traps.c:296 do_invalid_op+0x1b/0x20 arch/x86/kernel/traps.c:315 invalid_op+0x14/0x20 arch/x86/entry/entry_64.S:992 RIP: 0010:kmalloc_slab+0x56/0x70 mm/slab_common.c:996 RSP: 0018:ffff8801d907fc58 EFLAGS: 00010246 RAX: 0000000000000000 RBX: ffff8801aeecb280 RCX: ffffffff8185ebd7 RDX: 0000000000000000 RSI: 0000000000000000 RDI: 00000000ffffffe1 RBP: ffff8801d907fc58 R08: ffff8801adb5e1c0 R09: ffffed0035a84700 R10: ffffed0035a84700 R11: ffff8801ad423803 R12: ffff8801aeecb280 R13: 00000000fffffff4 R14: ffff8801ad891a00 R15: 00000000014200c0 __do_kmalloc mm/slab.c:3713 [inline] __kmalloc+0x25/0x760 mm/slab.c:3727 kmalloc include/linux/slab.h:517 [inline] map_get_next_key+0x24a/0x640 kernel/bpf/syscall.c:858 __do_sys_bpf kernel/bpf/syscall.c:2131 [inline] __se_sys_bpf kernel/bpf/syscall.c:2096 [inline] __x64_sys_bpf+0x354/0x4f0 kernel/bpf/syscall.c:2096 do_syscall_64+0x1b1/0x800 arch/x86/entry/common.c:287 entry_SYSCALL_64_after_hwframe+0x49/0xbe The test case is against sock hashmap with a key size 0xffffffe1. Such a large key size will cause the below code in function sock_hash_alloc() overflowing and produces a smaller elem_size, hence map creation will be successful. htab->elem_size = sizeof(struct htab_elem) + round_up(htab->map.key_size, 8); Later, when map_get_next_key is called and kernel tries to allocate the key unsuccessfully, it will issue the above warning. Similar to hashtab, ensure the key size is at most MAX_BPF_STACK for a successful map creation. Fixes: 81110384441a ("bpf: sockmap, add hash map support") Reported-by: syzbot+e4566d29080e7f3460ff@syzkaller.appspotmail.com Signed-off-by: Yonghong Song <yhs@fb.com> Acked-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17libbpf: add ifindex to enable offload supportDavid Beckett4-3/+20
BPF programs currently can only be offloaded using iproute2. This patch will allow programs to be offloaded using libbpf calls. Signed-off-by: David Beckett <david.beckett@netronome.com> Reviewed-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-17bpf: add __printf verification to bpf_verifier_vlogMathieu Malaterre1-2/+2
__printf is useful to verify format and arguments. ‘bpf_verifier_vlog’ function is used twice in verifier.c in both cases the caller function already uses the __printf gcc attribute. Remove the following warning, triggered with W=1: kernel/bpf/verifier.c:176:2: warning: function might be possible candidate for ‘gnu_printf’ format attribute [-Wsuggest-attribute=format] Signed-off-by: Mathieu Malaterre <malat@debian.org> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-16samples/bpf: Decrement ttl in fib forwarding exampleDavid Ahern1-8/+31
Only consider forwarding packets if ttl in received packet is > 1 and decrement ttl before handing off to bpf_redirect_map. Signed-off-by: David Ahern <dsahern@gmail.com> Acked-by: Yonghong Song <yhs@fb.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-16bpf: bpftool, support for sockhashJohn Fastabend1-0/+1
This adds the SOCKHASH map type to bpftools so that we get correct pretty printing. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-16bpf: selftest additions for SOCKHASHJohn Fastabend7-349/+453
This runs existing SOCKMAP tests with SOCKHASH map type. To do this we push programs into include file and build two BPF programs. One for SOCKHASH and one for SOCKMAP. We then run the entire test suite with each type. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Acked-by: David S. Miller <davem@davemloft.net> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
2018-05-16cxgb4: update LE-TCAM collection for T6Rahul Lakkireddy4-8/+37
For T6, clip table is separated from main TCAM. So, update LE-TCAM collection logic to collect clip table TCAM as well. IPv6 takes 4 entries in clip table TCAM compared to 2 entries in main TCAM. Also, in case of errors, keep LE-TCAM collected so far and set the status to partial dump. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16qed: Fix LL2 race during connection terminateMichal Kalderon1-10/+14
Stress on qedi/qedr load unload lead to list_del corruption. This is due to ll2 connection terminate freeing resources without verifying that no more ll2 processing will occur. This patch unregisters the ll2 status block before terminating the connection to assure this race does not occur. Fixes: 1d6cff4fca4366 ("qed: Add iSCSI out of order packet handling") Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16qed: Fix possibility of list corruption during rmmod flowsMichal Kalderon1-1/+10
The ll2 flows of flushing the txq/rxq need to be synchronized with the regular fp processing. Caused list corruption during load/unload stress tests. Fixes: 0a7fb11c23c0f ("qed: Add Light L2 support") Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16qed: LL2 flush isles when connection is closedMichal Kalderon1-0/+26
Driver should free all pending isles once it gets a FLUSH cqe from FW. Part of iSCSI out of order flow. Fixes: 1d6cff4fca4366 ("qed: Add iSCSI out of order packet handling") Signed-off-by: Ariel Elior <Ariel.Elior@cavium.com> Signed-off-by: Michal Kalderon <Michal.Kalderon@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: ethoc: Remove useless test before clk_disable_unprepareYueHaibing1-4/+2
clk_disable_unprepare() already checks that the clock pointer is valid. No need to test it before calling it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Reviewed-by: Tobias Klauser <tklauser@distanz.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: stmmac: Remove useless test before clk_disable_unprepareYueHaibing1-17/+7
clk_disable_unprepare() already checks that the clock pointer is valid. No need to test it before calling it. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: qcom/emac: Encapsulate sgmii ops under one structureHemanth Puranik4-71/+103
This patch introduces ops structure for sgmii, This by ensures that we do not need dummy functions in case of emulation platforms. Signed-off-by: Hemanth Puranik <hpuranik@codeaurora.org> Acked-by: Timur Tabi <timur@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: qualcomm: rmnet: Remove redundant command checkSubash Abhinov Kasiviswanathan1-11/+3
The command packet size is already checked once in rmnet_map_deaggregate() for the header, packet and trailer size, so this additional check is not needed. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: qualcomm: rmnet: Add support for ethtool private statsSubash Abhinov Kasiviswanathan3-16/+112
Add ethtool private stats handler to debug the handling of packets with checksum offload header / trailer. This allows to keep track of the number of packets for which hardware computes the checksum and counts and reasons where checksum computation was skipped in hardware and was done in the network stack. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: qualcomm: rmnet: Capture all drops in transmit pathSubash Abhinov Kasiviswanathan1-11/+10
Packets in transmit path could potentially be dropped if there were errors while adding the MAP header or the checksum header. Increment the tx_drops stats in these cases. Additionally, refactor the code to free the packet and increment the tx_drops stat under a single label. Signed-off-by: Subash Abhinov Kasiviswanathan <subashab@codeaurora.org> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16drivers: net: Remove device_node checks with of_mdiobus_register()Florian Fainelli11-61/+20
A number of drivers have the following pattern: if (np) of_mdiobus_register() else mdiobus_register() which the implementation of of_mdiobus_register() now takes care of. Remove that pattern in drivers that strictly adhere to it. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Grygorii Strashko <grygorii.strashko@ti.com> Reviewed-by: Fugang Duan <fugang.duan@nxp.com> Reviewed-by: Antoine Tenart <antoine.tenart@bootlin.com> Reviewed-by: Jose Abreu <joabreu@synopsys.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16of: mdio: Fall back to mdiobus_register() with NULL device_nodeFlorian Fainelli1-0/+3
When the device_node specified is NULL, fall back to mdiobus_register(). We have a number of drivers having a similar pattern which is: if (np) of_mdiobus_register() else mdiobus_register() so incorporate that behavior within the core of_mdiobus_register() function. This is also consistent with the stub version that we defined when CONFIG_OF=n. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: ethernet: ti: cpsw-phy-sel: check bus_find_device() ret valueGrygorii Strashko1-1/+7
This fixes klockworks warnings: Pointer 'dev' returned from call to function 'bus_find_device' at line 179 may be NULL and will be dereferenced at line 181. cpsw-phy-sel.c:179: 'dev' is assigned the return value from function 'bus_find_device'. bus.c:342: 'bus_find_device' explicitly returns a NULL value. cpsw-phy-sel.c:181: 'dev' is dereferenced by passing argument 1 to function 'dev_get_drvdata'. device.h:1024: 'dev' is passed to function 'dev_get_drvdata'. device.h:1026: 'dev' is explicitly dereferenced. Signed-off-by: Grygorii Strashko <grygorii.strashko@ti.com> [nsekhar@ti.com: add an error message, fix return path] Signed-off-by: Sekhar Nori <nsekhar@ti.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16Revert "bonding: allow carrier and link status to determine link state"Debabrata Banerjee3-14/+9
This reverts commit 1386c36b30388f46a95100924bfcae75160db715. We don't want to encourage drivers to not report carrier status correctly, therefore remove this commit. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16tc-testing: updated mirred and vlan with more testsRoman Mashak2-20/+324
Added extra test cases for different control actions (reclassify, pipe etc.), cookies, max values & exceeding maximum, and replace existing actions unit tests. Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16tc-testing: fixed copy-pasting error in police testsRoman Mashak1-2/+2
Signed-off-by: Roman Mashak <mrv@mojatatu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16sched: manipulate __QDISC_STATE_RUNNING in qdisc_run_* helpersPaolo Abeni3-24/+19
Currently NOLOCK qdiscs pay a measurable overhead to atomically manipulate the __QDISC_STATE_RUNNING. Such bit is flipped twice per packet in the uncontended scenario with packet rate below the line rate: on packed dequeue and on the next, failing dequeue attempt. This changeset moves the bit manipulation into the qdisc_run_{begin,end} helpers, so that the bit is now flipped only once per packet, with measurable performance improvement in the uncontended scenario. This also allows simplifying the qdisc teardown code path - since qdisc_is_running() is now effective for each qdisc type - and avoid a possible race between qdisc_run() and dev_deactivate_many(), as now the some_qdisc_is_busy() can properly detect NOLOCK qdiscs being busy dequeuing packets. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16bonding: allow carrier and link status to determine link stateDebabrata Banerjee3-9/+14
In a mixed environment it may be difficult to tell if your hardware support carrier, if it does not it can always report true. With a new use_carrier option of 2, we can check both carrier and link status sequentially, instead of one or the other Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16bonding: allow use of tx hashing in balance-albDebabrata Banerjee4-15/+43
The rx load balancing provided by balance-alb is not mutually exclusive with using hashing for tx selection, and should provide a decent speed increase because this eliminates spinlocks and cache contention. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16bonding: use common mac addr checksDebabrata Banerjee1-19/+9
Replace homegrown mac addr checks with faster defs from etherdevice.h Note that this will also prevent any rlb arp updates for multicast addresses, however this should have been forbidden anyway. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16bonding: don't queue up extraneous rlb updatesDebabrata Banerjee1-6/+12
arps for incomplete entries can't be sent anyway. Signed-off-by: Debabrata Banerjee <dbanerje@akamai.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: check for pending terminationKarsten Graul3-3/+7
Avoid to run the processing in smc_lgr_terminate() more than once, remember when the link group termination is triggered. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: drop messages when link state is inactiveKarsten Graul1-0/+2
Drop incoming messages when the link is flagged as inactive. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: set link inactive before calling smc_lgr_free()Karsten Graul2-1/+5
Before smc_lgr_free() is called the link must be set inactive by calling smc_llc_link_inactive(). Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: handle all error codes from smc_conn_create()Karsten Graul1-0/+2
Always set a reason_code when smc_conn_create() returns an error code. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: use a workqueue to defer llc sendKarsten Graul4-43/+104
SMC handles deferred work in tasklets. As tasklets cannot sleep this can result in rare EBUSY conditions, so defer this work in a work queue. The high level api functions do not defer work because they can sleep until the llc send is actually completed. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: move link llc initialization to llc layerKarsten Graul3-6/+12
Move the llc layer specific initialization and cleanup out of smc_core.c into smc_llc.c (smc_llc_link_init and smc_llc_link_clear). Move all initialization of a link into the new init function. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: simplify test_link function usageKarsten Graul2-9/+5
Make smc_llc_send_test_link() static and remove it from the header file. And to send a test_link response set the response flag and send the message back as-is, without using smc_llc_send_test_link(). And because smc_llc_send_test_link() must no longer send responses, remove the response flag handling from the function. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: remove unnecessary castKarsten Graul1-3/+3
Remove an unneeded (void *) cast from the calls to smc_llc_send_message(). No functional changes. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: register new rmbs with the peerKarsten Graul5-8/+64
Register new rmb buffers with the remote peer by exchanging a confirm_rkey llc message. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net/smc: no tx work trigger for fallback socketsUrsula Braun1-2/+2
If TCP_NODELAY is set or TCP_CORK is reset, setsockopt triggers the tx worker. This does not make sense, if the SMC socket switched to the TCP fallback when the connection is created. This patch adds the additional check for the fallback case. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
2018-05-16net: hns3: Fixes the missing PCI iounmap for various legsFuyun Liang1-0/+2
We call pcim_iomap in hclge_pci_init, pcim_iounmap should be called in error handle of hclge_init_ae_dev. We call pcim_iomap in hclge_pci_init, but do not call pcim_iounmap in hclge_pci_uninit. When we remove the hclge.ko and insert it again, a problem that pci can not map will happen. pcim_iounmap need to be called in hclge_pci_uninit. Fixes: 46a3df9f9718 ("net: hns3: Add HNS3 Acceleration Engine & Compatibility Layer Support") Signed-off-by: Fuyun Liang <liangfuyun1@huawei.com> Signed-off-by: Peng Li <lipeng321@huawei.com> Signed-off-by: Salil Mehta <salil.mehta@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>