summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* s390/qeth: prepare for copy-free TSO transmissionJulian Wiedmann2018-09-174-28/+77
| | | | | | | | | | | Add all the necessary TSO plumbing to the copy-less transmit path. This includes calculating the right length of required protocol headers, and always building a separate buffer element for the TSO headers. A follow-up patch will then switch TSO traffic over to this path. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* s390/qeth: check size of required HW header cache objectJulian Wiedmann2018-09-171-2/+9
| | | | | | | | | | | When qeth_add_hw_header() falls back to the header cache, ensure that the requested length doesn't exceed the object size. For current usage this is a no-brainer, but TSO transmission will introduce protocol headers of varying length. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* s390/qeth: fix up protocol headers earlyJulian Wiedmann2018-09-171-3/+11
| | | | | | | | | | | | When qeth_add_hw_header() falls back to the HW header cache, it also copies over the necessary protocol headers. Thus any manipulation to the protocol headers needs to happen before adding the HW header. For current usage this doesn't matter, but it becomes relevant when moving TSO transmission over to the faster code path. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* s390/qeth: limit csum offload erratum to L3 devicesJulian Wiedmann2018-09-172-5/+5
| | | | | | | | | Combined L3+L4 csum offload is only required for some L3 HW. So for L2 devices, don't offload the IP header csum calculation. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reference-ID: JUP 394553 Signed-off-by: David S. Miller <davem@davemloft.net>
* s390/qeth: remove qeth_get_elements_no()Julian Wiedmann2018-09-174-33/+14
| | | | | | | | Convert the last remaining user of qeth_get_elements_no() to qeth_count_elements(), so this helper can be removed. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* s390/qeth: remove unused L3 xmit codeJulian Wiedmann2018-09-171-54/+17
| | | | | | | qeth_l3_xmit() is now only used for TSOv4 traffic, shrink it down. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* s390/qeth: run non-offload L3 traffic over common xmit pathJulian Wiedmann2018-09-171-26/+44
| | | | | | | | | | L3 OSAs can only offload IPv4 traffic, use the common L2 transmit path for all other traffic. In particular there's no support for TX VLAN offload, so any such packet needs to be manually de-accelerated via ndo_features_check(). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* s390/qeth: move L2 xmit code to core moduleJulian Wiedmann2018-09-173-63/+75
| | | | | | | | We need the exact same transmit path for non-offload-eligible traffic on L3 OSAs. So make it accessible from both sub-drivers. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* liquidio: Add the features to show FEC settings and set FEC settingsWeilin Chang2018-09-176-3/+243
| | | | | | | | | | 1. Add functions for get_fecparam and set_fecparam. 2. Modify lio_get_link_ksettings to display FEC setting. Signed-off-by: Weilin Chang <weilin.chang@cavium.com> Acked-by: Derek Chickles <derek.chickles@cavium.com> Signed-off-by: Felix Manlunas <felix.manlunas@cavium.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ethernet: remove redundant null pointer check before of_node_putzhong jiang2018-09-171-2/+1
| | | | | | | | | | of_node_put has taken the null pointer check into account. So it is safe to remove the duplicated check before of_node_put. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Reviewed-by: Vladimir Zapolskiy <vz@mleia.com> Acked-by: Fugang Duan <fugang.duan@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: remove redundant null pointer check before put_devicezhong jiang2018-09-171-2/+1
| | | | | | | | put_device has taken the null pinter check into account. So it is safe to remove the duplicated check before put_device. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: remove redundant null pointer check before of_node_putzhong jiang2018-09-171-2/+1
| | | | | | | | of_node_put has taken the null pointer check into account. So it is safe to remove the duplicated check before of_node_put. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: usb: remove redundant null pointer check before of_node_putzhong jiang2018-09-171-2/+1
| | | | | | | | of_node_put has taken the null pointer check into account. So it is safe to remove the duplicated check before of_node_put. Signed-off-by: zhong jiang <zhongjiang@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: rds: use memset to optimize the recvZhu Yanjun2018-09-171-4/+1
| | | | | | | | | | | | | | | The function rds_inc_init is in recv process. To use memset can optimize the function rds_inc_init. The test result: Before: 1) + 24.950 us | rds_inc_init [rds](); After: 1) + 10.990 us | rds_inc_init [rds](); Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: Zhu Yanjun <yanjun.zhu@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* selftests/tls: Add MSG_WAITALL in recv() syscallVakul Garg2018-09-171-9/+9
| | | | | | | | | | | A number of tls selftests rely upon recv() to return an exact number of data bytes. When tls record crypto is done using an async accelerator, it is possible that recv() returns lesser than expected number bytes. This leads to failure of many test cases. To fix it, MSG_WAITALL has been used in flags passed to recv() syscall. Signed-off-by: Vakul Garg <vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: aquantia: memory corruption on jumbo framesFriedemann Gerold2018-09-171-14/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes skb_shared area, which will be corrupted upon reception of 4K jumbo packets. Originally build_skb usage purpose was to reuse page for skb to eliminate needs of extra fragments. But that logic does not take into account that skb_shared_info should be reserved at the end of skb data area. In case packet data consumes all the page (4K), skb_shinfo location overflows the page. As a consequence, __build_skb zeroed shinfo data above the allocated page, corrupting next page. The issue is rarely seen in real life because jumbo are normally larger than 4K and that causes another code path to trigger. But it 100% reproducible with simple scapy packet, like: sendp(IP(dst="192.168.100.3") / TCP(dport=443) \ / Raw(RandString(size=(4096-40))), iface="enp1s0") Fixes: 018423e90bee ("net: ethernet: aquantia: Add ring support code") Reported-by: Friedemann Gerold <f.gerold@b-c-s.de> Reported-by: Michael Rauch <michael@rauch.be> Signed-off-by: Friedemann Gerold <f.gerold@b-c-s.de> Tested-by: Nikita Danilov <nikita.danilov@aquantia.com> Signed-off-by: Igor Russkikh <igor.russkikh@aquantia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'lantiq-Minor-fixes-for-vrx200-and-gswip'David S. Miller2018-09-175-39/+45
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hauke Mehrtens says: ==================== net: lantiq: Minor fixes for vrx200 and gswip These are mostly minor fixes to problems addresses in the latests round of the review of the original series adding these driver, which were not applied before the patches got merged into net-next. In addition it fixes a data bus error on poweroff. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dsa: tag_gswip: Add gswip to dsa_tag_protocol_to_str()Hauke Mehrtens2018-09-171-0/+3
| | | | | | | | | | | | | | | | The gswip tag was missing in the dsa_tag_protocol_to_str() function, add it. Fixes: 7969119293f5 ("net: dsa: Add Lantiq / Intel GSWIP tag support") Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dsa: lantiq_gswip: Minor code style improvementsHauke Mehrtens2018-09-171-20/+18
| | | | | | | | | | | | | | | | | | | | Use one code block when returning because the interface type is unsupported and also check if some unsupported port gets configured. In addition fix a double the and use dsa_is_cpu_port() instated of manually getting the CPU port. Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: lantiq: lantiq_xrx200: Move clock prepare to probe functionHauke Mehrtens2018-09-171-9/+12
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The switch and the MAC are in one IP core and they use the same clock signal from the clock generation unit. Currently the clock architecture in the lantiq SoC code does not allow to easily share the same clocks, this has to be fixed by switching to the common clock framework. As a workaround the clock of the switch and MAC should be activated when the MAC gets probed and only disabled when the MAC gets removed. This way it is ensured that the clock is always enabled when the switch or MAC is used. The switch can not be used without the MAC. This fixes a data bus error when rebooting the system and deactivating the switch and mac and later accessing some registers in the cleanup while the clocks are disabled. Fixes: fe1a56420cf2 ("net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver") Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * dt-bindings: net: dsa: lantiq, xrx200-gswip: Fix minor style fixesHauke Mehrtens2018-09-171-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * Use one compatible line per line in the documentation * Remove SoC revision depended compatible lines, we can detect that in the driver * Use lower case letters in hex addresses * Fix the size of the address ranges in the example, this now matches the sizes used by the SoC. The old ones will also work, this just adds some empty address space. * Change the reg size of the gphy-fw node Fixes: 86ce2bc73c7a ("dt-bindings: net: dsa: Add lantiq, xrx200-gswip DT bindings") Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Cc: devicetree@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
| * dt-bindings: net: lantiq, xrx200-net: Use lower case in hexHauke Mehrtens2018-09-171-2/+2
|/ | | | | | | | | | | | Use lower case letters in the addresses of the device tree binding. In addition replace eth with ethernet and fix the size of the reg element in the example. The additional range does not contain any registers but is used for the IP block on the this SoC. Fixes: 839790e88a3c ("dt-bindings: net: Add lantiq, xrx200-net DT bindings") Signed-off-by: Hauke Mehrtens <hauke@hauke-m.de> Cc: devicetree@vger.kernel.org Signed-off-by: David S. Miller <davem@davemloft.net>
* net: hns: make function hns_gmac_wait_fifo_clean() staticWei Yongjun2018-09-171-1/+1
| | | | | | | | | | Fixes the following sparse warning: drivers/net/ethernet/hisilicon/hns/hns_dsaf_gmac.c:322:5: warning: symbol 'hns_gmac_wait_fifo_clean' was not declared. Should it be static? Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: lantiq: Fix return value check in xrx200_probe()Wei Yongjun2018-09-171-2/+2
| | | | | | | | | | | In case of error, the function devm_ioremap_resource() returns ERR_PTR() and never returns NULL. The NULL test in the return value check should be replaced with IS_ERR(). Fixes: fe1a56420cf2 ("net: lantiq: Add Lantiq / Intel VRX200 Ethernet driver") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: gswip: Fix copy-paste error in gswip_gphy_fw_probe()Wei Yongjun2018-09-171-3/+3
| | | | | | | | | | | The return value from of_reset_control_array_get_exclusive() is not checked correctly. The test is done against a wrong variable. This patch fix it. Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: gswip: Fix return value check in gswip_probe()Wei Yongjun2018-09-171-6/+6
| | | | | | | | | | | In case of error, the function devm_ioremap_resource() returns ERR_PTR() and never returns NULL. The NULL test in the return value check should be replaced with IS_ERR(). Fixes: 14fceff4771e ("net: dsa: Add Lantiq / Intel DSA driver for vrx200") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Acked-by: Hauke Mehrtens <hauke@hauke-m.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* tls: async support causes out-of-bounds access in crypto APIsJohn Fastabend2018-09-172-20/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When async support was added it needed to access the sk from the async callback to report errors up the stack. The patch tried to use space after the aead request struct by directly setting the reqsize field in aead_request. This is an internal field that should not be used outside the crypto APIs. It is used by the crypto code to define extra space for private structures used in the crypto context. Users of the API then use crypto_aead_reqsize() and add the returned amount of bytes to the end of the request memory allocation before posting the request to encrypt/decrypt APIs. So this breaks (with general protection fault and KASAN error, if enabled) because the request sent to decrypt is shorter than required causing the crypto API out-of-bounds errors. Also it seems unlikely the sk is even valid by the time it gets to the callback because of memset in crypto layer. Anyways, fix this by holding the sk in the skb->sk field when the callback is set up and because the skb is already passed through to the callback handler via void* we can access it in the handler. Then in the handler we need to be careful to NULL the pointer again before kfree_skb. I added comments on both the setup (in tls_do_decryption) and when we clear it from the crypto callback handler tls_decrypt_done(). After this selftests pass again and fixes KASAN errors/warnings. Fixes: 94524d8fc965 ("net/tls: Add support for async decryption of tls records") Signed-off-by: John Fastabend <john.fastabend@gmail.com> Reviewed-by: Vakul Garg <Vakul.garg@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip6_gre: simplify gre header parsing in ip6gre_errHaishuang Yan2018-09-171-22/+4
| | | | | | | | Same as ip_gre, use gre_parse_header to parse gre header in gre error handler code. Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ip_gre: fix parsing gre header in ipgre_errHaishuang Yan2018-09-172-9/+7
| | | | | | | | | | | | | | gre_parse_header stops parsing when csum_err is encountered, which means tpi->key is undefined and ip_tunnel_lookup will return NULL improperly. This patch introduce a NULL pointer as csum_err parameter. Even when csum_err is encountered, it won't return error and continue parsing gre header as expected. Fixes: 9f57c67c379d ("gre: Remove support for sharing GRE protocol hook.") Reported-by: Jiri Benc <jbenc@redhat.com> Signed-off-by: Haishuang Yan <yanhaishuang@cmss.chinamobile.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: phy: et011c: Remove incorrect PHY_POLL flagsFlorian Fainelli2018-09-171-1/+0
| | | | | | | | | PHY_POLL is defined as -1 which means that we would be setting all flags of the PHY driver, this is also not a valid flag to tell PHYLIB about, just remove it. Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'act_police-lockless-data-path'David S. Miller2018-09-171-80/+106
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Davide Caratti says: ==================== net/sched: act_police: lockless data path the data path of 'police' action can be faster if we avoid using spinlocks: - patch 1 converts act_police to use per-cpu counters - patch 2 lets act_police use RCU to access its configuration data. test procedure (using pktgen from https://github.com/netoptimizer): # ip link add name eth1 type dummy # ip link set dev eth1 up # tc qdisc add dev eth1 clsact # tc filter add dev eth1 egress matchall action police \ > rate 2gbit burst 100k conform-exceed pass/pass index 100 # for c in 1 2 4; do > ./pktgen_bench_xmit_mode_queue_xmit.sh -v -s 64 -t $c -n 5000000 -i eth1 > done test results (avg. pps/thread): $c | before patch | after patch | improvement ----+--------------+--------------+------------- 1 | 3518448 | 3591240 | irrelevant 2 | 3070065 | 3383393 | 10% 4 | 1540969 | 3238385 | 110% ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/sched: act_police: don't use spinlock in the data pathDavide Caratti2018-09-171-64/+92
| | | | | | | | | | | | | | | | | | use RCU instead of spinlocks, to protect concurrent read/write on act_police configuration. This reduces the effects of contention in the data path, in case multiple readers are present. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/sched: act_police: use per-cpu countersDavide Caratti2018-09-171-24/+22
|/ | | | | | | | | use per-CPU counters, instead of sharing a single set of stats with all cores. This removes the need of using spinlock when statistics are read or updated. Signed-off-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* cxgb4: update supported DCB versionGanesh Goudar2018-09-142-2/+31
| | | | | | | | | | | - In CXGB4_DCB_STATE_FW_INCOMPLETE state check if the dcb version is changed and update the dcb supported version. - Also, fill the priority code point value for priority based flow control. Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* cxgb4: add per rx-queue counter for packet errorsGanesh Goudar2018-09-143-0/+6
| | | | | | | | print per rx-queue packet errors in sge_qinfo Signed-off-by: Casey Leedom <leedom@chelsio.com> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* cxgb4: Fix endianness issue in t4_fwcache()Ganesh Goudar2018-09-141-1/+1
| | | | | | | | Do not put host-endian 0 or 1 into big endian feild. Reported-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: Ganesh Goudar <ganeshgr@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: move definition of pcpu_lstats to header fileLi RongQing2018-09-144-22/+10
| | | | | | | | | pcpu_lstats is defined in several files, so unify them as one and move to header file Signed-off-by: Zhang Yu <zhangyu31@baidu.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/ibm/emac: Remove VLA usageKees Cook2018-09-142-1/+8
| | | | | | | | | | | | | | | | | In the quest to remove all stack VLA usage from the kernel[1], this removes the VLA used for the emac xaht registers size. Since the size of registers can only ever be 4 or 8, as detected in emac_init_config(), the max can be hardcoded and a runtime test added for robustness. [1] https://lkml.kernel.org/r/CA+55aFzCG-zNmZwX4A2FQpadafLfEzK6CC=qPXydAacU1RqZWA@mail.gmail.com Cc: "David S. Miller" <davem@davemloft.net> Cc: Christian Lamparter <chunkeey@gmail.com> Cc: Ivan Mikhaylov <ivan@de.ibm.com> Cc: netdev@vger.kernel.org Co-developed-by: Benjamin Herrenschmidt <benh@kernel.crashing.org> Signed-off-by: Kees Cook <keescook@chromium.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* pktgen: Fix fall-through annotationGustavo A. R. Silva2018-09-141-1/+1
| | | | | | | | | | Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tg3: Fix fall-through annotationsGustavo A. R. Silva2018-09-141-6/+6
| | | | | | | | | | Replace "fallthru" with a proper "fall through" annotation. This fix is part of the ongoing efforts to enabling -Wimplicit-fallthrough Signed-off-by: Gustavo A. R. Silva <gustavo@embeddedor.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* gso_segment: Reset skb->mac_len after modifying network headerToke Høiland-Jørgensen2018-09-132-0/+2
| | | | | | | | | | | | | | | | | | | | | | | When splitting a GSO segment that consists of encapsulated packets, the skb->mac_len of the segments can end up being set wrong, causing packet drops in particular when using act_mirred and ifb interfaces in combination with a qdisc that splits GSO packets. This happens because at the time skb_segment() is called, network_header will point to the inner header, throwing off the calculation in skb_reset_mac_len(). The network_header is subsequently adjust by the outer IP gso_segment handlers, but they don't set the mac_len. Fix this by adding skb_reset_mac_len() calls to both the IPv4 and IPv6 gso_segment handlers, after they modify the network_header. Many thanks to Eric Dumazet for his help in identifying the cause of the bug. Acked-by: Dave Taht <dave.taht@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Toke Høiland-Jørgensen <toke@toke.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
* vxlan: Remove duplicated include from vxlan.hYueHaibing2018-09-131-1/+0
| | | | | | | Remove duplicated include. Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: b53: Do not fail when IRQ are not initializedFlorian Fainelli2018-09-131-1/+7
| | | | | | | | | | | When the Device Tree is not providing the per-port interrupts, do not fail during b53_srab_irq_enable() but instead bail out gracefully. The SRAB driver is used on the BCM5301X (Northstar) platforms which do not yet have the SRAB interrupts wired up. Fixes: 16994374a6fc ("net: dsa: b53: Make SRAB driver manage port interrupts") Signed-off-by: Florian Fainelli <f.fainelli@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'vhost_net-TX-batching'David S. Miller2018-09-135-80/+471
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jason Wang says: ==================== vhost_net TX batching This series tries to batch submitting packets to underlayer socket through msg_control during sendmsg(). This is done by: 1) Doing userspace copy inside vhost_net 2) Build XDP buff 3) Batch at most 64 (VHOST_NET_BATCH) XDP buffs and submit them once through msg_control during sendmsg(). 4) Underlayer sockets can use XDP buffs directly when XDP is enalbed, or build skb based on XDP buff. For the packet that can not be built easily with XDP or for the case that batch submission is hard (e.g sndbuf is limited). We will go for the previous slow path, passing iov iterator to underlayer socket through sendmsg() once per packet. This can help to improve cache utilization and avoid lots of indirect calls with sendmsg(). It can also co-operate with the batching support of the underlayer sockets (e.g the case of XDP redirection through maps). Testpmd(txonly) in guest shows obvious improvements: Test /+pps% XDP_DROP on TAP /+44.8% XDP_REDIRECT on TAP /+29% macvtap (skb) /+26% Netperf TCP_STREAM TX from guest shows obvious improvements on small packet: size/session/+thu%/+normalize% 64/ 1/ +2%/ 0% 64/ 2/ +3%/ +1% 64/ 4/ +7%/ +5% 64/ 8/ +8%/ +6% 256/ 1/ +3%/ 0% 256/ 2/ +10%/ +7% 256/ 4/ +26%/ +22% 256/ 8/ +27%/ +23% 512/ 1/ +3%/ +2% 512/ 2/ +19%/ +14% 512/ 4/ +43%/ +40% 512/ 8/ +45%/ +41% 1024/ 1/ +4%/ 0% 1024/ 2/ +27%/ +21% 1024/ 4/ +38%/ +73% 1024/ 8/ +15%/ +24% 2048/ 1/ +10%/ +7% 2048/ 2/ +16%/ +12% 2048/ 4/ 0%/ +2% 2048/ 8/ 0%/ +2% 4096/ 1/ +36%/ +60% 4096/ 2/ -11%/ -26% 4096/ 4/ 0%/ +14% 4096/ 8/ 0%/ +4% 16384/ 1/ -1%/ +5% 16384/ 2/ 0%/ +2% 16384/ 4/ 0%/ -3% 16384/ 8/ 0%/ +4% 65535/ 1/ 0%/ +10% 65535/ 2/ 0%/ +8% 65535/ 4/ 0%/ +1% 65535/ 8/ 0%/ +3% Please review. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * vhost_net: batch submitting XDP buffers to underlayer socketsJason Wang2018-09-131-13/+161
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implements XDP batching for vhost_net. The idea is first to try to do userspace copy and build XDP buff directly in vhost. Instead of submitting the packet immediately, vhost_net will batch them in an array and submit every 64 (VHOST_NET_BATCH) packets to the under layer sockets through msg_control of sendmsg(). When XDP is enabled on the TUN/TAP, TUN/TAP can process XDP inside a loop without caring GUP thus it can do batch map flushing. When XDP is not enabled or not supported, the underlayer socket need to build skb and pass it to network core. The batched packet submission allows us to do batching like netif_receive_skb_list() in the future. This saves lots of indirect calls for better cache utilization. For the case that we can't so batching e.g when sndbuf is limited or packet size is too large, we will go for usual one packet per sendmsg() way. Doing testpmd on various setups gives us: Test /+pps% XDP_DROP on TAP /+44.8% XDP_REDIRECT on TAP /+29% macvtap (skb) /+26% Netperf tests shows obvious improvements for small packet transmission: size/session/+thu%/+normalize% 64/ 1/ +2%/ 0% 64/ 2/ +3%/ +1% 64/ 4/ +7%/ +5% 64/ 8/ +8%/ +6% 256/ 1/ +3%/ 0% 256/ 2/ +10%/ +7% 256/ 4/ +26%/ +22% 256/ 8/ +27%/ +23% 512/ 1/ +3%/ +2% 512/ 2/ +19%/ +14% 512/ 4/ +43%/ +40% 512/ 8/ +45%/ +41% 1024/ 1/ +4%/ 0% 1024/ 2/ +27%/ +21% 1024/ 4/ +38%/ +73% 1024/ 8/ +15%/ +24% 2048/ 1/ +10%/ +7% 2048/ 2/ +16%/ +12% 2048/ 4/ 0%/ +2% 2048/ 8/ 0%/ +2% 4096/ 1/ +36%/ +60% 4096/ 2/ -11%/ -26% 4096/ 4/ 0%/ +14% 4096/ 8/ 0%/ +4% 16384/ 1/ -1%/ +5% 16384/ 2/ 0%/ +2% 16384/ 4/ 0%/ -3% 16384/ 8/ 0%/ +4% 65535/ 1/ 0%/ +10% 65535/ 2/ 0%/ +8% 65535/ 4/ 0%/ +1% 65535/ 8/ 0%/ +3% Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tap: accept an array of XDP buffs through sendmsg()Jason Wang2018-09-131-2/+72
| | | | | | | | | | | | | | | | | | | | | | | | | | This patch implement TUN_MSG_PTR msg_control type. This type allows the caller to pass an array of XDP buffs to tuntap through ptr field of the tun_msg_control. Tap will build skb through those XDP buffers. This will avoid lots of indirect calls thus improves the icache utilization and allows to do XDP batched flushing when doing XDP redirection. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: accept an array of XDP buffs through sendmsg()Jason Wang2018-09-131-3/+114
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch implement TUN_MSG_PTR msg_control type. This type allows the caller to pass an array of XDP buffs to tuntap through ptr field of the tun_msg_control. If an XDP program is attached, tuntap can run XDP program directly. If not, tuntap will build skb and do a fast receiving since part of the work has been done by vhost_net. This will avoid lots of indirect calls thus improves the icache utilization and allows to do XDP batched flushing when doing XDP redirection. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tun: switch to new type of msg_controlJason Wang2018-09-134-9/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch introduces to a new tun/tap specific msg_control: #define TUN_MSG_UBUF 1 #define TUN_MSG_PTR 2 struct tun_msg_ctl { int type; void *ptr; }; This allows us to pass different kinds of msg_control through sendmsg(). The first supported type is ubuf (TUN_MSG_UBUF) which will be used by the existed vhost_net zerocopy code. The second is XDP buff, which allows vhost_net to pass XDP buff to TUN. This could be used to implement accepting an array of XDP buffs from vhost_net in the following patches. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: move XDP flushing out of tun_do_xdp()Jason Wang2018-09-131-1/+2
| | | | | | | | | | | | | | This will allow adding batch flushing on top. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: split out XDP logicJason Wang2018-09-131-37/+51
| | | | | | | | | | | | | | | | This patch split out XDP logic into a single function. This make it to be reused by XDP batching path in the following patch. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>