summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* ipv6: honor RT6_LOOKUP_F_DST_NOREF in rule lookup logicWei Wang2019-06-233-7/+20
| | | | | | | | | | | | This patch specifically converts the rule lookup logic to honor this flag and not release refcnt when traversing each rule and calling lookup() on each routing table. Similar to previous patch, we also need some special handling of dst entries in uncached list because there is always 1 refcnt taken for them even if RT6_LOOKUP_F_DST_NOREF flag is set. Signed-off-by: Wei Wang <weiwan@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: initialize rt6->rt6i_uncached in all pre-allocated dst entriesWei Wang2019-06-231-0/+3
| | | | | | | | | | | | | | | | Initialize rt6->rt6i_uncached on the following pre-allocated dsts: net->ipv6.ip6_null_entry net->ipv6.ip6_prohibit_entry net->ipv6.ip6_blk_hole_entry This is a preparation patch for later commits to be able to distinguish dst entries in uncached list by doing: !list_empty(rt6->rt6i_uncached) Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: introduce RT6_LOOKUP_F_DST_NOREF flag in ip6_pol_route()Wei Wang2019-06-232-43/+31
| | | | | | | | | | | | | | | | | | | | | | | | This new flag is to instruct the route lookup function to not take refcnt on the dst entry. The user which does route lookup with this flag must properly use rcu protection. ip6_pol_route() is the major route lookup function for both tx and rx path. In this function: Do not take refcnt on dst if RT6_LOOKUP_F_DST_NOREF flag is set, and directly return the route entry. The caller should be holding rcu lock when using this flag, and decide whether to take refcnt or not. One note on the dst cache in the uncached_list: As uncached_list does not consume refcnt, one refcnt is always returned back to the caller even if RT6_LOOKUP_F_DST_NOREF flag is set. Uncached dst is only possible in the output path. So in such call path, caller MUST check if the dst is in the uncached_list before assuming that there is no refcnt taken on the returned dst. Signed-off-by: Wei Wang <weiwan@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Mahesh Bandewar <maheshb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* doc: phy: document some PHY_INTERFACE_MODE_xxx settingsRussell King2019-06-231-1/+44
| | | | | | | | | | There seems to be some confusion surrounding three PHY interface modes, specifically 1000BASE-X, 2500BASE-X and SGMII. Add some documentation to phylib detailing precisely what these interface modes refer to. Signed-off-by: Russell King <rmk+kernel@armlinux.org.uk> Reviewed-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: fix compilation warnings in fqdir_pre_exit()Qian Cai2019-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | The linux-next commit "inet: fix various use-after-free in defrags units" [1] introduced compilation warnings, ./include/net/inet_frag.h:117:1: warning: 'inline' is not at beginning of declaration [-Wold-style-declaration] static void inline fqdir_pre_exit(struct fqdir *fqdir) ^~~~~~ In file included from ./include/net/netns/ipv4.h:10, from ./include/net/net_namespace.h:20, from ./include/linux/netdevice.h:38, from ./include/linux/icmpv6.h:13, from ./include/linux/ipv6.h:86, from ./include/net/ipv6.h:12, from ./include/rdma/ib_verbs.h:51, from ./include/linux/mlx5/device.h:37, from ./include/linux/mlx5/driver.h:51, from drivers/net/ethernet/mellanox/mlx5/core/pagealloc.c:37: [1] https://lore.kernel.org/netdev/20190618180900.88939-3-edumazet@google.com/ Signed-off-by: Qian Cai <cai@lca.pw> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: dsa: mv88e6xxx: introduce helpers for handling chip->reg_lockRasmus Villemoes2019-06-239-148/+158
| | | | | | | | | | | | | This is a no-op that simply moves all locking and unlocking of ->reg_lock into trivial helpers. I did that to be able to easily add some ad hoc instrumentation to those helpers to get some information on contention and hold times of the mutex. Perhaps others want to do something similar at some point, so this frees them from doing the 'sed -i' yoga, and have a much smaller 'git diff' while fiddling. Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Reviewed-by: Vivien Didelot <vivien.didelot@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ena: Fix bug where ring allocation backoff stopped too lateSameeh Jubran2019-06-231-2/+2
| | | | | | | | | | | | | | | | The current code of create_queues_with_size_backoff() allows the ring size to become as small as ENA_MIN_RING_SIZE/2. This is a bug since we don't want the queue ring to be smaller than ENA_MIN_RING_SIZE In this commit we change the loop's termination condition to look at the queue size of the next iteration instead of that of the current one, so that the minimal queue size again becomes ENA_MIN_RING_SIZE. Fixes: eece4d2ab9d2 ("net: ena: add ethtool function for changing io queue sizes") Signed-off-by: Arthur Kiyanovski <akiyano@amazon.com> Signed-off-by: Sameeh Jubran <sameehj@amazon.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* hinic: fix dereference of pointer hwdev before it is null checkedColin Ian King2019-06-231-6/+15
| | | | | | | | | | | Currently pointer hwdev is dereferenced when assigning hwif before hwdev is null checked. Fix this by only derefencing hwdev after the null check. Addresses-Coverity: ("Dereference before null check") Fixes: 4fdc51bb4e92 ("hinic: add support for rss parameters with ethtool") Signed-off-by: Colin Ian King <colin.king@canonical.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'net-mediatek-Add-MT7621-TRGMII-mode-support'David S. Miller2019-06-234-14/+85
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | René van Dorst says: ==================== net: mediatek: Add MT7621 TRGMII mode support Like many other mediatek SOCs, the MT7621 SOC and the internal MT7530 switch both supports TRGMII mode. MT7621 TRGMII speed is fix 1200MBit. v1->v2: - Fix breakage on non MT7621 SOC - Support 25MHz and 40MHz XTAL as MT7530 clocksource ==================== Tested-by: "Frank Wunderlich" <frank-w@public-files.de> Acked-by: Sean Wang <sean.wang@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: dsa: mt7530: Add MT7621 TRGMII mode supportRené van Dorst2019-06-232-10/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch add support TRGMII mode for MT7621 internal MT7530 switch. MT7621 TRGMII has only one fix speed mode of 1200MBit. Also adding support for mt7530 25MHz and 40MHz crystal clocksource. Values are based on Banana Pi R2 bsp [1]. Don't change MT7623 registers on a MT7621 device. [1] https://github.com/BPI-SINOVOIP/BPI-R2-bsp/blob/master/linux-mt/drivers/net/ethernet/mediatek/gsw_mt7623.c#L769 Signed-off-by: René van Dorst <opensource@vdorst.com> Tested-by: Frank Wunderlich <frank-w@public-files.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ethernet: mediatek: Add MT7621 TRGMII mode supportRené van Dorst2019-06-232-4/+45
|/ | | | | | | | MT7621 SOC also supports TRGMII. TRGMII speed is 1200MBit. Signed-off-by: René van Dorst <opensource@vdorst.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: restore ops before calling ops_exit_listLi RongQing2019-06-231-1/+1
| | | | | | | | | | ops has been iterated to first element when call pre_exit, and it needs to restore from save_ops, not save ops to save_ops Fixes: d7d99872c144 ("netns: add pre_exit method to struct pernet_operations") Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Error when route does not have any valid nexthopsIdo Schimmel2019-06-231-0/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When user space sends invalid information in RTA_MULTIPATH, the nexthop list in ip6_route_multipath_add() is empty and 'rt_notif' is set to NULL. The code that emits the in-kernel notifications does not check for this condition, which results in a NULL pointer dereference [1]. Fix this by bailing earlier in the function if the parsed nexthop list is empty. This is consistent with the corresponding IPv4 code. v2: * Check if parsed nexthop list is empty and bail with extack set [1] kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] PREEMPT SMP KASAN CPU: 0 PID: 9190 Comm: syz-executor149 Not tainted 5.2.0-rc5+ #38 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 RIP: 0010:call_fib6_multipath_entry_notifiers+0xd1/0x1a0 net/ipv6/ip6_fib.c:396 Code: 8b b5 30 ff ff ff 48 c7 85 68 ff ff ff 00 00 00 00 48 c7 85 70 ff ff ff 00 00 00 00 89 45 88 4c 89 e0 48 c1 e8 03 4c 89 65 80 <42> 80 3c 28 00 0f 85 9a 00 00 00 48 b8 00 00 00 00 00 fc ff df 4d RSP: 0018:ffff88809788f2c0 EFLAGS: 00010246 RAX: 0000000000000000 RBX: 1ffff11012f11e59 RCX: 00000000ffffffff RDX: 0000000000000000 RSI: 0000000000000000 RDI: 0000000000000000 RBP: ffff88809788f390 R08: ffff88809788f8c0 R09: 000000000000000c R10: ffff88809788f5d8 R11: ffff88809788f527 R12: 0000000000000000 R13: dffffc0000000000 R14: ffff88809788f8c0 R15: ffffffff89541d80 FS: 000055555632c880(0000) GS:ffff8880ae800000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000080 CR3: 000000009ba7c000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: ip6_route_multipath_add+0xc55/0x1490 net/ipv6/route.c:5094 inet6_rtm_newroute+0xed/0x180 net/ipv6/route.c:5208 rtnetlink_rcv_msg+0x463/0xb00 net/core/rtnetlink.c:5219 netlink_rcv_skb+0x177/0x450 net/netlink/af_netlink.c:2477 rtnetlink_rcv+0x1d/0x30 net/core/rtnetlink.c:5237 netlink_unicast_kernel net/netlink/af_netlink.c:1302 [inline] netlink_unicast+0x531/0x710 net/netlink/af_netlink.c:1328 netlink_sendmsg+0x8ae/0xd70 net/netlink/af_netlink.c:1917 sock_sendmsg_nosec net/socket.c:646 [inline] sock_sendmsg+0xd7/0x130 net/socket.c:665 ___sys_sendmsg+0x803/0x920 net/socket.c:2286 __sys_sendmsg+0x105/0x1d0 net/socket.c:2324 __do_sys_sendmsg net/socket.c:2333 [inline] __se_sys_sendmsg net/socket.c:2331 [inline] __x64_sys_sendmsg+0x78/0xb0 net/socket.c:2331 do_syscall_64+0xfd/0x680 arch/x86/entry/common.c:301 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x4401f9 Code: 18 89 d0 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 0f 83 fb 13 fc ff c3 66 2e 0f 1f 84 00 00 00 00 RSP: 002b:00007ffc09fd0028 EFLAGS: 00000246 ORIG_RAX: 000000000000002e RAX: ffffffffffffffda RBX: 00000000004002c8 RCX: 00000000004401f9 RDX: 0000000000000000 RSI: 0000000020000080 RDI: 0000000000000003 RBP: 00000000006ca018 R08: 0000000000000000 R09: 00000000004002c8 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000401a80 R13: 0000000000401b10 R14: 0000000000000000 R15: 0000000000000000 Reported-by: syzbot+382566d339d52cd1a204@syzkaller.appspotmail.com Fixes: ebee3cad835f ("ipv6: Add IPv6 multipath notifications for add / replace") Signed-off-by: Ido Schimmel <idosch@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Reviewed-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* fjes: no need to check return value of debugfs_create functionsGreg Kroah-Hartman2019-06-231-13/+2
| | | | | | | | | | | | When calling debugfs functions, there is no need to ever check the return value. The function can work or not, but the code logic should never do something different based on this. Cc: "David S. Miller" <davem@davemloft.net> Cc: Yangtao Li <tiny.windzz@gmail.com> Cc: netdev@vger.kernel.org Signed-off-by: Greg Kroah-Hartman <gregkh@linuxfoundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: fastopen: robustness and endianness fixes for SipHashArd Biesheuvel2019-06-235-27/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some changes to the TCP fastopen code to make it more robust against future changes in the choice of key/cookie size, etc. - Instead of keeping the SipHash key in an untyped u8[] buffer and casting it to the right type upon use, use the correct type directly. This ensures that the key will appear at the correct alignment if we ever change the way these data structures are allocated. (Currently, they are only allocated via kmalloc so they always appear at the correct alignment) - Use DIV_ROUND_UP when sizing the u64[] array to hold the cookie, so it is always of sufficient size, even if TCP_FASTOPEN_COOKIE_MAX is no longer a multiple of 8. - Drop the 'len' parameter from the tcp_fastopen_reset_cipher() function, which is no longer used. - Add endian swabbing when setting the keys and calculating the hash, to ensure that cookie values are the same for a given key and source/destination address pair regardless of the endianness of the server. Note that none of these are functional changes wrt the current state of the code, with the exception of the swabbing, which only affects big endian systems. Signed-off-by: Ard Biesheuvel <ard.biesheuvel@linaro.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2019-06-225301-26605/+7190
|\ | | | | | | | | | | Minor SPDX change conflict. Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2019-06-2225-112/+123
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Fix leak of unqueued fragments in ipv6 nf_defrag, from Guillaume Nault. 2) Don't access the DDM interface unless the transceiver implements it in bnx2x, from Mauro S. M. Rodrigues. 3) Don't double fetch 'len' from userspace in sock_getsockopt(), from JingYi Hou. 4) Sign extension overflow in lio_core, from Colin Ian King. 5) Various netem bug fixes wrt. corrupted packets from Jakub Kicinski. 6) Fix epollout hang in hvsock, from Sunil Muthuswamy. 7) Fix regression in default fib6_type, from David Ahern. 8) Handle memory limits in tcp_fragment more appropriately, from Eric Dumazet. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (24 commits) tcp: refine memory limit test in tcp_fragment() inet: clear num_timeout reqsk_alloc() net: mvpp2: debugfs: Add pmap to fs dump ipv6: Default fib6_type to RTN_UNICAST when not set net: hns3: Fix inconsistent indenting net/af_iucv: always register net_device notifier net/af_iucv: build proper skbs for HiperTransport net/af_iucv: remove GFP_DMA restriction for HiperTransport net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge() hvsock: fix epollout hang from race condition net/udp_gso: Allow TX timestamp with UDP GSO net: netem: fix use after free and double free with packet corruption net: netem: fix backlog accounting for corrupted GSO frames net: lio_core: fix potential sign-extension overflow on large shift tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skb ip6_tunnel: allow not to count pkts on tstats by passing dev as NULL ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL tun: wake up waitqueues after IFF_UP is set net: remove duplicate fetch in sock_getsockopt tipc: fix issues with early FAILOVER_MSG from peer ...
| | * tcp: refine memory limit test in tcp_fragment()Eric Dumazet2019-06-221-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | tcp_fragment() might be called for skbs in the write queue. Memory limits might have been exceeded because tcp_sendmsg() only checks limits at full skb (64KB) boundaries. Therefore, we need to make sure tcp_fragment() wont punish applications that might have setup very low SO_SNDBUF values. Fixes: f070ef2ac667 ("tcp: tcp_fragment() should apply sane memory limits") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Christoph Paasch <cpaasch@apple.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * inet: clear num_timeout reqsk_alloc()Eric Dumazet2019-06-193-8/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | KMSAN caught uninit-value in tcp_create_openreq_child() [1] This is caused by a recent change, combined by the fact that TCP cleared num_timeout, num_retrans and sk fields only when a request socket was about to be queued. Under syncookie mode, a temporary request socket is used, and req->num_timeout could contain garbage. Lets clear these three fields sooner, there is really no point trying to defer this and risk other bugs. [1] BUG: KMSAN: uninit-value in tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 CPU: 1 PID: 13357 Comm: syz-executor591 Not tainted 5.2.0-rc4+ #3 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0x191/0x1f0 lib/dump_stack.c:113 kmsan_report+0x162/0x2d0 mm/kmsan/kmsan.c:611 __msan_warning+0x75/0xe0 mm/kmsan/kmsan_instr.c:304 tcp_create_openreq_child+0x157f/0x1cc0 net/ipv4/tcp_minisocks.c:526 tcp_v6_syn_recv_sock+0x761/0x2d80 net/ipv6/tcp_ipv6.c:1152 tcp_get_cookie_sock+0x16e/0x6b0 net/ipv4/syncookies.c:209 cookie_v6_check+0x27e0/0x29a0 net/ipv6/syncookies.c:252 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 </IRQ> do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 RIP: 0033:0x401d50 Code: 01 f0 ff ff 0f 83 40 0d 00 00 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 44 00 00 83 3d dd 8d 2d 00 00 75 14 b8 03 00 00 00 0f 05 <48> 3d 01 f0 ff ff 0f 83 14 0d 00 00 c3 48 83 ec 08 e8 7a 02 00 00 RSP: 002b:00007fff1cf58cf8 EFLAGS: 00000246 ORIG_RAX: 0000000000000003 RAX: 0000000000000000 RBX: 0000000000000004 RCX: 0000000000401d50 RDX: 000000000000001c RSI: 0000000000000000 RDI: 0000000000000003 RBP: 00000000004a9050 R08: 0000000020000040 R09: 000000000000001c R10: 0000000020004004 R11: 0000000000000246 R12: 0000000000402ef0 R13: 0000000000402f80 R14: 0000000000000000 R15: 0000000000000000 Uninit was created at: kmsan_save_stack_with_flags mm/kmsan/kmsan.c:201 [inline] kmsan_internal_poison_shadow+0x53/0xa0 mm/kmsan/kmsan.c:160 kmsan_kmalloc+0xa4/0x130 mm/kmsan/kmsan_hooks.c:177 kmem_cache_alloc+0x534/0xb00 mm/slub.c:2781 reqsk_alloc include/net/request_sock.h:84 [inline] inet_reqsk_alloc+0xa8/0x600 net/ipv4/tcp_input.c:6384 cookie_v6_check+0xadb/0x29a0 net/ipv6/syncookies.c:173 tcp_v6_cookie_check net/ipv6/tcp_ipv6.c:1039 [inline] tcp_v6_do_rcv+0xf1c/0x1ce0 net/ipv6/tcp_ipv6.c:1344 tcp_v6_rcv+0x60b7/0x6a30 net/ipv6/tcp_ipv6.c:1554 ip6_protocol_deliver_rcu+0x1433/0x22f0 net/ipv6/ip6_input.c:397 ip6_input_finish net/ipv6/ip6_input.c:438 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_input+0x2af/0x340 net/ipv6/ip6_input.c:447 dst_input include/net/dst.h:439 [inline] ip6_rcv_finish net/ipv6/ip6_input.c:76 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ipv6_rcv+0x683/0x710 net/ipv6/ip6_input.c:272 __netif_receive_skb_one_core net/core/dev.c:4981 [inline] __netif_receive_skb net/core/dev.c:5095 [inline] process_backlog+0x721/0x1410 net/core/dev.c:5906 napi_poll net/core/dev.c:6329 [inline] net_rx_action+0x738/0x1940 net/core/dev.c:6395 __do_softirq+0x4ad/0x858 kernel/softirq.c:293 do_softirq_own_stack+0x49/0x80 arch/x86/entry/entry_64.S:1052 do_softirq kernel/softirq.c:338 [inline] __local_bh_enable_ip+0x199/0x1e0 kernel/softirq.c:190 local_bh_enable+0x36/0x40 include/linux/bottom_half.h:32 rcu_read_unlock_bh include/linux/rcupdate.h:682 [inline] ip6_finish_output2+0x213f/0x2670 net/ipv6/ip6_output.c:117 ip6_finish_output+0xae4/0xbc0 net/ipv6/ip6_output.c:150 NF_HOOK_COND include/linux/netfilter.h:294 [inline] ip6_output+0x5d3/0x720 net/ipv6/ip6_output.c:167 dst_output include/net/dst.h:433 [inline] NF_HOOK include/linux/netfilter.h:305 [inline] ip6_xmit+0x1f53/0x2650 net/ipv6/ip6_output.c:271 inet6_csk_xmit+0x3df/0x4f0 net/ipv6/inet6_connection_sock.c:135 __tcp_transmit_skb+0x4076/0x5b40 net/ipv4/tcp_output.c:1156 tcp_transmit_skb net/ipv4/tcp_output.c:1172 [inline] tcp_write_xmit+0x39a9/0xa730 net/ipv4/tcp_output.c:2397 __tcp_push_pending_frames+0x124/0x4e0 net/ipv4/tcp_output.c:2573 tcp_send_fin+0xd43/0x1540 net/ipv4/tcp_output.c:3118 tcp_close+0x16ba/0x1860 net/ipv4/tcp.c:2403 inet_release+0x1f7/0x270 net/ipv4/af_inet.c:427 inet6_release+0xaf/0x100 net/ipv6/af_inet6.c:470 __sock_release net/socket.c:601 [inline] sock_close+0x156/0x490 net/socket.c:1273 __fput+0x4c9/0xba0 fs/file_table.c:280 ____fput+0x37/0x40 fs/file_table.c:313 task_work_run+0x22e/0x2a0 kernel/task_work.c:113 tracehook_notify_resume include/linux/tracehook.h:185 [inline] exit_to_usermode_loop arch/x86/entry/common.c:168 [inline] prepare_exit_to_usermode+0x39d/0x4d0 arch/x86/entry/common.c:199 syscall_return_slowpath+0x90/0x5c0 arch/x86/entry/common.c:279 do_syscall_64+0xe2/0xf0 arch/x86/entry/common.c:305 entry_SYSCALL_64_after_hwframe+0x63/0xe7 Fixes: 336c39a03151 ("tcp: undo init congestion window on false SYNACK timeout") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * net: mvpp2: debugfs: Add pmap to fs dumpNathan Huckleberry2019-06-191-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There was an unused variable 'mvpp2_dbgfs_prs_pmap_fops' Added a usage consistent with other fops to dump pmap to userspace. Cc: clang-built-linux@googlegroups.com Link: https://github.com/ClangBuiltLinux/linux/issues/529 Signed-off-by: Nathan Huckleberry <nhuck@google.com> Tested-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * ipv6: Default fib6_type to RTN_UNICAST when not setDavid Ahern2019-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A user reported that routes are getting installed with type 0 (RTN_UNSPEC) where before the routes were RTN_UNICAST. One example is from accel-ppp which apparently still uses the ioctl interface and does not set rtmsg_type. Another is the netlink interface where ipv6 does not require rtm_type to be set (v4 does). Prior to the commit in the Fixes tag the ipv6 stack converted type 0 to RTN_UNICAST, so restore that behavior. Fixes: e8478e80e5a7 ("net/ipv6: Save route type in rt6_info") Signed-off-by: David Ahern <dsahern@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * net: hns3: Fix inconsistent indentingKrzysztof Kozlowski2019-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | Fix wrong indentation of goto return. Signed-off-by: Krzysztof Kozlowski <krzk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * Merge branch 'af_iucv-fixes'David S. Miller2019-06-191-13/+36
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Julian Wiedmann says: ==================== net/af_iucv: fixes 2019-06-18 I spent a few cycles on transmit problems for af_iucv over regular netdevices - please apply the following fixes to -net. The first patch allows for skb allocations outside of GFP_DMA, while the second patch respects that drivers might use skb_cow_head() and/or want additional dev->needed_headroom. Patch 3 is for a separate issue, where we didn't setup some of the netdevice-specific infrastructure when running as a z/VM guest. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * net/af_iucv: always register net_device notifierJulian Wiedmann2019-06-191-7/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Even when running as VM guest (ie pr_iucv != NULL), af_iucv can still open HiperTransport-based connections. For robust operation these connections require the af_iucv_netdev_notifier, so register it unconditionally. Also handle any error that register_netdevice_notifier() returns. Fixes: 9fbd87d41392 ("af_iucv: handle netdev events") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * net/af_iucv: build proper skbs for HiperTransportJulian Wiedmann2019-06-191-5/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The HiperSockets-based transport path in af_iucv is still too closely entangled with qeth. With commit a647a02512ca ("s390/qeth: speed-up L3 IQD xmit"), the relevant xmit code in qeth has begun to use skb_cow_head(). So to avoid unnecessary skb head expansions, af_iucv must learn to 1) respect dev->needed_headroom when allocating skbs, and 2) drop the header reference before cloning the skb. While at it, also stop hard-coding the LL-header creation stage and just use the appropriate helper. Fixes: a647a02512ca ("s390/qeth: speed-up L3 IQD xmit") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * net/af_iucv: remove GFP_DMA restriction for HiperTransportJulian Wiedmann2019-06-191-1/+5
| | |/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | af_iucv sockets over z/VM IUCV require that their skbs are allocated in DMA memory. This restriction doesn't apply to connections over HiperSockets. So only set this limit for z/VM IUCV sockets, thereby increasing the likelihood that the large (and linear!) allocations for HiperTransport messages succeed. Fixes: 3881ac441f64 ("af_iucv: add HiperSockets transport") Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Reviewed-by: Hendrik Brueckner <brueckner@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * net: dsa: mv88e6xxx: fix shift of FID bits in mv88e6185_g1_vtu_loadpurge()Rasmus Villemoes2019-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | The comment is correct, but the code ends up moving the bits four places too far, into the VTUOp field. Fixes: 11ea809f1a74 (net: dsa: mv88e6xxx: support 256 databases) Signed-off-by: Rasmus Villemoes <rasmus.villemoes@prevas.dk> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2019-06-193-14/+14
| | |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net 1) Module autoload for masquerade and redirection does not work. 2) Leak in unqueued packets in nf_ct_frag6_queue(). Ignore duplicated fragments, pretend they are placed into the queue. Patches from Guillaume Nault. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * netfilter: ipv6: nf_defrag: accept duplicate fragments againGuillaume Nault2019-06-071-3/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When fixing the skb leak introduced by the conversion to rbtree, I forgot about the special case of duplicate fragments. The condition under the 'insert_error' label isn't effective anymore as nf_ct_frg6_gather() doesn't override the returned value anymore. So duplicate fragments now get NF_DROP verdict. To accept duplicate fragments again, handle them specially as soon as inet_frag_queue_insert() reports them. Return -EINPROGRESS which will translate to NF_STOLEN verdict, like any accepted fragment. However, such packets don't carry any new information and aren't queued, so we just drop them immediately. Fixes: a0d56cb911ca ("netfilter: ipv6: nf_defrag: fix leakage of unqueued fragments") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | | * netfilter: ipv6: nf_defrag: fix leakage of unqueued fragmentsGuillaume Nault2019-06-041-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With commit 997dd9647164 ("net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c"), nf_ct_frag6_reasm() is now called from nf_ct_frag6_queue(). With this change, nf_ct_frag6_queue() can fail after the skb has been added to the fragment queue and nf_ct_frag6_gather() was adapted to handle this case. But nf_ct_frag6_queue() can still fail before the fragment has been queued. nf_ct_frag6_gather() can't handle this case anymore, because it has no way to know if nf_ct_frag6_queue() queued the fragment before failing. If it didn't, the skb is lost as the error code is overwritten with -EINPROGRESS. Fix this by setting -EINPROGRESS directly in nf_ct_frag6_queue(), so that nf_ct_frag6_gather() can propagate the error as is. Fixes: 997dd9647164 ("net: IP6 defrag: use rbtrees in nf_conntrack_reasm.c") Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | | * netfilter: nf_tables: fix module autoload with inet familyPablo Neira Ayuso2019-05-312-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use MODULE_ALIAS_NFT_EXPR() to make happy the inet family with nat. Fixes: 63ce3940f3ab ("netfilter: nft_redir: add inet support") Fixes: 071657d2c38c ("netfilter: nft_masq: add inet support") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| | * | hvsock: fix epollout hang from race conditionSunil Muthuswamy2019-06-191-31/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, hvsock can enter into a state where epoll_wait on EPOLLOUT will not return even when the hvsock socket is writable, under some race condition. This can happen under the following sequence: - fd = socket(hvsocket) - fd_out = dup(fd) - fd_in = dup(fd) - start a writer thread that writes data to fd_out with a combination of epoll_wait(fd_out, EPOLLOUT) and - start a reader thread that reads data from fd_in with a combination of epoll_wait(fd_in, EPOLLIN) - On the host, there are two threads that are reading/writing data to the hvsocket stack: hvs_stream_has_space hvs_notify_poll_out vsock_poll sock_poll ep_poll Race condition: check for epollout from ep_poll(): assume no writable space in the socket hvs_stream_has_space() returns 0 check for epollin from ep_poll(): assume socket has some free space < HVS_PKT_LEN(HVS_SEND_BUF_SIZE) hvs_stream_has_space() will clear the channel pending send size host will not notify the guest because the pending send size has been cleared and so the hvsocket will never mark the socket writable Now, the EPOLLOUT will never return even if the socket write buffer is empty. The fix is to set the pending size to the default size and never change it. This way the host will always notify the guest whenever the writable space is bigger than the pending size. The host is already optimized to *only* notify the guest when the pending size threshold boundary is crossed and not everytime. This change also reduces the cpu usage somewhat since hv_stream_has_space() is in the hotpath of send: vsock_stream_sendmsg()->hv_stream_has_space() Earlier hv_stream_has_space was setting/clearing the pending size on every call. Signed-off-by: Sunil Muthuswamy <sunilmut@microsoft.com> Reviewed-by: Dexuan Cui <decui@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net/udp_gso: Allow TX timestamp with UDP GSOFred Klassen2019-06-191-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fixes an issue where TX Timestamps are not arriving on the error queue when UDP_SEGMENT CMSG type is combined with CMSG type SO_TIMESTAMPING. This can be illustrated with an updated updgso_bench_tx program which includes the '-T' option to test for this condition. It also introduces the '-P' option which will call poll() before reading the error queue. ./udpgso_bench_tx -4ucTPv -S 1472 -l2 -D 172.16.120.18 poll timeout udp tx: 0 MB/s 1 calls/s 1 msg/s The "poll timeout" message above indicates that TX timestamp never arrived. This patch preserves tx_flags for the first UDP GSO segment. Only the first segment is timestamped, even though in some cases there may be benefital in timestamping both the first and last segment. Factors in deciding on first segment timestamp only: - Timestamping both first and last segmented is not feasible. Hardware can only have one outstanding TS request at a time. - Timestamping last segment may under report network latency of the previous segments. Even though the doorbell is suppressed, the ring producer counter has been incremented. - Timestamping the first segment has the upside in that it reports timestamps from the application's view, e.g. RTT. - Timestamping the first segment has the downside that it may underreport tx host network latency. It appears that we have to pick one or the other. And possibly follow-up with a config flag to choose behavior. v2: Remove tests as noted by Willem de Bruijn <willemb@google.com> Moving tests from net to net-next v3: Update only relevant tx_flag bits as per Willem de Bruijn <willemb@google.com> v4: Update comments and commit message as per Willem de Bruijn <willemb@google.com> Fixes: ee80d1ebe5ba ("udp: add udp gso") Signed-off-by: Fred Klassen <fklassen@appneta.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | Merge branch 'net-netem-fix-issues-with-corrupting-GSO-frames'David S. Miller2019-06-191-12/+14
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Jakub Kicinski says: ==================== net: netem: fix issues with corrupting GSO frames Corrupting GSO frames currently leads to crashes, due to skb use after free. These stem from the skb list handling - the segmented skbs come back on a list, and this list is not properly unlinked before enqueuing the segments. Turns out this condition is made very likely to occur because of another bug - in backlog accounting. Segments are counted twice, which means qdisc's limit gets reached leading to drops and making the use after free very likely to happen. The bugs are fixed in order in which they were added to the tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | net: netem: fix use after free and double free with packet corruptionJakub Kicinski2019-06-191-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Brendan reports that the use of netem's packet corruption capability leads to strange crashes. This seems to be caused by commit d66280b12bd7 ("net: netem: use a list in addition to rbtree") which uses skb->next pointer to construct a fast-path queue of in-order skbs. Packet corruption code has to invoke skb_gso_segment() in case of skbs in need of GSO. skb_gso_segment() returns a list of skbs. If next pointers of the skbs on that list do not get cleared fast path list may point to freed skbs or skbs which are also on the RB tree. Let's say skb gets segmented into 3 frames: A -> B -> C A gets hooked to the t_head t_tail list by tfifo_enqueue(), but it's next pointer didn't get cleared so we have: h t |/ A -> B -> C Now if B and C get also get enqueued successfully all is fine, because tfifo_enqueue() will overwrite the list in order. IOW: Enqueue B: h t | | A -> B C Enqueue C: h t | | A -> B -> C But if B and C get reordered we may end up with: h t RB tree |/ | A -> B -> C B \ C Or if they get dropped just: h t |/ A -> B -> C where A and B are already freed. To reproduce either limit has to be set low to cause freeing of segs or reorders have to happen (due to delay jitter). Note that we only have to mark the first segment as not on the list, "finish_segs" handling of other frags already does that. Another caveat is that qdisc_drop_all() still has to free all segments correctly in case of drop of first segment, therefore we re-link segs before calling it. v2: - re-link before drop, v1 was leaking non-first segs if limit was hit at the first seg - better commit message which lead to discovering the above :) Reported-by: Brendan Galloway <brendan.galloway@netronome.com> Fixes: d66280b12bd7 ("net: netem: use a list in addition to rbtree") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | net: netem: fix backlog accounting for corrupted GSO framesJakub Kicinski2019-06-191-5/+8
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When GSO frame has to be corrupted netem uses skb_gso_segment() to produce the list of frames, and re-enqueues the segments one by one. The backlog length has to be adjusted to account for new frames. The current calculation is incorrect, leading to wrong backlog lengths in the parent qdisc (both bytes and packets), and incorrect packet backlog count in netem itself. Parent backlog goes negative, netem's packet backlog counts all non-first segments twice (thus remaining non-zero even after qdisc is emptied). Move the variables used to count the adjustment into local scope to make 100% sure they aren't used at any stage in backports. Fixes: 6071bd1aa13e ("netem: Segment GSO packets on enqueue") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Acked-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: lio_core: fix potential sign-extension overflow on large shiftColin Ian King2019-06-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Left shifting the signed int value 1 by 31 bits has undefined behaviour and the shift amount oq_no can be as much as 63. Fix this by using BIT_ULL(oq_no) instead. Addresses-Coverity: ("Bad shift operation") Fixes: f21fb3ed364b ("Add support of Cavium Liquidio ethernet adapters") Signed-off-by: Colin Ian King <colin.king@canonical.com> Reviewed-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | Merge branch 'net-fix-quite-a-few-dst_cache-crashes-reported-by-syzbot'David S. Miller2019-06-193-11/+15
| | |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Xin Long says: ==================== net: fix quite a few dst_cache crashes reported by syzbot There are two kinds of crashes reported many times by syzbot with no reproducer. Call Traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... They were caused by the fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten by another percpu variable 'dev->tstats' access overflow in tipc udp media xmit path when counting packets on a non tunnel device. The fix is to make udp tunnel work with no tunnel device by allowing not to count packets on the tstats when the tunnel dev is NULL in Patches 1/3 and 2/3, then pass a NULL tunnel dev in tipc_udp_tunnel() in Patch 3/3. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | tipc: pass tunnel dev as NULL to udp_tunnel(6)_xmit_skbXin Long2019-06-191-5/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | udp_tunnel(6)_xmit_skb() called by tipc_udp_xmit() expects a tunnel device to count packets on dev->tstats, a perpcu variable. However, TIPC is using udp tunnel with no tunnel device, and pass the lower dev, like veth device that only initializes dev->lstats(a perpcu variable) when creating it. Later iptunnel_xmit_stats() called by ip(6)tunnel_xmit() thinks the dev as a tunnel device, and uses dev->tstats instead of dev->lstats. tstats' each pointer points to a bigger struct than lstats, so when tstats->tx_bytes is increased, other percpu variable's members could be overwritten. syzbot has reported quite a few crashes due to fib_nh_common percpu member 'nhc_pcpu_rth_output' overwritten, call traces are like: BUG: KASAN: slab-out-of-bounds in rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 rt_cache_valid+0x158/0x190 net/ipv4/route.c:1556 __mkroute_output net/ipv4/route.c:2332 [inline] ip_route_output_key_hash_rcu+0x819/0x2d50 net/ipv4/route.c:2564 ip_route_output_key_hash+0x1ef/0x360 net/ipv4/route.c:2393 __ip_route_output_key include/net/route.h:125 [inline] ip_route_output_flow+0x28/0xc0 net/ipv4/route.c:2651 ip_route_output_key include/net/route.h:135 [inline] ... or: kasan: GPF could be caused by NULL-ptr deref or user memory access RIP: 0010:dst_dev_put+0x24/0x290 net/core/dst.c:168 <IRQ> rt_fibinfo_free_cpus net/ipv4/fib_semantics.c:200 [inline] free_fib_info_rcu+0x2e1/0x490 net/ipv4/fib_semantics.c:217 __rcu_reclaim kernel/rcu/rcu.h:240 [inline] rcu_do_batch kernel/rcu/tree.c:2437 [inline] invoke_rcu_callbacks kernel/rcu/tree.c:2716 [inline] rcu_process_callbacks+0x100a/0x1ac0 kernel/rcu/tree.c:2697 ... The issue exists since tunnel stats update is moved to iptunnel_xmit by Commit 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()"), and here to fix it by passing a NULL tunnel dev to udp_tunnel(6)_xmit_skb so that the packets counting won't happen on dev->tstats. Reported-by: syzbot+9d4c12bfd45a58738d0a@syzkaller.appspotmail.com Reported-by: syzbot+a9e23ea2aa21044c2798@syzkaller.appspotmail.com Reported-by: syzbot+c4c4b2bb358bb936ad7e@syzkaller.appspotmail.com Reported-by: syzbot+0290d2290a607e035ba1@syzkaller.appspotmail.com Reported-by: syzbot+a43d8d4e7e8a7a9e149e@syzkaller.appspotmail.com Reported-by: syzbot+a47c5f4c6c00fc1ed16e@syzkaller.appspotmail.com Fixes: 039f50629b7f ("ip_tunnel: Move stats update to iptunnel_xmit()") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | ip6_tunnel: allow not to count pkts on tstats by passing dev as NULLXin Long2019-06-191-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A similar fix to Patch "ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULL" is also needed by ip6_tunnel. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | | * | ip_tunnel: allow not to count pkts on tstats by setting skb's dev to NULLXin Long2019-06-191-3/+6
| | |/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | iptunnel_xmit() works as a common function, also used by a udp tunnel which doesn't have to have a tunnel device, like how TIPC works with udp media. In these cases, we should allow not to count pkts on dev's tstats, so that udp tunnel can work with no tunnel device safely. Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tun: wake up waitqueues after IFF_UP is setFei Li2019-06-181-10/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently after setting tap0 link up, the tun code wakes tx/rx waited queues up in tun_net_open() when .ndo_open() is called, however the IFF_UP flag has not been set yet. If there's already a wait queue, it would fail to transmit when checking the IFF_UP flag in tun_sendmsg(). Then the saving vhost_poll_start() will add the wq into wqh until it is waken up again. Although this works when IFF_UP flag has been set when tun_chr_poll detects; this is not true if IFF_UP flag has not been set at that time. Sadly the latter case is a fatal error, as the wq will never be waken up in future unless later manually setting link up on purpose. Fix this by moving the wakeup process into the NETDEV_UP event notifying process, this makes sure IFF_UP has been set before all waited queues been waken up. Signed-off-by: Fei Li <lifei.shirley@bytedance.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | net: remove duplicate fetch in sock_getsockoptJingYi Hou2019-06-181-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In sock_getsockopt(), 'optlen' is fetched the first time from userspace. 'len < 0' is then checked. Then in condition 'SO_MEMINFO', 'optlen' is fetched the second time from userspace. If change it between two fetches may cause security problems or unexpected behaivor, and there is no reason to fetch it a second time. To fix this, we need to remove the second fetch. Signed-off-by: JingYi Hou <houjingyi647@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | tipc: fix issues with early FAILOVER_MSG from peerTuong Lien2019-06-182-4/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It appears that a FAILOVER_MSG can come from peer even when the failure link is resetting (i.e. just after the 'node_write_unlock()'...). This means the failover procedure on the node has not been started yet. The situation is as follows: node1 node2 linkb linka linka linkb | | | | | | x failure | | | RESETTING | | | | | | x failure RESET | | RESETTING FAILINGOVER | | | (FAILOVER_MSG) | | |<-------------------------------------------------| | *FAILINGOVER | | | | | (dummy FAILOVER_MSG) | | |------------------------------------------------->| | RESET | | FAILOVER_END | FAILINGOVER RESET | . . . . . . . . . . . . Once this happens, the link failover procedure will be triggered wrongly on the receiving node since the node isn't in FAILINGOVER state but then another link failover will be carried out. The consequences are: 1) A peer might get stuck in FAILINGOVER state because the 'sync_point' was set, reset and set incorrectly, the criteria to end the failover would not be met, it could keep waiting for a message that has already received. 2) The early FAILOVER_MSG(s) could be queued in the link failover deferdq but would be purged or not pulled out because the 'drop_point' was not set correctly. 3) The early FAILOVER_MSG(s) could be dropped too. 4) The dummy FAILOVER_MSG could make the peer leaving FAILINGOVER state shortly, but later on it would be restarted. The same situation can also happen when the link is in PEER_RESET state and a FAILOVER_MSG arrives. The commit resolves the issues by forcing the link down immediately, so the failover procedure will be started normally (which is the same as when receiving a FAILOVER_MSG and the link is in up state). Also, the function "tipc_node_link_failover()" is toughen to avoid such a situation from happening. Acked-by: Jon Maloy <jon.maloy@ericsson.se> Signed-off-by: Tuong Lien <tuong.t.lien@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | bnx2x: Check if transceiver implements DDM before accessMauro S. M. Rodrigues2019-06-182-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some transceivers may comply with SFF-8472 even though they do not implement the Digital Diagnostic Monitoring (DDM) interface described in the spec. The existence of such area is specified by the 6th bit of byte 92, set to 1 if implemented. Currently, without checking this bit, bnx2x fails trying to read sfp module's EEPROM with the follow message: ethtool -m enP5p1s0f1 Cannot get Module EEPROM data: Input/output error Because it fails to read the additional 256 bytes in which it is assumed to exist the DDM data. This issue was noticed using a Mellanox Passive DAC PN 01FT738. The EEPROM data was confirmed by Mellanox as correct and similar to other Passive DACs from other manufacturers. Signed-off-by: Mauro S. M. Rodrigues <maurosr@linux.vnet.ibm.com> Acked-by: Sudarsana Reddy Kalluru <skalluru@marvell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdmaLinus Torvalds2019-06-2117-62/+174
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull rdma fixes from Doug Ledford: "This is probably our last -rc pull request. We don't have anything else outstanding at the moment anyway, and with the summer months on us and people taking trips, I expect the next weeks leading up to the merge window to be pretty calm and sedate. This has two simple, no brainer fixes for the EFA driver. Then it has ten not quite so simple fixes for the hfi1 driver. The problem with them is that they aren't simply one liner typo fixes. They're still fixes, but they're more complex issues like livelock under heavy load where the answer was to change work queue usage and spinlock usage to resolve the problem, or issues with orphaned requests during certain types of failures like link down which required some more complex work to fix too. They all look like legitimate fixes to me, they just aren't small like I wish they were. Summary: - 2 minor EFA fixes - 10 hfi1 fixes related to scaling issues" * tag 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/rdma/rdma: RDMA/efa: Handle mmap insertions overflow RDMA/efa: Fix success return value in case of error IB/hfi1: Handle port down properly in pio IB/hfi1: Handle wakeup of orphaned QPs for pio IB/hfi1: Wakeup QPs orphaned on wait list after flush IB/hfi1: Use aborts to trigger RC throttling IB/hfi1: Create inline to get extended headers IB/hfi1: Silence txreq allocation warnings IB/hfi1: Avoid hardlockup with flushlist_lock IB/hfi1: Correct tid qp rcd to match verbs context IB/hfi1: Close PSM sdma_progress sleep window IB/hfi1: Validate fault injection opcode user input
| | * | | RDMA/efa: Handle mmap insertions overflowGal Pressman2019-06-181-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When inserting a new mmap entry to the xarray we should check for 'mmap_page' overflow as it is limited to 32 bits. Fixes: 40909f664d27 ("RDMA/efa: Add EFA verbs implementation") Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
| | * | | RDMA/efa: Fix success return value in case of errorGal Pressman2019-06-181-6/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Existing code would mistakenly return success in case of error instead of a proper return value. Fixes: e9c6c5373088 ("RDMA/efa: Add common command handlers") Reviewed-by: Firas JahJah <firasj@amazon.com> Reviewed-by: Yossi Leybovich <sleybo@amazon.com> Signed-off-by: Gal Pressman <galpress@amazon.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
| | * | | IB/hfi1: Handle port down properly in pioMike Marciniszyn2019-06-184-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The call to sc_buffer_alloc currently returns NULL (no buffer) or a buffer descriptor. There is a third case when the port is down. Currently that returns NULL and this prevents the caller from properly handling the sc_buffer_alloc() failure. A verbs code link test after the call is racy so the indication needs to come from the state check inside the allocation routine to be valid. Fix by encoding the ECOMM failure like SDMA. IS_ERR_OR_NULL() tests are added at all call sites. For verbs send, this needs to treat any error by returning a completion without any MMIO copy. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>
| | * | | IB/hfi1: Handle wakeup of orphaned QPs for pioMike Marciniszyn2019-06-181-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Once a send context is taken down due to a link failure, any QPs waiting for pio credits will stay on the waitlist indefinitely. Fix by wakeing up all QPs linked to piowait list. Fixes: 7724105686e7 ("IB/hfi1: add driver files") Reviewed-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Mike Marciniszyn <mike.marciniszyn@intel.com> Signed-off-by: Dennis Dalessandro <dennis.dalessandro@intel.com> Signed-off-by: Doug Ledford <dledford@redhat.com>