summaryrefslogtreecommitdiffstats
path: root/net/ipv6 (follow)
Commit message (Collapse)AuthorAgeFilesLines
* ip6_offload: check segs for NULL in ipv6_gso_segment.Artem Savkov2016-12-021-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | segs needs to be checked for being NULL in ipv6_gso_segment() before calling skb_shinfo(segs), otherwise kernel can run into a NULL-pointer dereference: [ 97.811262] BUG: unable to handle kernel NULL pointer dereference at 00000000000000cc [ 97.819112] IP: [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 97.825214] PGD 0 [ 97.827047] [ 97.828540] Oops: 0000 [#1] SMP [ 97.831678] Modules linked in: vhost_net vhost macvtap macvlan nfsv3 rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache xt_CHECKSUM iptable_mangle ipt_MASQUERADE nf_nat_masquerade_ipv4 iptable_nat nf_nat_ipv4 nf_nat nf_conntrack_ipv4 nf_defrag_ipv4 xt_conntrack nf_conntrack ipt_REJECT nf_reject_ipv4 tun ebtable_filter ebtables ip6table_filter ip6_tables iptable_filter bridge stp llc snd_hda_codec_realtek snd_hda_codec_hdmi snd_hda_codec_generic snd_hda_intel snd_hda_codec edac_mce_amd snd_hda_core edac_core snd_hwdep kvm_amd snd_seq kvm snd_seq_device snd_pcm irqbypass snd_timer ppdev parport_serial snd parport_pc k10temp pcspkr soundcore parport sp5100_tco shpchp sg wmi i2c_piix4 acpi_cpufreq nfsd auth_rpcgss nfs_acl lockd grace sunrpc ip_tables xfs libcrc32c sr_mod cdrom sd_mod ata_generic pata_acpi amdkfd amd_iommu_v2 radeon broadcom bcm_phy_lib i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm ahci serio_raw tg3 firewire_ohci libahci pata_atiixp drm ptp libata firewire_core pps_core i2c_core crc_itu_t fjes dm_mirror dm_region_hash dm_log dm_mod [ 97.927721] CPU: 1 PID: 3504 Comm: vhost-3495 Not tainted 4.9.0-7.el7.test.x86_64 #1 [ 97.935457] Hardware name: AMD Snook/Snook, BIOS ESK0726A 07/26/2010 [ 97.941806] task: ffff880129a1c080 task.stack: ffffc90001bcc000 [ 97.947720] RIP: 0010:[<ffffffff816e52f9>] [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 97.956251] RSP: 0018:ffff88012fc43a10 EFLAGS: 00010207 [ 97.961557] RAX: 0000000000000000 RBX: ffff8801292c8700 RCX: 0000000000000594 [ 97.968687] RDX: 0000000000000593 RSI: ffff880129a846c0 RDI: 0000000000240000 [ 97.975814] RBP: ffff88012fc43a68 R08: ffff880129a8404e R09: 0000000000000000 [ 97.982942] R10: 0000000000000000 R11: ffff880129a84076 R12: 00000020002949b3 [ 97.990070] R13: ffff88012a580000 R14: 0000000000000000 R15: ffff88012a580000 [ 97.997198] FS: 0000000000000000(0000) GS:ffff88012fc40000(0000) knlGS:0000000000000000 [ 98.005280] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 [ 98.011021] CR2: 00000000000000cc CR3: 0000000126c5d000 CR4: 00000000000006e0 [ 98.018149] Stack: [ 98.020157] 00000000ffffffff ffff88012fc43ac8 ffffffffa017ad0a 000000000000000e [ 98.027584] 0000001300000000 0000000077d59998 ffff8801292c8700 00000020002949b3 [ 98.035010] ffff88012a580000 0000000000000000 ffff88012a580000 ffff88012fc43a98 [ 98.042437] Call Trace: [ 98.044879] <IRQ> [ 98.046803] [<ffffffffa017ad0a>] ? tg3_start_xmit+0x84a/0xd60 [tg3] [ 98.053156] [<ffffffff815eeee0>] skb_mac_gso_segment+0xb0/0x130 [ 98.059158] [<ffffffff815eefd3>] __skb_gso_segment+0x73/0x110 [ 98.064985] [<ffffffff815ef40d>] validate_xmit_skb+0x12d/0x2b0 [ 98.070899] [<ffffffff815ef5d2>] validate_xmit_skb_list+0x42/0x70 [ 98.077073] [<ffffffff81618560>] sch_direct_xmit+0xd0/0x1b0 [ 98.082726] [<ffffffff815efd86>] __dev_queue_xmit+0x486/0x690 [ 98.088554] [<ffffffff8135c135>] ? cpumask_next_and+0x35/0x50 [ 98.094380] [<ffffffff815effa0>] dev_queue_xmit+0x10/0x20 [ 98.099863] [<ffffffffa09ce057>] br_dev_queue_push_xmit+0xa7/0x170 [bridge] [ 98.106907] [<ffffffffa09ce161>] br_forward_finish+0x41/0xc0 [bridge] [ 98.113430] [<ffffffff81627cf2>] ? nf_iterate+0x52/0x60 [ 98.118735] [<ffffffff81627d6b>] ? nf_hook_slow+0x6b/0xc0 [ 98.124216] [<ffffffffa09ce32c>] __br_forward+0x14c/0x1e0 [bridge] [ 98.130480] [<ffffffffa09ce120>] ? br_dev_queue_push_xmit+0x170/0x170 [bridge] [ 98.137785] [<ffffffffa09ce4bd>] br_forward+0x9d/0xb0 [bridge] [ 98.143701] [<ffffffffa09cfbb7>] br_handle_frame_finish+0x267/0x560 [bridge] [ 98.150834] [<ffffffffa09d0064>] br_handle_frame+0x174/0x2f0 [bridge] [ 98.157355] [<ffffffff8102fb89>] ? sched_clock+0x9/0x10 [ 98.162662] [<ffffffff810b63b2>] ? sched_clock_cpu+0x72/0xa0 [ 98.168403] [<ffffffff815eccf5>] __netif_receive_skb_core+0x1e5/0xa20 [ 98.174926] [<ffffffff813659f9>] ? timerqueue_add+0x59/0xb0 [ 98.180580] [<ffffffff815ed548>] __netif_receive_skb+0x18/0x60 [ 98.186494] [<ffffffff815ee625>] process_backlog+0x95/0x140 [ 98.192145] [<ffffffff815edccd>] net_rx_action+0x16d/0x380 [ 98.197713] [<ffffffff8170cff1>] __do_softirq+0xd1/0x283 [ 98.203106] [<ffffffff8170b2bc>] do_softirq_own_stack+0x1c/0x30 [ 98.209107] <EOI> [ 98.211029] [<ffffffff8108a5c0>] do_softirq+0x50/0x60 [ 98.216166] [<ffffffff815ec853>] netif_rx_ni+0x33/0x80 [ 98.221386] [<ffffffffa09eeff7>] tun_get_user+0x487/0x7f0 [tun] [ 98.227388] [<ffffffffa09ef3ab>] tun_sendmsg+0x4b/0x60 [tun] [ 98.233129] [<ffffffffa0b68932>] handle_tx+0x282/0x540 [vhost_net] [ 98.239392] [<ffffffffa0b68c25>] handle_tx_kick+0x15/0x20 [vhost_net] [ 98.245916] [<ffffffffa0abacfe>] vhost_worker+0x9e/0xf0 [vhost] [ 98.251919] [<ffffffffa0abac60>] ? vhost_umem_alloc+0x40/0x40 [vhost] [ 98.258440] [<ffffffff81003a47>] ? do_syscall_64+0x67/0x180 [ 98.264094] [<ffffffff810a44d9>] kthread+0xd9/0xf0 [ 98.268965] [<ffffffff810a4400>] ? kthread_park+0x60/0x60 [ 98.274444] [<ffffffff8170a4d5>] ret_from_fork+0x25/0x30 [ 98.279836] Code: 8b 93 d8 00 00 00 48 2b 93 d0 00 00 00 4c 89 e6 48 89 df 66 89 93 c2 00 00 00 ff 10 48 3d 00 f0 ff ff 49 89 c2 0f 87 52 01 00 00 <41> 8b 92 cc 00 00 00 48 8b 80 d0 00 00 00 44 0f b7 74 10 06 66 [ 98.299425] RIP [<ffffffff816e52f9>] ipv6_gso_segment+0x119/0x2f0 [ 98.305612] RSP <ffff88012fc43a10> [ 98.309094] CR2: 00000000000000cc [ 98.312406] ---[ end trace 726a2c7a2d2d78d0 ]--- Signed-off-by: Artem Savkov <asavkov@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Revert: "ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"Eli Cooper2016-12-021-1/+0
| | | | | | | | | | | | This reverts commit ae148b085876fa771d9ef2c05f85d4b4bf09ce0d ("ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()"). skb->protocol is now set in __ip_local_out() and __ip6_local_out() before dst_output() is called. It is no longer necessary to do it for each tunnel. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Set skb->protocol properly for local outputEli Cooper2016-12-021-0/+2
| | | | | | | | | | | | | | | | When xfrm is applied to TSO/GSO packets, it follows this path: xfrm_output() -> xfrm_output_gso() -> skb_gso_segment() where skb_gso_segment() relies on skb->protocol to function properly. This patch sets skb->protocol to ETH_P_IPV6 before dst_output() is called, fixing a bug where GSO packets sent through an ipip6 tunnel are dropped when xfrm is involved. Cc: stable@vger.kernel.org Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2016-12-011-1/+1
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-12-01 1) Change the error value when someone tries to run 32bit userspace on a 64bit host from -ENOTSUPP to the userspace exported -EOPNOTSUPP. Fix from Yi Zhao. 2) On inbound, ESN sequence numbers are already in network byte order. So don't try to convert it again, this fixes integrity verification for ESN. Fixes from Tobias Brunner. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * esp6: Fix integrity verification when ESN are usedTobias Brunner2016-11-301-1/+1
| | | | | | | | | | | | | | | | | | | | When handling inbound packets, the two halves of the sequence number stored on the skb are already in network order. Fixes: 000ae7b2690e ("esp6: Switch to new AEAD interface") Signed-off-by: Tobias Brunner <tobias@strongswan.org> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2016-12-013-3/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net This is a large batch of Netfilter fixes for net, they are: 1) Three patches to fix NAT conversion to rhashtable: Switch to rhlist structure that allows to have several objects with the same key. Moreover, fix wrong comparison logic in nf_nat_bysource_cmp() as this is expecting a return value similar to memcmp(). Change location of the nat_bysource field in the nf_conn structure to avoid zeroing this as it breaks interaction with SLAB_DESTROY_BY_RCU and lead us to crashes. From Florian Westphal. 2) Don't allow malformed fragments go through in IPv6, drop them, otherwise we hit GPF, patch from Florian Westphal. 3) Fix crash if attributes are missing in nft_range, from Liping Zhang. 4) Fix arptables 32-bits userspace 64-bits kernel compat, from Hongxu Jia. 5) Two patches from David Ahern to fix netfilter interaction with vrf. From David Ahern. 6) Fix element timeout calculation in nf_tables, we take milliseconds from userspace, but we use jiffies from kernelspace. Patch from Anders K. Pedersen. 7) Missing validation length netlink attribute for nft_hash, from Laura Garcia. 8) Fix nf_conntrack_helper documentation, we don't default to off anymore for a bit of time so let's get this in sync with the code. I know is late but I think these are important, specifically the NAT bits, as they are mostly addressing fallout from recent changes. I also read there are chances to have -rc8, if that is the case, that would also give us a bit more time to test this. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: ipv6: nf_defrag: drop mangled skb on ream errorFlorian Westphal2016-11-292-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dmitry Vyukov reported GPF in network stack that Andrey traced down to negative nh offset in nf_ct_frag6_queue(). Problem is that all network headers before fragment header are pulled. Normal ipv6 reassembly will drop the skb when errors occur further down the line. netfilter doesn't do this, and instead passed the original fragment along. That was also fine back when netfilter ipv6 defrag worked with cloned fragments, as the original, pristine fragment was passed on. So we either have to undo the pull op, or discard such fragments. Since they're malformed after all (e.g. overlapping fragment) it seems preferrable to just drop them. Same for temporary errors -- it doesn't make sense to accept (and perhaps forward!) only some fragments of same datagram. Fixes: 029f7f3b8701cc7ac ("netfilter: ipv6: nf_defrag: avoid/free clone operations") Reported-by: Dmitry Vyukov <dvyukov@google.com> Debugged-by: Andrey Konovalov <andreyknvl@google.com> Diagnosed-by: Eric Dumazet <Eric Dumazet <edumazet@google.com> Signed-off-by: Florian Westphal <fw@strlen.de> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: Update nf_send_reset6 to consider L3 domainDavid Ahern2016-11-241-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | nf_send_reset6 is not considering the L3 domain and lookups are sent to the wrong table. For example consider the following output rule: ip6tables -A OUTPUT -p tcp --dport 12345 -j REJECT --reject-with tcp-reset using perf to analyze lookups via the fib6_table_lookup tracepoint shows: swapper 0 [001] 248.787816: fib6:fib6_table_lookup: table 255 oif 0 iif 1 src 2100:1::3 dst 2100:1: ffffffff81439cdc perf_trace_fib6_table_lookup ([kernel.kallsyms]) ffffffff814c1ce3 trace_fib6_table_lookup ([kernel.kallsyms]) ffffffff814c3e89 ip6_pol_route ([kernel.kallsyms]) ffffffff814c40d5 ip6_pol_route_output ([kernel.kallsyms]) ffffffff814e7b6f fib6_rule_action ([kernel.kallsyms]) ffffffff81437f60 fib_rules_lookup ([kernel.kallsyms]) ffffffff814e7c79 fib6_rule_lookup ([kernel.kallsyms]) ffffffff814c2541 ip6_route_output_flags ([kernel.kallsyms]) 528 nf_send_reset6 ([nf_reject_ipv6]) The lookup is directed to table 255 rather than the table associated with the device via the L3 domain. Update nf_send_reset6 to pull the L3 domain from the dst currently attached to the skb. Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | l2tp: lock socket before checking flags in connect()Guillaume Nault2016-11-301-1/+3
| |/ |/| | | | | | | | | | | | | | | | | | | Socket flags aren't updated atomically, so the socket must be locked while reading the SOCK_ZAPPED flag. This issue exists for both l2tp_ip and l2tp_ip6. For IPv6, this patch also brings error handling for __ip6_datagram_connect() failures. Signed-off-by: Guillaume Nault <g.nault@alphalink.fr> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: handle no dst on skb in icmp6_sendDavid Ahern2016-11-281-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Andrey reported the following while fuzzing the kernel with syzkaller: kasan: CONFIG_KASAN_INLINE enabled kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Modules linked in: CPU: 0 PID: 3859 Comm: a.out Not tainted 4.9.0-rc6+ #429 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011 task: ffff8800666d4200 task.stack: ffff880067348000 RIP: 0010:[<ffffffff833617ec>] [<ffffffff833617ec>] icmp6_send+0x5fc/0x1e30 net/ipv6/icmp.c:451 RSP: 0018:ffff88006734f2c0 EFLAGS: 00010206 RAX: ffff8800666d4200 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000000 RSI: dffffc0000000000 RDI: 0000000000000018 RBP: ffff88006734f630 R08: ffff880064138418 R09: 0000000000000003 R10: dffffc0000000000 R11: 0000000000000005 R12: 0000000000000000 R13: ffffffff84e7e200 R14: ffff880064138484 R15: ffff8800641383c0 FS: 00007fb3887a07c0(0000) GS:ffff88006cc00000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 0000000020000000 CR3: 000000006b040000 CR4: 00000000000006f0 Stack: ffff8800666d4200 ffff8800666d49f8 ffff8800666d4200 ffffffff84c02460 ffff8800666d4a1a 1ffff1000ccdaa2f ffff88006734f498 0000000000000046 ffff88006734f440 ffffffff832f4269 ffff880064ba7456 0000000000000000 Call Trace: [<ffffffff83364ddc>] icmpv6_param_prob+0x2c/0x40 net/ipv6/icmp.c:557 [< inline >] ip6_tlvopt_unknown net/ipv6/exthdrs.c:88 [<ffffffff83394405>] ip6_parse_tlv+0x555/0x670 net/ipv6/exthdrs.c:157 [<ffffffff8339a759>] ipv6_parse_hopopts+0x199/0x460 net/ipv6/exthdrs.c:663 [<ffffffff832ee773>] ipv6_rcv+0xfa3/0x1dc0 net/ipv6/ip6_input.c:191 ... icmp6_send / icmpv6_send is invoked for both rx and tx paths. In both cases the dst->dev should be preferred for determining the L3 domain if the dst has been set on the skb. Fallback to the skb->dev if it has not. This covers the case reported here where icmp6_send is invoked on Rx before the route lookup. Fixes: 5d41ce29e ("net: icmp6_send should use dst dev to determine L3 domain") Reported-by: Andrey Konovalov <andreyknvl@google.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵David S. Miller2016-11-281-0/+31
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/klassert/ipsec Steffen Klassert says: ==================== pull request (net): ipsec 2016-11-25 1) Fix a refcount leak in vti6. From Nicolas Dichtel. 2) Fix a wrong if statement in xfrm_sk_policy_lookup. From Florian Westphal. 3) The flowcache watermarks are per cpu. Take this into account when comparing to the threshold where we refusing new allocations. From Miroslav Urbanek. Please pull or let me know if there are problems. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | vti6: flush x-netns xfrm cache when vti interface is removedNicolas Dichtel2016-10-111-0/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is the same fix than commit a5d0dc810abf ("vti: flush x-netns xfrm cache when vti interface is removed") This patch fixes a refcnt problem when a x-netns vti6 interface is removed: unregister_netdevice: waiting for vti6_test to become free. Usage count = 1 Here is a script to reproduce the problem: ip link set dev ntfp2 up ip addr add dev ntfp2 2001::1/64 ip link add vti6_test type vti6 local 2001::1 remote 2001::2 key 1 ip netns add secure ip link set vti6_test netns secure ip netns exec secure ip link set vti6_test up ip netns exec secure ip link s lo up ip netns exec secure ip addr add dev vti6_test 2003::1/64 ip -6 xfrm policy add dir out tmpl src 2001::1 dst 2001::2 proto esp \ mode tunnel mark 1 ip -6 xfrm policy add dir in tmpl src 2001::2 dst 2001::1 proto esp \ mode tunnel mark 1 ip xfrm state add src 2001::1 dst 2001::2 proto esp spi 1 mode tunnel \ enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 1 ip xfrm state add src 2001::2 dst 2001::1 proto esp spi 1 mode tunnel \ enc des3_ede 0x112233445566778811223344556677881122334455667788 mark 1 ip netns exec secure ping6 -c 4 2003::2 ip netns del secure CC: Lance Richardson <lrichard@redhat.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Lance Richardson <lrichard@redhat.com> Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com>
* | | udplite: call proper backlog handlersEric Dumazet2016-11-243-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commits 93821778def10 ("udp: Fix rcv socket locking") and f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") UDP backlog handlers were renamed, but UDPlite was forgotten. This leads to crashes if UDPlite header is pulled twice, which happens starting from commit e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Bug found by syzkaller team, thanks a lot guys ! Note that backlog use in UDP/UDPlite is scheduled to be removed starting from linux-4.10, so this patch is only needed up to linux-4.9 Fixes: 93821778def1 ("udp: Fix rcv socket locking") Fixes: f7ad74fef3af ("net/ipv6/udp: UDP encapsulation: break backlog_rcv into __udpv6_queue_rcv_skb") Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Andrey Konovalov <andreyknvl@google.com> Cc: Benjamin LaHaise <bcrl@kvack.org> Cc: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: bump genid when the IFA_F_TENTATIVE flag is clearPaolo Abeni2016-11-241-6/+12
| |/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When an ipv6 address has the tentative flag set, it can't be used as source for egress traffic, while the associated route, if any, can be looked up and even stored into some dst_cache. In the latter scenario, the source ipv6 address selected and stored in the cache is most probably wrong (e.g. with link-local scope) and the entity using the dst_cache will experience lack of ipv6 connectivity until said cache is cleared or invalidated. Overall this may cause lack of connectivity over most IPv6 tunnels (comprising geneve and vxlan), if the first egress packet reaches the tunnel before the DaD is completed for the used ipv6 address. This patch bumps a new genid after that the IFA_F_TENTATIVE flag is cleared, so that dst_cache will be invalidated on next lookup and ipv6 connectivity restored. Fixes: 0c1d70af924b ("net: use dst_cache for vxlan device") Fixes: 468dfffcd762 ("geneve: add dst caching support") Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip6_tunnel: disable caching when the traffic class is inheritedPaolo Abeni2016-11-171-2/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | If an ip6 tunnel is configured to inherit the traffic class from the inner header, the dst_cache must be disabled or it will foul the policy routing. The issue is apprently there since at leat Linux-2.6.12-rc2. Reported-by: Liam McBirnie <liam.mcbirnie@boeing.com> Cc: Liam McBirnie <liam.mcbirnie@boeing.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | udp: restore UDPlite many-cast deliveryPablo Neira2016-11-161-3/+3
| | | | | | | | | | | | | | | | | | | | Honor udptable parameter that is passed to __udp*_lib_mcast_deliver(), otherwise udplite broadcast/multicast use the wrong table and it breaks. Fixes: 2dc41cff7545 ("udp: Use hash2 for long hash1 chains in __udp*_lib_mcast_deliver.") Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: take care of truncations done by sk_filter()Eric Dumazet2016-11-131-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With syzkaller help, Marco Grassi found a bug in TCP stack, crashing in tcp_collapse() Root cause is that sk_filter() can truncate the incoming skb, but TCP stack was not really expecting this to happen. It probably was expecting a simple DROP or ACCEPT behavior. We first need to make sure no part of TCP header could be removed. Then we need to adjust TCP_SKB_CB(skb)->end_seq Many thanks to syzkaller team and Marco for giving us a reproducer. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Marco Grassi <marco.gra@gmail.com> Reported-by: Vladis Dronov <vdronov@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: tcp response should set oif only if it is L3 masterDavid Ahern2016-11-101-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Lorenzo noted an Android unit test failed due to e0d56fdd7342: "The expectation in the test was that the RST replying to a SYN sent to a closed port should be generated with oif=0. In other words it should not prefer the interface where the SYN came in on, but instead should follow whatever the routing table says it should do." Revert the change to ip_send_unicast_reply and tcp_v6_send_response such that the oif in the flow is set to the skb_iif only if skb_iif is an L3 master. Fixes: e0d56fdd7342 ("net: l3mdev: remove redundant calls") Reported-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Tested-by: Lorenzo Colitti <lorenzo@google.com> Acked-by: Lorenzo Colitti <lorenzo@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nfDavid S. Miller2016-11-101-2/+4
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter fixes for net The following patchset contains a larger than usual batch of Netfilter fixes for your net tree. This series contains a mixture of old bugs and recently introduced bugs, they are: 1) Fix a crash when using nft_dynset with nft_set_rbtree, which doesn't support the set element updates from the packet path. From Liping Zhang. 2) Fix leak when nft_expr_clone() fails, from Liping Zhang. 3) Fix a race when inserting new elements to the set hash from the packet path, also from Liping. 4) Handle segmented TCP SIP packets properly, basically avoid that the INVITE in the allow header create bogus expectations by performing stricter SIP message parsing, from Ulrich Weber. 5) nft_parse_u32_check() should return signed integer for errors, from John Linville. 6) Fix wrong allocation instead of connlabels, allocate 16 instead of 32 bytes, from Florian Westphal. 7) Fix compilation breakage when building the ip_vs_sync code with CONFIG_OPTIMIZE_INLINING on x86, from Arnd Bergmann. 8) Destroy the new set if the transaction object cannot be allocated, also from Liping Zhang. 9) Use device to route duplicated packets via nft_dup only when set by the user, otherwise packets may not follow the right route, again from Liping. 10) Fix wrong maximum genetlink attribute definition in IPVS, from WANG Cong. 11) Ignore untracked conntrack objects from xt_connmark, from Florian Westphal. 12) Allow to use conntrack helpers that are registered NFPROTO_UNSPEC via CT target, otherwise we cannot use the h.245 helper, from Florian. 13) Revisit garbage collection heuristic in the new workqueue-based timer approach for conntrack to evict objects earlier, again from Florian. 14) Fix crash in nf_tables when inserting an element into a verdict map, from Liping Zhang. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: nft_dup: do not use sreg_dev if the user doesn't specify itLiping Zhang2016-10-311-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | The NFTA_DUP_SREG_DEV attribute is not a must option, so we should use it in routing lookup only when the user specify it. Fixes: d877f07112f1 ("netfilter: nf_tables: add nft_dup expression") Signed-off-by: Liping Zhang <zlpnobody@gmail.com> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* | | net-ipv6: on device mtu change do not add mtu to mtu-less routesMaciej Żenczykowski2016-11-091-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Routes can specify an mtu explicitly or inherit the mtu from the underlying device - this inheritance is implemented in dst->ops->mtu handlers ip6_mtu() and ip6_blackhole_mtu(). Currently changing the mtu of a device adds mtu explicitly to routes using that device. ie. # ip link set dev lo mtu 65536 # ip -6 route add local 2000::1 dev lo # ip -6 route get 2000::1 local 2000::1 dev lo table local src ... metric 1024 pref medium # ip link set dev lo mtu 65535 # ip -6 route get 2000::1 local 2000::1 dev lo table local src ... metric 1024 mtu 65535 pref medium # ip link set dev lo mtu 65536 # ip -6 route get 2000::1 local 2000::1 dev lo table local src ... metric 1024 mtu 65536 pref medium # ip -6 route del local 2000::1 After this patch the route entry no longer changes unless it already has an mtu. There is no need: this inheritance is already done in ip6_mtu() # ip link set dev lo mtu 65536 # ip -6 route add local 2000::1 dev lo # ip -6 route add local 2000::2 dev lo mtu 2000 # ip -6 route get 2000::1; ip -6 route get 2000::2 local 2000::1 dev lo table local src ... metric 1024 pref medium local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium # ip link set dev lo mtu 65535 # ip -6 route get 2000::1; ip -6 route get 2000::2 local 2000::1 dev lo table local src ... metric 1024 pref medium local 2000::2 dev lo table local src ... metric 1024 mtu 2000 pref medium # ip link set dev lo mtu 1501 # ip -6 route get 2000::1; ip -6 route get 2000::2 local 2000::1 dev lo table local src ... metric 1024 pref medium local 2000::2 dev lo table local src ... metric 1024 mtu 1501 pref medium # ip link set dev lo mtu 65536 # ip -6 route get 2000::1; ip -6 route get 2000::2 local 2000::1 dev lo table local src ... metric 1024 pref medium local 2000::2 dev lo table local src ... metric 1024 mtu 65536 pref medium # ip -6 route del local 2000::1 # ip -6 route del local 2000::2 This is desirable because changing device mtu and then resetting it to the previous value shouldn't change the user visible routing table. Signed-off-by: Maciej Żenczykowski <maze@google.com> CC: Eric Dumazet <edumazet@google.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: icmp6_send should use dst dev to determine L3 domainDavid Ahern2016-11-081-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | icmp6_send is called in response to some event. The skb may not have the device set (skb->dev is NULL), but it is expected to have a dst set. Update icmp6_send to use the dst on the skb to determine L3 domain. Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ip6_udp_tunnel: remove unused IPCB related codesEli Cooper2016-11-021-3/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Some IPCB fields are currently set in udp_tunnel6_xmit_skb(), which are never used before it reaches ip6tunnel_xmit(), and past that point the control buffer is no longer interpreted as IPCB. This clears these unused IPCB related codes. Currently there is no skb scrubbing in ip6_udp_tunnel, otherwise IPCB(skb)->opt might need to be cleared for IPv4 packets, as shown in 5146d1f1511 ("tunnel: Clear IPCB(skb)->opt before dst_link_failure called"). Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: add mtu lock check in __ip6_rt_update_pmtuXin Long2016-10-311-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prior to this patch, ipv6 didn't do mtu lock check in ip6_update_pmtu. It leaded to that mtu lock doesn't really work when receiving the pkt of ICMPV6_PKT_TOOBIG. This patch is to add mtu lock check in __ip6_rt_update_pmtu just as ipv4 did in __ip_rt_update_pmtu. Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ipv6: Don't use ufo handling on later transformed packetsJakub Sitnicki2016-10-311-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to commit c146066ab802 ("ipv4: Don't use ufo handling on later transformed packets"), don't perform UFO on packets that will be IPsec transformed. To detect it we rely on the fact that headerlen in dst_entry is non-zero only for transformation bundles (xfrm_dst objects). Unwanted segmentation can be observed with a NETIF_F_UFO capable device, such as a dummy device: DEV=dum0 LEN=1493 ip li add $DEV type dummy ip addr add fc00::1/64 dev $DEV nodad ip link set $DEV up ip xfrm policy add dir out src fc00::1 dst fc00::2 \ tmpl src fc00::1 dst fc00::2 proto esp spi 1 ip xfrm state add src fc00::1 dst fc00::2 \ proto esp spi 1 enc 'aes' 0x0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b0b tcpdump -n -nn -i $DEV -t & socat /dev/zero,readbytes=$LEN udp6:[fc00::2]:$LEN tcpdump output before: IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448 IP6 fc00::1 > fc00::2: frag (1448|48) IP6 fc00::1 > fc00::2: ESP(spi=0x00000001,seq=0x2), length 88 ... and after: IP6 fc00::1 > fc00::2: frag (0|1448) ESP(spi=0x00000001,seq=0x1), length 1448 IP6 fc00::1 > fc00::2: frag (1448|80) Fixes: e89e9cf539a2 ("[IPv4/IPv6]: UFO Scatter-gather approach") Signed-off-by: Jakub Sitnicki <jkbs@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | ip6_tunnel: Update skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit()Eli Cooper2016-10-291-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch updates skb->protocol to ETH_P_IPV6 in ip6_tnl_xmit() when an IPv6 header is installed to a socket buffer. This is not a cosmetic change. Without updating this value, GSO packets transmitted through an ipip6 tunnel have the protocol of ETH_P_IP and skb_mac_gso_segment() will attempt to call gso_segment() for IPv4, which results in the packets being dropped. Fixes: b8921ca83eed ("ip4ip6: Support for GSO/GRO") Signed-off-by: Eli Cooper <elicooper@gmx.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | inet: Fix missing return value in inet6_hashCraig Gallek2016-10-291-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | As part of a series to implement faster SO_REUSEPORT lookups, commit 086c653f5862 ("sock: struct proto hash function may error") added return values to protocol hash functions and commit 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function") implemented a new hash function for IPv6. However, the latter does not respect the former's convention. This properly propagates the hash errors in the IPv6 case. Fixes: 496611d7b5ea ("inet: create IPv6-equivalent inet_hash function") Reported-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: Craig Gallek <kraig@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv6: Do not consider link state for nexthop validationDavid Ahern2016-10-271-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to IPv4, do not consider link state when validating next hops. Currently, if the link is down default routes can fail to insert: $ ip -6 ro add vrf blue default via 2100:2::64 dev eth2 RTNETLINK answers: No route to host With this patch the command succeeds. Fixes: 8c14586fc320 ("net: ipv6: Use passed in table for nexthop lookups") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | net: ipv6: Fix processing of RAs in presence of VRFDavid Ahern2016-10-271-20/+48
|/ / | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rt6_add_route_info and rt6_add_dflt_router were updated to pull the FIB table from the device index, but the corresponding rt6_get_route_info and rt6_get_dflt_router functions were not leading to the failure to process RA's: ICMPv6: RA: ndisc_router_discovery failed to add default route Fix the 'get' functions by using the table id associated with the device when applicable. Also, now that default routes can be added to tables other than the default table, rt6_purge_dflt_routers needs to be updated as well to look at all tables. To handle that efficiently, add a flag to the table denoting if it is has a default route via RA. Fixes: ca254490c8dfd ("net: Add VRF support to IPv6 stack") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | udp: fix IP_CHECKSUM handlingEric Dumazet2016-10-261-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | First bug was added in commit ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") : Tom missed that ipv4 udp messages could be received on AF_INET6 socket. ip_cmsg_recv(msg, skb) should have been replaced by ip_cmsg_recv_offset(msg, skb, sizeof(struct udphdr)); Then commit e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") forgot to adjust the offsets now UDP headers are pulled before skb are put in receive queue. Fixes: ad6f939ab193 ("ip: Add offset parameter to ip_cmsg_recv") Fixes: e6afc8ace6dd ("udp: remove headers from UDP packets before queueing") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Sam Kumar <samanthakumar@google.com> Cc: Willem de Bruijn <willemb@google.com> Tested-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: do not increment mac header when it's unsetJason A. Donenfeld2016-10-231-1/+2
| | | | | | | | | | | | | | | | Otherwise we'll overflow the integer. This occurs when layer 3 tunneled packets are handed off to the IPv6 layer. Signed-off-by: Jason A. Donenfeld <Jason@zx2c4.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: fix a potential deadlock in do_ipv6_setsockopt()WANG Cong2016-10-212-6/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Baozeng reported this deadlock case: CPU0 CPU1 ---- ---- lock([ 165.136033] sk_lock-AF_INET6); lock([ 165.136033] rtnl_mutex); lock([ 165.136033] sk_lock-AF_INET6); lock([ 165.136033] rtnl_mutex); Similar to commit 87e9f0315952 ("ipv4: fix a potential deadlock in mcast getsockopt() path") this is due to we still have a case, ipv6_sock_mc_close(), where we acquire sk_lock before rtnl_lock. Close this deadlock with the similar solution, that is always acquire rtnl lock first. Fixes: baf606d9c9b1 ("ipv4,ipv6: grab rtnl before locking the socket") Reported-by: Baozeng Ding <sploving1@gmail.com> Tested-by: Baozeng Ding <sploving1@gmail.com> Cc: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | udp: must lock the socket in udp_disconnect()Eric Dumazet2016-10-202-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Baozeng Ding reported KASAN traces showing uses after free in udp_lib_get_port() and other related UDP functions. A CONFIG_DEBUG_PAGEALLOC=y kernel would eventually crash. I could write a reproducer with two threads doing : static int sock_fd; static void *thr1(void *arg) { for (;;) { connect(sock_fd, (const struct sockaddr *)arg, sizeof(struct sockaddr_in)); } } static void *thr2(void *arg) { struct sockaddr_in unspec; for (;;) { memset(&unspec, 0, sizeof(unspec)); connect(sock_fd, (const struct sockaddr *)&unspec, sizeof(unspec)); } } Problem is that udp_disconnect() could run without holding socket lock, and this was causing list corruptions. Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: add recursion limit to GROSabrina Dubroca2016-10-201-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, GRO can do unlimited recursion through the gro_receive handlers. This was fixed for tunneling protocols by limiting tunnel GRO to one level with encap_mark, but both VLAN and TEB still have this problem. Thus, the kernel is vulnerable to a stack overflow, if we receive a packet composed entirely of VLAN headers. This patch adds a recursion counter to the GRO layer to prevent stack overflow. When a gro_receive function hits the recursion limit, GRO is aborted for this skb and it is processed normally. This recursion counter is put in the GRO CB, but could be turned into a percpu counter if we run out of space in the CB. Thanks to Vladimír Beneš <vbenes@redhat.com> for the initial bug report. Fixes: CVE-2016-7039 Fixes: 9b174d88c257 ("net: Add Transparent Ethernet Bridging GRO support.") Fixes: 66e5133f19e9 ("vlan: Add GRO support for non hardware accelerated vlan") Signed-off-by: Sabrina Dubroca <sd@queasysnail.net> Reviewed-by: Jiri Benc <jbenc@redhat.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: properly prevent temp_prefered_lft sysctl raceJiri Bohac2016-10-201-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The check for an underflow of tmp_prefered_lft is always false because tmp_prefered_lft is unsigned. The intention of the check was to guard against racing with an update of the temp_prefered_lft sysctl, potentially resulting in an underflow. As suggested by David Miller, the best way to prevent the race is by reading the sysctl variable using READ_ONCE. Signed-off-by: Jiri Bohac <jbohac@suse.cz> Reported-by: Julia Lawall <julia.lawall@lip6.fr> Fixes: 76506a986dc3 ("IPv6: fix DESYNC_FACTOR") Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: Require exact match for TCP socket lookups if dif is l3mdevDavid Ahern2016-10-171-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, socket lookups for l3mdev (vrf) use cases can match a socket that is bound to a port but not a device (ie., a global socket). If the sysctl tcp_l3mdev_accept is not set this leads to ack packets going out based on the main table even though the packet came in from an L3 domain. The end result is that the connection does not establish creating confusion for users since the service is running and a socket shows in ss output. Fix by requiring an exact dif to sk_bound_dev_if match if the skb came through an interface enslaved to an l3mdev device and the tcp_l3mdev_accept is not set. skb's through an l3mdev interface are marked by setting a flag in inet{6}_skb_parm. The IPv6 variant is already set; this patch adds the flag for IPv4. Using an skb flag avoids a device lookup on the dif. The flag is set in the VRF driver using the IP{6}CB macros. For IPv4, the inet_skb_parm struct is moved in the cb per commit 971f10eca186, so the match function in the TCP stack needs to use TCP_SKB_CB. For IPv6, the move is done after the socket lookup, so IP6CB is used. The flags field in inet_skb_parm struct needs to be increased to add another flag. There is currently a 1-byte hole following the flags, so it can be expanded to u16 without increasing the size of the struct. Fixes: 193125dbd8eb ("net: Introduce VRF device driver") Signed-off-by: David Ahern <dsa@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | IPv6: fix DESYNC_FACTORJiri Bohac2016-10-141-8/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The IPv6 temporary address generation uses a variable called DESYNC_FACTOR to prevent hosts updating the addresses at the same time. Quoting RFC 4941: ... The value DESYNC_FACTOR is a random value (different for each client) that ensures that clients don't synchronize with each other and generate new addresses at exactly the same time ... DESYNC_FACTOR is defined as: DESYNC_FACTOR -- A random value within the range 0 - MAX_DESYNC_FACTOR. It is computed once at system start (rather than each time it is used) and must never be greater than (TEMP_VALID_LIFETIME - REGEN_ADVANCE). First, I believe the RFC has a typo in it and meant to say: "and must never be greater than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE)" The reason is that at various places in the RFC, DESYNC_FACTOR is used in a calculation like (TEMP_PREFERRED_LIFETIME - DESYNC_FACTOR) or (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE - DESYNC_FACTOR). It needs to be smaller than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE) for the result of these calculations to be larger than zero. It's never used in a calculation together with TEMP_VALID_LIFETIME. I already submitted an errata to the rfc-editor: https://www.rfc-editor.org/errata_search.php?rfc=4941 The Linux implementation of DESYNC_FACTOR is very wrong: max_desync_factor is used in places DESYNC_FACTOR should be used. max_desync_factor is initialized to the RFC-recommended value for MAX_DESYNC_FACTOR (600) but the whole point is to get a _random_ value. And nothing ensures that the value used is not greater than (TEMP_PREFERRED_LIFETIME - REGEN_ADVANCE), which leads to underflows. The effect can easily be observed when setting the temp_prefered_lft sysctl e.g. to 60. The preferred lifetime of the temporary addresses will be bogus. TEMP_PREFERRED_LIFETIME and REGEN_ADVANCE are not constants and can be influenced by these three sysctls: regen_max_retry, dad_transmits and temp_prefered_lft. Thus, the upper bound for desync_factor needs to be re-calculated each time a new address is generated and if desync_factor is larger than the new upper bound, a new random value needs to be re-generated. And since we already have max_desync_factor configurable per interface, we also need to calculate and store desync_factor per interface. Signed-off-by: Jiri Bohac <jbohac@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
* | IPv6: Drop the temporary address regen_timerJiri Bohac2016-10-141-52/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The randomized interface identifier (rndid) was periodically updated from the regen_timer timer. Simplify the code by updating the rndid only when needed by ipv6_try_regen_rndid(). This makes the follow-up DESYNC_FACTOR fix much simpler. Also it fixes a reference counting error in this error path, where an in6_dev_put was missing: err = addrconf_sysctl_register(ndev); if (err) { ipv6_mc_destroy_dev(ndev); - del_timer(&ndev->regen_timer); snmp6_unregister_dev(ndev); goto err_release; Signed-off-by: Jiri Bohac <jbohac@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: correctly add local routes when lo goes upNicolas Dichtel2016-10-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The goal of the patch is to fix this scenario: ip link add dummy1 type dummy ip link set dummy1 up ip link set lo down ; ip link set lo up After that sequence, the local route to the link layer address of dummy1 is not there anymore. When the loopback is set down, all local routes are deleted by addrconf_ifdown()/rt6_ifdown(). At this time, the rt6_info entry still exists, because the corresponding idev has a reference on it. After the rcu grace period, dst_rcu_free() is called, and thus ___dst_free(), which will set obsolete to DST_OBSOLETE_DEAD. In this case, init_loopback() is called before dst_rcu_free(), thus obsolete is still sets to something <= 0. So, the function doesn't add the route again. To avoid that race, let's check the rt6 refcnt instead. Fixes: 25fb6ca4ed9c ("net IPv6 : Fix broken IPv6 routing table after loopback down-up") Fixes: a881ae1f625c ("ipv6: don't call addrconf_dst_alloc again when enable lo") Fixes: 33d99113b110 ("ipv6: reallocate addrconf router for ipv6 address when lo device up") Reported-by: Francesco Santoro <francesco.santoro@6wind.com> Reported-by: Samuel Gauthier <samuel.gauthier@6wind.com> CC: Balakumaran Kannan <Balakumaran.Kannan@ap.sony.com> CC: Maruthi Thotad <Maruthi.Thotad@ap.sony.com> CC: Sabrina Dubroca <sd@queasysnail.net> CC: Hannes Frederic Sowa <hannes@stressinduktion.org> CC: Weilong Chen <chenweilong@huawei.com> CC: Gao feng <gaofeng@cn.fujitsu.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ip6_tunnel: fix ip6_tnl_lookupVadim Fedorenko2016-10-141-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit ea3dc9601bda ("ip6_tunnel: Add support for wildcard tunnel endpoints.") introduces support for wildcards in tunnels endpoints, but in some rare circumstances ip6_tnl_lookup selects wrong tunnel interface relying only on source or destination address of the packet and not checking presence of wildcard in tunnels endpoints. Later in ip6_tnl_rcv this packets can be dicarded because of difference in ipproto even if fallback device have proper ipproto configuration. This patch adds checks of wildcard endpoint in tunnel avoiding such behavior Fixes: ea3dc9601bda ("ip6_tunnel: Add support for wildcard tunnel endpoints.") Signed-off-by: Vadim Fedorenko <junk@yandex-team.ru> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: tcp: restore IP6CB for pktoptions skbsEric Dumazet2016-10-131-9/+11
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Baozeng Ding reported following KASAN splat : BUG: KASAN: use-after-free in ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 at addr ffff880029c84ec8 Read of size 1 by task poc/25548 Call Trace: [<ffffffff82cf43c9>] dump_stack+0x12e/0x185 /lib/dump_stack.c:15 [< inline >] print_address_description /mm/kasan/report.c:204 [<ffffffff817ced3b>] kasan_report_error+0x48b/0x4b0 /mm/kasan/report.c:283 [< inline >] kasan_report /mm/kasan/report.c:303 [<ffffffff817ced9e>] __asan_report_load1_noabort+0x3e/0x40 /mm/kasan/report.c:321 [<ffffffff85c71da1>] ip6_datagram_recv_specific_ctl+0x13f1/0x15c0 /net/ipv6/datagram.c:687 [<ffffffff85c734c3>] ip6_datagram_recv_ctl+0x33/0x40 [<ffffffff85c0b07c>] do_ipv6_getsockopt.isra.4+0xaec/0x2150 [<ffffffff85c0c7f6>] ipv6_getsockopt+0x116/0x230 [<ffffffff859b5a12>] tcp_getsockopt+0x82/0xd0 /net/ipv4/tcp.c:3035 [<ffffffff855fb385>] sock_common_getsockopt+0x95/0xd0 /net/core/sock.c:2647 [< inline >] SYSC_getsockopt /net/socket.c:1776 [<ffffffff855f8ba2>] SyS_getsockopt+0x142/0x230 /net/socket.c:1758 [<ffffffff8685cdc5>] entry_SYSCALL_64_fastpath+0x23/0xc6 Memory state around the buggy address: ffff880029c84d80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff880029c84e00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff > ffff880029c84e80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ^ ffff880029c84f00: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ffff880029c84f80: ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff ff He also provided a syzkaller reproducer. Issue is that ip6_datagram_recv_specific_ctl() expects to find IP6CB data that was moved at a different place in tcp_v6_rcv() This patch moves tcp_v6_restore_cb() up and calls it from tcp_v6_do_rcv() when np->pktoptions is set. Fixes: 971f10eca186 ("tcp: better TCP_SKB_CB layout to reduce cache line misses") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Baozeng Ding <sploving1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6 addrconf: disallow rtr_solicits < -1Maciej Żenczykowski2016-10-081-1/+3
| | | | | | | | | | | | This disallows setting /proc/sys/net/ipv6/conf/*/router_solicitations to values below -1. -1 continues to mean an unlimited number of retransmits. Note: this depends on 'ipv6 addrconf: remove addrconf_sysctl_hop_limit()' Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6 addrconf: remove addrconf_sysctl_hop_limit()Maciej Żenczykowski2016-10-031-17/+14
| | | | | | | | | | | | | | | | | | | | | This is an effective no-op in terms of user observable behaviour. By preventing the overwrite of non-null extra1/extra2 fields in addrconf_sysctl() we can enable the use of proc_dointvec_minmax(). This allows us to eliminate the constant min/max (1..255) trampoline function that is addrconf_sysctl_hop_limit(). This is nice because it simplifies the code, and allows future sysctls with constant min/max limits to also not require trampolines. We still can't eliminate the trampoline for mtu because it isn't actually a constant (it depends on other tunables of the device) and thus requires at-write-time logic to enforce range. Signed-off-by: Maciej Żenczykowski <maze@google.com> Acked-by: Erik Kline <ek@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2016-10-033-4/+6
|\ | | | | | | | | | | Three sets of overlapping changes. Nothing serious. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_routeNikolay Aleksandrov2016-09-262-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid instead of the previous dst_pid which was copied from in_skb's portid. Since the skb is new the portid is 0 at that point so the packets are sent to the kernel and we get scheduling while atomic or a deadlock (depending on where it happens) by trying to acquire rtnl two times. Also since this is RTM_GETROUTE, it can be triggered by a normal user. Here's the sleeping while atomic trace: [ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620 [ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0 [ 7858.212881] 2 locks held by swapper/0/0: [ 7858.213013] #0: (((&mrt->ipmr_expire_timer))){+.-...}, at: [<ffffffff810fbbf5>] call_timer_fn+0x5/0x350 [ 7858.213422] #1: (mfc_unres_lock){+.....}, at: [<ffffffff8161e005>] ipmr_expire_process+0x25/0x130 [ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179 [ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014 [ 7858.214108] 0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000 [ 7858.214412] ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e [ 7858.214716] 000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f [ 7858.215251] Call Trace: [ 7858.215412] <IRQ> [<ffffffff813a7804>] dump_stack+0x85/0xc1 [ 7858.215662] [<ffffffff810a4a72>] ___might_sleep+0x192/0x250 [ 7858.215868] [<ffffffff810a4b9f>] __might_sleep+0x6f/0x100 [ 7858.216072] [<ffffffff8165bea3>] mutex_lock_nested+0x33/0x4d0 [ 7858.216279] [<ffffffff815a7a5f>] ? netlink_lookup+0x25f/0x460 [ 7858.216487] [<ffffffff8157474b>] rtnetlink_rcv+0x1b/0x40 [ 7858.216687] [<ffffffff815a9a0c>] netlink_unicast+0x19c/0x260 [ 7858.216900] [<ffffffff81573c70>] rtnl_unicast+0x20/0x30 [ 7858.217128] [<ffffffff8161cd39>] ipmr_destroy_unres+0xa9/0xf0 [ 7858.217351] [<ffffffff8161e06f>] ipmr_expire_process+0x8f/0x130 [ 7858.217581] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.217785] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.217990] [<ffffffff810fbc95>] call_timer_fn+0xa5/0x350 [ 7858.218192] [<ffffffff810fbbf5>] ? call_timer_fn+0x5/0x350 [ 7858.218415] [<ffffffff8161dfe0>] ? ipmr_net_init+0x180/0x180 [ 7858.218656] [<ffffffff810fde10>] run_timer_softirq+0x260/0x640 [ 7858.218865] [<ffffffff8166379b>] ? __do_softirq+0xbb/0x54f [ 7858.219068] [<ffffffff816637c8>] __do_softirq+0xe8/0x54f [ 7858.219269] [<ffffffff8107a948>] irq_exit+0xb8/0xc0 [ 7858.219463] [<ffffffff81663452>] smp_apic_timer_interrupt+0x42/0x50 [ 7858.219678] [<ffffffff816625bc>] apic_timer_interrupt+0x8c/0xa0 [ 7858.219897] <EOI> [<ffffffff81055f16>] ? native_safe_halt+0x6/0x10 [ 7858.220165] [<ffffffff810d64dd>] ? trace_hardirqs_on+0xd/0x10 [ 7858.220373] [<ffffffff810298e3>] default_idle+0x23/0x190 [ 7858.220574] [<ffffffff8102a20f>] arch_cpu_idle+0xf/0x20 [ 7858.220790] [<ffffffff810c9f8c>] default_idle_call+0x4c/0x60 [ 7858.221016] [<ffffffff810ca33b>] cpu_startup_entry+0x39b/0x4d0 [ 7858.221257] [<ffffffff8164f995>] rest_init+0x135/0x140 [ 7858.221469] [<ffffffff81f83014>] start_kernel+0x50e/0x51b [ 7858.221670] [<ffffffff81f82120>] ? early_idt_handler_array+0x120/0x120 [ 7858.221894] [<ffffffff81f8243f>] x86_64_start_reservations+0x2a/0x2c [ 7858.222113] [<ffffffff81f8257c>] x86_64_start_kernel+0x13b/0x14a Fixes: 2942e9005056 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip6_gre: fix flowi6_proto value in ip6gre_xmit_other()Lance Richardson2016-09-241-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to commit 3be07244b733 ("ip6_gre: fix flowi6_proto value in xmit path"), set flowi6_proto to IPPROTO_GRE for output route lookup. Up until now, ip6gre_xmit_other() has set flowi6_proto to a bogus value. This affected output route lookup for packets sent on an ip6gretap device in cases where routing was dependent on the value of flowi6_proto. Since the correct proto is already set in the tunnel flowi6 template via commit 252f3f5a1189 ("ip6_gre: Set flowi6_proto as IPPROTO_GRE in xmit path."), simply delete the line setting the incorrect flowi6_proto value. Suggested-by: Jiri Benc <jbenc@redhat.com> Fixes: c12b395a4664 ("gre: Support GRE over IPv6") Reviewed-by: Shmulik Ladkani <shmulik.ladkani@gmail.com> Signed-off-by: Lance Richardson <lrichard@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6 addrconf: implement RFC7559 router solicitation backoffMaciej Żenczykowski2016-09-301-7/+44
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This implements: https://tools.ietf.org/html/rfc7559 Backoff is performed according to RFC3315 section 14: https://tools.ietf.org/html/rfc3315#section-14 We allow setting /proc/sys/net/ipv6/conf/*/router_solicitations to a negative value meaning an unlimited number of retransmits, and we make this the new default (inline with the RFC). We also add a new setting: /proc/sys/net/ipv6/conf/*/router_solicitation_max_interval defaulting to 1 hour (per RFC recommendation). Signed-off-by: Maciej Żenczykowski <maze@google.com> Acked-by: Erik Kline <ek@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | ipv6: Remove useless parameter in __snmp6_fill_statsdevJia He2016-09-301-6/+6
| | | | | | | | | | | | | | The parameter items(is always ICMP6_MIB_MAX) is useless for __snmp6_fill_statsdev Signed-off-by: Jia He <hejianet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | proc: Reduce cache miss in snmp6_seq_showJia He2016-09-301-8/+22
| | | | | | | | | | | | | | | | This is to use the generic interfaces snmp_get_cpu_field{,64}_batch to aggregate the data by going through all the items of each cpu sequentially. Signed-off-by: Jia He <hejianet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'master' of ↵Pablo Neira Ayuso2016-09-2520-106/+271
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next Conflicts: net/netfilter/core.c net/netfilter/nf_tables_netdev.c Resolve two conflicts before pull request for David's net-next tree: 1) Between c73c24849011 ("netfilter: nf_tables_netdev: remove redundant ip_hdr assignment") from the net tree and commit ddc8b6027ad0 ("netfilter: introduce nft_set_pktinfo_{ipv4, ipv6}_validate()"). 2) Between e8bffe0cf964 ("net: Add _nf_(un)register_hooks symbols") and Aaron Conole's patches to replace list_head with single linked list. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>