summaryrefslogtreecommitdiffstats
path: root/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
* udpv6: Check address length before reading address familyTetsuo Handa2019-04-121-0/+2
| | | | | | | | | | | | | KMSAN will complain if valid address length passed to udpv6_pre_connect() is shorter than sizeof("struct sockaddr"->sa_family) bytes. (This patch is bogus if it is guaranteed that udpv6_pre_connect() is always called after checking "struct sockaddr"->sa_family. In that case, we want a comment why we don't need to check valid address length here.) Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Song Liu <songliubraving@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* bpf: Check address length before reading address familyTetsuo Handa2019-04-121-0/+2
| | | | | | | | | KMSAN will complain if valid address length passed to bpf_bind() is shorter than sizeof("struct sockaddr"->sa_family) bytes. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Andrey Ignatov <rdna@fb.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* llc: Check address length before reading address fieldTetsuo Handa2019-04-121-2/+1
| | | | | | | | KMSAN will complain if valid address length passed to bind() is shorter than sizeof(struct sockaddr_llc) bytes. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
* Bluetooth: Check address length before reading address fieldTetsuo Handa2019-04-121-2/+2
| | | | | | | | KMSAN will complain if valid address length passed to bind() is shorter than sizeof(struct sockaddr_sco) bytes. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
* rxrpc: Check address length before reading srx_service fieldTetsuo Handa2019-04-121-1/+2
| | | | | | | | KMSAN will complain if valid address length passed to bind() is shorter than sizeof(struct sockaddr_rxrpc) bytes. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: netlink: Check address length before reading groups fieldTetsuo Handa2019-04-121-1/+2
| | | | | | | | KMSAN will complain if valid address length passed to bind() is shorter than sizeof(struct sockaddr_nl) bytes. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: Check address length before reading address familyTetsuo Handa2019-04-121-1/+2
| | | | | | | | | KMSAN will complain if valid address length passed to connect() is shorter than sizeof("struct sockaddr"->sa_family) bytes. Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/rds: Check address length before reading address familyTetsuo Handa2019-04-122-0/+5
| | | | | | | | | | | | | | | | syzbot is reporting uninitialized value at rds_connect() [1] and rds_bind() [2]. This is because syzbot is passing ulen == 0 whereas these functions expect that it is safe to access sockaddr->family field in order to determine minimal address length for validation. [1] https://syzkaller.appspot.com/bug?id=f4e61c010416c1e6f0fa3ffe247561b60a50ad71 [2] https://syzkaller.appspot.com/bug?id=a4bf9e41b7e055c3823fdcd83e8c58ca7270e38f Reported-by: syzbot <syzbot+0049bebbf3042dbd2e8f@syzkaller.appspotmail.com> Reported-by: syzbot <syzbot+915c9f99f3dbc4bd6cd1@syzkaller.appspotmail.com> Signed-off-by: Tetsuo Handa <penguin-kernel@I-love.SAKURA.ne.jp> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* dctcp: more accurate tracking of packets deliveryEric Dumazet2019-04-121-28/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit e21db6f69a95 ("tcp: track total bytes delivered with ECN CE marks") core TCP stack does a very good job tracking ECN signals. The "sender's best estimate of CE information" Yuchung mentioned in his patch is indeed the best we can do. DCTCP can use tp->delivered_ce and tp->delivered to not duplicate the logic, and use the existing best estimate. This solves some problems, since current DCTCP logic does not deal with losses and/or GRO or ack aggregation very well. This also removes a dubious use of inet_csk(sk)->icsk_ack.rcv_mss (this should have been tp->mss_cache), and a 64 bit divide. Finally, we can see that the DCTCP logic, calling dctcp_update_alpha() for every ACK could be done differently, calling it only once per RTT. Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Soheil Hassas Yeganeh <soheil@google.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Abdul Kabbani <akabbani@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: netrom: Fix error cleanup path of nr_proto_initYueHaibing2019-04-114-25/+60
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Syzkaller report this: BUG: unable to handle kernel paging request at fffffbfff830524b PGD 237fe8067 P4D 237fe8067 PUD 237e64067 PMD 1c9716067 PTE 0 Oops: 0000 [#1] SMP KASAN PTI CPU: 1 PID: 4465 Comm: syz-executor.0 Not tainted 5.0.0+ #5 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.10.2-1ubuntu1 04/01/2014 RIP: 0010:__list_add_valid+0x21/0xe0 lib/list_debug.c:23 Code: 8b 0c 24 e9 17 fd ff ff 90 55 48 89 fd 48 8d 7a 08 53 48 89 d3 48 b8 00 00 00 00 00 fc ff df 48 89 fa 48 c1 ea 03 48 83 ec 08 <80> 3c 02 00 0f 85 8b 00 00 00 48 8b 53 08 48 39 f2 75 35 48 89 f2 RSP: 0018:ffff8881ea2278d0 EFLAGS: 00010282 RAX: dffffc0000000000 RBX: ffffffffc1829250 RCX: 1ffff1103d444ef4 RDX: 1ffffffff830524b RSI: ffffffff85659300 RDI: ffffffffc1829258 RBP: ffffffffc1879250 R08: fffffbfff0acb269 R09: fffffbfff0acb269 R10: ffff8881ea2278f0 R11: fffffbfff0acb268 R12: ffffffffc1829250 R13: dffffc0000000000 R14: 0000000000000008 R15: ffffffffc187c830 FS: 00007fe0361df700(0000) GS:ffff8881f7300000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: fffffbfff830524b CR3: 00000001eb39a001 CR4: 00000000007606e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 PKRU: 55555554 Call Trace: __list_add include/linux/list.h:60 [inline] list_add include/linux/list.h:79 [inline] proto_register+0x444/0x8f0 net/core/sock.c:3375 nr_proto_init+0x73/0x4b3 [netrom] ? 0xffffffffc1628000 ? 0xffffffffc1628000 do_one_initcall+0xbc/0x47d init/main.c:887 do_init_module+0x1b5/0x547 kernel/module.c:3456 load_module+0x6405/0x8c10 kernel/module.c:3804 __do_sys_finit_module+0x162/0x190 kernel/module.c:3898 do_syscall_64+0x9f/0x450 arch/x86/entry/common.c:290 entry_SYSCALL_64_after_hwframe+0x49/0xbe RIP: 0033:0x462e99 Code: f7 d8 64 89 02 b8 ff ff ff ff c3 66 0f 1f 44 00 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 c7 c1 bc ff ff ff f7 d8 64 89 01 48 RSP: 002b:00007fe0361dec58 EFLAGS: 00000246 ORIG_RAX: 0000000000000139 RAX: ffffffffffffffda RBX: 000000000073bf00 RCX: 0000000000462e99 RDX: 0000000000000000 RSI: 0000000020000100 RDI: 0000000000000003 RBP: 00007fe0361dec70 R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 00007fe0361df6bc R13: 00000000004bcefa R14: 00000000006f6fb0 R15: 0000000000000004 Modules linked in: netrom(+) ax25 fcrypt pcbc af_alg arizona_ldo1 v4l2_common videodev media v4l2_dv_timings hdlc ide_cd_mod snd_soc_sigmadsp_regmap snd_soc_sigmadsp intel_spi_platform intel_spi mtd spi_nor snd_usbmidi_lib usbcore lcd ti_ads7950 hi6421_regulator snd_soc_kbl_rt5663_max98927 snd_soc_hdac_hdmi snd_hda_ext_core snd_hda_core snd_soc_rt5663 snd_soc_core snd_pcm_dmaengine snd_compress snd_soc_rl6231 mac80211 rtc_rc5t583 spi_slave_time leds_pwm hid_gt683r hid industrialio_triggered_buffer kfifo_buf industrialio ir_kbd_i2c rc_core led_class_flash dwc_xlgmac snd_ymfpci gameport snd_mpu401_uart snd_rawmidi snd_ac97_codec snd_pcm ac97_bus snd_opl3_lib snd_timer snd_seq_device snd_hwdep snd soundcore iptable_security iptable_raw iptable_mangle iptable_nat nf_nat nf_conntrack nf_defrag_ipv6 nf_defrag_ipv4 iptable_filter bpfilter ip6_vti ip_vti ip_gre ipip sit tunnel4 ip_tunnel hsr veth netdevsim vxcan batman_adv cfg80211 rfkill chnl_net caif nlmon dummy team bonding vcan bridge stp llc ip6_gre gre ip6_tunnel tunnel6 tun joydev mousedev ppdev tpm kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul crc32c_intel ghash_clmulni_intel ide_pci_generic piix aesni_intel aes_x86_64 crypto_simd cryptd glue_helper ide_core psmouse input_leds i2c_piix4 serio_raw intel_agp intel_gtt ata_generic agpgart pata_acpi parport_pc rtc_cmos parport floppy sch_fq_codel ip_tables x_tables sha1_ssse3 sha1_generic ipv6 [last unloaded: rxrpc] Dumping ftrace buffer: (ftrace buffer empty) CR2: fffffbfff830524b ---[ end trace 039ab24b305c4b19 ]--- If nr_proto_init failed, it may forget to call proto_unregister, tiggering this issue.This patch rearrange code of nr_proto_init to avoid such issues. Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: bridge: multicast: use rcu to access port list from ↵Nikolay Aleksandrov2019-04-111-1/+3
| | | | | | | | | | | | br_multicast_start_querier br_multicast_start_querier() walks over the port list but it can be called from a timer with only multicast_lock held which doesn't protect the port list, so use RCU to walk over it. Fixes: c83b8fab06fc ("bridge: Restart queries when last querier expires") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: move unhash before release of clcsockUrsula Braun2019-04-111-3/+3
| | | | | | | | | | | | | Commit <26d92e951fe0> ("net/smc: move unhash as early as possible in smc_release()") fixes one occurrence in the smc code, but the same pattern exists in other places. This patch covers the remaining occurrences and makes sure, the unhash operation is done before the smc->clcsock is released. This avoids a potential use-after-free in smc_diag_dump(). Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: fix return code from FLUSH commandKarsten Graul2019-04-111-1/+2
| | | | | | | | | | | | | | | | The FLUSH command is used to empty the pnet table. No return code is expected from the command. Commit a9d8b0b1e3d6 added namespace support for the pnet table and changed the FLUSH command processing to call smc_pnet_remove_by_pnetid() to remove the pnet entries. This function returns -ENOENT when no entry was deleted, which is now the return code of the FLUSH command. As a result the FLUSH command will return an error when the pnet table is already empty. Restore the expected behavior and let FLUSH always return 0. Fixes: a9d8b0b1e3d6 ("net/smc: add pnet table namespace support") Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: propagate file from SMC to TCP socketUrsula Braun2019-04-111-10/+28
| | | | | | | | | | | | | fcntl(fd, F_SETOWN, getpid()) selects the recipient of SIGURG signals that are delivered when out-of-band data arrives on socket fd. If an SMC socket program makes use of such an fcntl() call, it fails in case of fallback to TCP-mode. In case of fallback the traffic is processed with the internal TCP socket. Propagating field "file" from the SMC socket to the internal TCP socket fixes the issue. Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: fix a NULL pointer dereferenceKangjie Lu2019-04-111-0/+5
| | | | | | | | | In case alloc_ordered_workqueue fails, the fix returns NULL to avoid NULL pointer dereference. Signed-off-by: Kangjie Lu <kjlu@umn.edu> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: wait for pending work before clcsock release_sockKarsten Graul2019-04-113-10/+30
| | | | | | | | | | | | | | When the clcsock is already released using sock_release() and a pending smc_listen_work accesses the clcsock than that will fail. Solve this by canceling and waiting for the work to complete first. Because the work holds the sock_lock it must make sure that the lock is not hold before the new helper smc_clcsock_release() is invoked. And before the smc_listen_work starts working check if the parent listen socket is still valid, otherwise stop the work early. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: fou: do not use guehdr after iptunnel_pull_offloads in gue_udp_recvLorenzo Bianconi2019-04-111-1/+3
| | | | | | | | | | | gue tunnels run iptunnel_pull_offloads on received skbs. This can determine a possible use-after-free accessing guehdr pointer since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device) Fixes: a09a4c8dd1ec ("tunnels: Remove encapsulation offloads on decap") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tipc: missing entries in name table of publicationsHoang Le2019-04-111-1/+2
| | | | | | | | | | | | | | | | | | | | When binding multiple services with specific type 1Ki, 2Ki.., this leads to some entries in the name table of publications missing when listed out via 'tipc name show'. The problem is at identify zero last_type conditional provided via netlink. The first is initial 'type' when starting name table dummping. The second is continuously with zero type (node state service type). Then, lookup function failure to finding node state service type in next iteration. To solve this, adding more conditional to marked as dirty type and lookup correct service type for the next iteration instead of select the first service as initial 'type' zero. Acked-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: Hoang Le <hoang.h.le@dektech.com.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* failover: allow name change on IFF_UP slave interfacesSi-Wei Liu2019-04-112-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When a netdev appears through hot plug then gets enslaved by a failover master that is already up and running, the slave will be opened right away after getting enslaved. Today there's a race that userspace (udev) may fail to rename the slave if the kernel (net_failover) opens the slave earlier than when the userspace rename happens. Unlike bond or team, the primary slave of failover can't be renamed by userspace ahead of time, since the kernel initiated auto-enslavement is unable to, or rather, is never meant to be synchronized with the rename request from userspace. As the failover slave interfaces are not designed to be operated directly by userspace apps: IP configuration, filter rules with regard to network traffic passing and etc., should all be done on master interface. In general, userspace apps only care about the name of master interface, while slave names are less important as long as admin users can see reliable names that may carry other information describing the netdev. For e.g., they can infer that "ens3nsby" is a standby slave of "ens3", while for a name like "eth0" they can't tell which master it belongs to. Historically the name of IFF_UP interface can't be changed because there might be admin script or management software that is already relying on such behavior and assumes that the slave name can't be changed once UP. But failover is special: with the in-kernel auto-enslavement mechanism, the userspace expectation for device enumeration and bring-up order is already broken. Previously initramfs and various userspace config tools were modified to bypass failover slaves because of auto-enslavement and duplicate MAC address. Similarly, in case that users care about seeing reliable slave name, the new type of failover slaves needs to be taken care of specifically in userspace anyway. It's less risky to lift up the rename restriction on failover slave which is already UP. Although it's possible this change may potentially break userspace component (most likely configuration scripts or management software) that assumes slave name can't be changed while UP, it's relatively a limited and controllable set among all userspace components, which can be fixed specifically to listen for the rename events on failover slaves. Userspace component interacting with slaves is expected to be changed to operate on failover master interface instead, as the failover slave is dynamic in nature which may come and go at any point. The goal is to make the role of failover slaves less relevant, and userspace components should only deal with failover master in the long run. Fixes: 30c8bd5aa8b2 ("net: Introduce generic failover module") Signed-off-by: Si-Wei Liu <si-wei.liu@oracle.com> Reviewed-by: Liran Alon <liran.alon@oracle.com> Acked-by: Sridhar Samudrala <sridhar.samudrala@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: fix build without CONFIG_TLS_DEVICEJakub Kicinski2019-04-111-0/+2
| | | | | | | | | | buildbot noticed that TLS_HW is not defined if CONFIG_TLS_DEVICE=n. Wrap the cleanup branch into an ifdef, tls_device_free_resources_tx() wouldn't be compiled either in this case. Fixes: 35b71a34ada6 ("net/tls: don't leak partially sent record in device mode") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: strparser: partially revert "strparser: Call skb_unclone conditionally"Jakub Kicinski2019-04-101-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts the first part of commit 4e485d06bb8c ("strparser: Call skb_unclone conditionally"). To build a message with multiple fragments we need our own root of frag_list. We can't simply use the frag_list of orig_skb, because it will lead to linking all orig_skbs together creating very long frag chains, and causing stack overflow on kfree_skb() (which is called recursively on the frag_lists). BUG: stack guard page was hit at 00000000d40fad41 (stack is 0000000029dde9f4..000000008cce03d5) kernel stack overflow (double-fault): 0000 [#1] PREEMPT SMP RIP: 0010:free_one_page+0x2b/0x490 Call Trace: __free_pages_ok+0x143/0x2c0 skb_release_data+0x8e/0x140 ? skb_release_data+0xad/0x140 kfree_skb+0x32/0xb0 [...] skb_release_data+0xad/0x140 ? skb_release_data+0xad/0x140 kfree_skb+0x32/0xb0 skb_release_data+0xad/0x140 ? skb_release_data+0xad/0x140 kfree_skb+0x32/0xb0 skb_release_data+0xad/0x140 ? skb_release_data+0xad/0x140 kfree_skb+0x32/0xb0 skb_release_data+0xad/0x140 ? skb_release_data+0xad/0x140 kfree_skb+0x32/0xb0 skb_release_data+0xad/0x140 __kfree_skb+0xe/0x20 tcp_disconnect+0xd6/0x4d0 tcp_close+0xf4/0x430 ? tcp_check_oom+0xf0/0xf0 tls_sk_proto_close+0xe4/0x1e0 [tls] inet_release+0x36/0x60 __sock_release+0x37/0xa0 sock_close+0x11/0x20 __fput+0xa2/0x1d0 task_work_run+0x89/0xb0 exit_to_usermode_loop+0x9a/0xa0 do_syscall_64+0xc0/0xf0 entry_SYSCALL_64_after_hwframe+0x44/0xa9 Let's leave the second unclone conditional, as I'm not entirely sure what is its purpose :) Fixes: 4e485d06bb8c ("strparser: Call skb_unclone conditionally") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: don't leak partially sent record in device modeJakub Kicinski2019-04-103-14/+30
| | | | | | | | | | | | | | | | | | | | | | David reports that tls triggers warnings related to sk->sk_forward_alloc not being zero at destruction time: WARNING: CPU: 5 PID: 6831 at net/core/stream.c:206 sk_stream_kill_queues+0x103/0x110 WARNING: CPU: 5 PID: 6831 at net/ipv4/af_inet.c:160 inet_sock_destruct+0x15b/0x170 When sender fills up the write buffer and dies from SIGPIPE. This is due to the device implementation not cleaning up the partially_sent_record. This is because commit a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") moved the partial record cleanup to the SW-only path. Fixes: a42055e8d2c3 ("net/tls: Add support for async encryption of records for performance") Reported-by: David Beckett <david.beckett@netronome.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/tls: fix the IV leaksJakub Kicinski2019-04-101-1/+4
| | | | | | | | | | | | | | Commit f66de3ee2c16 ("net/tls: Split conf to rx + tx") made freeing of IV and record sequence number conditional to SW path only, but commit e8f69799810c ("net/tls: Add generic NIC offload infrastructure") also allocates that state for the device offload configuration. Remember to free it. Fixes: e8f69799810c ("net/tls: Add generic NIC offload infrastructure") Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com> Reviewed-by: Dirk van der Merwe <dirk.vandermerwe@netronome.com> Reviewed-by: Simon Horman <simon.horman@netronome.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge tag 'mac80211-for-davem-2019-04-09' of ↵David S. Miller2019-04-0910-48/+102
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jberg/mac80211 Johannes Berg says: ==================== Various fixes: * iTXQ fixes from Felix * tracing fix - increase message length * fix SW_CRYPTO_CONTROL enforcement * WMM rule handling for regdomain intersection * max_interfaces in hwsim - reported by syzbot * clear private data in some more commands * a clang compiler warning fix I added a patch with two new (unused) macros for rate-limited printing to simplify getting the users into the tree. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * mac80211: Honor SW_CRYPTO_CONTROL for unicast keys in AP VLAN modeAlexander Wetzel2019-04-091-5/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Restore SW_CRYPTO_CONTROL operation on AP_VLAN interfaces for unicast keys, the original override was intended to be done for group keys as those are treated specially by mac80211 and would always have been rejected. Now the situation is that AP_VLAN support must be enabled by the driver if it can support it (meaning it can support software crypto GTK TX). Thus, also simplify the code - if we get here with AP_VLAN and non- pairwise key, software crypto must be used (driver doesn't know about the interface) and can be used (driver must've advertised AP_VLAN if it also uses SW_CRYPTO_CONTROL). Fixes: db3bdcb9c3ff ("mac80211: allow AP_VLAN operation on crypto controlled devices") Signed-off-by: Alexander Wetzel <alexander@wetzel-home.de> [rewrite commit message] Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: make ieee80211_schedule_txq schedule empty TXQsFelix Fietkau2019-04-081-4/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there is no way for the driver to signal to mac80211 that it should schedule a TXQ even if there are no packets on the mac80211 part of that queue. This is problematic if the driver has an internal retry queue to deal with software A-MPDU retry. This patch changes the behavior of ieee80211_schedule_txq to always schedule the queue, as its only user (ath9k) seems to expect such behavior already: it calls this function on tx status and on powersave wakeup whenever its internal retry queue is not empty. Also add an extra argument to ieee80211_return_txq to get the same behavior. This fixes an issue on ath9k where tx queues with packets to retry (and no new packets in mac80211) would not get serviced. Fixes: 89cea7493a346 ("ath9k: Switch to mac80211 TXQ scheduling and airtime APIs") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: rework locking for txq scheduling / airtime fairnessFelix Fietkau2019-03-291-28/+16
| | | | | | | | | | | | | | | | | | | | | | | | Holding the lock around the entire duration of tx scheduling can create some nasty lock contention, especially when processing airtime information from the tx status or the rx path. Improve locking by only holding the active_txq_lock for lookups / scheduling list modifications. Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * nl80211: Add NL80211_FLAG_CLEAR_SKB flag for other NL commandsSunil Dutt2019-03-291-6/+12
| | | | | | | | | | | | | | | | | | | | | | | | This commit adds NL80211_FLAG_CLEAR_SKB flag to other NL commands that carry key data to ensure they do not stick around on heap after the SKB is freed. Also introduced this flag for NL80211_CMD_VENDOR as there are sub commands which configure the keys. Signed-off-by: Sunil Dutt <usdutt@codeaurora.org> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * cfg80211: Use kmemdup in cfg80211_gen_new_ie()YueHaibing2019-03-291-2/+1
| | | | | | | | | | | | | | Use kmemdup rather than duplicating its implementation Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: do not call driver wake_tx_queue op during reconfigFelix Fietkau2019-03-291-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There are several scenarios in which mac80211 can call drv_wake_tx_queue after ieee80211_restart_hw has been called and has not yet completed. Driver private structs are considered uninitialized until mac80211 has uploaded the vifs, stations and keys again, so using private tx queue data during that time is not safe. The driver can also not rely on drv_reconfig_complete to figure out when it is safe to accept drv_wake_tx_queue calls again, because it is only called after all tx queues are woken again. To fix this, bail out early in drv_wake_tx_queue if local->in_reconfig is set. Cc: stable@vger.kernel.org Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * cfg80211: Change an 'else if' into an 'else' in cfg80211_calculate_bitrate_heNathan Chancellor2019-03-291-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When building with -Wsometimes-uninitialized, Clang warns: net/wireless/util.c:1223:11: warning: variable 'result' is used uninitialized whenever 'if' condition is false [-Wsometimes-uninitialized] Clang can't evaluate at this point that WARN(1, ...) always returns true because __ret_warn_on is defined as !!(condition), which isn't immediately evaluated as 1. Change this branch to else so that it's clear to Clang that we intend to bail out here. Link: https://github.com/ClangBuiltLinux/linux/issues/382 Suggested-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Reviewed-by: Arnd Bergmann <arnd@arndb.de> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: fix memory accounting with A-MSDU aggregationFelix Fietkau2019-03-291-0/+3
| | | | | | | | | | | | | | | | | | skb->truesize can change due to memory reallocation or when adding extra fragments. Adjust fq->memory_usage accordingly Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * cfg80211: Handle WMM rules in regulatory domain intersectionIlan Peer2019-03-291-0/+39
| | | | | | | | | | | | | | | | | | | | The support added for regulatory WMM rules did not handle the case of regulatory domain intersections. Fix it. Signed-off-by: Ilan Peer <ilan.peer@intel.com> Fixes: 230ebaa189af ("cfg80211: read wmm rules from regulatory database") Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: Increase MAX_MSG_LENAndrei Otcheretianski2019-03-291-1/+6
| | | | | | | | | | | | | | | | | | | | Looks that 100 chars isn't enough for messages, as we keep getting warnings popping from different places due to message shortening. Instead of trying to shorten the prints, just increase the buffer size. Signed-off-by: Andrei Otcheretianski <andrei.otcheretianski@intel.com> Signed-off-by: Luca Coelho <luciano.coelho@intel.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: fix unaligned access in mesh table hash functionFelix Fietkau2019-03-291-1/+1
| | | | | | | | | | | | | | | | The pointer to the last four bytes of the address is not guaranteed to be aligned, so we need to use __get_unaligned_cpu32 here Signed-off-by: Felix Fietkau <nbd@nbd.name> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
| * mac80211: un-schedule TXQs on powersave startFelix Fietkau2019-03-291-1/+9
| | | | | | | | | | | | | | | | | | | | | | Once a station enters powersave, its queues should not be returned by ieee80211_next_txq() anymore. They will be re-scheduled again after the station has woken up again Fixes: 1866760096bf4 ("mac80211: Add TXQ scheduling API") Signed-off-by: Felix Fietkau <nbd@nbd.name> Acked-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com>
* | net: ip6_gre: fix possible use-after-free in ip6erspan_rcvLorenzo Bianconi2019-04-091-7/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | erspan_v6 tunnels run __iptunnel_pull_header on received skbs to remove erspan header. This can determine a possible use-after-free accessing pkt_md pointer in ip6erspan_rcv since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device). Fix it resetting pkt_md pointer after __iptunnel_pull_header Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: ip_gre: fix possible use-after-free in erspan_rcvLorenzo Bianconi2019-04-091-5/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | erspan tunnels run __iptunnel_pull_header on received skbs to remove gre and erspan headers. This can determine a possible use-after-free accessing pkt_md pointer in erspan_rcv since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device). Fix it resetting pkt_md pointer after __iptunnel_pull_header Fixes: 1d7e2ed22f8d ("net: erspan: refactor existing erspan code") Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | NFC: nci: Add some bounds checking in nci_hci_cmd_received()Dan Carpenter2019-04-071-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is similar to commit 674d9de02aa7 ("NFC: Fix possible memory corruption when handling SHDLC I-Frame commands"). I'm not totally sure, but I think that commit description may have overstated the danger. I was under the impression that this data came from the firmware? If you can't trust your networking firmware, then you're already in trouble. Anyway, these days we add bounds checking where ever we can and we call it kernel hardening. Better safe than sorry. Fixes: 11f54f228643 ("NFC: nci: Add HCI over NCI protocol support") Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2019-04-0545-189/+250
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Several hash table refcount fixes in batman-adv, from Sven Eckelmann. 2) Use after free in bpf_evict_inode(), from Daniel Borkmann. 3) Fix mdio bus registration in ixgbe, from Ivan Vecera. 4) Unbounded loop in __skb_try_recv_datagram(), from Paolo Abeni. 5) ila rhashtable corruption fix from Herbert Xu. 6) Don't allow upper-devices to be added to vrf devices, from Sabrina Dubroca. 7) Add qmi_wwan device ID for Olicard 600, from Bjørn Mork. 8) Don't leave skb->next poisoned in __netif_receive_skb_list_ptype, from Alexander Lobakin. 9) Missing IDR checks in mlx5 driver, from Aditya Pakki. 10) Fix false connection termination in ktls, from Jakub Kicinski. 11) Work around some ASPM issues with r8169 by disabling rx interrupt coalescing on certain chips. From Heiner Kallweit. 12) Properly use per-cpu qstat values on NOLOCK qdiscs, from Paolo Abeni. 13) Fully initialize sockaddr_in structures in SCTP, from Xin Long. 14) Various BPF flow dissector fixes from Stanislav Fomichev. 15) Divide by zero in act_sample, from Davide Caratti. 16) Fix bridging multicast regression introduced by rhashtable conversion, from Nikolay Aleksandrov. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) ibmvnic: Fix completion structure initialization ipv6: sit: reset ip header pointer in ipip6_rcv net: bridge: always clear mcast matching struct on reports and leaves libcxgb: fix incorrect ppmax calculation vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2x sch_cake: Make sure we can write the IP header before changing DSCP bits sch_cake: Use tc_skb_protocol() helper for getting packet protocol tcp: Ensure DCTCP reacts to losses net/sched: act_sample: fix divide by zero in the traffic path net: thunderx: fix NULL pointer dereference in nicvf_open/nicvf_stop net: hns: Fix sparse: some warnings in HNS drivers net: hns: Fix WARNING when remove HNS driver with SMMU enabled net: hns: fix ICMP6 neighbor solicitation messages discard problem net: hns: Fix probabilistic memory overwrite when HNS driver initialized net: hns: Use NAPI_POLL_WEIGHT for hns driver net: hns: fix KASAN: use-after-free in hns_nic_net_xmit_hw() flow_dissector: rst'ify documentation ipv6: Fix dangling pointer when ipv6 fragment net-gro: Fix GRO flush when receiving a GSO packet. flow_dissector: document BPF flow dissector environment ...
| * | ipv6: sit: reset ip header pointer in ipip6_rcvLorenzo Bianconi2019-04-051-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ipip6 tunnels run iptunnel_pull_header on received skbs. This can determine the following use-after-free accessing iph pointer since the packet will be 'uncloned' running pskb_expand_head if it is a cloned gso skb (e.g if the packet has been sent though a veth device) [ 706.369655] BUG: KASAN: use-after-free in ipip6_rcv+0x1678/0x16e0 [sit] [ 706.449056] Read of size 1 at addr ffffe01b6bd855f5 by task ksoftirqd/1/= [ 706.669494] Hardware name: HPE ProLiant m400 Server/ProLiant m400 Server, BIOS U02 08/19/2016 [ 706.771839] Call trace: [ 706.801159] dump_backtrace+0x0/0x2f8 [ 706.845079] show_stack+0x24/0x30 [ 706.884833] dump_stack+0xe0/0x11c [ 706.925629] print_address_description+0x68/0x260 [ 706.982070] kasan_report+0x178/0x340 [ 707.025995] __asan_report_load1_noabort+0x30/0x40 [ 707.083481] ipip6_rcv+0x1678/0x16e0 [sit] [ 707.132623] tunnel64_rcv+0xd4/0x200 [tunnel4] [ 707.185940] ip_local_deliver_finish+0x3b8/0x988 [ 707.241338] ip_local_deliver+0x144/0x470 [ 707.289436] ip_rcv_finish+0x43c/0x14b0 [ 707.335447] ip_rcv+0x628/0x1138 [ 707.374151] __netif_receive_skb_core+0x1670/0x2600 [ 707.432680] __netif_receive_skb+0x28/0x190 [ 707.482859] process_backlog+0x1d0/0x610 [ 707.529913] net_rx_action+0x37c/0xf68 [ 707.574882] __do_softirq+0x288/0x1018 [ 707.619852] run_ksoftirqd+0x70/0xa8 [ 707.662734] smpboot_thread_fn+0x3a4/0x9e8 [ 707.711875] kthread+0x2c8/0x350 [ 707.750583] ret_from_fork+0x10/0x18 [ 707.811302] Allocated by task 16982: [ 707.854182] kasan_kmalloc.part.1+0x40/0x108 [ 707.905405] kasan_kmalloc+0xb4/0xc8 [ 707.948291] kasan_slab_alloc+0x14/0x20 [ 707.994309] __kmalloc_node_track_caller+0x158/0x5e0 [ 708.053902] __kmalloc_reserve.isra.8+0x54/0xe0 [ 708.108280] __alloc_skb+0xd8/0x400 [ 708.150139] sk_stream_alloc_skb+0xa4/0x638 [ 708.200346] tcp_sendmsg_locked+0x818/0x2b90 [ 708.251581] tcp_sendmsg+0x40/0x60 [ 708.292376] inet_sendmsg+0xf0/0x520 [ 708.335259] sock_sendmsg+0xac/0xf8 [ 708.377096] sock_write_iter+0x1c0/0x2c0 [ 708.424154] new_sync_write+0x358/0x4a8 [ 708.470162] __vfs_write+0xc4/0xf8 [ 708.510950] vfs_write+0x12c/0x3d0 [ 708.551739] ksys_write+0xcc/0x178 [ 708.592533] __arm64_sys_write+0x70/0xa0 [ 708.639593] el0_svc_handler+0x13c/0x298 [ 708.686646] el0_svc+0x8/0xc [ 708.739019] Freed by task 17: [ 708.774597] __kasan_slab_free+0x114/0x228 [ 708.823736] kasan_slab_free+0x10/0x18 [ 708.868703] kfree+0x100/0x3d8 [ 708.905320] skb_free_head+0x7c/0x98 [ 708.948204] skb_release_data+0x320/0x490 [ 708.996301] pskb_expand_head+0x60c/0x970 [ 709.044399] __iptunnel_pull_header+0x3b8/0x5d0 [ 709.098770] ipip6_rcv+0x41c/0x16e0 [sit] [ 709.146873] tunnel64_rcv+0xd4/0x200 [tunnel4] [ 709.200195] ip_local_deliver_finish+0x3b8/0x988 [ 709.255596] ip_local_deliver+0x144/0x470 [ 709.303692] ip_rcv_finish+0x43c/0x14b0 [ 709.349705] ip_rcv+0x628/0x1138 [ 709.388413] __netif_receive_skb_core+0x1670/0x2600 [ 709.446943] __netif_receive_skb+0x28/0x190 [ 709.497120] process_backlog+0x1d0/0x610 [ 709.544169] net_rx_action+0x37c/0xf68 [ 709.589131] __do_softirq+0x288/0x1018 [ 709.651938] The buggy address belongs to the object at ffffe01b6bd85580 which belongs to the cache kmalloc-1024 of size 1024 [ 709.804356] The buggy address is located 117 bytes inside of 1024-byte region [ffffe01b6bd85580, ffffe01b6bd85980) [ 709.946340] The buggy address belongs to the page: [ 710.003824] page:ffff7ff806daf600 count:1 mapcount:0 mapping:ffffe01c4001f600 index:0x0 [ 710.099914] flags: 0xfffff8000000100(slab) [ 710.149059] raw: 0fffff8000000100 dead000000000100 dead000000000200 ffffe01c4001f600 [ 710.242011] raw: 0000000000000000 0000000000380038 00000001ffffffff 0000000000000000 [ 710.334966] page dumped because: kasan: bad access detected Fix it resetting iph pointer after iptunnel_pull_header Fixes: a09a4c8dd1ec ("tunnels: Remove encapsulation offloads on decap") Tested-by: Jianlin Shi <jishi@redhat.com> Signed-off-by: Lorenzo Bianconi <lorenzo.bianconi@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net: bridge: always clear mcast matching struct on reports and leavesNikolay Aleksandrov2019-04-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need to be careful and always zero the whole br_ip struct when it is used for matching since the rhashtable change. This patch fixes all the places which didn't properly clear it which in turn might've caused mismatches. Thanks for the great bug report with reproducing steps and bisection. Steps to reproduce (from the bug report): ip link add br0 type bridge mcast_querier 1 ip link set br0 up ip link add v2 type veth peer name v3 ip link set v2 master br0 ip link set v2 up ip link set v3 up ip addr add 3.0.0.2/24 dev v3 ip netns add test ip link add v1 type veth peer name v1 netns test ip link set v1 master br0 ip link set v1 up ip -n test link set v1 up ip -n test addr add 3.0.0.1/24 dev v1 # Multicast receiver ip netns exec test socat UDP4-RECVFROM:5588,ip-add-membership=224.224.224.224:3.0.0.1,fork - # Multicast sender echo hello | nc -u -s 3.0.0.2 224.224.224.224 5588 Reported-by: liam.mcbirnie@boeing.com Fixes: 19e3a9c90c53 ("net: bridge: convert multicast to generic rhashtable") Signed-off-by: Nikolay Aleksandrov <nikolay@cumulusnetworks.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | vlan: conditional inclusion of FCoE hooks to match netdevice.h and bnx2xChris Leech2019-04-051-11/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Way back in 3c9c36bcedd426f2be2826da43e5163de61735f7 the ndo_fcoe_get_wwn pointer was switched from depending on CONFIG_FCOE to CONFIG_LIBFCOE in order to allow building FCoE support into the bnx2x driver and used by bnx2fc without including the generic software fcoe module. But, FCoE is generally used over an 802.1q VLAN, and the implementation of ndo_fcoe_get_wwn in the 8021q module was not similarly changed. The result is that if CONFIG_FCOE is disabled, then bnz2fc cannot make a call to ndo_fcoe_get_wwn through the 8021q interface to the underlying bnx2x interface. The bnx2fc driver then falls back to a potentially different mapping of Ethernet MAC to Fibre Channel WWN, creating an incompatibility with the fabric and target configurations when compared to the WWNs used by pre-boot firmware and differently-configured kernels. So make the conditional inclusion of FCoE code in 8021q match the conditional inclusion in netdevice.h Signed-off-by: Chris Leech <cleech@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/bpf/bpfDavid S. Miller2019-04-042-14/+6
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Daniel Borkmann says: ==================== pull-request: bpf 2019-04-04 The following pull-request contains BPF updates for your *net* tree. The main changes are: 1) Batch of fixes to the existing BPF flow dissector API to support calling BPF programs from the eth_get_headlen context (support for latter is planned to be added in bpf-next), from Stanislav. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| | * | flow_dissector: allow access only to a subset of __sk_buff fieldsStanislav Fomichev2019-04-031-13/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use whitelist instead of a blacklist and allow only a small set of fields that might be relevant in the context of flow dissector: * data * data_end * flow_keys This is required for the eth_get_headlen case where we have only a chunk of data to dissect (i.e. trying to read the other skb fields doesn't make sense). Note, that it is a breaking API change! However, we've provided flow_keys->n_proto as a substitute for skb->protocol; and there is no need to manually handle skb->vlan_present. So even if we break somebody, the migration is trivial. Unfortunately, we can't support eth_get_headlen use-case without those breaking changes. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | flow_dissector: fix clamping of BPF flow_keys for non-zero nhoffStanislav Fomichev2019-04-031-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Don't allow BPF program to set flow_keys->nhoff to less than initial value. We currently don't read the value afterwards in anything but the tests, but it's still a good practice to return consistent values to the test programs. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| | * | net/flow_dissector: pass flow_keys->n_proto to BPF programsStanislav Fomichev2019-04-031-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a preparation for the next commit that would prohibit access to the most fields of __sk_buff from the BPF programs. Instead of requiring BPF flow dissector programs to look into skb, pass all input data in the flow_keys. Signed-off-by: Stanislav Fomichev <sdf@google.com> Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
| * | | sch_cake: Make sure we can write the IP header before changing DSCP bitsToke Høiland-Jørgensen2019-04-041-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is not actually any guarantee that the IP headers are valid before we access the DSCP bits of the packets. Fix this using the same approach taken in sch_dsmark. Reported-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | sch_cake: Use tc_skb_protocol() helper for getting packet protocolToke Høiland-Jørgensen2019-04-041-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We shouldn't be using skb->protocol directly as that will miss cases with hardware-accelerated VLAN tags. Use the helper instead to get the right protocol number. Reported-by: Kevin Darbyshire-Bryant <kevin@darbyshire-bryant.me.uk> Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | tcp: Ensure DCTCP reacts to lossesKoen De Schepper2019-04-041-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RFC8257 §3.5 explicitly states that "A DCTCP sender MUST react to loss episodes in the same way as conventional TCP". Currently, Linux DCTCP performs no cwnd reduction when losses are encountered. Optionally, the dctcp_clamp_alpha_on_loss resets alpha to its maximal value if a RTO happens. This behavior is sub-optimal for at least two reasons: i) it ignores losses triggering fast retransmissions; and ii) it causes unnecessary large cwnd reduction in the future if the loss was isolated as it resets the historical term of DCTCP's alpha EWMA to its maximal value (i.e., denoting a total congestion). The second reason has an especially noticeable effect when using DCTCP in high BDP environments, where alpha normally stays at low values. This patch replace the clamping of alpha by setting ssthresh to half of cwnd for both fast retransmissions and RTOs, at most once per RTT. Consequently, the dctcp_clamp_alpha_on_loss module parameter has been removed. The table below shows experimental results where we measured the drop probability of a PIE AQM (not applying ECN marks) at a bottleneck in the presence of a single TCP flow with either the alpha-clamping option enabled or the cwnd halving proposed by this patch. Results using reno or cubic are given for comparison. | Link | RTT | Drop TCP CC | speed | base+AQM | probability ==================|=========|==========|============ CUBIC | 40Mbps | 7+20ms | 0.21% RENO | | | 0.19% DCTCP-CLAMP-ALPHA | | | 25.80% DCTCP-HALVE-CWND | | | 0.22% ------------------|---------|----------|------------ CUBIC | 100Mbps | 7+20ms | 0.03% RENO | | | 0.02% DCTCP-CLAMP-ALPHA | | | 23.30% DCTCP-HALVE-CWND | | | 0.04% ------------------|---------|----------|------------ CUBIC | 800Mbps | 1+1ms | 0.04% RENO | | | 0.05% DCTCP-CLAMP-ALPHA | | | 18.70% DCTCP-HALVE-CWND | | | 0.06% We see that, without halving its cwnd for all source of losses, DCTCP drives the AQM to large drop probabilities in order to keep the queue length under control (i.e., it repeatedly faces RTOs). Instead, if DCTCP reacts to all source of losses, it can then be controlled by the AQM using similar drop levels than cubic or reno. Signed-off-by: Koen De Schepper <koen.de_schepper@nokia-bell-labs.com> Signed-off-by: Olivier Tilmans <olivier.tilmans@nokia-bell-labs.com> Cc: Bob Briscoe <research@bobbriscoe.net> Cc: Lawrence Brakmo <brakmo@fb.com> Cc: Florian Westphal <fw@strlen.de> Cc: Daniel Borkmann <borkmann@iogearbox.net> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Eric Dumazet <edumazet@google.com> Cc: Andrew Shewmaker <agshew@gmail.com> Cc: Glenn Judd <glenn.judd@morganstanley.com> Acked-by: Florian Westphal <fw@strlen.de> Acked-by: Neal Cardwell <ncardwell@google.com> Acked-by: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: David S. Miller <davem@davemloft.net>