summaryrefslogtreecommitdiffstats
path: root/net/sched (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-10-0621-71/+30
|\ | | | | | | | | | | | | | | | | | | | | Rejecting non-native endian BTF overlapped with the addition of support for it. The rest were more simple overlapping changes, except the renesas ravb binding update, which had to follow a file move as well as a YAML conversion. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: check error pointer in tcf_dump_walker()Cong Wang2020-10-041-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Although we take RTNL on dump path, it is possible to skip RTNL on insertion path. So the following race condition is possible: rtnl_lock() // no rtnl lock mutex_lock(&idrinfo->lock); // insert ERR_PTR(-EBUSY) mutex_unlock(&idrinfo->lock); tc_dump_action() rtnl_unlock() So we have to skip those temporary -EBUSY entries on dump path too. Reported-and-tested-by: syzbot+b47bc4f247856fb4d9e1@syzkaller.appspotmail.com Fixes: 0fedc63fadf0 ("net_sched: commit action insertions together") Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: remove a redundant goto chain checkCong Wang2020-09-281-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All TC actions call tcf_action_check_ctrlact() to validate goto chain, so this check in tcf_action_init_1() is actually redundant. Remove it to save troubles of leaking memory. Fixes: e49d8c22f126 ("net_sched: defer tcf_idr_insert() in tcf_action_init_1()") Reported-by: Vlad Buslov <vladbu@mellanox.com> Suggested-by: Davide Caratti <dcaratti@redhat.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Davide Caratti <dcaratti@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: commit action insertions togetherCong Wang2020-09-251-9/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzbot is able to trigger a failure case inside the loop in tcf_action_init(), and when this happens we clean up with tcf_action_destroy(). But, as these actions are already inserted into the global IDR, other parallel process could free them before tcf_action_destroy(), then we will trigger a use-after-free. Fix this by deferring the insertions even later, after the loop, and committing all the insertions in a separate loop, so we will never fail in the middle of the insertions any more. One side effect is that the window between alloction and final insertion becomes larger, now it is more likely that the loop in tcf_del_walker() sees the placeholder -EBUSY pointer. So we have to check for error pointer in tcf_del_walker(). Reported-and-tested-by: syzbot+2287853d392e4b42374a@syzkaller.appspotmail.com Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: defer tcf_idr_insert() in tcf_action_init_1()Cong Wang2020-09-2521-64/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All TC actions call tcf_idr_insert() for new action at the end of their ->init(), so we can actually move it to a central place in tcf_action_init_1(). And once the action is inserted into the global IDR, other parallel process could free it immediately as its refcnt is still 1, so we can not fail after this, we need to move it after the goto action validation to avoid handling the failure case after insertion. This is found during code review, is not directly triggered by syzbot. And this prepares for the next patch. Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/sched: act_mpls: Add action to push MPLS LSE before Ethernet headerGuillaume Nault2020-10-041-0/+18
| | | | | | | | | | | | | | | | | | | | | | | | Define the MAC_PUSH action which pushes an MPLS LSE before the mac header (instead of between the mac and the network headers as the plain PUSH action does). The only special case is when the skb has an offloaded VLAN. In that case, it has to be inlined before pushing the MPLS header. Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/sched: act_vlan: Add {POP,PUSH}_ETH actionsGuillaume Nault2020-10-041-0/+40
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Implement TCA_VLAN_ACT_POP_ETH and TCA_VLAN_ACT_PUSH_ETH, to respectively pop and push a base Ethernet header at the beginning of a frame. POP_ETH is just a matter of pulling ETH_HLEN bytes. VLAN tags, if any, must be stripped before calling POP_ETH. PUSH_ETH is restricted to skbs with no mac_header, and only the MAC addresses can be configured. The Ethertype is automatically set from skb->protocol. These restrictions ensure that all skb's fields remain consistent, so that this action can't confuse other part of the networking stack (like GSO). Since openvswitch already had these actions, consolidate the code in skbuff.c (like for vlan and mpls push/pop). Signed-off-by: Guillaume Nault <gnault@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/sched: cls_u32: Replace one-element array with flexible-array memberGustavo A. R. Silva2020-09-291-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. Refactor the code according to the use of a flexible-array member in struct tc_u_hnode and use the struct_size() helper to calculate the size for the allocations. Commit 5778d39d070b ("net_sched: fix struct tc_u_hnode layout in u32") makes it clear that the code is expected to dynamically allocate divisor + 1 entries for ->ht[] in tc_uhnode. Also, based on other observations, as the piece of code below: 1232 for (h = 0; h <= ht->divisor; h++) { 1233 for (n = rtnl_dereference(ht->ht[h]); 1234 n; 1235 n = rtnl_dereference(n->next)) { 1236 if (tc_skip_hw(n->flags)) 1237 continue; 1238 1239 err = u32_reoffload_knode(tp, n, add, cb, 1240 cb_priv, extack); 1241 if (err) 1242 return err; 1243 } 1244 } we can assume that, in general, the code is actually expecting to allocate that extra space for the one-element array in tc_uhnode, everytime it allocates memory for instances of tc_uhnode or tc_u_common structures. That's the reason for passing '1' as the last argument for struct_size() in the allocation for _root_ht_ and _tp_c_, and 'divisor + 1' in the allocation code for _ht_. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays Tested-by: kernel test robot <lkp@intel.com> Link: https://lore.kernel.org/lkml/5f7062af.z3T9tn9yIPv6h5Ny%25lkp@intel.com/ Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-09-235-37/+89
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | Two minor conflicts: 1) net/ipv4/route.c, adding a new local variable while moving another local variable and removing it's initial assignment. 2) drivers/net/dsa/microchip/ksz9477.c, overlapping changes. One pretty prints the port mode differently, whilst another changes the driver to try and obtain the port mode from the port node rather than the switch node. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: initialize with 0 before setting erspan md->uXin Long2020-09-151-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In fl_set_erspan_opt(), all bits of erspan md was set 1, as this function is also used to set opt MASK. However, when setting for md->u.index for opt VALUE, the rest bits of the union md->u will be left 1. It would cause to fail the match of the whole md when version is 1 and only index is set. This patch is to fix by initializing with 0 before setting erspan md->u. Reported-by: Shuang Li <shuali@redhat.com> Fixes: 79b1011cb33d ("net: sched: allow flower to match erspan options") Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sched: only keep the available bits when setting vxlan md->gbpXin Long2020-09-152-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | As we can see from vxlan_build/parse_gbp_hdr(), when processing metadata on vxlan rx/tx path, only dont_learn/policy_applied/policy_id fields can be set to or parse from the packet for vxlan gbp option. So we'd better do the mask when set it in act_tunnel_key and cls_flower. Otherwise, when users don't know these bits, they may configure with a value which can never be matched. Reported-by: Shuang Li <shuali@redhat.com> Signed-off-by: Xin Long <lucien.xin@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * taprio: Fix allowing too small intervalsVinicius Costa Gomes2020-09-121-11/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It's possible that the user specifies an interval that couldn't allow any packet to be transmitted. This also avoids the issue of the hrtimer handler starving the other threads because it's running too often. The solution is to reject interval sizes that according to the current link speed wouldn't allow any packet to be transmitted. Reported-by: syzbot+8267241609ae8c23b248@syzkaller.appspotmail.com Fixes: 5a781ccbd19e ("tc: Add support for configuring the taprio scheduler") Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sch_generic: aviod concurrent reset and enqueue op for lockless qdiscYunsheng Lin2020-09-101-15/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently there is concurrent reset and enqueue operation for the same lockless qdisc when there is no lock to synchronize the q->enqueue() in __dev_xmit_skb() with the qdisc reset operation in qdisc_deactivate() called by dev_deactivate_queue(), which may cause out-of-bounds access for priv->ring[] in hns3 driver if user has requested a smaller queue num when __dev_xmit_skb() still enqueue a skb with a larger queue_mapping after the corresponding qdisc is reset, and call hns3_nic_net_xmit() with that skb later. Reused the existing synchronize_net() in dev_deactivate_many() to make sure skb with larger queue_mapping enqueued to old qdisc(which is saved in dev_queue->qdisc_sleeping) will always be reset when dev_reset_queue() is called. Fixes: 6b3ba9146fe6 ("net: sched: allow qdiscs to handle locking") Signed-off-by: Yunsheng Lin <linyunsheng@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * act_ife: load meta modules before tcf_idr_check_alloc()Cong Wang2020-09-051-10/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following deadlock scenario is triggered by syzbot: Thread A: Thread B: tcf_idr_check_alloc() ... populate_metalist() rtnl_unlock() rtnl_lock() ... request_module() tcf_idr_check_alloc() rtnl_lock() At this point, thread A is waiting for thread B to release RTNL lock, while thread B is waiting for thread A to commit the IDR change with tcf_idr_insert() later. Break this deadlock situation by preloading ife modules earlier, before tcf_idr_check_alloc(), this is fine because we only need to load modules we need potentially. Reported-and-tested-by: syzbot+80e32b5d1f9923f8ace6@syzkaller.appspotmail.com Fixes: 0190c1d452a9 ("net: sched: atomically check-allocate action") Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Vlad Buslov <vladbu@mellanox.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* | net: sched: skip an unnecessay checkTom Rix2020-09-091-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Reviewing the error handling in tcf_action_init_1() most of the early handling uses err_out: if (cookie) { kfree(cookie->data); kfree(cookie); } before cookie could ever be set. So skip the unnecessay check. Signed-off-by: Tom Rix <trix@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netJakub Kicinski2020-09-053-23/+29
|\| | | | | | | | | | | | | | | | | | | | | | | We got slightly different patches removing a double word in a comment in net/ipv4/raw.c - picked the version from net. Simple conflict in drivers/net/ethernet/ibm/ibmvnic.c. Use cached values instead of VNIC login response buffer (following what commit 507ebe6444a4 ("ibmvnic: Fix use-after-free of VNIC login response buffer") did). Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netLinus Torvalds2020-09-042-22/+28
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: 1) Use netif_rx_ni() when necessary in batman-adv stack, from Jussi Kivilinna. 2) Fix loss of RTT samples in rxrpc, from David Howells. 3) Memory leak in hns_nic_dev_probe(), from Dignhao Liu. 4) ravb module cannot be unloaded, fix from Yuusuke Ashizuka. 5) We disable BH for too lokng in sctp_get_port_local(), add a cond_resched() here as well, from Xin Long. 6) Fix memory leak in st95hf_in_send_cmd, from Dinghao Liu. 7) Out of bound access in bpf_raw_tp_link_fill_link_info(), from Yonghong Song. 8) Missing of_node_put() in mt7530 DSA driver, from Sumera Priyadarsini. 9) Fix crash in bnxt_fw_reset_task(), from Michael Chan. 10) Fix geneve tunnel checksumming bug in hns3, from Yi Li. 11) Memory leak in rxkad_verify_response, from Dinghao Liu. 12) In tipc, don't use smp_processor_id() in preemptible context. From Tuong Lien. 13) Fix signedness issue in mlx4 memory allocation, from Shung-Hsi Yu. 14) Missing clk_disable_prepare() in gemini driver, from Dan Carpenter. 15) Fix ABI mismatch between driver and firmware in nfp, from Louis Peens. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (110 commits) net/smc: fix sock refcounting in case of termination net/smc: reset sndbuf_desc if freed net/smc: set rx_off for SMCR explicitly net/smc: fix toleration of fake add_link messages tg3: Fix soft lockup when tg3_reset_task() fails. doc: net: dsa: Fix typo in config code sample net: dp83867: Fix WoL SecureOn password nfp: flower: fix ABI mismatch between driver and firmware tipc: fix shutdown() of connectionless socket ipv6: Fix sysctl max for fib_multipath_hash_policy drivers/net/wan/hdlc: Change the default of hard_header_len to 0 net: gemini: Fix another missing clk_disable_unprepare() in probe net: bcmgenet: fix mask check in bcmgenet_validate_flow() amd-xgbe: Add support for new port mode net: usb: dm9601: Add USB ID of Keenetic Plus DSL vhost: fix typo in error message net: ethernet: mlx4: Fix memory allocation in mlx4_buddy_init() pktgen: fix error message with wrong function name net: ethernet: ti: am65-cpsw: fix rmii 100Mbit link mode cxgb4: fix thermal zone device registration ...
| | * net_sched: fix error path in red_init()Cong Wang2020-08-281-16/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When ->init() fails, ->destroy() is called to clean up. So it is unnecessary to clean up in red_init(), and it would cause some refcount underflow. Fixes: aee9caa03fc3 ("net: sched: sch_red: Add qevents "early_drop" and "mark"") Reported-and-tested-by: syzbot+b33c1cb0a30ebdc8a5f9@syzkaller.appspotmail.com Reported-and-tested-by: syzbot+e5ea5f8a3ecfd4427a1c@syzkaller.appspotmail.com Cc: Petr Machata <petrm@mellanox.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Petr Machata <petrm@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| | * taprio: Fix using wrong queues in gate maskVinicius Costa Gomes2020-08-271-6/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 9c66d1564676 ("taprio: Add support for hardware offloading") there's a bit of inconsistency when offloading schedules to the hardware: In software mode, the gate masks are specified in terms of traffic classes, so if say "sched-entry S 03 20000", it means that the traffic classes 0 and 1 are open for 20us; when taprio is offloaded to hardware, the gate masks are specified in terms of hardware queues. The idea here is to fix hardware offloading, so schedules in hardware and software mode have the same behavior. What's needed to do is to map traffic classes to queues when applying the offload to the driver. Fixes: 9c66d1564676 ("taprio: Add support for hardware offloading") Signed-off-by: Vinicius Costa Gomes <vinicius.gomes@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | treewide: Use fallthrough pseudo-keywordGustavo A. R. Silva2020-08-241-1/+1
| |/ | | | | | | | | | | | | | | | | | | Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/v5.7/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-08-231-1/+1
|\|
| * net/sched: act_ct: Fix skb double-free in tcf_ct_handle_fragments() error flowAlaa Hleihel2020-08-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | tcf_ct_handle_fragments() shouldn't free the skb when ip_defrag() call fails. Otherwise, we will cause a double-free bug. In such cases, just return the error to the caller. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: Alaa Hleihel <alaa@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netlink: consistently use NLA_POLICY_EXACT_LEN()Johannes Berg2020-08-183-10/+7
|/ | | | | | | | | | Change places that open-code NLA_POLICY_EXACT_LEN() to use the macro instead, giving us flexibility in how we handle the details of the macro. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds2020-08-0634-99/+304
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) Support 6Ghz band in ath11k driver, from Rajkumar Manoharan. 2) Support UDP segmentation in code TSO code, from Eric Dumazet. 3) Allow flashing different flash images in cxgb4 driver, from Vishal Kulkarni. 4) Add drop frames counter and flow status to tc flower offloading, from Po Liu. 5) Support n-tuple filters in cxgb4, from Vishal Kulkarni. 6) Various new indirect call avoidance, from Eric Dumazet and Brian Vazquez. 7) Fix BPF verifier failures on 32-bit pointer arithmetic, from Yonghong Song. 8) Support querying and setting hardware address of a port function via devlink, use this in mlx5, from Parav Pandit. 9) Support hw ipsec offload on bonding slaves, from Jarod Wilson. 10) Switch qca8k driver over to phylink, from Jonathan McDowell. 11) In bpftool, show list of processes holding BPF FD references to maps, programs, links, and btf objects. From Andrii Nakryiko. 12) Several conversions over to generic power management, from Vaibhav Gupta. 13) Add support for SO_KEEPALIVE et al. to bpf_setsockopt(), from Dmitry Yakunin. 14) Various https url conversions, from Alexander A. Klimov. 15) Timestamping and PHC support for mscc PHY driver, from Antoine Tenart. 16) Support bpf iterating over tcp and udp sockets, from Yonghong Song. 17) Support 5GBASE-T i40e NICs, from Aleksandr Loktionov. 18) Add kTLS RX HW offload support to mlx5e, from Tariq Toukan. 19) Fix the ->ndo_start_xmit() return type to be netdev_tx_t in several drivers. From Luc Van Oostenryck. 20) XDP support for xen-netfront, from Denis Kirjanov. 21) Support receive buffer autotuning in MPTCP, from Florian Westphal. 22) Support EF100 chip in sfc driver, from Edward Cree. 23) Add XDP support to mvpp2 driver, from Matteo Croce. 24) Support MPTCP in sock_diag, from Paolo Abeni. 25) Commonize UDP tunnel offloading code by creating udp_tunnel_nic infrastructure, from Jakub Kicinski. 26) Several pci_ --> dma_ API conversions, from Christophe JAILLET. 27) Add FLOW_ACTION_POLICE support to mlxsw, from Ido Schimmel. 28) Add SK_LOOKUP bpf program type, from Jakub Sitnicki. 29) Refactor a lot of networking socket option handling code in order to avoid set_fs() calls, from Christoph Hellwig. 30) Add rfc4884 support to icmp code, from Willem de Bruijn. 31) Support TBF offload in dpaa2-eth driver, from Ioana Ciornei. 32) Support XDP_REDIRECT in qede driver, from Alexander Lobakin. 33) Support PCI relaxed ordering in mlx5 driver, from Aya Levin. 34) Support TCP syncookies in MPTCP, from Flowian Westphal. 35) Fix several tricky cases of PMTU handling wrt. briding, from Stefano Brivio. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2056 commits) net: thunderx: initialize VF's mailbox mutex before first usage usb: hso: remove bogus check for EINPROGRESS usb: hso: no complaint about kmalloc failure hso: fix bailout in error case of probe ip_tunnel_core: Fix build for archs without _HAVE_ARCH_IPV6_CSUM selftests/net: relax cpu affinity requirement in msg_zerocopy test mptcp: be careful on subflow creation selftests: rtnetlink: make kci_test_encap() return sub-test result selftests: rtnetlink: correct the final return value for the test net: dsa: sja1105: use detected device id instead of DT one on mismatch tipc: set ub->ifindex for local ipv6 address ipv6: add ipv6_dev_find() net: openvswitch: silence suspicious RCU usage warning Revert "vxlan: fix tos value before xmit" ptp: only allow phase values lower than 1 period farsync: switch from 'pci_' to 'dma_' API wan: wanxl: switch from 'pci_' to 'dma_' API hv_netvsc: do not use VF device if link is down dpaa2-eth: Fix passing zero to 'PTR_ERR' warning net: macb: Properly handle phylink on at91sam9x ...
| * net/sched: act_ct: fix miss set mru for ovs after defrag in act_ctwenxu2020-08-042-2/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When openvswitch conntrack offload with act_ct action. Fragment packets defrag in the ingress tc act_ct action and miss the next chain. Then the packet pass to the openvswitch datapath without the mru. The over mtu packet will be dropped in output action in openvswitch for over mtu. "kernel: net2: dropped over-mtu packet: 1528 > 1500" This patch add mru in the tc_skb_ext for adefrag and miss next chain situation. And also add mru in the qdisc_skb_cb. The act_ct set the mru to the qdisc_skb_cb when the packet defrag. And When the chain miss, The mru is set to tc_skb_ext which can be got by ovs datapath. Fixes: b57dc7c13ea9 ("net/sched: Introduce action ct") Signed-off-by: wenxu <wenxu@ucloud.cn> Reviewed-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-08-021-2/+2
| |\ | | | | | | | | | | | | | | | | | | Resolved kernel/bpf/btf.c using instructions from merge commit 69138b34a7248d2396ab85c8652e20c0c39beaba Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/sched: cls_u32: Use struct_size() helperGustavo A. R. Silva2020-08-011-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make use of the struct_size() helper, in multiple places, instead of an open-coded version in order to avoid any potential type mistakes and protect against potential integer overflows. Also, remove unnecessary object identifier size. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/sched: act_pedit: Use flex_array_size() helper in memcpy()Gustavo A. R. Silva2020-07-311-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make use of the flex_array_size() helper to calculate the size of a flexible array member within an enclosing structure. This helper offers defense-in-depth against potential integer overflows, while at the same time makes it explicitly clear that we are dealing with a flexible array member. Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net_sched: initialize timer earlier in red_init()Cong Wang2020-07-291-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When red_init() fails, red_destroy() is called to clean up. If the timer is not initialized yet, del_timer_sync() will complain. So we have to move timer_setup() before any failure. Reported-and-tested-by: syzbot+6e95a4fabf88dc217145@syzkaller.appspotmail.com Fixes: aee9caa03fc3 ("net: sched: sch_red: Add qevents "early_drop" and "mark"") Cc: Petr Machata <petrm@mellanox.com> Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Reviewed-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-07-262-3/+14
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The UDP reuseport conflict was a little bit tricky. The net-next code, via bpf-next, extracted the reuseport handling into a helper so that the BPF sk lookup code could invoke it. At the same time, the logic for reuseport handling of unconnected sockets changed via commit efc6b6f6c3113e8b203b9debfb72d81e0f3dcace which changed the logic to carry on the reuseport result into the rest of the lookup loop if we do not return immediately. This requires moving the reuseport_has_conns() logic into the callers. While we are here, get rid of inline directives as they do not belong in foo.c files. The other changes were cases of more straightforward overlapping modifications. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net/sched: cls_flower: Add hash info to flow classificationAriel Levkovich2020-07-251-0/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding new cls flower keys for hash value and hash mask and dissect the hash info from the skb into the flow key towards flow classication. Signed-off-by: Ariel Levkovich <lariel@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | sched: sch_api: add missing rcu read lock to silence the warningJiri Pirko2020-07-211-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In case the qdisc_match_from_root function() is called from non-rcu path with rtnl mutex held, a suspiciout rcu usage warning appears: [ 241.504354] ============================= [ 241.504358] WARNING: suspicious RCU usage [ 241.504366] 5.8.0-rc4-custom-01521-g72a7c7d549c3 #32 Not tainted [ 241.504370] ----------------------------- [ 241.504378] net/sched/sch_api.c:270 RCU-list traversed in non-reader section!! [ 241.504382] other info that might help us debug this: [ 241.504388] rcu_scheduler_active = 2, debug_locks = 1 [ 241.504394] 1 lock held by tc/1391: [ 241.504398] #0: ffffffff85a27850 (rtnl_mutex){+.+.}-{3:3}, at: rtnetlink_rcv_msg+0x49a/0xbd0 [ 241.504431] stack backtrace: [ 241.504440] CPU: 0 PID: 1391 Comm: tc Not tainted 5.8.0-rc4-custom-01521-g72a7c7d549c3 #32 [ 241.504446] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.13.0-2.fc32 04/01/2014 [ 241.504453] Call Trace: [ 241.504465] dump_stack+0x100/0x184 [ 241.504482] lockdep_rcu_suspicious+0x153/0x15d [ 241.504499] qdisc_match_from_root+0x293/0x350 Fix this by passing the rtnl held lockdep condition down to hlist_for_each_entry_rcu() Reported-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Revert "net: sched: Pass root lock to Qdisc_ops.enqueue"Petr Machata2020-07-1733-67/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts commit aebe4426ccaa4838f36ea805cdf7d76503e65117. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | net: sched: Do not drop root lock in tcf_qevent_handle()Petr Machata2020-07-172-10/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mirred currently does not mix well with blocks executed after the qdisc root lock is taken. This includes classification blocks (such as in PRIO, ETS, DRR qdiscs) and qevents. The locking caused by the packet mirrored by mirred can cause deadlocks: either when the thread of execution attempts to take the lock a second time, or when two threads end up waiting on each other's locks. The qevent patchset attempted to not introduce further badness of this sort, and dropped the lock before executing the qevent block. However this lead to too little locking and races between qdisc configuration and packet enqueue in the RED qdisc. Before the deadlock issues are solved in a way that can be applied across many qdiscs reasonably easily, do for qevents what is done for the classification blocks and just keep holding the root lock. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
| * | | net: sched: Pass qdisc reference in struct flow_block_offloadPetr Machata2020-07-141-7/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Previously, shared blocks were only relevant for the pseudo-qdiscs ingress and clsact. Recently, a qevent facility was introduced, which allows to bind blocks to well-defined slots of a qdisc instance. RED in particular got two qevents: early_drop and mark. Drivers that wish to offload these blocks will be sent the usual notification, and need to know which qdisc it is related to. To that end, extend flow_block_offload with a "sch" pointer, and initialize as appropriate. This prompts changes in the indirect block facility, which now tracks the scheduler in addition to the netdevice. Update signatures of several functions similarly. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: sched: kerneldoc fixesAndrew Lunn2020-07-142-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Simple fixes which require no deep knowledge of the code. Cc: Jamal Hadi Salim <jhs@mojatatu.com> Cc: Cong Wang <xiyou.wangcong@gmail.com> Cc: Jiri Pirko <jiri@resnulli.us> Signed-off-by: Andrew Lunn <andrew@lunn.ch> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-07-1116-33/+40
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net/sched: Use fallthrough pseudo-keywordGustavo A. R. Silva2020-07-0814-14/+15
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the existing /* fall through */ comments and its variants with the new pseudo-keyword macro fallthrough[1]. Also, remove unnecessary fall-through markings when it is the case. [1] https://www.kernel.org/doc/html/latest/process/deprecated.html?highlight=fallthrough#implicit-switch-case-fall-through Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: sched: Allow changing default qdisc to FQ-PIEDanny Lin2020-07-021-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to fq_codel and the other qdiscs that can set as default, fq_pie is also suitable for general use without explicit configuration, which makes it a valid choice for this. This is useful in situations where a painless out-of-the-box solution for reducing bufferbloat is desired but fq_codel is not necessarily the best choice. For example, fq_pie can be better for DASH streaming, but there could be more cases where it's the better choice of the two simple AQMs available in the kernel. Signed-off-by: Danny Lin <danny@kdrag0n.dev> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net:qos: police action offloading parameter 'burst' change to the original valuePo Liu2020-06-301-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since 'tcfp_burst' with TICK factor, driver side always need to recover it to the original value, this patch moves the generic calculation and recover to the 'burst' original value before offloading to device driver. Signed-off-by: Po Liu <po.liu@nxp.com> Acked-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: sched: sch_red: Add qevents "early_drop" and "mark"Petr Machata2020-06-301-2/+56
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In order to allow acting on dropped and/or ECN-marked packets, add two new qevents to the RED qdisc: "early_drop" and "mark". Filters attached at "early_drop" block are executed as packets are early-dropped, those attached at the "mark" block are executed as packets are ECN-marked. Two new attributes are introduced: TCA_RED_EARLY_DROP_BLOCK with the block index for the "early_drop" qevent, and TCA_RED_MARK_BLOCK for the "mark" qevent. Absence of these attributes signifies "don't care": no block is allocated in that case, or the existing blocks are left intact in case of the change callback. For purposes of offloading, blocks attached to these qevents appear with newly-introduced binder types, FLOW_BLOCK_BINDER_TYPE_RED_EARLY_DROP and FLOW_BLOCK_BINDER_TYPE_RED_MARK. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: sched: sch_red: Split init and change callbacksPetr Machata2020-06-301-12/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the following patches, RED will get two qevents. The implementation will be clearer if the callback for change is not a pure subset of the callback for init. Split the two and promote attribute parsing to the callbacks themselves from the common code, because it will be handy there. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: sched: Introduce helpers for qevent blocksPetr Machata2020-06-301-0/+119
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Qevents are attach points for TC blocks, where filters can be put that are executed when "interesting events" take place in a qdisc. The data to keep and the functions to invoke to maintain a qevent will be largely the same between qevents. Therefore introduce sched-wide helpers for qevent management. Currently, similarly to ingress and egress blocks of clsact pseudo-qdisc, blocks attachment cannot be changed after the qdisc is created. To that end, add a helper tcf_qevent_validate_change(), which verifies whether block index attribute is not attached, or if it is, whether its value matches the current one (i.e. there is no material change). The function tcf_qevent_handle() should be invoked when qdisc hits the "interesting event" corresponding to a block. This function releases root lock for the duration of executing the attached filters, to allow packets generated through user actions (notably mirred) to be reinserted to the same qdisc tree. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | net: sched: Pass root lock to Qdisc_ops.enqueuePetr Machata2020-06-3033-67/+67
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A following patch introduces qevents, points in qdisc algorithm where packet can be processed by user-defined filters. Should this processing lead to a situation where a new packet is to be enqueued on the same port, holding the root lock would lead to deadlocks. To solve the issue, qevent handler needs to unlock and relock the root lock when necessary. To that end, add the root lock argument to the qdisc op enqueue, and propagate throughout. Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-06-266-85/+127
| |\ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Minor overlapping changes in xfrm_device.c, between the double ESP trailing bug fix setting the XFRM_INIT flag and the changes in net-next preparing for bonding encryption support. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | sch_cake: add RFC 8622 LE PHB support to CAKE diffserv handlingKevin Darbyshire-Bryant2020-06-261-4/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Change tin mapping on diffserv3, 4 & 8 for LE PHB support, in essence making LE a member of the Bulk tin. Bulk has the least priority and minimum of 1/16th total bandwidth in the face of higher priority traffic. NB: Diffserv 3 & 4 swap tin 0 & 1 priorities from the default order as found in diffserv8, in case anyone is wondering why it looks a bit odd. Signed-off-by: Kevin Darbyshire-Bryant <ldir@darbyshire-bryant.me.uk> [ reword commit message slightly ] Signed-off-by: Toke Høiland-Jørgensen <toke@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: qos: police action add index for tc flower offloadingPo Liu2020-06-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Hardware device may include more than one police entry. Specifying the action's index make it possible for several tc filters to share the same police action when installing the filters. Propagate this index to device drivers through the flow offload intermediate representation, so that drivers could share a single hardware policer between multiple filters. v1->v2 changes: - Update the commit message suggest by Ido Schimmel <idosch@idosch.org> Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | net: qos: add tc police offloading action with max frame size limitPo Liu2020-06-251-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current police offloading support the 'burst'' and 'rate_bytes_ps'. Some hardware own the capability to limit the frame size. If the frame size larger than the setting, the frame would be dropped. For the police action itself already accept the 'mtu' parameter in tc command. But not extend to tc flower offloading. So extend 'mtu' to tc flower offloading. Signed-off-by: Po Liu <Po.Liu@nxp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | tcindex_change: Remove redundant null checkGaurav Singh2020-06-231-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | arg cannot be NULL since its already being dereferenced before. Remove the redundant NULL check. Signed-off-by: Gaurav Singh <gaurav1086@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | Remove redundant skb null checkGaurav Singh2020-06-211-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove the redundant null check for skb. Signed-off-by: Gaurav Singh <gaurav1086@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>