summaryrefslogtreecommitdiffstats
Commit message (Collapse)AuthorAgeFilesLines
* netfilter: nf_tables: pass context to nft_set_destroy()Pablo Neira Ayuso2020-03-191-5/+5
| | | | | | | The patch that adds support for stateful expressions in set definitions require this. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* netfilter: nf_tables: move nft_expr_clone() to nf_tables_api.cPablo Neira Ayuso2020-03-193-17/+19
| | | | | | Move the nft_expr_clone() helper function to the core. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
* Merge tag 'mlx5-updates-2020-03-17' of ↵David S. Miller2020-03-1912-105/+218
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux Saeed Mahameed says: ==================== mlx5-updates-2020-03-17 1) Compiler warnings and cleanup for the connection tracking series 2) Bug fixes for the connection tracking series 3) Fix devlink port register sequence 4) Last five patches in the series, By Eli cohen Add the support for forwarding traffic between two eswitch uplink representors (Hairpin for eswitch), using mlx5 termination tables to change the direction of a packet in hw from RX to TX pipeline. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/mlx5: Avoid forwarding to other eswitch uplinkEli Cohen2020-03-181-0/+3
| | | | | | | | | | | | | | | | | | Do not allow forwarding of encapsulated traffic received from one eswtich's uplink to another eswtich's uplink. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Eswitch, enable forwarding back to uplink portEli Cohen2020-03-182-17/+45
| | | | | | | | | | | | | | | | | | | | | | | | Add dependencny on cap termination_table_raw_traffic to allow non encapsulated packets received from uplink to be forwarded back to the received uplink port. Refactor the conditions into a separate function. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5e: Add support for offloading traffic from uplink to uplinkEli Cohen2020-03-181-27/+70
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Termination tables change the direction of a packet in hw from RX to SX pipeline. Use that to offload hairpin flows received from uplink and sent back to uplink. Currently termination tables are used for pushing VLAN to packets received from uplink and targeting a VF. Extend the implementation to allow forwarding packets to uplink. These packets can either be encapsulated or not. In case encapsulation is needed before forwarding, move the reformat object to the termination table as required. Extend the hash table key to include tunnel information for the sake of reusing reformat objects. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Don't use termination tables in slow pathEli Cohen2020-03-183-7/+19
| | | | | | | | | | | | | | | | | | | | | | Don't use termination tables for packets that are steered to the slow path, as a pre-step for supporting packet encap (packet reformat) action on termination tables. Packet encap (reformat action) actions steer the packet to the slow path until outer arp entries are resolved. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Avoid configuring eswitch QoS if not supportedEli Cohen2020-03-182-0/+11
| | | | | | | | | | | | | | | | | | | | | | | | Check if QoS is enabled for the eswitch before attempting to configure QoS parameters and emit a netlink error if not supported. Introduce an API to check if QoS is supported for the eswitch. Signed-off-by: Eli Cohen <eli@mellanox.com> Reviewed-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5e: Fix devlink port register sequenceVladyslav Tarasiuk2020-03-183-24/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If udevd is configured to rename interfaces according to persistent naming rules and if a network interface has phys_port_name in sysfs, its contents will be appended to the interface name. However, register_netdev creates device in sysfs and if devlink_port_register is called after that, there is a timeframe in which udevd may read an empty phys_port_name value. The consequence is that the interface will lose this suffix and its name will not be really persistent. The solution is to register the port before registering a netdev. Fixes: c6acd629eec7 ("net/mlx5e: Add support for devlink-port in non-representors mode") Signed-off-by: Vladyslav Tarasiuk <vladyslavt@mellanox.com> Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com> Reviewed-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5e: Fix rejecting all egress rules not on vlanRoi Dayan2020-03-181-14/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | The original condition rejected all egress rules that are not on tunnel device. Also, the whole point of this egress reject was to disallow bad rules because of egdev which doesn't exists today, so remove this check entirely. Fixes: 0a7fcb78cc21 ("net/mlx5e: Support inner header rewrite with goto action") Signed-off-by: Roi Dayan <roid@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Reviewed-by: Vlad Buslov <vladbu@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5e: en_tc: Rely just on register loopback for tunnel restorationPaul Blakey2020-03-181-3/+3
| | | | | | | | | | | | | | | | | | | | Register loopback which is needed for tunnel restoration, is now always enabled if supported and not just with metadata enabled, check for that instead. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5e: CT: Fix stack usage compiler warningSaeed Mahameed2020-03-181-9/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix the following warnings: [-Werror=frame-larger-than=] In function ‘mlx5_tc_ct_entry_add_rule’: drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:541:1: error: the frame size of 1136 bytes is larger than 1024 bytes In function ‘__mlx5_tc_ct_flow_offload’: drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:1049:1: error: the frame size of 1168 bytes is larger than 1024 bytes Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking") Signed-off-by: Saeed Mahameed <saeedm@mellanox.com> Reviewed-by: Paul Blakey <paulb@mellanox.com>
| * net/mlx5e: CT: Fix insert rules when TC_CT config isn't enabledPaul Blakey2020-03-181-0/+9
| | | | | | | | | | | | | | | | | | | | | | | | If CONFIG_MLX5_TC_CT isn't enabled, all offloading of eswitch tc rules fails on parsing ct match, even if there is no ct match. Return success if there is no ct match, regardless of config. Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking") Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5e: CT: remove set but not used variable 'unnew'YueHaibing2020-03-181-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c: In function mlx5_tc_ct_parse_match: drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:699:36: warning: variable unnew set but not used [-Wunused-but-set-variable] Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking") Reported-by: Hulk Robot <hulkci@huawei.com> Signed-off-by: YueHaibing <yuehaibing@huawei.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: E-Switch, Skip restore modify header between prios of same chainPaul Blakey2020-03-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Restore modify header writes the chain mapping on the packet. This modify header and action is added on all prios connections, and gets overwritten with the same value consecutively in prios of the same chain. Use the chain's modify header only for the last prio of a given tc chain. Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: E-Switch: Fix using fwd and modify when firmware doesn't support itPaul Blakey2020-03-181-2/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | Currently, if firmware doesn't support fwd and modify, driver fails initializing eswitch chains while entering switchdev mode. Instead, on such cases, disable the chains and prio feature (as we can't restore the chain on miss) and the usage of fwd and modify. Fixes: 8f1e0b97cc70 ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping") Signed-off-by: Paul Blakey <paulb@mellanox.com> Reviewed-by: Oz Shlomo <ozsh@mellanox.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
| * net/mlx5: Add missing inline to stub esw_add_restore_ruleNathan Chancellor2020-03-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_MLX5_ESWITCH is unset, clang warns: In file included from drivers/net/ethernet/mellanox/mlx5/core/main.c:58: drivers/net/ethernet/mellanox/mlx5/core/eswitch.h:670:1: warning: unused function 'esw_add_restore_rule' [-Wunused-function] esw_add_restore_rule(struct mlx5_eswitch *esw, u32 tag) ^ 1 warning generated. This stub function is missing inline; add it to suppress the warning. Fixes: 11b717d61526 ("net/mlx5: E-Switch, Get reg_c0 value on CQE") Signed-off-by: Nathan Chancellor <natechancellor@gmail.com> Reviewed-by: Nick Desaulniers <ndesaulniers@google.com> Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
* | net: phy: realtek: read actual speed to detect downshiftHeiner Kallweit2020-03-191-1/+59
| | | | | | | | | | | | | | | | | | | | | | At least some integrated PHY's in RTL8168/RTL8125 chip versions support downshift, and the actual link speed can be read from a vendor-specific register. Info about this register was provided by Realtek. More details about downshift configuration (e.g. number of attempts) aren't available, therefore the downshift tunable is not implemented. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: Fix hw_stats_type setting in pedit loopPetr Machata2020-03-191-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In the commit referenced below, hw_stats_type of an entry is set for every entry that corresponds to a pedit action. However, the assignment is only done after the entry pointer is bumped, and therefore could overwrite memory outside of the entries array. The reason for this positioning may have been that the current entry's hw_stats_type is already set above, before the action-type dispatch. However, if there are no more actions, the assignment is wrong. And if there are, the next round of the for_each_action loop will make the assignment before the action-type dispatch anyway. Therefore fix this issue by simply reordering the two lines. Fixes: 74522e7baae2 ("net: sched: set the hw_stats_type in pedit loop") Signed-off-by: Petr Machata <petrm@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 'mlxsw-spectrum_cnt-Expose-counter-resources'David S. Miller2020-03-198-64/+343
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Ido Schimmel says: ==================== mlxsw: spectrum_cnt: Expose counter resources Jiri says: Capacity and utilization of existing flow and RIF counters are currently unavailable to be seen by the user. Use the existing devlink resources API to expose the information: $ sudo devlink resource show pci/0000:00:10.0 -v pci/0000:00:10.0: name kvd resource_path /kvd size 524288 unit entry dpipe_tables none name span_agents resource_path /span_agents size 8 occ 0 unit entry dpipe_tables none name counters resource_path /counters size 79872 occ 44 unit entry dpipe_tables none resources: name flow resource_path /counters/flow size 61440 occ 4 unit entry dpipe_tables none name rif resource_path /counters/rif size 18432 occ 40 unit entry dpipe_tables none ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | selftests: mlxsw: Add tc action hw_stats testsJiri Pirko2020-03-192-0/+139
| | | | | | | | | | | | | | | | | | | | | | | | Add tests for mlxsw hw_stats types. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mlxsw: spectrum_cnt: Expose devlink resource occupancy for countersJiri Pirko2020-03-191-1/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | Implement occupancy counting for counters and expose over devlink resource API. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mlxsw: spectrum_cnt: Consolidate subpools initializationJiri Pirko2020-03-191-22/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | Put all init operations related to subpools into mlxsw_sp_counter_sub_pools_init(). Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mlxsw: spectrum_cnt: Move config validation along with resource registerJiri Pirko2020-03-191-23/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Move the validation of subpools configuration, to avoid possible over commitment to resource registration. Add WARN_ON to indicate bug in the code. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mlxsw: spectrum_cnt: Expose subpool sizes over devlink resourcesJiri Pirko2020-03-194-17/+101
| | | | | | | | | | | | | | | | | | | | | | | | | | | Implement devlink resources support for counter pools. Move the subpool sizes calculations into the new resources register function. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mlxsw: spectrum_cnt: Add entry_size_res_id for each subpool and use it to ↵Jiri Pirko2020-03-191-12/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | query entry size Add new field to subpool struct that would indicate which resource id should be used to query the entry size for the subpool from the device. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mlxsw: spectrum_cnt: Move sub_pools under per-instance pool structJiri Pirko2020-03-191-18/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, the global static array of subpools is used. Make it per-instance as multiple instances of the mlxsw driver can have different values. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | selftests: spectrum-2: Adjust tc_flower_scale limit according to current ↵Jiri Pirko2020-03-191-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | counter count With the change that made the code to query counter bank size from device instead of using hard-coded value, the number of available counters changed for Spectrum-2. Adjust the limit in the selftests. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mlxsw: spectrum_cnt: Query bank size from FW resourcesJiri Pirko2020-03-192-6/+11
|/ / | | | | | | | | | | | | | | | | | | The bank size is different between Spectrum versions. Also it is a resource that can be queried. So instead of hard coding the value in code, query it from the firmware. Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: Ido Schimmel <idosch@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | cxgb4: rework TC filter rule insertion across regionsRahul Lakkireddy2020-03-196-152/+381
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Chelsio NICs have 3 filter regions, in following order of priority: 1. High Priority (HPFILTER) region (Highest Priority). 2. HASH region. 3. Normal FILTER region (Lowest Priority). Currently, there's a 1-to-1 mapping between the prio value passed by TC and the filter region index. However, it's possible to have multiple TC rules with the same prio value. In this case, if a region is exhausted, no attempt is made to try inserting the rule in the next available region. So, rework and remove the 1-to-1 mapping. Instead, dynamically select the region to insert the filter rule, as long as the new rule's prio value doesn't conflict with existing rules across all the 3 regions. Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | netfilter: revert introduction of egress hookDaniel Borkmann2020-03-198-160/+68
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This reverts the following commits: 8537f78647c0 ("netfilter: Introduce egress hook") 5418d3881e1f ("netfilter: Generalize ingress hook") b030f194aed2 ("netfilter: Rename ingress hook include file") >From the discussion in [0], the author's main motivation to add a hook in fast path is for an out of tree kernel module, which is a red flag to begin with. Other mentioned potential use cases like NAT{64,46} is on future extensions w/o concrete code in the tree yet. Revert as suggested [1] given the weak justification to add more hooks to critical fast-path. [0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/ [1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/ Signed-off-by: Daniel Borkmann <daniel@iogearbox.net> Cc: David Miller <davem@davemloft.net> Cc: Pablo Neira Ayuso <pablo@netfilter.org> Cc: Alexei Starovoitov <ast@kernel.org> Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge branch 's390-qeth-next'David S. Miller2020-03-196-70/+174
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Julian Wiedmann says: ==================== s390/qeth: updates 2020-03-18 please apply the following patch series for qeth to netdev's net-next tree. This consists of three parts: 1) support for __GFP_MEMALLOC, 2) several ethtool enhancements (.set_channels, SW Timestamping), 3) the usual cleanups. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: use dev->reg_stateJulian Wiedmann2020-03-193-29/+17
| | | | | | | | | | | | | | | | | | | | | | | | To check whether a netdevice has already been registered, look at NETREG_REGISTERED to replace some hacks I added a while ago. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: remove gratuitous NULL checksJulian Wiedmann2020-03-192-5/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | qeth_do_ioctl() is only reached through our own net_device_ops, so we can trust that dev->ml_priv still contains what we put there earlier. qeth_bridgeport_an_set() is an internal function that doesn't require such sanity checks. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: add phys_to_virt() translation for AOBJulian Wiedmann2020-03-191-3/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | Data addresses in the AOB are absolute, and need to be translated before being fed into kmem_cache_free(). Currently this phys_to_virt() is a no-op. Also see commit 2db01da8d25f ("s390/qdio: fill SBALEs with absolute addresses"). Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: don't report hard-coded driver versionJulian Wiedmann2020-03-191-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Versions are meaningless for an in-kernel driver. Instead use the UTS_RELEASE that is set by ethtool_get_drvinfo(). Cc: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Reviewed-by: Leon Romanovsky <leonro@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: add SW timestamping support for IQD devicesJulian Wiedmann2020-03-192-1/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This adds support for SOF_TIMESTAMPING_TX_SOFTWARE. No support for non-IQD devices, since they orphan the skb in their xmit path. To play nice with TX bulking, set the timestamp when the buffer that contains the skb(s) is actually flushed out to HW. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: balance the TX queue selection for IQD devicesJulian Wiedmann2020-03-193-2/+46
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For ucast traffic, qeth_iqd_select_queue() falls back to netdev_pick_tx(). This will potentially use skb_tx_hash() to distribute the flow over all active TX queues - so txq 0 is a valid selection, and qeth_iqd_select_queue() needs to check for this and put it on some other queue. As a result, the distribution for ucast flows is unbalanced and hits QETH_IQD_MIN_UCAST_TXQ heavier than the other queues. Open-coding a custom variant of skb_tx_hash() isn't an option, since netdev_pick_tx() also gives us eg. access to XPS. But we can pull a little trick: add a single TC class that excludes the mcast txq, and thus encourage skb_tx_hash() to not pick the mcast txq. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: allow configuration of TX queues for IQD devicesJulian Wiedmann2020-03-192-4/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Similar to the support for z/VM NICs, but we need to take extra care about the dedicated mcast queue: 1. netdev_pick_tx() is unaware of this limitation and might select the mcast txq. Catch this. 2. require at least _two_ TX queues - one for ucast, one for mcast. 3. when reducing the number of TX queues, there's a potential race where netdev_cap_txqueue() over-rules the selected txq index and falls back to index 0. This would place ucast traffic on the mcast queue, and result in TX errors. So for IQD, reject a reduction while the interface is running. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: allow configuration of TX queues for z/VM NICsJulian Wiedmann2020-03-191-0/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add support for ETHTOOL_SCHANNELS to change the count of active TX queues. Since all TX queue structs are pre-allocated and -registered, we just need to trivially adjust dev->real_num_tx_queues. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: remove prio-queueing support for z/VM NICsJulian Wiedmann2020-03-195-25/+49
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | z/VM NICs don't offer HW QoS for TX rings. So just use netdev_pick_tx() to distribute the connections equally over all enabled TX queues. We start with just 1 enabled TX queue (this matches the typical configuration without prio-queueing). A follow-on patch will allow users to enable additional TX queues. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: use memory reserves in TX slow pathJulian Wiedmann2020-03-192-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When falling back to an allocation from the HW header cache, check if the skb is eligible for using memory reserves. This only makes a difference if the cache is empty and needs to be refilled. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | s390/qeth: use memory reserves to back RX buffersJulian Wiedmann2020-03-191-2/+2
|/ / | | | | | | | | | | | | | | Use dev_alloc_page() for backing the RX buffers with pages. This way we pick up __GFP_MEMALLOC. Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/pablo/nf-nextDavid S. Miller2020-03-1852-581/+2781
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pablo Neira Ayuso says: ==================== Netfilter updates for net-next The following patchset contains Netfilter updates for net-next: 1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey. 2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi. Follow up patch to clean up this new mode, from Dan Carpenter. 3) Add support for geneve tunnel options, from Xin Long. 4) Make sets built-in and remove modular infrastructure for sets, from Florian Westphal. 5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing. 6) Statify nft_pipapo_get, from Chen Wandun. 7) Use C99 flexible-array member, from Gustavo A. R. Silva. 8) More descriptive variable names for bitwise, from Jeremy Sowden. 9) Four patches to add tunnel device hardware offload to the flowtable infrastructure, from wenxu. 10) pipapo set supports for 8-bit grouping, from Stefano Brivio. 11) pipapo can switch between nibble and byte grouping, also from Stefano. 12) Add AVX2 vectorized version of pipapo, from Stefano Brivio. 13) Update pipapo to be use it for single ranges, from Stefano. 14) Add stateful expression support to elements via control plane, eg. counter per element. 15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal. 15) Add new egress hook, from Lukas Wunner. ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
| * | netfilter: Introduce egress hookLukas Wunner2020-03-187-8/+83
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key") introduced the ability to classify packets on ingress. Allow the same on egress. Position the hook immediately before a packet is handed to tc and then sent out on an interface, thereby mirroring the ingress order. This order allows marking packets in the netfilter egress hook and subsequently using the mark in tc. Another benefit of this order is consistency with a lot of existing documentation which says that egress tc is performed after netfilter hooks. Egress hooks already exist for the most common protocols, such as NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because they are executed earlier during packet processing. However for more exotic protocols, there is currently no provision to apply netfilter on egress. A common workaround is to enslave the interface to a bridge and use ebtables, or to resort to tc. But when the ingress hook was introduced, consensus was that users should be given the choice to use netfilter or tc, whichever tool suits their needs best: https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/ This hook is also useful for NAT46/NAT64, tunneling and filtering of locally generated af_packet traffic such as dhclient. There have also been occasional user requests for a netfilter egress hook in the past, e.g.: https://www.spinics.net/lists/netfilter/msg50038.html Performance measurements with pktgen surprisingly show a speedup rather than a slowdown with this commit: * Without this commit: Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags) 2920481pps 1401Mb/sec (1401830880bps) errors: 0 * With this commit: Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags) 2941410pps 1411Mb/sec (1411876800bps) errors: 0 * Without this commit + tc egress: Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags) 2562631pps 1230Mb/sec (1230062880bps) errors: 0 * With this commit + tc egress: Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags) 2659259pps 1276Mb/sec (1276444320bps) errors: 0 * With this commit + nft egress: Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags) 2413320pps 1158Mb/sec (1158393600bps) errors: 0 Tested on a bare-metal Core i7-3615QM, each measurement was performed three times to verify that the numbers are stable. Commands to perform a measurement: modprobe pktgen echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3 samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000 Commands for testing tc egress: tc qdisc add dev lo clsact tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32 Commands for testing nft egress: nft add table netdev t nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \} nft add rule netdev t co ip daddr 4.3.2.1/32 drop All testing was performed on the loopback interface to avoid distorting measurements by the packet handling in the low-level Ethernet driver. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: Generalize ingress hookLukas Wunner2020-03-182-15/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for addition of a netfilter egress hook by generalizing the ingress hook introduced by commit e687ad60af09 ("netfilter: add netfilter ingress hook after handle_ing() under unique static key"). In particular, rename and refactor the ingress hook's static inlines such that they can be reused for an egress hook. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: Rename ingress hook include fileLukas Wunner2020-03-182-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Prepare for addition of a netfilter egress hook by renaming <linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>. The egress hook also necessitates a refactoring of the include file, but that is done in a separate commit to ease reviewing. No functional change intended. Signed-off-by: Lukas Wunner <lukas@wunner.de> Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: conntrack: re-visit sysctls in unprivileged namespacesFlorian Westphal2020-03-151-11/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | since commit b884fa46177659 ("netfilter: conntrack: unify sysctl handling") conntrack no longer exposes most of its sysctls (e.g. tcp timeouts settings) to network namespaces that are not owned by the initial user namespace. This patch exposes all sysctls even if the namespace is unpriviliged. compared to a 4.19 kernel, the newly visible and writeable sysctls are: net.netfilter.nf_conntrack_acct net.netfilter.nf_conntrack_timestamp .. to allow to enable accouting and timestamp extensions. net.netfilter.nf_conntrack_events .. to turn off conntrack event notifications. net.netfilter.nf_conntrack_checksum .. to disable checksum validation. net.netfilter.nf_conntrack_log_invalid .. to enable logging of packets deemed invalid by conntrack. newly visible sysctls that are only exported as read-only: net.netfilter.nf_conntrack_count .. current number of conntrack entries living in this netns. net.netfilter.nf_conntrack_max .. global upperlimit (maximum size of the table). net.netfilter.nf_conntrack_buckets .. size of the conntrack table (hash buckets). net.netfilter.nf_conntrack_expect_max .. maximum number of permitted expectations in this netns. net.netfilter.nf_conntrack_helper .. conntrack helper auto assignment. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nft_lookup: update element stateful expressionPablo Neira Ayuso2020-03-151-0/+1
| | | | | | | | | | | | | | | | | | If the set element comes with an stateful expression, update it. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
| * | netfilter: nf_tables: add nft_set_elem_update_expr() helper functionPablo Neira Ayuso2020-03-152-7/+13
| | | | | | | | | | | | | | | | | | | | | This helper function runs the eval path of the stateful expression of an existing set element. Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>