| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
| |
The patch that adds support for stateful expressions in set definitions
require this.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|
|
|
|
|
| |
Move the nft_expr_clone() helper function to the core.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux
Saeed Mahameed says:
====================
mlx5-updates-2020-03-17
1) Compiler warnings and cleanup for the connection tracking series
2) Bug fixes for the connection tracking series
3) Fix devlink port register sequence
4) Last five patches in the series, By Eli cohen
Add the support for forwarding traffic between two eswitch uplink
representors (Hairpin for eswitch), using mlx5 termination tables
to change the direction of a packet in hw from RX to TX pipeline.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Do not allow forwarding of encapsulated traffic received from one eswtich's
uplink to another eswtich's uplink.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add dependencny on cap termination_table_raw_traffic to allow non
encapsulated packets received from uplink to be forwarded back to the
received uplink port.
Refactor the conditions into a separate function.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Termination tables change the direction of a packet in hw from RX to SX
pipeline. Use that to offload hairpin flows received from uplink and
sent back to uplink.
Currently termination tables are used for pushing VLAN to packets
received from uplink and targeting a VF. Extend the implementation to
allow forwarding packets to uplink. These packets can either be
encapsulated or not.
In case encapsulation is needed before forwarding, move the reformat
object to the termination table as required.
Extend the hash table key to include tunnel information for the sake of
reusing reformat objects.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Don't use termination tables for packets that are steered to the slow path,
as a pre-step for supporting packet encap (packet reformat) action on
termination tables. Packet encap (reformat action) actions steer the packet
to the slow path until outer arp entries are resolved.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Check if QoS is enabled for the eswitch before attempting to configure
QoS parameters and emit a netlink error if not supported.
Introduce an API to check if QoS is supported for the eswitch.
Signed-off-by: Eli Cohen <eli@mellanox.com>
Reviewed-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If udevd is configured to rename interfaces according to persistent
naming rules and if a network interface has phys_port_name in sysfs,
its contents will be appended to the interface name.
However, register_netdev creates device in sysfs and if
devlink_port_register is called after that, there is a timeframe in
which udevd may read an empty phys_port_name value. The consequence is
that the interface will lose this suffix and its name will not be
really persistent.
The solution is to register the port before registering a netdev.
Fixes: c6acd629eec7 ("net/mlx5e: Add support for devlink-port in non-representors mode")
Signed-off-by: Vladyslav Tarasiuk <vladyslavt@mellanox.com>
Reviewed-by: Maxim Mikityanskiy <maximmi@mellanox.com>
Reviewed-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The original condition rejected all egress rules that
are not on tunnel device.
Also, the whole point of this egress reject was to disallow bad
rules because of egdev which doesn't exists today, so remove
this check entirely.
Fixes: 0a7fcb78cc21 ("net/mlx5e: Support inner header rewrite with goto action")
Signed-off-by: Roi Dayan <roid@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Reviewed-by: Vlad Buslov <vladbu@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Register loopback which is needed for tunnel restoration, is now always
enabled if supported and not just with metadata enabled, check for
that instead.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Fix the following warnings: [-Werror=frame-larger-than=]
In function ‘mlx5_tc_ct_entry_add_rule’:
drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:541:1:
error: the frame size of 1136 bytes is larger than 1024 bytes
In function ‘__mlx5_tc_ct_flow_offload’:
drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:1049:1:
error: the frame size of 1168 bytes is larger than 1024 bytes
Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking")
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
Reviewed-by: Paul Blakey <paulb@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If CONFIG_MLX5_TC_CT isn't enabled, all offloading of eswitch tc rules
fails on parsing ct match, even if there is no ct match.
Return success if there is no ct match, regardless of config.
Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:
In function mlx5_tc_ct_parse_match:
drivers/net/ethernet/mellanox/mlx5/core/en/tc_ct.c:699:36: warning:
variable unnew set but not used [-Wunused-but-set-variable]
Fixes: 4c3844d9e97e ("net/mlx5e: CT: Introduce connection tracking")
Reported-by: Hulk Robot <hulkci@huawei.com>
Signed-off-by: YueHaibing <yuehaibing@huawei.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Restore modify header writes the chain mapping on the packet.
This modify header and action is added on all prios connections,
and gets overwritten with the same value consecutively in prios
of the same chain.
Use the chain's modify header only for the last prio of a given tc
chain.
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Currently, if firmware doesn't support fwd and modify, driver fails
initializing eswitch chains while entering switchdev mode.
Instead, on such cases, disable the chains and prio feature (as we can't
restore the chain on miss) and the usage of fwd and modify.
Fixes: 8f1e0b97cc70 ("net/mlx5: E-Switch, Mark miss packets with new chain id mapping")
Signed-off-by: Paul Blakey <paulb@mellanox.com>
Reviewed-by: Oz Shlomo <ozsh@mellanox.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When CONFIG_MLX5_ESWITCH is unset, clang warns:
In file included from drivers/net/ethernet/mellanox/mlx5/core/main.c:58:
drivers/net/ethernet/mellanox/mlx5/core/eswitch.h:670:1: warning: unused
function 'esw_add_restore_rule' [-Wunused-function]
esw_add_restore_rule(struct mlx5_eswitch *esw, u32 tag)
^
1 warning generated.
This stub function is missing inline; add it to suppress the warning.
Fixes: 11b717d61526 ("net/mlx5: E-Switch, Get reg_c0 value on CQE")
Signed-off-by: Nathan Chancellor <natechancellor@gmail.com>
Reviewed-by: Nick Desaulniers <ndesaulniers@google.com>
Signed-off-by: Saeed Mahameed <saeedm@mellanox.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
At least some integrated PHY's in RTL8168/RTL8125 chip versions support
downshift, and the actual link speed can be read from a vendor-specific
register. Info about this register was provided by Realtek.
More details about downshift configuration (e.g. number of attempts)
aren't available, therefore the downshift tunable is not implemented.
Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In the commit referenced below, hw_stats_type of an entry is set for every
entry that corresponds to a pedit action. However, the assignment is only
done after the entry pointer is bumped, and therefore could overwrite
memory outside of the entries array.
The reason for this positioning may have been that the current entry's
hw_stats_type is already set above, before the action-type dispatch.
However, if there are no more actions, the assignment is wrong. And if
there are, the next round of the for_each_action loop will make the
assignment before the action-type dispatch anyway.
Therefore fix this issue by simply reordering the two lines.
Fixes: 74522e7baae2 ("net: sched: set the hw_stats_type in pedit loop")
Signed-off-by: Petr Machata <petrm@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Ido Schimmel says:
====================
mlxsw: spectrum_cnt: Expose counter resources
Jiri says:
Capacity and utilization of existing flow and RIF counters are currently
unavailable to be seen by the user. Use the existing devlink resources
API to expose the information:
$ sudo devlink resource show pci/0000:00:10.0 -v
pci/0000:00:10.0:
name kvd resource_path /kvd size 524288 unit entry dpipe_tables none
name span_agents resource_path /span_agents size 8 occ 0 unit entry dpipe_tables none
name counters resource_path /counters size 79872 occ 44 unit entry dpipe_tables none
resources:
name flow resource_path /counters/flow size 61440 occ 4 unit entry dpipe_tables none
name rif resource_path /counters/rif size 18432 occ 40 unit entry dpipe_tables none
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add tests for mlxsw hw_stats types.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Implement occupancy counting for counters and expose over devlink
resource API.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Put all init operations related to subpools into
mlxsw_sp_counter_sub_pools_init().
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Move the validation of subpools configuration, to avoid possible over
commitment to resource registration. Add WARN_ON to indicate bug
in the code.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Implement devlink resources support for counter pools. Move the subpool
sizes calculations into the new resources register function.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
query entry size
Add new field to subpool struct that would indicate which
resource id should be used to query the entry size for
the subpool from the device.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Currently, the global static array of subpools is used. Make it
per-instance as multiple instances of the mlxsw driver can have
different values.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
counter count
With the change that made the code to query counter bank size from device
instead of using hard-coded value, the number of available counters
changed for Spectrum-2. Adjust the limit in the selftests.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The bank size is different between Spectrum versions. Also it is
a resource that can be queried. So instead of hard coding the value in
code, query it from the firmware.
Signed-off-by: Jiri Pirko <jiri@mellanox.com>
Signed-off-by: Ido Schimmel <idosch@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Chelsio NICs have 3 filter regions, in following order of priority:
1. High Priority (HPFILTER) region (Highest Priority).
2. HASH region.
3. Normal FILTER region (Lowest Priority).
Currently, there's a 1-to-1 mapping between the prio value passed
by TC and the filter region index. However, it's possible to have
multiple TC rules with the same prio value. In this case, if a region
is exhausted, no attempt is made to try inserting the rule in the
next available region.
So, rework and remove the 1-to-1 mapping. Instead, dynamically select
the region to insert the filter rule, as long as the new rule's prio
value doesn't conflict with existing rules across all the 3 regions.
Signed-off-by: Rahul Lakkireddy <rahul.lakkireddy@chelsio.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
This reverts the following commits:
8537f78647c0 ("netfilter: Introduce egress hook")
5418d3881e1f ("netfilter: Generalize ingress hook")
b030f194aed2 ("netfilter: Rename ingress hook include file")
>From the discussion in [0], the author's main motivation to add a hook
in fast path is for an out of tree kernel module, which is a red flag
to begin with. Other mentioned potential use cases like NAT{64,46}
is on future extensions w/o concrete code in the tree yet. Revert as
suggested [1] given the weak justification to add more hooks to critical
fast-path.
[0] https://lore.kernel.org/netdev/cover.1583927267.git.lukas@wunner.de/
[1] https://lore.kernel.org/netdev/20200318.011152.72770718915606186.davem@davemloft.net/
Signed-off-by: Daniel Borkmann <daniel@iogearbox.net>
Cc: David Miller <davem@davemloft.net>
Cc: Pablo Neira Ayuso <pablo@netfilter.org>
Cc: Alexei Starovoitov <ast@kernel.org>
Nacked-by: Pablo Neira Ayuso <pablo@netfilter.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Julian Wiedmann says:
====================
s390/qeth: updates 2020-03-18
please apply the following patch series for qeth to netdev's net-next
tree.
This consists of three parts:
1) support for __GFP_MEMALLOC,
2) several ethtool enhancements (.set_channels, SW Timestamping),
3) the usual cleanups.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
To check whether a netdevice has already been registered, look at
NETREG_REGISTERED to replace some hacks I added a while ago.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
qeth_do_ioctl() is only reached through our own net_device_ops, so we
can trust that dev->ml_priv still contains what we put there earlier.
qeth_bridgeport_an_set() is an internal function that doesn't require
such sanity checks.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Data addresses in the AOB are absolute, and need to be translated before
being fed into kmem_cache_free(). Currently this phys_to_virt() is a no-op.
Also see commit 2db01da8d25f ("s390/qdio: fill SBALEs with absolute addresses").
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Versions are meaningless for an in-kernel driver.
Instead use the UTS_RELEASE that is set by ethtool_get_drvinfo().
Cc: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Reviewed-by: Leon Romanovsky <leonro@mellanox.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This adds support for SOF_TIMESTAMPING_TX_SOFTWARE.
No support for non-IQD devices, since they orphan the skb in their xmit
path.
To play nice with TX bulking, set the timestamp when the buffer that
contains the skb(s) is actually flushed out to HW.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
For ucast traffic, qeth_iqd_select_queue() falls back to
netdev_pick_tx(). This will potentially use skb_tx_hash() to distribute
the flow over all active TX queues - so txq 0 is a valid selection, and
qeth_iqd_select_queue() needs to check for this and put it on some other
queue. As a result, the distribution for ucast flows is unbalanced and
hits QETH_IQD_MIN_UCAST_TXQ heavier than the other queues.
Open-coding a custom variant of skb_tx_hash() isn't an option, since
netdev_pick_tx() also gives us eg. access to XPS. But we can pull a
little trick: add a single TC class that excludes the mcast txq, and
thus encourage skb_tx_hash() to not pick the mcast txq.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Similar to the support for z/VM NICs, but we need to take extra care
about the dedicated mcast queue:
1. netdev_pick_tx() is unaware of this limitation and might select the
mcast txq. Catch this.
2. require at least _two_ TX queues - one for ucast, one for mcast.
3. when reducing the number of TX queues, there's a potential race
where netdev_cap_txqueue() over-rules the selected txq index and
falls back to index 0. This would place ucast traffic on the mcast
queue, and result in TX errors.
So for IQD, reject a reduction while the interface is running.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Add support for ETHTOOL_SCHANNELS to change the count of active
TX queues.
Since all TX queue structs are pre-allocated and -registered, we just
need to trivially adjust dev->real_num_tx_queues.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
z/VM NICs don't offer HW QoS for TX rings. So just use netdev_pick_tx()
to distribute the connections equally over all enabled TX queues.
We start with just 1 enabled TX queue (this matches the typical
configuration without prio-queueing). A follow-on patch will allow users
to enable additional TX queues.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When falling back to an allocation from the HW header cache, check if
the skb is eligible for using memory reserves.
This only makes a difference if the cache is empty and needs to be
refilled.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| | |
Use dev_alloc_page() for backing the RX buffers with pages. This way we
pick up __GFP_MEMALLOC.
Signed-off-by: Julian Wiedmann <jwi@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pablo Neira Ayuso says:
====================
Netfilter updates for net-next
The following patchset contains Netfilter updates for net-next:
1) Use nf_flow_offload_tuple() to fetch flow stats, from Paul Blakey.
2) Add new xt_IDLETIMER hard mode, from Manoj Basapathi.
Follow up patch to clean up this new mode, from Dan Carpenter.
3) Add support for geneve tunnel options, from Xin Long.
4) Make sets built-in and remove modular infrastructure for sets,
from Florian Westphal.
5) Remove unused TEMPLATE_NULLS_VAL, from Li RongQing.
6) Statify nft_pipapo_get, from Chen Wandun.
7) Use C99 flexible-array member, from Gustavo A. R. Silva.
8) More descriptive variable names for bitwise, from Jeremy Sowden.
9) Four patches to add tunnel device hardware offload to the flowtable
infrastructure, from wenxu.
10) pipapo set supports for 8-bit grouping, from Stefano Brivio.
11) pipapo can switch between nibble and byte grouping, also from
Stefano.
12) Add AVX2 vectorized version of pipapo, from Stefano Brivio.
13) Update pipapo to be use it for single ranges, from Stefano.
14) Add stateful expression support to elements via control plane,
eg. counter per element.
15) Re-visit sysctls in unprivileged namespaces, from Florian Westphal.
15) Add new egress hook, from Lukas Wunner.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Commit e687ad60af09 ("netfilter: add netfilter ingress hook after
handle_ing() under unique static key") introduced the ability to
classify packets on ingress.
Allow the same on egress. Position the hook immediately before a packet
is handed to tc and then sent out on an interface, thereby mirroring the
ingress order. This order allows marking packets in the netfilter
egress hook and subsequently using the mark in tc. Another benefit of
this order is consistency with a lot of existing documentation which
says that egress tc is performed after netfilter hooks.
Egress hooks already exist for the most common protocols, such as
NF_INET_LOCAL_OUT or NF_ARP_OUT, and those are to be preferred because
they are executed earlier during packet processing. However for more
exotic protocols, there is currently no provision to apply netfilter on
egress. A common workaround is to enslave the interface to a bridge and
use ebtables, or to resort to tc. But when the ingress hook was
introduced, consensus was that users should be given the choice to use
netfilter or tc, whichever tool suits their needs best:
https://lore.kernel.org/netdev/20150430153317.GA3230@salvia/
This hook is also useful for NAT46/NAT64, tunneling and filtering of
locally generated af_packet traffic such as dhclient.
There have also been occasional user requests for a netfilter egress
hook in the past, e.g.:
https://www.spinics.net/lists/netfilter/msg50038.html
Performance measurements with pktgen surprisingly show a speedup rather
than a slowdown with this commit:
* Without this commit:
Result: OK: 34240933(c34238375+d2558) usec, 100000000 (60byte,0frags)
2920481pps 1401Mb/sec (1401830880bps) errors: 0
* With this commit:
Result: OK: 33997299(c33994193+d3106) usec, 100000000 (60byte,0frags)
2941410pps 1411Mb/sec (1411876800bps) errors: 0
* Without this commit + tc egress:
Result: OK: 39022386(c39019547+d2839) usec, 100000000 (60byte,0frags)
2562631pps 1230Mb/sec (1230062880bps) errors: 0
* With this commit + tc egress:
Result: OK: 37604447(c37601877+d2570) usec, 100000000 (60byte,0frags)
2659259pps 1276Mb/sec (1276444320bps) errors: 0
* With this commit + nft egress:
Result: OK: 41436689(c41434088+d2600) usec, 100000000 (60byte,0frags)
2413320pps 1158Mb/sec (1158393600bps) errors: 0
Tested on a bare-metal Core i7-3615QM, each measurement was performed
three times to verify that the numbers are stable.
Commands to perform a measurement:
modprobe pktgen
echo "add_device lo@3" > /proc/net/pktgen/kpktgend_3
samples/pktgen/pktgen_bench_xmit_mode_queue_xmit.sh -i 'lo@3' -n 100000000
Commands for testing tc egress:
tc qdisc add dev lo clsact
tc filter add dev lo egress protocol ip prio 1 u32 match ip dst 4.3.2.1/32
Commands for testing nft egress:
nft add table netdev t
nft add chain netdev t co \{ type filter hook egress device lo priority 0 \; \}
nft add rule netdev t co ip daddr 4.3.2.1/32 drop
All testing was performed on the loopback interface to avoid distorting
measurements by the packet handling in the low-level Ethernet driver.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prepare for addition of a netfilter egress hook by generalizing the
ingress hook introduced by commit e687ad60af09 ("netfilter: add
netfilter ingress hook after handle_ing() under unique static key").
In particular, rename and refactor the ingress hook's static inlines
such that they can be reused for an egress hook.
No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Prepare for addition of a netfilter egress hook by renaming
<linux/netfilter_ingress.h> to <linux/netfilter_netdev.h>.
The egress hook also necessitates a refactoring of the include file,
but that is done in a separate commit to ease reviewing.
No functional change intended.
Signed-off-by: Lukas Wunner <lukas@wunner.de>
Cc: Daniel Borkmann <daniel@iogearbox.net>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
since commit b884fa46177659 ("netfilter: conntrack: unify sysctl handling")
conntrack no longer exposes most of its sysctls (e.g. tcp timeouts
settings) to network namespaces that are not owned by the initial user
namespace.
This patch exposes all sysctls even if the namespace is unpriviliged.
compared to a 4.19 kernel, the newly visible and writeable sysctls are:
net.netfilter.nf_conntrack_acct
net.netfilter.nf_conntrack_timestamp
.. to allow to enable accouting and timestamp extensions.
net.netfilter.nf_conntrack_events
.. to turn off conntrack event notifications.
net.netfilter.nf_conntrack_checksum
.. to disable checksum validation.
net.netfilter.nf_conntrack_log_invalid
.. to enable logging of packets deemed invalid by conntrack.
newly visible sysctls that are only exported as read-only:
net.netfilter.nf_conntrack_count
.. current number of conntrack entries living in this netns.
net.netfilter.nf_conntrack_max
.. global upperlimit (maximum size of the table).
net.netfilter.nf_conntrack_buckets
.. size of the conntrack table (hash buckets).
net.netfilter.nf_conntrack_expect_max
.. maximum number of permitted expectations in this netns.
net.netfilter.nf_conntrack_helper
.. conntrack helper auto assignment.
Signed-off-by: Florian Westphal <fw@strlen.de>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | |
| | |
| | |
| | |
| | |
| | | |
If the set element comes with an stateful expression, update it.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
This helper function runs the eval path of the stateful expression
of an existing set element.
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|