linux - linux

	Commit message (Collapse)	Author	Age	Files	Lines
*	IPv4: Add "offload failed" indication to routes	Amit Cohen	2021-02-09	2	-0/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	After installing a route to the kernel, user space receives an acknowledgment, which means the route was installed in the kernel, but not necessarily in hardware. The asynchronous nature of route installation in hardware can lead to a routing daemon advertising a route before it was actually installed in hardware. This can result in packet loss or mis-routed packets until the route is installed in hardware. To avoid such cases, previous patch set added the ability to emit RTM_NEWROUTE notifications whenever RTM_F_OFFLOAD/RTM_F_TRAP flags are changed, this behavior is controlled by sysctl. With the above mentioned behavior, it is possible to know from user-space if the route was offloaded, but if the offload fails there is no indication to user-space. Following a failure, a routing daemon will wait indefinitely for a notification that will never come. This patch adds an "offload_failed" indication to IPv4 routes, so that users will have better visibility into the offload process. 'struct fib_alias', and 'struct fib_rt_info' are extended with new field that indicates if route offload failed. Note that the new field is added using unused bit and therefore there is no need to increase structs size. Signed-off-by: Amit Cohen <amcohen@nvidia.com> Signed-off-by: Ido Schimmel <idosch@nvidia.com> Signed-off-by: David S. Miller <davem@davemloft.net>
*	Merge tag 'mlx5-updates-2021-02-04' of ↵	David S. Miller	2021-02-09	24	-1057/+3783
\|\ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/saeed/linux mlx5-updates-2021-02-04 Vlad Buslov says: ================= Implement support for VF tunneling Abstract Currently, mlx5 only supports configuration with tunnel endpoint IP address on uplink representor. Remove implicit and explicit assumptions of tunnel always being terminated on uplink and implement necessary infrastructure for configuring tunnels on VF representors and updating rules on such tunnels according to routing changes. SW TC model From TC perspective VF tunnel configuration requires two rules in both directions: TX rules 1. Rule that redirects packets from UL to VF rep that has the tunnel endpoint IP address: $ tc -s filter show dev enp8s0f0 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac 16:c9:a0:2d:69:2c src_mac 0c:42:a1:58:ab:e4 eth_type ipv4 ip_flags nofrag in_hw in_hw_count 1 action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen index 3 ref 1 bind 1 installed 377 sec used 0 sec Action statistics: Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 114096 bytes 952 pkt backlog 0b 0p requeues 0 cookie 878fa48d8c423fc08c3b6ca599b50a97 no_percpu used_hw_stats delayed 2. Rule that decapsulates the tunneled flow and redirects to destination VF representor: $ tc -s filter show dev vxlan_sys_4789 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac ca:2e:a7:3f:f5:0f src_mac 0a:40:bd:30:89:99 eth_type ipv4 enc_dst_ip 7.7.7.5 enc_src_ip 7.7.7.1 enc_key_id 98 enc_dst_port 4789 enc_tos 0 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key unset pipe index 2 ref 1 bind 1 installed 434 sec used 434 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats delayed action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen index 4 ref 1 bind 1 installed 434 sec used 0 sec Action statistics: Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 129936 bytes 1082 pkt backlog 0b 0p requeues 0 cookie ac17cf398c4c69e4a5b2f7aabd1b88ff no_percpu used_hw_stats delayed RX rules 1. Rule that encapsulates the tunneled flow and redirects packets from source VF rep to tunnel device: $ tc -s filter show dev enp8s0f0_1 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac 0a:40:bd:30:89:99 src_mac ca:2e:a7:3f:f5:0f eth_type ipv4 ip_tos 0/0x3 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key set src_ip 7.7.7.5 dst_ip 7.7.7.1 key_id 98 dst_port 4789 nocsum ttl 64 pipe index 1 ref 1 bind 1 installed 411 sec used 411 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 no_percpu used_hw_stats delayed action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen index 1 ref 1 bind 1 installed 411 sec used 0 sec Action statistics: Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 5615833 bytes 4028 pkt backlog 0b 0p requeues 0 cookie bb406d45d343bf7ade9690ae80c7cba4 no_percpu used_hw_stats delayed 2. Rule that redirects from tunnel device to UL rep: $ tc -s filter show dev vxlan_sys_4789 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac ca:2e:a7:3f:f5:0f src_mac 0a:40:bd:30:89:99 eth_type ipv4 enc_dst_ip 7.7.7.5 enc_src_ip 7.7.7.1 enc_key_id 98 enc_dst_port 4789 enc_tos 0 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key unset pipe index 2 ref 1 bind 1 installed 434 sec used 434 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats delayed action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen index 4 ref 1 bind 1 installed 434 sec used 0 sec Action statistics: Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 129936 bytes 1082 pkt backlog 0b 0p requeues 0 cookie ac17cf398c4c69e4a5b2f7aabd1b88ff no_percpu used_hw_stats delayed HW offloads model For hardware offload the goal is to mach packet on both rules without exposing it to software on tunnel endpoint VF. In order to achieve this for tx, TC implementation marks encap rules with tunnel endpoint on mlx5 VF of same eswitch with MLX5_ESW_DEST_CHAIN_WITH_SRC_PORT_CHANGE flag and adds header modification rule to overwrite packet source port to the value of tunnel VF. Eswitch code is modified to recirculate such packets after source port value is changed, which allows second tx rules to match. For rx path indirect table infrastructure is used to allow fully processing VF tunnel traffic in hardware. To implement such pipeline driver needs to program the hardware after matching on UL rule to overwrite source vport from UL to tunnel VF and recirculate the packet to the root table to allow matching on the rule installed on tunnel VF. For this, indirect table matches all encapsulated traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by the miss rule. Such configuration will cause packet to appear on VF representor instead of VF itself if packet has been matches by indirect table rule based on tunnel parameters but missed on second rule (after recirculation). Handle such case by marking packets processed by indirect table with special 0xFFF value in reg_c1 and extending slow table with additional flow group that matches on reg_c0 (source port value set by indirect tables) and reg_c1 (special 0xFFF mark). When creating offloads fdb tables, install one rule per VF vport to match on recirculated miss packets and redirect them to appropriate VF vport. Routing events In order to support routing changes and migration of tunnel device between different endpoint VFs, implement routing infrastructure and update it with FIB events. Routing entry table is introduced to mlx5 TC. Every rx and tx VF tunnel rule is attached to a routing entry, which is shared for rules of same tunnel. On FIB event the work is scheduled to delete/recreate all rules of affected tunnel. Note: only vxlan tunnel type is supported by this series. =================
\| *	net/mlx5e: Handle FIB events to update tunnel endpoint device	Vlad Buslov	2021-02-06	7	-67/+773
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Process FIB route update events to dynamically update the stack device rules when tunnel routing changes. Use rtnl lock to prevent FIB event handler from running concurrently with neigh update and neigh stats workqueue tasks. Use encap_tbl_lock mutex to synchronize with TC rule update path that doesn't use rtnl lock. FIB event workflow for encap flows: - Unoffload all flows attached to route encaps from slow or fast path depending on encap destination endpoint neigh state. - Update encap IP header according to new route dev. - Update flows mod_hdr action that is responsible for overwriting reg_c0 source port bits to source port of new underlying VF of new route dev. This step requires changing flow create/delete code to save flow parse attribute mod_hdr_acts structure for whole flow lifetime instead of deallocating it after flow creation. Refactor mod_hdr code to allow saving id of individual mod_hdr actions and updating them with dedicated helper. - Offload all flows to either slow or fast path depending on encap destination endpoint neigh state. FIB event workflow for decap flows: - Unoffload all route flows from hardware. When last route flow is deleted all indirect table rules for the route dev will also be deleted. - Update flow attr decap_vport and destination MAC according to underlying VF of new rote dev. - Offload all route flows back to hardware creating new indirect table rules according to updated flow attribute data. Extract some neigh update code to helper functions to be used by both neigh update and route update infrastructure. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Rename some encap-specific API to generic names	Vlad Buslov	2021-02-06	5	-9/+9
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Some of the encap-specific functions and fields will also be used by route update infrastructure in following patches. Rename them to generic names. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: TC preparation refactoring for routing update event	Vlad Buslov	2021-02-06	5	-9/+288
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following patch in series implement routing update event which requires ability to modify rule match_to_reg modify header actions dynamically during rule lifetime. In order to accommodate such behavior, refactor and extend TC infrastructure in following ways: - Modify mod_hdr infrastructure to preserve its parse attribute for whole rule lifetime, instead of deallocating it after rule creation. - Extend match_to_reg infrastructure with new function mlx5e_tc_match_to_reg_set_and_get_id() that returns mod_hdr action id that can be used afterwards to update the action, and mlx5e_tc_match_to_reg_mod_hdr_change() that can modify existing actions by its id. - Extend tun API with new functions mlx5e_tc_tun_update_header_ipv{4\|6}() that are used to updated existing encap entry tunnel header. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Refactor neigh update infrastructure	Vlad Buslov	2021-02-06	9	-31/+35
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following patches in series implements route update which can cause encap entries to migrate between routing devices. Consecutively, their parent nhe's need to be also transferable between devices instead of having neigh device as a part of their immutable key. Move neigh device from struct mlx5_neigh to struct mlx5e_neigh_hash_entry and check that nhe and neigh devices are the same in workqueue neigh update handler. Save neigh net_device that can change dynamically in dedicated nhe->dev field. With FIB event handler that is implemented in following patches changing nhe->dev, NETEVENT_DELAY_PROBE_TIME_UPDATE handler can concurrently access the nhe entry when traversing neigh list under rcu read lock. Processing stale values in that handler doesn't change the handler logic, so just wrap all accesses to the dev pointer in {WRITE\|READ}_ONCE() helpers. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Create route entry infrastructure	Vlad Buslov	2021-02-06	7	-11/+290
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Implement dedicated route entry infrastructure to be used in following patch by route update event. Both encap (indirectly through their corresponding encap entries) and decap (directly) flows are attached to routing entry. Since route update also requires updating encap (route device MAC address is a source MAC address of tunnel encapsulation), same encap_tbl_lock mutex is used for synchronization. The new infrastructure looks similar to existing infrastructures for shared encap, mod_hdr and hairpin entries: - Per-eswitch hash table is used for quick entry lookup. - Flows are attached to per-entry linked list and hold reference to entry during their lifetime. - Atomic reference counting and rcu mechanisms are used as synchronization primitives for concurrent access. The infrastructure also enables connection tracking on stacked devices topology by attaching CT chain 0 flow on tunneling dev to decap route entry. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Extract tc tunnel encap/decap code to dedicated file	Vlad Buslov	2021-02-06	7	-885/+947
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following patches in series extend the extracted code with routing infrastructure. To improve code modularity created a dedicated tc_tun_encap.c source file and move encap/decap related code to the new file. Export code that is used by both regular TC code and encap/decap code into tc_priv.h (new header intended to be used only by TC module). Rename some exported functions by adding "mlx5e_" prefix to their names. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Match recirculated packet miss in slow table using reg_c1	Vlad Buslov	2021-02-06	4	-7/+134
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Previous patch in series that implements stack devices RX path implements indirect table rules that match on tunnel VNI. After such rule is created all tunnel traffic is recirculated to root table. However, recirculated packet might not match on any rules installed in the table (for example, when IP traffic follows ARP traffic). In that case packets appear on representor of tunnel endpoint VF instead being redirected to the VF itself. Extend slow table with additional flow group that matches on reg_c0 (source port value set by indirect tables implemented by previous patch in series) and reg_c1 (special 0xFFF mark). When creating offloads fdb tables, install one rule per VF vport to match on recirculated miss packets and redirect them to appropriate VF vport. Modify indirect tables code to also rewrite reg_c1 with special 0xFFF mark. Implementation reuses reg_c1 tunnel id bits. This is safe to do because recirculated packets are always matched before decapsulation. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Refactor reg_c1 usage	Vlad Buslov	2021-02-06	4	-9/+7
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following patch in series uses reg_c1 in eswitch code. To use reg_c1 helpers in both TC and eswitch code, refactor existing helpers according to similar use case of reg_c0 and move the functionality into eswitch.h. Calculate reg mappings length from new defines to ensure that they are always in sync and only need to be changed in single place. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: VF tunnel RX traffic offloading	Vlad Buslov	2021-02-06	5	-8/+271
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When tunnel endpoint is on VF the encapsulated RX traffic is exposed on the representor of the VF without any further processing of rules installed on the VF. Detect such case by checking if the device returned by route lookup in decap rule handling code is a mlx5 VF and handle it with new redirection tables API. Example TC rules for VF tunnel traffic: 1. Rule that encapsulates the tunneled flow and redirects packets from source VF rep to tunnel device: $ tc -s filter show dev enp8s0f0_1 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac 0a:40:bd:30:89:99 src_mac ca:2e:a7:3f:f5:0f eth_type ipv4 ip_tos 0/0x3 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key set src_ip 7.7.7.5 dst_ip 7.7.7.1 key_id 98 dst_port 4789 nocsum ttl 64 pipe index 1 ref 1 bind 1 installed 411 sec used 411 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 no_percpu used_hw_stats delayed action order 2: mirred (Egress Redirect to device vxlan_sys_4789) stolen index 1 ref 1 bind 1 installed 411 sec used 0 sec Action statistics: Sent 5615833 bytes 4028 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 5615833 bytes 4028 pkt backlog 0b 0p requeues 0 cookie bb406d45d343bf7ade9690ae80c7cba4 no_percpu used_hw_stats delayed 2. Rule that redirects from tunnel device to UL rep: $ tc -s filter show dev vxlan_sys_4789 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac ca:2e:a7:3f:f5:0f src_mac 0a:40:bd:30:89:99 eth_type ipv4 enc_dst_ip 7.7.7.5 enc_src_ip 7.7.7.1 enc_key_id 98 enc_dst_port 4789 enc_tos 0 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key unset pipe index 2 ref 1 bind 1 installed 434 sec used 434 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats delayed action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen index 4 ref 1 bind 1 installed 434 sec used 0 sec Action statistics: Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 129936 bytes 1082 pkt backlog 0b 0p requeues 0 cookie ac17cf398c4c69e4a5b2f7aabd1b88ff no_percpu used_hw_stats delayed Co-developed-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Remove redundant match on tunnel destination mac	Vlad Buslov	2021-02-06	1	-8/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Remove hardcoded match on tunnel destination MAC address. Such match is no longer required and would be wrong for stacked devices topology where encapsulation destination MAC address will be the address of tunnel VF that can change dynamically on route change (implemented in following patches in the series). Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5: E-Switch, Indirect table infrastructure	Vlad Buslov	2021-02-06	6	-0/+616
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Indirect table infrastructure is used to allow fully processing VF tunnel traffic in hardware. Kernel software model uses two TC rules for such traffic: UL rep to tunnel device, then tunnel VF rep to destination VF rep. To implement such pipeline driver needs to program the hardware after matching on UL rule to overwrite source vport from UL to tunnel VF and recirculate the packet to the root table to allow matching on the rule installed on tunnel VF. For this indirect table matches all encapsulated traffic by tunnel parameters and all other IP traffic is sent to tunnel VF by the miss rule. Indirect table API overview: - mlx5_esw_indir_table_{init\|destroy}() - init and destroy opaque indirect table object. - mlx5_esw_indir_table_get() - get or create new table according to vport id and IP version. Table has following pre-created groups: recirculation group with match on ethertype and VNI (rules that match encapsulated packets are installed to this group) and forward group with default/miss rule that forwards to vport of tunnel endpoint VF (rule for regular non-encapsulated packets). - mlx5_esw_indir_table_put() - decrease reference to the indirect table and matching rule (for encapsulated traffic). - mlx5_esw_indir_table_needed() - check that in_port is an uplink port and out_port is VF on the same eswitch, verify that the rule is for IP traffic and source port rewrite functionality can be used. - mlx5_esw_indir_table_decap_vport() - function returns decap vport of flow attribute. Co-developed-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Refactor tun routing helpers	Vlad Buslov	2021-02-06	1	-109/+126
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Refactor tun routing helpers to use dedicated struct mlx5e_tc_tun_route_attr instead of multiple output arguments. This simplifies the callers (no need to keep track of bunch of output param pointers) and allows to unify struct release code in new mlx5e_tc_tun_route_attr_cleanup() helper instead of requiring callers to manually release some of the output parameters that require it. Simplify code by unifying error handling at the end of the function and rearranging code. Remove redundant empty line. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: VF tunnel TX traffic offloading	Vlad Buslov	2021-02-06	4	-11/+199
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When tunnel endpoint is on VF, driver still assumes that endpoint is on uplink and incorrectly configures encap rule offload according to that assumption. As a result, traffic is sent directly to the uplink and rules installed on representor of tunnel endpoint VF are ignored. Implement following changes to allow offloading tx traffic with tunnel endpoint on VF: - For tunneling flows perform route lookup on route and out devices pair. If out device is uplink and route device is VF of same physical port, then modify packet reg_c_0 metadata register (source port) with the value of VF vport. Use eswitch vhca_id->vport mapping introduced in one of previous patches in the series to obtain vport from route netdevice. - Recirculate encapsulated packets to VF vport in order to apply any flow rules installed on VF representor that match on encapsulated traffic. Only enable support for this functionality when all following conditions are true: - Hardware advertises capability to preserve reg_c_0 value on packet recirculation. - Vport metadata matching is enabled. - Termination tables are to be used by the flow. Example TC rules for VF tunnel traffic: 1. Rule that redirects packets from UL to VF rep that has the tunnel endpoint IP address: $ tc -s filter show dev enp8s0f0 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac 16:c9:a0:2d:69:2c src_mac 0c:42:a1:58:ab:e4 eth_type ipv4 ip_flags nofrag in_hw in_hw_count 1 action order 1: mirred (Egress Redirect to device enp8s0f0_0) stolen index 3 ref 1 bind 1 installed 377 sec used 0 sec Action statistics: Sent 114096 bytes 952 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 114096 bytes 952 pkt backlog 0b 0p requeues 0 cookie 878fa48d8c423fc08c3b6ca599b50a97 no_percpu used_hw_stats delayed 2. Rule that decapsulates the tunneled flow and redirects to destination VF representor: $ tc -s filter show dev vxlan_sys_4789 ingress filter protocol ip pref 4 flower chain 0 filter protocol ip pref 4 flower chain 0 handle 0x1 dst_mac ca:2e:a7:3f:f5:0f src_mac 0a:40:bd:30:89:99 eth_type ipv4 enc_dst_ip 7.7.7.5 enc_src_ip 7.7.7.1 enc_key_id 98 enc_dst_port 4789 enc_tos 0 ip_flags nofrag in_hw in_hw_count 1 action order 1: tunnel_key unset pipe index 2 ref 1 bind 1 installed 434 sec used 434 sec Action statistics: Sent 0 bytes 0 pkt (dropped 0, overlimits 0 requeues 0) backlog 0b 0p requeues 0 used_hw_stats delayed action order 2: mirred (Egress Redirect to device enp8s0f0_1) stolen index 4 ref 1 bind 1 installed 434 sec used 0 sec Action statistics: Sent 129936 bytes 1082 pkt (dropped 0, overlimits 0 requeues 0) Sent software 0 bytes 0 pkt Sent hardware 129936 bytes 1082 pkt backlog 0b 0p requeues 0 cookie ac17cf398c4c69e4a5b2f7aabd1b88ff no_percpu used_hw_stats delayed Co-developed-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5: E-Switch, Refactor rule offload forward action processing	Vlad Buslov	2021-02-06	1	-60/+129
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following patches in the series extend forwarding functionality with VF tunnel TX and RX handling. Extract action forwarding processing code into dedicated functions to simplify further extensions: - Handle every forwarding case with dedicated function instead of inline code. - Extract forwarding dest dispatch conditional into helper function esw_setup_dests(). - Unify forwaring cleanup code in error path of mlx5_eswitch_add_offloaded_rule() and in rule deletion code of __mlx5_eswitch_del_rule() in new helper function esw_cleanup_dests() (dual to new esw_setup_dests() helper). This patch does not change functionality. Co-developed-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: Always set attr mdev pointer	Vlad Buslov	2021-02-06	1	-0/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Eswitch offloads extensions in following patches in the series require attr->esw_attr->in_mdev pointer to always be set. This is already the case for all code paths except mlx5_tc_ct_entry_add_rule() function. Fix the function to assign mdev pointer with priv->mdev value. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5e: E-Switch, Maintain vhca_id to vport_num mapping	Vlad Buslov	2021-02-06	5	-0/+119
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Following patches in the series need to be able to map VF netdev to vport. Since it is trivial to obtain vhca_id from netdev, maintain mapping from vhca_id to vport_num inside eswitch offloads using xarray. Provide function mlx5_eswitch_vhca_id_to_vport() to be used by TC code in following patches to obtain the mapping. Signed-off-by: Vlad Buslov <vladbu@nvidia.com> Signed-off-by: Dmytro Linkin <dlinkin@nvidia.com> Reviewed-by: Roi Dayan <roid@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
\| *	net/mlx5: E-Switch, Refactor setting source port	Mark Bloch	2021-02-06	1	-7/+12
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Setting the source port requires only the E-Switch and vport number. Refactor the function to get those parameters instead of passing the full attribute. Signed-off-by: Mark Bloch <mbloch@nvidia.com> Reviewed-by: Saeed Mahameed <saeedm@nvidia.com> Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
* \|	cxgb4: remove unused vpd_cap_addr	Heiner Kallweit	2021-02-09	2	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It is likely that this is a leftover from T3 driver heritage. cxgb4 uses the PCI core VPD access code that handles detection of VPD capabilities. Reviewed-by: Alexander Duyck <alexanderduyck@fb.com> Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	nfc: st-nci: Remove unnecessary variable	wengjianfeng	2021-02-08	1	-2/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The variable r is defined at the beginning and initialized to 0 until the function returns r, and the variable r is not reassigned.Therefore, we do not need to define the variable r, just return 0 directly at the end of the function. Signed-off-by: wengjianfeng <wengjianfeng@yulong.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* \|	Merge branch '100GbE' of ↵	Jakub Kicinski	2021-02-07	12	-242/+906
\|\ \ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue Tony Nguyen says: ==================== 100GbE Intel Wired LAN Driver Updates 2021-02-05 This series contains updates to ice driver only. Jake adds adds reporting of timeout length during devlink flash and implements support to report devlink info regarding the version of firmware that is stored (downloaded) to the device, but is not yet active. ice_devlink_info_get will report "stored" versions when there is no pending flash update. Version info includes the UNDI Option ROM, the Netlist module, and the fw.bundle_id. Gustavo A. R. Silva replaces a one-element array to flexible-array member. Bruce utilizes flex_array_size() helper and removes dead code on a check for a condition that can't occur. v2: * removed security revision implementation, and re-ordered patches to account for this removal * squashed patches implementing ice_read_flash_module to avoid patches refactoring the implementation of a previous patch in the series * modify ice_devlink_info_get to always report "stored" versions instead of only reporting them when a pending flash update is ready. * '100GbE' of git://git.kernel.org/pub/scm/linux/kernel/git/tnguy/next-queue: ice: remove dead code ice: use flex_array_size where possible ice: Replace one-element array with flexible-array member ice: display stored UNDI firmware version via devlink info ice: display stored netlist versions via devlink info ice: display some stored NVM versions via devlink info ice: introduce function for reading from flash modules ice: cache NVM module bank information ice: introduce context struct for info report ice: create flash_info structure and separate NVM version ice: report timeout length for erasing during devlink flash ==================== Link: https://lore.kernel.org/r/20210206044101.636242-1-anthony.l.nguyen@intel.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
\| * \|	ice: remove dead code	Bruce Allan	2021-02-05	1	-7/+1
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The check for a NULL pf pointer is moot since the earlier declaration and assignment of struct device *dev already de-referenced the pointer. Also, the only caller of ice_set_dflt_mib() already ensures pf is not NULL. Cc: Dave Ertman <david.m.ertman@intel.com> Reported-by: kernel test robot <lkp@intel.com> Reported-by: Dan Carpenter <dan.carpenter@oracle.com> Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: use flex_array_size where possible	Bruce Allan	2021-02-05	2	-2/+2
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Use the flex_array_size() helper with the recently added flexible array members in structures. Signed-off-by: Bruce Allan <bruce.w.allan@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: Replace one-element array with flexible-array member	Gustavo A. R. Silva	2021-02-05	2	-4/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a regular need in the kernel to provide a way to declare having a dynamically sized set of trailing elements in a structure. Kernel code should always use “flexible array members”[1] for these cases. The older style of one-element or zero-length arrays should no longer be used[2]. Refactor the code according to the use of a flexible-array member in struct ice_res_tracker, instead of a one-element array and use the struct_size() helper to calculate the size for the allocations. Also, notice that the code below suggests that, currently, two too many bytes are being allocated with devm_kzalloc(), as the total number of entries (pf->irq_tracker->num_entries) for pf->irq_tracker->list[] is _vectors_ and sizeof(pf->irq_tracker) also includes the size of the one-element array _list_ in struct ice_res_tracker. drivers/net/ethernet/intel/ice/ice_main.c:3511: 3511 / populate SW interrupts pool with number of OS granted IRQs. */ 3512 pf->num_avail_sw_msix = (u16)vectors; 3513 pf->irq_tracker->num_entries = (u16)vectors; 3514 pf->irq_tracker->end = pf->irq_tracker->num_entries; With this change, the right amount of dynamic memory is now allocated because, contrary to one-element arrays which occupy at least as much space as a single object of the type, flexible-array members don't occupy such space in the containing structure. [1] https://en.wikipedia.org/wiki/Flexible_array_member [2] https://www.kernel.org/doc/html/v5.9-rc1/process/deprecated.html#zero-length-and-one-element-arrays Built-tested-by: kernel test robot <lkp@intel.com> Signed-off-by: Gustavo A. R. Silva <gustavoars@kernel.org> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: display stored UNDI firmware version via devlink info	Jacob Keller	2021-02-05	3	-38/+113
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Just as we recently added support for other stored firmware flash versions, support display of the stored UNDI Option ROM version via devlink info. To do this, we need to introduce a new ice_get_inactive_orom_ver function. This is a little trickier than with other flash versions. The Option ROM version data was being read from a special "Boot Configuration" block of the NVM Preserved Field Area. This block only contains the active Option ROM version data. It is populated when the device firmware finishes updating the Option ROM. This method is ineffective at reading the stored Option ROM version data. Instead of reading from this section of the flash, replace this version extraction with one which locates the Combo Version information from within the Option ROM binary. This data is stored within the Option ROM at a 512 byte offset, in a simple structured format. The structure uses a simple modulo 256 checksum for integrity verification. Scan through the Option ROM to locate the CIVD data section, and extract the Combo Version. Refactor ice_get_orom_ver_info so that it takes the bank select enumeration parameter. Use this to implement ice_get_inactive_orom_ver. Although all ice devices have a Boot Configuration block in the NVM PFA, not all devices have a valid Option ROM. In this case, the old ice_get_orom_ver_info would "succeed" but report a version of all zeros. The new implementation would fail to locate the $CIV section in the Option ROM and report an error. Thus, we must ensure that ice_init_nvm does not fail if ice_get_orom_ver_info fails. Use the new ice_get_inactive_orom_ver to allow reporting the Option ROM versions for a pending update via devlink info. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: display stored netlist versions via devlink info	Jacob Keller	2021-02-05	7	-93/+176
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Add a function to read the inactive netlist bank for version information. To support this, refactor how we read the netlist version data. Instead of using the firmware AQ interface with a module ID, read from the flash as a flat NVM, using ice_read_flash_module. This change requires a slight adjustment to the offset values used, as reading from the flat NVM includes the type field (which was stripped by firmware previously). Cleanup the macro names and move them to ice_type.h. For clarity in how we calculate the offsets and so that programmers can easily map the offset value to the data sheet, use a wrapper macro to account for the offset adjustments. Use the newly added ice_get_inactive_netlist_ver function to extract the version data from the pending netlist module update. Add the stored variants of "fw.netlist", and "fw.netlist.build" to the info version map array. With this change, we now report the "fw.netlist" and "fw.netlist.build" versions into the stored section of the devlink info report. As with the main NVM module versions, if there is no pending update, we report the currently active values as stored. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: display some stored NVM versions via devlink info	Jacob Keller	2021-02-05	3	-4/+94
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The devlink info interface supports drivers reporting "stored" versions. These versions indicate the version of an update that has been downloaded to the device, but is not yet active. The code for extracting the NVM version recently changed to enable support for reading from either the active or the inactive bank. Use this to implement ice_get_inactive_nvm_ver, which will read the NVM version data from the inactive section of flash. When reporting the versions via devlink info, first read the device capabilities. Determine if there is a pending flash update, and if so, extract relevant version information from the inactive flash. Store these within the info context structure. When reporting "stored" firmware versions, devlink documentation indicates that we ought to always report a stored value, even if there is no pending update. In this common case, the stored version should match the running version. This means that each stored version should by default fallback to the same value as reported by the running handler. To support this, modify the version structure to have both a "getter" and a "fallback". Modify the control loop so that it will use the "fallback" function if the "getter" function does not report a version. To report versions for which we can read the stored value, use a new "stored()" macro. This macro will insert two entries into the version list. The first entry is the traditional running version. The second is the stored version, implemented with a fallback to the active version. This is a little tricky, but reduces the overall duplication of elements in the entry list, and ensures that running and stored values remain consistent. To avoid some duplication, add a combined() macro that will insert both the running and stored versions into the version entry list. Using this new support, add pending version reporter functions for "fw.psid.api" and "fw.bundle_id". This enables reporting the stored values for some of versions in the NVM module of the flash. Reporting management versions is not implemented by this patch. The active management version is reported to the driver via the AdminQ mailbox during load. Although the version must be in the firmware binary somewhere, accessing this from the inactive firmware is not trivial and has not been implemented in this change. Future changes will introduce support for reading the UNDI Option ROM version and the version associated with the Netlist module. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: introduce function for reading from flash modules	Jacob Keller	2021-02-05	2	-5/+179
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When reading from the flash memory of the device, the ice driver has two interfaces available to it. First, it can use a mediated interface via firmware that allows specifying a module ID. This allows reading from specific modules of the active flash bank. The second interface available is to perform flat reads. This allows complete access to the entire flash. However, using it requires the software to handle calculating module location and interpret pointer addresses. While most data required is accessible through the convenient first interface, certain flash contents are not. This includes the CSS header information associated with the Option ROM and NVM banks, as well as any access to the "inactive" banks used as scratch space for performing flash updates. In order to access all of the relevant flash contents, software must use the flat reads. Rather than forcing all flows to perform flat read calculations, introduce a new abstraction for reading from the flash: ice_read_flash_module. This function provides an abstraction for reading from either the active or inactive flash bank at the requested module. This interface is very similar to the abstraction provided via firmware, but allows access to additional modules, as well as providing a mechanism to request access to both flash banks. At first glance, it might make sense for this abstraction to allow specifying precisely which bank (1st or 2nd) the caller wishes to read. This is simpler to implement but more difficult to use. In practice, most callers only know whether they want the active bank, or the inactive bank. Rather than force callers to determine for themselves which bank to read from, implement ice_read_flash_module in terms of "active" vs "inactive". This significantly simplifies the implementation at the caller level and is a more useful abstraction over the flash contents. Make use of this new interface to refactor reading of the main NVM version information. Instead of using the firmware's mediated ShadowRAM function, use the ice_read_flash_module abstraction. To do this, notice that most reads of the NVM are going to be in 2-byte word chunks. To simplify using ice_read_flash_module for this case, ice_read_nvm_module is introduced. This is a simple wrapper around ice_read_flash_module which takes the correct pointer address for the NVM bank, and forces the 2-byte word format onto the caller. When reading the NVM versions, some fields are read from the Shadow RAM. The Shadow RAM is the first 64KB of flash memory, and is populated during device load. Most fields are copied from a section within the active NVM bank. In order to read this data from both the active and inactive NVM banks, we need to read not from the first 64KB of flash, but instead from the correct offset into the NVM bank. Introduce ice_read_nvm_sr_copy for this purpose. This function wraps around ice_read_nvm_module and has the same interface as the ice_read_sr_word, with the exception of allowing the caller to specify whether to read the active or inactive flash bank. With this change, it is now trivial to refactor ice_get_nvm_ver_info to read using the software mediated ice_read_flash_module interface instead of relying on the firmware mediated interface. This will be used in the following change to implement support for stored versions in the devlink info report. Additionally, the overall ice_read_flash_module interface will be used and extended to support all three major flash banks, and additionally to support reading the flash image security revision information. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: cache NVM module bank information	Jacob Keller	2021-02-05	2	-0/+188
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ice flash contains two copies of each of the NVM, Option ROM, and Netlist modules. Each bank has a pointer word and a size word. In order to correctly read from the active flash bank, the driver must calculate the offset manually. During NVM initialization, read the Shadow RAM control word and determine which bank is active for each NVM module. Additionally, cache the size and pointer values for use in calculating the correct offset. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: introduce context struct for info report	Jacob Keller	2021-02-05	1	-41/+68
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ice driver uses an array of structures which link an info name with a function that formats the associated version data into a string. All existing format functions simply format already captured static data from the driver hw structure. Future changes will introduce format functions for reporting the versions of flash sections stored but not yet applied. This type of version data is not stored as a member of the hw structure. This is because (a) it might not yet exist in the case there is no pending flash update, and (b) even if it does, it might change such as if an update is canceled or replaced by a new update before finalizing. We could simply have each format function gather its own data upon being called. However, in some cases the raw binary version data is a combination of multiple different reported fields. Additionally, the current interface doesn't have a way for the function to indicate that the version doesn't exist. Refactor this function interface to take a new ice_info_ctx structure instead of the buffer pointer and length. This context structure allows for future extensions to pre-gather version data that is stored within the context struct instead of the hw struct. Allocate this context structure initially at the start of ice_devlink_info_get. We use dynamic allocation instead of a local stack variable in order to avoid using too much kernel stack once we extend it with additional data structures. Modify the main loop that drives the info reporting so that the version buffer string is always cleared between each format. Explicitly check that the format function actually filled in a version string of non-zero length. If the string is not provided, simply skip this version without reporting an error. This allows for introducing format functions of versions which may or may not be present, such as the version of a pending update that has not yet been activated. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: create flash_info structure and separate NVM version	Jacob Keller	2021-02-05	4	-62/+91
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ice_nvm_info structure has become somewhat of a dumping ground for all of the fields related to flash version. It holds the NVM version and EETRACK id, the OptionROM info structure, the flash size, the ShadowRAM size, and more. A future change is going to add the ability to read the NVM version and EETRACK ID from the inactive NVM bank. To make this simpler, it is useful to have these NVM version info fields extracted to their own structure. Rename ice_nvm_info into ice_flash_info, and create a separate ice_nvm_info structure that will contain the eetrack and NVM map version. Move the netlist_ver structure into ice_flash_info and rename it ice_netlist_info for consistency. Modify the static ice_get_orom_ver_info to take the option rom structure as a pointer. This makes it more obvious what portion of the hw struct is being modified. Do the same for ice_get_netlist_ver_info. Introduce a new ice_get_nvm_ver_info function, which will be similar to ice_get_orom_ver_info and ice_get_netlist_ver_info, used to keep the NVM version extraction code co-located. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
\| * \|	ice: report timeout length for erasing during devlink flash	Jacob Keller	2021-02-05	1	-3/+7
\| \|/ \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When erasing, notify userspace of how long we will potentially take to erase a module. Doing so allows userspace to report the timeout, giving a clear indication of the upper time bound of the operation. Since we're re-using the erase timeout value, make it a macro rather than a magic number. Signed-off-by: Jacob Keller <jacob.e.keller@intel.com> Reviewed-by: Shannon Nelson <snelson@pensando.io> Tested-by: Tony Brelinski <tonyx.brelinski@intel.com> Signed-off-by: Tony Nguyen <anthony.l.nguyen@intel.com>
* \|	r8169: don't try to disable interrupts if NAPI is scheduled already	Heiner Kallweit	2021-02-07	1	-2/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There's no benefit in trying to disable interrupts if NAPI is scheduled already. This allows us to save a PCI write in this case. Signed-off-by: Heiner Kallweit <hkallweit1@gmail.com> Link: https://lore.kernel.org/r/78c7f2fb-9772-1015-8c1d-632cbdff253f@gmail.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: ipa: avoid field overflow	Alex Elder	2021-02-06	1	-8/+14
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It's possible that the length passed to ipa_header_size_encoded() is larger than what can be represented by the HDR_LEN field alone (starting with IPA v4.5). If we attempted that, u32_encode_bits() would trigger a build-time error. Avoid this problem by masking off high-order bits of the value encoded as the lower portion of the header length. The same sort of problem exists in ipa_metadata_offset_encoded(), so implement the same fix there. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: ipa: get rid of status size constraint	Alex Elder	2021-02-06	1	-3/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	There is a build-time check that the packet status structure is a multiple of 4 bytes in size. It's not clear where that constraint comes from, but the structure defines what hardware provides so its definition won't change. Get rid of the check; it adds no value. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: ipa: use a Boolean rather than count when replenishing	Alex Elder	2021-02-06	1	-16/+19
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The count argument to ipa_endpoint_replenish() is only ever 0 or 1, and always will be (because we always handle each receive buffer in a single transaction). Rename the argument to be add_one and change it to be Boolean. Update the function description to reflect the current code. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: ipa: remove two unused register definitions	Alex Elder	2021-02-06	1	-10/+0
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We do not support inter-EE channel or event ring commands. Inter-EE interrupts are disabled (and never re-enabled) for all channels and event rings, so we have no need for the GSI registers that clear those interrupt conditions. So remove their definitions. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: ipa: do not cache event ring state	Alex Elder	2021-02-06	2	-19/+21
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	An event ring's state only needs to be known when it is allocated, reset, or deallocated. We check an event ring's state both before and after performing an event ring control command that changes its state. These are only issued at startup and shutdown, so there is very little value in caching the state. Stop recording a copy of the channel's last known state, and instead fetch the true state from hardware whenever it's needed. In such cases, do record the state in a local variable, in case an error message reports it (so the value reported is the value seen). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: ipa: synchronize NAPI only for suspend	Alex Elder	2021-02-06	1	-8/+10
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	When stopping a channel, gsi_channel_stop() will ensure NAPI polling is complete when it calls napi_disable(). So there is no need to call napi_synchronize() in that case. Move the call to napi_synchronize() out of __gsi_channel_stop() and into gsi_channel_suspend(), so it's only used where needed. Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: ipa: move mutex calls into __gsi_channel_stop()	Alex Elder	2021-02-06	1	-11/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Move the mutex calls out of gsi_channel_stop_retry() and into __gsi_channel_stop(), to make the latter more semantically similar to __gsi_channel_start(). Signed-off-by: Alex Elder <elder@linaro.org> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: dsa: felix: propagate the LAG offload ops towards the ocelot lib	Vladimir Oltean	2021-02-06	2	-6/+32
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The ocelot switch has been supporting LAG offload since its initial commit, however felix could not make use of that, due to lack of a LAG abstraction in DSA. Now that we have that, let's forward DSA's calls towards the ocelot library, who will deal with setting up the bonding. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: mscc: ocelot: rebalance LAGs on link up/down events	Vladimir Oltean	2021-02-06	3	-9/+63
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	At present there is an issue when ocelot is offloading a bonding interface, but one of the links of the physical ports goes down. Traffic keeps being hashed towards that destination, and of course gets dropped on egress. Monitor the netdev notifier events emitted by the bonding driver for changes in the physical state of lower interfaces, to determine which ports are active and which ones are no longer. Then extend ocelot_get_bond_mask to return either the configured bonding interfaces, or the active ones, depending on a boolean argument. The code that does rebalancing only needs to do so among the active ports, whereas the bridge forwarding mask and the logical port IDs still need to look at the permanently bonded ports. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: mscc: ocelot: rename aggr_count to num_ports_in_lag	Vladimir Oltean	2021-02-06	1	-4/+3
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	It makes it a bit easier to read and understand the code that deals with balancing the 16 aggregation codes among the ports in a certain LAG. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: mscc: ocelot: drop the use of the "lags" array	Vladimir Oltean	2021-02-06	1	-56/+39
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	We can now simplify the implementation by always using ocelot_get_bond_mask to look up the other ports that are offloading the same bonding interface as us. In ocelot_set_aggr_pgids, the code had a way to uniquely iterate through LAGs. We need to achieve the same behavior by marking each LAG as visited, which we do now by using a temporary 32-bit "visited" bitmask. This is ok and we do not need dynamic memory allocation, because we know that this switch architecture will not have more than 32 ports (the PGID port masks are 32-bit anyway). Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: mscc: ocelot: set up logical port IDs centrally	Vladimir Oltean	2021-02-06	1	-19/+28
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The setup of logical port IDs is done in two places: from the inconclusively named ocelot_setup_lag and from ocelot_port_lag_leave, a function that also calls ocelot_setup_lag (which apparently does an incomplete setup of the LAG). To improve this situation, we can rename ocelot_setup_lag into ocelot_setup_logical_port_ids, and drop the "lag" argument. It will now set up the logical port IDs of all switch ports, which may be just slightly more inefficient but more maintainable. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: mscc: ocelot: avoid unneeded "lp" variable in LAG join	Vladimir Oltean	2021-02-06	1	-10/+6
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	The index of the LAG is equal to the logical port ID that all the physical port members have, which is further equal to the index of the first physical port that is a member of the LAG. The code gets a bit carried away with logic like this: if (a == b) c = a; else c = b; which can be simplified, of course, into: c = b; (with a being port, b being lp, c being lag) This further makes the "lp" variable redundant, since we can use "lag" everywhere where "lp" (logical port) was used. So instead of a "c = b" assignment, we can do a complete deletion of b. Only one comment here: if (bond_mask) { lp = __ffs(bond_mask); ocelot->lags[lp] = 0; } lp was clobbered before, because it was used as a temporary variable to hold the new smallest port ID from the bond. Now that we don't have "lp" any longer, we'll just avoid the temporary variable and zeroize the bonding mask directly. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: mscc: ocelot: set up the bonding mask in a way that avoids a net_device	Vladimir Oltean	2021-02-06	1	-7/+22
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since this code should be called from pure switchdev as well as from DSA, we must find a way to determine the bonding mask not by looking directly at the net_device lowers of the bonding interface, since those could have different private structures. We keep a pointer to the bonding upper interface, if present, in struct ocelot_port. Then the bonding mask becomes the bitwise OR of all ports that have the same bonding upper interface. This adds a duplication of functionality with the current "lags" array, but the duplication will be short-lived, since further patches will remove the latter completely. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: mscc: ocelot: use ipv6 in the aggregation code	Vladimir Oltean	2021-02-06	1	-1/+4
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	IPv6 header information is not currently part of the entropy source for the 4-bit aggregation code used for LAG offload, even though it could be. The hardware reference manual says about these fields: ANA::AGGR_CFG.AC_IP6_TCPUDP_PORT_ENA Use IPv6 TCP/UDP port when calculating aggregation code. Configure identically for all ports. Recommended value is 1. ANA::AGGR_CFG.AC_IP6_FLOW_LBL_ENA Use IPv6 flow label when calculating AC. Configure identically for all ports. Recommended value is 1. Integration with the xmit_hash_policy of the bonding interface is TBD. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* \|	net: mscc: ocelot: don't refuse bonding interfaces we can't offload	Vladimir Oltean	2021-02-06	3	-28/+17
\| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \| \|	Since switchdev/DSA exposes network interfaces that fulfill many of the same user space expectations that dedicated NICs do, it makes sense to not deny bonding interfaces with a bonding policy that we cannot offload, but instead allow the bonding driver to select the egress interface in software. Signed-off-by: Vladimir Oltean <vladimir.oltean@nxp.com> Reviewed-by: Alexandre Belloni <alexandre.belloni@bootlin.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>