| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
|
|
|
| |
To enable multicast packets which are offloaded in bridge multicast
offload mode to be sent also to uplink, FTE bit uplink_hairpin_en should
be set. Add this bit to FTE for the bridge multicast offload rules.
Fixes: 18c2916cee12 ("net/mlx5: Bridge, snoop igmp/mld packets")
Signed-off-by: Moshe Shemesh <moshe@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The below WARN [1] is reported once a callback command failed.
As a callback runs under an interrupt context, needs to use the IRQ
save/restore variant.
[1]
DEBUG_LOCKS_WARN_ON(lockdep_hardirq_context())
WARNING: CPU: 15 PID: 0 at kernel/locking/lockdep.c:4353
lockdep_hardirqs_on_prepare+0x11b/0x180
Modules linked in: vhost_net vhost tap mlx5_vfio_pci
vfio_pci vfio_pci_core vfio_iommu_type1 vfio mlx5_vdpa vringh
vhost_iotlb vdpa nfnetlink_cttimeout openvswitch nsh ip6table_mangle
ip6table_nat ip6table_filter ip6_tables iptable_mangle
xt_conntrackxt_MASQUERADE nf_conntrack_netlink nfnetlink
xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5
auth_rpcgss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi
scsi_transport_iscsi rdma_cm iw_cm ib_umad ib_ipoib ib_cm
mlx5_ib ib_uverbs ib_core fuse mlx5_core
CPU: 15 PID: 0 Comm: swapper/15 Tainted: G W 6.7.0-rc4+ #1587
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS
rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:lockdep_hardirqs_on_prepare+0x11b/0x180
Code: 00 5b c3 c3 e8 e6 0d 58 00 85 c0 74 d6 8b 15 f0 c3
76 01 85 d2 75 cc 48 c7 c6 04 a5 3b 82 48 c7 c7 f1
e9 39 82 e8 95 12 f9 ff <0f> 0b 5b c3 e8 bc 0d 58 00
85 c0 74 ac 8b 3d c6 c3 76 01 85 ff 75
RSP: 0018:ffffc900003ecd18 EFLAGS: 00010086
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000027
RDX: 0000000000000000 RSI: ffff88885fbdb880 RDI: ffff88885fbdb888
RBP: 00000000ffffff87 R08: 0000000000000000 R09: 0000000000000001
R10: 0000000000000000 R11: 284e4f5f4e524157 R12: 00000000002c9aa1
R13: ffff88810aace980 R14: ffff88810aace9b8 R15: 0000000000000003
FS: 0000000000000000(0000) GS:ffff88885fbc0000(0000)
knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f731436f4c8 CR3: 000000010aae6001 CR4: 0000000000372eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<IRQ>
? __warn+0x81/0x170
? lockdep_hardirqs_on_prepare+0x11b/0x180
? report_bug+0xf8/0x1c0
? handle_bug+0x3f/0x70
? exc_invalid_op+0x13/0x60
? asm_exc_invalid_op+0x16/0x20
? lockdep_hardirqs_on_prepare+0x11b/0x180
? lockdep_hardirqs_on_prepare+0x11b/0x180
trace_hardirqs_on+0x4a/0xa0
raw_spin_unlock_irq+0x24/0x30
cmd_status_err+0xc0/0x1a0 [mlx5_core]
cmd_status_err+0x1a0/0x1a0 [mlx5_core]
mlx5_cmd_exec_cb_handler+0x24/0x40 [mlx5_core]
mlx5_cmd_comp_handler+0x129/0x4b0 [mlx5_core]
cmd_comp_notifier+0x1a/0x20 [mlx5_core]
notifier_call_chain+0x3e/0xe0
atomic_notifier_call_chain+0x5f/0x130
mlx5_eq_async_int+0xe7/0x200 [mlx5_core]
notifier_call_chain+0x3e/0xe0
atomic_notifier_call_chain+0x5f/0x130
irq_int_handler+0x11/0x20 [mlx5_core]
__handle_irq_event_percpu+0x99/0x220
? tick_irq_enter+0x5d/0x80
handle_irq_event_percpu+0xf/0x40
handle_irq_event+0x3a/0x60
handle_edge_irq+0xa2/0x1c0
__common_interrupt+0x55/0x140
common_interrupt+0x7d/0xa0
</IRQ>
<TASK>
asm_common_interrupt+0x22/0x40
RIP: 0010:default_idle+0x13/0x20
Code: c0 08 00 00 00 4d 29 c8 4c 01 c7 4c 29 c2 e9 72 ff
ff ff cc cc cc cc 8b 05 ea 08 25 01 85 c0 7e 07 0f 00 2d 7f b0 26 00 fb
f4 <fa> c3 90 66 2e 0f 1f 84 00 00 00 00 00 65 48 8b 04 25 80 d0 02 00
RSP: 0018:ffffc9000010fec8 EFLAGS: 00000242
RAX: 0000000000000001 RBX: 000000000000000f RCX: 4000000000000000
RDX: 0000000000000000 RSI: 0000000000000000 RDI: ffffffff811c410c
RBP: ffffffff829478c0 R08: 0000000000000001 R09: 0000000000000001
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
? do_idle+0x1ec/0x210
default_idle_call+0x6c/0x90
do_idle+0x1ec/0x210
cpu_startup_entry+0x26/0x30
start_secondary+0x11b/0x150
secondary_startup_64_no_verify+0x165/0x16b
</TASK>
irq event stamp: 833284
hardirqs last enabled at (833283): [<ffffffff811c410c>]
do_idle+0x1ec/0x210
hardirqs last disabled at (833284): [<ffffffff81daf9ef>]
common_interrupt+0xf/0xa0
softirqs last enabled at (833224): [<ffffffff81dc199f>]
__do_softirq+0x2bf/0x40e
softirqs last disabled at (833177): [<ffffffff81178ddf>]
irq_exit_rcu+0x7f/0xa0
Fixes: 34f46ae0d4b3 ("net/mlx5: Add command failures data to debugfs")
Signed-off-by: Yishai Hadas <yishaih@nvidia.com>
Reviewed-by: Moshe Shemesh <moshe@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The cited change refactored mlx5e_tc_del_fdb_peer_flow() to only clear DUP
flag when list of peer flows has become empty. However, if any concurrent
user holds a reference to a peer flow (for example, the neighbor update
workqueue task is updating peer flow's parent encap entry concurrently),
then the flow will not be removed from the peer list and, consecutively,
DUP flag will remain set. Since mlx5e_tc_del_fdb_peers_flow() calls
mlx5e_tc_del_fdb_peer_flow() for every possible peer index the algorithm
will try to remove the flow from eswitch instances that it has never peered
with causing either NULL pointer dereference when trying to remove the flow
peer list head of peer_index that was never initialized or a warning if the
list debug config is enabled[0].
Fix the issue by always removing the peer flow from the list even when not
releasing the last reference to it.
[0]:
[ 3102.985806] ------------[ cut here ]------------
[ 3102.986223] list_del corruption, ffff888139110698->next is NULL
[ 3102.986757] WARNING: CPU: 2 PID: 22109 at lib/list_debug.c:53 __list_del_entry_valid_or_report+0x4f/0xc0
[ 3102.987561] Modules linked in: act_ct nf_flow_table bonding act_tunnel_key act_mirred act_skbedit vxlan cls_matchall nfnetlink_cttimeout act_gact cls_flower sch_ingress mlx5_vdpa vringh vhost_iotlb vdpa openvswitch nsh xt_MASQUERADE nf_conntrack_netlink nfnetlink iptable_nat xt_addrtype xt_conntrack nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcg
ss oid_registry overlay rpcrdma rdma_ucm ib_iser libiscsi scsi_transport_iscsi ib_umad rdma_cm ib_ipoib iw_cm ib_cm mlx5_ib ib_uverbs ib_core mlx5_core [last unloaded: bonding]
[ 3102.991113] CPU: 2 PID: 22109 Comm: revalidator28 Not tainted 6.6.0-rc6+ #3
[ 3102.991695] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
[ 3102.992605] RIP: 0010:__list_del_entry_valid_or_report+0x4f/0xc0
[ 3102.993122] Code: 39 c2 74 56 48 8b 32 48 39 fe 75 62 48 8b 51 08 48 39 f2 75 73 b8 01 00 00 00 c3 48 89 fe 48 c7 c7 48 fd 0a 82 e8 41 0b ad ff <0f> 0b 31 c0 c3 48 89 fe 48 c7 c7 70 fd 0a 82 e8 2d 0b ad ff 0f 0b
[ 3102.994615] RSP: 0018:ffff8881383e7710 EFLAGS: 00010286
[ 3102.995078] RAX: 0000000000000000 RBX: 0000000000000002 RCX: 0000000000000000
[ 3102.995670] RDX: 0000000000000001 RSI: ffff88885f89b640 RDI: ffff88885f89b640
[ 3102.997188] DEL flow 00000000be367878 on port 0
[ 3102.998594] RBP: dead000000000122 R08: 0000000000000000 R09: c0000000ffffdfff
[ 3102.999604] R10: 0000000000000008 R11: ffff8881383e7598 R12: dead000000000100
[ 3103.000198] R13: 0000000000000002 R14: ffff888139110000 R15: ffff888101901240
[ 3103.000790] FS: 00007f424cde4700(0000) GS:ffff88885f880000(0000) knlGS:0000000000000000
[ 3103.001486] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 3103.001986] CR2: 00007fd42e8dcb70 CR3: 000000011e68a003 CR4: 0000000000370ea0
[ 3103.002596] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[ 3103.003190] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[ 3103.003787] Call Trace:
[ 3103.004055] <TASK>
[ 3103.004297] ? __warn+0x7d/0x130
[ 3103.004623] ? __list_del_entry_valid_or_report+0x4f/0xc0
[ 3103.005094] ? report_bug+0xf1/0x1c0
[ 3103.005439] ? console_unlock+0x4a/0xd0
[ 3103.005806] ? handle_bug+0x3f/0x70
[ 3103.006149] ? exc_invalid_op+0x13/0x60
[ 3103.006531] ? asm_exc_invalid_op+0x16/0x20
[ 3103.007430] ? __list_del_entry_valid_or_report+0x4f/0xc0
[ 3103.007910] mlx5e_tc_del_fdb_peers_flow+0xcf/0x240 [mlx5_core]
[ 3103.008463] mlx5e_tc_del_flow+0x46/0x270 [mlx5_core]
[ 3103.008944] mlx5e_flow_put+0x26/0x50 [mlx5_core]
[ 3103.009401] mlx5e_delete_flower+0x25f/0x380 [mlx5_core]
[ 3103.009901] tc_setup_cb_destroy+0xab/0x180
[ 3103.010292] fl_hw_destroy_filter+0x99/0xc0 [cls_flower]
[ 3103.010779] __fl_delete+0x2d4/0x2f0 [cls_flower]
[ 3103.011207] fl_delete+0x36/0x80 [cls_flower]
[ 3103.011614] tc_del_tfilter+0x56f/0x750
[ 3103.011982] rtnetlink_rcv_msg+0xff/0x3a0
[ 3103.012362] ? netlink_ack+0x1c7/0x4e0
[ 3103.012719] ? rtnl_calcit.isra.44+0x130/0x130
[ 3103.013134] netlink_rcv_skb+0x54/0x100
[ 3103.013533] netlink_unicast+0x1ca/0x2b0
[ 3103.013902] netlink_sendmsg+0x361/0x4d0
[ 3103.014269] __sock_sendmsg+0x38/0x60
[ 3103.014643] ____sys_sendmsg+0x1f2/0x200
[ 3103.015018] ? copy_msghdr_from_user+0x72/0xa0
[ 3103.015265] ___sys_sendmsg+0x87/0xd0
[ 3103.016608] ? copy_msghdr_from_user+0x72/0xa0
[ 3103.017014] ? ___sys_recvmsg+0x9b/0xd0
[ 3103.017381] ? ttwu_do_activate.isra.137+0x58/0x180
[ 3103.017821] ? wake_up_q+0x49/0x90
[ 3103.018157] ? futex_wake+0x137/0x160
[ 3103.018521] ? __sys_sendmsg+0x51/0x90
[ 3103.018882] __sys_sendmsg+0x51/0x90
[ 3103.019230] ? exit_to_user_mode_prepare+0x56/0x130
[ 3103.019670] do_syscall_64+0x3c/0x80
[ 3103.020017] entry_SYSCALL_64_after_hwframe+0x46/0xb0
[ 3103.020469] RIP: 0033:0x7f4254811ef4
[ 3103.020816] Code: 89 f3 48 83 ec 10 48 89 7c 24 08 48 89 14 24 e8 42 eb ff ff 48 8b 14 24 41 89 c0 48 89 de 48 8b 7c 24 08 b8 2e 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 30 44 89 c7 48 89 04 24 e8 78 eb ff ff 48 8b
[ 3103.022290] RSP: 002b:00007f424cdd9480 EFLAGS: 00000293 ORIG_RAX: 000000000000002e
[ 3103.022970] RAX: ffffffffffffffda RBX: 00007f424cdd9510 RCX: 00007f4254811ef4
[ 3103.023564] RDX: 0000000000000000 RSI: 00007f424cdd9510 RDI: 0000000000000012
[ 3103.024158] RBP: 00007f424cdda238 R08: 0000000000000000 R09: 00007f41d801a4b0
[ 3103.024748] R10: 0000000000000000 R11: 0000000000000293 R12: 0000000000000001
[ 3103.025341] R13: 00007f424cdd9510 R14: 00007f424cdda240 R15: 00007f424cdd99a0
[ 3103.025931] </TASK>
[ 3103.026182] ---[ end trace 0000000000000000 ]---
[ 3103.027033] ------------[ cut here ]------------
Fixes: 9be6c21fdcf8 ("net/mlx5e: Handle offloads flows per peer")
Signed-off-by: Vlad Buslov <vladbu@nvidia.com>
Reviewed-by: Mark Bloch <mbloch@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The processing of traffic in hairpin queues occurs in HW/FW and does not
involve the cpus, hence the upper bound on max num channels does not
apply to them. Using this bound for the hairpin RQT max_table_size is
wrong. It could be too small, and cause the error below [1]. As the
RQT size provided on init does not get modified later, use the same
value for both actual and max table sizes.
[1]
mlx5_core 0000:08:00.1: mlx5_cmd_out_err:805:(pid 1200): CREATE_RQT(0x916) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x538faf), err(-22)
Fixes: 74a8dadac17e ("net/mlx5e: Preparations for supporting larger number of channels")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Reviewed-by: Gal Pressman <gal@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Indirection (*) is of lower precedence than postfix increment (++). Logic
in napi_poll context would cause an out-of-bound read by first increment
the pointer address by byte address space and then dereference the value.
Rather, the intended logic was to dereference first and then increment the
underlying value.
Fixes: 92214be5979c ("net/mlx5e: Update doorbell for port timestamping CQ before the software counter")
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Reviewed-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
|
|
|
|
|
|
|
|
| |
The sd_group field moved in the HW spec from the MPIR register
to the vport context.
Align the query accordingly.
Fixes: f5e956329960 ("net/mlx5: Expose Management PCIe Index Register (MPIR)")
Signed-off-by: Tariq Toukan <tariqt@nvidia.com>
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The cited commit moved the code of mlx5e_create_tises() and changed the
loop to create TISes over MLX5_MAX_PORTS constant value, instead of
getting the correct lag ports supported by the device, which can cause
FW errors on devices with less than MLX5_MAX_PORTS ports.
Change that back to mlx5e_get_num_lag_ports(mdev).
Also IPoIB interfaces create there own TISes, they don't use the eth
TISes, pass a flag to indicate that.
This fixes the following errors that might appear in kernel log:
mlx5_cmd_out_err:808:(pid 650): CREATE_TIS(0x912) op_mod(0x0) failed, status bad parameter(0x3), syndrome (0x595b5d), err(-22)
mlx5e_create_mdev_resources:174:(pid 650): alloc tises failed, -22
Fixes: b25bd37c859f ("net/mlx5: Move TISes from priv to mdev HW resources")
Signed-off-by: Saeed Mahameed <saeedm@nvidia.com>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When a qdisc is deleted from a net device the stack instructs the
underlying driver to remove its flow offload callback from the
associated filter block using the 'FLOW_BLOCK_UNBIND' command. The stack
then continues to replay the removal of the filters in the block for
this driver by iterating over the chains in the block and invoking the
'reoffload' operation of the classifier being used. In turn, the
classifier in its 'reoffload' operation prepares and emits a
'FLOW_CLS_DESTROY' command for each filter.
However, the stack does not do the same for chain templates and the
underlying driver never receives a 'FLOW_CLS_TMPLT_DESTROY' command when
a qdisc is deleted. This results in a memory leak [1] which can be
reproduced using [2].
Fix by introducing a 'tmplt_reoffload' operation and have the stack
invoke it with the appropriate arguments as part of the replay.
Implement the operation in the sole classifier that supports chain
templates (flower) by emitting the 'FLOW_CLS_TMPLT_{CREATE,DESTROY}'
command based on whether a flow offload callback is being bound to a
filter block or being unbound from one.
As far as I can tell, the issue happens since cited commit which
reordered tcf_block_offload_unbind() before tcf_block_flush_all_chains()
in __tcf_block_put(). The order cannot be reversed as the filter block
is expected to be freed after flushing all the chains.
[1]
unreferenced object 0xffff888107e28800 (size 2048):
comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
hex dump (first 32 bytes):
b1 a6 7c 11 81 88 ff ff e0 5b b3 10 81 88 ff ff ..|......[......
01 00 00 00 00 00 00 00 e0 aa b0 84 ff ff ff ff ................
backtrace:
[<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
[<ffffffff81ab374e>] __kmalloc+0x4e/0x90
[<ffffffff832aec6d>] mlxsw_sp_acl_ruleset_get+0x34d/0x7a0
[<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
[<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
[<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
[<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
[<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
[<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
[<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
[<ffffffff83ac6270>] netlink_unicast+0x540/0x820
[<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
[<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80
[<ffffffff8379d29a>] ___sys_sendmsg+0x13a/0x1e0
[<ffffffff8379d50c>] __sys_sendmsg+0x11c/0x1f0
[<ffffffff843b9ce0>] do_syscall_64+0x40/0xe0
unreferenced object 0xffff88816d2c0400 (size 1024):
comm "tc", pid 1079, jiffies 4294958525 (age 3074.287s)
hex dump (first 32 bytes):
40 00 00 00 00 00 00 00 57 f6 38 be 00 00 00 00 @.......W.8.....
10 04 2c 6d 81 88 ff ff 10 04 2c 6d 81 88 ff ff ..,m......,m....
backtrace:
[<ffffffff81c06a68>] __kmem_cache_alloc_node+0x1e8/0x320
[<ffffffff81ab36c1>] __kmalloc_node+0x51/0x90
[<ffffffff81a8ed96>] kvmalloc_node+0xa6/0x1f0
[<ffffffff82827d03>] bucket_table_alloc.isra.0+0x83/0x460
[<ffffffff82828d2b>] rhashtable_init+0x43b/0x7c0
[<ffffffff832aed48>] mlxsw_sp_acl_ruleset_get+0x428/0x7a0
[<ffffffff832bc195>] mlxsw_sp_flower_tmplt_create+0x145/0x180
[<ffffffff832b2e1a>] mlxsw_sp_flow_block_cb+0x1ea/0x280
[<ffffffff83a10613>] tc_setup_cb_call+0x183/0x340
[<ffffffff83a9f85a>] fl_tmplt_create+0x3da/0x4c0
[<ffffffff83a22435>] tc_ctl_chain+0xa15/0x1170
[<ffffffff838a863c>] rtnetlink_rcv_msg+0x3cc/0xed0
[<ffffffff83ac87f0>] netlink_rcv_skb+0x170/0x440
[<ffffffff83ac6270>] netlink_unicast+0x540/0x820
[<ffffffff83ac6e28>] netlink_sendmsg+0x8d8/0xda0
[<ffffffff83793def>] ____sys_sendmsg+0x30f/0xa80
[2]
# tc qdisc add dev swp1 clsact
# tc chain add dev swp1 ingress proto ip chain 1 flower dst_ip 0.0.0.0/32
# tc qdisc del dev swp1 clsact
# devlink dev reload pci/0000:06:00.0
Fixes: bbf73830cd48 ("net: sched: traverse chains in block with tcf_get_next_chain()")
Signed-off-by: Ido Schimmel <idosch@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
We are missing a lot of config options from net selftests,
it seems:
tun/tap: CONFIG_TUN, CONFIG_MACVLAN, CONFIG_MACVTAP
fib_tests: CONFIG_NET_SCH_FQ_CODEL
l2tp: CONFIG_L2TP, CONFIG_L2TP_V3, CONFIG_L2TP_IP, CONFIG_L2TP_ETH
sctp-vrf: CONFIG_INET_DIAG
txtimestamp: CONFIG_NET_CLS_U32
vxlan_mdb: CONFIG_BRIDGE_VLAN_FILTERING
gre_gso: CONFIG_NET_IPGRE_DEMUX, CONFIG_IP_GRE, CONFIG_IPV6_GRE
srv6_end_dt*_l3vpn: CONFIG_IPV6_SEG6_LWTUNNEL
ip_local_port_range: CONFIG_MPTCP
fib_test: CONFIG_NET_CLS_BASIC
rtnetlink: CONFIG_MACSEC, CONFIG_NET_SCH_HTB, CONFIG_XFRM_INTERFACE
CONFIG_NET_IPGRE, CONFIG_BONDING
fib_nexthops: CONFIG_MPLS, CONFIG_MPLS_ROUTING
vxlan_mdb: CONFIG_NET_ACT_GACT
tls: CONFIG_TLS, CONFIG_CRYPTO_CHACHA20POLY1305
psample: CONFIG_PSAMPLE
fcnal: CONFIG_TCP_MD5SIG
Try to add them in a semi-alphabetical order.
Fixes: 62199e3f1658 ("selftests: net: Add VXLAN MDB test")
Fixes: c12e0d5f267d ("self-tests: introduce self-tests for RPS default mask")
Fixes: 122db5e3634b ("selftests/net: add MPTCP coverage for IP_LOCAL_PORT_RANGE")
Link: https://lore.kernel.org/r/20240122203528.672004-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Current code in netvsc_drv_init() incorrectly assumes that PAGE_SIZE
is 4 Kbytes, which is wrong on ARM64 with 16K or 64K page size. As a
result, the default VMBus ring buffer size on ARM64 with 64K page size
is 8 Mbytes instead of the expected 512 Kbytes. While this doesn't break
anything, a typical VM with 8 vCPUs and 8 netvsc channels wastes 120
Mbytes (8 channels * 2 ring buffers/channel * 7.5 Mbytes/ring buffer).
Unfortunately, the module parameter specifying the ring buffer size
is in units of 4 Kbyte pages. Ideally, it should be in units that
are independent of PAGE_SIZE, but backwards compatibility prevents
changing that now.
Fix this by having netvsc_drv_init() hardcode 4096 instead of using
PAGE_SIZE when calculating the ring buffer size in bytes. Also
use the VMBUS_RING_SIZE macro to ensure proper alignment when running
with page size larger than 4K.
Cc: <stable@vger.kernel.org> # 5.15.x
Fixes: 7aff79e297ee ("Drivers: hv: Enable Hyper-V code to be built on ARM64")
Signed-off-by: Michael Kelley <mhklinux@outlook.com>
Link: https://lore.kernel.org/r/20240122162028.348885-1-mhklinux@outlook.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
This reverts commit b34ab3527b9622ca4910df24ff5beed5aa66c6b5.
Using skb_ensure_writable_head_tail without a call to skb_unshare causes
the MACsec stack to operate on the original skb rather than a copy in the
macsec_encrypt path. This causes the buffer to be exceeded in space, and
leads to warnings generated by skb_put operations. Opting to revert this
change since skb_copy_expand is more efficient than
skb_ensure_writable_head_tail followed by a call to skb_unshare.
Log:
------------[ cut here ]------------
kernel BUG at net/core/skbuff.c:2464!
invalid opcode: 0000 [#1] SMP KASAN
CPU: 21 PID: 61997 Comm: iperf3 Not tainted 6.7.0-rc8_for_upstream_debug_2024_01_07_17_05 #1
Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS rel-1.13.0-0-gf21b5a4aeb02-prebuilt.qemu.org 04/01/2014
RIP: 0010:skb_put+0x113/0x190
Code: 03 0f b6 14 02 48 89 f8 83 e0 07 83 c0 03 38 d0 7c 04 84 d2 75 70 3b 9d bc 00 00 00 77 0e 48 83 c4 08 4c 89 e8 5b 5d 41 5d c3 <0f> 0b 4c 8b 6c 24 20 89 74 24 04 e8 6d b7 f0 fe 8b 74 24 04 48 c7
RSP: 0018:ffff8882694e7278 EFLAGS: 00010202
RAX: 0000000000000025 RBX: 0000000000000100 RCX: 0000000000000001
RDX: 0000000000000000 RSI: 0000000000000010 RDI: ffff88816ae0bad4
RBP: ffff88816ae0ba60 R08: 0000000000000004 R09: 0000000000000004
R10: 0000000000000001 R11: 0000000000000001 R12: ffff88811ba5abfa
R13: ffff8882bdecc100 R14: ffff88816ae0ba60 R15: ffff8882bdecc0ae
FS: 00007fe54df02740(0000) GS:ffff88881f080000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007fe54d92e320 CR3: 000000010a345003 CR4: 0000000000370eb0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
Call Trace:
<TASK>
? die+0x33/0x90
? skb_put+0x113/0x190
? do_trap+0x1b4/0x3b0
? skb_put+0x113/0x190
? do_error_trap+0xb6/0x180
? skb_put+0x113/0x190
? handle_invalid_op+0x2c/0x30
? skb_put+0x113/0x190
? exc_invalid_op+0x2b/0x40
? asm_exc_invalid_op+0x16/0x20
? skb_put+0x113/0x190
? macsec_start_xmit+0x4e9/0x21d0
macsec_start_xmit+0x830/0x21d0
? get_txsa_from_nl+0x400/0x400
? lock_downgrade+0x690/0x690
? dev_queue_xmit_nit+0x78b/0xae0
dev_hard_start_xmit+0x151/0x560
__dev_queue_xmit+0x1580/0x28f0
? check_chain_key+0x1c5/0x490
? netdev_core_pick_tx+0x2d0/0x2d0
? __ip_queue_xmit+0x798/0x1e00
? lock_downgrade+0x690/0x690
? mark_held_locks+0x9f/0xe0
ip_finish_output2+0x11e4/0x2050
? ip_mc_finish_output+0x520/0x520
? ip_fragment.constprop.0+0x230/0x230
? __ip_queue_xmit+0x798/0x1e00
__ip_queue_xmit+0x798/0x1e00
? __skb_clone+0x57a/0x760
__tcp_transmit_skb+0x169d/0x3490
? lock_downgrade+0x690/0x690
? __tcp_select_window+0x1320/0x1320
? mark_held_locks+0x9f/0xe0
? lockdep_hardirqs_on_prepare+0x286/0x400
? tcp_small_queue_check.isra.0+0x120/0x3d0
tcp_write_xmit+0x12b6/0x7100
? skb_page_frag_refill+0x1e8/0x460
__tcp_push_pending_frames+0x92/0x320
tcp_sendmsg_locked+0x1ed4/0x3190
? tcp_sendmsg_fastopen+0x650/0x650
? tcp_sendmsg+0x1a/0x40
? mark_held_locks+0x9f/0xe0
? lockdep_hardirqs_on_prepare+0x286/0x400
tcp_sendmsg+0x28/0x40
? inet_send_prepare+0x1b0/0x1b0
__sock_sendmsg+0xc5/0x190
sock_write_iter+0x222/0x380
? __sock_sendmsg+0x190/0x190
? kfree+0x96/0x130
vfs_write+0x842/0xbd0
? kernel_write+0x530/0x530
? __fget_light+0x51/0x220
? __fget_light+0x51/0x220
ksys_write+0x172/0x1d0
? update_socket_protocol+0x10/0x10
? __x64_sys_read+0xb0/0xb0
? lockdep_hardirqs_on_prepare+0x286/0x400
do_syscall_64+0x40/0xe0
entry_SYSCALL_64_after_hwframe+0x46/0x4e
RIP: 0033:0x7fe54d9018b7
Code: 0f 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007ffdbd4191d8 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 0000000000000025 RCX: 00007fe54d9018b7
RDX: 0000000000000025 RSI: 0000000000d9859c RDI: 0000000000000004
RBP: 0000000000d9859c R08: 0000000000000004 R09: 0000000000000000
R10: 00007fe54d80afe0 R11: 0000000000000246 R12: 0000000000000004
R13: 0000000000000025 R14: 00007fe54e00ec00 R15: 0000000000d982a0
</TASK>
Modules linked in: 8021q garp mrp iptable_raw bonding vfio_pci rdma_ucm ib_umad mlx5_vfio_pci mlx5_ib vfio_pci_core vfio_iommu_type1 ib_uverbs vfio mlx5_core ip_gre nf_tables ipip tunnel4 ib_ipoib ip6_gre gre ip6_tunnel tunnel6 geneve openvswitch nsh xt_conntrack xt_MASQUERADE nf_conntrack_netlink nfnetlink xt_addrtype iptable_nat nf_nat br_netfilter rpcsec_gss_krb5 auth_rpcgss oid_registry overlay rpcrdma ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm ib_core zram zsmalloc fuse [last unloaded: ib_uverbs]
---[ end trace 0000000000000000 ]---
Cc: Radu Pirea (NXP OSS) <radu-nicolae.pirea@oss.nxp.com>
Cc: Sabrina Dubroca <sd@queasysnail.net>
Signed-off-by: Rahul Rameshbabu <rrameshbabu@nvidia.com>
Link: https://lore.kernel.org/r/20240118191811.50271-1-rrameshbabu@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless
Kalle Valo says:
====================
wireless fixes for v6.8-rc2
The most visible fix here is the ath11k crash fix which was introduced
in v6.7. We also have a fix for iwlwifi memory corruption and few
smaller fixes in the stack.
* tag 'wireless-2024-01-22' of git://git.kernel.org/pub/scm/linux/kernel/git/wireless/wireless:
wifi: mac80211: fix race condition on enabling fast-xmit
wifi: iwlwifi: fix a memory corruption
wifi: mac80211: fix potential sta-link leak
wifi: cfg80211/mac80211: remove dependency on non-existing option
wifi: cfg80211: fix missing interfaces when dumping
wifi: ath11k: rely on mac80211 debugfs handling for vif
wifi: p54: fix GCC format truncation warning with wiphy->fw_version
====================
Link: https://lore.kernel.org/r/20240122153434.E0254C433C7@smtp.kernel.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
fast-xmit must only be enabled after the sta has been uploaded to the driver,
otherwise it could end up passing the not-yet-uploaded sta via drv_tx calls
to the driver, leading to potential crashes because of uninitialized drv_priv
data.
Add a missing sta->uploaded check and re-check fast xmit after inserting a sta.
Signed-off-by: Felix Fietkau <nbd@nbd.name>
Link: https://msgid.link/20240104181059.84032-1-nbd@nbd.name
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
iwl_fw_ini_trigger_tlv::data is a pointer to a __le32, which means that
if we copy to iwl_fw_ini_trigger_tlv::data + offset while offset is in
bytes, we'll write past the buffer.
Cc: stable@vger.kernel.org
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218233
Fixes: cf29c5b66b9f ("iwlwifi: dbg_ini: implement time point handling")
Signed-off-by: Emmanuel Grumbach <emmanuel.grumbach@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240111150610.2d2b8b870194.I14ed76505a5cf87304e0c9cc05cc0ae85ed3bf91@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When a station is allocated, links are added but not
set to valid yet (e.g. during connection to an AP MLD),
we might remove the station without ever marking links
valid, and leak them. Fix that.
Fixes: cb71f1d136a6 ("wifi: mac80211: add sta link addition/removal")
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
Reviewed-by: Ilan Peer <ilan.peer@intel.com>
Signed-off-by: Miri Korenblit <miriam.rachel.korenblit@intel.com>
Link: https://msgid.link/20240111181514.6573998beaf8.I09ac2e1d41c80f82a5a616b8bd1d9d8dd709a6a6@changeid
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Commit ffbd0c8c1e7f ("wifi: mac80211: add an element parsing unit test")
and commit 730eeb17bbdd ("wifi: cfg80211: add first kunit tests, for
element defrag") add new configs that depend on !KERNEL_6_2, but the config
option KERNEL_6_2 does not exist in the tree. This dependency is used for
handling backporting to restrict the option to certain kernels but this
really should not be carried around the mainline kernel tree.
Clean up this needless dependency on the non-existing option KERNEL_6_2.
Link: https://lore.kernel.org/lkml/CAKXUXMyfrM6amOR7Ysim3WNQ-Ckf9HJDqRhAoYmLXujo1UV+yA@mail.gmail.com/
Signed-off-by: Lukas Bulwahn <lukas.bulwahn@gmail.com>
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The nl80211_dump_interface() supports resumption
in case nl80211_send_iface() doesn't have the
resources to complete its work.
The logic would store the progress as iteration
offsets for rdev and wdev loops.
However the logic did not properly handle
resumption for non-last rdev. Assuming a system
with 2 rdevs, with 2 wdevs each, this could
happen:
dump(cb=[0, 0]):
if_start=cb[1] (=0)
send rdev0.wdev0 -> ok
send rdev0.wdev1 -> yield
cb[1] = 1
dump(cb=[0, 1]):
if_start=cb[1] (=1)
send rdev0.wdev1 -> ok
// since if_start=1 the rdev0.wdev0 got skipped
// through if_idx < if_start
send rdev1.wdev1 -> ok
The if_start needs to be reset back to 0 upon wdev
loop end.
The problem is actually hard to hit on a desktop,
and even on most routers. The prerequisites for
this manifesting was:
- more than 1 wiphy
- a few handful of interfaces
- dump without rdev or wdev filter
I was seeing this with 4 wiphys 9 interfaces each.
It'd miss 6 interfaces from the last wiphy
reported to userspace.
Signed-off-by: Michal Kazior <michal@plume.com>
Link: https://msgid.link/20240116142340.89678-1-kazikcz@gmail.com
Signed-off-by: Johannes Berg <johannes.berg@intel.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
mac80211 started to delete debugfs entries in certain cases, causing a
ath11k to crash when it tried to delete the entries later. Fix this by
relying on mac80211 to delete the entries when appropriate and adding
them from the vif_add_debugfs handler.
Fixes: 0a3d898ee9a8 ("wifi: mac80211: add/remove driver debugfs entries as appropriate")
Closes: https://bugzilla.kernel.org/show_bug.cgi?id=218364
Signed-off-by: Benjamin Berg <benjamin.berg@intel.com>
Acked-by: Jeff Johnson <quic_jjohnson@quicinc.com>
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://msgid.link/20240115101805.1277949-1-benjamin@sipsolutions.net
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
GCC 13.2 warns:
drivers/net/wireless/intersil/p54/fwio.c:128:34: warning: '%s' directive output may be truncated writing up to 39 bytes into a region of size 32 [-Wformat-truncation=]
drivers/net/wireless/intersil/p54/fwio.c:128:33: note: directive argument in the range [0, 16777215]
drivers/net/wireless/intersil/p54/fwio.c:128:33: note: directive argument in the range [0, 255]
drivers/net/wireless/intersil/p54/fwio.c:127:17: note: 'snprintf' output between 7 and 52 bytes into a destination of size 32
The issue here is that wiphy->fw_version is 32 bytes and in theory the string
we try to place there can be 39 bytes. wiphy->fw_version is used for providing
the firmware version to user space via ethtool, so not really important.
fw_version in theory can be 24 bytes but in practise it's shorter, so even if
print only 19 bytes via ethtool there should not be any practical difference.
I did consider removing fw_var from the string altogether or making the maximum
length for fw_version 19 bytes, but chose this approach as it was the least
intrusive.
Compile tested only.
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Acked-by: Christian Lamparter <chunkeey@gmail.com> # Tested with Dell 1450 USB
Signed-off-by: Kalle Valo <kvalo@kernel.org>
Link: https://msgid.link/20231219162516.898205-1-kvalo@kernel.org
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In commit 198bc90e0e73("tcp: make sure init the accept_queue's spinlocks
once"), the spinlocks of accept_queue are initialized only when socket is
created in the inet4 scenario. The locks are not initialized when socket
is created in the inet6 scenario. The kernel reports the following error:
INFO: trying to register non-static key.
The code is fine but needs lockdep annotation, or maybe
you didn't initialize this object before use?
turning off the locking correctness validator.
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
Call Trace:
<TASK>
dump_stack_lvl (lib/dump_stack.c:107)
register_lock_class (kernel/locking/lockdep.c:1289)
__lock_acquire (kernel/locking/lockdep.c:5015)
lock_acquire.part.0 (kernel/locking/lockdep.c:5756)
_raw_spin_lock_bh (kernel/locking/spinlock.c:178)
inet_csk_listen_stop (net/ipv4/inet_connection_sock.c:1386)
tcp_disconnect (net/ipv4/tcp.c:2981)
inet_shutdown (net/ipv4/af_inet.c:935)
__sys_shutdown (./include/linux/file.h:32 net/socket.c:2438)
__x64_sys_shutdown (net/socket.c:2445)
do_syscall_64 (arch/x86/entry/common.c:52)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
RIP: 0033:0x7f52ecd05a3d
Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7
48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff
ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48
RSP: 002b:00007f52ecf5dde8 EFLAGS: 00000293 ORIG_RAX: 0000000000000030
RAX: ffffffffffffffda RBX: 00007f52ecf5e640 RCX: 00007f52ecd05a3d
RDX: 00007f52ecc8b188 RSI: 0000000000000000 RDI: 0000000000000004
RBP: 00007f52ecf5de20 R08: 00007ffdae45c69f R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000293 R12: 00007f52ecf5e640
R13: 0000000000000000 R14: 00007f52ecc8b060 R15: 00007ffdae45c6e0
Fixes: 198bc90e0e73 ("tcp: make sure init the accept_queue's spinlocks once")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240122102001.2851701-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
I analyze the potential sleeping issue of the following processes:
Thread A Thread B
... netlink_create //ref = 1
do_mq_notify ...
sock = netlink_getsockbyfilp ... //ref = 2
info->notify_sock = sock; ...
... netlink_sendmsg
... skb = netlink_alloc_large_skb //skb->head is vmalloced
... netlink_unicast
... sk = netlink_getsockbyportid //ref = 3
... netlink_sendskb
... __netlink_sendskb
... skb_queue_tail //put skb to sk_receive_queue
... sock_put //ref = 2
... ...
... netlink_release
... deferred_put_nlk_sk //ref = 1
mqueue_flush_file
spin_lock
remove_notification
netlink_sendskb
sock_put //ref = 0
sk_free
...
__sk_destruct
netlink_sock_destruct
skb_queue_purge //get skb from sk_receive_queue
...
__skb_queue_purge_reason
kfree_skb_reason
__kfree_skb
...
skb_release_all
skb_release_head_state
netlink_skb_destructor
vfree(skb->head) //sleeping while holding spinlock
In netlink_sendmsg, if the memory pointed to by skb->head is allocated by
vmalloc, and is put to sk_receive_queue queue, also the skb is not freed.
When the mqueue executes flush, the sleeping bug will occur. Use
vfree_atomic instead of vfree in netlink_skb_destructor to solve the issue.
Fixes: c05cdb1b864f ("netlink: allow large data transfers from user-space")
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Link: https://lore.kernel.org/r/20240122011807.2110357-1-shaozhengchao@huawei.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Jakub reported that ASSERT_EQ(cpu, i) in so_incoming_cpu.c seems to
fire somewhat randomly.
# # RUN so_incoming_cpu.before_reuseport.test3 ...
# # so_incoming_cpu.c:191:test3:Expected cpu (32) == i (0)
# # test3: Test terminated by assertion
# # FAIL so_incoming_cpu.before_reuseport.test3
# not ok 3 so_incoming_cpu.before_reuseport.test3
When the test failed, not-yet-accepted CLOSE_WAIT sockets received
SYN with a "challenging" SEQ number, which was sent from an unexpected
CPU that did not create the receiver.
The test basically does:
1. for each cpu:
1-1. create a server
1-2. set SO_INCOMING_CPU
2. for each cpu:
2-1. set cpu affinity
2-2. create some clients
2-3. let clients connect() to the server on the same cpu
2-4. close() clients
3. for each server:
3-1. accept() all child sockets
3-2. check if all children have the same SO_INCOMING_CPU with the server
The root cause was the close() in 2-4. and net.ipv4.tcp_tw_reuse.
In a loop of 2., close() changed the client state to FIN_WAIT_2, and
the peer transitioned to CLOSE_WAIT.
In another loop of 2., connect() happened to select the same port of
the FIN_WAIT_2 socket, and it was reused as the default value of
net.ipv4.tcp_tw_reuse is 2.
As a result, the new client sent SYN to the CLOSE_WAIT socket from
a different CPU, and the receiver's sk_incoming_cpu was overwritten
with unexpected CPU ID.
Also, the SYN had a different SEQ number, so the CLOSE_WAIT socket
responded with Challenge ACK. The new client properly returned RST
and effectively killed the CLOSE_WAIT socket.
This way, all clients were created successfully, but the error was
detected later by 3-2., ASSERT_EQ(cpu, i).
To avoid the failure, let's make sure that (i) the number of clients
is less than the number of available ports and (ii) such reuse never
happens.
Fixes: 6df96146b202 ("selftest: Add test for SO_INCOMING_CPU.")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Tested-by: Jakub Kicinski <kuba@kernel.org>
Link: https://lore.kernel.org/r/20240120031642.67014-1-kuniyu@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
On CPUs with weak memory models, reads and updates performed by tcp_push
to the sk variables can get reordered leaving the socket throttled when
it should not. The tasklet running tcp_wfree() may also not observe the
memory updates in time and will skip flushing any packets throttled by
tcp_push(), delaying the sending. This can pathologically cause 40ms
extra latency due to bad interactions with delayed acks.
Adding a memory barrier in tcp_push removes the bug, similarly to the
previous commit bf06200e732d ("tcp: tsq: fix nonagle handling").
smp_mb__after_atomic() is used to not incur in unnecessary overhead
on x86 since not affected.
Patch has been tested using an AWS c7g.2xlarge instance with Ubuntu
22.04 and Apache Tomcat 9.0.83 running the basic servlet below:
import java.io.IOException;
import java.io.OutputStreamWriter;
import java.io.PrintWriter;
import javax.servlet.ServletException;
import javax.servlet.http.HttpServlet;
import javax.servlet.http.HttpServletRequest;
import javax.servlet.http.HttpServletResponse;
public class HelloWorldServlet extends HttpServlet {
@Override
protected void doGet(HttpServletRequest request, HttpServletResponse response)
throws ServletException, IOException {
response.setContentType("text/html;charset=utf-8");
OutputStreamWriter osw = new OutputStreamWriter(response.getOutputStream(),"UTF-8");
String s = "a".repeat(3096);
osw.write(s,0,s.length());
osw.flush();
}
}
Load was applied using wrk2 (https://github.com/kinvolk/wrk2) from an AWS
c6i.8xlarge instance. Before the patch an additional 40ms latency from P99.99+
values is observed while, with the patch, the extra latency disappears.
No patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello
...
50.000% 0.91ms
75.000% 1.13ms
90.000% 1.46ms
99.000% 1.74ms
99.900% 1.89ms
99.990% 41.95ms <<< 40+ ms extra latency
99.999% 48.32ms
100.000% 48.96ms
With patch and tcp_autocorking=1
./wrk -t32 -c128 -d40s --latency -R10000 http://172.31.60.173:8080/hello/hello
...
50.000% 0.90ms
75.000% 1.13ms
90.000% 1.45ms
99.000% 1.72ms
99.900% 1.83ms
99.990% 2.11ms <<< no 40+ ms extra latency
99.999% 2.53ms
100.000% 2.62ms
Patch has been also tested on x86 (m7i.2xlarge instance) which it is not
affected by this issue and the patch doesn't introduce any additional
delay.
Fixes: 7aa5470c2c09 ("tcp: tsq: move tsq_flags close to sk_wmem_alloc")
Signed-off-by: Salvatore Dipietro <dipiets@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240119190133.43698-1-dipiets@amazon.com
Signed-off-by: Paolo Abeni <pabeni@redhat.com>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Syzcaller UBSAN crash occurs in rds_cmsg_recv(),
which reads inc->i_rx_lat_trace[j + 1] with index 4 (3 + 1),
but with array size of 4 (RDS_RX_MAX_TRACES).
Here 'j' is assigned from rs->rs_rx_trace[i] and in-turn from
trace.rx_trace_pos[i] in rds_recv_track_latency(),
with both arrays sized 3 (RDS_MSG_RX_DGRAM_TRACE_MAX). So fix the
off-by-one bounds check in rds_recv_track_latency() to prevent
a potential crash in rds_cmsg_recv().
Found by syzcaller:
=================================================================
UBSAN: array-index-out-of-bounds in net/rds/recv.c:585:39
index 4 is out of range for type 'u64 [4]'
CPU: 1 PID: 8058 Comm: syz-executor228 Not tainted 6.6.0-gd2f51b3516da #1
Hardware name: QEMU Standard PC (i440FX + PIIX, 1996),
BIOS 1.15.0-1 04/01/2014
Call Trace:
<TASK>
__dump_stack lib/dump_stack.c:88 [inline]
dump_stack_lvl+0x136/0x150 lib/dump_stack.c:106
ubsan_epilogue lib/ubsan.c:217 [inline]
__ubsan_handle_out_of_bounds+0xd5/0x130 lib/ubsan.c:348
rds_cmsg_recv+0x60d/0x700 net/rds/recv.c:585
rds_recvmsg+0x3fb/0x1610 net/rds/recv.c:716
sock_recvmsg_nosec net/socket.c:1044 [inline]
sock_recvmsg+0xe2/0x160 net/socket.c:1066
__sys_recvfrom+0x1b6/0x2f0 net/socket.c:2246
__do_sys_recvfrom net/socket.c:2264 [inline]
__se_sys_recvfrom net/socket.c:2260 [inline]
__x64_sys_recvfrom+0xe0/0x1b0 net/socket.c:2260
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x40/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
==================================================================
Fixes: 3289025aedc0 ("RDS: add receive message trace used by application")
Reported-by: Chenyuan Yang <chenyuan0y@gmail.com>
Closes: https://lore.kernel.org/linux-rdma/CALGdzuoVdq-wtQ4Az9iottBqC5cv9ZhcE5q8N7LfYFvkRsOVcw@mail.gmail.com/
Signed-off-by: Sharath Srinivasan <sharath.srinivasan@oracle.com>
Reviewed-by: Simon Horman <horms@kernel.org>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The HW has the capability to check each frame if it is a PTP frame,
which domain it is, which ptp frame type it is, different ip address in
the frame. And if one of these checks fail then the frame is not
timestamp. Most of these checks were disabled except checking the field
minorVersionPTP inside the PTP header. Meaning that once a partner sends
a frame compliant to 8021AS which has minorVersionPTP set to 1, then the
frame was not timestamp because the HW expected by default a value of 0
in minorVersionPTP. This is exactly the same issue as on lan8841.
Fix this issue by removing this check so the userspace can decide on this.
Fixes: ece19502834d ("net: phy: micrel: 1588 support for LAN8814 phy")
Signed-off-by: Horatiu Vultur <horatiu.vultur@microchip.com>
Reviewed-by: Maxime Chevallier <maxime.chevallier@bootlin.com>
Reviewed-by: Divya Koppera <divya.koppera@microchip.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Arkadiusz Kubalewski says:
====================
dpll: fix unordered unbind/bind registerer issues
Fix issues when performing unordered unbind/bind of a kernel modules
which are using a dpll device with DPLL_PIN_TYPE_MUX pins.
Currently only serialized bind/unbind of such use case works, fix
the issues and allow for unserialized kernel module bind order.
The issues are observed on the ice driver, i.e.,
$ echo 0000:af:00.0 > /sys/bus/pci/drivers/ice/unbind
$ echo 0000:af:00.1 > /sys/bus/pci/drivers/ice/unbind
results in:
ice 0000:af:00.0: Removed PTP clock
BUG: kernel NULL pointer dereference, address: 0000000000000010
PF: supervisor read access in kernel mode
PF: error_code(0x0000) - not-present page
PGD 0 P4D 0
Oops: 0000 [#1] PREEMPT SMP PTI
CPU: 7 PID: 71848 Comm: bash Kdump: loaded Not tainted 6.6.0-rc5_next-queue_19th-Oct-2023-01625-g039e5d15e451 #1
Hardware name: Intel Corporation S2600STB/S2600STB, BIOS SE5C620.86B.02.01.0008.031920191559 03/19/2019
RIP: 0010:ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
Code: 41 57 4d 89 cf 41 56 41 55 4d 89 c5 41 54 55 48 89 f5 53 4c 8b 66 08 48 89 cb 4d 8d b4 24 f0 49 00 00 4c 89 f7 e8 71 ec 1f c5 <0f> b6 5b 10 41 0f b6 84 24 30 4b 00 00 29 c3 41 0f b6 84 24 28 4b
RSP: 0018:ffffc902b179fb60 EFLAGS: 00010246
RAX: 0000000000000000 RBX: 0000000000000000 RCX: 0000000000000000
RDX: ffff8882c1398000 RSI: ffff888c7435cc60 RDI: ffff888c7435cb90
RBP: ffff888c7435cc60 R08: ffffc902b179fbb0 R09: 0000000000000000
R10: ffff888ef1fc8050 R11: fffffffffff82700 R12: ffff888c743581a0
R13: ffffc902b179fbb0 R14: ffff888c7435cb90 R15: 0000000000000000
FS: 00007fdc7dae0740(0000) GS:ffff888c105c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000010 CR3: 0000000132c24002 CR4: 00000000007706e0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __die+0x20/0x70
? page_fault_oops+0x76/0x170
? exc_page_fault+0x65/0x150
? asm_exc_page_fault+0x22/0x30
? ice_dpll_rclk_state_on_pin_get+0x2f/0x90 [ice]
? __pfx_ice_dpll_rclk_state_on_pin_get+0x10/0x10 [ice]
dpll_msg_add_pin_parents+0x142/0x1d0
dpll_pin_event_send+0x7d/0x150
dpll_pin_on_pin_unregister+0x3f/0x100
ice_dpll_deinit_pins+0xa1/0x230 [ice]
ice_dpll_deinit+0x29/0xe0 [ice]
ice_remove+0xcd/0x200 [ice]
pci_device_remove+0x33/0xa0
device_release_driver_internal+0x193/0x200
unbind_store+0x9d/0xb0
kernfs_fop_write_iter+0x128/0x1c0
vfs_write+0x2bb/0x3e0
ksys_write+0x5f/0xe0
do_syscall_64+0x59/0x90
? filp_close+0x1b/0x30
? do_dup2+0x7d/0xd0
? syscall_exit_work+0x103/0x130
? syscall_exit_to_user_mode+0x22/0x40
? do_syscall_64+0x69/0x90
? syscall_exit_work+0x103/0x130
? syscall_exit_to_user_mode+0x22/0x40
? do_syscall_64+0x69/0x90
entry_SYSCALL_64_after_hwframe+0x6e/0xd8
RIP: 0033:0x7fdc7d93eb97
Code: 0b 00 f7 d8 64 89 02 48 c7 c0 ff ff ff ff eb b7 0f 1f 00 f3 0f 1e fa 64 8b 04 25 18 00 00 00 85 c0 75 10 b8 01 00 00 00 0f 05 <48> 3d 00 f0 ff ff 77 51 c3 48 83 ec 28 48 89 54 24 18 48 89 74 24
RSP: 002b:00007fff2aa91028 EFLAGS: 00000246 ORIG_RAX: 0000000000000001
RAX: ffffffffffffffda RBX: 000000000000000d RCX: 00007fdc7d93eb97
RDX: 000000000000000d RSI: 00005644814ec9b0 RDI: 0000000000000001
RBP: 00005644814ec9b0 R08: 0000000000000000 R09: 00007fdc7d9b14e0
R10: 00007fdc7d9b13e0 R11: 0000000000000246 R12: 000000000000000d
R13: 00007fdc7d9fb780 R14: 000000000000000d R15: 00007fdc7d9f69e0
</TASK>
Modules linked in: uinput vfio_pci vfio_pci_core vfio_iommu_type1 vfio irqbypass ixgbevf snd_seq_dummy snd_hrtimer snd_seq snd_timer snd_seq_device snd soundcore overlay qrtr rfkill vfat fat xfs libcrc32c rpcrdma sunrpc rdma_ucm ib_srpt ib_isert iscsi_target_mod target_core_mod ib_iser libiscsi scsi_transport_iscsi rdma_cm iw_cm ib_cm intel_rapl_msr intel_rapl_common intel_uncore_frequency intel_uncore_frequency_common isst_if_common skx_edac nfit libnvdimm ipmi_ssif x86_pkg_temp_thermal intel_powerclamp coretemp irdma rapl intel_cstate ib_uverbs iTCO_wdt iTCO_vendor_support acpi_ipmi intel_uncore mei_me ipmi_si pcspkr i2c_i801 ib_core mei ipmi_devintf intel_pch_thermal ioatdma i2c_smbus ipmi_msghandler lpc_ich joydev acpi_power_meter acpi_pad ext4 mbcache jbd2 sd_mod t10_pi sg ast i2c_algo_bit drm_shmem_helper drm_kms_helper ice crct10dif_pclmul ixgbe crc32_pclmul drm crc32c_intel ahci i40e libahci ghash_clmulni_intel libata mdio dca gnss wmi fuse [last unloaded: iavf]
CR2: 0000000000000010
v6:
- fix memory corruption on error path in patch [v5 2/4]
====================
Acked-by: Vadim Fedorenko <vadim.fedorenko@linux.dev>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
In case of multiple kernel module instances using the same dpll device:
if only one registers dpll device, then only that one can register
directly connected pins with a dpll device. When unregistered parent is
responsible for determining if the muxed pin can be registered with it
or not, the drivers need to be loaded in serialized order to work
correctly - first the driver instance which registers the direct pins
needs to be loaded, then the other instances could register muxed type
pins.
Allow registration of a pin with a parent even if the parent was not
yet registered, thus allow ability for unserialized driver instance
load order.
Do not WARN_ON notification for unregistered pin, which can be invoked
for described case, instead just return error.
Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions")
Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions")
Reviewed-by: Jan Glaza <jan.glaza@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
If parent pin was unregistered but child pin was not, the userspace
would see the "zombie" pins - the ones that were registered with
a parent pin (dpll_pin_on_pin_register(..)).
Technically those are not available - as there is no dpll device in the
system. Do not dump those pins and prevent userspace from any
interaction with them. Provide a unified function to determine if the
pin is available and use it before acting/responding for user requests.
Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions")
Reviewed-by: Jan Glaza <jan.glaza@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
When a kernel module is unbound but the pin resources were not entirely
freed (other kernel module instance of the same PCI device have had kept
the reference to that pin), and kernel module is again bound, the pin
properties would not be updated (the properties are only assigned when
memory for the pin is allocated), prop pointer still points to the
kernel module memory of the kernel module which was deallocated on the
unbind.
If the pin dump is invoked in this state, the result is a kernel crash.
Prevent the crash by storing persistent pin properties in dpll subsystem,
copy the content from the kernel module when pin is allocated, instead of
using memory of the kernel module.
Fixes: 9431063ad323 ("dpll: core: Add DPLL framework base functions")
Fixes: 9d71b54b65b1 ("dpll: netlink: Add DPLL framework base functions")
Reviewed-by: Jan Glaza <jan.glaza@intel.com>
Reviewed-by: Przemek Kitszel <przemyslaw.kitszel@intel.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If pin type is not expected, or pin properities failed to allocate
memory, the unwind error path shall not destroy pin's xarrays, which
were not yet initialized.
Add new goto label and use it to fix broken error path.
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Signed-off-by: Arkadiusz Kubalewski <arkadiusz.kubalewski@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Yunjian Wang says:
====================
fixes for tun
There are few places on the receive path where packet receives and
packet drops were not accounted for. This patchset fixes that issue.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The TUN can be used as vhost-net backend, and it is necessary to
count the packets transmitted from TUN to vhost-net/virtio-net.
However, there are some places in the receive path that were not
taken into account when using XDP. It would be beneficial to also
include new accounting for successfully received bytes using
dev_sw_netstats_rx_add.
Fixes: 761876c857cb ("tap: XDP support")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The commit 8ae1aff0b331 ("tuntap: split out XDP logic") includes
dropped counter for XDP_DROP, XDP_ABORTED, and invalid XDP actions.
Unfortunately, that commit missed the dropped counter when error
occurs during XDP_TX and XDP_REDIRECT actions. This patch fixes
this issue.
Fixes: 8ae1aff0b331 ("tuntap: split out XDP logic")
Signed-off-by: Yunjian Wang <wangyunjian@huawei.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Acked-by: Jason Wang <jasowang@redhat.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Mark reports a BUG() when a net namespace is removed.
kernel BUG at net/core/dev.c:11520!
Physical interfaces moved outside of init_net get "refunded"
to init_net when that namespace disappears. The main interface
name may get overwritten in the process if it would have
conflicted. We need to also discard all conflicting altnames.
Recent fixes addressed ensuring that altnames get moved
with the main interface, which surfaced this problem.
Reported-by: Марк Коренберг <socketpair@gmail.com>
Link: https://lore.kernel.org/all/CAEmTpZFZ4Sv3KwqFOY2WKDHeZYdi0O7N5H1nTvcGp=SAEavtDg@mail.gmail.com/
Fixes: 7663d522099e ("net: check for altname conflicts when changing netdev's netns")
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Jiri Pirko <jiri@nvidia.com>
Reviewed-by: Xin Long <lucien.xin@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
idpf registers multiple netdevs (virtual ports) for one PCI function,
but it does not provide a way for userspace to distinguish them with
sysfs attributes. Per Documentation/ABI/testing/sysfs-class-net, it is
a bug not to set dev_port for independent ports on the same PCI bus,
device and function.
Without dev_port set, systemd-udevd's default naming policy attempts
to assign the same name ("ens2f0") to all four idpf netdevs on my test
system and obviously fails, leaving three of them with the initial
eth<N> name.
With this patch, systemd-udevd is able to assign unique names to the
netdevs (e.g. "ens2f0", "ens2f0d1", "ens2f0d2", "ens2f0d3").
The Intel-provided out-of-tree idpf driver already sets dev_port. In
this patch I chose to do it in the same place in the idpf_cfg_netdev
function.
Fixes: 0fe45467a104 ("idpf: add create vport and netdev configuration")
Signed-off-by: Michal Schmidt <mschmidt@redhat.com>
Reviewed-by: Jesse Brandeburg <jesse.brandeburg@intel.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Generic sk_busy_loop_end() only looks at sk->sk_receive_queue
for presence of packets.
Problem is that for UDP sockets after blamed commit, some packets
could be present in another queue: udp_sk(sk)->reader_queue
In some cases, a busy poller could spin until timeout expiration,
even if some packets are available in udp_sk(sk)->reader_queue.
v3: - make sk_busy_loop_end() nicer (Willem)
v2: - add a READ_ONCE(sk->sk_family) in sk_is_inet() to avoid KCSAN splats.
- add a sk_is_inet() check in sk_is_udp() (Willem feedback)
- add a sk_is_inet() check in sk_is_tcp().
Fixes: 2276f58ac589 ("udp: use a separate rx queue for packet reception")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Paolo Abeni <pabeni@redhat.com>
Reviewed-by: Willem de Bruijn <willemb@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
syzbot reported an uninit-value bug below. [0]
llc supports ETH_P_802_2 (0x0004) and used to support ETH_P_TR_802_2
(0x0011), and syzbot abused the latter to trigger the bug.
write$tun(r0, &(0x7f0000000040)={@val={0x0, 0x11}, @val, @mpls={[], @llc={@snap={0xaa, 0x1, ')', "90e5dd"}}}}, 0x16)
llc_conn_handler() initialises local variables {saddr,daddr}.mac
based on skb in llc_pdu_decode_sa()/llc_pdu_decode_da() and passes
them to __llc_lookup().
However, the initialisation is done only when skb->protocol is
htons(ETH_P_802_2), otherwise, __llc_lookup_established() and
__llc_lookup_listener() will read garbage.
The missing initialisation existed prior to commit 211ed865108e
("net: delete all instances of special processing for token ring").
It removed the part to kick out the token ring stuff but forgot to
close the door allowing ETH_P_TR_802_2 packets to sneak into llc_rcv().
Let's remove llc_tr_packet_type and complete the deprecation.
[0]:
BUG: KMSAN: uninit-value in __llc_lookup_established+0xe9d/0xf90
__llc_lookup_established+0xe9d/0xf90
__llc_lookup net/llc/llc_conn.c:611 [inline]
llc_conn_handler+0x4bd/0x1360 net/llc/llc_conn.c:791
llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206
__netif_receive_skb_one_core net/core/dev.c:5527 [inline]
__netif_receive_skb+0x1a6/0x5a0 net/core/dev.c:5641
netif_receive_skb_internal net/core/dev.c:5727 [inline]
netif_receive_skb+0x58/0x660 net/core/dev.c:5786
tun_rx_batched+0x3ee/0x980 drivers/net/tun.c:1555
tun_get_user+0x53af/0x66d0 drivers/net/tun.c:2002
tun_chr_write_iter+0x3af/0x5d0 drivers/net/tun.c:2048
call_write_iter include/linux/fs.h:2020 [inline]
new_sync_write fs/read_write.c:491 [inline]
vfs_write+0x8ef/0x1490 fs/read_write.c:584
ksys_write+0x20f/0x4c0 fs/read_write.c:637
__do_sys_write fs/read_write.c:649 [inline]
__se_sys_write fs/read_write.c:646 [inline]
__x64_sys_write+0x93/0xd0 fs/read_write.c:646
do_syscall_x64 arch/x86/entry/common.c:51 [inline]
do_syscall_64+0x44/0x110 arch/x86/entry/common.c:82
entry_SYSCALL_64_after_hwframe+0x63/0x6b
Local variable daddr created at:
llc_conn_handler+0x53/0x1360 net/llc/llc_conn.c:783
llc_rcv+0xfbb/0x14a0 net/llc/llc_input.c:206
CPU: 1 PID: 5004 Comm: syz-executor994 Not tainted 6.6.0-syzkaller-14500-g1c41041124bd #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 10/09/2023
Fixes: 211ed865108e ("net: delete all instances of special processing for token ring")
Reported-by: syzbot+b5ad66046b913bc04c6f@syzkaller.appspotmail.com
Closes: https://syzkaller.appspot.com/bug?extid=b5ad66046b913bc04c6f
Signed-off-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240119015515.61898-1-kuniyu@amazon.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
syzbot was able to trick llc_ui_sendmsg(), allocating an skb with no
headroom, but subsequently trying to push 14 bytes of Ethernet header [1]
Like some others, llc_ui_sendmsg() releases the socket lock before
calling sock_alloc_send_skb().
Then it acquires it again, but does not redo all the sanity checks
that were performed.
This fix:
- Uses LL_RESERVED_SPACE() to reserve space.
- Check all conditions again after socket lock is held again.
- Do not account Ethernet header for mtu limitation.
[1]
skbuff: skb_under_panic: text:ffff800088baa334 len:1514 put:14 head:ffff0000c9c37000 data:ffff0000c9c36ff2 tail:0x5dc end:0x6c0 dev:bond0
kernel BUG at net/core/skbuff.c:193 !
Internal error: Oops - BUG: 00000000f2000800 [#1] PREEMPT SMP
Modules linked in:
CPU: 0 PID: 6875 Comm: syz-executor.0 Not tainted 6.7.0-rc8-syzkaller-00101-g0802e17d9aca-dirty #0
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 11/17/2023
pstate: 60400005 (nZCv daif +PAN -UAO -TCO -DIT -SSBS BTYPE=--)
pc : skb_panic net/core/skbuff.c:189 [inline]
pc : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
lr : skb_panic net/core/skbuff.c:189 [inline]
lr : skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
sp : ffff800096f97000
x29: ffff800096f97010 x28: ffff80008cc8d668 x27: dfff800000000000
x26: ffff0000cb970c90 x25: 00000000000005dc x24: ffff0000c9c36ff2
x23: ffff0000c9c37000 x22: 00000000000005ea x21: 00000000000006c0
x20: 000000000000000e x19: ffff800088baa334 x18: 1fffe000368261ce
x17: ffff80008e4ed000 x16: ffff80008a8310f8 x15: 0000000000000001
x14: 1ffff00012df2d58 x13: 0000000000000000 x12: 0000000000000000
x11: 0000000000000001 x10: 0000000000ff0100 x9 : e28a51f1087e8400
x8 : e28a51f1087e8400 x7 : ffff80008028f8d0 x6 : 0000000000000000
x5 : 0000000000000001 x4 : 0000000000000001 x3 : ffff800082b78714
x2 : 0000000000000001 x1 : 0000000100000000 x0 : 0000000000000089
Call trace:
skb_panic net/core/skbuff.c:189 [inline]
skb_under_panic+0x13c/0x140 net/core/skbuff.c:203
skb_push+0xf0/0x108 net/core/skbuff.c:2451
eth_header+0x44/0x1f8 net/ethernet/eth.c:83
dev_hard_header include/linux/netdevice.h:3188 [inline]
llc_mac_hdr_init+0x110/0x17c net/llc/llc_output.c:33
llc_sap_action_send_xid_c+0x170/0x344 net/llc/llc_s_ac.c:85
llc_exec_sap_trans_actions net/llc/llc_sap.c:153 [inline]
llc_sap_next_state net/llc/llc_sap.c:182 [inline]
llc_sap_state_process+0x1ec/0x774 net/llc/llc_sap.c:209
llc_build_and_send_xid_pkt+0x12c/0x1c0 net/llc/llc_sap.c:270
llc_ui_sendmsg+0x7bc/0xb1c net/llc/af_llc.c:997
sock_sendmsg_nosec net/socket.c:730 [inline]
__sock_sendmsg net/socket.c:745 [inline]
sock_sendmsg+0x194/0x274 net/socket.c:767
splice_to_socket+0x7cc/0xd58 fs/splice.c:881
do_splice_from fs/splice.c:933 [inline]
direct_splice_actor+0xe4/0x1c0 fs/splice.c:1142
splice_direct_to_actor+0x2a0/0x7e4 fs/splice.c:1088
do_splice_direct+0x20c/0x348 fs/splice.c:1194
do_sendfile+0x4bc/0xc70 fs/read_write.c:1254
__do_sys_sendfile64 fs/read_write.c:1322 [inline]
__se_sys_sendfile64 fs/read_write.c:1308 [inline]
__arm64_sys_sendfile64+0x160/0x3b4 fs/read_write.c:1308
__invoke_syscall arch/arm64/kernel/syscall.c:37 [inline]
invoke_syscall+0x98/0x2b8 arch/arm64/kernel/syscall.c:51
el0_svc_common+0x130/0x23c arch/arm64/kernel/syscall.c:136
do_el0_svc+0x48/0x58 arch/arm64/kernel/syscall.c:155
el0_svc+0x54/0x158 arch/arm64/kernel/entry-common.c:678
el0t_64_sync_handler+0x84/0xfc arch/arm64/kernel/entry-common.c:696
el0t_64_sync+0x190/0x194 arch/arm64/kernel/entry.S:595
Code: aa1803e6 aa1903e7 a90023f5 94792f6a (d4210000)
Fixes: 1da177e4c3f4 ("Linux-2.6.12-rc2")
Reported-and-tested-by: syzbot+2a7024e9502df538e8ef@syzkaller.appspotmail.com
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reviewed-by: Kuniyuki Iwashima <kuniyu@amazon.com>
Link: https://lore.kernel.org/r/20240118183625.4007013-1-edumazet@google.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In the vlan_changelink function, a loop is used to parse the nested
attributes IFLA_VLAN_EGRESS_QOS and IFLA_VLAN_INGRESS_QOS in order to
obtain the struct ifla_vlan_qos_mapping. These two nested attributes are
checked in the vlan_validate_qos_map function, which calls
nla_validate_nested_deprecated with the vlan_map_policy.
However, this deprecated validator applies a LIBERAL strictness, allowing
the presence of an attribute with the type IFLA_VLAN_QOS_UNSPEC.
Consequently, the loop in vlan_changelink may parse an attribute of type
IFLA_VLAN_QOS_UNSPEC and believe it carries a payload of
struct ifla_vlan_qos_mapping, which is not necessarily true.
To address this issue and ensure compatibility, this patch introduces two
type checks that skip attributes whose type is not IFLA_VLAN_QOS_MAPPING.
Fixes: 07b5b17e157b ("[VLAN]: Use rtnl_link API")
Signed-off-by: Lin Ma <linma@zju.edu.cn>
Reviewed-by: Simon Horman <horms@kernel.org>
Link: https://lore.kernel.org/r/20240118130306.1644001-1-linma@zju.edu.cn
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Michael Chan says:
====================
bnxt_en: Bug fixes
This series contains 5 miscellaneous fixes. The fixes include adding
delay for FLR, buffer memory leak, RSS table size calculation,
ethtool self test kernel warning, and mqprio crash.
====================
Link: https://lore.kernel.org/r/20240117234515.226944-1-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The driver relies on netdev_get_num_tc() to get the number of HW
offloaded mqprio TCs to allocate and free TX rings. This won't
work and can potentially crash the system if software mqprio or
taprio TCs have been setup. netdev_get_num_tc() will return the
number of software TCs and it may cause the driver to allocate or
free more TX rings that it should. Fix it by adding a bp->num_tc
field to store the number of HW offload mqprio TCs for the device.
Use bp->num_tc instead of netdev_get_num_tc().
This fixes a crash like this:
BUG: kernel NULL pointer dereference, address: 0000000000000000
PGD 42b8404067 P4D 0
Oops: 0000 [#1] PREEMPT SMP NOPTI
CPU: 120 PID: 8661 Comm: ifconfig Kdump: loaded Tainted: G OE 5.18.16 #1
Hardware name: Lenovo ThinkSystem SR650 V3/SB27A92818, BIOS ESE114N-2.12 04/25/2023
RIP: 0010:bnxt_hwrm_cp_ring_alloc_p5+0x10/0x90 [bnxt_en]
Code: 41 5c 41 5d 41 5e c3 cc cc cc cc 41 8b 44 24 08 66 89 03 eb c6 e8 b0 f1 7d db 0f 1f 44 00 00 41 56 41 55 41 54 55 48 89 fd 53 <48> 8b 06 48 89 f3 48 81 c6 28 01 00 00 0f b6 96 13 ff ff ff 44 8b
RSP: 0018:ff65907660d1fa88 EFLAGS: 00010202
RAX: 0000000000000010 RBX: ff4dde1d907e4980 RCX: f400000000000000
RDX: 0000000000000010 RSI: 0000000000000000 RDI: ff4dde1d907e4980
RBP: ff4dde1d907e4980 R08: 000000000000000f R09: 0000000000000000
R10: ff4dde5f02671800 R11: 0000000000000008 R12: 0000000088888889
R13: 0500000000000000 R14: 00f0000000000000 R15: ff4dde5f02671800
FS: 00007f4b126b5740(0000) GS:ff4dde9bff600000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000000000000 CR3: 000000416f9c6002 CR4: 0000000000771ee0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe07f0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
bnxt_hwrm_ring_alloc+0x204/0x770 [bnxt_en]
bnxt_init_chip+0x4d/0x680 [bnxt_en]
? bnxt_poll+0x1a0/0x1a0 [bnxt_en]
__bnxt_open_nic+0xd2/0x740 [bnxt_en]
bnxt_open+0x10b/0x220 [bnxt_en]
? raw_notifier_call_chain+0x41/0x60
__dev_open+0xf3/0x1b0
__dev_change_flags+0x1db/0x250
dev_change_flags+0x21/0x60
devinet_ioctl+0x590/0x720
? avc_has_extended_perms+0x1b7/0x420
? _copy_from_user+0x3a/0x60
inet_ioctl+0x189/0x1c0
? wp_page_copy+0x45a/0x6e0
sock_do_ioctl+0x42/0xf0
? ioctl_has_perm.constprop.0.isra.0+0xbd/0x120
sock_ioctl+0x1ce/0x2e0
__x64_sys_ioctl+0x87/0xc0
do_syscall_64+0x59/0x90
? syscall_exit_work+0x103/0x130
? syscall_exit_to_user_mode+0x12/0x30
? do_syscall_64+0x69/0x90
? exc_page_fault+0x62/0x150
Fixes: c0c050c58d84 ("bnxt_en: New Broadcom ethernet driver.")
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240117234515.226944-6-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
We call bnxt_half_open_nic() to setup the chip partially to run
loopback tests. The rings and buffers are initialized normally
so that we can transmit and receive packets in loopback mode.
That means page pool buffers are allocated for the aggregation ring
just like the normal case. NAPI is not needed because we are just
polling for the loopback packets.
When we're done with the loopback tests, we call bnxt_half_close_nic()
to clean up. When freeing the page pools, we hit a WARN_ON()
in page_pool_unlink_napi() because the NAPI state linked to the
page pool is uninitialized.
The simplest way to avoid this warning is just to initialize the
NAPIs during half open and delete the NAPIs during half close.
Trying to skip the page pool initialization or skip linking of
NAPI during half open will be more complicated.
This fix avoids this warning:
WARNING: CPU: 4 PID: 46967 at net/core/page_pool.c:946 page_pool_unlink_napi+0x1f/0x30
CPU: 4 PID: 46967 Comm: ethtool Tainted: G S W 6.7.0-rc5+ #22
Hardware name: Dell Inc. PowerEdge R750/06V45N, BIOS 1.3.8 08/31/2021
RIP: 0010:page_pool_unlink_napi+0x1f/0x30
Code: 90 90 90 90 90 90 90 90 90 90 90 0f 1f 44 00 00 48 8b 47 18 48 85 c0 74 1b 48 8b 50 10 83 e2 01 74 08 8b 40 34 83 f8 ff 74 02 <0f> 0b 48 c7 47 18 00 00 00 00 c3 cc cc cc cc 66 90 90 90 90 90 90
RSP: 0018:ffa000003d0dfbe8 EFLAGS: 00010246
RAX: ff110003607ce640 RBX: ff110010baf5d000 RCX: 0000000000000008
RDX: 0000000000000000 RSI: ff110001e5e522c0 RDI: ff110010baf5d000
RBP: ff11000145539b40 R08: 0000000000000001 R09: ffffffffc063f641
R10: ff110001361eddb8 R11: 000000000040000f R12: 0000000000000001
R13: 000000000000001c R14: ff1100014553a080 R15: 0000000000003fc0
FS: 00007f9301c4f740(0000) GS:ff1100103fd00000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 00007f91344fa8f0 CR3: 00000003527cc005 CR4: 0000000000771ef0
DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
PKRU: 55555554
Call Trace:
<TASK>
? __warn+0x81/0x140
? page_pool_unlink_napi+0x1f/0x30
? report_bug+0x102/0x200
? handle_bug+0x44/0x70
? exc_invalid_op+0x13/0x60
? asm_exc_invalid_op+0x16/0x20
? bnxt_free_ring.isra.123+0xb1/0xd0 [bnxt_en]
? page_pool_unlink_napi+0x1f/0x30
page_pool_destroy+0x3e/0x150
bnxt_free_mem+0x441/0x5e0 [bnxt_en]
bnxt_half_close_nic+0x2a/0x40 [bnxt_en]
bnxt_self_test+0x21d/0x450 [bnxt_en]
__dev_ethtool+0xeda/0x2e30
? native_queued_spin_lock_slowpath+0x17f/0x2b0
? __link_object+0xa1/0x160
? _raw_spin_unlock_irqrestore+0x23/0x40
? __create_object+0x5f/0x90
? __kmem_cache_alloc_node+0x317/0x3c0
? dev_ethtool+0x59/0x170
dev_ethtool+0xa7/0x170
dev_ioctl+0xc3/0x530
sock_do_ioctl+0xa8/0xf0
sock_ioctl+0x270/0x310
__x64_sys_ioctl+0x8c/0xc0
do_syscall_64+0x3e/0xf0
entry_SYSCALL_64_after_hwframe+0x6e/0x76
Fixes: 294e39e0d034 ("bnxt: hook NAPIs to page pools")
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Ajit Khaparde <ajit.khaparde@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240117234515.226944-5-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
The existing formula used in the driver to calculate the number of RSS
table entries is to round up the number of RX rings to the next integer
multiples of 64 (e.g. 64, 128, 192, ..). This is incorrect. The valid
values supported by the chip are 64, 128, 256, 512 only (power of 2
starting from 64). When the number of RX rings is greater than 128, the
entry size will likely be wrong. Firmware will round down the invalid
value (e.g. 192 rounded down to 128) provided by the driver, causing some
RSS rings to not receive any packets.
We already have an existing function bnxt_calc_nr_ring_pages() to
do this calculation. Use it in bnxt_get_nr_rss_ctxs() to calculate the
number of RSS contexts correctly for P5_PLUS chips.
Reviewed-by: Andy Gospodarek <andrew.gospodarek@broadcom.com>
Reviewed-by: Pavan Chebbi <pavan.chebbi@broadcom.com>
Fixes: 7b3af4f75b81 ("bnxt_en: Add RSS support for 57500 chips.")
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240117234515.226944-4-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
bnxt_hwrm_get_rings() can abort and return error when there are not
enough ring resources. It aborts without releasing the HWRM DMA buffer,
causing a dma_pool_destroy warning when the driver is unloaded:
bnxt_en 0000:99:00.0: dma_pool_destroy bnxt_hwrm, 000000005b089ba8 busy
Fixes: f1e50b276d37 ("bnxt_en: Fix trimming of P5 RX and TX rings")
Reviewed-by: Somnath Kotur <somnath.kotur@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240117234515.226944-3-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
|/ /
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The first message to firmware may fail if the device is undergoing FLR.
The driver has some recovery logic for this failure scenario but we must
wait 100 msec for FLR to complete before proceeding. Otherwise the
recovery will always fail.
Fixes: ba02629ff6cb ("bnxt_en: log firmware status on firmware init failure")
Reviewed-by: Damodharam Ammepalli <damodharam.ammepalli@broadcom.com>
Signed-off-by: Michael Chan <michael.chan@broadcom.com>
Link: https://lore.kernel.org/r/20240117234515.226944-2-michael.chan@broadcom.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When I run syz's reproduction C program locally, it causes the following
issue:
pvqspinlock: lock 0xffff9d181cd5c660 has corrupted value 0x0!
WARNING: CPU: 19 PID: 21160 at __pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508)
Hardware name: Red Hat KVM, BIOS 0.5.1 01/01/2011
RIP: 0010:__pv_queued_spin_unlock_slowpath (kernel/locking/qspinlock_paravirt.h:508)
Code: 73 56 3a ff 90 c3 cc cc cc cc 8b 05 bb 1f 48 01 85 c0 74 05 c3 cc cc cc cc 8b 17 48 89 fe 48 c7 c7
30 20 ce 8f e8 ad 56 42 ff <0f> 0b c3 cc cc cc cc 0f 0b 0f 1f 40 00 90 90 90 90 90 90 90 90 90
RSP: 0018:ffffa8d200604cb8 EFLAGS: 00010282
RAX: 0000000000000000 RBX: 0000000000000000 RCX: ffff9d1ef60e0908
RDX: 00000000ffffffd8 RSI: 0000000000000027 RDI: ffff9d1ef60e0900
RBP: ffff9d181cd5c280 R08: 0000000000000000 R09: 00000000ffff7fff
R10: ffffa8d200604b68 R11: ffffffff907dcdc8 R12: 0000000000000000
R13: ffff9d181cd5c660 R14: ffff9d1813a3f330 R15: 0000000000001000
FS: 00007fa110184640(0000) GS:ffff9d1ef60c0000(0000) knlGS:0000000000000000
CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
CR2: 0000000020000000 CR3: 000000011f65e000 CR4: 00000000000006f0
Call Trace:
<IRQ>
_raw_spin_unlock (kernel/locking/spinlock.c:186)
inet_csk_reqsk_queue_add (net/ipv4/inet_connection_sock.c:1321)
inet_csk_complete_hashdance (net/ipv4/inet_connection_sock.c:1358)
tcp_check_req (net/ipv4/tcp_minisocks.c:868)
tcp_v4_rcv (net/ipv4/tcp_ipv4.c:2260)
ip_protocol_deliver_rcu (net/ipv4/ip_input.c:205)
ip_local_deliver_finish (net/ipv4/ip_input.c:234)
__netif_receive_skb_one_core (net/core/dev.c:5529)
process_backlog (./include/linux/rcupdate.h:779)
__napi_poll (net/core/dev.c:6533)
net_rx_action (net/core/dev.c:6604)
__do_softirq (./arch/x86/include/asm/jump_label.h:27)
do_softirq (kernel/softirq.c:454 kernel/softirq.c:441)
</IRQ>
<TASK>
__local_bh_enable_ip (kernel/softirq.c:381)
__dev_queue_xmit (net/core/dev.c:4374)
ip_finish_output2 (./include/net/neighbour.h:540 net/ipv4/ip_output.c:235)
__ip_queue_xmit (net/ipv4/ip_output.c:535)
__tcp_transmit_skb (net/ipv4/tcp_output.c:1462)
tcp_rcv_synsent_state_process (net/ipv4/tcp_input.c:6469)
tcp_rcv_state_process (net/ipv4/tcp_input.c:6657)
tcp_v4_do_rcv (net/ipv4/tcp_ipv4.c:1929)
__release_sock (./include/net/sock.h:1121 net/core/sock.c:2968)
release_sock (net/core/sock.c:3536)
inet_wait_for_connect (net/ipv4/af_inet.c:609)
__inet_stream_connect (net/ipv4/af_inet.c:702)
inet_stream_connect (net/ipv4/af_inet.c:748)
__sys_connect (./include/linux/file.h:45 net/socket.c:2064)
__x64_sys_connect (net/socket.c:2073 net/socket.c:2070 net/socket.c:2070)
do_syscall_64 (arch/x86/entry/common.c:51 arch/x86/entry/common.c:82)
entry_SYSCALL_64_after_hwframe (arch/x86/entry/entry_64.S:129)
RIP: 0033:0x7fa10ff05a3d
Code: 5b 41 5c c3 66 0f 1f 84 00 00 00 00 00 f3 0f 1e fa 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89
c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ab a3 0e 00 f7 d8 64 89 01 48
RSP: 002b:00007fa110183de8 EFLAGS: 00000202 ORIG_RAX: 000000000000002a
RAX: ffffffffffffffda RBX: 0000000020000054 RCX: 00007fa10ff05a3d
RDX: 000000000000001c RSI: 0000000020000040 RDI: 0000000000000003
RBP: 00007fa110183e20 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000202 R12: 00007fa110184640
R13: 0000000000000000 R14: 00007fa10fe8b060 R15: 00007fff73e23b20
</TASK>
The issue triggering process is analyzed as follows:
Thread A Thread B
tcp_v4_rcv //receive ack TCP packet inet_shutdown
tcp_check_req tcp_disconnect //disconnect sock
... tcp_set_state(sk, TCP_CLOSE)
inet_csk_complete_hashdance ...
inet_csk_reqsk_queue_add inet_listen //start listen
spin_lock(&queue->rskq_lock) inet_csk_listen_start
... reqsk_queue_alloc
... spin_lock_init
spin_unlock(&queue->rskq_lock) //warning
When the socket receives the ACK packet during the three-way handshake,
it will hold spinlock. And then the user actively shutdowns the socket
and listens to the socket immediately, the spinlock will be initialized.
When the socket is going to release the spinlock, a warning is generated.
Also the same issue to fastopenq.lock.
Move init spinlock to inet_create and inet_accept to make sure init the
accept_queue's spinlocks once.
Fixes: fff1f3001cc5 ("tcp: add a spinlock to protect struct request_sock_queue")
Fixes: 168a8f58059a ("tcp: TCP Fast Open Server - main code path")
Reported-by: Ming Shu <sming56@aliyun.com>
Signed-off-by: Zhengchao Shao <shaozhengchao@huawei.com>
Reviewed-by: Eric Dumazet <edumazet@google.com>
Link: https://lore.kernel.org/r/20240118012019.1751966-1-shaozhengchao@huawei.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When tests are run by runner.sh, bond_options.sh gets killed before
it can complete:
make -C tools/testing/selftests run_tests TARGETS="drivers/net/bonding"
[...]
# timeout set to 120
# selftests: drivers/net/bonding: bond_options.sh
# TEST: prio (active-backup miimon primary_reselect 0) [ OK ]
# TEST: prio (active-backup miimon primary_reselect 1) [ OK ]
# TEST: prio (active-backup miimon primary_reselect 2) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 0) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 1) [ OK ]
# TEST: prio (active-backup arp_ip_target primary_reselect 2) [ OK ]
#
not ok 7 selftests: drivers/net/bonding: bond_options.sh # TIMEOUT 120 seconds
This test includes many sleep statements, at least some of which are
related to timers in the operation of the bonding driver itself. Increase
the test timeout to allow the test to complete.
I ran the test in slightly different VMs (including one without HW
virtualization support) and got runtimes of 13m39.760s, 13m31.238s, and
13m2.956s. Use a ~1.5x "safety factor" and set the timeout to 1200s.
Fixes: 42a8d4aaea84 ("selftests: bonding: add bonding prio option test")
Reported-by: Jakub Kicinski <kuba@kernel.org>
Closes: https://lore.kernel.org/netdev/20240116104402.1203850a@kernel.org/#t
Suggested-by: Jakub Kicinski <kuba@kernel.org>
Signed-off-by: Benjamin Poirier <bpoirier@nvidia.com>
Reviewed-by: Hangbin Liu <liuhangbin@gmail.com>
Link: https://lore.kernel.org/r/20240118001233.304759-1-bpoirier@nvidia.com
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A crash was found when dumping SMC-D connections. It can be reproduced
by following steps:
- run nginx/wrk test:
smc_run nginx
smc_run wrk -t 16 -c 1000 -d <duration> -H 'Connection: Close' <URL>
- continuously dump SMC-D connections in parallel:
watch -n 1 'smcss -D'
BUG: kernel NULL pointer dereference, address: 0000000000000030
CPU: 2 PID: 7204 Comm: smcss Kdump: loaded Tainted: G E 6.7.0+ #55
RIP: 0010:__smc_diag_dump.constprop.0+0x5e5/0x620 [smc_diag]
Call Trace:
<TASK>
? __die+0x24/0x70
? page_fault_oops+0x66/0x150
? exc_page_fault+0x69/0x140
? asm_exc_page_fault+0x26/0x30
? __smc_diag_dump.constprop.0+0x5e5/0x620 [smc_diag]
? __kmalloc_node_track_caller+0x35d/0x430
? __alloc_skb+0x77/0x170
smc_diag_dump_proto+0xd0/0xf0 [smc_diag]
smc_diag_dump+0x26/0x60 [smc_diag]
netlink_dump+0x19f/0x320
__netlink_dump_start+0x1dc/0x300
smc_diag_handler_dump+0x6a/0x80 [smc_diag]
? __pfx_smc_diag_dump+0x10/0x10 [smc_diag]
sock_diag_rcv_msg+0x121/0x140
? __pfx_sock_diag_rcv_msg+0x10/0x10
netlink_rcv_skb+0x5a/0x110
sock_diag_rcv+0x28/0x40
netlink_unicast+0x22a/0x330
netlink_sendmsg+0x1f8/0x420
__sock_sendmsg+0xb0/0xc0
____sys_sendmsg+0x24e/0x300
? copy_msghdr_from_user+0x62/0x80
___sys_sendmsg+0x7c/0xd0
? __do_fault+0x34/0x160
? do_read_fault+0x5f/0x100
? do_fault+0xb0/0x110
? __handle_mm_fault+0x2b0/0x6c0
__sys_sendmsg+0x4d/0x80
do_syscall_64+0x69/0x180
entry_SYSCALL_64_after_hwframe+0x6e/0x76
It is possible that the connection is in process of being established
when we dump it. Assumed that the connection has been registered in a
link group by smc_conn_create() but the rmb_desc has not yet been
initialized by smc_buf_create(), thus causing the illegal access to
conn->rmb_desc. So fix it by checking before dump.
Fixes: 4b1b7d3b30a6 ("net/smc: add SMC-D diag support")
Signed-off-by: Wen Gu <guwen@linux.alibaba.com>
Reviewed-by: Dust Li <dust.li@linux.alibaba.com>
Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ \
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net
Pull networking fixes from Jakub Kicinski:
"Including fixes from bpf and netfilter.
Previous releases - regressions:
- Revert "net: rtnetlink: Enslave device before bringing it up",
breaks the case inverse to the one it was trying to fix
- net: dsa: fix oob access in DSA's netdevice event handler
dereference netdev_priv() before check its a DSA port
- sched: track device in tcf_block_get/put_ext() only for clsact
binder types
- net: tls, fix WARNING in __sk_msg_free when record becomes full
during splice and MORE hint set
- sfp-bus: fix SFP mode detect from bitrate
- drv: stmmac: prevent DSA tags from breaking COE
Previous releases - always broken:
- bpf: fix no forward progress in in bpf_iter_udp if output buffer is
too small
- bpf: reject variable offset alu on registers with a type of
PTR_TO_FLOW_KEYS to prevent oob access
- netfilter: tighten input validation
- net: add more sanity check in virtio_net_hdr_to_skb()
- rxrpc: fix use of Don't Fragment flag on RESPONSE packets, avoid
infinite loop
- amt: do not use the portion of skb->cb area which may get clobbered
- mptcp: improve validation of the MPTCPOPT_MP_JOIN MCTCP option
Misc:
- spring cleanup of inactive maintainers"
* tag 'net-6.8-rc1' of git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net: (88 commits)
i40e: Include types.h to some headers
ipv6: mcast: fix data-race in ipv6_mc_down / mld_ifc_work
selftests: mlxsw: qos_pfc: Adjust the test to support 8 lanes
selftests: mlxsw: qos_pfc: Remove wrong description
mlxsw: spectrum_router: Register netdevice notifier before nexthop
mlxsw: spectrum_acl_tcam: Fix stack corruption
mlxsw: spectrum_acl_tcam: Fix NULL pointer dereference in error path
mlxsw: spectrum_acl_erp: Fix error flow of pool allocation failure
ethtool: netlink: Add missing ethnl_ops_begin/complete
selftests: bonding: Add more missing config options
selftests: netdevsim: add a config file
libbpf: warn on unexpected __arg_ctx type when rewriting BTF
selftests/bpf: add tests confirming type logic in kernel for __arg_ctx
bpf: enforce types for __arg_ctx-tagged arguments in global subprogs
bpf: extract bpf_ctx_convert_map logic and make it more reusable
libbpf: feature-detect arg:ctx tag support in kernel
ipvs: avoid stat macros calls from preemptible context
netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
netfilter: nf_tables: skip dead set elements in netlink dump
netfilter: nf_tables: do not allow mismatch field size and set key length
...
|
| |\ \
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | |
| | | | |
git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf
Pablo Neira Ayuso says:
====================
Netfilter fixes for net
The following batch contains Netfilter fixes for net. Slightly larger
than usual because this batch includes several patches to tighten the
nf_tables control plane to reject inconsistent configuration:
1) Restrict NFTA_SET_POLICY to NFT_SET_POL_PERFORMANCE and
NFT_SET_POL_MEMORY.
2) Bail out if a nf_tables expression registers more than 16 netlink
attributes which is what struct nft_expr_info allows.
3) Bail out if NFT_EXPR_STATEFUL provides no .clone interface, remove
existing fallback to memcpy() when cloning which might accidentally
duplicate memory reference to the same object.
4) Fix br_netfilter interaction with neighbour layer. This requires
three preparation patches:
- Use nf_bridge_get_physinif() in nfnetlink_log
- Use nf_bridge_info_exists() to check in br_netfilter context
is available in nf_queue.
- Pass net to nf_bridge_get_physindev()
And finally, the fix which replaces physindev with physinif
in nf_bridge_info.
Patches from Pavel Tikhomirov.
5) Catch-all deactivation happens in the transaction, hence this
oneliner to check for the next generation. This bug uncovered after
the removal of the _BUSY bit, which happened in set elements back in
summer 2023.
6) Ensure set (total) key length size and concat field length description
is consistent, otherwise bail out.
7) Skip set element with the _DEAD flag on from the netlink dump path.
A tests occasionally shows that dump is mismatching because GC might
lose race to get rid of this element while a netlink dump is in
progress.
8) Reject NFT_SET_CONCAT for field_count < 1.
9) Use IP6_INC_STATS in ipvs to fix preemption BUG splat, patch
from Fedor Pchelkin.
* tag 'nf-24-01-18' of git://git.kernel.org/pub/scm/linux/kernel/git/netfilter/nf:
ipvs: avoid stat macros calls from preemptible context
netfilter: nf_tables: reject NFT_SET_CONCAT with not field length description
netfilter: nf_tables: skip dead set elements in netlink dump
netfilter: nf_tables: do not allow mismatch field size and set key length
netfilter: nf_tables: check if catch-all set element is active in next generation
netfilter: bridge: replace physindev with physinif in nf_bridge_info
netfilter: propagate net to nf_bridge_get_physindev
netfilter: nf_queue: remove excess nf_bridge variable
netfilter: nfnetlink_log: use proper helper for fetching physinif
netfilter: nft_limit: do not ignore unsupported flags
netfilter: nf_tables: bail out if stateful expression provides no .clone
netfilter: nf_tables: validate .maxattr at expression registration
netfilter: nf_tables: reject invalid set policy
====================
Link: https://lore.kernel.org/r/20240118161726.14838-1-pablo@netfilter.org
Signed-off-by: Jakub Kicinski <kuba@kernel.org>
|