summaryrefslogtreecommitdiffstats
path: root/net/smc/smc.h (follow)
Commit message (Collapse)AuthorAgeFilesLines
* net/smc: prevent NULL pointer dereference in txopt_getJeongjun Park2024-08-301-0/+3
| | | | | | | | | | | | | | | | | | | | | Since smc_inet6_prot does not initialize ipv6_pinfo_offset, inet6_create() copies an incorrect address value, sk + 0 (offset), to inet_sk(sk)->pinet6. In addition, since inet_sk(sk)->pinet6 and smc_sk(sk)->clcsock practically point to the same address, when smc_create_clcsk() stores the newly created clcsock in smc_sk(sk)->clcsock, inet_sk(sk)->pinet6 is corrupted into clcsock. This causes NULL pointer dereference and various other memory corruptions. To solve this problem, you need to initialize ipv6_pinfo_offset, add a smc6_sock structure, and then add ipv6_pinfo as the second member of the smc_sock structure. Reported-by: syzkaller <syzkaller@googlegroups.com> Fixes: d25a92ccae6b ("net/smc: Introduce IPPROTO_SMC") Signed-off-by: Jeongjun Park <aha310510@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: expose smc proto operationsD. Wythe2024-06-171-0/+33
| | | | | | | | | | | | | | | Externalize smc proto operations (smc_xxx) to allow access from files other than af_smc.c This is in preparation for the subsequent implementation of the AF_INET version of SMC. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: refactoring initialization of smc sockD. Wythe2024-06-171-0/+5
| | | | | | | | | | | | | | | | | This patch aims to isolate the shared components of SMC socket allocation by introducing smc_sk_init() for sock initialization and __smc_create_clcsk() for the initialization of clcsock. This is in preparation for the subsequent implementation of the AF_INET version of SMC. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Tested-by: Niklas Schnelle <schnelle@linux.ibm.com> Tested-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: change the term virtual ISM to Emulated-ISMWen Gu2024-02-081-2/+2
| | | | | | | | | | | | | | | According to latest release of SMCv2.1[1], the term 'virtual ISM' has been changed to 'Emulated-ISM' to avoid the ambiguity of the word 'virtual' in different contexts. So the names or comments in the code need be modified accordingly. [1] https://www.ibm.com/support/pages/node/7112343 Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Simon Horman <horms@kernel.org> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Link: https://lore.kernel.org/r/20240205033317.127269-1-guwen@linux.alibaba.com Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/smc: compatible with 128-bits extended GID of virtual ISM deviceWen Gu2023-12-261-3/+0
| | | | | | | | | | | According to virtual ISM support feature defined by SMCv2.1, GIDs of virtual ISM device are UUIDs defined by RFC4122, which are 128-bits long. So some adaptation work is required. And note that the GIDs of existing platform firmware ISM devices still remain 64-bits long. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: introduce virtual ISM device support featureWen Gu2023-12-261-3/+6
| | | | | | | | This introduces virtual ISM device support feature to SMCv2.1 as the first supplemental feature. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: support SMCv2.x supplemental features negotiationWen Gu2023-12-261-0/+4
| | | | | | | | | | | | | | | | | | This patch adds SMCv2.x supplemental features negotiation. Supported SMCv2.x supplemental features are represented by feature_mask in FCE field. The negotiation process is as follows. Server Client Proposal(features(c-mask bits)) <----------------------------------------- Accept(features(s-mask bits)) -----------------------------------------> Confirm(features(s&c-mask bits)) <----------------------------------------- Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Reviewed-and-tested-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: remove unneeded atomic operations in smc_tx_sndbuf_nonemptyLi RongQing2023-11-241-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The commit dcd2cf5f2fc0 ("net/smc: add autocorking support") adds an atomic variable tx_pushing in smc_connection to make sure only one can send to let it cork more and save CDC slot. since smc_tx_pending can be called in the soft IRQ without checking sock_owned_by_user() at that time, which would cause a race condition because bh_lock_sock() did not honor sock_lock() After commit 6b88af839d20 ("net/smc: don't send in the BH context if sock_owned_by_user"), the transmission is deferred to when sock_lock() is held by the user. Therefore, we no longer need tx_pending to hold message. So remove atomic variable tx_pushing and its operation, and smc_tx_sndbuf_nonempty becomes a wrapper of __smc_tx_sndbuf_nonempty, so rename __smc_tx_sndbuf_nonempty back to smc_tx_sndbuf_nonempty Suggested-by: Alexandra Winter <wintera@linux.ibm.com> Co-developed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: Li RongQing <lirongqing@baidu.com> Reviewed-by: Alexandra Winter <wintera@linux.ibm.com> diff v4: remove atomic variable tx_pushing diff v3: improvements in the commit body and comments diff v2: fix a typo in commit body and add net-next subject-prefix net/smc/smc.h | 1 - net/smc/smc_tx.c | 30 +----------------------------- 2 files changed, 1 insertion(+), 30 deletions(-) Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: fix dangling sock under state SMC_APPFINCLOSEWAITD. Wythe2023-11-061-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Considering scenario: smc_cdc_rx_handler __smc_release sock_set_flag smc_close_active() sock_set_flag __set_bit(DEAD) __set_bit(DONE) Dues to __set_bit is not atomic, the DEAD or DONE might be lost. if the DEAD flag lost, the state SMC_CLOSED will be never be reached in smc_close_passive_work: if (sock_flag(sk, SOCK_DEAD) && smc_close_sent_any_close(conn)) { sk->sk_state = SMC_CLOSED; } else { /* just shutdown, but not yet closed locally */ sk->sk_state = SMC_APPFINCLOSEWAIT; } Replace sock_set_flags or __set_bit to set_bit will fix this problem. Since set_bit is atomic. Fixes: b38d732477e4 ("smc: socket closing and linkgroup cleanup") Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: support smc release version negotiation in clc handshakeGuangguan Wang2023-08-191-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Support smc release version negotiation in clc handshake based on SMC v2, where no negotiation process for different releases, but for different versions. The latest smc release version was updated to v2.1. And currently there are two release versions of SMCv2, v2.0 and v2.1. In the release version negotiation, client sends the preferred release version by CLC Proposal Message, server makes decision for which release version to use based on the client preferred release version and self-supported release version (here choose the minimum release version of the client preferred and server latest supported), then the decision returns to client by CLC Accept Message. Client confirms the decision by CLC Confirm Message. Client Server Proposal(preferred release version) ------------------------------------> Accept(accpeted release version) min(client preferred, server latest supported) <------------------------------------ Confirm(accpeted release version) ------------------------------------> Signed-off-by: Guangguan Wang <guangguan.wang@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Reviewed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Fix setsockopt and sysctl to specify same buffer size againGerd Bayer2023-08-091-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") introduced the net.smc.rmem and net.smc.wmem sysctls to specify the size of buffers to be used for SMC type connections. This created a regression for users that specified the buffer size via setsockopt() as the effective buffer size was now doubled. Re-introduce the division by 2 in the SMC buffer create code and level this out by duplicating the net.smc.[rw]mem values used for initializing sk_rcvbuf/sk_sndbuf at socket creation time. This gives users of both methods (setsockopt or sysctl) the effective buffer size that they expect. Initialize net.smc.[rw]mem from its own constant of 64kB, respectively. Internal performance tests show that this value is a good compromise between throughput/latency and memory consumption. Also, this decouples it from any tuning that was done to net.ipv4.tcp_[rw]mem[1] before the module for SMC protocol was loaded. Check that no more than INT_MAX / 2 is assigned to net.smc.[rw]mem, in order to avoid any overflow condition when that is doubled for use in sk_sndbuf or sk_rcvbuf. While at it, drop the confusing sk_buf_size variable from __smc_buf_create and name "compressed" buffer size variables more consistently. Background: Before the commit mentioned above, SMC's buffer allocator in __smc_buf_create() always used half of the sockets' sk_rcvbuf/sk_sndbuf value as initial value to search for appropriate buffers. If the search resorted to using a bigger buffer when all buffers of the specified size were busy, the duplicate of the used effective buffer size is stored back to sk_rcvbuf/sk_sndbuf. When available, buffers of exactly the size that a user had specified as input to setsockopt() were used, despite setsockopt()'s documentation in "man 7 socket" talking of a mandatory duplication: [...] SO_SNDBUF Sets or gets the maximum socket send buffer in bytes. The kernel doubles this value (to allow space for book‐ keeping overhead) when it is set using setsockopt(2), and this doubled value is returned by getsockopt(2). The default value is set by the /proc/sys/net/core/wmem_default file and the maximum allowed value is set by the /proc/sys/net/core/wmem_max file. The minimum (doubled) value for this option is 2048. [...] Fixes: 0227f058aa29 ("net/smc: Unbind r/w buffer size from clcsock and make them tunable") Co-developed-by: Jan Karcher <jaka@linux.ibm.com> Signed-off-by: Jan Karcher <jaka@linux.ibm.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: Gerd Bayer <gbayer@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* smc: preserve const qualifier in smc_sk()Eric Dumazet2023-03-181-4/+1
| | | | | | | | | | | | | We can change smc_sk() to propagate its argument const qualifier, thanks to container_of_const(). Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Karsten Graul <kgraul@linux.ibm.com> Cc: Wenjia Zhang <wenjia@linux.ibm.com> Cc: Jan Karcher <jaka@linux.ibm.com> Reviewed-by: Simon Horman <simon.horman@corigine.com> Reviewed-by: Wenjia Zhang <wenjia@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Only save the original clcsock callback functionsWen Gu2022-04-251-0/+29
| | | | | | | | | | | | | | Both listen and fallback process will save the current clcsock callback functions and establish new ones. But if both of them happen, the saved callback functions will be overwritten. So this patch introduces some helpers to ensure that only save the original callback functions of clcsock. Fixes: 341adeec9ada ("net/smc: Forward wakeup to smc socket waitqueue after fallback") Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <kuba@kernel.org>
* net/smc: don't send in the BH context if sock_owned_by_userDust Li2022-03-011-0/+4
| | | | | | | | | | | | | | | | | | | | | | | Send data all the way down to the RDMA device is a time consuming operation(get a new slot, maybe do RDMA Write and send a CDC, etc). Moving those operations from BH to user context is good for performance. If the sock_lock is hold by user, we don't try to send data out in the BH context, but just mark we should send. Since the user will release the sock_lock soon, we can do the sending there. Add smc_release_cb() which will be called in release_sock() and try send in the callback if needed. This patch moves the sending part out from BH if sock lock is hold by user. In my testing environment, this saves about 20% softirq in the qperf 4K tcp_bw test in the sender side with no noticeable throughput drop. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: add autocorking supportDust Li2022-03-011-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch adds autocorking support for SMC which could improve throughput for small message by x3+. The main idea is borrowed from TCP autocorking with some RDMA specific modification: 1. The first message should never cork to make sure we won't bring extra latency 2. If we have posted any Tx WRs to the NIC that have not completed, cork the new messages until: a) Receive CQE for the last Tx WR b) We have corked enough message on the connection 3. Try to push the corked data out when we receive CQE of the last Tx WR to prevent the corked messages hang in the send queue. Both SMC autocorking and TCP autocorking check the TX completion to decide whether we should cork or not. The difference is when we got a SMC Tx WR completion, the data have been confirmed by the RNIC while TCP TX completion just tells us the data have been sent out by the local NIC. Add an atomic variable tx_pushing in smc_connection to make sure only one can send to let it cork more and save CDC slot. SMC autocorking should not bring extra latency since the first message will always been sent out immediately. The qperf tcp_bw test shows more than x4 increase under small message size with Mellanox connectX4-Lx, same result with other throughput benchmarks like sockperf/netperf. The qperf tcp_lat test shows SMC autocorking has not increase any ping-pong latency. Test command: client: smc_run taskset -c 1 qperf smc-server -oo msg_size:1:64K:*2 \ -t 30 -vu tcp_{bw|lat} server: smc_run taskset -c 1 qperf === Bandwidth ==== MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking 1 0.578 MB/s 2.392 MB/s(313.57%) 2.647 MB/s(357.72%) 2 1.159 MB/s 4.780 MB/s(312.53%) 5.153 MB/s(344.71%) 4 2.283 MB/s 10.266 MB/s(349.77%) 10.363 MB/s(354.02%) 8 4.668 MB/s 19.040 MB/s(307.86%) 21.215 MB/s(354.45%) 16 9.147 MB/s 38.904 MB/s(325.31%) 41.740 MB/s(356.32%) 32 18.369 MB/s 79.587 MB/s(333.25%) 82.392 MB/s(348.52%) 64 36.562 MB/s 148.668 MB/s(306.61%) 161.564 MB/s(341.89%) 128 72.961 MB/s 274.913 MB/s(276.80%) 325.363 MB/s(345.94%) 256 144.705 MB/s 512.059 MB/s(253.86%) 633.743 MB/s(337.96%) 512 288.873 MB/s 884.977 MB/s(206.35%) 1250.681 MB/s(332.95%) 1024 574.180 MB/s 1337.736 MB/s(132.98%) 2246.121 MB/s(291.19%) 2048 1095.192 MB/s 1865.952 MB/s( 70.38%) 2057.767 MB/s( 87.89%) 4096 2066.157 MB/s 2380.337 MB/s( 15.21%) 2173.983 MB/s( 5.22%) 8192 3717.198 MB/s 2733.073 MB/s(-26.47%) 3491.223 MB/s( -6.08%) 16384 4742.221 MB/s 2958.693 MB/s(-37.61%) 4637.692 MB/s( -2.20%) 32768 5349.550 MB/s 3061.285 MB/s(-42.77%) 5385.796 MB/s( 0.68%) 65536 5162.919 MB/s 3731.408 MB/s(-27.73%) 5223.890 MB/s( 1.18%) ==== Latency ==== MsgSize(Bytes) SMC-NoCork TCP SMC-AutoCorking 1 10.540 us 11.938 us( 13.26%) 10.573 us( 0.31%) 2 10.996 us 11.992 us( 9.06%) 10.269 us( -6.61%) 4 10.229 us 11.687 us( 14.25%) 10.240 us( 0.11%) 8 10.203 us 11.653 us( 14.21%) 10.402 us( 1.95%) 16 10.530 us 11.313 us( 7.44%) 10.599 us( 0.66%) 32 10.241 us 11.586 us( 13.13%) 10.223 us( -0.18%) 64 10.693 us 11.652 us( 8.97%) 10.251 us( -4.13%) 128 10.597 us 11.579 us( 9.27%) 10.494 us( -0.97%) 256 10.409 us 11.957 us( 14.87%) 10.710 us( 2.89%) 512 11.088 us 12.505 us( 12.78%) 10.547 us( -4.88%) 1024 11.240 us 12.255 us( 9.03%) 10.787 us( -4.03%) 2048 11.485 us 16.970 us( 47.76%) 11.256 us( -1.99%) 4096 12.077 us 13.948 us( 15.49%) 12.230 us( 1.27%) 8192 13.683 us 16.693 us( 22.00%) 13.786 us( 0.75%) 16384 16.470 us 23.615 us( 43.38%) 16.459 us( -0.07%) 32768 22.540 us 40.966 us( 81.75%) 23.284 us( 3.30%) 65536 34.192 us 73.003 us(113.51%) 34.233 us( 0.12%) With SMC autocorking support, we can archive better throughput than TCP in most message sizes without any latency trade-off. Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Add global configure for handshake limitation by netlinkD. Wythe2022-02-111-0/+6
| | | | | | | | | | | | | | Although we can control SMC handshake limitation through socket options, which means that applications who need it must modify their code. It's quite troublesome for many existing applications. This patch modifies the global default value of SMC handshake limitation through netlink, providing a way to put constraint on handshake without modifies any code for applications. Suggested-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Reviewed-by: Tony Lu <tonylu@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Dynamic control handshake limitation by socket optionsD. Wythe2022-02-111-0/+1
| | | | | | | | | | | | | | | | | | | | | This patch aims to add dynamic control for SMC handshake limitation for every smc sockets, in production environment, it is possible for the same applications to handle different service types, and may have different opinion on SMC handshake limitation. This patch try socket options to complete it, since we don't have socket option level for SMC yet, which requires us to implement it at the same time. This patch does the following: - add new socket option level: SOL_SMC. - add new SMC socket option: SMC_LIMIT_HS. - provide getter/setter for SMC socket options. Link: https://lore.kernel.org/all/20f504f961e1a803f85d64229ad84260434203bd.1644323503.git.alibuda@linux.alibaba.com/ Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Limit backlog connectionsD. Wythe2022-02-111-1/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Current implementation does not handling backlog semantics, one potential risk is that server will be flooded by infinite amount connections, even if client was SMC-incapable. This patch works to put a limit on backlog connections, referring to the TCP implementation, we divides SMC connections into two categories: 1. Half SMC connection, which includes all TCP established while SMC not connections. 2. Full SMC connection, which includes all SMC established connections. For half SMC connection, since all half SMC connections starts with TCP established, we can achieve our goal by put a limit before TCP established. Refer to the implementation of TCP, this limits will based on not only the half SMC connections but also the full connections, which is also a constraint on full SMC connections. For full SMC connections, although we know exactly where it starts, it's quite hard to put a limit before it. The easiest way is to block wait before receive SMC confirm CLC message, while it's under protection by smc_server_lgr_pending, a global lock, which leads this limit to the entire host instead of a single listen socket. Another way is to drop the full connections, but considering the cast of SMC connections, we prefer to keep full SMC connections. Even so, the limits of full SMC connections still exists, see commits about half SMC connection below. After this patch, the limits of backend connection shows like: For SMC: 1. Client with SMC-capability can makes 2 * backlog full SMC connections or 1 * backlog half SMC connections and 1 * backlog full SMC connections at most. 2. Client without SMC-capability can only makes 1 * backlog half TCP connections and 1 * backlog full TCP connections. Signed-off-by: D. Wythe <alibuda@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Forward wakeup to smc socket waitqueue after fallbackWen Gu2022-01-311-1/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When we replace TCP with SMC and a fallback occurs, there may be some socket waitqueue entries remaining in smc socket->wq, such as eppoll_entries inserted by userspace applications. After the fallback, data flows over TCP/IP and only clcsocket->wq will be woken up. Applications can't be notified by the entries which were inserted in smc socket->wq before fallback. So we need a mechanism to wake up smc socket->wq at the same time if some entries remaining in it. The current workaround is to transfer the entries from smc socket->wq to clcsock->wq during the fallback. But this may cause a crash like this: general protection fault, probably for non-canonical address 0xdead000000000100: 0000 [#1] PREEMPT SMP PTI CPU: 3 PID: 0 Comm: swapper/3 Kdump: loaded Tainted: G E 5.16.0+ #107 RIP: 0010:__wake_up_common+0x65/0x170 Call Trace: <IRQ> __wake_up_common_lock+0x7a/0xc0 sock_def_readable+0x3c/0x70 tcp_data_queue+0x4a7/0xc40 tcp_rcv_established+0x32f/0x660 ? sk_filter_trim_cap+0xcb/0x2e0 tcp_v4_do_rcv+0x10b/0x260 tcp_v4_rcv+0xd2a/0xde0 ip_protocol_deliver_rcu+0x3b/0x1d0 ip_local_deliver_finish+0x54/0x60 ip_local_deliver+0x6a/0x110 ? tcp_v4_early_demux+0xa2/0x140 ? tcp_v4_early_demux+0x10d/0x140 ip_sublist_rcv_finish+0x49/0x60 ip_sublist_rcv+0x19d/0x230 ip_list_rcv+0x13e/0x170 __netif_receive_skb_list_core+0x1c2/0x240 netif_receive_skb_list_internal+0x1e6/0x320 napi_complete_done+0x11d/0x190 mlx5e_napi_poll+0x163/0x6b0 [mlx5_core] __napi_poll+0x3c/0x1b0 net_rx_action+0x27c/0x300 __do_softirq+0x114/0x2d2 irq_exit_rcu+0xb4/0xe0 common_interrupt+0xba/0xe0 </IRQ> <TASK> The crash is caused by privately transferring waitqueue entries from smc socket->wq to clcsock->wq. The owners of these entries, such as epoll, have no idea that the entries have been transferred to a different socket wait queue and still use original waitqueue spinlock (smc socket->wq.wait.lock) to make the entries operation exclusive, but it doesn't work. The operations to the entries, such as removing from the waitqueue (now is clcsock->wq after fallback), may cause a crash when clcsock waitqueue is being iterated over at the moment. This patch tries to fix this by no longer transferring wait queue entries privately, but introducing own implementations of clcsock's callback functions in fallback situation. The callback functions will forward the wakeup to smc socket->wq if clcsock->wq is actually woken up and smc socket->wq has remaining entries. Fixes: 2153bd1e3d3d ("net/smc: Transfer remaining wait queue entries during fallback") Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Acked-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: Resolve the race between link group access and terminationWen Gu2022-01-131-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We encountered some crashes caused by the race between the access and the termination of link groups. Here are some of panic stacks we met: 1) Race between smc_clc_wait_msg() and __smc_lgr_terminate() BUG: kernel NULL pointer dereference, address: 00000000000002f0 Workqueue: smc_hs_wq smc_listen_work [smc] RIP: 0010:smc_clc_wait_msg+0x3eb/0x5c0 [smc] Call Trace: <TASK> ? smc_clc_send_accept+0x45/0xa0 [smc] ? smc_clc_send_accept+0x45/0xa0 [smc] smc_listen_work+0x783/0x1220 [smc] ? finish_task_switch+0xc4/0x2e0 ? process_one_work+0x1ad/0x3c0 process_one_work+0x1ad/0x3c0 worker_thread+0x4c/0x390 ? rescuer_thread+0x320/0x320 kthread+0x149/0x190 ? set_kthread_struct+0x40/0x40 ret_from_fork+0x1f/0x30 </TASK> smc_listen_work() abnormal case like port error --------------------------------------------------------------- | __smc_lgr_terminate() | |- smc_conn_kill() | |- smc_lgr_unregister_conn() | |- set conn->lgr = NULL smc_clc_wait_msg() | |- access conn->lgr (panic) | 2) Race between smc_setsockopt() and __smc_lgr_terminate() BUG: kernel NULL pointer dereference, address: 00000000000002e8 RIP: 0010:smc_setsockopt+0x17a/0x280 [smc] Call Trace: <TASK> __sys_setsockopt+0xfc/0x190 __x64_sys_setsockopt+0x20/0x30 do_syscall_64+0x34/0x90 entry_SYSCALL_64_after_hwframe+0x44/0xae </TASK> smc_setsockopt() abnormal case like port error -------------------------------------------------------------- | __smc_lgr_terminate() | |- smc_conn_kill() | |- smc_lgr_unregister_conn() | |- set conn->lgr = NULL mod_delayed_work() | |- access conn->lgr (panic) | There are some other panic places and they are caused by the similar reason as described above, which is accessing link group after termination, thus getting a NULL pointer or invalid resource. Currently, there seems to be no synchronization between the link group access and a sudden termination of it. This patch tries to fix this by introducing reference count of link group and not freeing link group until reference count is zero. Link group might be referred to by links or smc connections. So the operation to the link group reference count can be concluded as follows: object [hold or initialized as 1] [put] ------------------------------------------------------------------- link group smc_lgr_create() smc_lgr_free() connections smc_conn_create() smc_conn_free() links smcr_link_init() smcr_link_clear() Througth this way, we extend the life cycle of link group and ensure it is longer than the life cycle of connections and links above it, so that avoid invalid access to link group after its termination. Signed-off-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: fix kernel panic caused by race of smc_sockDust Li2021-12-281-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A crash occurs when smc_cdc_tx_handler() tries to access smc_sock but smc_release() has already freed it. [ 4570.695099] BUG: unable to handle page fault for address: 000000002eae9e88 [ 4570.696048] #PF: supervisor write access in kernel mode [ 4570.696728] #PF: error_code(0x0002) - not-present page [ 4570.697401] PGD 0 P4D 0 [ 4570.697716] Oops: 0002 [#1] PREEMPT SMP NOPTI [ 4570.698228] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 5.16.0-rc4+ #111 [ 4570.699013] Hardware name: Alibaba Cloud Alibaba Cloud ECS, BIOS 8c24b4c 04/0 [ 4570.699933] RIP: 0010:_raw_spin_lock+0x1a/0x30 <...> [ 4570.711446] Call Trace: [ 4570.711746] <IRQ> [ 4570.711992] smc_cdc_tx_handler+0x41/0xc0 [ 4570.712470] smc_wr_tx_tasklet_fn+0x213/0x560 [ 4570.712981] ? smc_cdc_tx_dismisser+0x10/0x10 [ 4570.713489] tasklet_action_common.isra.17+0x66/0x140 [ 4570.714083] __do_softirq+0x123/0x2f4 [ 4570.714521] irq_exit_rcu+0xc4/0xf0 [ 4570.714934] common_interrupt+0xba/0xe0 Though smc_cdc_tx_handler() checked the existence of smc connection, smc_release() may have already dismissed and released the smc socket before smc_cdc_tx_handler() further visits it. smc_cdc_tx_handler() |smc_release() if (!conn) | | |smc_cdc_tx_dismiss_slots() | smc_cdc_tx_dismisser() | |sock_put(&smc->sk) <- last sock_put, | smc_sock freed bh_lock_sock(&smc->sk) (panic) | To make sure we won't receive any CDC messages after we free the smc_sock, add a refcount on the smc_connection for inflight CDC message(posted to the QP but haven't received related CQE), and don't release the smc_connection until all the inflight CDC messages haven been done, for both success or failed ones. Using refcount on CDC messages brings another problem: when the link is going to be destroyed, smcr_link_clear() will reset the QP, which then remove all the pending CQEs related to the QP in the CQ. To make sure all the CQEs will always come back so the refcount on the smc_connection can always reach 0, smc_ib_modify_qp_reset() was replaced by smc_ib_modify_qp_error(). And remove the timeout in smc_wr_tx_wait_no_pending_sends() since we need to wait for all pending WQEs done, or we may encounter use-after- free when handling CQEs. For IB device removal routine, we need to wait for all the QPs on that device been destroyed before we can destroy CQs on the device, or the refcount on smc_connection won't reach 0 and smc_sock cannot be released. Fixes: 5f08318f617b ("smc: connection data control (CDC)") Reported-by: Wen Gu <guwen@linux.alibaba.com> Signed-off-by: Dust Li <dust.li@linux.alibaba.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: extend LLC layer for SMC-Rv2Karsten Graul2021-10-161-1/+19
| | | | | | | | | | | Add support for large v2 LLC control messages in smc_llc.c. The new large work request buffer allows to combine control messages into one packet that had to be spread over several packets before. Add handling of the new v2 LLC messages. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: add support for user defined EIDsKarsten Graul2021-09-141-3/+0
| | | | | | | | | | | | SMC-Dv2 allows users to define EIDs which allows to create separate name spaces enabling users to cluster their SMC-Dv2 connections. Add support for user defined EIDs and extent the generic netlink interface so users can add, remove and dump EIDs. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: introduce CLC first contact extensionUrsula Braun2020-09-291-0/+1
| | | | | | | | | | SMC Version 2 defines a first contact extension for CLC accept and CLC confirm. This patch covers sending and receiving of the CLC first contact extension. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: build and send V2 CLC proposalUrsula Braun2020-09-291-0/+6
| | | | | | | | | | The new format of an SMCD V2 CLC proposal is introduced, and building and checking of SMCD V2 CLC proposals is adapted accordingly. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: determine proposed ISM devicesUrsula Braun2020-09-291-0/+1
| | | | | | | | | | | | SMCD Version 2 allows to propose up to 8 additional ISM devices offered to the peer as candidates for SMCD communication. This patch covers determination of the ISM devices to be proposed. ISM devices without PNETID are preferred, since ISM devices with PNETID are a V1 leftover and will disappear over the time. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: prepare for more proposed ISM devicesUrsula Braun2020-09-291-0/+4
| | | | | | | | | | | SMCD Version 2 allows proposing of up to 8 ISM devices in addition to the native ISM device of SMCD Version 1. This patch prepares the struct smc_init_info to deal with these additional 8 ISM devices. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: use separate work queues for different worker typesKarsten Graul2020-09-111-0/+3
| | | | | | | | | | | | | | | | | | | There are 6 types of workers which exist per smc connection. 3 of them are used for listen and handshake processing, another 2 are used for close and abort processing and 1 is the tx worker that moves calls to sleeping functions into a worker. To prevent flooding of the system work queue when many connections are opened or closed at the same time (some pattern uperf implements), move those workers to one of 3 smc-specific work queues. Two work queues are module-global and used for handshake and close workers. The third work queue is defined per link group and used by the tx workers that may sleep waiting for resources of this link group. And in smc_llc_enqueue() queue the llc_event_work work to the system prio work queue because its critical that this work is started fast. Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: introduce better field namesUrsula Braun2020-09-111-0/+2
| | | | | | | | | | | | | Field names "srv_first_contact" and "cln_first_contact" are misleading, since they apply to both, server and client. Rename them to "first_contact_peer" and "first_contact_local". Rename "ism_gid" by the more precise name "ism_peer_gid". Rename version constant "SMC_CLC_V1" into "SMC_V1". No functional change. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: reduce active tcp_listen workersUrsula Braun2020-09-111-0/+2
| | | | | | | | | | | | | | | SMC starts a separate tcp_listen worker for every SMC socket in state SMC_LISTEN, and can accept an incoming connection request only, if this worker is really running and waiting in kernel_accept(). But the number of running workers is limited. This patch reworks the listening SMC code and starts a tcp_listen worker after the SYN-ACK handshake on the internal clc-socket only. Suggested-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Reviewed-by: Guvenc Gulce <guvenc@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: handle incoming CDC validation messageKarsten Graul2020-05-041-0/+2
| | | | | | | | | | | Call smc_cdc_msg_validate() when a CDC message with the failover validation bit enabled was received. Validate that the sequence number sent with the message is one we already have received. If not, messages were lost and the connection is terminated using a new abort_work. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: save state of last sent CDC messageKarsten Graul2020-05-041-0/+4
| | | | | | | | | | | | | When a link goes down and all connections of this link need to be switched to an other link then the producer cursor and the sequence of the last successfully sent CDC message must be known. Add the two fields to the SMC connection and update it in the tx completion handler. And to allow matching of sequences in error cases reset the seqno to the old value in smc_cdc_msg_send() when the actual send failed. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: convert static link ID to dynamic referencesKarsten Graul2020-04-291-0/+1
| | | | | | | | | As a preparation for the support of multiple links remove the usage of a static link id (SMC_SINGLE_LINK) and allow dynamic link ids. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Reviewed-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: cancel send and receive for terminated socketUrsula Braun2019-10-221-0/+1
| | | | | | | | | | | The resources for a terminated socket are being cleaned up. This patch makes sure * no more data is received for an actively terminated socket * no more data is sent for an actively or passively terminated socket Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Jakub Kicinski <jakub.kicinski@netronome.com>
* net/smc: nonblocking connect reworkUrsula Braun2019-04-121-7/+4
| | | | | | | | | | For nonblocking sockets move the kernel_connect() from the connect worker into the initial smc_connect part to return kernel_connect() errors other than -EINPROGRESS to user space. Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: fix smc_poll in SMC_INIT stateUrsula Braun2019-02-211-3/+3
| | | | | | | | | | | | | | smc_poll() returns with mask bit EPOLLPRI if the connection urg_state is SMC_URG_VALID. Since SMC_URG_VALID is zero, smc_poll signals EPOLLPRI errorneously if called in state SMC_INIT before the connection is created, for instance in a non-blocking connect scenario. This patch switches to non-zero values for the urg states. Reviewed-by: Karsten Graul <kgraul@linux.ibm.com> Fixes: de8474eb9d50 ("net/smc: urgent data support") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: fix TCP fallback socket releaseMyungho Jung2018-12-191-0/+4
| | | | | | | | | | | | | clcsock can be released while kernel_accept() references it in TCP listen worker. Also, clcsock needs to wake up before released if TCP fallback is used and the clcsock is blocked by accept. Add a lock to safely release clcsock and call kernel_sock_shutdown() to wake up clcsock from accept in smc_release(). Reported-by: syzbot+0bf2e01269f1274b4b03@syzkaller.appspotmail.com Reported-by: syzbot+e3132895630f957306bc@syzkaller.appspotmail.com Signed-off-by: Myungho Jung <mhjungk@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: provide fallback reason codeKarsten Graul2018-07-261-0/+2
| | | | | | | | | | Remember the fallback reason code and the peer diagnosis code for smc sockets, and provide them in smc_diag.c to the netlink interface. And add more detailed reason codes. Signed-off-by: Karsten Graul <kgraul@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge ra.kernel.org:/pub/scm/linux/kernel/git/davem/netDavid S. Miller2018-07-031-0/+8
|\ | | | | | | | | | | | | | | | | | | Simple overlapping changes in stmmac driver. Adjust skb_gro_flush_final_remcsum function signature to make GRO list changes in net-next, as per Stephen Rothwell's example merge resolution. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/smc: rebuild nonblocking connectUrsula Braun2018-06-281-0/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The recent poll change may lead to stalls for non-blocking connecting SMC sockets, since sock_poll_wait is no longer performed on the internal CLC socket, but on the outer SMC socket. kernel_connect() on the internal CLC socket returns with -EINPROGRESS, but the wake up logic does not work in all cases. If the internal CLC socket is still in state TCP_SYN_SENT when polled, sock_poll_wait() from sock_poll() does not sleep. It is supposed to sleep till the state of the internal CLC socket switches to TCP_ESTABLISHED. This problem triggered a redesign of the SMC nonblocking connect logic. This patch introduces a connect worker covering all connect steps followed by a wake up of socket waiters. It allows to get rid of all delays and locks in smc_poll(). Fixes: c0129a061442 ("smc: convert to ->poll_mask") Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: add SMC-D support in data transferHans Wippel2018-06-301-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The data transfer and CDC message headers differ in SMC-R and SMC-D. This patch adds support for the SMC-D data transfer to the existing SMC code. It consists of the following: * SMC-D CDC support * SMC-D tx support * SMC-D rx support The CDC header is stored at the beginning of the receive buffer. Thus, a rx_offset variable is added for the CDC header offset within the buffer (0 for SMC-R). Signed-off-by: Hans Wippel <hwippel@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Suggested-by: Thomas Richter <tmricht@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net/smc: determine port attributes independent from pnet tableUrsula Braun2018-06-301-2/+0
|/ | | | | | | | | | | | | | | | | For SMC it is important to know the current port state of RoCE devices. Monitoring port states has been triggered, when a RoCE device was added to the pnet table. To support future alternatives to the pnet table the monitoring of ports is made independent of the existence of a pnet table. It starts once the smc_ib_device is established. Due to this change smc_ib_remember_port_attr() is now a local function and shuffling its location and the location of its used functions makes any forward references obsolete. And the duplicate SMC_MAX_PORTS definition is removed. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: urgent data supportStefan Raspl2018-05-231-0/+15
| | | | | | | | Add support for out of band data send and receive. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: move smc_core specific code from smc.h to smc_coreHans Wippel2018-05-181-41/+0
| | | | | | | | | SMC connection and buffer handling belong to smc_core. So, this patch moves this code from smc.h to smc_core. Signed-off-by: Hans Wippel <hwippel@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: calculate write offset in RMB only once per connectionHans Wippel2018-05-181-0/+1
| | | | | | | | | | Currently, the write offset within the RMB is calculated on each write operation although it is fixed for each connection. With this patch, the offset is calculated once and stored in a connection specific variable. Signed-off-by: Hans Wippel <hwippel@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: rename connection index to RMBE indexHans Wippel2018-05-181-1/+1
| | | | | | | | | The connection index is actually a RMBE index. So, this patch changes the name accordingly. Signed-off-by: Hans Wippel <hwippel@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: add common buffer size in send and receive buffer descriptorsHans Wippel2018-05-181-2/+0
| | | | | | | | | | | In addition to the buffer references, SMC currently stores the sizes of the receive and send buffers in each connection as separate variables. This patch introduces a buffer length variable in the common buffer descriptor and uses this length instead. Signed-off-by: Hans Wippel <hwippel@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* smc: add support for splice()Stefan Raspl2018-05-041-0/+3
| | | | | | | | | Provide an implementation for splice() when we are using SMC. See smc_splice_read() for further details. Signed-off-by: Stefan Raspl <raspl@linux.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.ibm.com>< Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: handle sockopt TCP_DEFER_ACCEPTUrsula Braun2018-04-271-0/+4
| | | | | | | | If sockopt TCP_DEFER_ACCEPT is set, the accept is delayed till data is available. Signed-off-by: Ursula Braun <ubraun@linux.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net/smc: enable ipv6 support for smcKarsten Graul2018-03-161-1/+3
| | | | | | | | | | | | | Add ipv6 support to the smc socket layer functions. Make use of the updated clc layer functions to retrieve and match ipv6 information. The indicator for ipv4 or ipv6 is the protocol constant that is provided in the socket() call with address family AF_SMC. Based-on-patch-by: Takanori Ueda <tkueda@jp.ibm.com> Signed-off-by: Karsten Graul <kgraul@linux.vnet.ibm.com> Signed-off-by: Ursula Braun <ubraun@linux.vnet.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>