summaryrefslogtreecommitdiffstats
path: root/net/mptcp (follow)
Commit message (Collapse)AuthorAgeFilesLines
* mptcp: avoid data corruption on reinsertPaolo Abeni2020-07-231-1/+6
| | | | | | | | | | | | | | When updating a partially acked data fragment, we actually corrupt it. This is irrelevant till we send data on a single subflow, as retransmitted data, if any are discarded by the peer as duplicate, but it will cause data corruption as soon as we will start creating non backup subflows. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* subflow: always init 'rel_write_seq'Paolo Abeni2020-07-232-1/+1
| | | | | | | | | | | Currently we do not init the subflow write sequence for MP_JOIN subflows. This will cause bad mapping being generated as soon as we will use non backup subflow. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: zero token hash at creation time.Paolo Abeni2020-07-231-1/+1
| | | | | | | | | | | | Otherwise the 'chain_len' filed will carry random values, some token creation calls will fail due to excessive chain length, causing unexpected fallback to TCP. Fixes: 2c5ebd001d4f ("mptcp: refactor token container") Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Tested-by: Christoph Paasch <cpaasch@apple.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: move helper to where its usedFlorian Westphal2020-07-222-11/+12
| | | | | | | Only used in token.c. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: remove compat_sock_common_{get,set}sockoptChristoph Hellwig2020-07-201-6/+0
| | | | | | | | | | | Add the compat handling to sock_common_{get,set}sockopt instead, keyed of in_compat_syscall(). This allow to remove the now unused ->compat_{get,set}sockopt methods from struct proto_ops. Signed-off-by: Christoph Hellwig <hch@lst.de> Acked-by: Matthieu Baerts <matthieu.baerts@tessares.net> Acked-by: Stefan Schmidt <stefan@datenfreihafen.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: silence warning in subflow_data_ready()Davide Caratti2020-07-171-2/+3
| | | | | | | | | | | | | | | | | | | | since commit d47a72152097 ("mptcp: fix race in subflow_data_ready()"), it is possible to observe a regression in MP_JOIN kselftests. For sockets in TCP_CLOSE state, it's not sufficient to just wake up the main socket: we also need to ensure that received data are made available to the reader. Silence the WARN_ON_ONCE() in these cases: it preserves the syzkaller fix and restores kselftests when they are ran as follows: # while true; do > make KBUILD_OUTPUT=/tmp/kselftest TARGETS=net/mptcp kselftest > done Reported-by: Florian Westphal <fw@strlen.de> Fixes: d47a72152097 ("mptcp: fix race in subflow_data_ready()") Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/47 Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-07-111-3/+3
|\ | | | | | | | | | | | | All conflicts seemed rather trivial, with some guidance from Saeed Mameed on the tc_ct.c one. Signed-off-by: David S. Miller <davem@davemloft.net>
| * mptcp: fix DSS map generation on fin retransmissionPaolo Abeni2020-07-081-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The RFC 8684 mandates that no-data DATA FIN packets should carry a DSS with 0 sequence number and data len equal to 1. Currently, on FIN retransmission we re-use the existing mapping; if the previous fin transmission was part of a partially acked data packet, we could end-up writing in the egress packet a non-compliant DSS. The above will be detected by a "Bad mapping" warning on the receiver side. This change addresses the issue explicitly checking for 0 len packet when adding the DATA_FIN option. Fixes: 6d0060f600ad ("mptcp: Write MPTCP DSS headers to outgoing data packets") Reported-by: syzbot+42a07faa5923cfaeb9c9@syzkaller.appspotmail.com Tested-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Christoph Paasch <cpaasch@apple.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: add MPTCP socket diag interfacePaolo Abeni2020-07-093-0/+175
| | | | | | | | | | | | | | | | | | exposes basic inet socket attribute, plus some MPTCP socket fields comprising PM status and MPTCP-level sequence numbers. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: add msk interations helperPaolo Abeni2020-07-092-1/+62
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mptcp_token_iter_next() allow traversing all the MPTCP sockets inside the token container belonging to the given network namespace with a quite standard iterator semantic. That will be used by the next patch, but keep the API generic, as we plan to use this later for PM's sake. Additionally export mptcp_token_get_sock(), as it also will be used by the diag module. Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: use mptcp worker for path managementFlorian Westphal2020-07-073-47/+27
| | | | | | | | | | | | | | | | | | | | We can re-use the existing work queue to handle path management instead of a dedicated work queue. Just move pm_worker to protocol.c, call it from the mptcp worker and get rid of the msk lock (already held). Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: fix race in subflow_data_ready()Davide Caratti2020-07-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | syzkaller was able to make the kernel reach subflow_data_ready() for a server subflow that was closed before subflow_finish_connect() completed. In these cases we can avoid using the path for regular/fallback MPTCP data, and just wake the main socket, to avoid the following warning: WARNING: CPU: 0 PID: 9370 at net/mptcp/subflow.c:885 subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885 Kernel panic - not syncing: panic_on_warn set ... CPU: 0 PID: 9370 Comm: syz-executor.0 Not tainted 5.7.0 #106 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.12.1-0-ga5cab58e9a3f-prebuilt.qemu.org 04/01/2014 Call Trace: <IRQ> __dump_stack lib/dump_stack.c:77 [inline] dump_stack+0xb7/0xfe lib/dump_stack.c:118 panic+0x29e/0x692 kernel/panic.c:221 __warn.cold+0x2f/0x3d kernel/panic.c:582 report_bug+0x28b/0x2f0 lib/bug.c:195 fixup_bug arch/x86/kernel/traps.c:105 [inline] fixup_bug arch/x86/kernel/traps.c:100 [inline] do_error_trap+0x10f/0x180 arch/x86/kernel/traps.c:197 do_invalid_op+0x32/0x40 arch/x86/kernel/traps.c:216 invalid_op+0x1e/0x30 arch/x86/entry/entry_64.S:1027 RIP: 0010:subflow_data_ready+0x1e6/0x290 net/mptcp/subflow.c:885 Code: 04 02 84 c0 74 06 0f 8e 91 00 00 00 41 0f b6 5e 48 31 ff 83 e3 18 89 de e8 37 ec 3d fe 84 db 0f 85 65 ff ff ff e8 fa ea 3d fe <0f> 0b e9 59 ff ff ff e8 ee ea 3d fe 48 89 ee 4c 89 ef e8 f3 77 ff RSP: 0018:ffff88811b2099b0 EFLAGS: 00010206 RAX: ffff888111197000 RBX: 0000000000000000 RCX: ffffffff82fbc609 RDX: 0000000000000100 RSI: ffffffff82fbc616 RDI: 0000000000000001 RBP: ffff8881111bc800 R08: ffff888111197000 R09: ffffed10222a82af R10: ffff888111541577 R11: ffffed10222a82ae R12: 1ffff11023641336 R13: ffff888111541000 R14: ffff88810fd4ca00 R15: ffff888111541570 tcp_child_process+0x754/0x920 net/ipv4/tcp_minisocks.c:841 tcp_v4_do_rcv+0x749/0x8b0 net/ipv4/tcp_ipv4.c:1642 tcp_v4_rcv+0x2666/0x2e60 net/ipv4/tcp_ipv4.c:1999 ip_protocol_deliver_rcu+0x29/0x1f0 net/ipv4/ip_input.c:204 ip_local_deliver_finish net/ipv4/ip_input.c:231 [inline] NF_HOOK include/linux/netfilter.h:421 [inline] ip_local_deliver+0x2da/0x390 net/ipv4/ip_input.c:252 dst_input include/net/dst.h:441 [inline] ip_rcv_finish net/ipv4/ip_input.c:428 [inline] ip_rcv_finish net/ipv4/ip_input.c:414 [inline] NF_HOOK include/linux/netfilter.h:421 [inline] ip_rcv+0xef/0x140 net/ipv4/ip_input.c:539 __netif_receive_skb_one_core+0x197/0x1e0 net/core/dev.c:5268 __netif_receive_skb+0x27/0x1c0 net/core/dev.c:5382 process_backlog+0x1e5/0x6d0 net/core/dev.c:6226 napi_poll net/core/dev.c:6671 [inline] net_rx_action+0x3e3/0xd70 net/core/dev.c:6739 __do_softirq+0x18c/0x634 kernel/softirq.c:292 do_softirq_own_stack+0x2a/0x40 arch/x86/entry/entry_64.S:1082 </IRQ> do_softirq.part.0+0x26/0x30 kernel/softirq.c:337 do_softirq arch/x86/include/asm/preempt.h:26 [inline] __local_bh_enable_ip+0x46/0x50 kernel/softirq.c:189 local_bh_enable include/linux/bottom_half.h:32 [inline] rcu_read_unlock_bh include/linux/rcupdate.h:723 [inline] ip_finish_output2+0x78a/0x19c0 net/ipv4/ip_output.c:229 __ip_finish_output+0x471/0x720 net/ipv4/ip_output.c:306 dst_output include/net/dst.h:435 [inline] ip_local_out+0x181/0x1e0 net/ipv4/ip_output.c:125 __ip_queue_xmit+0x7a1/0x14e0 net/ipv4/ip_output.c:530 __tcp_transmit_skb+0x19dc/0x35e0 net/ipv4/tcp_output.c:1238 __tcp_send_ack.part.0+0x3c2/0x5b0 net/ipv4/tcp_output.c:3785 __tcp_send_ack net/ipv4/tcp_output.c:3791 [inline] tcp_send_ack+0x7d/0xa0 net/ipv4/tcp_output.c:3791 tcp_rcv_synsent_state_process net/ipv4/tcp_input.c:6040 [inline] tcp_rcv_state_process+0x36a4/0x49c2 net/ipv4/tcp_input.c:6209 tcp_v4_do_rcv+0x343/0x8b0 net/ipv4/tcp_ipv4.c:1651 sk_backlog_rcv include/net/sock.h:996 [inline] __release_sock+0x1ad/0x310 net/core/sock.c:2548 release_sock+0x54/0x1a0 net/core/sock.c:3064 inet_wait_for_connect net/ipv4/af_inet.c:594 [inline] __inet_stream_connect+0x57e/0xd50 net/ipv4/af_inet.c:686 inet_stream_connect+0x53/0xa0 net/ipv4/af_inet.c:725 mptcp_stream_connect+0x171/0x5f0 net/mptcp/protocol.c:1920 __sys_connect_file net/socket.c:1854 [inline] __sys_connect+0x267/0x2f0 net/socket.c:1871 __do_sys_connect net/socket.c:1882 [inline] __se_sys_connect net/socket.c:1879 [inline] __x64_sys_connect+0x6f/0xb0 net/socket.c:1879 do_syscall_64+0xb7/0x3d0 arch/x86/entry/common.c:295 entry_SYSCALL_64_after_hwframe+0x44/0xa9 RIP: 0033:0x7fb577d06469 Code: 00 f3 c3 66 2e 0f 1f 84 00 00 00 00 00 0f 1f 40 00 48 89 f8 48 89 f7 48 89 d6 48 89 ca 4d 89 c2 4d 89 c8 4c 8b 4c 24 08 0f 05 <48> 3d 01 f0 ff ff 73 01 c3 48 8b 0d ff 49 2b 00 f7 d8 64 89 01 48 RSP: 002b:00007fb5783d5dd8 EFLAGS: 00000246 ORIG_RAX: 000000000000002a RAX: ffffffffffffffda RBX: 000000000068bfa0 RCX: 00007fb577d06469 RDX: 000000000000004d RSI: 0000000020000040 RDI: 0000000000000003 RBP: 00000000ffffffff R08: 0000000000000000 R09: 0000000000000000 R10: 0000000000000000 R11: 0000000000000246 R12: 0000000000000000 R13: 000000000041427c R14: 00007fb5783d65c0 R15: 0000000000000003 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/39 Reported-by: Christoph Paasch <cpaasch@apple.com> Fixes: e1ff9e82e2ea ("net: mptcp: improve fallback to TCP") Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: support IPV6_V6ONLY setsockoptFlorian Westphal2020-07-051-0/+30
| | | | | | | | | | | | | | | | | | | | | | | | | | Without this, Opensshd fails to open an ipv6 socket listening socket: error: setsockopt IPV6_V6ONLY: Operation not supported error: Bind to port 22 on :: failed: Address already in use. Opensshd opens an ipv4 and and ipv6 listening socket, but because IPV6_V6ONLY setsockopt fails, the port number is already in use. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: add REUSEADDR/REUSEPORT supportFlorian Westphal2020-07-051-1/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This will e.g. make 'sshd restart' work when MPTCP is used, as we will now set this option on the listener socket instead of only the mptcp socket (where it has no effect). We still need to copy the setting to the master socket so that a subsequent getsockopt() returns the expected value. Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: use mptcp setsockopt function for SOL_SOCKET on mptcp socketsFlorian Westphal2020-07-051-0/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | setsockopt(mptcp_fd, SOL_SOCKET, ...)... appears to work (returns 0), but it has no effect -- this is because the MPTCP layer never has a chance to copy the settings to the subflow socket. Skip the generic handling for the mptcp case and instead call the mptcp specific handler instead for SOL_SOCKET too. Next patch adds more specific handling for SOL_SOCKET to mptcp. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: add receive buffer auto-tuningFlorian Westphal2020-07-023-8/+127
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When mptcp is used, userspace doesn't read from the tcp (subflow) socket but from the parent (mptcp) socket receive queue. skbs are moved from the subflow socket to the mptcp rx queue either from 'data_ready' callback (if mptcp socket can be locked), a work queue, or the socket receive function. This means tcp_rcv_space_adjust() is never called and thus no receive buffer size auto-tuning is done. An earlier (not merged) patch added tcp_rcv_space_adjust() calls to the function that moves skbs from subflow to mptcp socket. While this enabled autotuning, it also meant tuning was done even if userspace was reading the mptcp socket very slowly. This adds mptcp_rcv_space_adjust() and calls it after userspace has read data from the mptcp socket rx queue. Its very similar to tcp_rcv_space_adjust, with two differences: 1. The rtt estimate is the largest one observed on a subflow 2. The rcvbuf size and window clamp of all subflows is adjusted to the mptcp-level rcvbuf. Otherwise, we get spurious drops at tcp (subflow) socket level if the skbs are not moved to the mptcp socket fast enough. Before: time mptcp_connect.sh -t -f $((4*1024*1024)) -d 300 -l 0.01% -r 0 -e "" -m mmap [..] ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 40823ms) [ OK ] ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 23119ms) [ OK ] ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5421ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 41446ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 23427ms) [ OK ] ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5426ms) [ OK ] Time: 1396 seconds After: ns4 MPTCP -> ns3 (10.0.3.2:10108 ) MPTCP (duration 5417ms) [ OK ] ns4 MPTCP -> ns3 (10.0.3.2:10109 ) TCP (duration 5427ms) [ OK ] ns4 TCP -> ns3 (10.0.3.2:10110 ) MPTCP (duration 5422ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10111) MPTCP (duration 5415ms) [ OK ] ns4 MPTCP -> ns3 (dead:beef:3::2:10112) TCP (duration 5422ms) [ OK ] ns4 TCP -> ns3 (dead:beef:3::2:10113) MPTCP (duration 5423ms) [ OK ] Time: 296 seconds Signed-off-by: Florian Westphal <fw@strlen.de> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: do nonce initialization at subflow creation timePaolo Abeni2020-06-301-34/+20
| | | | | | | | | | | | | | | | | | | | This clean-up the code a bit, reduces the number of used hooks and indirect call requested, and allow better error reporting from __mptcp_subflow_connect() Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: close poll() racesPaolo Abeni2020-06-301-5/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | mptcp_poll always return POLLOUT for unblocking connect(), ensure that the socket is a suitable state. The MPTCP_DATA_READY bit is never cleared on accept: ensure we don't leave mptcp_accept() with an empty accept queue and such bit set. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: __mptcp_tcp_fallback() returns a struct sockPaolo Abeni2020-06-301-12/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently __mptcp_tcp_fallback() always return NULL on incoming connections, because MPTCP does not create the additional socket for the first subflow. Since the previous commit no __mptcp_tcp_fallback() caller needs a struct socket, so let __mptcp_tcp_fallback() return the first subflow sock and cope correctly even with incoming connections. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: create first subflow at msk creation timePaolo Abeni2020-06-301-33/+20
| | | | | | | | | | | | | | | | | | This cleans the code a bit and makes the behavior more consistent. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: check for plain TCP sock at accept timePaolo Abeni2020-06-301-62/+7
| | | | | | | | | | | | | | | | | | | | This cleanup the code a bit and avoid corrupted states on weird syscall sequence (accept(), connect()). Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: fallback in case of simultaneous connectDavide Caratti2020-06-302-0/+20
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | when a MPTCP client tries to connect to itself, tcp_finish_connect() is never reached. Because of this, depending on the socket current state, multiple faulty behaviours can be observed: 1) a WARN_ON() in subflow_data_ready() is hit WARNING: CPU: 2 PID: 882 at net/mptcp/subflow.c:911 subflow_data_ready+0x18b/0x230 [...] CPU: 2 PID: 882 Comm: gh35 Not tainted 5.7.0+ #187 [...] RIP: 0010:subflow_data_ready+0x18b/0x230 [...] Call Trace: tcp_data_queue+0xd2f/0x4250 tcp_rcv_state_process+0xb1c/0x49d3 tcp_v4_do_rcv+0x2bc/0x790 __release_sock+0x153/0x2d0 release_sock+0x4f/0x170 mptcp_shutdown+0x167/0x4e0 __sys_shutdown+0xe6/0x180 __x64_sys_shutdown+0x50/0x70 do_syscall_64+0x9a/0x370 entry_SYSCALL_64_after_hwframe+0x44/0xa9 2) client is stuck forever in mptcp_sendmsg() because the socket is not TCP_ESTABLISHED crash> bt 4847 PID: 4847 TASK: ffff88814b2fb100 CPU: 1 COMMAND: "gh35" #0 [ffff8881376ff680] __schedule at ffffffff97248da4 #1 [ffff8881376ff778] schedule at ffffffff9724a34f #2 [ffff8881376ff7a0] schedule_timeout at ffffffff97252ba0 #3 [ffff8881376ff8a8] wait_woken at ffffffff958ab4ba #4 [ffff8881376ff940] sk_stream_wait_connect at ffffffff96c2d859 #5 [ffff8881376ffa28] mptcp_sendmsg at ffffffff97207fca #6 [ffff8881376ffbc0] sock_sendmsg at ffffffff96be1b5b #7 [ffff8881376ffbe8] sock_write_iter at ffffffff96be1daa #8 [ffff8881376ffce8] new_sync_write at ffffffff95e5cb52 #9 [ffff8881376ffe50] vfs_write at ffffffff95e6547f #10 [ffff8881376ffe90] ksys_write at ffffffff95e65d26 #11 [ffff8881376fff28] do_syscall_64 at ffffffff956088ba #12 [ffff8881376fff50] entry_SYSCALL_64_after_hwframe at ffffffff9740008c RIP: 00007f126f6956ed RSP: 00007ffc2a320278 RFLAGS: 00000217 RAX: ffffffffffffffda RBX: 0000000020000044 RCX: 00007f126f6956ed RDX: 0000000000000004 RSI: 00000000004007b8 RDI: 0000000000000003 RBP: 00007ffc2a3202a0 R8: 0000000000400720 R9: 0000000000400720 R10: 0000000000400720 R11: 0000000000000217 R12: 00000000004004b0 R13: 00007ffc2a320380 R14: 0000000000000000 R15: 0000000000000000 ORIG_RAX: 0000000000000001 CS: 0033 SS: 002b 3) tcpdump captures show that DSS is exchanged even when MP_CAPABLE handshake didn't complete. $ tcpdump -tnnr bad.pcap IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S], seq 3208913911, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291694721,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [S.], seq 3208913911, ack 3208913912, win 65483, options [mss 65495,sackOK,TS val 3291706876 ecr 3291706876,nop,wscale 7,mptcp capable v1], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 1, win 512, options [nop,nop,TS val 3291706876 ecr 3291706876], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [F.], seq 1, ack 1, win 512, options [nop,nop,TS val 3291707876 ecr 3291706876,mptcp dss fin seq 0 subseq 0 len 1,nop,nop], length 0 IP 127.0.0.1.20000 > 127.0.0.1.20000: Flags [.], ack 2, win 512, options [nop,nop,TS val 3291707876 ecr 3291707876], length 0 force a fallback to TCP in these cases, and adjust the main socket state to avoid hanging in mptcp_sendmsg(). Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/35 Reported-by: Christoph Paasch <cpaasch@apple.com> Suggested-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: mptcp: improve fallback to TCPDavide Caratti2020-06-304-89/+98
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Keep using MPTCP sockets and a use "dummy mapping" in case of fallback to regular TCP. When fallback is triggered, skip addition of the MPTCP option on send. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/11 Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/22 Co-developed-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Davide Caratti <dcaratti@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: introduce token KUNIT self-testsPaolo Abeni2020-06-274-2/+152
| | | | | | | | | | | | | | | | | | | | | | | | Unit tests for the internal MPTCP token APIs, using KUNIT v1 -> v2: - use the correct RCU annotation when initializing icsk ulp - fix a few checkpatch issues Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: move crypto test to KUNITPaolo Abeni2020-06-274-67/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | currently MPTCP uses a custom hook to executed unit tests at boot time. Let's use the KUNIT framework instead. Additionally move the relevant code to a separate file and export the function needed by the test when self-tests are build as a module. Co-developed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: refactor token containerPaolo Abeni2020-06-274-113/+236
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace the radix tree with a hash table allocated at boot time. The radix tree has some shortcoming: a single lock is contented by all the mptcp operation, the lookup currently use such lock, and traversing all the items would require a lock, too. With hash table instead we trade a little memory to address all the above - a per bucket lock is used. To hash the MPTCP sockets, we re-use the msk' sk_node entry: the MPTCP sockets are never hashed by the stack. Replace the existing hash proto callbacks with a dummy implementation, annotating the above constraint. Additionally refactor the token creation to code to: - limit the number of consecutive attempts to a fixed maximum. Hitting a hash bucket with a long chain is considered a failed attempt - accept() no longer can fail to token management. - if token creation fails at connect() time, we do fallback to TCP (before the connection was closed) v1 -> v2: - fix "no newline at end of file" - Jakub Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | mptcp: add __init annotation on setup functionsPaolo Abeni2020-06-275-10/+10
| | | | | | | | | | | | | | | | | | Add the missing annotation in some setup-only functions. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-06-263-32/+28
|\| | | | | | | | | | | | | | | Minor overlapping changes in xfrm_device.c, between the double ESP trailing bug fix setting the XFRM_INIT flag and the changes in net-next preparing for bonding encryption support. Signed-off-by: David S. Miller <davem@davemloft.net>
| * mptcp: drop sndr_key in mptcp_syn_optionsGeliang Tang2020-06-231-2/+0
| | | | | | | | | | | | | | | | | | | | In RFC 8684, we don't need to send sndr_key in SYN package anymore, so drop it. Fixes: cc7972ea1932 ("mptcp: parse and emit MP_CAPABLE option according to v1 spec") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mptcp: drop MP_JOIN request sock on syn cookiesPaolo Abeni2020-06-191-8/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently any MPTCP socket using syn cookies will fallback to TCP at 3rd ack time. In case of MP_JOIN requests, the RFC mandate closing the child and sockets, but the existing error paths do not handle the syncookie scenario correctly. Address the issue always forcing the child shutdown in case of MP_JOIN fallback. Fixes: ae2dd7164943 ("mptcp: handle tcp fallback when using syn cookies") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * mptcp: cache msk on MP_JOIN init_reqPaolo Abeni2020-06-192-22/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The msk ownership is transferred to the child socket at 3rd ack time, so that we avoid more lookups later. If the request does not reach the 3rd ack, the MSK reference is dropped at request sock release time. As a side effect, fallback is now tracked by a NULL msk reference instead of zeroed 'mp_join' field. This will simplify the next patch. Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Mat Martineau <mathew.j.martineau@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: move ipv4_specific to tcp include fileEric Dumazet2020-06-241-2/+0
| | | | | | | | | | | | | | Declare ipv4_specific once, in tcp.h were it belongs. Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tcp: move ipv6_specific declaration to remove a warningEric Dumazet2020-06-241-3/+0
|/ | | | | | | | | | | | ipv6_specific should be declared in tcp include files, not mptcp. This removes the following warning : CHECK net/ipv6/tcp_ipv6.c net/ipv6/tcp_ipv6.c:78:42: warning: symbol 'ipv6_specific' was not declared. Should it be static? Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: fix memory leak in mptcp_subflow_create_socket()Wei Yongjun2020-06-161-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | socket malloced by sock_create_kern() should be release before return in the error handling, otherwise it cause memory leak. unreferenced object 0xffff88810910c000 (size 1216): comm "00000003_test_m", pid 12238, jiffies 4295050289 (age 54.237s) hex dump (first 32 bytes): 01 00 00 00 01 00 00 00 00 00 00 00 00 00 00 00 ................ 00 00 00 00 00 00 00 00 00 2f 30 0a 81 88 ff ff ........./0..... backtrace: [<00000000e877f89f>] sock_alloc_inode+0x18/0x1c0 [<0000000093d1dd51>] alloc_inode+0x63/0x1d0 [<000000005673fec6>] new_inode_pseudo+0x14/0xe0 [<00000000b5db6be8>] sock_alloc+0x3c/0x260 [<00000000e7e3cbb2>] __sock_create+0x89/0x620 [<0000000023e48593>] mptcp_subflow_create_socket+0xc0/0x5e0 [<00000000419795e4>] __mptcp_socket_create+0x1ad/0x3f0 [<00000000b2f942e8>] mptcp_stream_connect+0x281/0x4f0 [<00000000c80cd5cc>] __sys_connect_file+0x14d/0x190 [<00000000dc761f11>] __sys_connect+0x128/0x160 [<000000008b14e764>] __x64_sys_connect+0x6f/0xb0 [<000000007b4f93bd>] do_syscall_64+0xa1/0x530 [<00000000d3e770b6>] entry_SYSCALL_64_after_hwframe+0x49/0xb3 Fixes: 2303f994b3e1 ("mptcp: Associate MPTCP context with TCP socket") Signed-off-by: Wei Yongjun <weiyongjun1@huawei.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: use list_first_entry_or_nullGeliang Tang2020-06-151-4/+1
| | | | | | | Use list_first_entry_or_null to simplify the code. Signed-off-by: Geliang Tang <geliangtang@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: drop MPTCP_PM_MAX_ADDRGeliang Tang2020-06-151-2/+0
| | | | | | | | | We have defined MPTCP_PM_ADDR_MAX in pm_netlink.c, so drop this duplicate macro. Fixes: 1b1c7a0ef7f3 ("mptcp: Add path manager interface") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: don't leak msk in token containerPaolo Abeni2020-06-111-0/+1
| | | | | | | | | | | | | | | | | | If a listening MPTCP socket has unaccepted sockets at close time, the related msks are freed via mptcp_sock_destruct(), which in turn does not invoke the proto->destroy() method nor the mptcp_token_destroy() function. Due to the above, the child msk socket is not removed from the token container, leading to later UaF. Address the issue explicitly removing the token even in the above error path. Fixes: 79c0949e9a09 ("mptcp: Add key generation and token tree") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: fix races between shutdown and recvmsgPaolo Abeni2020-06-101-21/+24
| | | | | | | | | | | | | | | | The msk sk_shutdown flag is set by a workqueue, possibly introducing some delay in user-space notification. If the last subflow carries some data with the fin packet, the user space can wake-up before RCV_SHUTDOWN is set. If it executes unblocking recvmsg(), it may return with an error instead of eof. Address the issue explicitly checking for eof in recvmsg(), when no data is found. Fixes: 59832e246515 ("mptcp: subflow: check parent mptcp socket on subflow state change") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* mptcp: bugfix for RM_ADDR option parsingGeliang Tang2020-06-091-0/+2
| | | | | | | | | | | | | | | | | | | In MPTCPOPT_RM_ADDR option parsing, the pointer "ptr" pointed to the "Subtype" octet, the pointer "ptr+1" pointed to the "Address ID" octet: +-------+-------+---------------+ |Subtype|(resvd)| Address ID | +-------+-------+---------------+ | | ptr ptr+1 We should set mp_opt->rm_id to the value of "ptr+1", not "ptr". This patch will fix this bug. Fixes: 3df523ab582c ("mptcp: Add ADD_ADDR handling") Signed-off-by: Geliang Tang <geliangtang@gmail.com> Reviewed-by: Matthieu Baerts <matthieu.baerts@tessares.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-nextLinus Torvalds2020-06-044-64/+196
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking updates from David Miller: 1) Allow setting bluetooth L2CAP modes via socket option, from Luiz Augusto von Dentz. 2) Add GSO partial support to igc, from Sasha Neftin. 3) Several cleanups and improvements to r8169 from Heiner Kallweit. 4) Add IF_OPER_TESTING link state and use it when ethtool triggers a device self-test. From Andrew Lunn. 5) Start moving away from custom driver versions, use the globally defined kernel version instead, from Leon Romanovsky. 6) Support GRO vis gro_cells in DSA layer, from Alexander Lobakin. 7) Allow hard IRQ deferral during NAPI, from Eric Dumazet. 8) Add sriov and vf support to hinic, from Luo bin. 9) Support Media Redundancy Protocol (MRP) in the bridging code, from Horatiu Vultur. 10) Support netmap in the nft_nat code, from Pablo Neira Ayuso. 11) Allow UDPv6 encapsulation of ESP in the ipsec code, from Sabrina Dubroca. Also add ipv6 support for espintcp. 12) Lots of ReST conversions of the networking documentation, from Mauro Carvalho Chehab. 13) Support configuration of ethtool rxnfc flows in bcmgenet driver, from Doug Berger. 14) Allow to dump cgroup id and filter by it in inet_diag code, from Dmitry Yakunin. 15) Add infrastructure to export netlink attribute policies to userspace, from Johannes Berg. 16) Several optimizations to sch_fq scheduler, from Eric Dumazet. 17) Fallback to the default qdisc if qdisc init fails because otherwise a packet scheduler init failure will make a device inoperative. From Jesper Dangaard Brouer. 18) Several RISCV bpf jit optimizations, from Luke Nelson. 19) Correct the return type of the ->ndo_start_xmit() method in several drivers, it's netdev_tx_t but many drivers were using 'int'. From Yunjian Wang. 20) Add an ethtool interface for PHY master/slave config, from Oleksij Rempel. 21) Add BPF iterators, from Yonghang Song. 22) Add cable test infrastructure, including ethool interfaces, from Andrew Lunn. Marvell PHY driver is the first to support this facility. 23) Remove zero-length arrays all over, from Gustavo A. R. Silva. 24) Calculate and maintain an explicit frame size in XDP, from Jesper Dangaard Brouer. 25) Add CAP_BPF, from Alexei Starovoitov. 26) Support terse dumps in the packet scheduler, from Vlad Buslov. 27) Support XDP_TX bulking in dpaa2 driver, from Ioana Ciornei. 28) Add devm_register_netdev(), from Bartosz Golaszewski. 29) Minimize qdisc resets, from Cong Wang. 30) Get rid of kernel_getsockopt and kernel_setsockopt in order to eliminate set_fs/get_fs calls. From Christoph Hellwig. * git://git.kernel.org/pub/scm/linux/kernel/git/netdev/net-next: (2517 commits) selftests: net: ip_defrag: ignore EPERM net_failover: fixed rollback in net_failover_open() Revert "tipc: Fix potential tipc_aead refcnt leak in tipc_crypto_rcv" Revert "tipc: Fix potential tipc_node refcnt leak in tipc_rcv" vmxnet3: allow rx flow hash ops only when rss is enabled hinic: add set_channels ethtool_ops support selftests/bpf: Add a default $(CXX) value tools/bpf: Don't use $(COMPILE.c) bpf, selftests: Use bpf_probe_read_kernel s390/bpf: Use bcr 0,%0 as tail call nop filler s390/bpf: Maintain 8-byte stack alignment selftests/bpf: Fix verifier test selftests/bpf: Fix sample_cnt shared between two threads bpf, selftests: Adapt cls_redirect to call csum_level helper bpf: Add csum_level helper for fixing up csum levels bpf: Fix up bpf_skb_adjust_room helper's skb csum setting sfc: add missing annotation for efx_ef10_try_update_nic_stats_vf() crypto/chtls: IPv6 support for inline TLS Crypto/chcr: Fixes a coccinile check error Crypto/chcr: Fixes compilations warnings ...
| * Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-06-011-19/+48
| |\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | xdp_umem.c had overlapping changes between the 64-bit math fix for the calculation of npgs and the removal of the zerocopy memory type which got rid of the chunk_size_nohdr member. The mlx5 Kconfig conflict is a case where we just take the net-next copy of the Kconfig entry dependency as it takes on the ESWITCH dependency by one level of indirection which is what the 'net' conflicting change is trying to ensure. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mptcp: fix NULL ptr dereference in MP_JOIN error pathPaolo Abeni2020-05-311-4/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When token lookup on MP_JOIN 3rd ack fails, the server socket closes with a reset the incoming child. Such socket has the 'is_mptcp' flag set, but no msk socket associated - due to the failed lookup. While crafting the reset packet mptcp_established_options_mp() will try to dereference the child's master socket, causing a NULL ptr dereference. This change addresses the issue with explicit fallback to TCP in such error path. Fixes: 729cd6436f35 ("mptcp: cope better with MP_JOIN failure") Signed-off-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | mptcp: attempt coalescing when moving skbs to mptcp rx queueFlorian Westphal2020-05-271-2/+19
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We can try to coalesce skbs we take from the subflows rx queue with the tail of the mptcp rx queue. If successful, the skb head can be discarded early. We can also free the skb extensions, we do not access them after this. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | Merge git://git.kernel.org/pub/scm/linux/kernel/git/netdev/netDavid S. Miller2020-05-244-25/+24
| |\ \ | | | | | | | | | | | | | | | | | | | | | | | | The MSCC bug fix in 'net' had to be slightly adjusted because the register accesses are done slightly differently in net-next. Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | ipv6: move SIOCADDRT and SIOCDELRT handling into ->compat_ioctlChristoph Hellwig2020-05-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | To prepare removing the global routing_ioctl hack start lifting the code into a newly added ipv6 ->compat_ioctl handler. Signed-off-by: Christoph Hellwig <hch@lst.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | net: allow __skb_ext_alloc to sleepFlorian Westphal2020-05-171-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | mptcp calls this from the transmit side, from process context. Allow a sleeping allocation instead of unconditional GFP_ATOMIC. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | mptcp: remove inner wait loop from mptcp_sendmsg_fragFlorian Westphal2020-05-171-14/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | previous patches made sure we only call into this function when these prerequisites are met, so no need to wait on the subflow socket anymore. Closes: https://github.com/multipath-tcp/mptcp_net-next/issues/7 Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | mptcp: fill skb page frag cache outside of mptcp_sendmsg_fragFlorian Westphal2020-05-171-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mptcp_sendmsg_frag helper contains a loop that will wait on the subflow sk. It seems preferrable to only wait in mptcp_sendmsg() when blocking io is requested. mptcp_sendmsg already has such a wait loop that is used when no subflow socket is available for transmission. This is another preparation patch that makes sure we call mptcp_sendmsg_frag only if the page frag cache has been refilled. Followup patch will remove the wait loop from mptcp_sendmsg_frag(). The retransmit worker doesn't need to do this refill as it won't transmit new mptcp-level data. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | mptcp: fill skb extension cache outside of mptcp_sendmsg_fragFlorian Westphal2020-05-171-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The mptcp_sendmsg_frag helper contains a loop that will wait on the subflow sk. It seems preferrable to only wait in mptcp_sendmsg() when blocking io is requested. mptcp_sendmsg already has such a wait loop that is used when no subflow socket is available for transmission. This is a preparation patch that makes sure we call mptcp_sendmsg_frag only if a skb extension has been allocated. Moreover, such allocation currently uses GFP_ATOMIC while it could use sleeping allocation instead. Followup patches will remove the wait loop from mptcp_sendmsg_frag() and will allow to do a sleeping allocation for the extension. Acked-by: Paolo Abeni <pabeni@redhat.com> Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | mptcp: avoid blocking in tcp_sendpagesFlorian Westphal2020-05-171-3/+32
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The transmit loop continues to xmit new data until an error is returned or all data was transmitted. For the blocking i/o case, this means that tcp_sendpages() may block on the subflow until more space becomes available, i.e. we end up sleeping with the mptcp socket lock held. Instead we should check if a different subflow is ready to be used. This restarts the subflow sk lookup when the tx operation succeeded and the tcp subflow can't accept more data or if tcp_sendpages indicates -EAGAIN on a blocking mptcp socket. In that case we also need to set the NOSPACE bit to make sure we get notified once memory becomes available. In case all subflows are busy, the existing logic will wait until a subflow is ready, releasing the mptcp socket lock while doing so. The mptcp worker already sets DONTWAIT, so no need to make changes there. v2: * set NOSPACE bit * add a comment to clarify that mptcp-sk sndbuf limits need to be checked as well. Signed-off-by: Florian Westphal <fw@strlen.de> Signed-off-by: David S. Miller <davem@davemloft.net>