summaryrefslogtreecommitdiffstats
path: root/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Bluetooth: Use common error handling code in bt_init()Markus Elfring2017-12-131-21/+17
| | | | | | | | | | | | * Improve jump targets so that a bit of exception handling can be better reused at the end of this function. * Adjust five condition checks. This issue was detected by using the Coccinelle software. Signed-off-by: Markus Elfring <elfring@users.sourceforge.net> Signed-off-by: Marcel Holtmann <marcel@holtmann.org>
* tcp: smoother receiver autotuningEric Dumazet2017-12-121-14/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Back in linux-3.13 (commit b0983d3c9b13 ("tcp: fix dynamic right sizing")) I addressed the pressing issues we had with receiver autotuning. But DRS suffers from extra latencies caused by rcv_rtt_est.rtt_us drifts. One common problem happens during slow start, since the apparent RTT measured by the receiver can be inflated by ~50%, at the end of one packet train. Also, a single drop can delay read() calls by one RTT, meaning tcp_rcv_space_adjust() can be called one RTT too late. By replacing the tri-modal heuristic with a continuous function, we can offset the effects of not growing 'at the optimal time'. The curve of the function matches prior behavior if the space increased by 25% and 50% exactly. Cost of added multiply/divide is small, considering a TCP flow typically would run this part of the code few times in its life. I tested this patch with 100 ms RTT / 1% loss link, 100 runs of (netperf -l 5), and got an average throughput of 4600 Mbit instead of 1700 Mbit. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: avoid integer overflows in tcp_rcv_space_adjust()Eric Dumazet2017-12-121-5/+7
| | | | | | | | | | | | | When using large tcp_rmem[2] values (I did tests with 500 MB), I noticed overflows while computing rcvwin. Lets fix this before the following patch. Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: do not overshoot window_clamp in tcp_rcv_space_adjust()Eric Dumazet2017-12-121-1/+1
| | | | | | | | | | | | | | | | | | While rcvbuf is properly clamped by tcp_rmem[2], rcvwin is left to a potentially too big value. It has no serious effect, since : 1) tcp_grow_window() has very strict checks. 2) window_clamp can be mangled by user space to any value anyway. tcp_init_buffer_space() and companions use tcp_full_space(), we use tcp_win_from_space() to avoid reloading sk->sk_rcvbuf Signed-off-by: Eric Dumazet <edumazet@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Wei Wang <weiwan@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: add support for the process of unordered idataXin Long2017-12-113-27/+378
| | | | | | | | | | | | | | | | | | | | | | Unordered idata process is more complicated than unordered data: - It has to add mid into sctp_stream_out to save the next mid value, which is separated from ordered idata's. - To support pd for unordered idata, another mid and pd_mode need to be added to save the message id and pd state in sctp_stream_in. - To make unordered idata reasm easier, it adds a new event queue to save frags for idata. The patch mostly adds the samilar reasm functions for unordered idata as ordered idata's, and also adjusts some other codes on assign_mid, abort_pd and ulpevent_data for idata. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: implement abort_pd for sctp_stream_interleaveXin Long2017-12-114-5/+107
| | | | | | | | | | | | | | | | | | abort_pd is added as a member of sctp_stream_interleave, used to abort partial delivery for data or idata, called in sctp_cmd_assoc_failed. Since stream interleave allows to do partial delivery for each stream at the same time, sctp_intl_abort_pd for idata would be very different from the old function sctp_ulpq_abort_pd for data. Note that sctp_ulpevent_make_pdapi will support per stream in this patch by adding pdapi_stream and pdapi_seq in sctp_pdapi_event, as described in section 6.1.7 of RFC6458. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: implement start_pd for sctp_stream_interleaveXin Long2017-12-112-1/+3
| | | | | | | | | | | | | start_pd is added as a member of sctp_stream_interleave, used to do partial_delivery for data or idata when datalen >= asoc->rwnd in sctp_eat_data. The codes have been done in last patches, but they need to be extracted into start_pd, so that it could be used for SCTP_CMD_PART_DELIVER cmd as well. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: implement renege_events for sctp_stream_interleaveXin Long2017-12-113-4/+114
| | | | | | | | | | | | | | renege_events is added as a member of sctp_stream_interleave, used to renege some old data or idata in reasm or lobby queue properly to free some memory for the new data when there's memory stress. It defines sctp_renege_events for idata, and leaves sctp_ulpq_renege as it is for data. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: implement enqueue_event for sctp_stream_interleaveXin Long2017-12-115-7/+10
| | | | | | | | | | | | | | enqueue_event is added as a member of sctp_stream_interleave, used to enqueue either data, idata or notification events into user socket rx queue. It replaces sctp_ulpq_tail_event used in the other places with enqueue_event. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: implement ulpevent_data for sctp_stream_interleaveXin Long2017-12-114-8/+429
| | | | | | | | | | | | | | | | | | | | | | | | | | ulpevent_data is added as a member of sctp_stream_interleave, used to do the most process in ulpq, including to convert data or idata chunk to event, reasm them in reasm queue and put them in lobby queue in right order, and deliver them up to user sk rx queue. This procedure is described in section 2.2.3 of RFC8260. It adds most functions for idata here to do the similar process as the old functions for data. But since the details are very different between them, the old functions can not be reused for idata. event->ssn and event->ppid settings are moved to ulpevent_data from sctp_ulpevent_make_rcvmsg, so that sctp_ulpevent_make_rcvmsg could work for both data and idata. Note that mid is added in sctp_ulpevent for idata, __packed has to be used for defining sctp_ulpevent, or it would exceeds the skb cb that saves a sctp_ulpevent variable for ulp layer process. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: implement validate_data for sctp_stream_interleaveXin Long2017-12-113-13/+50
| | | | | | | | | | | | | | | | | | | | | | | | | | validate_data is added as a member of sctp_stream_interleave, used to validate ssn/chunk type for data or mid (message id)/chunk type for idata, called in sctp_eat_data. If this check fails, an abort packet will be sent, as said in section 2.2.3 of RFC8260. It also adds the process for idata in rx path. As Marcelo pointed out, there's no need to add event table for idata, but just share chunk_event_table with data's. It would drop data chunk for idata and drop idata chunk for data by calling validate_data in sctp_eat_data. As last patch did, it also replaces sizeof(struct sctp_data_chunk) with sctp_datachk_len for rx path. After this patch, the idata can be accepted and delivered to ulp layer. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: implement assign_number for sctp_stream_interleaveXin Long2017-12-114-6/+57
| | | | | | | | | | | | | | | | | | | | | | | | | assign_number is added as a member of sctp_stream_interleave, used to assign ssn for data or mid (message id) for idata, called in sctp_packet_append_data. sctp_chunk_assign_ssn is left as it is, and sctp_chunk_assign_mid is added for sctp_stream_interleave_1. This procedure is described in section 2.2.2 of RFC8260. All sizeof(struct sctp_data_chunk) in tx path is replaced with sctp_datachk_len, to make it right for idata as well. And also adjust sctp_chunk_is_data for SCTP_CID_I_DATA. After this patch, idata can be built and sent in tx path. Note that if sp strm_interleave is set, it has to wait_connect in sctp_sendmsg, as asoc intl_enable need to be known after 4 shake- hands, to decide if it should use data or idata later. data and idata can't be mixed to send in one asoc. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: implement make_datafrag for sctp_stream_interleaveXin Long2017-12-115-18/+91
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | To avoid hundreds of checks for the different process on I-DATA chunk, struct sctp_stream_interleave is defined as a group of functions used to replace the codes in some place where it needs to do different job according to if the asoc intl_enabled is set. With these ops, it only needs to initialize asoc->stream.si with sctp_stream_interleave_0 for normal data if asoc intl_enable is 0, or sctp_stream_interleave_1 for idata if asoc intl_enable is set in sctp_stream_init. After that, the members in asoc->stream.si can be used directly in some special places without checking asoc intl_enable. make_datafrag is the first member for sctp_stream_interleave, it's used to make data or idata frags, called in sctp_datamsg_from_user. The old function sctp_make_datafrag_empty needs to be adjust some to fit in this ops. Note that as idata and data chunks have different length, it also defines data_chunk_len for sctp_stream_interleave to describe the chunk size. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: add basic structures and make chunk function for idataXin Long2017-12-111-0/+6
| | | | | | | | | | | | | sctp_idatahdr and sctp_idata_chunk are used to define and parse I-DATA chunk format, and sctp_make_idata is a function to build the chunk. The I-DATA Chunk Format is defined in section 2.1 of RFC8260. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: add asoc intl_enable negotiation during 4 shakehandsXin Long2017-12-111-2/+16
| | | | | | | | | | | | | asoc intl_enable will be set when local sp strm_interleave is set and there's I-DATA chunk in init and init_ack extensions, as said in section 2.2.1 of RFC8260. asoc intl_enable indicates all data will be sent as I-DATA chunks. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* sctp: add stream interleave enable members and sockoptXin Long2017-12-111-1/+87
| | | | | | | | | | | | | | | | | | | | | This patch adds intl_enable in asoc and netns, and strm_interleave in sctp_sock to indicate if stream interleave is enabled and supported. netns intl_enable would be set via procfs, but that is not added yet until all stream interleave codes are completely implemented; asoc intl_enable will be set when doing 4-shakehands. sp strm_interleave can be set by sockopt SCTP_INTERLEAVING_SUPPORTED which is also added in this patch. This socket option is defined in section 4.3.1 of RFC8260. Note that strm_interleave can only be set by sockopt when both netns intl_enable and sp frag_interleave are set. Signed-off-by: Xin Long <lucien.xin@gmail.com> Acked-by: Marcelo Ricardo Leitner <marcelo.leitner@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: convert netlink tap spinlock to mutexCong Wang2017-12-111-6/+6
| | | | | | | | | | | | | Both netlink_add_tap() and netlink_remove_tap() are called in process context, no need to bother spinlock. Note, in fact, currently we always hold RTNL when calling these two functions, so we don't need any other lock at all, but keeping this lock doesn't harm anything. Cc: Daniel Borkmann <daniel@iogearbox.net> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netlink: make netlink tap per netnsCong Wang2017-12-111-17/+49
| | | | | | | | | | | nlmon device is not supposed to capture netlink events from other netns, so instead of filtering events, we can simply make netlink tap itself per netns. Cc: Daniel Borkmann <daniel@iogearbox.net> Cc: Kevin Cernekee <cernekee@chromium.org> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* rhashtable: Change rhashtable_walk_start to return voidTom Herbert2017-12-119-70/+26
| | | | | | | | | | | | | | | | | | | | | | | | | Most callers of rhashtable_walk_start don't care about a resize event which is indicated by a return value of -EAGAIN. So calls to rhashtable_walk_start are wrapped wih code to ignore -EAGAIN. Something like this is common: ret = rhashtable_walk_start(rhiter); if (ret && ret != -EAGAIN) goto out; Since zero and -EAGAIN are the only possible return values from the function this check is pointless. The condition never evaluates to true. This patch changes rhashtable_walk_start to return void. This simplifies code for the callers that ignore -EAGAIN. For the few cases where the caller cares about the resize event, particularly where the table can be walked in mulitple parts for netlink or seq file dump, the function rhashtable_walk_start_check has been added that returns -EAGAIN on a resize event. Signed-off-by: Tom Herbert <tom@quantonium.net> Acked-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* rtnetlink: fix typo in GSO max segmentsStephen Hemminger2017-12-111-1/+1
| | | | | | Fixes: 46e6b992c250 ("rtnetlink: allow GSO maximums to be set on device creation") Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2017-12-1021-143/+136
|\ | | | | | | | | | | | | Conflict was two parallel additions of include files to sch_generic.c, no biggie. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: evaluate packet losses upon RTT changeYuchung Cheng2017-12-081-11/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RACK skips an ACK unless it advances the most recently delivered TX timestamp (rack.mstamp). Since RACK also uses the most recent RTT to decide if a packet is lost, RACK should still run the loss detection whenever the most recent RTT changes. For example, an ACK that does not advance the timestamp but triggers the cwnd undo due to reordering, would then use the most recent (higher) RTT measurement to detect further losses. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: fix off-by-one bug in RACKYuchung Cheng2017-12-081-3/+3
| | | | | | | | | | | | | | | | | | | | RACK should mark a packet lost when remaining wait time is zero. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: always evaluate losses in RACK upon undoYuchung Cheng2017-12-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When sender detects spurious retransmission, all packets marked lost are remarked to be in-flight. However some may be considered lost based on its timestamps in RACK. This patch forces RACK to re-evaluate, which may be skipped previously if the ACK does not advance RACK timestamp. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: correctly test congestion state in RACKYuchung Cheng2017-12-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | RACK does not test the loss recovery state correctly to compute the reordering window. It assumes if lost_out is zero then TCP is not in loss recovery. But it can be zero during recovery before calling tcp_rack_detect_loss(): when an ACK acknowledges all packets marked lost before receiving this ACK, but has not yet to discover new ones by tcp_rack_detect_loss(). The fix is to simply test the congestion state directly. Signed-off-by: Yuchung Cheng <ycheng@google.com> Reviewed-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Priyaranjan Jha <priyarjha@google.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp_bbr: reset long-term bandwidth sampling on loss recovery undoNeal Cardwell2017-12-081-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix BBR so that upon notification of a loss recovery undo BBR resets long-term bandwidth sampling. Under high reordering, reordering events can be interpreted as loss. If the reordering and spurious loss estimates are high enough, this can cause BBR to spuriously estimate that we are seeing loss rates high enough to trigger long-term bandwidth estimation. To avoid that problem, this commit resets long-term bandwidth sampling on loss recovery undo events. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp_bbr: reset full pipe detection on loss recovery undoNeal Cardwell2017-12-081-0/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Fix BBR so that upon notification of a loss recovery undo BBR resets the full pipe detection (STARTUP exit) state machine. Under high reordering, reordering events can be interpreted as loss. If the reordering and spurious loss estimates are high enough, this could previously cause BBR to spuriously estimate that the pipe is full. Since spurious loss recovery means that our overall sending will have slowed down spuriously, this commit gives a flow more time to probe robustly for bandwidth and decide the pipe is really full. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp_bbr: record "full bw reached" decision in new full_bw_reached bitNeal Cardwell2017-12-081-2/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This commit records the "full bw reached" decision in a new full_bw_reached bit. This is a pure refactor that does not change the current behavior, but enables subsequent fixes and improvements. In particular, this enables simple and clean fixes because the full_bw and full_bw_cnt can be unconditionally zeroed without worrying about forgetting that we estimated we filled the pipe in Startup. And it enables future improvements because multiple code paths can be used for estimating that we filled the pipe in Startup; any new code paths only need to set this bit when they think the pipe is full. Note that this fix intentionally reduces the width of the full_bw_cnt counter, since we have never used the most significant bit. Signed-off-by: Neal Cardwell <ncardwell@google.com> Reviewed-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: invalidate rate samples during SACK renegingYousuk Seung2017-12-083-5/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Mark tcp_sock during a SACK reneging event and invalidate rate samples while marked. Such rate samples may overestimate bw by including packets that were SACKed before reneging. < ack 6001 win 10000 sack 7001:38001 < ack 7001 win 0 sack 8001:38001 // Reneg detected > seq 7001:8001 // RTO, SACK cleared. < ack 38001 win 10000 In above example the rate sample taken after the last ack will count 7001-38001 as delivered while the actual delivery rate likely could be much lower i.e. 7001-8001. This patch adds a new field tcp_sock.sack_reneg and marks it when we declare SACK reneging and entering TCP_CA_Loss, and unmarks it after the last rate sample was taken before moving back to TCP_CA_Open. This patch also invalidates rate samples taken while tcp_sock.is_sack_reneg is set. Fixes: b9f64820fb22 ("tcp: track data delivery rate for a TCP connection") Signed-off-by: Yousuk Seung <ysseung@google.com> Signed-off-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: Yuchung Cheng <ycheng@google.com> Acked-by: Soheil Hassas Yeganeh <soheil@google.com> Acked-by: Eric Dumazet <edumazet@google.com> Acked-by: Priyaranjan Jha <priyarjha@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: use current time in tcp_rcv_space_adjust()Eric Dumazet2017-12-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | When I switched rcv_rtt_est to high resolution timestamps, I forgot that tp->tcp_mstamp needed to be refreshed in tcp_rcv_space_adjust() Using an old timestamp leads to autotuning lags. Fixes: 645f4c6f2ebd ("tcp: switch rcv_rtt_est and rcvq_space to high resolution timestamps") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Wei Wang <weiwan@google.com> Cc: Neal Cardwell <ncardwell@google.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * adding missing rcu_read_unlock in ipxip6_rcvNikita V. Shirokov2017-12-071-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") introduced new exit point in ipxip6_rcv. however rcu_read_unlock is missing there. this diff is fixing this v1->v2: instead of doing rcu_read_unlock in place, we are going to "drop" section (to prevent skb leakage) Fixes: 8d79266bc48c ("ip6_tunnel: add collect_md mode to IPv6 tunnels") Signed-off-by: Nikita V. Shirokov <tehnerd@fb.com> Acked-by: Alexei Starovoitov <ast@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * rds: Fix NULL pointer dereference in __rds_rdma_mapHåkon Bugge2017-12-061-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This is a fix for syzkaller719569, where memory registration was attempted without any underlying transport being loaded. Analysis of the case reveals that it is the setsockopt() RDS_GET_MR (2) and RDS_GET_MR_FOR_DEST (7) that are vulnerable. Here is an example stack trace when the bug is hit: BUG: unable to handle kernel NULL pointer dereference at 00000000000000c0 IP: __rds_rdma_map+0x36/0x440 [rds] PGD 2f93d03067 P4D 2f93d03067 PUD 2f93d02067 PMD 0 Oops: 0000 [#1] SMP Modules linked in: bridge stp llc tun rpcsec_gss_krb5 nfsv4 dns_resolver nfs fscache rds binfmt_misc sb_edac intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul c rc32_pclmul ghash_clmulni_intel pcbc aesni_intel crypto_simd glue_helper cryptd iTCO_wdt mei_me sg iTCO_vendor_support ipmi_si mei ipmi_devintf nfsd shpchp pcspkr i2c_i801 ioatd ma ipmi_msghandler wmi lpc_ich mfd_core auth_rpcgss nfs_acl lockd grace sunrpc ip_tables ext4 mbcache jbd2 mgag200 i2c_algo_bit drm_kms_helper ixgbe syscopyarea ahci sysfillrect sysimgblt libahci mdio fb_sys_fops ttm ptp libata sd_mod mlx4_core drm crc32c_intel pps_core megaraid_sas i2c_core dca dm_mirror dm_region_hash dm_log dm_mod CPU: 48 PID: 45787 Comm: repro_set2 Not tainted 4.14.2-3.el7uek.x86_64 #2 Hardware name: Oracle Corporation ORACLE SERVER X5-2L/ASM,MOBO TRAY,2U, BIOS 31110000 03/03/2017 task: ffff882f9190db00 task.stack: ffffc9002b994000 RIP: 0010:__rds_rdma_map+0x36/0x440 [rds] RSP: 0018:ffffc9002b997df0 EFLAGS: 00010202 RAX: 0000000000000000 RBX: ffff882fa2182580 RCX: 0000000000000000 RDX: 0000000000000000 RSI: ffffc9002b997e40 RDI: ffff882fa2182580 RBP: ffffc9002b997e30 R08: 0000000000000000 R09: 0000000000000002 R10: ffff885fb29e3838 R11: 0000000000000000 R12: ffff882fa2182580 R13: ffff882fa2182580 R14: 0000000000000002 R15: 0000000020000ffc FS: 00007fbffa20b700(0000) GS:ffff882fbfb80000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00000000000000c0 CR3: 0000002f98a66006 CR4: 00000000001606e0 Call Trace: rds_get_mr+0x56/0x80 [rds] rds_setsockopt+0x172/0x340 [rds] ? __fget_light+0x25/0x60 ? __fdget+0x13/0x20 SyS_setsockopt+0x80/0xe0 do_syscall_64+0x67/0x1b0 entry_SYSCALL64_slow_path+0x25/0x25 RIP: 0033:0x7fbff9b117f9 RSP: 002b:00007fbffa20aed8 EFLAGS: 00000293 ORIG_RAX: 0000000000000036 RAX: ffffffffffffffda RBX: 00000000000c84a4 RCX: 00007fbff9b117f9 RDX: 0000000000000002 RSI: 0000400000000114 RDI: 000000000000109b RBP: 00007fbffa20af10 R08: 0000000000000020 R09: 00007fbff9dd7860 R10: 0000000020000ffc R11: 0000000000000293 R12: 0000000000000000 R13: 00007fbffa20b9c0 R14: 00007fbffa20b700 R15: 0000000000000021 Code: 41 56 41 55 49 89 fd 41 54 53 48 83 ec 18 8b 87 f0 02 00 00 48 89 55 d0 48 89 4d c8 85 c0 0f 84 2d 03 00 00 48 8b 87 00 03 00 00 <48> 83 b8 c0 00 00 00 00 0f 84 25 03 00 0 0 48 8b 06 48 8b 56 08 The fix is to check the existence of an underlying transport in __rds_rdma_map(). Signed-off-by: Håkon Bugge <haakon.bugge@oracle.com> Reported-by: syzbot <syzkaller@googlegroups.com> Acked-by: Santosh Shilimkar <santosh.shilimkar@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: use macvlan real dev trans_start in dev_trans_start()Chris Dion2017-12-061-0/+3
| | | | | | | | | | | | | | | | | | Macvlan devices are similar to vlans and do not update their own trans_start. In order for arp monitoring to work for a bond device when the slaves are macvlans, obtain its real device. Signed-off-by: Chris Dion <christopher.dion@dell.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * make sock_alloc_file() do sock_release() on failuresAl Viro2017-12-064-25/+9
| | | | | | | | | | | | | | | | | | | | | | | | This changes calling conventions (and simplifies the hell out the callers). New rules: once struct socket had been passed to sock_alloc_file(), it's been consumed either by struct file or by sock_release() done by sock_alloc_file(). Either way the caller should not do sock_release() after that point. Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| * socketpair(): allocate descriptors firstAl Viro2017-12-061-51/+38
| | | | | | | | | | | | | | | | simplifies failure exits considerably... Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| * fix kcm_clone()Al Viro2017-12-061-44/+27
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | 1) it's fput() or sock_release(), not both 2) don't do fd_install() until the last failure exit. 3) not a bug per se, but... don't attach socket to struct file until it's set up. Take reserving descriptor into the caller, move fd_install() to the caller, sanitize failure exits and calling conventions. Cc: stable@vger.kernel.org # v4.6+ Acked-by: Tom Herbert <tom@herbertland.com> Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
| * dccp: CVE-2017-8824: use-after-free in DCCP codeMohamed Ghannam2017-12-061-0/+5
| | | | | | | | | | | | | | | | | | | | Whenever the sock object is in DCCP_CLOSED state, dccp_disconnect() must free dccps_hc_tx_ccid and dccps_hc_rx_ccid and set to NULL. Signed-off-by: Mohamed Ghannam <simo.ghannam@gmail.com> Reviewed-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * VSOCK: fix outdated sk_state value in hvs_release()Stefan Hajnoczi2017-12-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since commit 3b4477d2dcf2709d0be89e2a8dced3d0f4a017f2 ("VSOCK: use TCP state constants for sk_state") VSOCK has used TCP_* constants for sk_state. Commit b4562ca7925a3bedada87a3dd072dd5bad043288 ("hv_sock: add locking in the open/close/release code paths") reintroduced the SS_DISCONNECTING constant. This patch replaces the old SS_DISCONNECTING with the new TCP_CLOSING constant. CC: Dexuan Cui <decui@microsoft.com> CC: Cathy Avery <cavery@redhat.com> Signed-off-by: Stefan Hajnoczi <stefanha@redhat.com> Reviewed-by: Jorgen Hansen <jhansen@vmware.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix memory leak in tipc_accept_from_sock()Jon Maloy2017-12-051-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | When the function tipc_accept_from_sock() fails to create an instance of struct tipc_subscriber it omits to free the already created instance of struct tipc_conn instance before it returns. We fix that with this commit. Reported-by: David S. Miller <davem@davemloft.net> Signed-off-by: Jon Maloy <jon.maloy@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix a null pointer deref on error pathCong Wang2017-12-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In tipc_topsrv_kern_subscr() when s->tipc_conn_new() fails we call tipc_close_conn() to clean up, but in this case calling conn_put() is just enough. This fixes the folllowing crash: kasan: GPF could be caused by NULL-ptr deref or user memory access general protection fault: 0000 [#1] SMP KASAN Dumping ftrace buffer: (ftrace buffer empty) Modules linked in: CPU: 0 PID: 3085 Comm: syzkaller064164 Not tainted 4.15.0-rc1+ #137 Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011 task: 00000000c24413a5 task.stack: 000000005e8160b5 RIP: 0010:__lock_acquire+0xd55/0x47f0 kernel/locking/lockdep.c:3378 RSP: 0018:ffff8801cb5474a8 EFLAGS: 00010002 RAX: dffffc0000000000 RBX: 0000000000000000 RCX: 0000000000000000 RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffff85ecb400 RBP: ffff8801cb547830 R08: 0000000000000001 R09: 0000000000000000 R10: 0000000000000000 R11: ffffffff87489d60 R12: ffff8801cd2980c0 R13: 0000000000000000 R14: 0000000000000001 R15: 0000000000000020 FS: 00000000014ee880(0000) GS:ffff8801db400000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033 CR2: 00007ffee2426e40 CR3: 00000001cb85a000 CR4: 00000000001406f0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400 Call Trace: lock_acquire+0x1d5/0x580 kernel/locking/lockdep.c:4004 __raw_spin_lock_bh include/linux/spinlock_api_smp.h:135 [inline] _raw_spin_lock_bh+0x31/0x40 kernel/locking/spinlock.c:175 spin_lock_bh include/linux/spinlock.h:320 [inline] tipc_subscrb_subscrp_delete+0x8f/0x470 net/tipc/subscr.c:201 tipc_subscrb_delete net/tipc/subscr.c:238 [inline] tipc_subscrb_release_cb+0x17/0x30 net/tipc/subscr.c:316 tipc_close_conn+0x171/0x270 net/tipc/server.c:204 tipc_topsrv_kern_subscr+0x724/0x810 net/tipc/server.c:514 tipc_group_create+0x702/0x9c0 net/tipc/group.c:184 tipc_sk_join net/tipc/socket.c:2747 [inline] tipc_setsockopt+0x249/0xc10 net/tipc/socket.c:2861 SYSC_setsockopt net/socket.c:1851 [inline] SyS_setsockopt+0x189/0x360 net/socket.c:1830 entry_SYSCALL_64_fastpath+0x1f/0x96 Fixes: 14c04493cb77 ("tipc: add ability to order and receive topology events in driver") Reported-by: syzbot <syzkaller@googlegroups.com> Cc: Jon Maloy <jon.maloy@ericsson.com> Cc: Ying Xue <ying.xue@windriver.com> Signed-off-by: Cong Wang <xiyou.wangcong@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net_sched: red: Avoid illegal valuesNogah Frankel2017-12-054-0/+11
| | | | | | | | | | | | | | | | | | Check the qmin & qmax values doesn't overflow for the given Wlog value. Check that qmin <= qmax. Fixes: a783474591f2 ("[PKT_SCHED]: Generic RED layer") Signed-off-by: Nogah Frankel <nogahf@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Revert "net: core: maybe return -EEXIST in __dev_alloc_name"Johannes Berg2017-12-051-1/+1
| | | | | | | | | | | | | | | | | | | | | | This reverts commit d6f295e9def0; some userspace (in the case we noticed it's wpa_supplicant), is relying on the current error code to determine that a fixed name interface already exists. Reported-by: Jouni Malinen <j@w1.fi> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * Revert "tcp: must block bh in __inet_twsk_hashdance()"Eric Dumazet2017-12-051-3/+3
| | | | | | | | | | | | | | | | | | | | | | We had to disable BH _before_ calling __inet_twsk_hashdance() in commit cfac7f836a71 ("tcp/dccp: block bh before arming time_wait timer"). This means we can revert 614bdd4d6e61 ("tcp: must block bh in __inet_twsk_hashdance()"). Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | rtnetlink: allow GSO maximums to be set on device creationStephen Hemminger2017-12-081-0/+34
| | | | | | | | | | | | | | | | | | Netlink device already allows changing GSO sizes with ip set command. The part that is missing is allowing overriding GSO settings on device creation. Signed-off-by: Stephen Hemminger <sthemmin@microsoft.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: fix use-after-free in tcf_block_put_extJiri Pirko2017-12-081-6/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Since the block is freed with last chain being put, once we reach the end of iteration of list_for_each_entry_safe, the block may be already freed. I'm hitting this only by creating and deleting clsact: [ 202.171952] ================================================================== [ 202.180182] BUG: KASAN: use-after-free in tcf_block_put_ext+0x240/0x390 [ 202.187590] Read of size 8 at addr ffff880225539a80 by task tc/796 [ 202.194508] [ 202.196185] CPU: 0 PID: 796 Comm: tc Not tainted 4.15.0-rc2jiri+ #5 [ 202.203200] Hardware name: Mellanox Technologies Ltd. "MSN2100-CB2F"/"SA001017", BIOS 5.6.5 06/07/2016 [ 202.213613] Call Trace: [ 202.216369] dump_stack+0xda/0x169 [ 202.220192] ? dma_virt_map_sg+0x147/0x147 [ 202.224790] ? show_regs_print_info+0x54/0x54 [ 202.229691] ? tcf_chain_destroy+0x1dc/0x250 [ 202.234494] print_address_description+0x83/0x3d0 [ 202.239781] ? tcf_block_put_ext+0x240/0x390 [ 202.244575] kasan_report+0x1ba/0x460 [ 202.248707] ? tcf_block_put_ext+0x240/0x390 [ 202.253518] tcf_block_put_ext+0x240/0x390 [ 202.258117] ? tcf_chain_flush+0x290/0x290 [ 202.262708] ? qdisc_hash_del+0x82/0x1a0 [ 202.267111] ? qdisc_hash_add+0x50/0x50 [ 202.271411] ? __lock_is_held+0x5f/0x1a0 [ 202.275843] clsact_destroy+0x3d/0x80 [sch_ingress] [ 202.281323] qdisc_destroy+0xcb/0x240 [ 202.285445] qdisc_graft+0x216/0x7b0 [ 202.289497] tc_get_qdisc+0x260/0x560 Fix this by holding the block also by chain 0 and put chain 0 explicitly, out of the list_for_each_entry_safe loop at the very end of tcf_block_put_ext. Fixes: efbf78973978 ("net_sched: get rid of rcu_barrier() in tcf_block_put_ext()") Signed-off-by: Jiri Pirko <jiri@mellanox.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: pfifo_fast use skb_arrayJohn Fastabend2017-12-082-53/+92
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This converts the pfifo_fast qdisc to use the skb_array data structure and set the lockless qdisc bit. pfifo_fast is the first qdisc to support the lockless bit that can be a child of a qdisc requiring locking. So we add logic to clear the lock bit on initialization in these cases when the qdisc graft operation occurs. This also removes the logic used to pick the next band to dequeue from and instead just checks a per priority array for packets from top priority to lowest. This might need to be a bit more clever but seems to work for now. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqprioJohn Fastabend2017-12-082-35/+69
| | | | | | | | | | | | | | | | | | | | | | | | | | The sch_mqprio qdisc creates a sub-qdisc per tx queue which are then called independently for enqueue and dequeue operations. However statistics are aggregated and pushed up to the "master" qdisc. This patch adds support for any of the sub-qdiscs to be per cpu statistic qdiscs. To handle this case add a check when calculating stats and aggregate the per cpu stats if needed. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: add support for TCQ_F_NOLOCK subqueues to sch_mqJohn Fastabend2017-12-082-11/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The sch_mq qdisc creates a sub-qdisc per tx queue which are then called independently for enqueue and dequeue operations. However statistics are aggregated and pushed up to the "master" qdisc. This patch adds support for any of the sub-qdiscs to be per cpu statistic qdiscs. To handle this case add a check when calculating stats and aggregate the per cpu stats if needed. Also exports __gnet_stats_copy_queue() to use as a helper function. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: helpers to sum qlen and qlen for per cpu logicJohn Fastabend2017-12-081-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | Add qdisc qlen helper routines for lockless qdiscs to use. The qdisc qlen is no longer used in the hotpath but it is reported via stats query on the qdisc so it still needs to be tracked. This adds the per cpu operations needed along with a helper to return the summation of per cpu stats. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: sched: check for frozen queue before skb_bad_txq checkJohn Fastabend2017-12-081-4/+7
| | | | | | | | | | | | | | | | | | | | I can not think of any reason to pull the bad txq skb off the qdisc if the txq we plan to send this on is still frozen. So check for frozen queue first and abort before dequeuing either skb_bad_txq skb or normal qdisc dequeue() skb. Signed-off-by: John Fastabend <john.fastabend@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>