summaryrefslogtreecommitdiffstats
path: root/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netLinus Torvalds2013-11-2052-921/+694
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull networking fixes from David Miller: "Mostly these are fixes for fallout due to merge window changes, as well as cures for problems that have been with us for a much longer period of time" 1) Johannes Berg noticed two major deficiencies in our genetlink registration. Some genetlink protocols we passing in constant counts for their ops array rather than something like ARRAY_SIZE(ops) or similar. Also, some genetlink protocols were using fixed IDs for their multicast groups. We have to retain these fixed IDs to keep existing userland tools working, but reserve them so that other multicast groups used by other protocols can not possibly conflict. In dealing with these two problems, we actually now use less state management for genetlink operations and multicast groups. 2) When configuring interface hardware timestamping, fix several drivers that simply do not validate that the hwtstamp_config value is one the driver actually supports. From Ben Hutchings. 3) Invalid memory references in mwifiex driver, from Amitkumar Karwar. 4) In dev_forward_skb(), set the skb->protocol in the right order relative to skb_scrub_packet(). From Alexei Starovoitov. 5) Bridge erroneously fails to use the proper wrapper functions to make calls to netdev_ops->ndo_vlan_rx_{add,kill}_vid. Fix from Toshiaki Makita. 6) When detaching a bridge port, make sure to flush all VLAN IDs to prevent them from leaking, also from Toshiaki Makita. 7) Put in a compromise for TCP Small Queues so that deep queued devices that delay TX reclaim non-trivially don't have such a performance decrease. One particularly problematic area is 802.11 AMPDU in wireless. From Eric Dumazet. 8) Fix crashes in tcp_fastopen_cache_get(), we can see NULL socket dsts here. Fix from Eric Dumzaet, reported by Dave Jones. 9) Fix use after free in ipv6 SIT driver, from Willem de Bruijn. 10) When computing mergeable buffer sizes, virtio-net fails to take the virtio-net header into account. From Michael Dalton. 11) Fix seqlock deadlock in ip4_datagram_connect() wrt. statistic bumping, this one has been with us for a while. From Eric Dumazet. 12) Fix NULL deref in the new TIPC fragmentation handling, from Erik Hugne. 13) 6lowpan bit used for traffic classification was wrong, from Jukka Rissanen. 14) macvlan has the same issue as normal vlans did wrt. propagating LRO disabling down to the real device, fix it the same way. From Michal Kubecek. 15) CPSW driver needs to soft reset all slaves during suspend, from Daniel Mack. 16) Fix small frame pacing in FQ packet scheduler, from Eric Dumazet. 17) The xen-netfront RX buffer refill timer isn't properly scheduled on partial RX allocation success, from Ma JieYue. 18) When ipv6 ping protocol support was added, the AF_INET6 protocol initialization cleanup path on failure was borked a little. Fix from Vlad Yasevich. 19) If a socket disconnects during a read/recvmsg/recvfrom/etc that blocks we can do the wrong thing with the msg_name we write back to userspace. From Hannes Frederic Sowa. There is another fix in the works from Hannes which will prevent future problems of this nature. 20) Fix route leak in VTI tunnel transmit, from Fan Du. * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net: (106 commits) genetlink: make multicast groups const, prevent abuse genetlink: pass family to functions using groups genetlink: add and use genl_set_err() genetlink: remove family pointer from genl_multicast_group genetlink: remove genl_unregister_mc_group() hsr: don't call genl_unregister_mc_group() quota/genetlink: use proper genetlink multicast APIs drop_monitor/genetlink: use proper genetlink multicast APIs genetlink: only pass array to genl_register_family_with_ops() tcp: don't update snd_nxt, when a socket is switched from repair mode atm: idt77252: fix dev refcnt leak xfrm: Release dst if this dst is improper for vti tunnel netlink: fix documentation typo in netlink_set_err() be2net: Delete secondary unicast MAC addresses during be_close be2net: Fix unconditional enabling of Rx interface options net, virtio_net: replace the magic value ping: prevent NULL pointer dereference on write to msg_name bnx2x: Prevent "timeout waiting for state X" bnx2x: prevent CFC attention bnx2x: Prevent panic during DMAE timeout ...
| * genetlink: make multicast groups const, prevent abuseJohannes Berg2013-11-1913-325/+298
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Register generic netlink multicast groups as an array with the family and give them contiguous group IDs. Then instead of passing the global group ID to the various functions that send messages, pass the ID relative to the family - for most families that's just 0 because the only have one group. This avoids the list_head and ID in each group, adding a new field for the mcast group ID offset to the family. At the same time, this allows us to prevent abusing groups again like the quota and dropmon code did, since we can now check that a family only uses a group it owns. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: pass family to functions using groupsJohannes Berg2013-11-1911-74/+105
| | | | | | | | | | | | | | | | | | This doesn't really change anything, but prepares for the next patch that will change the APIs to pass the group ID within the family, rather than the global group ID. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: add and use genl_set_err()Johannes Berg2013-11-192-7/+7
| | | | | | | | | | | | | | | | | | Add a static inline to generic netlink to wrap netlink_set_err() to make it easier to use here - use it in openvswitch (the only generic netlink user of netlink_set_err()). Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: remove family pointer from genl_multicast_groupJohannes Berg2013-11-191-20/+18
| | | | | | | | | | | | | | | | There's no reason to have the family pointer there since it can just be passed internally where needed, so remove it. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: remove genl_unregister_mc_group()Johannes Berg2013-11-191-23/+0
| | | | | | | | | | | | | | | | There are no users of this API remaining, and we'll soon change group registration to be static (like ops are now) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * hsr: don't call genl_unregister_mc_group()Johannes Berg2013-11-191-2/+0
| | | | | | | | | | | | | | | | There's no need to unregister the multicast group if the generic netlink family is registered immediately after. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * quota/genetlink: use proper genetlink multicast APIsJohannes Berg2013-11-191-2/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The quota code is abusing the genetlink API and is using its family ID as the multicast group ID, which is invalid and may belong to somebody else (and likely will.) Make the quota code use the correct API, but since this is already used as-is by userspace, reserve a family ID for this code and also reserve that group ID to not break userspace assumptions. Acked-by: Jan Kara <jack@suse.cz> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * drop_monitor/genetlink: use proper genetlink multicast APIsJohannes Berg2013-11-192-4/+22
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The drop monitor code is abusing the genetlink API and is statically using the generic netlink multicast group 1, even if that group belongs to somebody else (which it invariably will, since it's not reserved.) Make the drop monitor code use the proper APIs to reserve a group ID, but also reserve the group id 1 in generic netlink code to preserve the userspace API. Since drop monitor can be a module, don't clear the bit for it on unregistration. Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: only pass array to genl_register_family_with_ops()Johannes Berg2013-11-1916-39/+31
| | | | | | | | | | | | | | | | | | | | | | | | | | | | As suggested by David Miller, make genl_register_family_with_ops() a macro and pass only the array, evaluating ARRAY_SIZE() in the macro, this is a little safer. The openvswitch has some indirection, assing ops/n_ops directly in that code. This might ultimately just assign the pointers in the family initializations, saving the struct genl_family_and_ops and code (once mcast groups are handled differently.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: don't update snd_nxt, when a socket is switched from repair modeAndrey Vagin2013-11-191-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | snd_nxt must be updated synchronously with sk_send_head. Otherwise tp->packets_out may be updated incorrectly, what may bring a kernel panic. Here is a kernel panic from my host. [ 103.043194] BUG: unable to handle kernel NULL pointer dereference at 0000000000000048 [ 103.044025] IP: [<ffffffff815aaaaf>] tcp_rearm_rto+0xcf/0x150 ... [ 146.301158] Call Trace: [ 146.301158] [<ffffffff815ab7f0>] tcp_ack+0xcc0/0x12c0 Before this panic a tcp socket was restored. This socket had sent and unsent data in the write queue. Sent data was restored in repair mode, then the socket was switched from reapair mode and unsent data was restored. After that the socket was switched back into repair mode. In that moment we had a socket where write queue looks like this: snd_una snd_nxt write_seq |_________|________| | sk_send_head After a second switching from repair mode the state of socket was changed: snd_una snd_nxt, write_seq |_________ ________| | sk_send_head This state is inconsistent, because snd_nxt and sk_send_head are not synchronized. Bellow you can find a call trace, how packets_out can be incremented twice for one skb, if snd_nxt and sk_send_head are not synchronized. In this case packets_out will be always positive, even when sk_write_queue is empty. tcp_write_wakeup skb = tcp_send_head(sk); tcp_fragment if (!before(tp->snd_nxt, TCP_SKB_CB(buff)->end_seq)) tcp_adjust_pcount(sk, skb, diff); tcp_event_new_data_sent tp->packets_out += tcp_skb_pcount(skb); I think update of snd_nxt isn't required, when a socket is switched from repair mode. Because it's initialized in tcp_connect_init. Then when a write queue is restored, snd_nxt is incremented in tcp_event_new_data_sent, so it's always is in consistent state. I have checked, that the bug is not reproduced with this patch and all tests about restoring tcp connections work fine. Cc: Pavel Emelyanov <xemul@parallels.com> Cc: Eric Dumazet <edumazet@google.com> Cc: "David S. Miller" <davem@davemloft.net> Cc: Alexey Kuznetsov <kuznet@ms2.inr.ac.ru> Cc: James Morris <jmorris@namei.org> Cc: Hideaki YOSHIFUJI <yoshfuji@linux-ipv6.org> Cc: Patrick McHardy <kaber@trash.net> Signed-off-by: Andrey Vagin <avagin@openvz.org> Acked-by: Pavel Emelyanov <xemul@parallels.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * xfrm: Release dst if this dst is improper for vti tunnelfan.du2013-11-191-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | After searching rt by the vti tunnel dst/src parameter, if this rt has neither attached to any transformation nor the transformation is not tunnel oriented, this rt should be released back to ip layer. otherwise causing dst memory leakage. Signed-off-by: Fan Du <fan.du@windriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * netlink: fix documentation typo in netlink_set_err()Johannes Berg2013-11-191-1/+1
| | | | | | | | | | | | | | The parameter is just 'group', not 'groups', fix the documentation typo. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ping: prevent NULL pointer dereference on write to msg_nameHannes Frederic Sowa2013-11-181-15/+19
| | | | | | | | | | | | | | | | A plain read() on a socket does set msg->msg_name to NULL. So check for NULL pointer first. Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv6: Fix inet6_init() cleanup orderVlad Yasevich2013-11-181-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 6d0bfe22611602f36617bc7aa2ffa1bbb2f54c67 net: ipv6: Add IPv6 support to the ping socket introduced a change in the cleanup logic of inet6_init and has a bug in that ipv6_packet_cleanup() may not be called. Fix the cleanup ordering. CC: Hannes Frederic Sowa <hannes@stressinduktion.org> CC: Lorenzo Colitti <lorenzo@google.com> CC: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: rename shadowed variableJohannes Berg2013-11-181-5/+5
| | | | | | | | | | | | | | | | | | Sparse pointed out that the new flags variable I had added shadowed an existing one, rename the new one to avoid that, making the code clearer. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * inet: prevent leakage of uninitialized memory to user in recv syscallsHannes Frederic Sowa2013-11-188-38/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Only update *addr_len when we actually fill in sockaddr, otherwise we can return uninitialized memory from the stack to the caller in the recvfrom, recvmmsg and recvmsg syscalls. Drop the the (addr_len == NULL) checks because we only get called with a valid addr_len pointer either from sock_common_recvmsg or inet_recvmsg. If a blocking read waits on a socket which is concurrently shut down we now return zero and set msg_msgnamelen to 0. Reported-by: mpb <mpb.mail@gmail.com> Suggested-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: Hannes Frederic Sowa <hannes@stressinduktion.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: ipv6: ndisc: Fix warning when CONFIG_SYSCTL=nFabio Estevam2013-11-181-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | When CONFIG_SYSCTL=n the following build warning happens: net/ipv6/ndisc.c:1730:1: warning: label 'out' defined but not used [-Wunused-label] The 'out' label is only used when CONFIG_SYSCTL=y, so move it inside the 'ifdef CONFIG_SYSCTL' block. Signed-off-by: Fabio Estevam <fabio.estevam@freescale.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * pkt_sched: fq: fix pacing for small framesEric Dumazet2013-11-161-4/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | For performance reasons, sch_fq tried hard to not setup timers for every sent packet, using a quantum based heuristic : A delay is setup only if the flow exhausted its credit. Problem is that application limited flows can refill their credit for every queued packet, and they can evade pacing. This problem can also be triggered when TCP flows use small MSS values, as TSO auto sizing builds packets that are smaller than the default fq quantum (3028 bytes) This patch adds a 40 ms delay to guard flow credit refill. Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Maciej Żenczykowski <maze@google.com> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Cc: Neal Cardwell <ncardwell@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * pkt_sched: fq: warn users using defrateEric Dumazet2013-11-161-6/+4
| | | | | | | | | | | | | | | | | | | | Commit 7eec4174ff29 ("pkt_sched: fq: fix non TCP flows pacing") obsoleted TCA_FQ_FLOW_DEFAULT_RATE without notice for the users. Suggested by David Miller Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: unify registration functionsJohannes Berg2013-11-161-37/+16
| | | | | | | | | | | | | | | | | | | | | | | | Now that the ops assignment is just two variables rather than a long list iteration etc., there's no reason to separately export __genl_register_family() and __genl_register_family_with_ops(). Unify the two functions into __genl_register_family() and make genl_register_family_with_ops() call it after assigning the ops. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * macvlan: disable LRO on lower device instead of macvlanMichal Kubeček2013-11-151-0/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A macvlan device has always LRO disabled so that calling dev_disable_lro() on it does nothing. If we need to disable LRO e.g. because - the macvlan device is inserted into a bridge - IPv6 forwarding is enabled for it - it is in a different namespace than lowerdev and IPv4 forwarding is enabled in it we need to disable LRO on its underlying device instead (as we do for 802.1q VLAN devices). v2: use newly introduced netif_is_macvlan() Signed-off-by: Michal Kubecek <mkubecek@suse.cz> Signed-off-by: David S. Miller <davem@davemloft.net>
| * 6lowpan: Uncompression of traffic class field was incorrectJukka Rissanen2013-11-151-2/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If priority/traffic class field in IPv6 header is set (seen when using ssh), the uncompression sets the TC and Flow fields incorrectly. Example: This is IPv6 header of a sent packet. Note the priority/TC (=1) in the first byte. 00000000: 61 00 00 00 00 2c 06 40 fe 80 00 00 00 00 00 00 00000010: 02 02 72 ff fe c6 42 10 fe 80 00 00 00 00 00 00 00000020: 02 1e ab ff fe 4c 52 57 This gets compressed like this in the sending side 00000000: 72 31 04 06 02 1e ab ff fe 4c 52 57 ec c2 00 16 00000010: aa 2d fe 92 86 4e be c6 .... In the receiving end, the packet gets uncompressed to this IPv6 header 00000000: 60 06 06 02 00 2a 1e 40 fe 80 00 00 00 00 00 00 00000010: 02 02 72 ff fe c6 42 10 fe 80 00 00 00 00 00 00 00000020: ab ff fe 4c 52 57 ec c2 First four bytes are set incorrectly and we have also lost two bytes from destination address. The fix is to switch the case values in switch statement when checking the TC field. Signed-off-by: Jukka Rissanen <jukka.rissanen@linux.intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tipc: fix dereference before check warningErik Hugne2013-11-151-1/+2
| | | | | | | | | | | | | | | | | | | | | | | | This fixes the following Smatch warning: net/tipc/link.c:2364 tipc_link_recv_fragment() warn: variable dereferenced before check '*head' (see line 2361) A null pointer might be passed to skb_try_coalesce if a malicious sender injects orphan fragments on a link. Signed-off-by: Erik Hugne <erik.hugne@ericsson.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: fix possible seqlock deadlockEric Dumazet2013-11-141-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | ip4_datagram_connect() being called from process context, it should use IP_INC_STATS() instead of IP_INC_STATS_BH() otherwise we can deadlock on 32bit arches, or get corruptions of SNMP counters. Fixes: 584bdf8cbdf6 ("[IPV4]: Fix "ipOutNoRoutes" counter error for TCP and UDP") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net/hsr: Fix possible leak in 'hsr_get_node_status()'Geyslan G. Bem2013-11-141-1/+1
| | | | | | | | | | | | | | | | If 'hsr_get_node_data()' returns error, going directly to 'fail' label doesn't free the memory pointed by 'skb_out'. Signed-off-by: Geyslan G. Bem <geyslan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * pkt_sched: fq: change classification of control packetsMaciej Żenczykowski2013-11-141-7/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Initial sch_fq implementation copied code from pfifo_fast to classify a packet as a high prio packet. This clashes with setups using PRIO with say 7 bands, as one of the band could be incorrectly (mis)classified by FQ. Packets would be queued in the 'internal' queue, and no pacing ever happen for this special queue. Fixes: afe4fd062416 ("pkt_sched: fq: Fair Queue packet scheduler") Signed-off-by: Maciej Żenczykowski <maze@google.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: Stephen Hemminger <stephen@networkplumber.org> Cc: Willem de Bruijn <willemb@google.com> Cc: Yuchung Cheng <ycheng@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: make all genl_ops users constJohannes Berg2013-11-1414-18/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Now that genl_ops are no longer modified in place when registering, they can be made const. This patch was done mostly with spatch: @@ identifier ops; @@ +const struct genl_ops ops[] = { ... }; (except the struct thing in net/openvswitch/datapath.c) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: allow making ops constJohannes Berg2013-11-142-19/+25
| | | | | | | | | | | | | | | | | | | | | | | | Allow making the ops array const by not modifying the ops flags on registration but rather only when ops are sent out in the family information. No users are updated yet except for the pre_doit/post_doit calls in wireless (the only ones that exist now.) Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: register family ops as arrayJohannes Berg2013-11-141-45/+33
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Instead of using a linked list, use an array. This reduces the data size needed by the users of genetlink, for example in wireless (net/wireless/nl80211.c) on 64-bit it frees up over 1K of data space. Remove the attempted sending of CTRL_CMD_NEWOPS ctrl event since genl_ctrl_event(CTRL_CMD_NEWOPS, ...) only returns -EINVAL anyway, therefore no such event could ever be sent. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * genetlink: remove genl_register_ops/genl_unregister_opsJohannes Berg2013-11-141-56/+1
| | | | | | | | | | | | | | | | genl_register_ops() is still needed for internal registration, but is no longer available to users of the API. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * wimax: use genl_register_family_with_ops()Johannes Berg2013-11-146-119/+47
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This simplifies the code since there's no longer a need to have error handling in the registration. Unfortunately it means more extern function declarations are needed, but the overall goal would seem to justify this. Due to the removal of duplication in the netlink policies, this reduces the size of wimax by almost 1k. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ieee802154: use genl_register_family_with_ops()Johannes Berg2013-11-144-94/+51
| | | | | | | | | | | | | | | | | | | | | | | | | | | | This simplifies the code since there's no longer a need to have error handling in the registration. Unfortunately it means more extern function declarations are needed, but the overall goal would seem to justify this. While at it, also fix the registration error path - if the family registration failed then it shouldn't be unregistered. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * hsr: use genl_register_family_with_ops()Johannes Berg2013-11-141-29/+17
| | | | | | | | | | | | | | | | This simplifies the code since there's no longer a need to have error handling in the registration. Signed-off-by: Johannes Berg <johannes.berg@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * ip6tnl: fix use after free of fb_tnl_devNicolas Dichtel2013-11-141-5/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bug has been introduced by commit bb8140947a24 ("ip6tnl: allow to use rtnl ops on fb tunnel"). When ip6_tunnel.ko is unloaded, FB device is delete by rtnl_link_unregister() and then we try to use the pointer in ip6_tnl_destroy_tunnels(). Let's add an handler for dellink, which will never remove the FB tunnel. With this patch it will no more be possible to remove it via 'ip link del ip6tnl0', but it's safer. The same fix was already proposed by Willem de Bruijn <willemb@google.com> for sit interfaces. CC: Willem de Bruijn <willemb@google.com> Reported-by: Steven Rostedt <rostedt@goodmis.org> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Willem de Bruijn <willemb@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sit/gre6: don't try to add the same route two timesNicolas Dichtel2013-11-141-3/+0
| | | | | | | | | | | | | | | | addrconf_add_linklocal() already adds the link local route, so there is no reason to add it before calling this function. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sit: link local routes are missingNicolas Dichtel2013-11-141-19/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | When a link local address was added to a sit interface, the corresponding route was not configured. This breaks routing protocols that use the link local address, like OSPFv3. To ease the code reading, I remove sit_route_add(), which only adds v4 mapped routes, and add this kind of route directly in sit_add_v4_addrs(). Thus link local and v4 mapped routes are configured in the same place. Reported-by: Li Hongjun <hongjun.li@6wind.com> Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sit: fix prefix length of ll and v4mapped addressesNicolas Dichtel2013-11-141-7/+4
| | | | | | | | | | | | | | | | | | | | | | When the local IPv4 endpoint is wilcard (0.0.0.0), the prefix length is correctly set, ie 64 if the address is a link local one or 96 if the address is a v4 mapped one. But when the local endpoint is specified, the prefix length is set to 128 for both kind of address. This patch fix this wrong prefix length. Signed-off-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * sit: fix use after free of fb_tunnel_devWillem de Bruijn2013-11-141-4/+14
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Bug: The fallback device is created in sit_init_net and assumed to be freed in sit_exit_net. First, it is dereferenced in that function, in sit_destroy_tunnels: struct net *net = dev_net(sitn->fb_tunnel_dev); Prior to this, rtnl_unlink_register has removed all devices that match rtnl_link_ops == sit_link_ops. Commit 205983c43700 added the line + sitn->fb_tunnel_dev->rtnl_link_ops = &sit_link_ops; which cases the fallback device to match here and be freed before it is last dereferenced. Fix: This commit adds an explicit .delllink callback to sit_link_ops that skips deallocation at rtnl_unlink_register for the fallback device. This mechanism is comparable to the one in ip_tunnel. It also modifies sit_destroy_tunnels and its only caller sit_exit_net to avoid the offending dereference in the first place. That double lookup is more complicated than required. Test: The bug is only triggered when CONFIG_NET_NS is enabled. It causes a GPF only when CONFIG_DEBUG_SLAB is enabled. Verified that this bug exists at the mentioned commit, at davem-net HEAD and at 3.11.y HEAD. Verified that it went away after applying this patch. Fixes: 205983c43700 ("sit: allow to use rtnl ops on fb tunnel") Signed-off-by: Willem de Bruijn <willemb@google.com> Acked-by: Nicolas Dichtel <nicolas.dichtel@6wind.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: sctp: bug-fixing: retran_path not set properly after transports ↵Chang Xiangzhong2013-11-141-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | recovering (v3) When a transport recovers due to the new coming sack, SCTP should iterate all of its transport_list to locate the __two__ most recently used transport and set to active_path and retran_path respectively. The exising code does not find the two properly - In case of the following list: [most-recent] -> [2nd-most-recent] -> ... Both active_path and retran_path would be set to the 1st element. The bug happens when: 1) multi-homing 2) failure/partial_failure transport recovers Both active_path and retran_path would be set to the same most-recent one, in other words, retran_path would not take its role - an end user might not even notice this issue. Signed-off-by: Chang Xiangzhong <changxiangzhong@gmail.com> Acked-by: Vlad Yasevich <vyasevich@gmail.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * net-tcp: fix panic in tcp_fastopen_cache_set()Eric Dumazet2013-11-141-1/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We had some reports of crashes using TCP fastopen, and Dave Jones gave a nice stack trace pointing to the error. Issue is that tcp_get_metrics() should not be called with a NULL dst Fixes: 1fe4c481ba637 ("net-tcp: Fast Open client - cookie cache") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Dave Jones <davej@redhat.com> Cc: Yuchung Cheng <ycheng@google.com> Acked-by: Yuchung Cheng <ycheng@google.com> Tested-by: Dave Jones <davej@fedoraproject.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tcp: tsq: restore minimal amount of queueingEric Dumazet2013-11-142-7/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | After commit c9eeec26e32e ("tcp: TSQ can use a dynamic limit"), several users reported throughput regressions, notably on mvneta and wifi adapters. 802.11 AMPDU requires a fair amount of queueing to be effective. This patch partially reverts the change done in tcp_write_xmit() so that the minimal amount is sysctl_tcp_limit_output_bytes. It also remove the use of this sysctl while building skb stored in write queue, as TSO autosizing does the right thing anyway. Users with well behaving NICS and correct qdisc (like sch_fq), can then lower the default sysctl_tcp_limit_output_bytes value from 128KB to 8KB. This new usage of sysctl_tcp_limit_output_bytes permits each driver authors to check how their driver performs when/if the value is set to a minimum of 4KB. Normally, line rate for a single TCP flow should be possible, but some drivers rely on timers to perform TX completion and too long TX completion delays prevent reaching full throughput. Fixes: c9eeec26e32e ("tcp: TSQ can use a dynamic limit") Signed-off-by: Eric Dumazet <edumazet@google.com> Reported-by: Sujith Manoharan <sujith@msujith.org> Reported-by: Arnaud Ebalard <arno@natisbad.org> Tested-by: Sujith Manoharan <sujith@msujith.org> Cc: Felix Fietkau <nbd@openwrt.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: Fix memory leak when deleting bridge with vlan filtering enabledToshiaki Makita2013-11-141-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We currently don't call br_vlan_flush() when deleting a bridge, which leads to memory leak if br->vlan_info is allocated. Steps to reproduce: while : do brctl addbr br0 bridge vlan add dev br0 vid 10 self brctl delbr br0 done We can observe the cache size of corresponding slab entry (as kmalloc-2048 in SLUB) is increased. kmemleak output: unreferenced object 0xffff8800b68a7000 (size 2048): comm "bridge", pid 2086, jiffies 4295774704 (age 47.656s) hex dump (first 32 bytes): 00 00 00 00 00 00 00 00 00 48 9b 36 00 88 ff ff .........H.6.... 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 00 ................ backtrace: [<ffffffff815eb6ae>] kmemleak_alloc+0x4e/0xb0 [<ffffffff8116a1ca>] kmem_cache_alloc_trace+0xca/0x220 [<ffffffffa03eddd6>] br_vlan_add+0x66/0xe0 [bridge] [<ffffffffa03e543c>] br_setlink+0x2dc/0x340 [bridge] [<ffffffff8150e481>] rtnl_bridge_setlink+0x101/0x200 [<ffffffff8150d9d9>] rtnetlink_rcv_msg+0x99/0x260 [<ffffffff81528679>] netlink_rcv_skb+0xa9/0xc0 [<ffffffff8150d938>] rtnetlink_rcv+0x28/0x30 [<ffffffff81527ccd>] netlink_unicast+0xdd/0x190 [<ffffffff8152807f>] netlink_sendmsg+0x2ff/0x740 [<ffffffff814e8368>] sock_sendmsg+0x88/0xc0 [<ffffffff814e8ac8>] ___sys_sendmsg.part.14+0x298/0x2b0 [<ffffffff814e91de>] __sys_sendmsg+0x4e/0x90 [<ffffffff814e922e>] SyS_sendmsg+0xe/0x10 [<ffffffff81601669>] system_call_fastpath+0x16/0x1b [<ffffffffffffffff>] 0xffffffffffffffff Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: Call vlan_vid_del for all vids at nbp_vlan_flushToshiaki Makita2013-11-141-0/+4
| | | | | | | | | | | | | | | | We should call vlan_vid_del for all vids at nbp_vlan_flush to prevent vid_info->refcount from being leaked when detaching a bridge port. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
| * bridge: Use vlan_vid_[add/del] instead of direct ndo_vlan_rx_[add/kill]_vid ↵Toshiaki Makita2013-11-141-14/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | calls We should use wrapper functions vlan_vid_[add/del] instead of ndo_vlan_rx_[add/kill]_vid. Otherwise, we might be not able to communicate using vlan interface in a certain situation. Example of problematic case: vconfig add eth0 10 brctl addif br0 eth0 bridge vlan add dev eth0 vid 10 bridge vlan del dev eth0 vid 10 brctl delif br0 eth0 In this case, we cannot communicate via eth0.10 because vlan 10 is filtered by NIC that has the vlan filtering feature. Signed-off-by: Toshiaki Makita <makita.toshiaki@lab.ntt.co.jp> Signed-off-by: David S. Miller <davem@davemloft.net>
| * core/dev: do not ignore dmac in dev_forward_skb()Alexei Starovoitov2013-11-142-7/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") and refactoring 64261f230a91 ("dev: move skb_scrub_packet() after eth_type_trans()") are forcing pkt_type to be PACKET_HOST when skb traverses veth. which means that ip forwarding will kick in inside netns even if skb->eth->h_dest != dev->dev_addr Fix order of eth_type_trans() and skb_scrub_packet() in dev_forward_skb() and in ip_tunnel_rcv() Fixes: 06a23fe31ca3 ("core/dev: set pkt_type after eth_type_trans() in dev_forward_skb()") CC: Isaku Yamahata <yamahatanetdev@gmail.com> CC: Maciej Zenczykowski <zenczykowski@gmail.com> CC: Nicolas Dichtel <nicolas.dichtel@6wind.com> Signed-off-by: Alexei Starovoitov <ast@plumgrid.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge tag 'nfs-for-3.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfsLinus Torvalds2013-11-162-19/+38
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull NFS client bugfixes: - Stable fix for data corruption when retransmitting O_DIRECT writes - Stable fix for a deep recursion/stack overflow bug in rpc_release_client - Stable fix for infinite looping when mounting a NFSv4.x volume - Fix a typo in the nfs mount option parser - Allow pNFS layouts to be compiled into the kernel when NFSv4.1 is * tag 'nfs-for-3.13-2' of git://git.linux-nfs.org/projects/trondmy/linux-nfs: nfs: fix pnfs Kconfig defaults NFS: correctly report misuse of "migration" mount option. nfs: don't retry detect_trunking with RPC_AUTH_UNIX more than once SUNRPC: Avoid deep recursion in rpc_release_client SUNRPC: Fix a data corruption issue when retransmitting RPC calls
| * | SUNRPC: Avoid deep recursion in rpc_release_clientTrond Myklebust2013-11-131-12/+17
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In cases where an rpc client has a parent hierarchy, then rpc_free_client may end up calling rpc_release_client() on the parent, thus recursing back into rpc_free_client. If the hierarchy is deep enough, then we can get into situations where the stack simply overflows. The fix is to have rpc_release_client() loop so that it can take care of the parent rpc client hierarchy without needing to recurse. Reported-by: Jeff Layton <jlayton@redhat.com> Reported-by: Weston Andros Adamson <dros@netapp.com> Reported-by: Bruce Fields <bfields@fieldses.org> Link: http://lkml.kernel.org/r/2C73011F-0939-434C-9E4D-13A1EB1403D7@netapp.com Cc: stable@vger.kernel.org Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | SUNRPC: Fix a data corruption issue when retransmitting RPC callsTrond Myklebust2013-11-081-7/+21
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The following scenario can cause silent data corruption when doing NFS writes. It has mainly been observed when doing database writes using O_DIRECT. 1) The RPC client uses sendpage() to do zero-copy of the page data. 2) Due to networking issues, the reply from the server is delayed, and so the RPC client times out. 3) The client issues a second sendpage of the page data as part of an RPC call retransmission. 4) The reply to the first transmission arrives from the server _before_ the client hardware has emptied the TCP socket send buffer. 5) After processing the reply, the RPC state machine rules that the call to be done, and triggers the completion callbacks. 6) The application notices the RPC call is done, and reuses the pages to store something else (e.g. a new write). 7) The client NIC drains the TCP socket send buffer. Since the page data has now changed, it reads a corrupted version of the initial RPC call, and puts it on the wire. This patch fixes the problem in the following manner: The ordering guarantees of TCP ensure that when the server sends a reply, then we know that the _first_ transmission has completed. Using zero-copy in that situation is therefore safe. If a time out occurs, we then send the retransmission using sendmsg() (i.e. no zero-copy), We then know that the socket contains a full copy of the data, and so it will retransmit a faithful reproduction even if the RPC call completes, and the application reuses the O_DIRECT buffer in the meantime. Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com> Cc: stable@vger.kernel.org
* | | Merge branch 'nfsd-next' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2013-11-166-28/+28
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pull nfsd changes from Bruce Fields: "This includes miscellaneous bugfixes and cleanup and a performance fix for write-heavy NFSv4 workloads. (The most significant nfsd-relevant change this time is actually in the delegation patches that went through Viro, fixing a long-standing bug that can cause NFSv4 clients to miss updates made by non-nfs users of the filesystem. Those enable some followup nfsd patches which I have queued locally, but those can wait till 3.14)" * 'nfsd-next' of git://linux-nfs.org/~bfields/linux: (24 commits) nfsd: export proper maximum file size to the client nfsd4: improve write performance with better sendspace reservations svcrpc: remove an unnecessary assignment sunrpc: comment typo fix Revert "nfsd: remove_stid can be incorporated into nfs4_put_delegation" nfsd4: fix discarded security labels on setattr NFSD: Add support for NFS v4.2 operation checking nfsd4: nfsd_shutdown_net needs state lock NFSD: Combine decode operations for v4 and v4.1 nfsd: -EINVAL on invalid anonuid/gid instead of silent failure nfsd: return better errors to exportfs nfsd: fh_update should error out in unexpected cases nfsd4: need to destroy revoked delegations in destroy_client nfsd: no need to unhash_stid before free nfsd: remove_stid can be incorporated into nfs4_put_delegation nfsd: nfs4_open_delegation needs to remove_stid rather than unhash_stid nfsd: nfs4_free_stid nfsd: fix Kconfig syntax sunrpc: trim off EC bytes in GSSAPI v2 unwrap gss_krb5: document that we ignore sequence number ...