summaryrefslogtreecommitdiffstats
path: root/drivers/net/tun.c (follow)
Commit message (Collapse)AuthorAgeFilesLines
* tuntap: limit head length of skb allocatedJason Wang2013-11-141-1/+9
| | | | | | | | | | | | | | | | We currently use hdr_len as a hint of head length which is advertised by guest. But when guest advertise a very big value, it can lead to an 64K+ allocating of kmalloc() which has a very high possibility of failure when host memory is fragmented or under heavy stress. The huge hdr_len also reduce the effect of zerocopy or even disable if a gso skb is linearized in guest. To solves those issues, this patch introduces an upper limit (PAGE_SIZE) of the head, which guarantees an order 0 allocation each time. Cc: Stefan Hajnoczi <stefanha@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: don't look at current when non-blockingMichael S. Tsirkin2013-10-081-3/+5
| | | | | | | | | | | We play with a wait queue even if socket is non blocking. This is an obvious waste. Besides, it will prevent calling the non blocking variant when current is not valid. Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: correctly handle error in tun_set_iff()Jason Wang2013-09-121-3/+8
| | | | | | | | | | | | | | | | | | Commit c8d68e6be1c3b242f1c598595830890b65cea64a (tuntap: multiqueue support) only call free_netdev() on error in tun_set_iff(). This causes several issues: - memory of tun security were leaked - use after free since the flow gc timer was not deleted and the tfile were not detached This patch solves the above issues. Reported-by: Wannes Rombouts <wannes.rombouts@epitech.eu> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: orphan frags before trying to set tx timestampJason Wang2013-09-051-3/+5
| | | | | | | | | | | | | | | sock_tx_timestamp() will clear all zerocopy flags of skb which may lead the frags never to be orphaned. This will break guest to guest traffic when zerocopy is enabled. Fix this by orphaning the frags before trying to set tx time stamp. The issue were introduced by commit eda297729171fe16bf34fe5b0419dfb69060f623 (tun: Support software transmit time stamping). Cc: Richard Cochran <richardcochran@gmail.com> Cc: Sergei Shtylyov <sergei.shtylyov@cogentembedded.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: purge socket error queue on detachJason Wang2013-09-051-3/+9
| | | | | | | | | | | | Commit eda297729171fe16bf34fe5b0419dfb69060f623 (tun: Support software transmit time stamping) will queue skbs into error queue when tx stamping is enabled. But it forgets to purge the error queue during detach. This patch fixes this. Cc: Richard Cochran <richardcochran@gmail.com> Acked-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: Get skfilter layoutPavel Emelyanov2013-08-211-0/+10
| | | | | | | | | | The only thing we may have from tun device is the fprog, whic contains the number of filter elements and a pointer to (user-space) memory where the elements are. The program itself may not be available if the device is persistent and detached. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: Allow to skip filter on attachPavel Emelyanov2013-08-211-5/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | There's a small problem with sk-filters on tun devices. Consider an application doing this sequence of steps: fd = open("/dev/net/tun"); ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" }); ioctl(fd, TUNATTACHFILTER, &my_filter); ioctl(fd, TUNSETPERSIST, 1); close(fd); At that point the tun0 will remain in the system and will keep in mind that there should be a socket filter at address '&my_filter'. If after that we do fd = open("/dev/net/tun"); ioctl(fd, TUNSETIFF, { .ifr_name = "tun0" }); we most likely receive the -EFAULT error, since tun_attach() would try to connect the filter back. But (!) if we provide a filter at address &my_filter, then tun0 will be created and the "new" filter would be attached, but application may not know about that. This may create certain problems to anyone using tun-s, but it's critical problem for c/r -- if we meet a persistent tun device with a filter in mind, we will not be able to attach to it to dump its state (flags, owner, address, vnethdr size, etc.). The proposal is to allow to attach to tun device (with TUNSETIFF) w/o attaching the filter to the tun-file's socket. After this attach app may e.g clean the device by dropping the filter, it doesn't want to have one, or (in case of c/r) get information about the device with tun ioctls. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: Report whether the queue is attached or notPavel Emelyanov2013-08-211-0/+3
| | | | | | | | | | Multiqueue tun devices allow to attach and detach from its queues while keeping the interface itself set on file. Knowing this is critical for the checkpoint part of criu project. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: Add ability to create tun device with given indexPavel Emelyanov2013-08-211-1/+20
| | | | | | | | | | | | | | | | | | | | | | | | Tun devices cannot be created with ifidex user wants, but it's required by checkpoint-restore project. Long time ago such ability was implemented for rtnl_ops-based interface for creating links (9c7dafbf net: Allow to create links with given ifindex), but the only API for creating and managing tuntap devices is ioctl-based and is evolving with adding new ones (cde8b15f tuntap: add ioctl to attach or detach a file form tuntap device). Following that trend, here's how a new ioctl that sets the ifindex for device, that _will_ be created by TUNSETIFF ioctl looks like. So those who want a tuntap device with the ifindex N, should open the tun device, call ioctl(fd, TUNSETIFINDEX, &N), then call TUNSETIFF. If the index N is busy, then the register_netdev will find this out and the ioctl would be failed with -EBUSY. If setifindex is not called, then it will be generated as before. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-08-171-2/+4
|\
| * tun: signedness bug in tun_get_user()Dan Carpenter2013-08-151-2/+4
| | | | | | | | | | | | | | | | | | | | | | The recent fix d9bf5f1309 "tun: compare with 0 instead of total_len" is not totally correct. Because "len" and "sizeof()" are size_t type, that means they are never less than zero. Signed-off-by: Dan Carpenter <dan.carpenter@oracle.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tun: compare with 0 instead of total_lenWeiping Pan2013-08-141-2/+2
| | | | | | | | | | | | | | | | | | Since we set "len = total_len" in the beginning of tun_get_user(), so we should compare the new len with 0, instead of total_len, or the if statement always returns false. Signed-off-by: Weiping Pan <wpan@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: attempt high order allocations in sock_alloc_send_pskb()Eric Dumazet2013-08-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Adding paged frags skbs to af_unix sockets introduced a performance regression on large sends because of additional page allocations, even if each skb could carry at least 100% more payload than before. We can instruct sock_alloc_send_pskb() to attempt high order allocations. Most of the time, it does a single page allocation instead of 8. I added an additional parameter to sock_alloc_send_pskb() to let other users to opt-in for this new feature on followup patches. Tested: Before patch : $ netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 2304 212992 212992 10.00 46861.15 After patch : $ netperf -t STREAM_STREAM STREAM STREAM TEST Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 2304 212992 212992 10.00 57981.11 Signed-off-by: Eric Dumazet <edumazet@google.com> Cc: David Rientjes <rientjes@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: move zerocopy_sg_from_iovec() to net/core/datagram.cJason Wang2013-08-081-80/+0
| | | | | | | | | | | | | | To let it be reused and reduce code duplication. Also document this function. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: move iov_pages() to net/core/iovec.cJason Wang2013-08-081-23/+0
| | | | | | | | | | | | | | To let it be reused and reduce code duplication. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tuntap: hardware vlan tx supportJason Wang2013-07-281-4/+36
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Inspired by commit f09e2249c4f5c7c13261ec73f5a7807076af0c8e (macvtap: restore vlan header on user read). This patch adds hardware vlan tx support for tuntap. This is done by copying vlan header directly into userspace in tun_put_user() instead of doing it through __vlan_put_tag() in dev_hard_start_xmit(). This eliminates one unnecessary memmove() in vlan_insert_tag() for 802.1ad and 802.1q traffic. pktgen test shows about 20% improvement for 802.1q traffic: Before: 662149pps 317Mb/sec (317831520bps) errors: 0 After: 801033pps 384Mb/sec (384495840bps) errors: 0 Cc: Basil Gor <basil.gor@gmail.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tun: Support software transmit time stamping.Richard Cochran2013-07-221-2/+12
|/ | | | | | | | This patch adds transmit time stamping to the tun/tap driver. Similar support already exists for UDP, can, and raw packets. Signed-off-by: Richard Cochran <richardcochran@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: do not zerocopy if iov needs more pages than MAX_SKB_FRAGSJason Wang2013-07-181-24/+38
| | | | | | | | | | | | | | | | | | | | | | | | We try to linearize part of the skb when the number of iov is greater than MAX_SKB_FRAGS. This is not enough since each single vector may occupy more than one pages, so zerocopy_sg_fromiovec() may still fail and may break the guest network. Solve this problem by calculate the pages needed for iov before trying to do zerocopy and switch to use copy instead of zerocopy if it needs more than MAX_SKB_FRAGS. This is done through introducing a new helper to count the pages for iov, and call uarg->callback() manually when switching from zerocopy to copy to notify vhost. We can do further optimization on top. The bug were introduced from commit 0690899b4d4501b3505be069b9a687e68ccbe15b (tun: experimental zero copy tx support) Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: correctly linearize skb when zerocopy is usedJason Wang2013-07-111-3/+6
| | | | | | | | | | | | | | | | | | Userspace may produce vectors greater than MAX_SKB_FRAGS. When we try to linearize parts of the skb to let the rest of iov to be fit in the frags, we need count copylen into linear when calling tun_alloc_skb() instead of partly counting it into data_len. Since this breaks zerocopy_sg_from_iovec() since its inner counter assumes nr_frags should be zero at beginning. This cause nr_frags to be increased wrongly without setting the correct frags. This bug were introduced from 0690899b4d4501b3505be069b9a687e68ccbe15b (tun: experimental zero copy tx support) Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-07-031-2/+4
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/freescale/fec_main.c drivers/net/ethernet/renesas/sh_eth.c net/ipv4/gre.c The GRE conflict is between a bug fix (kfree_skb --> kfree_skb_list) and the splitting of the gre.c code into seperate files. The FEC conflict was two sets of changes adding ethtool support code in an "!CONFIG_M5272" CPP protected block. Finally the sh_eth.c conflict was between one commit add bits set in the .eesr_err_check mask whilst another commit removed the .tx_error_check member and assignments. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tun: fix recovery from gup errorsMichael S. Tsirkin2013-06-261-2/+4
| | | | | | | | | | | | | | | | | | | | | | | | | | get user pages might fail partially in tun zero copy mode. To recover we need to put all pages that we got, but code used a wrong index resulting in double-free errors. Reported-by: Brad Hubbard <bhubbard@redhat.com> Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-06-201-1/+3
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/wireless/ath/ath9k/Kconfig drivers/net/xen-netback/netback.c net/batman-adv/bat_iv_ogm.c net/wireless/nl80211.c The ath9k Kconfig conflict was a change of a Kconfig option name right next to the deletion of another option. The xen-netback conflict was overlapping changes involving the handling of the notify list in xen_netbk_rx_action(). Batman conflict resolution provided by Antonio Quartulli, basically keep everything in both conflict hunks. The nl80211 conflict is a little more involved. In 'net' we added a dynamic memory allocation to nl80211_dump_wiphy() to fix a race that Linus reported. Meanwhile in 'net-next' the handlers were converted to use pre and post doit handlers which use a flag to determine whether to hold the RTNL mutex around the operation. However, the dump handlers to not use this logic. Instead they have to explicitly do the locking. There were apparent bugs in the conversion of nl80211_dump_wiphy() in that we were not dropping the RTNL mutex in all the return paths, and it seems we very much should be doing so. So I fixed that whilst handling the overlapping changes. To simplify the initial returns, I take the RTNL mutex after we try to allocate 'tb'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: set SOCK_ZEROCOPY flag during openJason Wang2013-06-121-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | Commit 54f968d6efdbf7dec36faa44fc11f01b0e4d1990 (tuntap: move socket to tun_file) forgets to set SOCK_ZEROCOPY flag, which will prevent vhost_net from doing zercopy w/ tap. This patch fixes this by setting it during file open. Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: fix a possible race between queue selection and changing queuesJason Wang2013-06-101-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Complier may generate codes that re-read the tun->numqueues during tun_select_queue(). This may be a race if vlan->numqueues were changed in the same time and can lead unexpected result (e.g. very huge value). We need prevent the compiler from generating such codes by adding an ACCESS_ONCE() to make sure tun->numqueues were only read once. Bug were introduced by commit c8d68e6be1c3b242f1c598595830890b65cea64a (tuntap: multiqueue support). Reported-by: Michael S. Tsirkin <mst@redhat.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tun: Turn tun_flow_init() into void fnPavel Emelyanov2013-06-131-7/+2
| | | | | | | | | | | | | | | | This routine doesn't fail since 9fdc6bef (tuntap: dont use a private kmem_cache) so it makes sense to compact the code a little bit. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tun: Report "persist" flag to userspacePavel Emelyanov2013-06-131-0/+3
|/ | | | | | | | | | | | | The TUN_PERSIST flag is not reported at all -- both TUNGETIFF, and sysfs "flags" attribute skip one. Knowing whether a device is persistent or not is critical for checkpoint-restore, thus I propose to add the read-only IFF_PERSIST one for this. Setting this new IFF_PERSIST is hardly possible, as TUNSETIFF doesn't check for unknown flags being zero and thus there can be trash. Signed-off-by: Pavel Emelyanov <xemul@parallels.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: forbid changing mq flag for persistent deviceJason Wang2013-05-291-0/+4
| | | | | | | | | | | | | | | | | | | We currently allow changing the mq flag (IFF_MULTI_QUEUE) for a persistent device. This will result a mismatch between the number the queues in netdev and tuntap. This is because we only allocate a 1q netdevice when IFF_MULTI_QUEUE was not specified, so when we set the IFF_MULTI_QUEUE and try to attach more queues later, netif_set_real_num_tx_queues() may fail which result a single queue netdevice with multiple sockets attached. Solve this by disallowing changing the mq flag for persistent device. Bug was introduced by commit edfb6a148ce62e5e19354a1dcd9a34e00815c2a1 (tuntap: reduce memory using of queues). Reported-by: Sriram Narasimhan <sriram.narasimhan@hp.com> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-04-301-4/+11
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/broadcom/bnx2x/bnx2x_main.c drivers/net/ethernet/emulex/benet/be.h include/net/tcp.h net/mac802154/mac802154.h Most conflicts were minor overlapping stuff. The be2net driver brought in some fixes that added __vlan_put_tag calls, which in net-next take an additional argument. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: tun: release the reference of tun device in tun_recvmsgGao feng2013-04-291-2/+5
| | | | | | | | | | | | | | | | | | | | We forget to release the reference of tun device in tun_recvmsg. bug introduced in commit 54f968d6efdbf7dec36faa44fc11f01b0e4d1990 (tuntap: move socket to tun_file) Signed-off-by: Gao feng <gaofeng@cn.fujitsu.com> Acked-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: correct the return value in tun_set_iff()Jason Wang2013-04-251-2/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | commit (3be8fbab tuntap: fix error return code in tun_set_iff()) breaks the creation of multiqueue tuntap since it forbids to create more than one queues for a multiqueue tuntap device. We need return 0 instead -EBUSY here since we don't want to re-initialize the device when one or more queues has been already attached. Add a comment and correct the return value to zero. Reported-by: Jerry Chu <hkchu@google.com> Cc: Jerry Chu <hkchu@google.com> Cc: Wei Yongjun <weiyj.lk@gmail.com> Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Jerry Chu <hkchu@google.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-04-231-1/+1
|\| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/emulex/benet/be_main.c drivers/net/ethernet/intel/igb/igb_main.c drivers/net/wireless/brcm80211/brcmsmac/mac80211_if.c include/net/scm.h net/batman-adv/routing.c net/ipv4/tcp_input.c The e{uid,gid} --> {uid,gid} credentials fix conflicted with the cleanup in net-next to now pass cred structs around. The be2net driver had a bug fix in 'net' that overlapped with the VLAN interface changes by Patrick McHardy in net-next. An IGB conflict existed because in 'net' the build_skb() support was reverted, and in 'net-next' there was a comment style fix within that code. Several batman-adv conflicts were resolved by making sure that all calls to batadv_is_my_mac() are changed to have a new bat_priv first argument. Eric Dumazet's TS ECR fix in TCP in 'net' conflicted with the F-RTO rewrite in 'net-next', mostly overlapping changes. Thanks to Stephen Rothwell and Antonio Quartulli for help with several of these merge resolutions. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: fix error return code in tun_set_iff()Wei Yongjun2013-04-121-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | Fix to return a negative error code from the error handling case instead of 0, as returned elsewhere in this function. [ Bug added in linux-3.8 , commit 4008e97f866db665 ("tuntap: fix ambigious multiqueue API") ] Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tuntap: initialize vlan_featuresJason Wang2013-04-111-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The vlan_features was zero which prevents vlan GSO packets to be transmitted to userspace. This is suboptimal so enable this by initialize vlan_features for tuntap. Netperf shows better performance of guest receiving since vlan TSO works for tuntap: before: netperf -H 192.168.5.4 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.01 2786.67 after: netperf -H 192.168.5.4 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.5.4 () port 0 AF_INET : demo Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 8085.49 Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: switch to use skb_probe_transport_header()Jason Wang2013-03-271-9/+1
| | | | | | | | | | | | | | | | | | | | | | Switch to use the new help skb_probe_transport_header() to do the l4 header probing for untrusted sources. For packets with partial csum, the header should already been set by skb_partial_csum_set(). Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tuntap: set transport header before passing it to kernelJason Wang2013-03-261-0/+10
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Currently, for the packets receives from tuntap, before doing header check, kernel just reset the transport header in netif_receive_skb() which pretends no l4 header. This is suboptimal for precise packet length estimation (introduced in 1def9238) which needs correct l4 header for gso packets. So this patch set the transport header to csum_start for partial checksum packets, otherwise it first try skb_flow_dissect(), if it fails, just reset the transport header. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | tuntap: remove unused variable in __tun_detach()Wei Yongjun2013-03-131-2/+0
|/ | | | | | | | | The variable dev is initialized but never used otherwise, so remove the unused variable. Signed-off-by: Wei Yongjun <yongjun_wei@trendmicro.com.cn> Acked-by: Neil Horman <nhorman@tuxdriver.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: add a missing nf_reset() in tun_net_xmit()Eric Dumazet2013-03-061-0/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Dave reported following crash : general protection fault: 0000 [#1] SMP CPU 2 Pid: 25407, comm: qemu-kvm Not tainted 3.7.9-205.fc18.x86_64 #1 Hewlett-Packard HP Z400 Workstation/0B4Ch RIP: 0010:[<ffffffffa0399bd5>] [<ffffffffa0399bd5>] destroy_conntrack+0x35/0x120 [nf_conntrack] RSP: 0018:ffff880276913d78 EFLAGS: 00010206 RAX: 50626b6b7876376c RBX: ffff88026e530d68 RCX: ffff88028d158e00 RDX: ffff88026d0d5470 RSI: 0000000000000011 RDI: 0000000000000002 RBP: ffff880276913d88 R08: 0000000000000000 R09: ffff880295002900 R10: 0000000000000000 R11: 0000000000000003 R12: ffffffff81ca3b40 R13: ffffffff8151a8e0 R14: ffff880270875000 R15: 0000000000000002 FS: 00007ff3bce38a00(0000) GS:ffff88029fc40000(0000) knlGS:0000000000000000 CS: 0010 DS: 0000 ES: 0000 CR0: 000000008005003b CR2: 00007fd1430bd000 CR3: 000000027042b000 CR4: 00000000000027e0 DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000 DR3: 0000000000000000 DR6: 00000000ffff0ff0 DR7: 0000000000000400 Process qemu-kvm (pid: 25407, threadinfo ffff880276912000, task ffff88028c369720) Stack: ffff880156f59100 ffff880156f59100 ffff880276913d98 ffffffff815534f7 ffff880276913db8 ffffffff8151a74b ffff880270875000 ffff880156f59100 ffff880276913dd8 ffffffff8151a5a6 ffff880276913dd8 ffff88026d0d5470 Call Trace: [<ffffffff815534f7>] nf_conntrack_destroy+0x17/0x20 [<ffffffff8151a74b>] skb_release_head_state+0x7b/0x100 [<ffffffff8151a5a6>] __kfree_skb+0x16/0xa0 [<ffffffff8151a666>] kfree_skb+0x36/0xa0 [<ffffffff8151a8e0>] skb_queue_purge+0x20/0x40 [<ffffffffa02205f7>] __tun_detach+0x117/0x140 [tun] [<ffffffffa022184c>] tun_chr_close+0x3c/0xd0 [tun] [<ffffffff8119669c>] __fput+0xec/0x240 [<ffffffff811967fe>] ____fput+0xe/0x10 [<ffffffff8107eb27>] task_work_run+0xa7/0xe0 [<ffffffff810149e1>] do_notify_resume+0x71/0xb0 [<ffffffff81640152>] int_signal+0x12/0x17 Code: 00 00 04 48 89 e5 41 54 53 48 89 fb 4c 8b a7 e8 00 00 00 0f 85 de 00 00 00 0f b6 73 3e 0f b7 7b 2a e8 10 40 00 00 48 85 c0 74 0e <48> 8b 40 28 48 85 c0 74 05 48 89 df ff d0 48 c7 c7 08 6a 3a a0 RIP [<ffffffffa0399bd5>] destroy_conntrack+0x35/0x120 [nf_conntrack] RSP <ffff880276913d78> This is because tun_net_xmit() needs to call nf_reset() before queuing skb into receive_queue Reported-by: Dave Jones <davej@redhat.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* hlist: drop the node parameter from iteratorsSasha Levin2013-02-281-8/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | I'm not sure why, but the hlist for each entry iterators were conceived list_for_each_entry(pos, head, member) The hlist ones were greedy and wanted an extra parameter: hlist_for_each_entry(tpos, pos, head, member) Why did they need an extra pos parameter? I'm not quite sure. Not only they don't really need it, it also prevents the iterator from looking exactly like the list iterator, which is unfortunate. Besides the semantic patch, there was some manual work required: - Fix up the actual hlist iterators in linux/list.h - Fix up the declaration of other iterators based on the hlist ones. - A very small amount of places were using the 'node' parameter, this was modified to use 'obj->member' instead. - Coccinelle didn't handle the hlist_for_each_entry_safe iterator properly, so those had to be fixed up manually. The semantic patch which is mostly the work of Peter Senna Tschudin is here: @@ iterator name hlist_for_each_entry, hlist_for_each_entry_continue, hlist_for_each_entry_from, hlist_for_each_entry_rcu, hlist_for_each_entry_rcu_bh, hlist_for_each_entry_continue_rcu_bh, for_each_busy_worker, ax25_uid_for_each, ax25_for_each, inet_bind_bucket_for_each, sctp_for_each_hentry, sk_for_each, sk_for_each_rcu, sk_for_each_from, sk_for_each_safe, sk_for_each_bound, hlist_for_each_entry_safe, hlist_for_each_entry_continue_rcu, nr_neigh_for_each, nr_neigh_for_each_safe, nr_node_for_each, nr_node_for_each_safe, for_each_gfn_indirect_valid_sp, for_each_gfn_sp, for_each_host; type T; expression a,c,d,e; identifier b; statement S; @@ -T b; <+... when != b ( hlist_for_each_entry(a, - b, c, d) S | hlist_for_each_entry_continue(a, - b, c) S | hlist_for_each_entry_from(a, - b, c) S | hlist_for_each_entry_rcu(a, - b, c, d) S | hlist_for_each_entry_rcu_bh(a, - b, c, d) S | hlist_for_each_entry_continue_rcu_bh(a, - b, c) S | for_each_busy_worker(a, c, - b, d) S | ax25_uid_for_each(a, - b, c) S | ax25_for_each(a, - b, c) S | inet_bind_bucket_for_each(a, - b, c) S | sctp_for_each_hentry(a, - b, c) S | sk_for_each(a, - b, c) S | sk_for_each_rcu(a, - b, c) S | sk_for_each_from -(a, b) +(a) S + sk_for_each_from(a) S | sk_for_each_safe(a, - b, c, d) S | sk_for_each_bound(a, - b, c) S | hlist_for_each_entry_safe(a, - b, c, d, e) S | hlist_for_each_entry_continue_rcu(a, - b, c) S | nr_neigh_for_each(a, - b, c) S | nr_neigh_for_each_safe(a, - b, c, d) S | nr_node_for_each(a, - b, c) S | nr_node_for_each_safe(a, - b, c, d) S | - for_each_gfn_sp(a, c, d, b) S + for_each_gfn_sp(a, c, d) S | - for_each_gfn_indirect_valid_sp(a, c, d, b) S + for_each_gfn_indirect_valid_sp(a, c, d) S | for_each_host(a, - b, c) S | for_each_host_safe(a, - b, c, d) S | for_each_mesh_entry(a, - b, c, d) S ) ...+> [akpm@linux-foundation.org: drop bogus change from net/ipv4/raw.c] [akpm@linux-foundation.org: drop bogus hunk from net/ipv6/raw.c] [akpm@linux-foundation.org: checkpatch fixes] [akpm@linux-foundation.org: fix warnings] [akpm@linux-foudnation.org: redo intrusive kvm changes] Tested-by: Peter Senna Tschudin <peter.senna@gmail.com> Acked-by: Paul E. McKenney <paulmck@linux.vnet.ibm.com> Signed-off-by: Sasha Levin <sasha.levin@oracle.com> Cc: Wu Fengguang <fengguang.wu@intel.com> Cc: Marcelo Tosatti <mtosatti@redhat.com> Cc: Gleb Natapov <gleb@redhat.com> Signed-off-by: Andrew Morton <akpm@linux-foundation.org> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* net: Fix possible wrong checksum generation.Pravin B Shelar2013-02-131-8/+5
| | | | | | | | | | | | | | | | | | | | | Patch cef401de7be8c4e (net: fix possible wrong checksum generation) fixed wrong checksum calculation but it broke TSO by defining new GSO type but not a netdev feature for that type. net_gso_ok() would not allow hardware checksum/segmentation offload of such packets without the feature. Following patch fixes TSO and wrong checksum. This patch uses same logic that Eric Dumazet used. Patch introduces new flag SKBTX_SHARED_FRAG if at least one frag can be modified by the user. but SKBTX_SHARED_FRAG flag is kept in skb shared info tx_flags rather than gso_type. tx_flags is better compared to gso_type since we can have skb with shared frag without gso packet. It does not link SHARED_FRAG to GSO, So there is no need to define netdev feature for this. Signed-off-by: Pravin B Shelar <pshelar@nicira.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-02-051-14/+24
|\ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: drivers/net/ethernet/intel/e1000e/ethtool.c drivers/net/vmxnet3/vmxnet3_drv.c drivers/net/wireless/iwlwifi/dvm/tx.c net/ipv6/route.c The ipv6 route.c conflict is simple, just ignore the 'net' side change as we fixed the same problem in 'net-next' by eliminating cached neighbours from ipv6 routes. The e1000e conflict is an addition of a new statistic in the ethtool code, trivial. The vmxnet3 conflict is about one change in 'net' removing a guarding conditional, whilst in 'net-next' we had a netdev_info() conversion. The iwlwifi conflict is dealing with a WARN_ON() conversion in 'net-next' vs. a revert happening in 'net'. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: allow polling/writing/reading when detachedJason Wang2013-01-291-9/+16
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We forbid polling, writing and reading when the file were detached, this may complex the user in several cases: - when guest pass some buffers to vhost/qemu and then disable some queues, host/qemu needs to do its own cleanup on those buffers which is complex sometimes. We can do this simply by allowing a user can still write to an disabled queue. Write to an disabled queue will cause the packet pass to the kernel and read will get nothing. - align the polling behavior with macvtap which never fails when the queue is created. This can simplify the polling errors handling of its user (e.g vhost) We can simply achieve this by don't assign NULL to tfile->tun when detached. Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tun: fix carrier on/off statusMichael S. Tsirkin2013-01-291-5/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | Commit c8d68e6be1c3b242f1c598595830890b65cea64a removed carrier off call from tun_detach since it's now called on queue disable and not only on tun close. This confuses userspace which used this flag to detect a free tun. To fix, put this back but under if (clean). Signed-off-by: Michael S. Tsirkin <mst@redhat.com> Tested-by: Jason Wang <jasowang@redhat.com> Acked-by: Jason Wang <jasowang@redhat.com> Tested-by: Toralf Förster <toralf.foerster@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/netDavid S. Miller2013-01-291-8/+14
|\| | | | | | | | | | | | | Bring in the 'net' tree so that we can get some ipv4/ipv6 bug fixes that some net-next work will build upon. Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: limit the number of flow cachesJason Wang2013-01-231-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We create new flow caches when a new flow is identified by tuntap, This may lead some issues: - userspace may produce a huge amount of short live flows to exhaust host memory - the unlimited number of flow caches may produce a long list which increase the time in the linear searching Solve this by introducing a limit of total number of flow caches. Cc: Stephen Hemminger <stephen@networkplumber.org> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * tuntap: reduce memory using of queuesJason Wang2013-01-231-7/+8
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | A MAX_TAP_QUEUES(1024) queues of tuntap device is always allocated unconditionally even userspace only requires a single queue device. This is unnecessary and will lead a very high order of page allocation when has a high possibility to fail. Solving this by creating a one queue net device when userspace only use one queue and also reduce MAX_TAP_QUEUES to DEFAULT_MAX_NUM_RSS_QUEUES which can guarantee the success of the allocation. Reported-by: Dirk Hohndel <dirk@hohndel.org> Cc: Eric Dumazet <eric.dumazet@gmail.com> Cc: David Woodhouse <dwmw2@infradead.org> Cc: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Acked-by: Michael S. Tsirkin <mst@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | net: fix possible wrong checksum generationEric Dumazet2013-01-281-4/+8
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Pravin Shelar mentioned that GSO could potentially generate wrong TX checksum if skb has fragments that are overwritten by the user between the checksum computation and transmit. He suggested to linearize skbs but this extra copy can be avoided for normal tcp skbs cooked by tcp_sendmsg(). This patch introduces a new SKB_GSO_SHARED_FRAG flag, set in skb_shinfo(skb)->gso_type if at least one frag can be modified by the user. Typical sources of such possible overwrites are {vm}splice(), sendfile(), and macvtap/tun/virtio_net drivers. Tested: $ netperf -H 7.7.8.84 MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3959.52 $ netperf -H 7.7.8.84 -t TCP_SENDFILE TCP SENDFILE TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 7.7.8.84 () port 0 AF_INET Recv Send Send Socket Socket Message Elapsed Size Size Size Time Throughput bytes bytes bytes secs. 10^6bits/sec 87380 16384 16384 10.00 3216.80 Performance of the SENDFILE is impacted by the extra allocation and copy, and because we use order-0 pages, while the TCP_STREAM uses bigger pages. Reported-by: Pravin Shelar <pshelar@nicira.com> Signed-off-by: Eric Dumazet <edumazet@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tun: fix LSM/SELinux labeling of tun/tap devicesPaul Moore2013-01-151-5/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch corrects some problems with LSM/SELinux that were introduced with the multiqueue patchset. The problem stems from the fact that the multiqueue work changed the relationship between the tun device and its associated socket; before the socket persisted for the life of the device, however after the multiqueue changes the socket only persisted for the life of the userspace connection (fd open). For non-persistent devices this is not an issue, but for persistent devices this can cause the tun device to lose its SELinux label. We correct this problem by adding an opaque LSM security blob to the tun device struct which allows us to have the LSM security state, e.g. SELinux labeling information, persist for the lifetime of the tun device. In the process we tweak the LSM hooks to work with this new approach to TUN device/socket labeling and introduce a new LSM hook, security_tun_dev_attach_queue(), to approve requests to attach to a TUN queue via TUNSETQUEUE. The SELinux code has been adjusted to match the new LSM hooks, the other LSMs do not make use of the LSM TUN controls. This patch makes use of the recently added "tun_socket:attach_queue" permission to restrict access to the TUNSETQUEUE operation. On older SELinux policies which do not define the "tun_socket:attach_queue" permission the access control decision for TUNSETQUEUE will be handled according to the SELinux policy's unknown permission setting. Signed-off-by: Paul Moore <pmoore@redhat.com> Acked-by: Eric Paris <eparis@parisplace.org> Tested-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: fix leaking reference countJason Wang2013-01-121-3/+9
| | | | | | | | | | | | | | Reference count leaking of both module and sock were found: - When a detached file were closed, its sock refcnt from device were not released, solving this by add the sock_put(). - The module were hold or drop unconditionally in TUNSETPERSIST, which means we if we set the persist flag for N times, we need unset it for another N times. Solving this by only hold or drop an reference when there's a flag change and also drop the reference count when the persist device is deleted. Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: forbid calling TUNSETIFF when detachedJason Wang2013-01-121-2/+3
| | | | | | | | | | | | | Michael points out that even after Stefan's fix the TUNSETIFF is still allowed to create a new tap device. This because we only check tfile->tun but the tfile->detached were introduced. Fix this by failing early in tun_set_iff() if the file is detached. After this fix, there's no need to do the check again in tun_set_iff(), so this patch removes it. Cc: Michael S. Tsirkin <mst@redhat.com> Cc: Stefan Hajnoczi <stefanha@redhat.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tuntap: switch to use rtnl_dereference()Jason Wang2013-01-121-17/+10
| | | | | | | | Switch to use rtnl_dereference() instead of the open code, suggested by Eric. Cc: Eric Dumazet <edumazet@google.com> Signed-off-by: Jason Wang <jasowang@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>