summaryrefslogtreecommitdiffstats
path: root/net (follow)
Commit message (Collapse)AuthorAgeFilesLines
* NFSD: Prevent a buffer overflow in svc_xprt_names()Chuck Lever2009-04-281-17/+39
| | | | | | | | | | | | | | | | | | | | | | | | | The svc_xprt_names() function can overflow its buffer if it's so near the end of the passed in buffer that the "name too long" string still doesn't fit. Of course, it could never tell if it was near the end of the passed in buffer, since its only caller passes in zero as the buffer length. Let's make this API a little safer. Change svc_xprt_names() so it *always* checks for a buffer overflow, and change its only caller to pass in the correct buffer length. If svc_xprt_names() does overflow its buffer, it now fails with an ENAMETOOLONG errno, instead of trying to write a message at the end of the buffer. I don't like this much, but I can't figure out a clean way that's always safe to return some of the names, *and* an indication that the buffer was not long enough. The displayed error when doing a 'cat /proc/fs/nfsd/portlist' is "File name too long". Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* SUNRPC: Fix error return value of svc_addr_len()Chuck Lever2009-04-281-3/+4
| | | | | | | | | | | | | | | | | | | | | | The svc_addr_len() helper function returns -EAFNOSUPPORT if it doesn't recognize the address family of the passed-in socket address. However, the return type of this function is size_t, which means -EAFNOSUPPORT is turned into a very large positive value in this case. The check in svc_udp_recvfrom() to see if the return value is less than zero therefore won't work at all. Additionally, handle_connect_req() passes this value directly to memset(). This could cause memset() to clobber a large chunk of memory if svc_addr_len() has returned an error. Currently the address family of these addresses, however, is known to be supported long before handle_connect_req() is called, so this isn't a real risk. Change the error return value of svc_addr_len() to zero, which fits in the range of size_t, and is safer to pass to memset() directly. Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* net/sunrpc/svc_xprt.c: fix sparse warningsH Hartley Sweeten2009-04-281-0/+1
| | | | | | | | | | | | Fix the following sparse warnings in net/sunrpc/svc_xprt.c. warning: symbol 'svc_recv' was not declared. Should it be static? warning: symbol 'svc_drop' was not declared. Should it be static? warning: symbol 'svc_send' was not declared. Should it be static? warning: symbol 'svc_close_all' was not declared. Should it be static? Signed-off-by: H Hartley Sweeten <hsweeten@visionengravers.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* FRV: Fix the section attribute on UP DECLARE_PER_CPU()David Howells2009-04-221-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | In non-SMP mode, the variable section attribute specified by DECLARE_PER_CPU() does not agree with that specified by DEFINE_PER_CPU(). This means that architectures that have a small data section references relative to a base register may throw up linkage errors due to too great a displacement between where the base register points and the per-CPU variable. On FRV, the .h declaration says that the variable is in the .sdata section, but the .c definition says it's actually in the .data section. The linker throws up the following errors: kernel/built-in.o: In function `release_task': kernel/exit.c:78: relocation truncated to fit: R_FRV_GPREL12 against symbol `per_cpu__process_counts' defined in .data section in kernel/built-in.o kernel/exit.c:78: relocation truncated to fit: R_FRV_GPREL12 against symbol `per_cpu__process_counts' defined in .data section in kernel/built-in.o To fix this, DECLARE_PER_CPU() should simply apply the same section attribute as does DEFINE_PER_CPU(). However, this is made slightly more complex by virtue of the fact that there are several variants on DEFINE, so these need to be matched by variants on DECLARE. Signed-off-by: David Howells <dhowells@redhat.com> Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
* packet: avoid warnings when high-order page allocation failsEric Dumazet2009-04-151-2/+3
| | | | | | | | | | | | | | | | | | | | | | | Latest tcpdump/libpcap triggers annoying messages because of high order page allocation failures (when lowmem exhausted or fragmented) These allocation errors are correctly handled so could be silent. [22660.208901] tcpdump: page allocation failure. order:5, mode:0xc0d0 [22660.208921] Pid: 13866, comm: tcpdump Not tainted 2.6.30-rc2 #170 [22660.208936] Call Trace: [22660.208950] [<c04e2b46>] ? printk+0x18/0x1a [22660.208965] [<c02760f7>] __alloc_pages_internal+0x357/0x460 [22660.208980] [<c0276251>] __get_free_pages+0x21/0x40 [22660.208995] [<c04cc835>] packet_set_ring+0x105/0x3d0 [22660.209009] [<c04ccd1d>] packet_setsockopt+0x21d/0x4d0 [22660.209025] [<c0270400>] ? filemap_fault+0x0/0x450 [22660.209040] [<c0449e34>] sys_setsockopt+0x54/0xa0 [22660.209053] [<c044b97f>] sys_socketcall+0xef/0x270 [22660.209067] [<c0202e34>] sysenter_do_call+0x12/0x26 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Revert "rose: zero length frame filtering in af_rose.c"David S. Miller2009-04-151-10/+0
| | | | | | | | | | | | This reverts commit 244f46ae6e9e18f6fc0be7d1f49febde4762c34b. Alan Cox did the research, and just like the other radio protocols zero-length frames have meaning because at the top level ROSE is X.25 PLP. So this zero-length filtering is invalid. Signed-off-by: David S. Miller <davem@davemloft.net>
* gro: Restore correct value to gso_sizeHerbert Xu2009-04-151-2/+3
| | | | | | | | | | | | | Since everybody has been focusing on baremetal GRO performance no one noticed when I added a bug that zapped gso_size for all GRO packets. This only gets picked up when you forward the skb out of an interface. Thanks to Mark Wagner for noticing this bug when testing kvm. Reported-by: Mark Wagner <mwagner@redhat.com> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6:remove useless checkYang Hongyang2009-04-141-4/+0
| | | | | | | | | | | After switch (rthdr->type) {...},the check below is completely useless.Because: if the type is 2,then hdrlen must be 2 and segments_left must be 1,clearly the check is redundant;if the type is not 2,then goto sticky_done,the check is useless too. Signed-off-by: Yang Hongyang <yanghy@cn.fujitsu.com> Reviewed-by: Shan Wei <shanwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: fix >2 iw selectionIlpo Järvinen2009-04-141-0/+3
| | | | | | | | A long-standing feature in tcp_init_metrics() is such that any of its goto reset prevents call to tcp_init_cwnd(). Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
* netsched: Allow meta match on vlan tag on receiveStephen Hemminger2009-04-141-2/+4
| | | | | | | | | When vlan acceleration is used on receive, the vlan tag is maintained outside of the skb data. The existing vlan tag match only works on TX path because it uses vlan_get_tag which tests for VLAN_HW_TX_ACCEL. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* gro: Normalise skb before bypassing GRO on netpoll VLAN pathHerbert Xu2009-04-141-1/+3
| | | | | | | | | | | | | | | | | | Hi: gro: Normalise skb before bypassing GRO on netpoll VLAN path When we detect netpoll RX on the GRO VLAN path we bail out and call the normal VLAN receive handler. However, the packet needs to be normalised by calling eth_type_trans since that's what the normal path expects (normally the GRO path does the fixup). This patch adds the necessary call to vlan_gro_frags. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Thanks, Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Fix NULL pointer dereference with time-wait socketsVlad Yasevich2009-04-112-3/+6
| | | | | | | | | | | | | | | | | | | | | | | | | Commit b2f5e7cd3dee2ed721bf0675e1a1ddebb849aee6 (ipv6: Fix conflict resolutions during ipv6 binding) introduced a regression where time-wait sockets were not treated correctly. This resulted in the following: BUG: unable to handle kernel NULL pointer dereference at 0000000000000062 IP: [<ffffffff805d7d61>] ipv4_rcv_saddr_equal+0x61/0x70 ... Call Trace: [<ffffffffa033847b>] ipv6_rcv_saddr_equal+0x1bb/0x250 [ipv6] [<ffffffffa03505a8>] inet6_csk_bind_conflict+0x88/0xd0 [ipv6] [<ffffffff805bb18e>] inet_csk_get_port+0x1ee/0x400 [<ffffffffa0319b7f>] inet6_bind+0x1cf/0x3a0 [ipv6] [<ffffffff8056d17c>] ? sockfd_lookup_light+0x3c/0xd0 [<ffffffff8056ed49>] sys_bind+0x89/0x100 [<ffffffff80613ea2>] ? trace_hardirqs_on_thunk+0x3a/0x3c [<ffffffff8020bf9b>] system_call_fastpath+0x16/0x1b Tested-by: Brian Haley <brian.haley@hp.com> Tested-by: Ed Tomlinson <edt@aei.ca> Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tr: fix leakage of device in net/802/tr.cWei Yongjun2009-04-111-0/+3
| | | | | | | | Add dev_put() after dev_get_by_index() to avoid leakage of device. Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: netif_device_attach/detach should start/stop all queuesAlexander Duyck2009-04-111-2/+2
| | | | | | | | | Currently netif_device_attach/detach are only stopping one queue. They should be starting and stopping all the queues on a given device. Signed-off-by: Alexander Duyck <alexander.h.duyck@intel.com> Signed-off-by: Jeff Kirsher <jeffrey.t.kirsher@intel.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2009-04-083-25/+9
|\ | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/kaber/nf-2.6
| * netfilter: ctnetlink: fix regression in expectation handlingPablo Neira Ayuso2009-04-061-24/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This patch fixes a regression (introduced by myself in commit 19abb7b: netfilter: ctnetlink: deliver events for conntracks changed from userspace) that results in an expectation re-insertion since __nf_ct_expect_check() may return 0 for expectation timer refreshing. This patch also removes a unnecessary refcount bump that pretended to avoid a possible race condition with event delivery and expectation timers (as said, not needed since we hold a reference to the object since until we finish the expectation setup). This also merges nf_ct_expect_related_report() and nf_ct_expect_related() which look basically the same. Reported-by: Patrick McHardy <kaber@trash.net> Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * netfilter: fix selection of "LED" target in netfilterAlex Riesen2009-04-061-1/+1
| | | | | | | | | | | | | | It's plural, not LED_TRIGGERS. Signed-off-by: Alex Riesen <fork0@users.sourceforge.net> Signed-off-by: Patrick McHardy <kaber@trash.net>
| * netfilter: ip6tables regression fixEric Dumazet2009-04-061-0/+2
| | | | | | | | | | | | | | | | | | Commit 7845447 (netfilter: iptables: lock free counters) broke ip6_tables by unconditionally returning ENOMEM in alloc_counters(), Reported-by: Graham Murray <graham@gmurray.org.uk> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: Patrick McHardy <kaber@trash.net>
* | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2009-04-073-4/+5
|\ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: b44: Use kernel DMA addresses for the kernel DMA API forcedeth: Fix resume from hibernation regression. xfrm: fix fragmentation on inter family tunnels ibm_newemac: Fix dangerous struct assumption gigaset: documentation update gigaset: in file ops, check for device disconnect before anything else bas_gigaset: use tasklet_hi_schedule for timing critical tasklets net/802/fddi.c: add MODULE_LICENSE smsc911x: remove unused #include <linux/version.h> axnet_cs: fix phy_id detection for bogus Asix chip. bnx2: Use request_firmware() b44: Fix sizes passed to b44_sync_dma_desc_for_{device,cpu}() socket: use percpu_add() while updating sockets_in_use virtio_net: Set the mac config only when VIRITO_NET_F_MAC myri_sbus: use request_firmware e1000: fix loss of multicast packets vxge: should include tcp.h Conflict in firmware/WHENCE (SCSI vs net firmware)
| * | xfrm: fix fragmentation on inter family tunnelsSteffen Klassert2009-04-071-0/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | If an ipv4 packet (not locally generated with IP_DF flag not set) bigger than mtu size is supposed to go via a xfrm ipv6 tunnel, the packetsize check in xfrm4_tunnel_check_size() is omited and ipv6 drops the packet without sending a notice to the original sender of the ipv4 packet. Another issue is that ipv4 connection tracking does reassembling of incomming fragmented packets. If such a reassembled packet is supposed to go via a xfrm ipv6 tunnel it will be droped, even if the original sender did proper fragmentation. According to RFC 2473 (section 7) tunnel ipv6 packets resulting from the encapsulation of an original packet are considered as locally generated packets. If such a packet passed the checks in xfrm{4,6}_tunnel_check_size() fragmentation is allowed according to RFC 2473 (section 7.1/7.2). This patch sets skb->local_df in xfrm6_prepare_output() to achieve fragmentation in this case. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | net/802/fddi.c: add MODULE_LICENSEAdrian Bunk2009-04-071-0/+2
| | | | | | | | | | | | | | | | | | | | | This patch adds the missing MODULE_LICENSE("GPL"). Signed-off-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | socket: use percpu_add() while updating sockets_in_useEric Dumazet2009-04-051-4/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | sock_alloc() currently uses following code to update sockets_in_use get_cpu_var(sockets_in_use)++; put_cpu_var(sockets_in_use); This translates to : c0436274: b8 01 00 00 00 mov $0x1,%eax c0436279: e8 42 40 df ff call c022a2c0 <add_preempt_count> c043627e: bb 20 4f 6a c0 mov $0xc06a4f20,%ebx c0436283: e8 18 ca f0 ff call c0342ca0 <debug_smp_processor_id> c0436288: 03 1c 85 60 4a 65 c0 add -0x3f9ab5a0(,%eax,4),%ebx c043628f: ff 03 incl (%ebx) c0436291: b8 01 00 00 00 mov $0x1,%eax c0436296: e8 75 3f df ff call c022a210 <sub_preempt_count> c043629b: 89 e0 mov %esp,%eax c043629d: 25 00 e0 ff ff and $0xffffe000,%eax c04362a2: f6 40 08 08 testb $0x8,0x8(%eax) c04362a6: 75 07 jne c04362af <sock_alloc+0x7f> c04362a8: 8d 46 d8 lea -0x28(%esi),%eax c04362ab: 5b pop %ebx c04362ac: 5e pop %esi c04362ad: c9 leave c04362ae: c3 ret c04362af: e8 cc 5d 09 00 call c04cc080 <preempt_schedule> c04362b4: 8d 74 26 00 lea 0x0(%esi,%eiz,1),%esi c04362b8: eb ee jmp c04362a8 <sock_alloc+0x78> While percpu_add(sockets_in_use, 1) translates to a single instruction : c0436275: 64 83 05 20 5f 6a c0 addl $0x1,%fs:0xc06a5f20 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | Merge branch 'for-2.6.30' of git://linux-nfs.org/~bfields/linuxLinus Torvalds2009-04-063-38/+127
|\ \ \ | |_|/ |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * 'for-2.6.30' of git://linux-nfs.org/~bfields/linux: (81 commits) nfsd41: define nfsd4_set_statp as noop for !CONFIG_NFSD_V4 nfsd41: define NFSD_DRC_SIZE_SHIFT in set_max_drc nfsd41: Documentation/filesystems/nfs41-server.txt nfsd41: CREATE_EXCLUSIVE4_1 nfsd41: SUPPATTR_EXCLCREAT attribute nfsd41: support for 3-word long attribute bitmask nfsd: dynamically skip encoded fattr bitmap in _nfsd4_verify nfsd41: pass writable attrs mask to nfsd4_decode_fattr nfsd41: provide support for minor version 1 at rpc level nfsd41: control nfsv4.1 svc via /proc/fs/nfsd/versions nfsd41: add OPEN4_SHARE_ACCESS_WANT nfs4_stateid bmap nfsd41: access_valid nfsd41: clientid handling nfsd41: check encode size for sessions maxresponse cached nfsd41: stateid handling nfsd: pass nfsd4_compound_state* to nfs4_preprocess_{state,seq}id_op nfsd41: destroy_session operation nfsd41: non-page DRC for solo sequence responses nfsd41: Add a create session replay cache nfsd41: create_session operation ...
| * | nfsd: don't use the deferral service, return NFS4ERR_DELAYAndy Adamson2009-04-042-1/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | On an NFSv4.1 server cache miss that causes an upcall, NFS4ERR_DELAY will be returned. It is up to the NFSv4.1 client to resend only the operations that have not been processed. Initialize rq_usedeferral to 1 in svc_process(). It sill be turned off in nfsd4_proc_compound() only when NFSv4.1 Sessions are used. Note: this isn't an adequate solution on its own. It's acceptable as a way to get some minimal 4.1 up and working, but we're going to have to find a way to avoid returning DELAY in all common cases before 4.1 can really be considered ready. Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> [nfsd41: reverse rq_nodeferral negative logic] Signed-off-by: Benny Halevy <bhalevy@panasas.com> [sunrpc: initialize rq_usedeferral] Signed-off-by: Andy Adamson <andros@netapp.com> Signed-off-by: Benny Halevy <bhalevy@panasas.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | sunrpc/svc.c: Remove unused line 'rqstp->rq_server = serv;' in svc_processideawu2009-03-281-1/+0
| | | | | | | | | | | | | | | | | | | | | | | | There is no need to set rqstp->rq_server to serv, while serv is initialized as rqstp->rq_server at previous line. And between these two lines, there is no change to rqstp->rq_server. Signed-off-by: ideawu <ideawu@163.com> Reviewed-by: Tom Tucker <tom@opengridcomputing.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | svcrpc: take advantage of tcp autotuningOlga Kornievskaia2009-03-181-28/+7
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Allow the NFSv4 server to make use of TCP autotuning behaviour, which was previously disabled by setting the sk_userlocks variable. Set the receive buffers to be big enough to receive the whole RPC request, and set this for the listening socket, not the accept socket. Remove the code that readjusts the receive/send buffer sizes for the accepted socket. Previously this code was used to influence the TCP window management behaviour, which is no longer needed when autotuning is enabled. This can improve IO bandwidth on networks with high bandwidth-delay products, where a large tcp window is required. It also simplifies performance tuning, since getting adequate tcp buffers previously required increasing the number of nfsd threads. Signed-off-by: Olga Kornievskaia <aglo@citi.umich.edu> Cc: Jim Rees <rees@umich.edu> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | knfsd: add file to export stats about nfsd poolsGreg Banks2009-03-181-1/+99
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Add /proc/fs/nfsd/pool_stats to export to userspace various statistics about the operation of rpc server thread pools. This patch is based on a forward-ported version of knfsd-add-pool-thread-stats which has been shipping in the SGI "Enhanced NFS" product since 2006 and which was previously posted: http://article.gmane.org/gmane.linux.nfs/10375 It has also been updated thus: * moved EXPORT_SYMBOL() to near the function it exports * made the new struct struct seq_operations const * used SEQ_START_TOKEN instead of ((void *)1) * merged fix from SGI PV 990526 "sunrpc: use dprintk instead of printk in svc_pool_stats_*()" by Harshula Jayasuriya. * merged fix from SGI PV 964001 "Crash reading pool_stats before nfsds are started". Signed-off-by: Greg Banks <gnb@sgi.com> Signed-off-by: Harshula Jayasuriya <harshula@sgi.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
| * | knfsd: avoid overloading the CPU scheduler with enormous load averagesGreg Banks2009-03-181-7/+18
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Avoid overloading the CPU scheduler with enormous load averages when handling high call-rate NFS loads. When the knfsd bottom half is made aware of an incoming call by the socket layer, it tries to choose an nfsd thread and wake it up. As long as there are idle threads, one will be woken up. If there are lot of nfsd threads (a sensible configuration when the server is disk-bound or is running an HSM), there will be many more nfsd threads than CPUs to run them. Under a high call-rate low service-time workload, the result is that almost every nfsd is runnable, but only a handful are actually able to run. This situation causes two significant problems: 1. The CPU scheduler takes over 10% of each CPU, which is robbing the nfsd threads of valuable CPU time. 2. At a high enough load, the nfsd threads starve userspace threads of CPU time, to the point where daemons like portmap and rpc.mountd do not schedule for tens of seconds at a time. Clients attempting to mount an NFS filesystem timeout at the very first step (opening a TCP connection to portmap) because portmap cannot wake up from select() and call accept() in time. Disclaimer: these effects were observed on a SLES9 kernel, modern kernels' schedulers may behave more gracefully. The solution is simple: keep in each svc_pool a counter of the number of threads which have been woken but have not yet run, and do not wake any more if that count reaches an arbitrary small threshold. Testing was on a 4 CPU 4 NIC Altix using 4 IRIX clients, each with 16 synthetic client threads simulating an rsync (i.e. recursive directory listing) workload reading from an i386 RH9 install image (161480 regular files in 10841 directories) on the server. That tree is small enough to fill in the server's RAM so no disk traffic was involved. This setup gives a sustained call rate in excess of 60000 calls/sec before being CPU-bound on the server. The server was running 128 nfsds. Profiling showed schedule() taking 6.7% of every CPU, and __wake_up() taking 5.2%. This patch drops those contributions to 3.0% and 2.2%. Load average was over 120 before the patch, and 20.9 after. This patch is a forward-ported version of knfsd-avoid-nfsd-overload which has been shipping in the SGI "Enhanced NFS" product since 2006. It has been posted before: http://article.gmane.org/gmane.linux.nfs/10374 Signed-off-by: Greg Banks <gnb@sgi.com> Signed-off-by: J. Bruce Fields <bfields@citi.umich.edu>
* | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumaskLinus Torvalds2009-04-051-2/+1
|\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/rusty/linux-2.6-cpumask: (36 commits) cpumask: remove cpumask allocation from idle_balance, fix numa, cpumask: move numa_node_id default implementation to topology.h, fix cpumask: remove cpumask allocation from idle_balance x86: cpumask: x86 mmio-mod.c use cpumask_var_t for downed_cpus x86: cpumask: update 32-bit APM not to mug current->cpus_allowed x86: microcode: cleanup x86: cpumask: use work_on_cpu in arch/x86/kernel/microcode_core.c cpumask: fix CONFIG_CPUMASK_OFFSTACK=y cpu hotunplug crash numa, cpumask: move numa_node_id default implementation to topology.h cpumask: convert node_to_cpumask_map[] to cpumask_var_t cpumask: remove x86 cpumask_t uses. cpumask: use cpumask_var_t in uv_flush_tlb_others. cpumask: remove cpumask_t assignment from vector_allocation_domain() cpumask: make Xen use the new operators. cpumask: clean up summit's send_IPI functions cpumask: use new cpumask functions throughout x86 x86: unify cpu_callin_mask/cpu_callout_mask/cpu_initialized_mask/cpu_sibling_setup_mask cpumask: convert struct cpuinfo_x86's llc_shared_map to cpumask_var_t cpumask: convert node_to_cpumask_map[] to cpumask_var_t x86: unify 32 and 64-bit node_to_cpumask_map ...
| * \ \ Merge branch 'cpumask-for-linus' of ↵Rusty Russell2009-03-311-2/+1
| |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/tip/linux-2.6-tip Conflicts: arch/x86/include/asm/topology.h drivers/oprofile/buffer_sync.c (Both cases: changed in Linus' tree, removed in Ingo's).
| | * \ \ Merge branch 'linus' into cpumask-for-linusIngo Molnar2009-03-30350-8195/+31016
| | |\ \ \ | | | | | | | | | | | | | | | | | | | | | | | | Conflicts: arch/x86/kernel/cpu/common.c
| | * | | | cpumask: replace node_to_cpumask with cpumask_of_node.Rusty Russell2009-03-131-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Impact: cleanup node_to_cpumask (and the blecherous node_to_cpumask_ptr which contained a declaration) are replaced now everyone implements cpumask_of_node. Signed-off-by: Rusty Russell <rusty@rustcorp.com.au>
| | | | | |
| | | \ \ \
| | | \ \ \
| | | \ \ \
| | *---. \ \ \ Merge branches 'x86/cleanups', 'x86/kexec', 'x86/mce2' and 'linus' into x86/coreIngo Molnar2009-03-1117-129/+140
| | |\ \ \ \ \ \
* | | \ \ \ \ \ \ Merge branch 'for-linus' of ↵Linus Torvalds2009-04-047-15/+15
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial: (28 commits) trivial: Update my email address trivial: NULL noise: drivers/mtd/tests/mtd_*test.c trivial: NULL noise: drivers/media/dvb/frontends/drx397xD_fw.h trivial: Fix misspelling of "Celsius". trivial: remove unused variable 'path' in alloc_file() trivial: fix a pdlfush -> pdflush typo in comment trivial: jbd header comment typo fix for JBD_PARANOID_IOFAIL trivial: wusb: Storage class should be before const qualifier trivial: drivers/char/bsr.c: Storage class should be before const qualifier trivial: h8300: Storage class should be before const qualifier trivial: fix where cgroup documentation is not correctly referred to trivial: Give the right path in Documentation example trivial: MTD: remove EOL from MODULE_DESCRIPTION trivial: Fix typo in bio_split()'s documentation trivial: PWM: fix of #endif comment trivial: fix typos/grammar errors in Kconfig texts trivial: Fix misspelling of firmware trivial: cgroups: documentation typo and spelling corrections trivial: Update contact info for Jochen Hein trivial: fix typo "resgister" -> "register" ...
| * | | | | | | | | trivial: fix typos/grammar errors in Kconfig textsMatt LaPlante2009-03-307-15/+15
| | |_|_|_|_|/ / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Matt LaPlante <kernel1@cyberdogtech.com> Acked-by: Randy Dunlap <randy.dunlap@oracle.com> Signed-off-by: Jiri Kosina <jkosina@suse.cz>
* | | | | | | | | Merge branch 'for-linus' of ↵Linus Torvalds2009-04-031-1/+1
|\ \ \ \ \ \ \ \ \ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6 * 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs-2.6: Remove two unneeded exports and make two symbols static in fs/mpage.c Cleanup after commit 585d3bc06f4ca57f975a5a1f698f65a45ea66225 Trim includes of fdtable.h Don't crap into descriptor table in binfmt_som Trim includes in binfmt_elf Don't mess with descriptor table in load_elf_binary() Get rid of indirect include of fs_struct.h New helper - current_umask() check_unsafe_exec() doesn't care about signal handlers sharing New locking/refcounting for fs_struct Take fs_struct handling to new file (fs/fs_struct.c) Get rid of bumping fs_struct refcount in pivot_root(2) Kill unsharing fs_struct in __set_personality()
| * | | | | | | | | New helper - current_umask()Al Viro2009-04-011-1/+1
| | |/ / / / / / / | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | current->fs->umask is what most of fs_struct users are doing. Put that into a helper function. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk>
* | | | | | | | | Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6Linus Torvalds2009-04-0319-158/+264
|\ \ \ \ \ \ \ \ \ | | |_|_|_|_|_|_|/ | |/| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | * git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-2.6: (54 commits) glge: remove unused #include <version.h> dnet: remove unused #include <version.h> tcp: miscounts due to tcp_fragment pcount reset tcp: add helper for counter tweaking due mid-wq change hso: fix for the 'invalid frame length' messages hso: fix for crash when unplugging the device fsl_pq_mdio: Fix compile failure fsl_pq_mdio: Revive UCC MDIO support ucc_geth: Pass proper device to DMA routines, otherwise oops happens i.MX31: Fixing cs89x0 network building to i.MX31ADS tc35815: Fix build error if NAPI enabled hso: add Vendor/Product ID's for new devices ucc_geth: Remove unused header gianfar: Remove unused header kaweth: Fix locking to be SMP-safe net: allow multiple dev per napi with GRO r8169: reset IntrStatus after chip reset ixgbe: Fix potential memory leak/driver panic issue while setting up Tx & Rx ring parameters ixgbe: fix ethtool -A|a behavior ixgbe: Patch to fix driver panic while freeing up tx & rx resources ...
| * | | | | | | | tcp: miscounts due to tcp_fragment pcount resetIlpo Järvinen2009-04-031-1/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It seems that trivial reset of pcount to one was not sufficient in tcp_retransmit_skb. Multiple counters experience a positive miscount when skb's pcount gets lowered without the necessary adjustments (depending on skb's sacked bits which exactly), at worst a packets_out miscount can crash at RTO if the write queue is empty! Triggering this requires mss change, so bidir tcp or mtu probe or like. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Reported-by: Markus Trippelsdorf <markus@trippelsdorf.de> Tested-by: Uwe Bugla <uwe.bugla@gmx.de> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | tcp: add helper for counter tweaking due mid-wq changeIlpo Järvinen2009-04-031-32/+34
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We need full-scale adjustment to fix a TCP miscount in the next patch, so just move it into a helper and call for that from the other places. Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | net: allow multiple dev per napi with GROStephen Hemminger2009-04-021-2/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | GRO assumes that there is a one-to-one relationship between NAPI structure and network device. Some devices like sky2 share multiple devices on a single interrupt so only have one NAPI handler. Rather than split GRO from NAPI, just have GRO assume if device changes that it is a different flow. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | netfilter: use rcu_read_bh() in ipt_do_table()Eric Dumazet2009-04-023-6/+6
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Commit 784544739a25c30637397ace5489eeb6e15d7d49 (netfilter: iptables: lock free counters) forgot to disable BH in arpt_do_table(), ipt_do_table() and ip6t_do_table() Use rcu_read_lock_bh() instead of rcu_read_lock() cures the problem. Reported-and-bisected-by: Roman Mindalev <r000n@r000n.net> Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Patrick McHardy <kaber@trash.net> Acked-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | RDS: Use spinlock to protect 64b value update on 32b archsAndy Grover2009-04-027-24/+100
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We have a 64bit value that needs to be set atomically. This is easy and quick on all 64bit archs, and can also be done on x86/32 with set_64bit() (uses cmpxchg8b). However other 32b archs don't have this. I actually changed this to the current state in preparation for mainline because the old way (using a spinlock on 32b) resulted in unsightly #ifdefs in the code. But obviously, being correct takes precedence. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | RDS: Rewrite connection cleanup, fixing oops on rmmodAndy Grover2009-04-028-85/+109
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This fixes a bug where a connection was unexpectedly not on *any* list while being destroyed. It also cleans up some code duplication and regularizes some function names. * Grab appropriate lock in conn_free() and explain in comment * Ensure via locking that a conn is never not on either a dev's list or the nodev list * Add rds_xx_remove_conn() to match rds_xx_add_conn() * Make rds_xx_add_conn() return void * Rename remove_{,nodev_}conns() to destroy_{,nodev_}conns() and unify their implementation in a helper function * Document lock ordering as nodev conn_lock before dev_conn_lock Reported-by: Yosef Etigin <yosefe@voltaire.com> Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | RDS: Fix m_rs_lock deadlockAndy Grover2009-04-021-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | rs_send_drop_to() is called during socket close. If it takes m_rs_lock without disabling interrupts, then rds_send_remove_from_sock() can run from the rx completion handler and thus deadlock. Signed-off-by: Andy Grover <andy.grover@oracle.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | core: remove pointless conditional before kfree()Wei Yongjun2009-04-011-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Remove pointless conditional before kfree(). Signed-off-by: Wei Yongjun <yjwei@cn.fujitsu.com> Signed-off-by: David S. Miller <davem@davemloft.net>
| * | | | | | | | ipv4: remove unused parameter from tcp_recv_urg().Rami Rosen2009-03-311-3/+2
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Signed-off-by: Rami Rosen <ramirose@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | | | | | | | | Merge branch 'devel' into for-linusTrond Myklebust2009-04-0110-383/+505
|\ \ \ \ \ \ \ \ \
| * | | | | | | | | SUNRPC: Ensure IPV6_V6ONLY is set on the socket before binding to a portTrond Myklebust2009-04-011-12/+24
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Also ensure that we use the protocol family instead of the address family when calling sock_create_kern(). Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>
| * | | | | | | | | SUNRPC: Remove CONFIG_SUNRPC_REGISTER_V4Chuck Lever2009-03-281-22/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | We just augmented the kernel's RPC service registration code so that it automatically adjusts to what is supported in user space. Thus we no longer need the kernel configuration option to enable registering RPC services with v4 -- it's all done automatically. This patch is part of a series that addresses http://bugzilla.kernel.org/show_bug.cgi?id=12256 Signed-off-by: Chuck Lever <chuck.lever@oracle.com> Signed-off-by: Trond Myklebust <Trond.Myklebust@netapp.com>