summaryrefslogtreecommitdiffstats
path: root/include/net/inet_hashtables.h (follow)
Commit message (Collapse)AuthorAgeFilesLines
* tcp: Fix a connect() race with timewait socketsEric Dumazet2009-12-091-3/+5
| | | | | | | | | | | | | First patch changes __inet_hash_nolisten() and __inet6_hash() to get a timewait parameter to be able to unhash it from ehash at same time the new socket is inserted in hash. This makes sure timewait socket wont be found by a concurrent writer in __inet_check_established() Reported-by: kapil dakhane <kdakhane@gmail.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: fix inet_bind_bucket_for_eachLucian Adrian Grijincu2009-11-141-2/+2
| | | | | | | | | | | | | The first "node" is supposed to be the cursor used in the for_each. The second "node" is ment literally and should not be macro expanded: it's the name of the hlist_node field from the inet_bind_bucket. This currently works because when inet_bind_bucket_for_each is called it's argument is still "node". Signed-off-by: Lucian Adrian Grijincu <lgrijincu@ixiacom.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: rename some inet_sock fieldsEric Dumazet2009-10-191-6/+6
| | | | | | | | | | | | | | | | In order to have better cache layouts of struct sock (separate zones for rx/tx paths), we need this preliminary patch. Goal is to transfert fields used at lookup time in the first read-mostly cache line (inside struct sock_common) and move sk_refcnt to a separate cache line (only written by rx path) This patch adds inet_ prefix to daddr, rcv_saddr, dport, num, saddr, sport and id fields. This allows a future patch to define these fields as macros, like sk_refcnt, without name clashes. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: replace ehash_size by ehash_maskEric Dumazet2009-10-131-2/+2
| | | | | | | | Storing the mask (size - 1) instead of the size allows fast path to be a bit faster. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: skb->dst accessorsEric Dumazet2009-06-031-1/+1
| | | | | | | | | | | | | | | | | | Define three accessors to get/set dst attached to a skb struct dst_entry *skb_dst(const struct sk_buff *skb) void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: move bsockets outside of read only beginning of struct inet_hashinfoEric Dumazet2009-02-011-1/+2
| | | | | | | | And switch bsockets to atomic_t since it might be changed in parallel. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Evgeniy Polyakov <zbr@ioremap.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of ↵David S. Miller2009-01-301-1/+1
|\ | | | | | | | | | | | | master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6 Conflicts: drivers/net/e1000/e1000_main.c
| * net: wrong test in inet_ehash_locks_alloc()Eric Dumazet2009-01-281-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | In commit 9db66bdcc83749affe61c61eb8ff3cf08f42afec (net: convert TCP/DCCP ehash rwlocks to spinlocks), I forgot to change one occurrence of rwlock_t to spinlock_t I believe sizeof(raw_spinlock_t) might be > 0 on !CONFIG_SMP if CONFIG_DEBUG_SPINLOCK while sizeof(raw_rwlock_t) should be 0 in this case. Fortunatly, CONFIG_DEBUG_SPINLOCK adds fields to both spinlock_t and rwlock_t, but at this might change in the future (being able to debug spinlocks but not rwlocks for example), better to be safe. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* | inet: Allowing more than 64k connections and heavily optimize bind(0) time.Evgeniy Polyakov2009-01-211-1/+2
|/ | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | With simple extension to the binding mechanism, which allows to bind more than 64k sockets (or smaller amount, depending on sysctl parameters), we have to traverse the whole bind hash table to find out empty bucket. And while it is not a problem for example for 32k connections, bind() completion time grows exponentially (since after each successful binding we have to traverse one bucket more to find empty one) even if we start each time from random offset inside the hash table. So, when hash table is full, and we want to add another socket, we have to traverse the whole table no matter what, so effectivelly this will be the worst case performance and it will be constant. Attached picture shows bind() time depending on number of already bound sockets. Green area corresponds to the usual binding to zero port process, which turns on kernel port selection as described above. Red area is the bind process, when number of reuse-bound sockets is not limited by 64k (or sysctl parameters). The same exponential growth (hidden by the green area) before number of ports reaches sysctl limit. At this time bind hash table has exactly one reuse-enbaled socket in a bucket, but it is possible that they have different addresses. Actually kernel selects the first port to try randomly, so at the beginning bind will take roughly constant time, but with time number of port to check after random start will increase. And that will have exponential growth, but because of above random selection, not every next port selection will necessary take longer time than previous. So we have to consider the area below in the graph (if you could zoom it, you could find, that there are many different times placed there), so area can hide another. Blue area corresponds to the port selection optimization. This is rather simple design approach: hashtable now maintains (unprecise and racely updated) number of currently bound sockets, and when number of such sockets becomes greater than predefined value (I use maximum port range defined by sysctls), we stop traversing the whole bind hash table and just stop at first matching bucket after random start. Above limit roughly corresponds to the case, when bind hash table is full and we turned on mechanism of allowing to bind more reuse-enabled sockets, so it does not change behaviour of other sockets. Signed-off-by: Evgeniy Polyakov <zbr@ioremap.net> Tested-by: Denys Fedoryschenko <denys@visp.net.lb> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Convert TCP/DCCP listening hash tables to use RCUEric Dumazet2008-11-241-1/+8
| | | | | | | | | | | | | | | | | | | | | | This is the last step to be able to perform full RCU lookups in __inet_lookup() : After established/timewait tables, we add RCU lookups to listening hash table. The only trick here is that a socket of a given type (TCP ipv4, TCP ipv6, ...) can now flight between two different tables (established and listening) during a RCU grace period, so we must use different 'nulls' end-of-chain values for two tables. We define a large value : #define LISTENING_NULLS_BASE (1U << 29) So that slots in listening table are guaranteed to have different end-of-chain values than slots in established table. A reader can still detect it finished its lookup in the right chain. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: use net_eq() in INET_MATCH and INET_TW_MATCHEric Dumazet2008-11-221-4/+4
| | | | | | | | | | | | We can avoid some useless instructions if !CONFIG_NET_NS Because of RCU, we use INET_MATCH or INET_TW_MATCH twice for the found socket, so thats six instructions less per incoming TCP packet. Yet another tbench speedup :) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: convert TCP/DCCP ehash rwlocks to spinlocksEric Dumazet2008-11-211-7/+7
| | | | | | | | | | | | Now TCP & DCCP use RCU lookups, we can convert ehash rwlocks to spinlocks. /proc/net/tcp and other seq_file 'readers' can safely be converted to 'writers'. This should speedup writers, since spin_lock()/spin_unlock() only use one atomic operation instead of two for write_lock()/write_unlock() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: listening_hash get a spinlock per bucketEric Dumazet2008-11-201-30/+15
| | | | | | | | | | | | | This patch prepares RCU migration of listening_hash table for TCP/DCCP protocols. listening_hash table being small (32 slots per protocol), we add a spinlock for each slot, instead of a single rwlock for whole table. This should reduce hold time of readers, and writers concurrency. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Convert TCP & DCCP hash tables to use RCU / hlist_nullsEric Dumazet2008-11-171-2/+2
| | | | | | | | | | | | | | | | | | | | | | | RCU was added to UDP lookups, using a fast infrastructure : - sockets kmem_cache use SLAB_DESTROY_BY_RCU and dont pay the price of call_rcu() at freeing time. - hlist_nulls permits to use few memory barriers. This patch uses same infrastructure for TCP/DCCP established and timewait sockets. Thanks to SLAB_DESTROY_BY_RCU, no slowdown for applications using short lived TCP connections. A followup patch, converting rwlocks to spinlocks will even speedup this case. __inet_lookup_established() is pretty fast now we dont have to dirty a contended cache line (read_lock/read_unlock) Only established and timewait hashtable are converted to RCU (bind table and listen table are still using traditional locking) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: ib_net pointer should depends on CONFIG_NET_NSEric Dumazet2008-11-121-0/+7
| | | | | | | | We can shrink size of "struct inet_bind_bucket" by 50%, using read_pnet() and write_pnet() Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: Don't lookup the socket if there's a socket attached to the skbKOVACS Krisztian2008-10-071-3/+7
| | | | | | | | Use the socket cached in the skb if it's present. Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet_hashtables: Add inet_lookup_skb helpersArnaldo Carvalho de Melo2008-10-071-0/+14
| | | | | | | | | | To be able to use the cached socket reference in the skb during input processing we add a new set of lookup functions that receive the skb on their argument list. Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: KOVACS Krisztian <hidden@sch.bme.hu> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: introduce the net_hash_mix "salt" for hashesPavel Emelyanov2008-06-171-2/+3
| | | | | | | | | | | | | | | | | There are many possible ways to add this "salt", thus I made this patch to be the last in the series to change it if required. Currently I propose to use the struct net pointer itself as this salt, but since this pointer is most often cache-line aligned, shift this right to eliminate the bits, that are most often zeroed. After this, simply add this mix to prepared hashfn-s. For CONFIG_NET_NS=n case this salt is 0 and no changes in hashfn appear. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: add struct net argument to inet_lhashfnPavel Emelyanov2008-06-171-2/+2
| | | | | | | | | Listening-on-one-port sockets in many namespaces produce long chains in the listening_hash-es, so prepare the inet_lhashfn to take struct net into account. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* inet: add struct net argument to inet_bhashfnPavel Emelyanov2008-06-171-1/+2
| | | | | | | | | Binding to some port in many namespaces may create too long chains in bhash-es, so prepare the hashfn to take struct net into account. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: Uninline the __inet_inherit_port call.Pavel Emelyanov2008-04-181-13/+1
| | | | | | | | | | | This deblats ~200 bytes when ipv6 and dccp are 'y'. Besides, this will ease compilation issues for patches I'm working on to make inet hash tables more scalable wrt net namespaces. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: Drop the inet_inherit_port() call.Pavel Emelyanov2008-04-181-7/+0
| | | | | | | | | | | | | | | As I can see from the code, two places (tcp_v6_syn_recv_sock and dccp_v6_request_recv_sock) that call this one already run with BHs disabled, so it's safe to call __inet_inherit_port there. Besides (in case I missed smth with code review) the calltrace tcp_v6_syn_recv_sock `- tcp_v4_syn_recv_sock `- __inet_inherit_port and the similar for DCCP are valid, but assumes BHs to be disabled. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [NET] NETNS: Omit sock->sk_net without CONFIG_NET_NS.YOSHIFUJI Hideaki2008-03-251-4/+4
| | | | | | | | | Introduce per-sock inlines: sock_net(), sock_net_set() and per-inet_timewait_sock inlines: twsk_net(), twsk_net_set(). Without CONFIG_NET_NS, no namespace other than &init_net exists. Let's explicitly define them to help compiler optimizations. Signed-off-by: YOSHIFUJI Hideaki <yoshfuji@linux-ipv6.org>
* [SOCK]: Add udp_hash member to struct proto.Pavel Emelyanov2008-03-231-1/+1
| | | | | | | | | | | | | | | | | Inspired by the commit ab1e0a13 ([SOCK] proto: Add hashinfo member to struct proto) from Arnaldo, I made similar thing for UDP/-Lite IPv4 and -v6 protocols. The result is not that exciting, but it removes some levels of indirection in udpxxx_get_port and saves some space in code and text. The first step is to union existing hashinfo and new udp_hash on the struct proto and give a name to this union, since future initialization of tcpxxx_prot, dccp_vx_protinfo and udpxxx_protinfo will cause gcc warning about inability to initialize anonymous member this way. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: Fix accidentally broken inet(6)_hash_connect's port offset calculations.Pavel Emelyanov2008-02-051-1/+1
| | | | | | | | | | | | | | | The port offset calculations depend on the protocol family, but, as Adrian noticed, I broke this logic with the commit 5ee31fc1ecdcbc234c8c56dcacef87c8e09909d8 [INET]: Consolidate inet(6)_hash_connect. Return this logic back, by passing the port offset directly into the consolidated function. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Noticed-by: Adrian Bunk <bunk@kernel.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [SOCK] proto: Add hashinfo member to struct protoArnaldo Carvalho de Melo2008-02-031-42/+9
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | This way we can remove TCP and DCCP specific versions of sk->sk_prot->get_port: both v4 and v6 use inet_csk_get_port sk->sk_prot->hash: inet_hash is directly used, only v6 need a specific version to deal with mapped sockets sk->sk_prot->unhash: both v4 and v6 use inet_hash directly struct inet_connection_sock_af_ops also gets a new member, bind_conflict, so that inet_csk_get_port can find the per family routine. Now only the lookup routines receive as a parameter a struct inet_hashtable. With this we further reuse code, reducing the difference among INET transport protocols. Eventually work has to be done on UDP and SCTP to make them share this infrastructure and get as a bonus inet_diag interfaces so that iproute can be used with these protocols. net-2.6/net/ipv4/inet_hashtables.c: struct proto | +8 struct inet_connection_sock_af_ops | +8 2 structs changed __inet_hash_nolisten | +18 __inet_hash | -210 inet_put_port | +8 inet_bind_bucket_create | +1 __inet_hash_connect | -8 5 functions changed, 27 bytes added, 218 bytes removed, diff: -191 net-2.6/net/core/sock.c: proto_seq_show | +3 1 function changed, 3 bytes added, diff: +3 net-2.6/net/ipv4/inet_connection_sock.c: inet_csk_get_port | +15 1 function changed, 15 bytes added, diff: +15 net-2.6/net/ipv4/tcp.c: tcp_set_state | -7 1 function changed, 7 bytes removed, diff: -7 net-2.6/net/ipv4/tcp_ipv4.c: tcp_v4_get_port | -31 tcp_v4_hash | -48 tcp_v4_destroy_sock | -7 tcp_v4_syn_recv_sock | -2 tcp_unhash | -179 5 functions changed, 267 bytes removed, diff: -267 net-2.6/net/ipv6/inet6_hashtables.c: __inet6_hash | +8 1 function changed, 8 bytes added, diff: +8 net-2.6/net/ipv4/inet_hashtables.c: inet_unhash | +190 inet_hash | +242 2 functions changed, 432 bytes added, diff: +432 vmlinux: 16 functions changed, 485 bytes added, 492 bytes removed, diff: -7 /home/acme/git/net-2.6/net/ipv6/tcp_ipv6.c: tcp_v6_get_port | -31 tcp_v6_hash | -7 tcp_v6_syn_recv_sock | -9 3 functions changed, 47 bytes removed, diff: -47 /home/acme/git/net-2.6/net/dccp/proto.c: dccp_destroy_sock | -7 dccp_unhash | -179 dccp_hash | -49 dccp_set_state | -7 dccp_done | +1 5 functions changed, 1 bytes added, 242 bytes removed, diff: -241 /home/acme/git/net-2.6/net/dccp/ipv4.c: dccp_v4_get_port | -31 dccp_v4_request_recv_sock | -2 2 functions changed, 33 bytes removed, diff: -33 /home/acme/git/net-2.6/net/dccp/ipv6.c: dccp_v6_get_port | -31 dccp_v6_hash | -7 dccp_v6_request_recv_sock | +5 3 functions changed, 5 bytes added, 38 bytes removed, diff: -33 Signed-off-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [NETNS]: Tcp-v4 sockets per-net lookup.Pavel Emelyanov2008-02-011-21/+27
| | | | | | | | | | | Add a net argument to inet_lookup and propagate it further into lookup calls. Plus tune the __inet_check_established. The dccp and inet_diag, which use that lookup functions pass the init_net into them. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [NETNS]: Make bind buckets live in net namespaces.Pavel Emelyanov2008-02-011-0/+2
| | | | | | | | | | | | | This tags the inet_bind_bucket struct with net pointer, initializes it during creation and makes a filtering during lookup. A better hashfn, that takes the net into account is to be done in the future, but currently all bind buckets with similar port will be in one hash chain. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: Consolidate inet(6)_hash_connect.Pavel Emelyanov2008-02-011-0/+5
| | | | | | | | | | | | | | | | | | | These two functions are the same except for what they call to "check_established" and "hash" for a socket. This saves half-a-kilo for ipv4 and ipv6. add/remove: 1/0 grow/shrink: 1/4 up/down: 582/-1128 (-546) function old new delta __inet_hash_connect - 577 +577 arp_ignore 108 113 +5 static.hint 8 4 -4 rt_worker_func 376 372 -4 inet6_hash_connect 584 25 -559 inet_hash_connect 586 25 -561 Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [NET]: prot_inuse cleanups and optimizationsEric Dumazet2008-01-291-1/+1
| | | | | | | | | | | | | | | | | | | | 1) Cleanups (all functions are prefixed by sock_prot_inuse) sock_prot_inc_use(prot) -> sock_prot_inuse_add(prot,-1) sock_prot_dec_use(prot) -> sock_prot_inuse_add(prot,-1) sock_prot_inuse() -> sock_prot_inuse_get() New functions : sock_prot_inuse_init() and sock_prot_inuse_free() to abstract pcounter use. 2) if CONFIG_PROC_FS=n, we can zap 'inuse' member from "struct proto", since nobody wants to read the inuse value. This saves 1372 bytes on i386/SMP and some cpu cycles. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: Uninline the __inet_lookup_established function.Pavel Emelyanov2008-01-281-37/+3
| | | | | | | | | | | | | | | | | This is -700 bytes from the net/ipv4/built-in.o add/remove: 1/0 grow/shrink: 1/3 up/down: 340/-1040 (-700) function old new delta __inet_lookup_established - 339 +339 tcp_sacktag_write_queue 2254 2255 +1 tcp_v4_err 1304 973 -331 tcp_v4_rcv 2089 1744 -345 tcp_v4_do_rcv 826 462 -364 Exporting is for dccp module (used via e.g. inet_lookup). Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: Uninline the __inet_hash function.Pavel Emelyanov2008-01-281-26/+3
| | | | | | | | | | | | | | | | | | | | | | | This one is used in quite many places in the networking code and seems to big to be inline. After the patch net/ipv4/build-in.o loses ~650 bytes: add/remove: 2/0 grow/shrink: 0/5 up/down: 461/-1114 (-653) function old new delta __inet_hash_nolisten - 282 +282 __inet_hash - 179 +179 tcp_sacktag_write_queue 2255 2254 -1 __inet_lookup_listener 284 274 -10 tcp_v4_syn_recv_sock 755 493 -262 tcp_v4_hash 389 35 -354 inet_hash_connect 1086 599 -487 This version addresses the issue pointed by Eric, that while being inline this function was optimized by gcc in respect to the 'listen_possible' argument. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [IPV4]: Fix memory leak in inet_hashtables.h when NUMA is onPavel Emelyanov2007-11-261-2/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | The inet_ehash_locks_alloc() looks like this: #ifdef CONFIG_NUMA if (size > PAGE_SIZE) x = vmalloc(...); else #endif x = kmalloc(...); Unlike it, the inet_ehash_locks_alloc() looks like this: #ifdef CONFIG_NUMA if (size > PAGE_SIZE) vfree(x); else #else kfree(x); #endif The error is obvious - if the NUMA is on and the size is less than the PAGE_SIZE we leak the pointer (kfree is inside the #else branch). Compiler doesn't warn us because after the kfree(x) there's a "x = NULL" assignment, so here's another (minor?) bug: we don't set x to NULL under certain circumstances. Boring explanation, I know... Patch explains it better. Signed-off-by: Pavel Emelyanov <xemul@openvz.org> Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au>
* [INET]: Add a missing include <linux/vmalloc.h> to inet_hashtables.hEric Dumazet2007-11-111-0/+1
| | | | | Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: Remove per bucket rwlock in tcp/dccp ehash table.Eric Dumazet2007-11-071-6/+65
| | | | | | | | | | | | | | | | | | | | | | | | As done two years ago on IP route cache table (commit 22c047ccbc68fa8f3fa57f0e8f906479a062c426) , we can avoid using one lock per hash bucket for the huge TCP/DCCP hash tables. On a typical x86_64 platform, this saves about 2MB or 4MB of ram, for litle performance differences. (we hit a different cache line for the rwlock, but then the bucket cache line have a better sharing factor among cpus, since we dirty it less often). For netstat or ss commands that want a full scan of hash table, we perform fewer memory accesses. Using a 'small' table of hashed rwlocks should be more than enough to provide correct SMP concurrency between different buckets, without using too much memory. Sizing of this table depends on num_possible_cpus() and various CONFIG settings. This patch provides some locking abstraction that may ease a future work using a different model for TCP/DCCP table. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [UDP]: Make use of inet_iif() when doing socket lookups.Vlad Yasevich2007-10-261-6/+0
| | | | | | | | | | | | | | UDP currently uses skb->dev->ifindex which may provide the wrong information when the socket bound to a specific interface. This patch makes inet_iif() accessible to UDP and makes UDP use it. The scenario we are trying to fix is when a client is running on the same system and the server and both client and server bind to a non-loopback device. Signed-off-by: Vlad Yasevich <vladislav.yasevich@hp.com> Acked-by: David L Stevens <dlstevens@us.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [NET]: sparse warning fixesStephen Hemminger2007-10-111-1/+1
| | | | | | | | | Fix a bunch of sparse warnings. Mostly about 0 used as NULL pointer, and shadowed variable declarations. One notable case was that hash size should have been unsigned. Signed-off-by: Stephen Hemminger <shemminger@linux-foundation.org> Signed-off-by: David S. Miller <davem@davemloft.net>
* [NET]: change layout of ehash tableEric Dumazet2007-02-081-5/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | ehash table layout is currently this one : First half of this table is used by sockets not in TIME_WAIT state Second half of it is used by sockets in TIME_WAIT state. This is non optimal because of for a given hash or socket, the two chain heads are located in separate cache lines. Moreover the locks of the second half are never used. If instead of this halving, we use two list heads in inet_ehash_bucket instead of only one, we probably can avoid one cache miss, and reduce ram usage, particularly if sizeof(rwlock_t) is big (various CONFIG_DEBUG_SPINLOCK, CONFIG_DEBUG_LOCK_ALLOC settings). So we still halves the table but we keep together related chains to speedup lookups and socket state change. In this patch I did not try to align struct inet_ehash_bucket, but a future patch could try to make this structure have a convenient size (a power of two or a multiple of L1_CACHE_SIZE). I guess rwlock will just vanish as soon as RCU is plugged into ehash :) , so maybe we dont need to scratch our heads to align the bucket... Note : In case struct inet_ehash_bucket is not a power of two, we could probably change alloc_large_system_hash() (in case it use __get_free_pages()) to free the unused space. It currently allocates a big zone, but the last quarter of it could be freed. Again, this should be a temporary 'problem'. Patch tested on ipv4 tcp only, but should be OK for IPV6 and DCCP. Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [PATCH] slab: remove kmem_cache_tChristoph Lameter2006-12-071-3/+3
| | | | | | | | | | | | | | | | | | | | | | | | | | | | Replace all uses of kmem_cache_t with struct kmem_cache. The patch was generated using the following script: #!/bin/sh # # Replace one string by another in all the kernel sources. # set -e for file in `find * -name "*.c" -o -name "*.h"|xargs grep -l $1`; do quilt add $file sed -e "1,\$s/$1/$2/g" $file >/tmp/$$ mv /tmp/$$ $file quilt refresh done The script was run like this sh replace kmem_cache_t "struct kmem_cache" Signed-off-by: Christoph Lameter <clameter@sgi.com> Signed-off-by: Andrew Morton <akpm@osdl.org> Signed-off-by: Linus Torvalds <torvalds@osdl.org>
* [IPV4]: annotate inet_lookup() and friendsAl Viro2006-09-291-10/+10
| | | | | | | | | inet_lookup() annotated along with helper functions (__inet_lookup(), __inet_lookup_established(), inet_lookup_established(), inet_lookup_listener(), __inet_lookup_listener() and inet_ehashfn()) Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* [IPV4]: INET_MATCH() annotationsAl Viro2006-09-291-11/+25
| | | | | | | | | | | | | | | INET_MATCH() and friends depend on an interesting set of kludges: * there's a pair of adjacent fields in struct inet_sock - __be16 dport followed by __u16 num. We want to search by pair, so we combine the keys into a single 32bit value and compare with 32bit value read from &...->dport. * on 64bit targets we combine comparisons with pair of adjacent __be32 fields in the same way. Make sure that we don't mix those values with anything else and that pairs we form them from have correct types. Signed-off-by: Al Viro <viro@zeniv.linux.org.uk> Signed-off-by: David S. Miller <davem@davemloft.net>
* [IPV4]: Use network-order dport for all visible inet_lookup_*Herbert Xu2006-09-221-7/+24
| | | | | | | | | | | | | | | | Right now most inet_lookup_* functions take a host-order hnum instead of a network-order dport because that's how it is represented internally. This means that users of these functions have to be careful about using the right byte-order. To add more confusion, inet_lookup takes a network-order dport unlike all other functions. So this patch changes all visible inet_lookup functions to take a dport and move all dport->hnum conversion inside them. Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* [IPV4]: Uninline inet_lookup_listenerHerbert Xu2006-09-221-31/+2
| | | | | | | | By modern standards this function is way too big to be inlined. It's even bigger than __inet_lookup_listener :) Signed-off-by: Herbert Xu <herbert@gondor.apana.org.au> Signed-off-by: David S. Miller <davem@davemloft.net>
* Don't include linux/config.h from anywhere else in include/David Woodhouse2006-04-261-1/+0
| | | | Signed-off-by: David Woodhouse <dwmw2@infradead.org>
* [INET_SOCK]: Move struct inet_sock & helper functions to net/inet_sock.hArnaldo Carvalho de Melo2006-01-031-20/+1
| | | | | | | | | | | To help in reducing the number of include dependencies, several files were touched as they were getting needed headers indirectly for stuff they use. Thanks also to Alan Menegotto for pointing out that net/dccp/proto.c had linux/dccp.h include twice. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: Generalise tcp_v4_hash_connectArnaldo Carvalho de Melo2006-01-031-0/+3
| | | | | | | | Renaming it to inet_hash_connect, making it possible to ditch dccp_v4_hash_connect and share the same code with TCP instead. Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [TCP/DCCP]: Randomize port selectionStephen Hemminger2005-11-061-2/+0
| | | | | | | | | This patch randomizes the port selected on bind() for connections to help with possible security attacks. It should also be faster in most cases because there is no need for a global lock. Signed-off-by: Stephen Hemminger <shemminger@osdl.org> Signed-off-by: Arnaldo Carvalho de Melo <acme@mandriva.com>
* [INET]: Shrink struct inet_ehash_bucket on 32 bits UPEric Dumazet2005-10-051-1/+1
| | | | | | | | | | | | | | | | | | No need to align struct inet_ehash_bucket on a 8 bytes boundary. On 32 bits Uniprocessor, that's a waste of 4 bytes per struct (50 %) On other platforms, the attribute is useless, natual alignement is already 8. platform | Size before | Size after patch -------------+-------------+------------------ 32 bits, UP | 8 | 4 32 bits, SMP | 8 | 8 64 bits, UP | 8 | 8 64 bits, SMP | 16 | 16 Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [INET]: speedup inet (tcp/dccp) lookupsEric Dumazet2005-10-031-26/+38
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Arnaldo and I agreed it could be applied now, because I have other pending patches depending on this one (Thank you Arnaldo) (The other important patch moves skc_refcnt in a separate cache line, so that the SMP/NUMA performance doesnt suffer from cache line ping pongs) 1) First some performance data : -------------------------------- tcp_v4_rcv() wastes a *lot* of time in __inet_lookup_established() The most time critical code is : sk_for_each(sk, node, &head->chain) { if (INET_MATCH(sk, acookie, saddr, daddr, ports, dif)) goto hit; /* You sunk my battleship! */ } The sk_for_each() does use prefetch() hints but only the begining of "struct sock" is prefetched. As INET_MATCH first comparison uses inet_sk(__sk)->daddr, wich is far away from the begining of "struct sock", it has to bring into CPU cache cold cache line. Each iteration has to use at least 2 cache lines. This can be problematic if some chains are very long. 2) The goal ----------- The idea I had is to change things so that INET_MATCH() may return FALSE in 99% of cases only using the data already in the CPU cache, using one cache line per iteration. 3) Description of the patch --------------------------- Adds a new 'unsigned int skc_hash' field in 'struct sock_common', filling a 32 bits hole on 64 bits platform. struct sock_common { unsigned short skc_family; volatile unsigned char skc_state; unsigned char skc_reuse; int skc_bound_dev_if; struct hlist_node skc_node; struct hlist_node skc_bind_node; atomic_t skc_refcnt; + unsigned int skc_hash; struct proto *skc_prot; }; Store in this 32 bits field the full hash, not masked by (ehash_size - 1) Using this full hash as the first comparison done in INET_MATCH permits us immediatly skip the element without touching a second cache line in case of a miss. Suppress the sk_hashent/tw_hashent fields since skc_hash (aliased to sk_hash and tw_hash) already contains the slot number if we mask with (ehash_size - 1) File include/net/inet_hashtables.h 64 bits platforms : #define INET_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ (((__sk)->sk_hash == (__hash)) ((*((__u64 *)&(inet_sk(__sk)->daddr)))== (__cookie)) && \ ((*((__u32 *)&(inet_sk(__sk)->dport))) == (__ports)) && \ (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) 32bits platforms: #define TCP_IPV4_MATCH(__sk, __hash, __cookie, __saddr, __daddr, __ports, __dif)\ (((__sk)->sk_hash == (__hash)) && \ (inet_sk(__sk)->daddr == (__saddr)) && \ (inet_sk(__sk)->rcv_saddr == (__daddr)) && \ (!((__sk)->sk_bound_dev_if) || ((__sk)->sk_bound_dev_if == (__dif)))) - Adds a prefetch(head->chain.first) in __inet_lookup_established()/__tcp_v4_check_established() and __inet6_lookup_established()/__tcp_v6_check_established() and __dccp_v4_check_established() to bring into cache the first element of the list, before the {read|write}_lock(&head->lock); Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Acked-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>
* [ICSK]: Generalise tcp_listen_{start,stop}Arnaldo Carvalho de Melo2005-08-301-0/+6
| | | | | | | | | This also moved inet_iif from tcp to inet_hashtables.h, as it is needed by the inet_lookup callers, perhaps this needs a bit of polishing, but for now seems fine. Signed-off-by: Arnaldo Carvalho de Melo <acme@ghostprotocols.net> Signed-off-by: David S. Miller <davem@davemloft.net>