summaryrefslogtreecommitdiffstats
path: root/include/net/dst.h (follow)
Commit message (Collapse)AuthorAgeFilesLines
* net: fix NULL dereferences in check_peer_redir()Eric Dumazet2011-08-031-4/+13
| | | | | | | | | | | | | | | | | | | | | | | | Gergely Kalman reported crashes in check_peer_redir(). It appears commit f39925dbde778 (ipv4: Cache learned redirect information in inetpeer.) added a race, leading to possible NULL ptr dereference. Since we can now change dst neighbour, we should make sure a reader can safely use a neighbour. Add RCU protection to dst neighbour, and make sure check_peer_redir() can be called safely by different cpus in parallel. As neighbours are already freed after one RCU grace period, this patch should not add typical RCU penalty (cache cold effects) Many thanks to Gergely for providing a pretty report pointing to the bug. Reported-by: Gergely Kalman <synapse@hippy.csoma.elte.hu> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add ->neigh_lookup() operation to dst_opsDavid S. Miller2011-07-181-0/+5
| | | | | | | | In the future dst entries will be neigh-less. In that environment we need to have an easy transition point for current users of dst->neighbour outside of the packet output fast path. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Abstract dst->neighbour accesses behind helpers.David S. Miller2011-07-181-3/+15
| | | | | | dst_{get,set}_neighbour() Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Embed hh_cache inside of struct neighbour.David S. Miller2011-07-141-9/+9
| | | | | | | | | | | | | | | Now that there is a one-to-one correspondance between neighbour and hh_cache entries, we no longer need: 1) dynamic allocation 2) attachment to dst->hh 3) refcounting Initialization of the hh_cache entry is indicated by hh_len being non-zero, and such initialization is always done with the neighbour's lock held as a writer. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv6: Don't put artificial limit on routing table size.David S. Miller2011-07-021-0/+1
| | | | | | | | | | | | | | | | | | | | | | | IPV6, unlike IPV4, doesn't have a routing cache. Routing table entries, as well as clones made in response to route lookup requests, all live in the same table. And all of these things are together collected in the destination cache table for ipv6. This means that routing table entries count against the garbage collection limits, even though such entries cannot ever be reclaimed and are added explicitly by the administrator (rather than being created in response to lookups). Therefore it makes no sense to count ipv6 routing table entries against the GC limits. Add a DST_NOCOUNT destination cache entry flag, and skip the counting if it is set. Use this flag bit in ipv6 when adding routing table entries. Signed-off-by: David S. Miller <davem@davemloft.net>
* dst: catch uninitialized metricsStephen Hemminger2011-05-241-0/+2
| | | | | | | | Catch cases where dst_metric_set() and other functions are called but _metrics is NULL. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Kill RT_CACHE_DEBUGDavid S. Miller2011-05-191-7/+0
| | | | | | | | | | | | | It's way past it's usefulness. And this gets rid of a bunch of stray ->rt_{dst,src} references. Even the comment documenting the macro was inaccurate (stated default was 1 when it's 0). If reintroduced, it should be done properly, with dynamic debug facilities. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Make dst_alloc() take more explicit initializations.David S. Miller2011-04-291-1/+2
| | | | | | | Now the dst->dev, dev->obsolete, and dst->flags values can be specified as well. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Remove __KERNEL__ cpp checks from include/netDavid S. Miller2011-04-241-3/+0
| | | | | | | | | | | These header files are never installed to user consumption, so any __KERNEL__ cpp checks are superfluous. Projects should also not copy these files into their userland utility sources and try to use them there. If they insist on doing so, the onus is on them to sanitize the headers as needed. Signed-off-by: David S. Miller <davem@davemloft.net>
* dst: Clone child entry in skb_dst_popSteffen Klassert2011-03-281-1/+1
| | | | | | | | | We clone the child entry in skb_dst_pop before we call skb_dst_drop(). Otherwise we might kill the child right before we return it to the caller. Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm: Return dst directly from xfrm_lookup()David S. Miller2011-03-021-6/+8
| | | | | | Instead of on the stack. Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm: Handle blackhole route creation via afinfo.David S. Miller2011-03-011-8/+0
| | | | | | | That way we don't have to potentially do this in every xfrm_lookup() caller. Signed-off-by: David S. Miller <davem@davemloft.net>
* xfrm: Kill XFRM_LOOKUP_WAIT flag.David S. Miller2011-03-011-2/+1
| | | | | | This can be determined from the flow flags instead. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Make flow cache paths use a const struct flowi.David S. Miller2011-02-231-4/+6
| | | | Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add initial_ref arg to dst_alloc().David S. Miller2011-02-181-1/+1
| | | | | | | | | | This allows avoiding multiple writes to the initial __refcnt. The most simplest cases of wanting an initial reference of "1" in ipv4 and ipv6 have been converted, the rest have been left along and kept at the existing "0". Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Remove bogus barrier() in dst_allfrag().David S. Miller2011-02-091-2/+0
| | | | | | | I simply missed this one when modifying the other dst metric interfaces earlier. Signed-off-by: David S. Miller <davem@davemloft.net>
* inetpeer: Move ICMP rate limiting state into inet_peer entries.David S. Miller2011-02-051-2/+0
| | | | | | | | | | Like metrics, the ICMP rate limiting bits are cached state about a destination. So move it into the inet_peer entries. If an inet_peer cannot be bound (the reason is memory allocation failure or similar), the policy is to allow. Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: Attach FIB info to dst_default_metrics when possibleDavid S. Miller2011-01-281-0/+1
| | | | | | | If there are no explicit metrics attached to a route, hook fi->fib_info up to dst_default_metrics. Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Implement read-only protection and COW'ing of metrics.David S. Miller2011-01-271-37/+77
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Routing metrics are now copy-on-write. Initially a route entry points it's metrics at a read-only location. If a routing table entry exists, it will point there. Else it will point at the all zero metric place-holder called 'dst_default_metrics'. The writeability state of the metrics is stored in the low bits of the metrics pointer, we have two bits left to spare if we want to store more states. For the initial implementation, COW is implemented simply via kmalloc. However future enhancements will change this to place the writable metrics somewhere else, in order to increase sharing. Very likely this "somewhere else" will be the inetpeer cache. Note also that this means that metrics updates may transiently fail if we cannot COW the metrics successfully. But even by itself, this patch should decrease memory usage and increase cache locality especially for routing workloads. In those cases the read-only metric copies stay in place and never get written to. TCP workloads where metrics get updated, and those rare cases where PMTU triggers occur, will take a very slight performance hit. But that hit will be alleviated when the long-term writable metrics move to a more sharable location. Since the metrics storage went from a u32 array of RTAX_MAX entries to what is essentially a pointer, some retooling of the dst_entry layout was necessary. Most importantly, we need to preserve the alignment of the reference count so that it doesn't share cache lines with the read-mostly state, as per Eric Dumazet's alignment assertion checks. The only non-trivial bit here is the move of the 'flags' member into the writeable cacheline. This is OK since we are always accessing the flags around the same moment when we made a modification to the reference count. Signed-off-by: David S. Miller <davem@davemloft.net>
* Merge branch 'master' of git://1984.lsi.us.es/net-next-2.6Patrick McHardy2011-01-141-10/+50
|\ | | | | | | | | | | | | Conflicts: net/ipv4/route.c Signed-off-by: Patrick McHardy <kaber@trash.net>
| * net: Abstract default MTU metric calculation behind an accessor.David S. Miller2010-12-141-7/+8
| | | | | | | | | | | | | | | | | | | | | | Like RTAX_ADVMSS, make the default calculation go through a dst_ops method rather than caching the computation in the routing cache entries. Now dst metrics are pretty much left as-is when new entries are created, thus optimizing metric sharing becomes a real possibility. Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Abstract default ADVMSS behind an accessor.David S. Miller2010-12-131-1/+13
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Make all RTAX_ADVMSS metric accesses go through a new helper function, dst_metric_advmss(). Leave the actual default metric as "zero" in the real metric slot, and compute the actual default value dynamically via a new dst_ops AF specific callback. For stacked IPSEC routes, we use the advmss of the path which preserves existing behavior. Unlike ipv4/ipv6, DecNET ties the advmss to the mtu and thus updates advmss on pmtu updates. This inconsistency in advmss handling results in more raw metric accesses than I wish we ended up with. Signed-off-by: David S. Miller <davem@davemloft.net>
| * ipv4: Don't pre-seed hoplimit metric.David S. Miller2010-12-131-6/+0
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Always go through a new ip4_dst_hoplimit() helper, just like ipv6. This allowed several simplifications: 1) The interim dst_metric_hoplimit() can go as it's no longer userd. 2) The sysctl_ip_default_ttl entry no longer needs to use ipv4_doint_and_flush, since the sysctl is not cached in routing cache metrics any longer. 3) ipv4_doint_and_flush no longer needs to be exported and therefore can be marked static. When ipv4_doint_and_flush_strategy was removed some time ago, the external declaration in ip.h was mistakenly left around so kill that off too. We have to move the sysctl_ip_default_ttl declaration into ipv4's route cache definition header net/route.h, because currently net/ip.h (where the declaration lives now) has a back dependency on net/route.h Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Abstract RTAX_HOPLIMIT metric accesses behind helper.David S. Miller2010-12-131-1/+14
| | | | | | | | Signed-off-by: David S. Miller <davem@davemloft.net>
| * net: Abstract away all dst_entry metrics accesses.David S. Miller2010-12-091-3/+23
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use helper functions to hide all direct accesses, especially writes, to dst_entry metrics values. This will allow us to: 1) More easily change how the metrics are stored. 2) Implement COW for metrics. In particular this will help us put metrics into the inetpeer cache if that is what we end up doing. We can make the _metrics member a pointer instead of an array, initially have it point at the read-only metrics in the FIB, and then on the first set grab an inetpeer entry and point the _metrics member there. Signed-off-by: David S. Miller <davem@davemloft.net> Acked-by: Eric Dumazet <eric.dumazet@gmail.com>
* | netfilter: fix Kconfig dependenciesPatrick McHardy2011-01-141-1/+1
|/ | | | | | | | | | | | | Fix dependencies of netfilter realm match: it depends on NET_CLS_ROUTE, which itself depends on NET_SCHED; this dependency is missing from netfilter. Since matching on realms is also useful without having NET_SCHED enabled and the option really only controls whether the tclassid member is included in route and dst entries, rename the config option to IP_ROUTE_CLASSID and move it outside of traffic scheduling context to get rid of the NET_SCHED dependeny. Reported-by: Vladis Kletnieks <Valdis.Kletnieks@vt.edu> Signed-off-by: Patrick McHardy <kaber@trash.net>
* decnet: RCU conversion and get rid of dev_base_lockEric Dumazet2010-11-081-4/+4
| | | | | | | | | | | | | | | | While tracking dev_base_lock users, I found decnet used it in dnet_select_source(), but for a wrong purpose: Writers only hold RTNL, not dev_base_lock, so readers must use RCU if they cannot use RTNL. Adds an rcu_head in struct dn_ifaddr and handle proper RCU management. Adds __rcu annotation in dn_route as well. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Acked-by: Steven Whitehouse <swhiteho@redhat.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* ipv4: add __rcu annotations to routes.cEric Dumazet2010-10-271-1/+1
| | | | | | | | | | | | Add __rcu annotations to : (struct dst_entry)->rt_next (struct rt_hash_bucket)->chain And use appropriate rcu primitives to reduce sparse warnings if CONFIG_SPARSE_RCU_POINTER=y Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: introduce DST_NOCACHE flagEric Dumazet2010-10-041-4/+5
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | While doing stress tests with IP route cache disabled, and multi queue devices, I noticed a very high contention on one rwlock used in neighbour code. When many cpus are trying to send frames (possibly using a high performance multiqueue device) to the same neighbour, they fight for the neigh->lock rwlock in order to call neigh_hh_init(), and fight on hh->hh_refcnt (a pair of atomic_inc/atomic_dec_and_test()) But we dont need to call neigh_hh_init() for dst that are used only once. It costs four atomic operations at least, on two contended cache lines, plus the high contention on neigh->lock rwlock. Introduce a new dst flag, DST_NOCACHE, that is set when dst was not inserted in route cache. With the stress test bench, sending 160000000 frames on one neighbour, results are : Before patch: real 2m28.406s user 0m11.781s sys 36m17.964s After patch: real 1m26.532s user 0m12.185s sys 20m3.903s Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tunnels: prepare percpu accountingEric Dumazet2010-09-281-5/+19
| | | | | | | | | | | | | Tunnels are going to use percpu for their accounting. They are going to use a new tstats field in net_device. skb_tunnel_rx() is changed to be a wrapper around __skb_tunnel_rx() IPTUNNEL_XMIT() is changed to be a wrapper around __IPTUNNEL_XMIT() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: reset skb queue mapping when rx'ing over tunnelTom Herbert2010-09-271-0/+1
| | | | | | | | | | Reset queue mapping when an skb is reentering the stack via a tunnel. On second pass, the queue mapping from the original device is no longer valid. Signed-off-by: Tom Herbert <therbert@google.com> Acked-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: check for refcount if pop a stacked dst_entrySteffen Klassert2010-06-051-3/+3
| | | | | | | | | | | | | xfrm triggers a warning if dst_pop() drops a refcount on a noref dst. This patch changes dst_pop() to skb_dst_pop(). skb_dst_pop() drops the refcnt only on a refcounted dst. Also we don't clone the child dst_entry, so it is not refcounted and we can use skb_dst_set_noref() in xfrm_output_one(). Signed-off-by: Steffen Klassert <steffen.klassert@secunet.com> Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Introduce skb_tunnel_rx() helperEric Dumazet2010-05-181-0/+20
| | | | | | | | | | | skb rxhash should be cleared when a skb is handled by a tunnel before being delivered again, so that correct packet steering can take place. There are other cleanups and accounting that we can factorize in a new helper, skb_tunnel_rx() Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: add a noref bit on skb dstEric Dumazet2010-05-181-3/+45
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | Use low order bit of skb->_skb_dst to tell dst is not refcounted. Change _skb_dst to _skb_refdst to make sure all uses are catched. skb_dst() returns the dst, regardless of noref bit set or not, but with a lockdep check to make sure a noref dst is not given if current user is not rcu protected. New skb_dst_set_noref() helper to set an notrefcounted dst on a skb. (with lockdep check) skb_dst_drop() drops a reference only if skb dst was refcounted. skb_dst_force() helper is used to force a refcount on dst, when skb is queued and not anymore RCU protected. Use skb_dst_force() in __sk_add_backlog(), __dev_xmit_skb() if !IFF_XMIT_DST_RELEASE or skb enqueued on qdisc queue, in sock_queue_rcv_skb(), in __nf_queue(). Use skb_dst_force() in dev_requeue_skb(). Note: dst_use_noref() still dirties dst, we might transform it later to do one dirtying per jiffies. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: sk_dst_cache RCUificationEric Dumazet2010-04-131-15/+0
| | | | | | | | | | | | | | | | | | | | | | | With latest CONFIG_PROVE_RCU stuff, I felt more comfortable to make this work. sk->sk_dst_cache is currently protected by a rwlock (sk_dst_lock) This rwlock is readlocked for a very small amount of time, and dst entries are already freed after RCU grace period. This calls for RCU again :) This patch converts sk_dst_lock to a spinlock, and use RCU for readers. __sk_dst_get() is supposed to be called with rcu_read_lock() or if socket locked by user, so use appropriate rcu_dereference_check() condition (rcu_read_lock_held() || sock_owned_by_user(sk)) This patch avoids two atomic ops per tx packet on UDP connected sockets, for example, and permits sk_dst_lock to be much less dirtied. Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Add rtnetlink init_rcvwnd to set the TCP initial receive windowlaurent chavey2009-12-231-2/+0
| | | | | | | | | | | | | | | | | | | Add rtnetlink init_rcvwnd to set the TCP initial receive window size advertised by passive and active TCP connections. The current Linux TCP implementation limits the advertised TCP initial receive window to the one prescribed by slow start. For short lived TCP connections used for transaction type of traffic (i.e. http requests), bounding the advertised TCP initial receive window results in increased latency to complete the transaction. Support for setting initial congestion window is already supported using rtnetlink init_cwnd, but the feature is useless without the ability to set a larger TCP initial receive window. The rtnetlink init_rcvwnd allows increasing the TCP initial receive window, allowing TCP connection to advertise larger TCP receive window than the ones bounded by slow start. Signed-off-by: Laurent Chavey <chavey@google.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: Revert per-route SACK/DSACK/TIMESTAMP changes.David S. Miller2009-12-161-1/+1
| | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | It creates a regression, triggering badness for SYN_RECV sockets, for example: [19148.022102] Badness at net/ipv4/inet_connection_sock.c:293 [19148.022570] NIP: c02a0914 LR: c02a0904 CTR: 00000000 [19148.023035] REGS: eeecbd30 TRAP: 0700 Not tainted (2.6.32) [19148.023496] MSR: 00029032 <EE,ME,CE,IR,DR> CR: 24002442 XER: 00000000 [19148.024012] TASK = eee9a820[1756] 'privoxy' THREAD: eeeca000 This is likely caused by the change in the 'estab' parameter passed to tcp_parse_options() when invoked by the functions in net/ipv4/tcp_minisocks.c But even if that is fixed, the ->conn_request() changes made in this patch series is fundamentally wrong. They try to use the listening socket's 'dst' to probe the route settings. The listening socket doesn't even have a route, and you can't get the right route (the child request one) until much later after we setup all of the state, and it must be done by hand. This stuff really isn't ready, so the best thing to do is a full revert. This reverts the following commits: f55017a93f1a74d50244b1254b9a2bd7ac9bbf7d 022c3f7d82f0f1c68018696f2f027b87b9bb45c2 1aba721eba1d84a2defce45b950272cee1e6c72a cda42ebd67ee5fdf09d7057b5a4584d36fe8a335 345cda2fd695534be5a4494f1b59da9daed33663 dc343475ed062e13fc260acccaab91d7d80fd5b2 05eaade2782fb0c90d3034fd7a7d5a16266182bb 6a2a2d6bf8581216e08be15fcb563cfd6c430e1e Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: Use defaults when no route options are availableGilad Ben-Yossef2009-11-051-1/+1
| | | | | | | | | | Trying to parse the option of a SYN packet that we have no route entry for should just use global wide defaults for route entry options. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Tested-by: Valdis.Kletnieks@vt.edu Signed-off-by: David S. Miller <davem@davemloft.net>
* net: cleanup include/netEric Dumazet2009-11-041-2/+1
| | | | | | | | | | | | | | | This cleanup patch puts struct/union/enum opening braces, in first line to ease grep games. struct something { becomes : struct something { Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* Add dst_feature to query route entry featuresGilad Ben-Yossef2009-10-291-1/+7
| | | | | | | | | | Adding an accessor to existing dst_entry feautres field and refactor the only supported feature (allfrag) to use it. Signed-off-by: Gilad Ben-Yossef <gilad@codefidence.com> Sigend-off-by: Ori Finkelman <ori@comsleep.com> Sigend-off-by: Yony Amit <yony@comsleep.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Fix for dst_negative_adviceKrishna Kumar2009-10-211-2/+10
| | | | | | | | | | | | | dst_negative_advice() should check for changed dst and reset sk_tx_queue_mapping accordingly. Pass sock to the callers of dst_negative_advice. (sk_reset_txq is defined just for use by dst_negative_advice. The only way I could find to get around this is to move dst_negative_() from dst.h to dst.c, include sock.h in dst.c, etc) Signed-off-by: Krishna Kumar <krkumar2@in.ibm.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns: embed ip6_dst_ops directlyAlexey Dobriyan2009-09-021-22/+1
| | | | | | | | | | | | | | | | | struct net::ipv6.ip6_dst_ops is separatedly dynamically allocated, but there is no fundamental reason for it. Embed it directly into struct netns_ipv6. For that: * move struct dst_ops into separate header to fix circular dependencies I honestly tried not to, it's pretty impossible to do other way * drop dynamical allocation, allocate together with netns For a change, remove struct dst_ops::dst_net, it's deducible by using container_of() given dst_ops pointer. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: skb->dst accessorsEric Dumazet2009-06-031-3/+9
| | | | | | | | | | | | | | | | | | Define three accessors to get/set dst attached to a skb struct dst_entry *skb_dst(const struct sk_buff *skb) void skb_dst_set(struct sk_buff *skb, struct dst_entry *dst) void skb_dst_drop(struct sk_buff *skb) This one should replace occurrences of : dst_release(skb->dst) skb->dst = NULL; Delete skb->dst field Signed-off-by: Eric Dumazet <eric.dumazet@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* netns xfrm: lookup in netnsAlexey Dobriyan2008-11-261-8/+8
| | | | | | | | | | Pass netns to xfrm_lookup()/__xfrm_lookup(). For that pass netns to flow_cache_lookup() and resolver callback. Take it from socket or netdevice. Stub DECnet to init_net. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: make sure struct dst_entry refcount is aligned on 64 bytesEric Dumazet2008-11-171-0/+19
| | | | | | | | | | | | | | | | | | | | | | | | | As found in the past (commit f1dd9c379cac7d5a76259e7dffcd5f8edc697d17 [NET]: Fix tbench regression in 2.6.25-rc1), it is really important that struct dst_entry refcount is aligned on a cache line. We cannot use __atribute((aligned)), so manually pad the structure for 32 and 64 bit arches. for 32bit : offsetof(truct dst_entry, __refcnt) is 0x80 for 64bit : offsetof(truct dst_entry, __refcnt) is 0xc0 As it is not possible to guess at compile time cache line size, we use a generic value of 64 bytes, that satisfies many current arches. (Using 128 bytes alignment on 64bit arches would waste 64 bytes) Add a BUILD_BUG_ON to catch future updates to "struct dst_entry" dont break this alignment. "tbench 8" is 4.4 % faster on a dual quad core (HP BL460c G1), Intel E5450 @3.00GHz (2350 MB/s instead of 2250 MB/s) Signed-off-by: Eric Dumazet <dada1@cosmosbay.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: remove struct dst_entry::entry_sizeAlexey Dobriyan2008-11-121-1/+0
| | | | | | | Unused after kmem_cache_zalloc() conversion. Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: reduce structures when XFRM=nAlexey Dobriyan2008-10-281-1/+2
| | | | | | | | | | ifdef out * struct sk_buff::sp (pointer) * struct dst_entry::xfrm (pointer) * struct sock::sk_policy (2 pointers) Signed-off-by: Alexey Dobriyan <adobriyan@gmail.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* net: Kill plain NET_XMIT_BYPASS.David S. Miller2008-08-051-11/+1
| | | | | | | | | | | dst_input() was doing something completely absurd, looping on skb->dst->input() if NET_XMIT_BYPASS was seen, but these functions never return such an error. And as a result plain ole' NET_XMIT_BYPASS has no more references and can be completely killed off. Signed-off-by: David S. Miller <davem@davemloft.net>
* tcp: RTT metrics scalingStephen Hemminger2008-07-191-0/+12
| | | | | | | | | | | | | | Some of the metrics (RTT, RTTVAR and RTAX_RTO_MIN) are stored in kernel units (jiffies) and this leaks out through the netlink API to user space where the units for jiffies are unknown. This patches changes the kernel to convert to/from milliseconds. This changes the ABI, but milliseconds seemed like the most natural unit for these parameters. Values available via syscall in /proc/net/rt_cache and netlink will be in milliseconds. Signed-off-by: Stephen Hemminger <shemminger@vyatta.com> Signed-off-by: David S. Miller <davem@davemloft.net>
* [NET]: uninline dst_releaseIlpo Järvinen2008-03-281-9/+1
| | | | | | | | | | | | Codiff stats (allyesconfig, v2.6.24-mm1): -16420 187 funcs, 103 +, 16523 -, diff: -16420 --- dst_release Without number of debug related CONFIGs (v2.6.25-rc2-mm1): -7257 186 funcs, 70 +, 7327 -, diff: -7257 --- dst_release dst_release | +40 Signed-off-by: Ilpo Järvinen <ilpo.jarvinen@helsinki.fi> Signed-off-by: David S. Miller <davem@davemloft.net>