| Commit message (Collapse) | Author | Age | Files | Lines |
|
|
|
|
|
|
|
| |
The function name is wrong in ip6gre_tnl_addr_conflict() comment, which
use ip6_tnl_addr_conflict instead of ip6gre_tnl_addr_conflict.
Signed-off-by: Sun Lianwen <sunlw.fnst@cn.fujitsu.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Syzbot managed to send a udp gso packet without checksum offload into
the gso stack by disabling tx checksum (UDP_NO_CHECK6_TX). This
triggered the skb_warn_bad_offload.
RIP: 0010:skb_warn_bad_offload+0x2bc/0x600 net/core/dev.c:2658
skb_gso_segment include/linux/netdevice.h:4038 [inline]
validate_xmit_skb+0x54d/0xd90 net/core/dev.c:3120
__dev_queue_xmit+0xbf8/0x34c0 net/core/dev.c:3577
dev_queue_xmit+0x17/0x20 net/core/dev.c:3618
UDP_NO_CHECK6_TX sets skb->ip_summed to CHECKSUM_NONE just after the
udp gso integrity checks in udp_(v6_)send_skb. Extend those checks to
catch and fail in this case.
After the integrity checks jump directly to the CHECKSUM_PARTIAL case
to avoid reading the no_check_tx flags again (a TOCTTOU race).
Fixes: bec1f6f69736 ("udp: generate gso with UDP_SEGMENT")
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
The seg6_make_flowlabel() is used by seg6_do_srh_encap() to compute the
flowlabel from a given skb. It relies on skb_get_hash() which eventually
calls __skb_flow_dissect() to extract the flow_keys struct values from
the skb.
In case of IPv4 traffic, calling seg6_make_flowlabel() after skb_push(),
skb_reset_network_header(), and skb_mac_header_rebuild() will results in
flow_keys struct of all key values set to zero.
This patch calls seg6_make_flowlabel() before resetting the headers of skb
to get the right key values.
Extracted Key values are based on the type inner packet as follows:
1) IPv6 traffic: src_IP, dst_IP, L4 proto, and flowlabel of inner packet.
2) IPv4 traffic: src_IP, dst_IP, L4 proto, src_port, and dst_port
3) L2 traffic: depends on what kind of traffic carried into the L2
frame. IPv6 and IPv4 traffic works as discussed 1) and 2)
Here a hex_dump of struct flow_keys for IPv4 and IPv6 traffic
10.100.1.100: 47302 > 30.0.0.2: 5001
00000000: 14 00 02 00 00 00 00 00 08 00 11 00 00 00 00 00
00000010: 00 00 00 00 00 00 00 00 13 89 b8 c6 1e 00 00 02
00000020: 0a 64 01 64
fc00:a1:a > b2::2
00000000: 28 00 03 00 00 00 00 00 86 dd 11 00 99 f9 02 00
00000010: 00 00 00 00 00 00 00 00 00 00 00 00 00 b2 00 00
00000020: 00 00 00 00 00 00 00 00 00 00 00 02 fc 00 00 a1
00000030: 00 00 00 00 00 00 00 00 00 00 00 0a
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Currently the truncated bit is set only when the mirrored packet
is larger than mtu. For certain cases, the packet might already
been truncated before sending to the erspan tunnel. In this case,
the patch detect whether the IP header's total length is larger
than the actual skb->len. If true, this indicated that the
mirrored packet is truncated and set the erspan truncate bit.
I tested the patch using bpf_skb_change_tail helper function to
shrink the packet size and send to erspan tunnel.
Reported-by: Xiaoyan Jin <xiaoyanj@vmware.com>
Signed-off-by: William Tu <u9012063@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
When adding tcp mmap() implementation, I forgot that socket lock
had to be taken before current->mm->mmap_sem. syzbot eventually caught
the bug.
Since we can not lock the socket in tcp mmap() handler we have to
split the operation in two phases.
1) mmap() on a tcp socket simply reserves VMA space, and nothing else.
This operation does not involve any TCP locking.
2) getsockopt(fd, IPPROTO_TCP, TCP_ZEROCOPY_RECEIVE, ...) implements
the transfert of pages from skbs to one VMA.
This operation only uses down_read(¤t->mm->mmap_sem) after
holding TCP lock, thus solving the lockdep issue.
This new implementation was suggested by Andy Lutomirski with great details.
Benefits are :
- Better scalability, in case multiple threads reuse VMAS
(without mmap()/munmap() calls) since mmap_sem wont be write locked.
- Better error recovery.
The previous mmap() model had to provide the expected size of the
mapping. If for some reason one part could not be mapped (partial MSS),
the whole operation had to be aborted.
With the tcp_zerocopy_receive struct, kernel can report how
many bytes were successfuly mapped, and how many bytes should
be read to skip the problematic sequence.
- No more memory allocation to hold an array of page pointers.
16 MB mappings needed 32 KB for this array, potentially using vmalloc() :/
- skbs are freed while mmap_sem has been released
Following patch makes the change in tcp_mmap tool to demonstrate
one possible use of mmap() and setsockopt(... TCP_ZEROCOPY_RECEIVE ...)
Note that memcg might require additional changes.
Fixes: 93ab6cc69162 ("tcp: implement mmap() for zero copy receive")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Suggested-by: Andy Lutomirski <luto@kernel.org>
Cc: linux-mm@kvack.org
Acked-by: Soheil Hassas Yeganeh <soheil@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
| |
Allow specifying segment size in the send call.
The new control message performs the same function as socket option
UDP_SEGMENT while avoiding the extra system call.
[ Export udp_cmsg_send for ipv6. -DaveM ]
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
| |
When sending large datagrams that are later segmented, store data in
page frags to avoid copying from linear in skb_segment.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Support generic segmentation offload for udp datagrams. Callers can
concatenate and send at once the payload of multiple datagrams with
the same destination.
To set segment size, the caller sets socket option UDP_SEGMENT to the
length of each discrete payload. This value must be smaller than or
equal to the relevant MTU.
A follow-up patch adds cmsg UDP_SEGMENT to specify segment size on a
per send call basis.
Total byte length may then exceed MTU. If not an exact multiple of
segment size, the last segment will be shorter.
The implementation adds a gso_size field to the udp socket, ip(v6)
cmsg cookie and inet_cork structure to be able to set the value at
setsockopt or cmsg time and to work with both lockless and corked
paths.
Initial benchmark numbers show UDP GSO about as expensive as TCP GSO.
tcp tso
3197 MB/s 54232 msg/s 54232 calls/s
6,457,754,262 cycles
tcp gso
1765 MB/s 29939 msg/s 29939 calls/s
11,203,021,806 cycles
tcp without tso/gso *
739 MB/s 12548 msg/s 12548 calls/s
11,205,483,630 cycles
udp
876 MB/s 14873 msg/s 624666 calls/s
11,205,777,429 cycles
udp gso
2139 MB/s 36282 msg/s 36282 calls/s
11,204,374,561 cycles
[*] after reverting commit 0a6b2a1dc2a2
("tcp: switch to GSO being always on")
Measured total system cycles ('-a') for one core while pinning both
the network receive path and benchmark process to that core:
perf stat -a -C 12 -e cycles \
./udpgso_bench_tx -C 12 -4 -D "$DST" -l 4
Note the reduction in calls/s with GSO. Bytes per syscall drops
increases from 1470 to 61818.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
Implement generic segmentation offload support for udp datagrams. A
follow-up patch adds support to the protocol stack to generate such
packets.
UDP GSO is not UFO. UFO fragments a single large datagram. GSO splits
a large payload into a number of discrete UDP datagrams.
The implementation adds a GSO type SKB_UDP_GSO_L4 to differentiate it
from UFO (SKB_UDP_GSO).
IPPROTO_UDPLITE is excluded, as that protocol has no gso handler
registered.
[ Export __udp_gso_segment for ipv6. -DaveM ]
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
| |
UDP segmentation offload needs access to inet_cork in the udp layer.
Pass the struct to ip(6)_make_skb instead of allocating it on the
stack in that function itself.
This patch is a noop otherwise.
Signed-off-by: Willem de Bruijn <willemb@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
| |
The addrconf_ifdown() evaluates keep_addr_on_down state twice. There
is no need to do it.
Cc: David Ahern <dsahern@gmail.com>
Signed-off-by: Ivan Vecera <cera@cera.cz>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
| |
ECMP (equal-cost multipath) hashes are typically computed on the packets'
5-tuple(src IP, dst IP, src port, dst port, L4 proto).
For encapsulated packets, the L4 data is not readily available and ECMP
hashing will often revert to (src IP, dst IP). This will lead to traffic
polarization on a single ECMP path, causing congestion and waste of network
capacity.
In IPv6, the 20-bit flow label field is also used as part of the ECMP hash.
In the lack of L4 data, the hashing will be on (src IP, dst IP, flow
label). Having a non-zero flow label is thus important for proper traffic
load balancing when L4 data is unavailable (i.e., when packets are
encapsulated).
Currently, the seg6_do_srh_encap() function extracts the original packet's
flow label and set it as the outer IPv6 flow label. There are two issues
with this behaviour:
a) There is no guarantee that the inner flow label is set by the source.
b) If the original packet is not IPv6, the flow label will be set to
zero (e.g., IPv4 or L2 encap).
This patch adds a function, named seg6_make_flowlabel(), that computes a
flow label from a given skb. It supports IPv6, IPv4 and L2 payloads, and
leverages the per namespace 'seg6_flowlabel" sysctl value.
The currently support behaviours are as follows:
-1 set flowlabel to zero.
0 copy flowlabel from Inner paceket in case of Inner IPv6
(Set flowlabel to 0 in case IPv4/L2)
1 Compute the flowlabel using seg6_make_flowlabel()
This patch has been tested for IPv6, IPv4, and L2 traffic.
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
|\ |
|
| |\
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Pablo Neira Ayuso says:
====================
Netfilter/IPVS fixes for net
The following patchset contains Netfilter/IPVS fixes for your net tree,
they are:
1) Fix SIP conntrack with phones sending session descriptions for different
media types but same port numbers, from Florian Westphal.
2) Fix incorrect rtnl_lock mutex logic from IPVS sync thread, from Julian
Anastasov.
3) Skip compat array allocation in ebtables if there is no entries, also
from Florian.
4) Do not lose left/right bits when shifting marks from xt_connmark, from
Jack Ma.
5) Silence false positive memleak in conntrack extensions, from Cong Wang.
6) Fix CONFIG_NF_REJECT_IPV6=m link problems, from Arnd Bergmann.
7) Cannot kfree rule that is already in list in nf_tables, switch order
so this error handling is not required, from Florian Westphal.
8) Release set name in error path, from Florian.
9) include kmemleak.h in nf_conntrack_extend.c, from Stepheh Rothwell.
10) NAT chain and extensions depend on NF_TABLES.
11) Out of bound access when renaming chains, from Taehee Yoo.
12) Incorrect casting in xt_connmark leads to wrong bitshifting.
====================
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
Move these options inside the scope of the 'if' NF_TABLES and
NF_TABLES_IPV6 dependencies. This patch fixes:
net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_nat_do_chain':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:37: undefined reference to `nft_do_chain'
net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_exit':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:94: undefined reference to `nft_unregister_chain_type'
net/ipv6/netfilter/nft_chain_nat_ipv6.o: In function `nft_chain_nat_ipv6_init':
>> net/ipv6/netfilter/nft_chain_nat_ipv6.c:87: undefined reference to `nft_register_chain_type'
that happens with:
CONFIG_NF_TABLES=m
CONFIG_NFT_CHAIN_NAT_IPV6=y
Fixes: 02c7b25e5f54 ("netfilter: nf_tables: build-in filter chain type")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: Pablo Neira Ayuso <pablo@netfilter.org>
|
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | |
| | | |
KMSAN reported use of uninit-value that I tracked to lack
of proper size check on RTA_TABLE attribute.
I also believe RTA_PREFSRC lacks a similar check.
Fixes: 86872cb57925 ("[IPv6] route: FIB6 configuration using struct fib6_config")
Fixes: c3968a857a6b ("ipv6: RTA_PREFSRC support for ipv6 route source address selection")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |/
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
In case of seg6 in encap mode, seg6_do_srh_encap() calls set_tun_src()
in order to set the src addr of outer IPv6 header.
The net_device is required for set_tun_src(). However calling ip6_dst_idev()
on dst_entry in case of IPv4 traffic results on the following bug.
Using just dst->dev should fix this BUG.
[ 196.242461] BUG: unable to handle kernel NULL pointer dereference at 0000000000000000
[ 196.242975] PGD 800000010f076067 P4D 800000010f076067 PUD 10f060067 PMD 0
[ 196.243329] Oops: 0000 [#1] SMP PTI
[ 196.243468] Modules linked in: nfsd auth_rpcgss nfs_acl nfs lockd grace fscache sunrpc crct10dif_pclmul crc32_pclmul ghash_clmulni_intel pcbc aesni_intel aes_x86_64 crypto_simd cryptd input_leds glue_helper led_class pcspkr serio_raw mac_hid video autofs4 hid_generic usbhid hid e1000 i2c_piix4 ahci pata_acpi libahci
[ 196.244362] CPU: 2 PID: 1089 Comm: ping Not tainted 4.16.0+ #1
[ 196.244606] Hardware name: innotek GmbH VirtualBox/VirtualBox, BIOS VirtualBox 12/01/2006
[ 196.244968] RIP: 0010:seg6_do_srh_encap+0x1ac/0x300
[ 196.245236] RSP: 0018:ffffb2ce00b23a60 EFLAGS: 00010202
[ 196.245464] RAX: 0000000000000000 RBX: ffff8c7f53eea300 RCX: 0000000000000000
[ 196.245742] RDX: 0000f10000000000 RSI: ffff8c7f52085a6c RDI: ffff8c7f41166850
[ 196.246018] RBP: ffffb2ce00b23aa8 R08: 00000000000261e0 R09: ffff8c7f41166800
[ 196.246294] R10: ffffdce5040ac780 R11: ffff8c7f41166828 R12: ffff8c7f41166808
[ 196.246570] R13: ffff8c7f52085a44 R14: ffffffffb73211c0 R15: ffff8c7e69e44200
[ 196.246846] FS: 00007fc448789700(0000) GS:ffff8c7f59d00000(0000) knlGS:0000000000000000
[ 196.247286] CS: 0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 196.247526] CR2: 0000000000000000 CR3: 000000010f05a000 CR4: 00000000000406e0
[ 196.247804] Call Trace:
[ 196.247972] seg6_do_srh+0x15b/0x1c0
[ 196.248156] seg6_output+0x3c/0x220
[ 196.248341] ? prandom_u32+0x14/0x20
[ 196.248526] ? ip_idents_reserve+0x6c/0x80
[ 196.248723] ? __ip_select_ident+0x90/0x100
[ 196.248923] ? ip_append_data.part.50+0x6c/0xd0
[ 196.249133] lwtunnel_output+0x44/0x70
[ 196.249328] ip_send_skb+0x15/0x40
[ 196.249515] raw_sendmsg+0x8c3/0xac0
[ 196.249701] ? _copy_from_user+0x2e/0x60
[ 196.249897] ? rw_copy_check_uvector+0x53/0x110
[ 196.250106] ? _copy_from_user+0x2e/0x60
[ 196.250299] ? copy_msghdr_from_user+0xce/0x140
[ 196.250508] sock_sendmsg+0x36/0x40
[ 196.250690] ___sys_sendmsg+0x292/0x2a0
[ 196.250881] ? _cond_resched+0x15/0x30
[ 196.251074] ? copy_termios+0x1e/0x70
[ 196.251261] ? _copy_to_user+0x22/0x30
[ 196.251575] ? tty_mode_ioctl+0x1c3/0x4e0
[ 196.251782] ? _cond_resched+0x15/0x30
[ 196.251972] ? mutex_lock+0xe/0x30
[ 196.252152] ? vvar_fault+0xd2/0x110
[ 196.252337] ? __do_fault+0x1f/0xc0
[ 196.252521] ? __handle_mm_fault+0xc1f/0x12d0
[ 196.252727] ? __sys_sendmsg+0x63/0xa0
[ 196.252919] __sys_sendmsg+0x63/0xa0
[ 196.253107] do_syscall_64+0x72/0x200
[ 196.253305] entry_SYSCALL_64_after_hwframe+0x3d/0xa2
[ 196.253530] RIP: 0033:0x7fc4480b0690
[ 196.253715] RSP: 002b:00007ffde9f252f8 EFLAGS: 00000246 ORIG_RAX: 000000000000002e
[ 196.254053] RAX: ffffffffffffffda RBX: 0000000000000040 RCX: 00007fc4480b0690
[ 196.254331] RDX: 0000000000000000 RSI: 000000000060a360 RDI: 0000000000000003
[ 196.254608] RBP: 00007ffde9f253f0 R08: 00000000002d1e81 R09: 0000000000000002
[ 196.254884] R10: 00007ffde9f250c0 R11: 0000000000000246 R12: 0000000000b22070
[ 196.255205] R13: 20c49ba5e353f7cf R14: 431bde82d7b634db R15: 00007ffde9f278fe
[ 196.255484] Code: a5 0f b6 45 c0 41 88 41 28 41 0f b6 41 2c 48 c1 e0 04 49 8b 54 01 38 49 8b 44 01 30 49 89 51 20 49 89 41 18 48 8b 83 b0 00 00 00 <48> 8b 30 49 8b 86 08 0b 00 00 48 8b 40 20 48 8b 50 08 48 0b 10
[ 196.256190] RIP: seg6_do_srh_encap+0x1ac/0x300 RSP: ffffb2ce00b23a60
[ 196.256445] CR2: 0000000000000000
[ 196.256676] ---[ end trace 71af7d093603885c ]---
Fixes: 8936ef7604c11 ("ipv6: sr: fix NULL pointer dereference when setting encap source address")
Signed-off-by: Ahmed Abdelsalam <amsalam20@gmail.com>
Acked-by: David Lebrun <dlebrun@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
rt6_remove_exception_rt() is called under rcu_read_lock() only.
We lock rt6_exception_lock a bit later, so we do not hold
rt6_exception_lock yet.
Fixes: 8a14e46f1402 ("net/ipv6: Fix missing rcu dereferences on from")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Cc: David Ahern <dsahern@gmail.com>
Acked-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
kbuild test robot reported 2 uses of rt->from not properly accessed
using rcu_dereference:
1. add rcu_dereference_protected to rt6_remove_exception_rt and make
sure it is always called with rcu lock held.
2. change rt6_do_redirect to take a reference on 'from' when accessed
the first time so it can be used the sceond time outside of the lock
Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: kbuild test robot <lkp@intel.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
syzbot reported a suspicious rcu_dereference_check:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
lockdep_rcu_suspicious+0x14a/0x153 kernel/locking/lockdep.c:4592
rt6_check_expired+0x38b/0x3e0 net/ipv6/route.c:410
ip6_negative_advice+0x67/0xc0 net/ipv6/route.c:2204
dst_negative_advice include/net/sock.h:1786 [inline]
sock_setsockopt+0x138f/0x1fe0 net/core/sock.c:1051
__sys_setsockopt+0x2df/0x390 net/socket.c:1899
SYSC_setsockopt net/socket.c:1914 [inline]
SyS_setsockopt+0x34/0x50 net/socket.c:1911
Add rcu locking around call to rt6_check_expired in
ip6_negative_advice.
Fixes: a68886a69180 ("net/ipv6: Make from in rt6_info rcu protected")
Reported-by: syzbot+2422c9e35796659d2273@syzkaller.appspotmail.com
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| | |
Signed-off-by: Roopa Prabhu <roopa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Dan reported an imbalance in fib6_check on use of f6i and checking
whether it is null. Since fib6_check is only called if f6i is non-null,
remove the unnecessary check.
Reported-by: Dan Carpenter <dan.carpenter@oracle.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
When a dst entry is created from a fib entry, the 'from' in rt6_info
is set to the fib entry. The 'from' reference is used most notably for
cookie checking - making sure stale dst entries are updated if the
fib entry is changed.
When a fib entry is deleted, the pcpu routes on it are walked releasing
the fib6_info reference. This is needed for the fib6_info cleanup to
happen and to make sure all device references are released in a timely
manner.
There is a race window when a FIB entry is deleted and the 'from' on the
pcpu route is dropped and the pcpu route hits a cookie check. Handle
this race using rcu on from.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| | |
Code move only; no functional change intended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| | |
A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
A later patch protects 'from' in rt6_info and this simplifies the
locking needed by it.
With the move, the fib6_info_hold for the uncached_rt is no longer
needed since the rcu_lock is still held.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| | |
rt6_get_cookie_safe takes a fib6_info and checks the sernum of
the node. Update the name to reflect its purpose.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| | |
rt6_clean_expires and rt6_set_expires are no longer used. Removed them.
rt6_update_expires has 1 caller in route.c, so move it from the header.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
If ip6_convert_metrics() fails to allocate memory, it should not
overwrite rt->fib6_metrics or we risk a crash later as syzbot found.
BUG: KASAN: null-ptr-deref in atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
BUG: KASAN: null-ptr-deref in refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
Read of size 4 at addr 0000000000000044 by task syzkaller832429/4487
CPU: 1 PID: 4487 Comm: syzkaller832429 Not tainted 4.16.0+ #6
Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
Call Trace:
__dump_stack lib/dump_stack.c:77 [inline]
dump_stack+0x1b9/0x294 lib/dump_stack.c:113
kasan_report_error mm/kasan/report.c:352 [inline]
kasan_report.cold.7+0x6d/0x2fe mm/kasan/report.c:412
check_memory_region_inline mm/kasan/kasan.c:260 [inline]
check_memory_region+0x13e/0x1b0 mm/kasan/kasan.c:267
kasan_check_read+0x11/0x20 mm/kasan/kasan.c:272
atomic_read include/asm-generic/atomic-instrumented.h:21 [inline]
refcount_sub_and_test+0x92/0x330 lib/refcount.c:179
refcount_dec_and_test+0x1a/0x20 lib/refcount.c:212
fib6_info_destroy+0x2d0/0x3c0 net/ipv6/ip6_fib.c:206
fib6_info_release include/net/ip6_fib.h:304 [inline]
ip6_route_info_create+0x677/0x3240 net/ipv6/route.c:3020
ip6_route_add+0x23/0xb0 net/ipv6/route.c:3030
inet6_rtm_newroute+0x142/0x160 net/ipv6/route.c:4406
rtnetlink_rcv_msg+0x466/0xc10 net/core/rtnetlink.c:4648
netlink_rcv_skb+0x172/0x440 net/netlink/af_netlink.c:2448
rtnetlink_rcv+0x1c/0x20 net/core/rtnetlink.c:4666
netlink_unicast_kernel net/netlink/af_netlink.c:1310 [inline]
netlink_unicast+0x58b/0x740 net/netlink/af_netlink.c:1336
netlink_sendmsg+0x9f0/0xfa0 net/netlink/af_netlink.c:1901
sock_sendmsg_nosec net/socket.c:629 [inline]
sock_sendmsg+0xd5/0x120 net/socket.c:639
___sys_sendmsg+0x805/0x940 net/socket.c:2117
__sys_sendmsg+0x115/0x270 net/socket.c:2155
SYSC_sendmsg net/socket.c:2164 [inline]
SyS_sendmsg+0x29/0x30 net/socket.c:2162
do_syscall_64+0x29e/0x9d0 arch/x86/entry/common.c:287
entry_SYSCALL_64_after_hwframe+0x42/0xb7
Fixes: d4ead6b34b67 ("net/ipv6: move metrics from dst to rt6_info")
Signed-off-by: Eric Dumazet <edumazet@google.com>
Cc: David Ahern <dsa@cumulusnetworks.com>
Reported-by: syzbot <syzkaller@googlegroups.com>
Acked-by: David Ahern <dsa@cumulusnetworks.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Eric noticed that __ipv6_ifa_notify is called under rcu_read_lock, so
the gfp argument to addrconf_prefix_route can not be GFP_KERNEL.
While scrubbing other calls I noticed addrconf_addr_gen has one
place with GFP_ATOMIC that can be GFP_KERNEL.
Fixes: acb54e3cba404 ("net/ipv6: Add gfp_flags to route add functions")
Reported-by: syzbot+2add39b05179b31f912f@syzkaller.appspotmail.com
Reported-by: Eric Dumazet <eric.dumazet@gmail.com>
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| | |
fib6_idev can be obtained from __in6_dev_get on the nexthop device
rather than caching it in the fib6_info. Remove it.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Prior to 4832c30d5458 ("net: ipv6: put host and anycast routes on device
with address") host routes and anycast routes were installed with the
device set to loopback (or VRF device once that feature was added). In the
older code dst.dev was set to loopback (needed for packet tx) and rt6i_idev
was used to denote the actual interface.
Commit 4832c30d5458 changed the code to have dst.dev pointing to the real
device with the switch to lo or vrf device done on dst clones. As a
consequence of this change a couple of device checks during route lookups
are no longer needed. Remove them.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
aca_idev has only 1 user - inet6_fill_ifacaddr - and it only
wants the device index which can be extracted from the fib6_info
nexthop.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
addrconf_dst_alloc now returns a fib6_info. Update the name
and its users to reflect the change.
Rename only; no functional change intended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Change the prefix for fib6_info struct elements from rt6i_ to fib6_.
rt6i_pcpu and rt6i_exception_bucket are left as is given that they
point to rt6_info entries.
Rename only; not functional change intended.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
lockdep does not know that the locks used by IPv4 defrag
and IPv6 reassembly units are of different classes.
It complains because of following chains :
1) sch_direct_xmit() (lock txq->_xmit_lock)
dev_hard_start_xmit()
xmit_one()
dev_queue_xmit_nit()
packet_rcv_fanout()
ip_check_defrag()
ip_defrag()
spin_lock() (lock frag queue spinlock)
2) ip6_input_finish()
ipv6_frag_rcv() (lock frag queue spinlock)
ip6_frag_queue()
icmpv6_param_prob() (lock txq->_xmit_lock at some point)
We could add lockdep annotations, but we also can make sure IPv6
calls icmpv6_param_prob() only after the release of the frag queue spinlock,
since this naturally makes frag queue spinlock a leaf in lock hierarchy.
Signed-off-by: Eric Dumazet <edumazet@google.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Drop unneeded elements from rt6_info struct and rearrange layout to
something more relevant for the data path.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| | |
Convert all code paths referencing a FIB entry from
rt6_info to fib6_info.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Last step before flipping the data type for FIB entries:
- use fib6_info_alloc to create FIB entries in ip6_route_info_create
and addrconf_dst_alloc
- use fib6_info_release in place of dst_release, ip6_rt_put and
rt6_release
- remove the dst_hold before calling __ip6_ins_rt or ip6_del_rt
- when purging routes, drop per-cpu routes
- replace inc and dec of rt6i_ref with fib6_info_hold and fib6_info_release
- use rt->from since it points to the FIB entry
- drop references to exception bucket, fib6_metrics and per-cpu from
dst entries (those are relevant for fib entries only)
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| | |
Add fib6_info struct and alloc, destroy, hold and release helpers.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
IPv6 FIB will only contain FIB entries with exception routes added to
the FIB entry. Once this transformation is complete, FIB lookups will
return a fib6_info with the lookup functions still returning a dst
based rt6_info. The current code uses rt6_info for both paths and
overloads the rt6_info variable usually called 'rt'.
This patch introduces a new 'f6i' variable name for the result of the FIB
lookup and keeps 'rt' as the dst based return variable. 'f6i' becomes a
fib6_info in a later patch which is why it is introduced as f6i now;
avoids the additional churn in the later patch.
In addition, remove RTF_CACHE and dst checks from fib6 add and delete
since they can not happen now and will never happen after the data
type flip.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Most FIB entries can be added using memory allocated with GFP_KERNEL.
Add gfp_flags to ip6_route_add and addrconf_dst_alloc. Code paths that
can be reached from the packet path (e.g., ndisc and autoconfig) or
atomic notifiers use GFP_ATOMIC; paths from user context (adding
addresses and routes) use GFP_KERNEL.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
The router discovery code has a FIB entry and wants to validate the
gateway has a neighbor entry. Refactor the existing dst_neigh_lookup
for IPv6 and create a new function that takes the gateway and device
and returns a neighbor entry. Use the new function in
ndisc_router_discovery to validate the gateway.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Continuing to wean FIB paths off of dst_entry, use a bool to hold
requests for certain dst settings. Add a helper to convert the
flags to DST flags when a FIB entry is converted to a dst_entry.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
ip6_pol_route_lookup is the lookup function for ip6_route_lookup and
rt6_lookup. At the moment it returns either a reference to a FIB entry
or a cached exception. To move FIB entries to a separate struct, this
lookup function needs to convert FIB entries to an rt6_info that is
returned to the caller.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
ip6_null_entry will stay a dst based return for lookups that fail to
match an entry.
Add a new fib6_null_entry which constitutes the root node and leafs
for fibs. Replace existing references to ip6_null_entry with the
new fib6_null_entry when dealing with FIBs.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Add expires to rt6_info for FIB entries, and add fib6 helpers to
manage it. Data path use of dst.expires remains.
The transition is fairly straightforward: when working with fib entries,
rt->dst.expires is just rt->expires, rt6_clean_expires is replaced with
fib6_clean_expires, rt6_set_expires becomes fib6_set_expires, and
rt6_check_expired becomes fib6_check_expired, where the fib6 versions
are added by this patch.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Similar to IPv4, add fib metrics to the fib struct, which at the moment
is rt6_info. Will be moved to fib6_info in a later patch. Copy metrics
into dst by reference using refcount.
To make the transition:
- add dst_metrics to rt6_info. Default to dst_default_metrics if no
metrics are passed during route add. No need for a separate pmtu
entry; it can reference the MTU slot in fib6_metrics
- ip6_convert_metrics allocates memory in the FIB entry and uses
ip_metrics_convert to copy from netlink attribute to metrics entry
- the convert metrics call is done in ip6_route_info_create simplifying
the route add path
+ fib6_commit_metrics and fib6_copy_metrics and the temporary
mx6_config are no longer needed
- add fib6_metric_set helper to change the value of a metric in the
fib entry since dst_metric_set can no longer be used
- cow_metrics for IPv6 can drop to dst_cow_metrics_generic
- rt6_dst_from_metrics_check is no longer needed
- rt6_fill_node needs the FIB entry and dst as separate arguments to
keep compatibility with existing output. Current dst address is
renamed to dest.
(to be consistent with IPv4 rt6_fill_node really should be split
into 2 functions similar to fib_dump_info and rt_fill_info)
- rt6_fill_node no longer needs the temporary metrics variable
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Defer setting dst input, output and error until fib entry is copied.
The reject path from ip6_route_info_create is moved to a new function
ip6_rt_init_dst_reject with a helper doing the conversion from fib6_type
to dst error.
The remainder of the new ip6_rt_init_dst is an amalgamtion of dst code
from addrconf_dst_alloc and the non-reject path of ip6_route_info_create.
The dst output function is always ip6_output and the input function is
either ip6_input (local routes), ip6_mc_input (multicast routes) or
ip6_forward (anything else).
A couple of places using dst.error are updated to look at rt6i_flags.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|
| |
| |
| |
| |
| |
| |
| |
| |
| |
| | |
Introduce fib6_nh structure and move nexthop related data from
rt6_info and rt6_info.dst to fib6_nh. References to dev, gateway or
lwtstate from a FIB lookup perspective are converted to use fib6_nh;
datapath references to dst version are left as is.
Signed-off-by: David Ahern <dsahern@gmail.com>
Signed-off-by: David S. Miller <davem@davemloft.net>
|