Merge branch 'ip_forward_pmtu'

Hannes Frederic Sowa says: ==================== path mtu hardening patches After a lot of back and forth I want to propose these changes regarding path mtu hardening and give an outline why I think this is the best way how to proceed: This set contains the following patches: * ipv4: introduce ip_dst_mtu_maybe_forward and protect forwarding path against pmtu spoofing * ipv6: introduce ip6_dst_mtu_forward and protect forwarding path with it * ipv4: introduce hardened ip_no_pmtu_disc mode The first one switches the forwarding path of IPv4 to use the interface mtu by default and ignore a possible discovered path mtu. It provides a sysctl to switch back to the original behavior (see discussion below). The second patch does the same thing unconditionally for IPv6. I don't provide a knob for IPv6 to switch to original behavior (please see below). The third patch introduces a hardened pmtu mode, where only pmtu information are accepted where the protocol is able to do more stringent checks on the icmp piggyback payload (please see the patch commit msg for further details). Why is this change necessary? First of all, RFC 1191 4. Router specification says: "When a router is unable to forward a datagram because it exceeds the MTU of the next-hop network and its Don't Fragment bit is set, the router is required to return an ICMP Destination Unreachable message to the source of the datagram, with the Code indicating "fragmentation needed and DF set". ..." For some time now fragmentation has been considered problematic, e.g.: * http://www.hpl.hp.com/techreports/Compaq-DEC/WRL-87-3.pdf * http://tools.ietf.org/search/rfc4963 Most of them seem to agree that fragmentation should be avoided because of efficiency, data corruption or security concerns. Recently it was shown possible that correctly guessing IP ids could lead to data injection on DNS packets: <https://sites.google.com/site/hayashulman/files/fragmentation-poisoning.pdf> While we can try to completly stop fragmentation on the end host (this is e.g. implemented via IP_PMTUDISC_INTERFACE), we cannot stop fragmentation completly on the forwarding path. On the end host the application has to deal with MTUs and has to choose fallback methods if fragmentation could be an attack vector. This is already the case for most DNS software, where a maximum UDP packet size can be configured. But until recently they had no control over local fragmentation and could thus emit fragmented packets. On the forwarding path we can just try to delay the fragmentation to the last hop where this is really necessary. Current kernel already does that but only because routers don't receive feedback of path mtus, these are only send back to the end host system. But it is possible to maliciously insert path mtu inforamtion via ICMP packets which have an icmp echo_reply payload, because we cannot validate those notifications against local sockets. DHCP clients which establish an any-bound RAW-socket could also start processing unwanted fragmentation-needed packets. Why does IPv4 has a knob to revert to old behavior while IPv6 doesn't? IPv4 does fragmentation on the path while IPv6 does always respond with packet-too-big errors. The interface MTU will always be greater than the path MTU information. So we would discard packets we could actually forward because of malicious information. After this change we would let the hop, which really could not forward the packet, notify the host of this problem. IPv4 allowes fragmentation mid-path. In case someone does use a software which tries to discover such paths and assumes that the kernel is handling the discovered pmtu information automatically. This should be an extremly rare case, but because I could not exclude the possibility this knob is provided. Also this software could insert non-locked mtu information into the kernel. We cannot distinguish that from path mtu information currently. Premature fragmentation could solve some problems in wrongly configured networks, thus this switch is provided. One frag-needed packet could reduce the path mtu down to 522 bytes (route/min_pmtu). Misc: IPv6 neighbor discovery could advertise mtu information for an interface. These information update the ipv6-specific interface mtu and thus get used by the forwarding path. Tunnel and xfrm output path will still honour path mtu and also respond with Packet-too-Big or fragmentation-needed errors if needed. Changelog for all patches: v2) * enabled ip_forward_use_pmtu by default * reworded v3) * disabled ip_forward_use_pmtu by default * reworded v4) * renamed ip_dst_mtu_secure to ip_dst_mtu_maybe_forward * updated changelog accordingly * removed unneeded !!(... & ...) double negations v2) * by default we honour pmtu information 3) * only honor interface mtu * rewritten and simplified * no knob to fall back to old mode any more v2) * reworded Documentation ==================== Signed-off-by: David S. Miller <davem@davemloft.net>
author: David S. Miller <davem@davemloft.net> 2014-01-13 20:23:02 +0100
committer: David S. Miller <davem@davemloft.net> 2014-01-13 20:23:02 +0100
commit: c139cd3b43eac8708849755ed14286d0e9d2e589 (patch)
tree: 4a8093c0dfa1507ce234754a54db8a95399955a4 /net
parent: HHF qdisc: fix jiffies-time conversion. (diff)
parent: ipv4: introduce hardened ip_no_pmtu_disc mode (diff)
download: linux-c139cd3b43eac8708849755ed14286d0e9d2e589.tar.xz
linux-c139cd3b43eac8708849755ed14286d0e9d2e589.zip
9 files changed, 66 insertions, 13 deletions
diff --git a/net/dccp/ipv4.c b/net/dccp/ipv4.c
index 88299c29101d..22b5d818b200 100644
--- a/net/dccp/ipv4.c
+++ b/net/dccp/ipv4.c
@@ -989,6 +989,7 @@ static const struct net_protocol dccp_v4_protocol = {
 	.err_handler	= dccp_v4_err,
 	.no_policy	= 1,
 	.netns_ok	= 1,
+	.icmp_strict_tag_validation = 1,
 };
 
 static const struct proto_ops inet_dccp_ops = {
diff --git a/net/ipv4/af_inet.c b/net/ipv4/af_inet.c
index 6268a4751e64..ecd2c3f245ce 100644
--- a/net/ipv4/af_inet.c
+++ b/net/ipv4/af_inet.c
@@ -1545,6 +1545,7 @@ static const struct net_protocol tcp_protocol = {
 	.err_handler	=	tcp_v4_err,
 	.no_policy	=	1,
 	.netns_ok	=	1,
+	.icmp_strict_tag_validation = 1,
 };
 
 static const struct net_protocol udp_protocol = {
diff --git a/net/ipv4/icmp.c b/net/ipv4/icmp.c
index fb3c5637199d..0134663fdbce 100644
--- a/net/ipv4/icmp.c
+++ b/net/ipv4/icmp.c
@@ -668,6 +668,16 @@ static void icmp_socket_deliver(struct sk_buff *skb, u32 info)
 	rcu_read_unlock();
 }
 
+static bool icmp_tag_validation(int proto)
+{
+	bool ok;
+
+	rcu_read_lock();
+	ok = rcu_dereference(inet_protos[proto])->icmp_strict_tag_validation;
+	rcu_read_unlock();
+	return ok;
+}
+
 /*
  *	Handle ICMP_DEST_UNREACH, ICMP_TIME_EXCEED, ICMP_QUENCH, and
  *	ICMP_PARAMETERPROB.
@@ -705,12 +715,22 @@ static void icmp_unreach(struct sk_buff *skb)
 		case ICMP_PORT_UNREACH:
 			break;
 		case ICMP_FRAG_NEEDED:
-			if (net->ipv4.sysctl_ip_no_pmtu_disc == 2) {
-				goto out;
-			} else if (net->ipv4.sysctl_ip_no_pmtu_disc) {
+			/* for documentation of the ip_no_pmtu_disc
+			 * values please see
+			 * Documentation/networking/ip-sysctl.txt
+			 */
+			switch (net->ipv4.sysctl_ip_no_pmtu_disc) {
+			default:
 				LIMIT_NETDEBUG(KERN_INFO pr_fmt("%pI4: fragmentation needed and DF set\n"),
 					       &iph->daddr);
-			} else {
+				break;
+			case 2:
+				goto out;
+			case 3:
+				if (!icmp_tag_validation(iph->protocol))
+					goto out;
+				/* fall through */
+			case 0:
 				info = ntohs(icmph->un.frag.mtu);
 				if (!info)
 					goto out;
diff --git a/net/ipv4/ip_forward.c b/net/ipv4/ip_forward.c
index 694de3b7aebf..e9f1217a8afd 100644
--- a/net/ipv4/ip_forward.c
+++ b/net/ipv4/ip_forward.c
@@ -54,6 +54,7 @@ static int ip_forward_finish(struct sk_buff *skb)
 
 int ip_forward(struct sk_buff *skb)
 {
+	u32 mtu;
 	struct iphdr *iph;	/* Our header */
 	struct rtable *rt;	/* Route we use */
 	struct ip_options *opt	= &(IPCB(skb)->opt);
@@ -88,11 +89,13 @@ int ip_forward(struct sk_buff *skb)
 	if (opt->is_strictroute && rt->rt_uses_gateway)
 		goto sr_failed;
 
-	if (unlikely(skb->len > dst_mtu(&rt->dst) && !skb_is_gso(skb) &&
+	IPCB(skb)->flags |= IPSKB_FORWARDED;
+	mtu = ip_dst_mtu_maybe_forward(&rt->dst, true);
+	if (unlikely(skb->len > mtu && !skb_is_gso(skb) &&
 		     (ip_hdr(skb)->frag_off & htons(IP_DF))) && !skb->local_df) {
 		IP_INC_STATS(dev_net(rt->dst.dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-			  htonl(dst_mtu(&rt->dst)));
+			  htonl(mtu));
 		goto drop;
 	}
 
diff --git a/net/ipv4/ip_output.c b/net/ipv4/ip_output.c
index df184616493f..9a78804cfe9c 100644
--- a/net/ipv4/ip_output.c
+++ b/net/ipv4/ip_output.c
@@ -449,6 +449,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	__be16 not_last_frag;
 	struct rtable *rt = skb_rtable(skb);
 	int err = 0;
+	bool forwarding = IPCB(skb)->flags & IPSKB_FORWARDED;
 
 	dev = rt->dst.dev;
 
@@ -458,12 +459,13 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 
 	iph = ip_hdr(skb);
 
+	mtu = ip_dst_mtu_maybe_forward(&rt->dst, forwarding);
 	if (unlikely(((iph->frag_off & htons(IP_DF)) && !skb->local_df) ||
 		     (IPCB(skb)->frag_max_size &&
-		      IPCB(skb)->frag_max_size > dst_mtu(&rt->dst)))) {
+		      IPCB(skb)->frag_max_size > mtu))) {
 		IP_INC_STATS(dev_net(dev), IPSTATS_MIB_FRAGFAILS);
 		icmp_send(skb, ICMP_DEST_UNREACH, ICMP_FRAG_NEEDED,
-			  htonl(ip_skb_dst_mtu(skb)));
+			  htonl(mtu));
 		kfree_skb(skb);
 		return -EMSGSIZE;
 	}
@@ -473,7 +475,7 @@ int ip_fragment(struct sk_buff *skb, int (*output)(struct sk_buff *))
 	 */
 
 	hlen = iph->ihl * 4;
-	mtu = dst_mtu(&rt->dst) - hlen;	/* Size of data space */
+	mtu = mtu - hlen;	/* Size of data space */
 #ifdef CONFIG_BRIDGE_NETFILTER
 	if (skb->nf_bridge)
 		mtu -= nf_bridge_mtu_reduction(skb);
diff --git a/net/ipv4/route.c b/net/ipv4/route.c
index f8da28278014..25071b48921c 100644
--- a/net/ipv4/route.c
+++ b/net/ipv4/route.c
@@ -112,9 +112,6 @@
 #define RT_FL_TOS(oldflp4) \
 	((oldflp4)->flowi4_tos & (IPTOS_RT_MASK | RTO_ONLINK))
 
-/* IPv4 datagram length is stored into 16bit field (tot_len) */
-#define IP_MAX_MTU	0xFFFF
-
 #define RT_GC_TIMEOUT (300*HZ)
 
 static int ip_rt_max_size;
diff --git a/net/ipv4/sysctl_net_ipv4.c b/net/ipv4/sysctl_net_ipv4.c
index 1d2480ac2bb6..44eba052b43d 100644
--- a/net/ipv4/sysctl_net_ipv4.c
+++ b/net/ipv4/sysctl_net_ipv4.c
@@ -831,6 +831,13 @@ static struct ctl_table ipv4_net_table[] = {
 		.mode		= 0644,
 		.proc_handler	= proc_dointvec
 	},
+	{
+		.procname	= "ip_forward_use_pmtu",
+		.data		= &init_net.ipv4.sysctl_ip_fwd_use_pmtu,
+		.maxlen		= sizeof(int),
+		.mode		= 0644,
+		.proc_handler	= proc_dointvec,
+	},
 	{ }
 };
 
diff --git a/net/ipv6/ip6_output.c b/net/ipv6/ip6_output.c
index d1de9560c421..ef02b26ccf81 100644
--- a/net/ipv6/ip6_output.c
+++ b/net/ipv6/ip6_output.c
@@ -321,6 +321,27 @@ static inline int ip6_forward_finish(struct sk_buff *skb)
 	return dst_output(skb);
 }
 
+static unsigned int ip6_dst_mtu_forward(const struct dst_entry *dst)
+{
+	unsigned int mtu;
+	struct inet6_dev *idev;
+
+	if (dst_metric_locked(dst, RTAX_MTU)) {
+		mtu = dst_metric_raw(dst, RTAX_MTU);
+		if (mtu)
+			return mtu;
+	}
+
+	mtu = IPV6_MIN_MTU;
+	rcu_read_lock();
+	idev = __in6_dev_get(dst->dev);
+	if (idev)
+		mtu = idev->cnf.mtu6;
+	rcu_read_unlock();
+
+	return mtu;
+}
+
 int ip6_forward(struct sk_buff *skb)
 {
 	struct dst_entry *dst = skb_dst(skb);
@@ -441,7 +462,7 @@ int ip6_forward(struct sk_buff *skb)
 		}
 	}
 
-	mtu = dst_mtu(dst);
+	mtu = ip6_dst_mtu_forward(dst);
 	if (mtu < IPV6_MIN_MTU)
 		mtu = IPV6_MIN_MTU;
 
diff --git a/net/sctp/protocol.c b/net/sctp/protocol.c
index 34b7726bcd7f..7c161084f241 100644
--- a/net/sctp/protocol.c
+++ b/net/sctp/protocol.c
@@ -1030,6 +1030,7 @@ static const struct net_protocol sctp_protocol = {
 	.err_handler = sctp_v4_err,
 	.no_policy   = 1,
 	.netns_ok    = 1,
+	.icmp_strict_tag_validation = 1,
 };
 
 /* IPv4 address related functions.  */
author	David S. Miller <davem@davemloft.net>	2014-01-13 20:23:02 +0100
committer	David S. Miller <davem@davemloft.net>	2014-01-13 20:23:02 +0100
commit	c139cd3b43eac8708849755ed14286d0e9d2e589 (patch)
tree	4a8093c0dfa1507ce234754a54db8a95399955a4 /net
parent	HHF qdisc: fix jiffies-time conversion. (diff)
parent	ipv4: introduce hardened ip_no_pmtu_disc mode (diff)
download	linux-c139cd3b43eac8708849755ed14286d0e9d2e589.tar.xz linux-c139cd3b43eac8708849755ed14286d0e9d2e589.zip