linux-stable.git/net/ipv4, branch v4.0.8

ip: report the original address of ICMP messages

2015-07-10T16:45:35+00:00

[ Upstream commit 34b99df4e6256ddafb663c6de0711dceceddfe0e ]

ICMP messages can trigger ICMP and local errors. In this case
serr->port is 0 and starting from Linux 4.0 we do not return
the original target address to the error queue readers.
Add function to define which errors provide addr_offset.
With this fix my ping command is not silent anymore.

Fixes: c247f0534cc5 ("ip: fix error queue empty skb handling")
Signed-off-by: Julian Anastasov 
Acked-by: Willem de Bruijn 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: Do not call tcp_fastopen_reset_cipher from interrupt context

2015-07-10T16:45:34+00:00

[ Upstream commit dfea2aa654243f70dc53b8648d0bbdeec55a7df1 ]

tcp_fastopen_reset_cipher really cannot be called from interrupt
context. It allocates the tcp_fastopen_context with GFP_KERNEL and
calls crypto_alloc_cipher, which allocates all kind of stuff with
GFP_KERNEL.

Thus, we might sleep when the key-generation is triggered by an
incoming TFO cookie-request which would then happen in interrupt-
context, as shown by enabling CONFIG_DEBUG_ATOMIC_SLEEP:

[   36.001813] BUG: sleeping function called from invalid context at mm/slub.c:1266
[   36.003624] in_atomic(): 1, irqs_disabled(): 0, pid: 1016, name: packetdrill
[   36.004859] CPU: 1 PID: 1016 Comm: packetdrill Not tainted 4.1.0-rc7 #14
[   36.006085] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS rel-1.7.5-0-ge51488c-20140602_164612-nilsson.home.kraxel.org 04/01/2014
[   36.008250]  00000000000004f2 ffff88007f8838a8 ffffffff8171d53a ffff880075a084a8
[   36.009630]  ffff880075a08000 ffff88007f8838c8 ffffffff810967d3 ffff88007f883928
[   36.011076]  0000000000000000 ffff88007f8838f8 ffffffff81096892 ffff88007f89be00
[   36.012494] Call Trace:
[   36.012953]    [] dump_stack+0x4f/0x6d
[   36.014085]  [] ___might_sleep+0x103/0x170
[   36.015117]  [] __might_sleep+0x52/0x90
[   36.016117]  [] kmem_cache_alloc_trace+0x47/0x190
[   36.017266]  [] ? tcp_fastopen_reset_cipher+0x42/0x130
[   36.018485]  [] tcp_fastopen_reset_cipher+0x42/0x130
[   36.019679]  [] tcp_fastopen_init_key_once+0x61/0x70
[   36.020884]  [] __tcp_fastopen_cookie_gen+0x1c/0x60
[   36.022058]  [] tcp_try_fastopen+0x58f/0x730
[   36.023118]  [] tcp_conn_request+0x3e8/0x7b0
[   36.024185]  [] ? __module_text_address+0x12/0x60
[   36.025327]  [] tcp_v4_conn_request+0x51/0x60
[   36.026410]  [] tcp_rcv_state_process+0x190/0xda0
[   36.027556]  [] ? __inet_lookup_established+0x47/0x170
[   36.028784]  [] tcp_v4_do_rcv+0x16d/0x3d0
[   36.029832]  [] ? security_sock_rcv_skb+0x16/0x20
[   36.030936]  [] tcp_v4_rcv+0x77a/0x7b0
[   36.031875]  [] ? iptable_filter_hook+0x33/0x70
[   36.032953]  [] ip_local_deliver_finish+0x92/0x1f0
[   36.034065]  [] ip_local_deliver+0x9a/0xb0
[   36.035069]  [] ? ip_rcv+0x3d0/0x3d0
[   36.035963]  [] ip_rcv_finish+0x119/0x330
[   36.036950]  [] ip_rcv+0x2e7/0x3d0
[   36.037847]  [] __netif_receive_skb_core+0x552/0x930
[   36.038994]  [] __netif_receive_skb+0x27/0x70
[   36.040033]  [] process_backlog+0xd2/0x1f0
[   36.041025]  [] net_rx_action+0x122/0x310
[   36.042007]  [] __do_softirq+0x103/0x2f0
[   36.042978]  [] do_softirq_own_stack+0x1c/0x30

This patch moves the call to tcp_fastopen_init_key_once to the places
where a listener socket creates its TFO-state, which always happens in
user-context (either from the setsockopt, or implicitly during the
listen()-call)

Cc: Eric Dumazet 
Cc: Hannes Frederic Sowa 
Fixes: 222e83d2e0ae ("tcp: switch tcp_fastopen key generation to net_get_random_once")
Signed-off-by: Christoph Paasch 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ipv4/udp: Verify multicast group is ours in upd_v4_early_demux()

2015-06-23T00:03:23+00:00

[ Upstream commit 6e540309326188f769e03bb4c6dd8ff6752930c2 ]

421b3885bf6d56391297844f43fb7154a6396e12 "udp: ipv4: Add udp early
demux" introduced a regression that allowed sockets bound to INADDR_ANY
to receive packets from multicast groups that the socket had not joined.
For example a socket that had joined 224.168.2.9 could also receive
packets from 225.168.2.9 despite not having joined that group if
ip_early_demux is enabled.

Fix this by calling ip_check_mc_rcu() in udp_v4_early_demux() to verify
that the multicast packet is indeed ours.

Signed-off-by: Shawn Bohrer 
Reported-by: Yurij M. Plotnikov 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: fix child sockets to use system default congestion control if not set

2015-06-23T00:03:23+00:00

[ Upstream commit 9f950415e4e28e7cfae2e416b43e862e8101d996 ]

Linux 3.17 and earlier are explicitly engineered so that if the app
doesn't specifically request a CC module on a listener before the SYN
arrives, then the child gets the system default CC when the connection
is established. See tcp_init_congestion_control() in 3.17 or earlier,
which says "if no choice made yet assign the current value set as
default". The change ("net: tcp: assign tcp cong_ops when tcp sk is
created") altered these semantics, so that children got their parent
listener's congestion control even if the system default had changed
after the listener was created.

This commit returns to those original semantics from 3.17 and earlier,
since they are the original semantics from 2007 in 4d4d3d1e8 ("[TCP]:
Congestion control initialization."), and some Linux congestion
control workflows depend on that.

In summary, if a listener socket specifically sets TCP_CONGESTION to
"x", or the route locks the CC module to "x", then the child gets
"x". Otherwise the child gets current system default from
net.ipv4.tcp_congestion_control. That's the behavior in 3.17 and
earlier, and this commit restores that.

Fixes: 55d8694fa82c ("net: tcp: assign tcp cong_ops when tcp sk is created")
Cc: Florian Westphal 
Cc: Daniel Borkmann 
Cc: Glenn Judd 
Cc: Stephen Hemminger 
Signed-off-by: Neal Cardwell 
Signed-off-by: Eric Dumazet 
Signed-off-by: Yuchung Cheng 
Acked-by: Daniel Borkmann 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

udp: fix behavior of wrong checksums

2015-06-23T00:03:23+00:00

[ Upstream commit beb39db59d14990e401e235faf66a6b9b31240b0 ]

We have two problems in UDP stack related to bogus checksums :

1) We return -EAGAIN to application even if receive queue is not empty.
   This breaks applications using edge trigger epoll()

2) Under UDP flood, we can loop forever without yielding to other
   processes, potentially hanging the host, especially on non SMP.

This patch is an attempt to make things better.

We might in the future add extra support for rt applications
wanting to better control time spent doing a recv() in a hostile
environment. For example we could validate checksums before queuing
packets in socket receive queue.

Signed-off-by: Eric Dumazet 
Cc: Willem de Bruijn 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ipv4: Avoid crashing in ip_error

2015-06-23T00:03:21+00:00

[ Upstream commit 381c759d9916c42959515ad34a6d467e24a88e93 ]

ip_error does not check if in_dev is NULL before dereferencing it.

IThe following sequence of calls is possible:
CPU A                          CPU B
ip_rcv_finish
    ip_route_input_noref()
        ip_route_input_slow()
                               inetdev_destroy()
    dst_input()

With the result that a network device can be destroyed while processing
an input packet.

A crash was triggered with only unicast packets in flight, and
forwarding enabled on the only network device.   The error condition
was created by the removal of the network device.

As such it is likely the that error code was -EHOSTUNREACH, and the
action taken by ip_error (if in_dev had been accessible) would have
been to not increment any counters and to have tried and likely failed
to send an icmp error as the network device is going away.

Therefore handle this weird case by just dropping the packet if
!in_dev.  It will result in dropping the packet sooner, and will not
result in an actual change of behavior.

Fixes: 251da4130115b ("ipv4: Cache ip_error() routes even when not forwarding.")
Reported-by: Vittorio Gambaletta 
Tested-by: Vittorio Gambaletta 
Signed-off-by: Vittorio Gambaletta 
Signed-off-by: "Eric W. Biederman" 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp/ipv6: fix flow label setting in TIME_WAIT state

2015-06-23T00:03:20+00:00

[ Upstream commit 21858cd02dabcf290564cbf4769b101eba54d7bb ]

commit 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages
send from TIME_WAIT") added the flow label in the last TCP packets.
Unfortunately, it was not casted properly.

This patch replace the buggy shift with be32_to_cpu/cpu_to_be32.

Fixes: 1d13a96c74fc ("ipv6: tcp: fix flowlabel value in ACK messages")
Reported-by: Eric Dumazet 
Signed-off-by: Florent Fourcot 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ipv4: Missing sk_nulls_node_init() in ping_unhash().

2015-05-13T12:14:18+00:00

[ Upstream commit a134f083e79fb4c3d0a925691e732c56911b4326 ]

If we don't do that, then the poison value is left in the ->pprev
backlink.

This can cause crashes if we do a disconnect, followed by a connect().

Tested-by: Linus Torvalds 
Reported-by: Wen Xu 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

route: Use ipv4_mtu instead of raw rt_pmtu

2015-05-13T12:14:16+00:00

[ Upstream commit cb6ccf09d6b94bec4def1ac5cf4678d12b216474 ]

The commit 3cdaa5be9e81a914e633a6be7b7d2ef75b528562 ("ipv4: Don't
increase PMTU with Datagram Too Big message") broke PMTU in cases
where the rt_pmtu value has expired but is smaller than the new
PMTU value.

This obsolete rt_pmtu then prevents the new PMTU value from being
installed.

Fixes: 3cdaa5be9e81 ("ipv4: Don't increase PMTU with Datagram Too Big message")
Reported-by: Gerd v. Egidy 
Signed-off-by: Herbert Xu 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: avoid looping in tcp_send_fin()

2015-05-06T20:03:34+00:00

[ Upstream commit 845704a535e9b3c76448f52af1b70e4422ea03fd ]

Presence of an unbound loop in tcp_send_fin() had always been hard
to explain when analyzing crash dumps involving gigantic dying processes
with millions of sockets.

Lets try a different strategy :

In case of memory pressure, try to add the FIN flag to last packet
in write queue, even if packet was already sent. TCP stack will
be able to deliver this FIN after a timeout event. Note that this
FIN being delivered by a retransmit, it also carries a Push flag
given our current implementation.

By checking sk_under_memory_pressure(), we anticipate that cooking
many FIN packets might deplete tcp memory.

In the case we could not allocate a packet, even with __GFP_WAIT
allocation, then not sending a FIN seems quite reasonable if it allows
to get rid of this socket, free memory, and not block the process from
eventually doing other useful work.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman