linux-stable.git/include/net, branch v3.12.52

net: fix IP early demux races

2016-01-05T17:18:01+00:00

[ Upstream commit 5037e9ef9454917b047f9f3a19b4dd179fbf7cd4 ]

David Wilder reported crashes caused by dst reuse.


  I am seeing a crash on a distro V4.2.3 kernel caused by a double
  release of a dst_entry.  In ipv4_dst_destroy() the call to
  list_empty() finds a poisoned next pointer, indicating the dst_entry
  has already been removed from the list and freed. The crash occurs
  18 to 24 hours into a run of a network stress exerciser.


Thanks to his detailed report and analysis, we were able to understand
the core issue.

IP early demux can associate a dst to skb, after a lookup in TCP/UDP
sockets.

When socket cache is not properly set, we want to store into
sk->sk_dst_cache the dst for future IP early demux lookups,
by acquiring a stable refcount on the dst.

Problem is this acquisition is simply using an atomic_inc(),
which works well, unless the dst was queued for destruction from
dst_release() noticing dst refcount went to zero, if DST_NOCACHE
was set on dst.

We need to make sure current refcount is not zero before incrementing
it, or risk double free as David reported.

This patch, being a stable candidate, adds two new helpers, and use
them only from IP early demux problematic paths.

It might be possible to merge in net-next skb_dst_force() and
skb_dst_force_safe(), but I prefer having the smallest patch for stable
kernels : Maybe some skb_dst_force() callers do not expect skb->dst
can suddenly be cleared.

Can probably be backported back to linux-3.6 kernels

Reported-by: David J. Wilder 
Tested-by: David J. Wilder 
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

net: add validation for the socket syscall protocol argument

2016-01-05T17:04:00+00:00

[ Upstream commit 79462ad02e861803b3840cc782248c7359451cd9 ]

郭永刚 reported that one could simply crash the kernel as root by
using a simple program:

	int socket_fd;
	struct sockaddr_in addr;
	addr.sin_port = 0;
	addr.sin_addr.s_addr = INADDR_ANY;
	addr.sin_family = 10;

	socket_fd = socket(10,3,0x40000000);
	connect(socket_fd , &addr,16);

AF_INET, AF_INET6 sockets actually only support 8-bit protocol
identifiers. inet_sock's skc_protocol field thus is sized accordingly,
thus larger protocol identifiers simply cut off the higher bits and
store a zero in the protocol fields.

This could lead to e.g. NULL function pointer because as a result of
the cut off inet_num is zero and we call down to inet_autobind, which
is NULL for raw sockets.

kernel: Call Trace:
kernel:  [] ? inet_autobind+0x2e/0x70
kernel:  [] inet_dgram_connect+0x54/0x80
kernel:  [] SYSC_connect+0xd9/0x110
kernel:  [] ? ptrace_notify+0x5b/0x80
kernel:  [] ? syscall_trace_enter_phase2+0x108/0x200
kernel:  [] SyS_connect+0xe/0x10
kernel:  [] tracesys_phase2+0x84/0x89

I found no particular commit which introduced this problem.

CVE: CVE-2015-8543
Cc: Cong Wang 
Reported-by: 郭永刚 
Signed-off-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

sctp: update the netstamp_needed counter when copying sockets

2016-01-05T16:57:50+00:00

[ Upstream commit 01ce63c90170283a9855d1db4fe81934dddce648 ]

Dmitry Vyukov reported that SCTP was triggering a WARN on socket destroy
related to disabling sock timestamp.

When SCTP accepts an association or peel one off, it copies sock flags
but forgot to call net_enable_timestamp() if a packet timestamping flag
was copied, leading to extra calls to net_disable_timestamp() whenever
such clones were closed.

The fix is to call net_enable_timestamp() whenever we copy a sock with
that flag on, like tcp does.

Reported-by: Dmitry Vyukov 
Signed-off-by: Marcelo Ricardo Leitner 
Acked-by: Vlad Yasevich 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

ipv6: add complete rcu protection around np->opt

2016-01-05T15:11:14+00:00

[ Upstream commit 45f6fad84cc305103b28d73482b344d7f5b76f39 ]

This patch addresses multiple problems :

UDP/RAW sendmsg() need to get a stable struct ipv6_txoptions
while socket is not locked : Other threads can change np->opt
concurrently. Dmitry posted a syzkaller
(http://github.com/google/syzkaller) program desmonstrating
use-after-free.

Starting with TCP/DCCP lockless listeners, tcp_v6_syn_recv_sock()
and dccp_v6_request_recv_sock() also need to use RCU protection
to dereference np->opt once (before calling ipv6_dup_options())

This patch adds full RCU protection to np->opt

Reported-by: Dmitry Vyukov 
Signed-off-by: Eric Dumazet 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

ipv6: distinguish frag queues by device for multicast and link-local packets

2016-01-05T15:11:13+00:00

[ Upstream commit 264640fc2c5f4f913db5c73fa3eb1ead2c45e9d7 ]

If a fragmented multicast packet is received on an ethernet device which
has an active macvlan on top of it, each fragment is duplicated and
received both on the underlying device and the macvlan. If some
fragments for macvlan are processed before the whole packet for the
underlying device is reassembled, the "overlapping fragments" test in
ip6_frag_queue() discards the whole fragment queue.

To resolve this, add device ifindex to the search key and require it to
match reassembling multicast packets and packets to link-local
addresses.

Note: similar patch has been already submitted by Yoshifuji Hideaki in

  http://patchwork.ozlabs.org/patch/220979/

but got lost and forgotten for some reason.

Signed-off-by: Michal Kubecek 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

unix: avoid use-after-free in ep_remove_wait_queue

2015-12-15T19:45:42+00:00

[ Upstream commit 7d267278a9ece963d77eefec61630223fce08c6c ]

Rainer Weikusat  writes:
An AF_UNIX datagram socket being the client in an n:1 association with
some server socket is only allowed to send messages to the server if the
receive queue of this socket contains at most sk_max_ack_backlog
datagrams. This implies that prospective writers might be forced to go
to sleep despite none of the message presently enqueued on the server
receive queue were sent by them. In order to ensure that these will be
woken up once space becomes again available, the present unix_dgram_poll
routine does a second sock_poll_wait call with the peer_wait wait queue
of the server socket as queue argument (unix_dgram_recvmsg does a wake
up on this queue after a datagram was received). This is inherently
problematic because the server socket is only guaranteed to remain alive
for as long as the client still holds a reference to it. In case the
connection is dissolved via connect or by the dead peer detection logic
in unix_dgram_sendmsg, the server socket may be freed despite "the
polling mechanism" (in particular, epoll) still has a pointer to the
corresponding peer_wait queue. There's no way to forcibly deregister a
wait queue with epoll.

Based on an idea by Jason Baron, the patch below changes the code such
that a wait_queue_t belonging to the client socket is enqueued on the
peer_wait queue of the server whenever the peer receive queue full
condition is detected by either a sendmsg or a poll. A wake up on the
peer queue is then relayed to the ordinary wait queue of the client
socket via wake function. The connection to the peer wait queue is again
dissolved if either a wake up is about to be relayed or the client
socket reconnects or a dead peer is detected or the client socket is
itself closed. This enables removing the second sock_poll_wait from
unix_dgram_poll, thus avoiding the use-after-free, while still ensuring
that no blocked writer sleeps forever.

Signed-off-by: Rainer Weikusat 
Fixes: ec0d215f9420 ("af_unix: fix 'poll for write'/connected DGRAM sockets")
Reviewed-by: Jason Baron 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

net: avoid NULL deref in inet_ctl_sock_destroy()

2015-11-14T16:04:51+00:00

[ Upstream commit 8fa677d2706d325d71dab91bf6e6512c05214e37 ]

Under low memory conditions, tcp_sk_init() and icmp_sk_init()
can both iterate on all possible cpus and call inet_ctl_sock_destroy(),
with eventual NULL pointer.

Signed-off-by: Eric Dumazet 
Reported-by: Dmitry Vyukov 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

net: add pfmemalloc check in sk_add_backlog()

2015-10-28T15:38:14+00:00

[ Upstream commit c7c49b8fde26b74277188bdc6c9dca38db6fa35b ]

Greg reported crashes hitting the following check in __sk_backlog_rcv()

	BUG_ON(!sock_flag(sk, SOCK_MEMALLOC));

The pfmemalloc bit is currently checked in sk_filter().

This works correctly for TCP, because sk_filter() is ran in
tcp_v[46]_rcv() before hitting the prequeue or backlog checks.

For UDP or other protocols, this does not work, because the sk_filter()
is ran from sock_queue_rcv_skb(), which might be called _after_ backlog
queuing if socket is owned by user by the time packet is processed by
softirq handler.

Fixes: b4b9e35585089 ("netvm: set PF_MEMALLOC as appropriate during SKB processing")
Signed-off-by: Eric Dumazet 
Reported-by: Greg Thelen 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

af_unix: Convert the unix_sk macro to an inline function for type safety

2015-10-28T15:38:11+00:00

[ Upstream commit 4613012db1d911f80897f9446a49de817b2c4c47 ]

As suggested by Eric Dumazet this change replaces the
#define with a static inline function to enjoy
complaints by the compiler when misusing the API.

Signed-off-by: Aaron Conole 
Signed-off-by: David S. Miller 
Signed-off-by: Jiri Slaby

netfilter: nf_conntrack: don't release a conntrack with non-zero refcnt

2015-09-14T14:28:42+00:00

[ Upstream commit e53376bef2cd97d3e3f61fdc677fb8da7d03d0da ]

With this patch, the conntrack refcount is initially set to zero and
it is bumped once it is added to any of the list, so we fulfill
Eric's golden rule which is that all released objects always have a
refcount that equals zero.

Andrey Vagin reports that nf_conntrack_free can't be called for a
conntrack with non-zero ref-counter, because it can race with
nf_conntrack_find_get().

A conntrack slab is created with SLAB_DESTROY_BY_RCU. Non-zero
ref-counter says that this conntrack is used. So when we release
a conntrack with non-zero counter, we break this assumption.

CPU1                                    CPU2
____nf_conntrack_find()
                                        nf_ct_put()
                                         destroy_conntrack()
                                        ...
                                        init_conntrack
                                         __nf_conntrack_alloc (set use = 1)
atomic_inc_not_zero(&ct->use) (use = 2)
                                         if (!l4proto->new(ct, skb, dataoff, timeouts))
                                          nf_conntrack_free(ct); (use = 2 !!!)
                                        ...
                                        __nf_conntrack_alloc (set use = 1)
 if (!nf_ct_key_equal(h, tuple, zone))
  nf_ct_put(ct); (use = 0)
   destroy_conntrack()
                                        /* continue to work with CT */

After applying the path "[PATCH] netfilter: nf_conntrack: fix RCU
race in nf_conntrack_find_get" another bug was triggered in
destroy_conntrack():

<4>[67096.759334] ------------[ cut here ]------------
<2>[67096.759353] kernel BUG at net/netfilter/nf_conntrack_core.c:211!
...
<4>[67096.759837] Pid: 498649, comm: atdd veid: 666 Tainted: G         C ---------------    2.6.32-042stab084.18 #1 042stab084_18 /DQ45CB
<4>[67096.759932] RIP: 0010:[]  [] destroy_conntrack+0x15c/0x190 [nf_conntrack]
<4>[67096.760255] Call Trace:
<4>[67096.760255]  [] nf_conntrack_destroy+0x17/0x30
<4>[67096.760255]  [] nf_conntrack_find_get+0x85/0x130 [nf_conntrack]
<4>[67096.760255]  [] nf_conntrack_in+0x352/0xb60 [nf_conntrack]
<4>[67096.760255]  [] ipv4_conntrack_local+0x51/0x60 [nf_conntrack_ipv4]
<4>[67096.760255]  [] nf_iterate+0x69/0xb0
<4>[67096.760255]  [] ? dst_output+0x0/0x20
<4>[67096.760255]  [] nf_hook_slow+0x74/0x110
<4>[67096.760255]  [] ? dst_output+0x0/0x20
<4>[67096.760255]  [] raw_sendmsg+0x775/0x910
<4>[67096.760255]  [] ? flush_tlb_others_ipi+0x128/0x130
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] inet_sendmsg+0x4a/0xb0
<4>[67096.760255]  [] ? sock_sendmsg+0x13/0x140
<4>[67096.760255]  [] sock_sendmsg+0x117/0x140
<4>[67096.760255]  [] ? native_smp_send_reschedule+0x49/0x60
<4>[67096.760255]  [] ? _spin_unlock_bh+0x1b/0x20
<4>[67096.760255]  [] ? autoremove_wake_function+0x0/0x40
<4>[67096.760255]  [] ? do_ip_setsockopt+0x90/0xd80
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] ? apic_timer_interrupt+0xe/0x20
<4>[67096.760255]  [] sys_sendto+0x139/0x190
<4>[67096.760255]  [] ? audit_syscall_entry+0x1d7/0x200
<4>[67096.760255]  [] ? __audit_syscall_exit+0x265/0x290
<4>[67096.760255]  [] compat_sys_socketcall+0x13f/0x210
<4>[67096.760255]  [] ia32_sysret+0x0/0x5

I have reused the original title for the RFC patch that Andrey posted and
most of the original patch description.

Cc: Eric Dumazet 
Cc: Andrew Vagin 
Cc: Florian Westphal 
Reported-by: Andrew Vagin 
Signed-off-by: Pablo Neira Ayuso 
Reviewed-by: Eric Dumazet 
Acked-by: Andrew Vagin 
Signed-off-by: Jiri Slaby