linux-stable.git/include/net/sock.h, branch linux-2.6.33.y

udp: add rehash on connect()

2011-03-21T19:44:10+00:00

commit 719f835853a92f6090258114a72ffe41f09155cd upstream.

commit 30fff923 introduced in linux-2.6.33 (udp: bind() optimisation)
added a secondary hash on UDP, hashed on (local addr, local port).

Problem is that following sequence :

fd = socket(...)
connect(fd, &remote, ...)

not only selects remote end point (address and port), but also sets
local address, while UDP stack stored in secondary hash table the socket
while its local address was INADDR_ANY (or ipv6 equivalent)

Sequence is :
 - autobind() : choose a random local port, insert socket in hash tables
              [while local address is INADDR_ANY]
 - connect() : set remote address and port, change local address to IP
              given by a route lookup.

When an incoming UDP frame comes, if more than 10 sockets are found in
primary hash table, we switch to secondary table, and fail to find
socket because its local address changed.

One solution to this problem is to rehash datagram socket if needed.

We add a new rehash(struct socket *) method in "struct proto", and
implement this method for UDP v4 & v6, using a common helper.

This rehashing only takes care of secondary hash table, since primary
hash (based on local port only) is not changed.

Reported-by: Krzysztof Piotr Oledzki 
Signed-off-by: Eric Dumazet 
Tested-by: Krzysztof Piotr Oledzki 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: fix problem in reading sock TX queue

2010-08-02T17:26:28+00:00

commit b0f77d0eae0c58a5a9691a067ada112ceeae2d00 upstream.

Fix problem in reading the tx_queue recorded in a socket.  In
dev_pick_tx, the TX queue is read by doing a check with
sk_tx_queue_recorded on the socket, followed by a sk_tx_queue_get.
The problem is that there is not mutual exclusion across these
calls in the socket so it it is possible that the queue in the
sock can be invalidated after sk_tx_queue_recorded is called so
that sk_tx_queue get returns -1, which sets 65535 in queue_index
and thus dev_pick_tx returns 65536 which is a bogus queue and
can cause crash in dev_queue_xmit.

We fix this by only calling sk_tx_queue_get which does the proper
checks.  The interface is that sk_tx_queue_get returns the TX queue
if the sock argument is non-NULL and TX queue is recorded, else it
returns -1.  sk_tx_queue_recorded is no longer used so it can be
completely removed.

Signed-off-by: Tom Herbert 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: add __must_check to sk_add_backlog

2010-04-01T23:02:05+00:00

[ Upstream commit 4045635318538d3ddd2007720412fdc4b08f6a62 ]

Add the "__must_check" tag to sk_add_backlog() so that any failure to
check and drop packets will be warned about.

Signed-off-by: Zhu Yi 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: backlog functions rename

2010-04-01T23:02:04+00:00

[ Upstream commit a3a858ff18a72a8d388e31ab0d98f7e944841a62 ]

sk_add_backlog -> __sk_add_backlog
sk_add_backlog_limited -> sk_add_backlog

Signed-off-by: Zhu Yi 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: add limit for socket backlog

2010-04-01T23:02:02+00:00

[ Upstream commit 8eae939f1400326b06d0c9afe53d2a484a326871 ]

We got system OOM while running some UDP netperf testing on the loopback
device. The case is multiple senders sent stream UDP packets to a single
receiver via loopback on local host. Of course, the receiver is not able
to handle all the packets in time. But we surprisingly found that these
packets were not discarded due to the receiver's sk->sk_rcvbuf limit.
Instead, they are kept queuing to sk->sk_backlog and finally ate up all
the memory. We believe this is a secure hole that a none privileged user
can crash the system.

The root cause for this problem is, when the receiver is doing
__release_sock() (i.e. after userspace recv, kernel udp_recvmsg ->
skb_free_datagram_locked -> release_sock), it moves skbs from backlog to
sk_receive_queue with the softirq enabled. In the above case, multiple
busy senders will almost make it an endless loop. The skbs in the
backlog end up eat all the system memory.

The issue is not only for UDP. Any protocols using socket backlog is
potentially affected. The patch adds limit for socket backlog so that
the backlog size cannot be expanded endlessly.

Reported-by: Alex Shi 
Cc: David Miller 
Cc: Arnaldo Carvalho de Melo 
Cc: Alexey Kuznetsov 
Cc: "Pekka Savola (ipv6)" 
Cc: Patrick McHardy 
Cc: Vlad Yasevich 
Cc: Sridhar Samudrala 
Cc: Jon Maloy 
Cc: Allan Stephens 
Cc: Andrew Hendry 
Signed-off-by: Zhu Yi 
Signed-off-by: Eric Dumazet 
Acked-by: Arnaldo Carvalho de Melo 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

udp: secondary hash on (local port, local address)

2009-11-09T04:53:06+00:00

Extends udp_table to contain a secondary hash table.

socket anchor for this second hash is free, because UDP
doesnt use skc_bind_node : We define an union to hold
both skc_bind_node & a new hlist_nulls_node udp_portaddr_node

udp_lib_get_port() inserts sockets into second hash chain
(additional cost of one atomic op)

udp_lib_unhash() deletes socket from second hash chain
(additional cost of one atomic op)

Note : No spinlock lockdep annotation is needed, because
lock for the secondary hash chain is always get after
lock for primary hash chain.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

udp: split sk_hash into two u16 hashes

2009-11-09T04:53:05+00:00

Union sk_hash with two u16 hashes for udp (no extra memory taken)

One 16 bits hash on (local port) value (the previous udp 'hash')

One 16 bits hash on (local address, local port) values, initialized
but not yet used. This second hash is using jenkin hash for better
distribution.

Because the 'port' is xored later, a partial hash is performed
on local address + net_hash_mix(net)

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: Introduce sk_tx_queue_mapping

2009-10-21T01:55:45+00:00

Introduce sk_tx_queue_mapping; and functions that set, test and
get this value. Reset sk_tx_queue_mapping to -1 whenever the dst
cache is set/reset, and in socket alloc. Setting txq to -1 and
using valid txq=<0 to n-1> allows the tx path to use the value
of sk_tx_queue_mapping directly instead of subtracting 1 on every
tx.

Signed-off-by: Krishna Kumar 
Signed-off-by: David S. Miller

Merge branch 'master' of master.kernel.org:/pub/scm/linux/kernel/git/davem/net-2.6

2009-10-13T19:55:20+00:00

net: Generalize socket rx gap / receive queue overflow cmsg

2009-10-12T20:26:31+00:00

Create a new socket level option to report number of queue overflows

Recently I augmented the AF_PACKET protocol to report the number of frames lost
on the socket receive queue between any two enqueued frames.  This value was
exported via a SOL_PACKET level cmsg.  AFter I completed that work it was
requested that this feature be generalized so that any datagram oriented socket
could make use of this option.  As such I've created this patch, It creates a
new SOL_SOCKET level option called SO_RXQ_OVFL, which when enabled exports a
SOL_SOCKET level cmsg that reports the nubmer of times the sk_receive_queue
overflowed between any two given frames.  It also augments the AF_PACKET
protocol to take advantage of this new feature (as it previously did not touch
sk->sk_drops, which this patch uses to record the overflow count).  Tested
successfully by me.

Notes:

1) Unlike my previous patch, this patch simply records the sk_drops value, which
is not a number of drops between packets, but rather a total number of drops.
Deltas must be computed in user space.

2) While this patch currently works with datagram oriented protocols, it will
also be accepted by non-datagram oriented protocols. I'm not sure if thats
agreeable to everyone, but my argument in favor of doing so is that, for those
protocols which aren't applicable to this option, sk_drops will always be zero,
and reporting no drops on a receive queue that isn't used for those
non-participating protocols seems reasonable to me.  This also saves us having
to code in a per-protocol opt in mechanism.

3) This applies cleanly to net-next assuming that commit
977750076d98c7ff6cbda51858bb5a5894a9d9ab (my af packet cmsg patch) is reverted

Signed-off-by: Neil Horman 
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller