linux-stable.git/net/ipv4, branch v3.0.35

xfrm: take net hdr len into account for esp payload size calculation

2012-06-09T15:33:03+00:00

[ Upstream commit 91657eafb64b4cb53ec3a2fbc4afc3497f735788 ]

Corrects the function that determines the esp payload size. The calculations
done in esp{4,6}_get_mtu() lead to overlength frames in transport mode for
certain mtu values and suboptimal frames for others.

According to what is done, mainly in esp{,6}_output() and tcp_mtu_to_mss(),
net_header_len must be taken into account before doing the alignment
calculation.

Signed-off-by: Benjamin Poirier 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ipv4: fix the rcu race between free_fib_info and ip_route_output_slow

2012-06-09T15:33:02+00:00

[ Upstream commit e49cc0da7283088c5e03d475ffe2fdcb24a6d5b1 ]

We hit a kernel OOPS.

<3>[23898.789643] BUG: sleeping function called from invalid context at
/data/buildbot/workdir/ics/hardware/intel/linux-2.6/arch/x86/mm/fault.c:1103
<3>[23898.862215] in_atomic(): 0, irqs_disabled(): 0, pid: 10526, name:
Thread-6683
<4>[23898.967805] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.258526] Pid: 10526, comm: Thread-6683 Tainted: G        W
3.0.8-137685-ge7742f9 #1
<4>[23899.357404] HSU serial 0000:00:05.1: 0000:00:05.2:HSU serial prevented me
to suspend...
<4>[23899.904225] Call Trace:
<4>[23899.989209]  [] ? pgtable_bad+0x130/0x130
<4>[23900.000416]  [] __might_sleep+0x10a/0x110
<4>[23900.007357]  [] do_page_fault+0xd1/0x3c0
<4>[23900.013764]  [] ? restore_all+0xf/0xf
<4>[23900.024024]  [] ? napi_complete+0x8b/0x690
<4>[23900.029297]  [] ? pgtable_bad+0x130/0x130
<4>[23900.123739]  [] ? pgtable_bad+0x130/0x130
<4>[23900.128955]  [] error_code+0x5f/0x64
<4>[23900.133466]  [] ? pgtable_bad+0x130/0x130
<4>[23900.138450]  [] ? __ip_route_output_key+0x698/0x7c0
<4>[23900.144312]  [] ? __ip_route_output_key+0x38d/0x7c0
<4>[23900.150730]  [] ip_route_output_flow+0x1f/0x60
<4>[23900.156261]  [] ip4_datagram_connect+0x188/0x2b0
<4>[23900.161960]  [] ? _raw_spin_unlock_bh+0x1f/0x30
<4>[23900.167834]  [] inet_dgram_connect+0x36/0x80
<4>[23900.173224]  [] ? _copy_from_user+0x48/0x140
<4>[23900.178817]  [] sys_connect+0x9a/0xd0
<4>[23900.183538]  [] ? alloc_file+0xdc/0x240
<4>[23900.189111]  [] ? sub_preempt_count+0x3d/0x50

Function free_fib_info resets nexthop_nh->nh_dev to NULL before releasing
fi. Other cpu might be accessing fi. Fixing it by delaying the releasing.

With the patch, we ran MTBF testing on Android mobile for 12 hours
and didn't trigger the issue.

Thank Eric for very detailed review/checking the issue.

Signed-off-by: Yanmin Zhang 
Signed-off-by: Kun Jiang 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ipv4: Do not use dead fib_info entries.

2012-06-09T15:33:02+00:00

[ Upstream commit dccd9ecc374462e5d6a5b8f8110415a86c2213d8 ]

Due to RCU lookups and RCU based release, fib_info objects can
be found during lookup which have fi->fib_dead set.

We must ignore these entries, otherwise we risk dereferencing
the parts of the entry which are being torn down.

Reported-by: Yevgen Pronenko 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: do_tcp_sendpages() must try to push data out on oom conditions

2012-05-21T16:40:03+00:00

commit bad115cfe5b509043b684d3a007ab54b80090aa1 upstream.

Since recent changes on TCP splicing (starting with commits 2f533844
"tcp: allow splice() to build full TSO packets" and 35f9c09f "tcp:
tcp_sendpages() should call tcp_push() once"), I started seeing
massive stalls when forwarding traffic between two sockets using
splice() when pipe buffers were larger than socket buffers.

Latest changes (net: netdev_alloc_skb() use build_skb()) made the
problem even more apparent.

The reason seems to be that if do_tcp_sendpages() fails on out of memory
condition without being able to send at least one byte, tcp_push() is not
called and the buffers cannot be flushed.

After applying the attached patch, I cannot reproduce the stalls at all
and the data rate it perfectly stable and steady under any condition
which previously caused the problem to be permanent.

The issue seems to have been there since before the kernel migrated to
git, which makes me think that the stalls I occasionally experienced
with tux during stress-tests years ago were probably related to the
same issue.

This issue was first encountered on 3.0.31 and 3.2.17, so please backport
to -stable.

Signed-off-by: Willy Tarreau 
Acked-by: Eric Dumazet 
Signed-off-by: Greg Kroah-Hartman

tcp: change tcp_adv_win_scale and tcp_rmem[2]

2012-05-21T16:40:00+00:00

[ Upstream commit b49960a05e32121d29316cfdf653894b88ac9190 ]

tcp_adv_win_scale default value is 2, meaning we expect a good citizen
skb to have skb->len / skb->truesize ratio of 75% (3/4)

In 2.6 kernels we (mis)accounted for typical MSS=1460 frame :
1536 + 64 + 256 = 1856 'estimated truesize', and 1856 * 3/4 = 1392.
So these skbs were considered as not bloated.

With recent truesize fixes, a typical MSS=1460 frame truesize is now the
more precise :
2048 + 256 = 2304. But 2304 * 3/4 = 1728.
So these skb are not good citizen anymore, because 1460 < 1728

(GRO can escape this problem because it build skbs with a too low
truesize.)

This also means tcp advertises a too optimistic window for a given
allocated rcvspace : When receiving frames, sk_rmem_alloc can hit
sk_rcvbuf limit and we call tcp_prune_queue()/tcp_collapse() too often,
especially when application is slow to drain its receive queue or in
case of losses (netperf is fast, scp is slow). This is a major latency
source.

We should adjust the len/truesize ratio to 50% instead of 75%

This patch :

1) changes tcp_adv_win_scale default to 1 instead of 2

2) increase tcp_rmem[2] limit from 4MB to 6MB to take into account
better truesize tracking and to allow autotuning tcp receive window to
reach same value than before. Note that same amount of kernel memory is
consumed compared to 2.6 kernels.

Signed-off-by: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Tom Herbert 
Cc: Yuchung Cheng 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: fix tcp_grow_window() for large incoming frames

2012-04-27T16:51:21+00:00

[ Upstream commit 4d846f02392a710f9604892ac3329e628e60a230 ]

tcp_grow_window() has to grow rcv_ssthresh up to window_clamp, allowing
sender to increase its window.

tcp_grow_window() still assumes a tcp frame is under MSS, but its no
longer true with LRO/GRO.

This patch fixes one of the performance issue we noticed with GRO on.

Signed-off-by: Eric Dumazet 
Cc: Neal Cardwell 
Cc: Tom Herbert 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: fix tcp_rcv_rtt_update() use of an unscaled RTT sample

2012-04-27T16:51:19+00:00

[ Upstream commit 18a223e0b9ec8979320ba364b47c9772391d6d05 ]

Fix a code path in tcp_rcv_rtt_update() that was comparing scaled and
unscaled RTT samples.

The intent in the code was to only use the 'm' measurement if it was a
new minimum.  However, since 'm' had not yet been shifted left 3 bits
but 'new_sample' had, this comparison would nearly always succeed,
leading us to erroneously set our receive-side RTT estimate to the 'm'
sample when that sample could be nearly 8x too high to use.

The overall effect is to often cause the receive-side RTT estimate to
be significantly too large (up to 40% too large for brief periods in
my tests).

Signed-off-by: Neal Cardwell 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: allow splice() to build full TSO packets

2012-04-27T16:51:18+00:00

[ This combines upstream commit
  2f53384424251c06038ae612e56231b96ab610ee and the follow-on bug fix
  commit 35f9c09fe9c72eb8ca2b8e89a593e1c151f28fc2 ]

vmsplice()/splice(pipe, socket) call do_tcp_sendpages() one page at a
time, adding at most 4096 bytes to an skb. (assuming PAGE_SIZE=4096)

The call to tcp_push() at the end of do_tcp_sendpages() forces an
immediate xmit when pipe is not already filled, and tso_fragment() try
to split these skb to MSS multiples.

4096 bytes are usually split in a skb with 2 MSS, and a remaining
sub-mss skb (assuming MTU=1500)

This makes slow start suboptimal because many small frames are sent to
qdisc/driver layers instead of big ones (constrained by cwnd and packets
in flight of course)

In fact, applications using sendmsg() (adding an additional memory copy)
instead of vmsplice()/splice()/sendfile() are a bit faster because of
this anomaly, especially if serving small files in environments with
large initial [c]wnd.

Call tcp_push() only if MSG_MORE is not set in the flags parameter.

This bit is automatically provided by splice() internals but for the
last page, or on all pages if user specified SPLICE_F_MORE splice()
flag.

In some workloads, this can reduce number of sent logical packets by an
order of magnitude, making zero-copy TCP actually faster than
one-copy :)

Reported-by: Tom Herbert 
Cc: Nandita Dukkipati 
Cc: Neal Cardwell 
Cc: Tom Herbert 
Cc: Yuchung Cheng 
Cc: H.K. Jerry Chu 
Cc: Maciej Żenczykowski 
Cc: Mahesh Bandewar 
Cc: Ilpo Järvinen 
Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: fix syncookie regression

2012-03-23T18:20:51+00:00

[ Upstream commit dfd25ffffc132c00070eed64200e8950da5d7e9d ]

commit ea4fc0d619 (ipv4: Don't use rt->rt_{src,dst} in ip_queue_xmit())
added a serious regression on synflood handling.

Simon Kirby discovered a successful connection was delayed by 20 seconds
before being responsive.

In my tests, I discovered that xmit frames were lost, and needed ~4
retransmits and a socket dst rebuild before being really sent.

In case of syncookie initiated connection, we use a different path to
initialize the socket dst, and inet->cork.fl.u.ip4 is left cleared.

As ip_queue_xmit() now depends on inet flow being setup, fix this by
copying the temp flowi4 we use in cookie_v4_check().

Reported-by: Simon Kirby 
Bisected-by: Simon Kirby 
Signed-off-by: Eric Dumazet 
Tested-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: fix tcp_shift_skb_data() to not shift SACKed data below snd_una

2012-03-19T15:57:46+00:00

[ Upstream commit 4648dc97af9d496218a05353b0e442b3dfa6aaab ]

This commit fixes tcp_shift_skb_data() so that it does not shift
SACKed data below snd_una.

This fixes an issue whose symptoms exactly match reports showing
tp->sacked_out going negative since 3.3.0-rc4 (see "WARNING: at
net/ipv4/tcp_input.c:3418" thread on netdev).

Since 2008 (832d11c5cd076abc0aa1eaf7be96c81d1a59ce41)
tcp_shift_skb_data() had been shifting SACKed ranges that were below
snd_una. It checked that the *end* of the skb it was about to shift
from was above snd_una, but did not check that the end of the actual
shifted range was above snd_una; this commit adds that check.

Shifting SACKed ranges below snd_una is problematic because for such
ranges tcp_sacktag_one() short-circuits: it does not declare anything
as SACKed and does not increase sacked_out.

Before the fixes in commits cc9a672ee522d4805495b98680f4a3db5d0a0af9
and daef52bab1fd26e24e8e9578f8fb33ba1d0cb412, shifting SACKed ranges
below snd_una happened to work because tcp_shifted_skb() was always
(incorrectly) passing in to tcp_sacktag_one() an skb whose end_seq
tcp_shift_skb_data() had already guaranteed was beyond snd_una. Hence
tcp_sacktag_one() never short-circuited and always increased
tp->sacked_out in this case.

After those two fixes, my testing has verified that shifting SACKed
ranges below snd_una could cause tp->sacked_out to go negative with
the following sequence of events:

(1) tcp_shift_skb_data() sees an skb whose end_seq is beyond snd_una,
    then shifts a prefix of that skb that is below snd_una

(2) tcp_shifted_skb() increments the packet count of the
    already-SACKed prev sk_buff

(3) tcp_sacktag_one() sees the end of the new SACKed range is below
    snd_una, so it short-circuits and doesn't increase tp->sacked_out

(5) tcp_clean_rtx_queue() sees the SACKed skb has been ACKed,
    decrements tp->sacked_out by this "inflated" pcount that was
    missing a matching increase in tp->sacked_out, and hence
    tp->sacked_out underflows to a u32 like 0xFFFFFFFF, which casted
    to s32 is negative.

(6) this leads to the warnings seen in the recent "WARNING: at
    net/ipv4/tcp_input.c:3418" thread on the netdev list; e.g.:
    tcp_input.c:3418  WARN_ON((int)tp->sacked_out < 0);

More generally, I think this bug can be tickled in some cases where
two or more ACKs from the receiver are lost and then a DSACK arrives
that is immediately above an existing SACKed skb in the write queue.

This fix changes tcp_shift_skb_data() to abort this sequence at step
(1) in the scenario above by noticing that the bytes are below snd_una
and not shifting them.

Signed-off-by: Neal Cardwell 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman