linux-stable.git/net/core, branch v3.2.67

net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding

2015-02-20T00:49:41+00:00

[ Upstream commit 2c26d34bbcc0b3f30385d5587aa232289e2eed8e ]

When using VXLAN tunnels and a sky2 device, I have experienced
checksum failures of the following type:

[ 4297.761899] eth0: hw csum failure
[...]
[ 4297.765223] Call Trace:
[ 4297.765224]    [] dump_stack+0x46/0x58
[ 4297.765235]  [] netdev_rx_csum_fault+0x42/0x50
[ 4297.765238]  [] ? skb_push+0x40/0x40
[ 4297.765240]  [] __skb_checksum_complete+0xbc/0xd0
[ 4297.765243]  [] tcp_v4_rcv+0x2e2/0x950
[ 4297.765246]  [] ? ip_rcv_finish+0x360/0x360

	These are reliably reproduced in a network topology of:

container:eth0 == host(OVS VXLAN on VLAN) == bond0 == eth0 (sky2) -> switch

	When VXLAN encapsulated traffic is received from a similarly
configured peer, the above warning is generated in the receive
processing of the encapsulated packet.  Note that the warning is
associated with the container eth0.

        The skbs from sky2 have ip_summed set to CHECKSUM_COMPLETE, and
because the packet is an encapsulated Ethernet frame, the checksum
generated by the hardware includes the inner protocol and Ethernet
headers.

	The receive code is careful to update the skb->csum, except in
__dev_forward_skb, as called by dev_forward_skb.  __dev_forward_skb
calls eth_type_trans, which in turn calls skb_pull_inline(skb, ETH_HLEN)
to skip over the Ethernet header, but does not update skb->csum when
doing so.

	This patch resolves the problem by adding a call to
skb_postpull_rcsum to update the skb->csum after the call to
eth_type_trans.

Signed-off-by: Jay Vosburgh 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

Revert "tcp: Apply device TSO segment limit earlier"

2015-02-20T00:49:33+00:00

This reverts commit 9f871e883277cc22c6217db806376dce52401a31, which
was commit 1485348d2424e1131ea42efc033cbd9366462b01 upstream.

It can cause connections to stall when a PMTU event occurs.  This was
fixed by commit 843925f33fcc ("tcp: Do not apply TSO segment limit to
non-TSO packets") upstream, but that depends on other changes to TSO.

The original issue this fixed was a performance regression for the sfc
driver in extreme cases of TSO (skb with > 100 segments).  This is not
really very important and it seems best to revert it rather than try
to fix it up.

Signed-off-by: Ben Hutchings 
Cc: Herbert Xu 
Cc: netdev@vger.kernel.org
Cc: linux-net-drivers@solarflare.com

net: Fix stacked vlan offload features computation

2015-02-20T00:49:32+00:00

commit 796f2da81bead71ffc91ef70912cd8d1827bf756 upstream.

When vlan tags are stacked, it is very likely that the outer tag is stored
in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
Currently netif_skb_features() first looks at skb->protocol even if there
is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
encapsulated by the inner vlan instead of the inner vlan protocol.
This allows GSO packets to be passed to HW and they end up being
corrupted.

Fixes: 58e998c6d239 ("offloading: Force software GSO for multiple vlan tags.")
Signed-off-by: Toshiaki Makita 
Signed-off-by: David S. Miller 
[bwh: Backported to 3.2:
 - We don't support 802.1ad tag offload
 - Keep passing protocol to harmonize_features()]
Signed-off-by: Ben Hutchings

iovec: make sure the caller actually wants anything in memcpy_fromiovecend

2014-09-13T22:41:49+00:00

[ Upstream commit 06ebb06d49486676272a3c030bfeef4bd969a8e6 ]

Check for cases when the caller requests 0 bytes instead of running off
and dereferencing potentially invalid iovecs.

Signed-off-by: Sasha Levin 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

inetpeer: get rid of ip_id_count

2014-09-13T22:41:48+00:00

[ Upstream commit 73f156a6e8c1074ac6327e0abd1169e95eb66463 ]

Ideally, we would need to generate IP ID using a per destination IP
generator.

linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.

1) each inet_peer struct consumes 192 bytes

2) inetpeer cache uses a binary tree of inet_peer structs,
   with a nominal size of ~66000 elements under load.

3) lookups in this tree are hitting a lot of cache lines, as tree depth
   is about 20.

4) If server deals with many tcp flows, we have a high probability of
   not finding the inet_peer, allocating a fresh one, inserting it in
   the tree with same initial ip_id_count, (cf secure_ip_id())

5) We garbage collect inet_peer aggressively.

IP ID generation do not have to be 'perfect'

Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.

We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.

ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)

secure_ip_id() and secure_ipv6_id() no longer are needed.

Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

net: sendmsg: fix NULL pointer dereference

2014-09-13T22:41:40+00:00

commit 40eea803c6b2cfaab092f053248cbeab3f368412 upstream.

Sasha's report:
	> While fuzzing with trinity inside a KVM tools guest running the latest -next
	> kernel with the KASAN patchset, I've stumbled on the following spew:
	>
	> [ 4448.949424] ==================================================================
	> [ 4448.951737] AddressSanitizer: user-memory-access on address 0
	> [ 4448.952988] Read of size 2 by thread T19638:
	> [ 4448.954510] CPU: 28 PID: 19638 Comm: trinity-c76 Not tainted 3.16.0-rc4-next-20140711-sasha-00046-g07d3099-dirty #813
	> [ 4448.956823]  ffff88046d86ca40 0000000000000000 ffff880082f37e78 ffff880082f37a40
	> [ 4448.958233]  ffffffffb6e47068 ffff880082f37a68 ffff880082f37a58 ffffffffb242708d
	> [ 4448.959552]  0000000000000000 ffff880082f37a88 ffffffffb24255b1 0000000000000000
	> [ 4448.961266] Call Trace:
	> [ 4448.963158] dump_stack (lib/dump_stack.c:52)
	> [ 4448.964244] kasan_report_user_access (mm/kasan/report.c:184)
	> [ 4448.965507] __asan_load2 (mm/kasan/kasan.c:352)
	> [ 4448.966482] ? netlink_sendmsg (net/netlink/af_netlink.c:2339)
	> [ 4448.967541] netlink_sendmsg (net/netlink/af_netlink.c:2339)
	> [ 4448.968537] ? get_parent_ip (kernel/sched/core.c:2555)
	> [ 4448.970103] sock_sendmsg (net/socket.c:654)
	> [ 4448.971584] ? might_fault (mm/memory.c:3741)
	> [ 4448.972526] ? might_fault (./arch/x86/include/asm/current.h:14 mm/memory.c:3740)
	> [ 4448.973596] ? verify_iovec (net/core/iovec.c:64)
	> [ 4448.974522] ___sys_sendmsg (net/socket.c:2096)
	> [ 4448.975797] ? put_lock_stats.isra.13 (./arch/x86/include/asm/preempt.h:98 kernel/locking/lockdep.c:254)
	> [ 4448.977030] ? lock_release_holdtime (kernel/locking/lockdep.c:273)
	> [ 4448.978197] ? lock_release_non_nested (kernel/locking/lockdep.c:3434 (discriminator 1))
	> [ 4448.979346] ? check_chain_key (kernel/locking/lockdep.c:2188)
	> [ 4448.980535] __sys_sendmmsg (net/socket.c:2181)
	> [ 4448.981592] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
	> [ 4448.982773] ? trace_hardirqs_on (kernel/locking/lockdep.c:2607)
	> [ 4448.984458] ? syscall_trace_enter (arch/x86/kernel/ptrace.c:1500 (discriminator 2))
	> [ 4448.985621] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2600)
	> [ 4448.986754] SyS_sendmmsg (net/socket.c:2201)
	> [ 4448.987708] tracesys (arch/x86/kernel/entry_64.S:542)
	> [ 4448.988929] ==================================================================

This reports means that we've come to netlink_sendmsg() with msg->msg_name == NULL and msg->msg_namelen > 0.

After this report there was no usual "Unable to handle kernel NULL pointer dereference"
and this gave me a clue that address 0 is mapped and contains valid socket address structure in it.

This bug was introduced in f3d3342602f8bcbf37d7c46641cb9bca7618eb1c
(net: rework recvmsg handler msg_name and msg_namelen logic).
Commit message states that:
	"Set msg->msg_name = NULL if user specified a NULL in msg_name but had a
	 non-null msg_namelen in verify_iovec/verify_compat_iovec. This doesn't
	 affect sendto as it would bail out earlier while trying to copy-in the
	 address."
But in fact this affects sendto when address 0 is mapped and contains
socket address structure in it. In such case copy-in address will succeed,
verify_iovec() function will successfully exit with msg->msg_namelen > 0
and msg->msg_name == NULL.

This patch fixes it by setting msg_namelen to 0 if msg_name == NULL.

Cc: Hannes Frederic Sowa 
Cc: Eric Dumazet 
Reported-by: Sasha Levin 
Signed-off-by: Andrey Ryabinin 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

rtnetlink: fix userspace API breakage for iproute2 < v3.9.0

2014-08-06T17:07:39+00:00

commit e5eca6d41f53db48edd8cf88a3f59d2c30227f8e upstream.

When running RHEL6 userspace on a current upstream kernel, "ip link"
fails to show VF information.

The reason is a kernel<->userspace API change introduced by commit
88c5b5ce5cb57 ("rtnetlink: Call nlmsg_parse() with correct header length"),
after which the kernel does not see iproute2's IFLA_EXT_MASK attribute
in the netlink request.

iproute2 adjusted for the API change in its commit 63338dca4513
("libnetlink: Use ifinfomsg instead of rtgenmsg in rtnl_wilddump_req_filter").

The problem has been noticed before:
http://marc.info/?l=linux-netdev&m=136692296022182&w=2
(Subject: Re: getting VF link info seems to be broken in 3.9-rc8)

We can do better than tell those with old userspace to upgrade. We can
recognize the old iproute2 in the kernel by checking the netlink message
length. Even when including the IFLA_EXT_MASK attribute, its netlink
message is shorter than struct ifinfomsg.

With this patch "ip link" shows VF information in both old and new
iproute2 versions.

Signed-off-by: Michal Schmidt 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

skbuff: skb_segment: orphan frags before copying

2014-07-11T12:34:00+00:00

commit 1fd819ecb90cc9b822cd84d3056ddba315d3340f upstream.

skb_segment copies frags around, so we need
to copy them carefully to avoid accessing
user memory after reporting completion to userspace
through a callback.

skb_segment doesn't normally happen on datapath:
TSO needs to be disabled - so disabling zero copy
in this case does not look like a big deal.

Signed-off-by: Michael S. Tsirkin 
Acked-by: Herbert Xu 
Signed-off-by: David S. Miller 
[bwh: Backported to 3.2.  As skb_segment() only supports page-frags *or* a
 frag list, there is no need for the additional frag_skb pointer or the
 preparatory renaming.]
Signed-off-by: Ben Hutchings

skbuff: export skb_copy_ubufs

2014-07-11T12:34:00+00:00

commit dcc0fb782b3a6e2abfeaaeb45dd88ed09596be0f upstream.

Export skb_copy_ubufs so that modules can orphan frags.

Signed-off-by: Michael S. Tsirkin 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

net-gro: reset skb->truesize in napi_reuse_skb()

2014-06-09T12:29:00+00:00

[ Upstream commit e33d0ba8047b049c9262fdb1fcafb93cb52ceceb ]

Recycling skb always had been very tough...

This time it appears GRO layer can accumulate skb->truesize
adjustments made by drivers when they attach a fragment to skb.

skb_gro_receive() can only subtract from skb->truesize the used part
of a fragment.

I spotted this problem seeing TcpExtPruneCalled and
TcpExtTCPRcvCollapsed that were unexpected with a recent kernel, where
TCP receive window should be sized properly to accept traffic coming
from a driver not overshooting skb->truesize.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings