linux-stable.git/net/ipv4, branch v3.13.2

net: gre: use icmp_hdr() to get inner ip header

2014-02-06T19:34:08+00:00

[ Upstream commit c0c0c50ff7c3e331c90bab316d21f724fb9e1994 ]

When dealing with icmp messages, the skb->data points the
ip header that triggered the sending of the icmp message.

In gre_cisco_err(), the parse_gre_header() is called, and the
iptunnel_pull_header() is called to pull the skb at the end of
the parse_gre_header(), so the skb->data doesn't point the
inner ip header.

Unfortunately, the ipgre_err still needs those ip addresses in
inner ip header to look up tunnel by ip_tunnel_lookup().

So just use icmp_hdr() to get inner ip header instead of skb->data.

Signed-off-by: Duan Jiong 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net: Fix memory leak if TPROXY used with TCP early demux

2014-02-06T19:34:08+00:00

[ Upstream commit a452ce345d63ddf92cd101e4196569f8718ad319 ]

I see a memory leak when using a transparent HTTP proxy using TPROXY
together with TCP early demux and Kernel v3.8.13.15 (Ubuntu stable):

unreferenced object 0xffff88008cba4a40 (size 1696):
  comm "softirq", pid 0, jiffies 4294944115 (age 8907.520s)
  hex dump (first 32 bytes):
    0a e0 20 6a 40 04 1b 37 92 be 32 e2 e8 b4 00 00  .. j@..7..2.....
    02 00 07 01 00 00 00 00 00 00 00 00 00 00 00 00  ................
  backtrace:
    [] kmem_cache_alloc+0xad/0xb9
    [] sk_prot_alloc+0x29/0xc5
    [] sk_clone_lock+0x14/0x283
    [] inet_csk_clone_lock+0xf/0x7b
    [] netlink_broadcast+0x14/0x16
    [] tcp_create_openreq_child+0x1b/0x4c3
    [] tcp_v4_syn_recv_sock+0x38/0x25d
    [] tcp_check_req+0x25c/0x3d0
    [] tcp_v4_do_rcv+0x287/0x40e
    [] ip_route_input_noref+0x843/0xa55
    [] tcp_v4_rcv+0x4c9/0x725
    [] ip_local_deliver_finish+0xe9/0x154
    [] __netif_receive_skb+0x4b2/0x514
    [] process_backlog+0xee/0x1c5
    [] net_rx_action+0xa7/0x200
    [] add_interrupt_randomness+0x39/0x157

But there are many more, resulting in the machine going OOM after some
days.

From looking at the TPROXY code, and with help from Florian, I see
that the memory leak is introduced in tcp_v4_early_demux():

  void tcp_v4_early_demux(struct sk_buff *skb)
  {
    /* ... */

    iph = ip_hdr(skb);
    th = tcp_hdr(skb);

    if (th->doff < sizeof(struct tcphdr) / 4)
        return;

    sk = __inet_lookup_established(dev_net(skb->dev), &tcp_hashinfo,
                       iph->saddr, th->source,
                       iph->daddr, ntohs(th->dest),
                       skb->skb_iif);
    if (sk) {
        skb->sk = sk;

where the socket is assigned unconditionally to skb->sk, also bumping
the refcnt on it.  This is problematic, because in our case the skb
has already a socket assigned in the TPROXY target.  This then results
in the leak I see.

The very same issue seems to be with IPv6, but haven't tested.

Reviewed-by: Florian Westphal 
Signed-off-by: Holger Eitzenberger 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

fib_frontend: fix possible NULL pointer dereference

2014-02-06T19:34:08+00:00

[ Upstream commit a0065f266a9b5d51575535a25c15ccbeed9a9966 ]

The two commits 0115e8e30d (net: remove delay at device dismantle) and
748e2d9396a (net: reinstate rtnl in call_netdevice_notifiers()) silently
removed a NULL pointer check for in_dev since Linux 3.7.

This patch re-introduces this check as it causes crashing the kernel when
setting small mtu values on non-ip capable netdevices.

Signed-off-by: Oliver Hartkopp 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

ip_tunnel: clear IPCB in ip_tunnel_xmit() in case dst_link_failure() is called

2014-02-06T19:34:07+00:00

[ Upstream commit 11c21a307d79ea5f6b6fc0d3dfdeda271e5e65f6 ]

commit a622260254ee48("ip_tunnel: fix kernel panic with icmp_dest_unreach")
clear IPCB in ip_tunnel_xmit()  , or else skb->cb[] may contain garbage from
GSO segmentation layer.

But commit 0e6fbc5b6c621("ip_tunnels: extend iptunnel_xmit()") refactor codes,
and it clear IPCB behind the dst_link_failure().

So clear IPCB in ip_tunnel_xmit() just like commti a622260254ee48("ip_tunnel:
fix kernel panic with icmp_dest_unreach").

Signed-off-by: Duan Jiong 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

tcp: metrics: Avoid duplicate entries with the same destination-IP

2014-01-18T02:05:34+00:00

Because the tcp-metrics is an RCU-list, it may be that two
soft-interrupts are inside __tcp_get_metrics() for the same
destination-IP at the same time. If this destination-IP is not yet part of
the tcp-metrics, both soft-interrupts will end up in tcpm_new and create
a new entry for this IP.
So, we will have two tcp-metrics with the same destination-IP in the list.

This patch checks twice __tcp_get_metrics(). First without holding the
lock, then while holding the lock. The second one is there to confirm
that the entry has not been added by another soft-irq while waiting for
the spin-lock.

Fixes: 51c5d0c4b169b (tcp: Maintain dynamic metrics in local cache.)
Signed-off-by: Christoph Paasch 
Reviewed-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: avoid reference counter overflows on fib_rules in multicast forwarding

2014-01-15T01:37:25+00:00

Bob Falken reported that after 4G packets, multicast forwarding stopped
working. This was because of a rule reference counter overflow which
freed the rule as soon as the overflow happend.

This patch solves this by adding the FIB_LOOKUP_NOREF flag to
fib_rules_lookup calls. This is safe even from non-rcu locked sections
as in this case the flag only implies not taking a reference to the rule,
which we don't need at all.

Rules only hold references to the namespace, which are guaranteed to be
available during the call of the non-rcu protected function reg_vif_xmit
because of the interface reference which itself holds a reference to
the net namespace.

Fixes: f0ad0860d01e47 ("ipv4: ipmr: support multiple tables")
Fixes: d1db275dd3f6e4 ("ipv6: ip6mr: support multiple tables")
Reported-by: Bob Falken 
Cc: Patrick McHardy 
Cc: Thomas Graf 
Cc: Julian Anastasov 
Cc: Eric Dumazet 
Signed-off-by: Hannes Frederic Sowa 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

inet_diag: fix inet_diag_dump_icsk() to use correct state for timewait sockets

2014-01-14T06:35:46+00:00

Fix inet_diag_dump_icsk() to reflect the fact that both TCP_TIME_WAIT
and TCP_FIN_WAIT2 connections are represented by inet_timewait_sock
(not just TIME_WAIT), and for such sockets the tw_substate field holds
the real state, which can be either TCP_TIME_WAIT or TCP_FIN_WAIT2.

This brings the inet_diag state-matching code in line with the field
it uses to populate idiag_state. This is also analogous to the info
exported in /proc/net/tcp, where get_tcp4_sock() exports sk->sk_state
and get_timewait4_sock() exports tw->tw_substate.

Before fixing this, (a) neither "ss -nemoi" nor "ss -nemoi state
fin-wait-2" would return a socket in TCP_FIN_WAIT2; and (b) "ss -nemoi
state time-wait" would also return sockets in state TCP_FIN_WAIT2.

This is an old bug that predates 05dbc7b ("tcp/dccp: remove twchain").

Signed-off-by: Neal Cardwell 
Cc: Eric Dumazet 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

ipv4: fix tunneled VM traffic over hw VXLAN/GRE GSO NIC

2014-01-03T00:06:47+00:00

VM to VM GSO traffic is broken if it goes through VXLAN or GRE
tunnel and the physical NIC on the host supports hardware VXLAN/GRE
GSO offload (e.g. bnx2x and next-gen mlx4).

Two issues -
(VXLAN) VM traffic has SKB_GSO_DODGY and SKB_GSO_UDP_TUNNEL with
SKB_GSO_TCP/UDP set depending on the inner protocol. GSO header
integrity check fails in udp4_ufo_fragment if inner protocol is
TCP. Also gso_segs is calculated incorrectly using skb->len that
includes tunnel header. Fix: robust check should only be applied
to the inner packet.

(VXLAN & GRE) Once GSO header integrity check passes, NULL segs
is returned and the original skb is sent to hardware. However the
tunnel header is already pulled. Fix: tunnel header needs to be
restored so that hardware can perform GSO properly on the original
packet.

Signed-off-by: Wei-Chun Chao 
Signed-off-by: David S. Miller

ipv4: consistent reporting of pmtu data in case of corking

2013-12-22T23:52:09+00:00

We report different pmtu values back on the first write and on further
writes on an corked socket.

Also don't include the dst.header_len (respectively exthdrlen) as this
should already be dealt with by the interface mtu of the outgoing
(virtual) interface and policy of that interface should dictate if
fragmentation should happen.

Instead reduce the pmtu data by IP options as we do for IPv6. Make the
same changes for ip_append_data, where we did not care about options or
dst.header_len at all.

Signed-off-by: Hannes Frederic Sowa 
Signed-off-by: David S. Miller

net: inet_diag: zero out uninitialized idiag_{src,dst} fields

2013-12-19T19:55:52+00:00

Jakub reported while working with nlmon netlink sniffer that parts of
the inet_diag_sockid are not initialized when r->idiag_family != AF_INET6.
That is, fields of r->id.idiag_src[1 ... 3], r->id.idiag_dst[1 ... 3].

In fact, it seems that we can leak 6 * sizeof(u32) byte of kernel [slab]
memory through this. At least, in udp_dump_one(), we allocate a skb in ...

  rep = nlmsg_new(sizeof(struct inet_diag_msg) + ..., GFP_KERNEL);

... and then pass that to inet_sk_diag_fill() that puts the whole struct
inet_diag_msg into the skb, where we only fill out r->id.idiag_src[0],
r->id.idiag_dst[0] and leave the rest untouched:

  r->id.idiag_src[0] = inet->inet_rcv_saddr;
  r->id.idiag_dst[0] = inet->inet_daddr;

struct inet_diag_msg embeds struct inet_diag_sockid that is correctly /
fully filled out in IPv6 case, but for IPv4 not.

So just zero them out by using plain memset (for this little amount of
bytes it's probably not worth the extra check for idiag_family == AF_INET).

Similarly, fix also other places where we fill that out.

Reported-by: Jakub Zawadzki 
Signed-off-by: Daniel Borkmann 
Signed-off-by: David S. Miller