linux.git/net/ipv4, branch v4.3-rc4

net: Initialize flow flags in input path

2015-09-30T04:52:32+00:00

The fib_table_lookup tracepoint found 2 places where the flowi4_flags is
not initialized.

Signed-off-by: David Ahern 
Signed-off-by: David S. Miller

net: Fix panic in icmp_route_lookup

2015-09-26T04:44:02+00:00

Andrey reported a panic:

[ 7249.865507] BUG: unable to handle kernel pointer dereference at 000000b4
[ 7249.865559] IP: [] icmp_route_lookup+0xaa/0x320
[ 7249.865598] *pdpt = 0000000030f7f001 *pde = 0000000000000000
[ 7249.865637] Oops: 0000 [#1]
...
[ 7249.866811] CPU: 0 PID: 0 Comm: swapper/0 Not tainted
4.3.0-999-generic #201509220155
[ 7249.866876] Hardware name: MSI MS-7250/MS-7250, BIOS 080014  08/02/2006
[ 7249.866916] task: c1a5ab00 ti: c1a52000 task.ti: c1a52000
[ 7249.866949] EIP: 0060:[] EFLAGS: 00210246 CPU: 0
[ 7249.866981] EIP is at icmp_route_lookup+0xaa/0x320
[ 7249.867012] EAX: 00000000 EBX: f483ba48 ECX: 00000000 EDX: f2e18a00
[ 7249.867045] ESI: 000000c0 EDI: f483ba70 EBP: f483b9ec ESP: f483b974
[ 7249.867077]  DS: 007b ES: 007b FS: 00d8 GS: 00e0 SS: 0068
[ 7249.867108] CR0: 8005003b CR2: 000000b4 CR3: 36ee07c0 CR4: 000006f0
[ 7249.867141] Stack:
[ 7249.867165]  320310ee 00000000 00000042 320310ee 00000000 c1aeca00
f3920240 f0c69180
[ 7249.867268]  f483ba04 f855058b a89b66cd f483ba44 f8962f4b 00000000
e659266c f483ba54
[ 7249.867361]  8004753c f483ba5c f8962f4b f2031140 000003c1 ffbd8fa0
c16b0e00 00000064
[ 7249.867448] Call Trace:
[ 7249.867494]  [] ? e1000_xmit_frame+0x87b/0xdc0 [e1000e]
[ 7249.867534]  [] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867576]  [] ? tcp_in_window+0xeb/0xb10 [nf_conntrack]
[ 7249.867615]  [] ? icmp_send+0xa0/0x380
[ 7249.867648]  [] icmp_send+0x2cf/0x380
[ 7249.867681]  [] nf_send_unreach+0xa6/0xc0 [nf_reject_ipv4]
[ 7249.867714]  [] reject_tg+0x7a/0x9f [ipt_REJECT]
[ 7249.867746]  [] ipt_do_table+0x317/0x70c [ip_tables]
[ 7249.867780]  [] ? __nf_conntrack_find_get+0x166/0x3b0
[nf_conntrack]
[ 7249.867838]  [] ? nf_conntrack_in+0x398/0x600 [nf_conntrack]
[ 7249.867889]  [] iptable_filter_hook+0x35/0x80 [iptable_filter]
[ 7249.867933]  [] nf_iterate+0x71/0x80
[ 7249.867970]  [] nf_hook_slow+0x65/0xc0
[ 7249.868002]  [] __ip_local_out_sk+0xc1/0xd0
[ 7249.868034]  [] ? ip_forward_options+0x1a0/0x1a0
[ 7249.868066]  [] ip_local_out_sk+0x16/0x30
[ 7249.868097]  [] ip_send_skb+0x14/0x80
[ 7249.868129]  [] ip_push_pending_frames+0x34/0x40
[ 7249.868163]  [] ip_send_unicast_reply+0x282/0x310
[ 7249.868196]  [] tcp_v4_send_reset+0x1b3/0x380
[ 7249.868227]  [] tcp_v4_rcv+0x323/0x990
[ 7249.868257]  [] ? nf_iterate+0x71/0x80
[ 7249.868289]  [] ip_local_deliver_finish+0x8b/0x230
[ 7249.868322]  [] ip_local_deliver+0x4c/0xa0
[ 7249.868353]  [] ? ip_rcv_finish+0x390/0x390
[ 7249.868384]  [] ip_rcv_finish+0x7c/0x390
[ 7249.868415]  [] ip_rcv+0x2e0/0x420
...

Prior to the VRF change the oif was not set in the flow struct, so the
VRF support should really have only added the vrf_master_ifindex lookup.

Fixes: 613d09b30f8b ("net: Use VRF device index for lookups on TX")
Cc: Andrey Melnikov 
Signed-off-by: David Ahern 
Signed-off-by: David S. Miller

lwtunnel: remove source and destination UDP port config option

2015-09-24T21:31:37+00:00

The UDP tunnel config is asymmetric wrt. to the ports used. The source and
destination ports from one direction of the tunnel are not related to the
ports of the other direction. We need to be able to respond to ARP requests
using the correct ports without involving routing.

As the consequence, UDP ports need to be fixed property of the tunnel
interface and cannot be set per route. Remove the ability to set ports per
route. This is still okay to do, as no kernel has been released with these
attributes yet.

Note that the ability to specify source and destination ports is preserved
for other users of the lwtunnel API which don't use routes for tunnel key
specification (like openvswitch).

If in the future we rework ARP handling to allow port specification, the
attributes can be added back.

Signed-off-by: Jiri Benc 
Acked-by: Thomas Graf 
Signed-off-by: David S. Miller

ipv4: send arp replies to the correct tunnel

2015-09-24T21:31:36+00:00

When using ip lwtunnels, the additional data for xmit (basically, the actual
tunnel to use) are carried in ip_tunnel_info either in dst->lwtstate or in
metadata dst. When replying to ARP requests, we need to send the reply to
the same tunnel the request came from. This means we need to construct
proper metadata dst for ARP replies.

We could perform another route lookup to get a dst entry with the correct
lwtstate. However, this won't always ensure that the outgoing tunnel is the
same as the incoming one, and it won't work anyway for IPv4 duplicate
address detection.

The only thing to do is to "reverse" the ip_tunnel_info.

Signed-off-by: Jiri Benc 
Acked-by: Thomas Graf 
Signed-off-by: David S. Miller

tcp: add proper TS val into RST packets

2015-09-23T21:24:07+00:00

RST packets sent on behalf of TCP connections with TS option (RFC 7323
TCP timestamps) have incorrect TS val (set to 0), but correct TS ecr.

A > B: Flags [S], seq 0, win 65535, options [mss 1000,nop,nop,TS val 100
ecr 0], length 0
B > A: Flags [S.], seq 2444755794, ack 1, win 28960, options [mss
1460,nop,nop,TS val 7264344 ecr 100], length 0
A > B: Flags [.], ack 1, win 65535, options [nop,nop,TS val 110 ecr
7264344], length 0

B > A: Flags [R.], seq 1, ack 1, win 28960, options [nop,nop,TS val 0
ecr 110], length 0

We need to call skb_mstamp_get() to get proper TS val,
derived from skb->skb_mstamp

Note that RFC 1323 was advocating to not send TS option in RST segment,
but RFC 7323 recommends the opposite :

  Once TSopt has been successfully negotiated, that is both  and
   contain TSopt, the TSopt MUST be sent in every non-
  segment for the duration of the connection, and SHOULD be sent in an
   segment (see Section 5.2 for details)

Note this RFC recommends to send TS val = 0, but we believe it is
premature : We do not know if all TCP stacks are properly
handling the receive side :

   When an  segment is
   received, it MUST NOT be subjected to the PAWS check by verifying an
   acceptable value in SEG.TSval, and information from the Timestamps
   option MUST NOT be used to update connection state information.
   SEG.TSecr MAY be used to provide stricter  acceptance checks.

In 5 years, if/when all TCP stack are RFC 7323 ready, we might consider
to decide to send TS val = 0, if it buys something.

Fixes: 7faee5c0d514 ("tcp: remove TCP_SKB_CB(skb)->when")
Signed-off-by: Eric Dumazet 
Acked-by: Yuchung Cheng 
Signed-off-by: David S. Miller

inet: fix races in reqsk_queue_hash_req()

2015-09-21T23:32:29+00:00

Before allowing lockless LISTEN processing, we need to make
sure to arm the SYN_RECV timer before the req socket is visible
in hash tables.

Also, req->rsk_hash should be written before we set rsk_refcnt
to a non zero value.

Fixes: fa76ce7328b2 ("inet: get rid of central tcp/dccp listener timer")
Signed-off-by: Eric Dumazet 
Cc: Ying Cai 
Signed-off-by: David S. Miller

tcp/dccp: fix timewait races in timer handling

2015-09-21T23:32:29+00:00

When creating a timewait socket, we need to arm the timer before
allowing other cpus to find it. The signal allowing cpus to find
the socket is setting tw_refcnt to non zero value.

As we set tw_refcnt in __inet_twsk_hashdance(), we therefore need to
call inet_twsk_schedule() first.

This also means we need to remove tw_refcnt changes from
inet_twsk_schedule() and let the caller handle it.

Note that because we use mod_timer_pinned(), we have the guarantee
the timer wont expire before we set tw_refcnt as we run in BH context.

To make things more readable I introduced inet_twsk_reschedule() helper.

When rearming the timer, we can use mod_timer_pending() to make sure
we do not rearm a canceled timer.

Note: This bug can possibly trigger if packets of a flow can hit
multiple cpus. This does not normally happen, unless flow steering
is broken somehow. This explains this bug was spotted ~5 months after
its introduction.

A similar fix is needed for SYN_RECV sockets in reqsk_queue_hash_req(),
but will be provided in a separate patch for proper tracking.

Fixes: 789f558cfb36 ("tcp/dccp: get rid of central timewait timer")
Signed-off-by: Eric Dumazet 
Reported-by: Ying Cai 
Signed-off-by: David S. Miller

iptunnel: make rx/tx bytes counters consistent

2015-09-21T05:36:22+00:00

This was already done a long time ago in
commit 64194c31a0b6 ("inet: Make tunnel RX/TX byte counters more consistent")
but tx path was broken (at least since 3.10).

Before the patch the gre header was included on tx.

After the patch:
$ ping -c1 192.168.0.121 ; ip -s l ls dev gre1
PING 192.168.0.121 (192.168.0.121) 56(84) bytes of data.
64 bytes from 192.168.0.121: icmp_req=1 ttl=64 time=2.95 ms

--- 192.168.0.121 ping statistics ---
1 packets transmitted, 1 received, 0% packet loss, time 0ms
rtt min/avg/max/mdev = 2.955/2.955/2.955/0.000 ms
7: gre1@NONE:  mtu 1468 qdisc noqueue state UNKNOWN mode DEFAULT group default
    link/gre 10.16.0.249 peer 10.16.0.121
    RX: bytes  packets  errors  dropped overrun mcast
    84         1        0       0       0       0
    TX: bytes  packets  errors  dropped carrier collsns
    84         1        0       0       0       0

Reported-by: Julien Meunier 
Signed-off-by: Nicolas Dichtel 
Signed-off-by: David S. Miller

net: Fix behaviour of unreachable, blackhole and prohibit routes

2015-09-21T04:45:08+00:00

Man page of ip-route(8) says following about route types:

  unreachable - these destinations are unreachable.  Packets are dis‐
  carded and the ICMP message host unreachable is generated.  The local
  senders get an EHOSTUNREACH error.

  blackhole - these destinations are unreachable.  Packets are dis‐
  carded silently.  The local senders get an EINVAL error.

  prohibit - these destinations are unreachable.  Packets are discarded
  and the ICMP message communication administratively prohibited is
  generated.  The local senders get an EACCES error.

In the inet6 address family, this was correct, except the local senders
got ENETUNREACH error instead of EHOSTUNREACH in case of unreachable route.
In the inet address family, all three route types generated ICMP message
net unreachable, and the local senders got ENETUNREACH error.

In both address families all three route types now behave consistently
with documentation.

Signed-off-by: Nikola Forró 
Signed-off-by: David S. Miller

tcp_cubic: do not set epoch_start in the future

2015-09-18T05:35:07+00:00

Tracking idle time in bictcp_cwnd_event() is imprecise, as epoch_start
is normally set at ACK processing time, not at send time.

Doing a proper fix would need to add an additional state variable,
and does not seem worth the trouble, given CUBIC bug has been there
forever before Jana noticed it.

Let's simply not set epoch_start in the future, otherwise
bictcp_update() could overflow and CUBIC would again
grow cwnd too fast.

This was detected thanks to a packetdrill test Neal wrote that was flaky
before applying this fix.

Fixes: 30927520dbae ("tcp_cubic: better follow cubic curve after idle period")
Signed-off-by: Eric Dumazet 
Signed-off-by: Neal Cardwell 
Signed-off-by: Yuchung Cheng 
Cc: Jana Iyengar 
Signed-off-by: David S. Miller