linux-stable.git/net/ipv4/route.c, branch linux-3.2.y

route: do not cache fib route info on local routes with oif

2017-03-16T02:18:51+00:00

[ Upstream commit d6d5e999e5df67f8ec20b6be45e2229455ee3699 ]

For local routes that require a particular output interface we do not want
to cache the result.  Caching the result causes incorrect behaviour when
there are multiple source addresses on the interface.  The end result
being that if the intended recipient is waiting on that interface for the
packet he won't receive it because it will be delivered on the loopback
interface and the IP_PKTINFO ipi_ifindex will be set to the loopback
interface as well.

This can be tested by running a program such as "dhcp_release" which
attempts to inject a packet on a particular interface so that it is
received by another program on the same board.  The receiving process
should see an IP_PKTINFO ipi_ifndex value of the source interface
(e.g., eth1) instead of the loopback interface (e.g., lo).  The packet
will still appear on the loopback interface in tcpdump but the important
aspect is that the CMSG info is correct.

Sample dhcp_release command line:

   dhcp_release eth1 192.168.204.222 02:11:33:22:44:66

Signed-off-by: Allain Legacy 
Signed off-by: Chris Friesen 
Reviewed-by: Julian Anastasov 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

ipmr, ip6mr: fix scheduling while atomic and a deadlock with ipmr_get_route

2016-11-20T01:01:42+00:00

commit 2cf750704bb6d7ed8c7d732e071dd1bc890ea5e8 upstream.

Since the commit below the ipmr/ip6mr rtnl_unicast() code uses the portid
instead of the previous dst_pid which was copied from in_skb's portid.
Since the skb is new the portid is 0 at that point so the packets are sent
to the kernel and we get scheduling while atomic or a deadlock (depending
on where it happens) by trying to acquire rtnl two times.
Also since this is RTM_GETROUTE, it can be triggered by a normal user.

Here's the sleeping while atomic trace:
[ 7858.212557] BUG: sleeping function called from invalid context at kernel/locking/mutex.c:620
[ 7858.212748] in_atomic(): 1, irqs_disabled(): 0, pid: 0, name: swapper/0
[ 7858.212881] 2 locks held by swapper/0/0:
[ 7858.213013]  #0:  (((&mrt->ipmr_expire_timer))){+.-...}, at: [] call_timer_fn+0x5/0x350
[ 7858.213422]  #1:  (mfc_unres_lock){+.....}, at: [] ipmr_expire_process+0x25/0x130
[ 7858.213807] CPU: 0 PID: 0 Comm: swapper/0 Not tainted 4.8.0-rc7+ #179
[ 7858.213934] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140531_083030-gandalf 04/01/2014
[ 7858.214108]  0000000000000000 ffff88005b403c50 ffffffff813a7804 0000000000000000
[ 7858.214412]  ffffffff81a1338e ffff88005b403c78 ffffffff810a4a72 ffffffff81a1338e
[ 7858.214716]  000000000000026c 0000000000000000 ffff88005b403ca8 ffffffff810a4b9f
[ 7858.215251] Call Trace:
[ 7858.215412]    [] dump_stack+0x85/0xc1
[ 7858.215662]  [] ___might_sleep+0x192/0x250
[ 7858.215868]  [] __might_sleep+0x6f/0x100
[ 7858.216072]  [] mutex_lock_nested+0x33/0x4d0
[ 7858.216279]  [] ? netlink_lookup+0x25f/0x460
[ 7858.216487]  [] rtnetlink_rcv+0x1b/0x40
[ 7858.216687]  [] netlink_unicast+0x19c/0x260
[ 7858.216900]  [] rtnl_unicast+0x20/0x30
[ 7858.217128]  [] ipmr_destroy_unres+0xa9/0xf0
[ 7858.217351]  [] ipmr_expire_process+0x8f/0x130
[ 7858.217581]  [] ? ipmr_net_init+0x180/0x180
[ 7858.217785]  [] ? ipmr_net_init+0x180/0x180
[ 7858.217990]  [] call_timer_fn+0xa5/0x350
[ 7858.218192]  [] ? call_timer_fn+0x5/0x350
[ 7858.218415]  [] ? ipmr_net_init+0x180/0x180
[ 7858.218656]  [] run_timer_softirq+0x260/0x640
[ 7858.218865]  [] ? __do_softirq+0xbb/0x54f
[ 7858.219068]  [] __do_softirq+0xe8/0x54f
[ 7858.219269]  [] irq_exit+0xb8/0xc0
[ 7858.219463]  [] smp_apic_timer_interrupt+0x42/0x50
[ 7858.219678]  [] apic_timer_interrupt+0x8c/0xa0
[ 7858.219897]    [] ? native_safe_halt+0x6/0x10
[ 7858.220165]  [] ? trace_hardirqs_on+0xd/0x10
[ 7858.220373]  [] default_idle+0x23/0x190
[ 7858.220574]  [] arch_cpu_idle+0xf/0x20
[ 7858.220790]  [] default_idle_call+0x4c/0x60
[ 7858.221016]  [] cpu_startup_entry+0x39b/0x4d0
[ 7858.221257]  [] rest_init+0x135/0x140
[ 7858.221469]  [] start_kernel+0x50e/0x51b
[ 7858.221670]  [] ? early_idt_handler_array+0x120/0x120
[ 7858.221894]  [] x86_64_start_reservations+0x2a/0x2c
[ 7858.222113]  [] x86_64_start_kernel+0x13b/0x14a

Fixes: 2942e9005056 ("[RTNETLINK]: Use rtnl_unicast() for rtnetlink unicasts")
Signed-off-by: Nikolay Aleksandrov 
Signed-off-by: David S. Miller 
[bwh: Backported to 3.2:
 - Use 'pid' instead of 'portid' where necessary
 - Adjust context]
Signed-off-by: Ben Hutchings

ipv4: disable bh while doing route gc

2014-11-05T20:27:47+00:00

Further tests revealed that after moving the garbage collector to a work
queue and protecting it with a spinlock may leave the system prone to
soft lockups if bottom half gets very busy.

It was reproced with a set of firewall rules that REJECTed packets. If
the NIC bottom half handler ends up running on the same CPU that is
running the garbage collector on a very large cache, the garbage
collector will not be able to do its job due to the amount of work
needed for handling the REJECTs and also won't reschedule.

The fix is to disable bottom half during the garbage collecting, as it
already was in the first place (most calls to it came from softirqs).

Signed-off-by: Marcelo Ricardo Leitner 
Acked-by: Hannes Frederic Sowa 
Acked-by: David S. Miller 
Signed-off-by: Ben Hutchings

ipv4: avoid parallel route cache gc executions

2014-11-05T20:27:47+00:00

When rt_intern_hash() has to deal with neighbour cache overflowing,
it triggers the route cache garbage collector in an attempt to free
some references on neighbour entries.

Such call cannot be done async but should also not run in parallel with
an already-running one, so that they don't collapse fighting over the
hash lock entries.

This patch thus blocks parallel executions with spinlocks:
- A call from worker and from rt_intern_hash() are not the same, and
cannot be merged, thus they will wait each other on rt_gc_lock.
- Calls to gc from rt_intern_hash() may happen in parallel but we must
wait for it to finish in order to try again. This dedup and
synchrinozation is then performed by the locking just before calling
__do_rt_garbage_collect().

Signed-off-by: Marcelo Ricardo Leitner 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: Ben Hutchings

ipv4: move route garbage collector to work queue

2014-11-05T20:27:46+00:00

Currently the route garbage collector gets called by dst_alloc() if it
have more entries than the threshold. But it's an expensive call, that
don't really need to be done by then.

Another issue with current way is that it allows running the garbage
collector with the same start parameters on multiple CPUs at once, which
is not optimal. A system may even soft lockup if the cache is big enough
as the garbage collectors will be fighting over the hash lock entries.

This patch thus moves the garbage collector to run asynchronously on a
work queue, much similar to how rt_expire_check runs.

There is one condition left that allows multiple executions, which is
handled by the next patch.

Signed-off-by: Marcelo Ricardo Leitner 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: Ben Hutchings

ip: make IP identifiers less predictable

2014-09-13T22:41:48+00:00

[ Upstream commit 04ca6973f7c1a0d8537f2d9906a0cf8e69886d75 ]

In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.

With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.

This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.

Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.

This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.

We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.

For IPv6, adds saddr into the hash as well, but not nexthdr.

If I ping the patched target, we can see ID are now hard to predict.

21:57:11.008086 IP (...)
    A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
    target > A: ICMP echo reply, seq 1, length 64

21:57:12.013133 IP (...)
    A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
    target > A: ICMP echo reply, seq 2, length 64

21:57:13.016580 IP (...)
    A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
    target > A: ICMP echo reply, seq 3, length 64

[1] TCP sessions uses a per flow ID generator not changed by this patch.

Signed-off-by: Eric Dumazet 
Reported-by: Jeffrey Knockel 
Reported-by: Jedidiah R. Crandall 
Cc: Willy Tarreau 
Cc: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

inetpeer: get rid of ip_id_count

2014-09-13T22:41:48+00:00

[ Upstream commit 73f156a6e8c1074ac6327e0abd1169e95eb66463 ]

Ideally, we would need to generate IP ID using a per destination IP
generator.

linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.

1) each inet_peer struct consumes 192 bytes

2) inetpeer cache uses a binary tree of inet_peer structs,
   with a nominal size of ~66000 elements under load.

3) lookups in this tree are hitting a lot of cache lines, as tree depth
   is about 20.

4) If server deals with many tcp flows, we have a high probability of
   not finding the inet_peer, allocating a fresh one, inserting it in
   the tree with same initial ip_id_count, (cf secure_ip_id())

5) We garbage collect inet_peer aggressively.

IP ID generation do not have to be 'perfect'

Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.

We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.

ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)

secure_ip_id() and secure_ipv6_id() no longer are needed.

Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

ipv4: initialise the itag variable in __mkroute_input

2014-06-09T12:29:00+00:00

[ Upstream commit fbdc0ad095c0a299e9abf5d8ac8f58374951149a ]

the value of itag is a random value from stack, and may not be initiated by
fib_validate_source, which called fib_combine_itag if CONFIG_IP_ROUTE_CLASSID
is not set

This will make the cached dst uncertainty

Signed-off-by: Li RongQing 
Acked-by: Alexei Starovoitov 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

ipv4: fix ineffective source address selection

2013-11-28T14:01:56+00:00

[ Upstream commit 0a7e22609067ff524fc7bbd45c6951dd08561667 ]

When sending out multicast messages, the source address in inet->mc_addr is
ignored and rewritten by an autoselected one. This is caused by a typo in
commit 813b3b5db831 ("ipv4: Use caller's on-stack flowi as-is in output
route lookups").

Signed-off-by: Jiri Benc 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

inetpeer: Invalidate the inetpeer tree along with the routing cache

2013-10-26T20:05:57+00:00

[ Upstream commit 5faa5df1fa2024bd750089ff21dcc4191798263d ]

We initialize the routing metrics with the values cached on the
inetpeer in rt_init_metrics(). So if we have the metrics cached on the
inetpeer, we ignore the user configured fib_metrics.

To fix this issue, we replace the old tree with a fresh initialized
inet_peer_base. The old tree is removed later with a delayed work queue.

Signed-off-by: Steffen Klassert 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings