linux-stable.git/net/ipv4, branch v3.2.65

Patch for 3.2.x, 3.4.x IP identifier regression

2014-12-14T16:24:01+00:00

With commits 73f156a6e8c1 ("inetpeer: get rid of ip_id_count") and
04ca6973f7c1 ("ip: make IP identifiers less predictable"), IP
identifiers are generated from a counter chosen from an array of
counters indexed by the hash of the outgoing packet header's source
address, destination address, and protocol number.  Thus, in
__ip_make_skb(), we must now call ip_select_ident() only after setting
these fields in the IP header to prevent IP identifiers from being
generated from bogus counters.

IP id sequence before fix: 18174, 5789, 5953, 59420, 59637, ...
After fix: 5967, 6185, 6374, 6600, 6795, 6892, 7051, 7288, ...

Signed-off-by: Jeffrey Knockel 
Signed-off-by: Ben Hutchings 
Cc: Eric Dumazet

tcp: be more strict before accepting ECN negociation

2014-12-14T16:24:00+00:00

commit bd14b1b2e29bd6812597f896dde06eaf7c6d2f24 upstream.

It appears some networks play bad games with the two bits reserved for
ECN. This can trigger false congestion notifications and very slow
transferts.

Since RFC 3168 (6.1.1) forbids SYN packets to carry CT bits, we can
disable TCP ECN negociation if it happens we receive mangled CT bits in
the SYN packet.

Signed-off-by: Eric Dumazet 
Cc: Perry Lorier 
Cc: Matt Mathis 
Cc: Yuchung Cheng 
Cc: Neal Cardwell 
Cc: Wilmer van der Gaast 
Cc: Ankur Jain 
Cc: Tom Herbert 
Cc: Dave Täht 
Acked-by: Neal Cardwell 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings 
Cc: Florian Westphal

ipv4: disable bh while doing route gc

2014-11-05T20:27:47+00:00

Further tests revealed that after moving the garbage collector to a work
queue and protecting it with a spinlock may leave the system prone to
soft lockups if bottom half gets very busy.

It was reproced with a set of firewall rules that REJECTed packets. If
the NIC bottom half handler ends up running on the same CPU that is
running the garbage collector on a very large cache, the garbage
collector will not be able to do its job due to the amount of work
needed for handling the REJECTs and also won't reschedule.

The fix is to disable bottom half during the garbage collecting, as it
already was in the first place (most calls to it came from softirqs).

Signed-off-by: Marcelo Ricardo Leitner 
Acked-by: Hannes Frederic Sowa 
Acked-by: David S. Miller 
Signed-off-by: Ben Hutchings

ipv4: avoid parallel route cache gc executions

2014-11-05T20:27:47+00:00

When rt_intern_hash() has to deal with neighbour cache overflowing,
it triggers the route cache garbage collector in an attempt to free
some references on neighbour entries.

Such call cannot be done async but should also not run in parallel with
an already-running one, so that they don't collapse fighting over the
hash lock entries.

This patch thus blocks parallel executions with spinlocks:
- A call from worker and from rt_intern_hash() are not the same, and
cannot be merged, thus they will wait each other on rt_gc_lock.
- Calls to gc from rt_intern_hash() may happen in parallel but we must
wait for it to finish in order to try again. This dedup and
synchrinozation is then performed by the locking just before calling
__do_rt_garbage_collect().

Signed-off-by: Marcelo Ricardo Leitner 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: Ben Hutchings

ipv4: move route garbage collector to work queue

2014-11-05T20:27:46+00:00

Currently the route garbage collector gets called by dst_alloc() if it
have more entries than the threshold. But it's an expensive call, that
don't really need to be done by then.

Another issue with current way is that it allows running the garbage
collector with the same start parameters on multiple CPUs at once, which
is not optimal. A system may even soft lockup if the cache is big enough
as the garbage collectors will be fighting over the hash lock entries.

This patch thus moves the garbage collector to run asynchronously on a
work queue, much similar to how rt_expire_check runs.

There is one condition left that allows multiple executions, which is
handled by the next patch.

Signed-off-by: Marcelo Ricardo Leitner 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: Ben Hutchings

tcp: Fix integer-overflow in TCP vegas

2014-09-13T22:41:48+00:00

[ Upstream commit 1f74e613ded11517db90b2bd57e9464d9e0fb161 ]

In vegas we do a multiplication of the cwnd and the rtt. This
may overflow and thus their result is stored in a u64. However, we first
need to cast the cwnd so that actually 64-bit arithmetic is done.

Then, we need to do do_div to allow this to be used on 32-bit arches.

Cc: Stephen Hemminger 
Cc: Neal Cardwell 
Cc: Eric Dumazet 
Cc: David Laight 
Cc: Doug Leith 
Fixes: 8d3a564da34e (tcp: tcp_vegas cong avoid fix)
Signed-off-by: Christoph Paasch 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

tcp: Fix integer-overflows in TCP veno

2014-09-13T22:41:48+00:00

[ Upstream commit 45a07695bc64b3ab5d6d2215f9677e5b8c05a7d0 ]

In veno we do a multiplication of the cwnd and the rtt. This
may overflow and thus their result is stored in a u64. However, we first
need to cast the cwnd so that actually 64-bit arithmetic is done.

A first attempt at fixing 76f1017757aa0 ([TCP]: TCP Veno congestion
control) was made by 159131149c2 (tcp: Overflow bug in Vegas), but it
failed to add the required cast in tcp_veno_cong_avoid().

Fixes: 76f1017757aa0 ([TCP]: TCP Veno congestion control)
Signed-off-by: Christoph Paasch 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

ip: make IP identifiers less predictable

2014-09-13T22:41:48+00:00

[ Upstream commit 04ca6973f7c1a0d8537f2d9906a0cf8e69886d75 ]

In "Counting Packets Sent Between Arbitrary Internet Hosts", Jeffrey and
Jedidiah describe ways exploiting linux IP identifier generation to
infer whether two machines are exchanging packets.

With commit 73f156a6e8c1 ("inetpeer: get rid of ip_id_count"), we
changed IP id generation, but this does not really prevent this
side-channel technique.

This patch adds a random amount of perturbation so that IP identifiers
for a given destination [1] are no longer monotonically increasing after
an idle period.

Note that prandom_u32_max(1) returns 0, so if generator is used at most
once per jiffy, this patch inserts no hole in the ID suite and do not
increase collision probability.

This is jiffies based, so in the worst case (HZ=1000), the id can
rollover after ~65 seconds of idle time, which should be fine.

We also change the hash used in __ip_select_ident() to not only hash
on daddr, but also saddr and protocol, so that ICMP probes can not be
used to infer information for other protocols.

For IPv6, adds saddr into the hash as well, but not nexthdr.

If I ping the patched target, we can see ID are now hard to predict.

21:57:11.008086 IP (...)
    A > target: ICMP echo request, seq 1, length 64
21:57:11.010752 IP (... id 2081 ...)
    target > A: ICMP echo reply, seq 1, length 64

21:57:12.013133 IP (...)
    A > target: ICMP echo request, seq 2, length 64
21:57:12.015737 IP (... id 3039 ...)
    target > A: ICMP echo reply, seq 2, length 64

21:57:13.016580 IP (...)
    A > target: ICMP echo request, seq 3, length 64
21:57:13.019251 IP (... id 3437 ...)
    target > A: ICMP echo reply, seq 3, length 64

[1] TCP sessions uses a per flow ID generator not changed by this patch.

Signed-off-by: Eric Dumazet 
Reported-by: Jeffrey Knockel 
Reported-by: Jedidiah R. Crandall 
Cc: Willy Tarreau 
Cc: Hannes Frederic Sowa 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

inetpeer: get rid of ip_id_count

2014-09-13T22:41:48+00:00

[ Upstream commit 73f156a6e8c1074ac6327e0abd1169e95eb66463 ]

Ideally, we would need to generate IP ID using a per destination IP
generator.

linux kernels used inet_peer cache for this purpose, but this had a huge
cost on servers disabling MTU discovery.

1) each inet_peer struct consumes 192 bytes

2) inetpeer cache uses a binary tree of inet_peer structs,
   with a nominal size of ~66000 elements under load.

3) lookups in this tree are hitting a lot of cache lines, as tree depth
   is about 20.

4) If server deals with many tcp flows, we have a high probability of
   not finding the inet_peer, allocating a fresh one, inserting it in
   the tree with same initial ip_id_count, (cf secure_ip_id())

5) We garbage collect inet_peer aggressively.

IP ID generation do not have to be 'perfect'

Goal is trying to avoid duplicates in a short period of time,
so that reassembly units have a chance to complete reassembly of
fragments belonging to one message before receiving other fragments
with a recycled ID.

We simply use an array of generators, and a Jenkin hash using the dst IP
as a key.

ipv6_select_ident() is put back into net/ipv6/ip6_output.c where it
belongs (it is only used from this file)

secure_ip_id() and secure_ipv6_id() no longer are needed.

Rename ip_select_ident_more() to ip_select_ident_segs() to avoid
unnecessary decrement/increment of the number of segments.

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Ben Hutchings

netlabel: fix a problem when setting bits below the previously lowest bit

2014-09-13T22:41:42+00:00

commit 41c3bd2039e0d7b3dc32313141773f20716ec524 upstream.

The NetLabel category (catmap) functions have a problem in that they
assume categories will be set in an increasing manner, e.g. the next
category set will always be larger than the last.  Unfortunately, this
is not a valid assumption and could result in problems when attempting
to set categories less than the startbit in the lowest catmap node.
In some cases kernel panics and other nasties can result.

This patch corrects the problem by checking for this and allocating a
new catmap node instance and placing it at the front of the list.

Reported-by: Christian Evans 
Signed-off-by: Paul Moore 
Tested-by: Casey Schaufler 
[bwh: Backported to 3.2: adjust filename for SMACK]
Signed-off-by: Ben Hutchings