linux-stable.git/include/linux/skbuff.h, branch v4.5.3

mld, igmp: Fix reserved tailroom calculation

2016-03-03T20:41:07+00:00

The current reserved_tailroom calculation fails to take hlen and tlen into
account.

skb:
[__hlen__|__data____________|__tlen___|__extra__]
^                                               ^
head                                            skb_end_offset

In this representation, hlen + data + tlen is the size passed to alloc_skb.
"extra" is the extra space made available in __alloc_skb because of
rounding up by kmalloc. We can reorder the representation like so:

[__hlen__|__data____________|__extra__|__tlen___]
^                                               ^
head                                            skb_end_offset

The maximum space available for ip headers and payload without
fragmentation is min(mtu, data + extra). Therefore,
reserved_tailroom
= data + extra + tlen - min(mtu, data + extra)
= skb_end_offset - hlen - min(mtu, skb_end_offset - hlen - tlen)
= skb_tailroom - min(mtu, skb_tailroom - tlen) ; after skb_reserve(hlen)

Compare the second line to the current expression:
reserved_tailroom = skb_end_offset - min(mtu, skb_end_offset)
and we can see that hlen and tlen are not taken into account.

The min() in the third line can be expanded into:
if mtu < skb_tailroom - tlen:
	reserved_tailroom = skb_tailroom - mtu
else:
	reserved_tailroom = tlen

Depending on hlen, tlen, mtu and the number of multicast address records,
the current code may output skbs that have less tailroom than
dev->needed_tailroom or it may output more skbs than needed because not all
space available is used.

Fixes: 4c672e4b ("ipv6: mld: fix add_grhead skb_over_panic for devs with large MTUs")
Signed-off-by: Benjamin Poirier 
Acked-by: Hannes Frederic Sowa 
Acked-by: Daniel Borkmann 
Signed-off-by: David S. Miller

net:Add sysctl_max_skb_frags

2016-02-09T09:28:06+00:00

Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.

Signed-off-by: Hans Westgaard Ry 
Reviewed-by: Håkon Bugge 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: preserve IP control block during GSO segmentation

2016-01-15T19:35:24+00:00

Skb_gso_segment() uses skb control block during segmentation.
This patch adds 32-bytes room for previous control block which
will be copied into all resulting segments.

This patch fixes kernel crash during fragmenting forwarded packets.
Fragmentation requires valid IP CB in skb for clearing ip options.
Also patch removes custom save/restore in ovs code, now it's redundant.

Signed-off-by: Konstantin Khlebnikov 
Link: http://lkml.kernel.org/r/CALYGNiP-0MZ-FExV2HutTvE9U-QQtkKSoE--KN=JQE5STYsjAA@mail.gmail.com
Signed-off-by: David S. Miller

bpf: add skb_postpush_rcsum and fix dev_forward_skb occasions

2016-01-10T22:54:28+00:00

Add a small helper skb_postpush_rcsum() and fix up redirect locations
that need CHECKSUM_COMPLETE fixups on ingress. dev_forward_skb() expects
a proper csum that covers also Ethernet header, f.e. since 2c26d34bbcc0
("net/core: Handle csum for CHECKSUM_COMPLETE VXLAN forwarding"), we
also do skb_postpull_rcsum() after pulling Ethernet header off via
eth_type_trans().

When using eBPF in a netns setup f.e. with vxlan in collect metadata mode,
I can trigger the following csum issue with an IPv6 setup:

  [  505.144065] dummy1: hw csum failure
  [...]
  [  505.144108] Call Trace:
  [  505.144112]    [] dump_stack+0x44/0x5c
  [  505.144134]  [] netdev_rx_csum_fault+0x3a/0x40
  [  505.144142]  [] __skb_checksum_complete+0xcf/0xe0
  [  505.144149]  [] nf_ip6_checksum+0xb2/0x120
  [  505.144161]  [] icmpv6_error+0x17e/0x328 [nf_conntrack_ipv6]
  [  505.144170]  [] ? ip6t_do_table+0x2fa/0x645 [ip6_tables]
  [  505.144177]  [] ? ipv6_get_l4proto+0x65/0xd0 [nf_conntrack_ipv6]
  [  505.144189]  [] nf_conntrack_in+0xc2/0x5a0 [nf_conntrack]
  [  505.144196]  [] ipv6_conntrack_in+0x1c/0x20 [nf_conntrack_ipv6]
  [  505.144204]  [] nf_iterate+0x5d/0x70
  [  505.144210]  [] nf_hook_slow+0x66/0xc0
  [  505.144218]  [] ipv6_rcv+0x3f2/0x4f0
  [  505.144225]  [] ? ip6_make_skb+0x1b0/0x1b0
  [  505.144232]  [] __netif_receive_skb_core+0x36b/0x9a0
  [  505.144239]  [] ? __netif_receive_skb+0x18/0x60
  [  505.144245]  [] __netif_receive_skb+0x18/0x60
  [  505.144252]  [] process_backlog+0x9f/0x140
  [  505.144259]  [] net_rx_action+0x145/0x320
  [...]

What happens is that on ingress, we push Ethernet header back in, either
from cls_bpf or right before skb_do_redirect(), but without updating csum.
The "hw csum failure" can be fixed by using the new skb_postpush_rcsum()
helper for the dev_forward_skb() case to correct the csum diff again.

Thanks to Hannes Frederic Sowa for the csum_partial() idea!

Fixes: 3896d655f4d4 ("bpf: introduce bpf_clone_redirect() helper")
Fixes: 27b29f63058d ("bpf: add bpf_redirect() helper")
Signed-off-by: Daniel Borkmann 
Acked-by: Alexei Starovoitov 
Signed-off-by: David S. Miller

net: Elaborate on checksum offload interface description

2015-12-15T21:50:21+00:00

Add specifics and details the description of the interface between
the stack and drivers for doing checksum offload. This description
is meant to be as specific and complete as possible.

Signed-off-by: Tom Herbert 
Signed-off-by: David S. Miller

net: Add skb_inner_transport_offset function

2015-12-15T21:49:57+00:00

Same thing as skb_transport_offset but returns the offset of the inner
transport header (when skb->encpasulation is set).

Signed-off-by: Tom Herbert 
Signed-off-by: David S. Miller

net: Fix typo in skb_fclone_busy

2015-12-14T21:27:00+00:00

This patch fix a typo found within comment of skb_fclone_busy.

Signed-off-by: Masanari Iida 
Signed-off-by: David S. Miller

core: enable more fine-grained datagram reception control

2015-12-07T04:31:54+00:00

The __skb_recv_datagram routine in core/ datagram.c provides a general
skb reception factility supposed to be utilized by protocol modules
providing datagram sockets. It encompasses both the actual recvmsg code
and a surrounding 'sleep until data is available' loop. This is
inconvenient if a protocol module has to use additional locking in order
to maintain some per-socket state the generic datagram socket code is
unaware of (as the af_unix code does). The patch below moves the recvmsg
proper code into a new __skb_try_recv_datagram routine which doesn't
sleep and renames wait_for_more_packets to
__skb_wait_for_more_packets, both routines being exported interfaces. The
original __skb_recv_datagram routine is reimplemented on top of these
two functions such that its user-visible behaviour remains unchanged.

Signed-off-by: Rainer Weikusat 
Signed-off-by: David S. Miller

net: better skb->sender_cpu and skb->napi_id cohabitation

2015-11-18T21:17:37+00:00

skb->sender_cpu and skb->napi_id share a common storage,
and we had various bugs about this.

We had to call skb_sender_cpu_clear() in some places to
not leave a prior skb->napi_id and fool netdev_pick_tx()

As suggested by Alexei, we could split the space so that
these errors can not happen.

0 value being reserved as the common (not initialized) value,
let's reserve [1 .. NR_CPUS] range for valid sender_cpu,
and [NR_CPUS+1 .. ~0U] for valid napi_id.

This will allow proper busy polling support over tunnels.

Signed-off-by: Eric Dumazet 
Suggested-by: Alexei Starovoitov 
Acked-by: Alexei Starovoitov 
Signed-off-by: David S. Miller

mm, page_alloc: distinguish between being unable to sleep, unwilling to sleep and avoiding waking kswapd

2015-11-07T01:50:42+00:00

__GFP_WAIT has been used to identify atomic context in callers that hold
spinlocks or are in interrupts.  They are expected to be high priority and
have access one of two watermarks lower than "min" which can be referred
to as the "atomic reserve".  __GFP_HIGH users get access to the first
lower watermark and can be called the "high priority reserve".

Over time, callers had a requirement to not block when fallback options
were available.  Some have abused __GFP_WAIT leading to a situation where
an optimisitic allocation with a fallback option can access atomic
reserves.

This patch uses __GFP_ATOMIC to identify callers that are truely atomic,
cannot sleep and have no alternative.  High priority users continue to use
__GFP_HIGH.  __GFP_DIRECT_RECLAIM identifies callers that can sleep and
are willing to enter direct reclaim.  __GFP_KSWAPD_RECLAIM to identify
callers that want to wake kswapd for background reclaim.  __GFP_WAIT is
redefined as a caller that is willing to enter direct reclaim and wake
kswapd for background reclaim.

This patch then converts a number of sites

o __GFP_ATOMIC is used by callers that are high priority and have memory
  pools for those requests. GFP_ATOMIC uses this flag.

o Callers that have a limited mempool to guarantee forward progress clear
  __GFP_DIRECT_RECLAIM but keep __GFP_KSWAPD_RECLAIM. bio allocations fall
  into this category where kswapd will still be woken but atomic reserves
  are not used as there is a one-entry mempool to guarantee progress.

o Callers that are checking if they are non-blocking should use the
  helper gfpflags_allow_blocking() where possible. This is because
  checking for __GFP_WAIT as was done historically now can trigger false
  positives. Some exceptions like dm-crypt.c exist where the code intent
  is clearer if __GFP_DIRECT_RECLAIM is used instead of the helper due to
  flag manipulations.

o Callers that built their own GFP flags instead of starting with GFP_KERNEL
  and friends now also need to specify __GFP_KSWAPD_RECLAIM.

The first key hazard to watch out for is callers that removed __GFP_WAIT
and was depending on access to atomic reserves for inconspicuous reasons.
In some cases it may be appropriate for them to use __GFP_HIGH.

The second key hazard is callers that assembled their own combination of
GFP flags instead of starting with something like GFP_KERNEL.  They may
now wish to specify __GFP_KSWAPD_RECLAIM.  It's almost certainly harmless
if it's missed in most cases as other activity will wake kswapd.

Signed-off-by: Mel Gorman 
Acked-by: Vlastimil Babka 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Vitaly Wool 
Cc: Rik van Riel 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds