linux-stable.git/include/linux/skbuff.h, branch v3.18.136

net: Don't copy pfmemalloc flag in __copy_skb_header()

2018-07-28T05:43:18+00:00

[ Upstream commit 8b7008620b8452728cadead460a36f64ed78c460 ]

The pfmemalloc flag indicates that the skb was allocated from
the PFMEMALLOC reserves, and the flag is currently copied on skb
copy and clone.

However, an skb copied from an skb flagged with pfmemalloc
wasn't necessarily allocated from PFMEMALLOC reserves, and on
the other hand an skb allocated that way might be copied from an
skb that wasn't.

So we should not copy the flag on skb copy, and rather decide
whether to allow an skb to be associated with sockets unrelated
to page reclaim depending only on how it was allocated.

Move the pfmemalloc flag before headers_start[0] using an
existing 1-bit hole, so that __copy_skb_header() doesn't copy
it.

When cloning, we'll now take care of this flag explicitly,
contravening to the warning comment of __skb_clone().

While at it, restore the newline usage introduced by commit
b19372273164 ("net: reorganize sk_buff for faster
__copy_skb_header()") to visually separate bytes used in
bitfields after headers_start[0], that was gone after commit
a9e419dc7be6 ("netfilter: merge ctinfo into nfct pointer storage
area"), and describe the pfmemalloc flag in the kernel-doc
structure comment.

This doesn't change the size of sk_buff or cacheline boundaries,
but consolidates the 15 bits hole before tc_index into a 2 bytes
hole before csum, that could now be filled more easily.

Reported-by: Patrick Talbert 
Fixes: c93bdd0e03e8 ("netvm: allow skb allocation to use PFMEMALLOC reserves")
Signed-off-by: Stefano Brivio 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

skbuff: return -EMSGSIZE in skb_to_sgvec to prevent overflow

2018-04-13T17:52:18+00:00

[ Upstream commit 48a1df65334b74bd7531f932cca5928932abf769 ]

This is a defense-in-depth measure in response to bugs like
4d6fa57b4dab ("macsec: avoid heap overflow in skb_to_sgvec"). There's
not only a potential overflow of sglist items, but also a stack overflow
potential, so we fix this by limiting the amount of recursion this function
is allowed to do. Not actually providing a bounded base case is a future
disaster that we can easily avoid here.

As a small matter of house keeping, we take this opportunity to move the
documentation comment over the actual function the documentation is for.

While this could be implemented by using an explicit stack of skbuffs,
when implementing this, the function complexity increased considerably,
and I don't think such complexity and bloat is actually worth it. So,
instead I built this and tested it on x86, x86_64, ARM, ARM64, and MIPS,
and measured the stack usage there. I also reverted the recent MIPS
changes that give it a separate IRQ stack, so that I could experience
some worst-case situations. I found that limiting it to 24 layers deep
yielded a good stack usage with room for safety, as well as being much
deeper than any driver actually ever creates.

Signed-off-by: Jason A. Donenfeld 
Cc: Steffen Klassert 
Cc: Herbert Xu 
Cc: "David S. Miller" 
Cc: David Howells 
Cc: Sabrina Dubroca 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

netfilter/ipvs: clear ipvs_property flag when SKB net namespace changed

2017-11-24T07:30:04+00:00

[ Upstream commit 2b5ec1a5f9738ee7bf8f5ec0526e75e00362c48f ]

When run ipvs in two different network namespace at the same host, and one
ipvs transport network traffic to the other network namespace ipvs.
'ipvs_property' flag will make the second ipvs take no effect. So we should
clear 'ipvs_property' when SKB network namespace changed.

Fixes: 621e84d6f373 ("dev: introduce skb_scrub_packet()")
Signed-off-by: Ye Yin 
Signed-off-by: Wei Zhou 
Signed-off-by: Julian Anastasov 
Signed-off-by: Simon Horman 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

net:Add sysctl_max_skb_frags

2017-02-08T08:43:05+00:00

commit 5f74f82ea34c0da80ea0b49192bb5ea06e063593 upstream.

Devices may have limits on the number of fragments in an skb they support.
Current codebase uses a constant as maximum for number of fragments one
skb can hold and use.
When enabling scatter/gather and running traffic with many small messages
the codebase uses the maximum number of fragments and may thereby violate
the max for certain devices.
The patch introduces a global variable as max number of fragments.

Signed-off-by: Hans Westgaard Ry 
Reviewed-by: Håkon Bugge 
Acked-by: Eric Dumazet 
Signed-off-by: David S. Miller 
Signed-off-by: Greg Kroah-Hartman

skbuff: Fix skb checksum partial check.

2015-11-13T18:16:36+00:00

[ Upstream commit 31b33dfb0a144469dd805514c9e63f4993729a48 ]

Earlier patch 6ae459bda tried to detect void ckecksum partial
skb by comparing pull length to checksum offset. But it does
not work for all cases since checksum-offset depends on
updates to skb->data.

Following patch fixes it by validating checksum start offset
after skb-data pointer is updated. Negative value of checksum
offset start means there is no need to checksum.

Fixes: 6ae459bda ("skbuff: Fix skb checksum flag on skb pull")
Reported-by: Andrew Vagin 
Signed-off-by: Pravin B Shelar 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

skbuff: Fix skb checksum flag on skb pull

2015-11-13T18:16:24+00:00

[ Upstream commit 6ae459bdaaeebc632b16e54dcbabb490c6931d61 ]

VXLAN device can receive skb with checksum partial. But the checksum
offset could be in outer header which is pulled on receive. This results
in negative checksum offset for the skb. Such skb can cause the assert
failure in skb_checksum_help(). Following patch fixes the bug by setting
checksum-none while pulling outer header.

Following is the kernel panic msg from old kernel hitting the bug.

------------[ cut here ]------------
kernel BUG at net/core/dev.c:1906!
RIP: 0010:[] skb_checksum_help+0x144/0x150
Call Trace:

[] queue_userspace_packet+0x408/0x470 [openvswitch]
[] ovs_dp_upcall+0x5d/0x60 [openvswitch]
[] ovs_dp_process_packet_with_key+0xe6/0x100 [openvswitch]
[] ovs_dp_process_received_packet+0x4b/0x80 [openvswitch]
[] ovs_vport_receive+0x2a/0x30 [openvswitch]
[] vxlan_rcv+0x53/0x60 [openvswitch]
[] vxlan_udp_encap_recv+0x8b/0xf0 [openvswitch]
[] udp_queue_rcv_skb+0x2dc/0x3b0
[] __udp4_lib_rcv+0x1cf/0x6c0
[] udp_rcv+0x1a/0x20
[] ip_local_deliver_finish+0xdd/0x280
[] ip_local_deliver+0x88/0x90
[] ip_rcv_finish+0x10d/0x370
[] ip_rcv+0x235/0x300
[] __netif_receive_skb+0x55d/0x620
[] netif_receive_skb+0x80/0x90
[] virtnet_poll+0x555/0x6f0
[] net_rx_action+0x134/0x290
[] __do_softirq+0xa8/0x210
[] call_softirq+0x1c/0x30
[] do_softirq+0x65/0xa0
[] irq_exit+0x8e/0xb0
[] do_IRQ+0x63/0xe0
[] common_interrupt+0x6e/0x6e

Reported-by: Anupam Chanda 
Signed-off-by: Pravin B Shelar 
Acked-by: Tom Herbert 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

net: fix crash in build_skb()

2015-05-11T11:07:58+00:00

[ Upstream commit 2ea2f62c8bda242433809c7f4e9eae1c52c40bbe ]

When I added pfmemalloc support in build_skb(), I forgot netlink
was using build_skb() with a vmalloc() area.

In this patch I introduce __build_skb() for netlink use,
and build_skb() is a wrapper handling both skb->head_frag and
skb->pfmemalloc

This means netlink no longer has to hack skb->head_frag

[ 1567.700067] kernel BUG at arch/x86/mm/physaddr.c:26!
[ 1567.700067] invalid opcode: 0000 [#1] PREEMPT SMP KASAN
[ 1567.700067] Dumping ftrace buffer:
[ 1567.700067]    (ftrace buffer empty)
[ 1567.700067] Modules linked in:
[ 1567.700067] CPU: 9 PID: 16186 Comm: trinity-c182 Not tainted 4.0.0-next-20150424-sasha-00037-g4796e21 #2167
[ 1567.700067] task: ffff880127efb000 ti: ffff880246770000 task.ti: ffff880246770000
[ 1567.700067] RIP: __phys_addr (arch/x86/mm/physaddr.c:26 (discriminator 3))
[ 1567.700067] RSP: 0018:ffff8802467779d8  EFLAGS: 00010202
[ 1567.700067] RAX: 000041000ed8e000 RBX: ffffc9008ed8e000 RCX: 000000000000002c
[ 1567.700067] RDX: 0000000000000004 RSI: 0000000000000000 RDI: ffffffffb3fd6049
[ 1567.700067] RBP: ffff8802467779f8 R08: 0000000000000019 R09: ffff8801d0168000
[ 1567.700067] R10: ffff8801d01680c7 R11: ffffed003a02d019 R12: ffffc9000ed8e000
[ 1567.700067] R13: 0000000000000f40 R14: 0000000000001180 R15: ffffc9000ed8e000
[ 1567.700067] FS:  00007f2a7da3f700(0000) GS:ffff8801d1000000(0000) knlGS:0000000000000000
[ 1567.700067] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[ 1567.700067] CR2: 0000000000738308 CR3: 000000022e329000 CR4: 00000000000007e0
[ 1567.700067] Stack:
[ 1567.700067]  ffffc9000ed8e000 ffff8801d0168000 ffffc9000ed8e000 ffff8801d0168000
[ 1567.700067]  ffff880246777a28 ffffffffad7c0a21 0000000000001080 ffff880246777c08
[ 1567.700067]  ffff88060d302e68 ffff880246777b58 ffff880246777b88 ffffffffad9a6821
[ 1567.700067] Call Trace:
[ 1567.700067] build_skb (include/linux/mm.h:508 net/core/skbuff.c:316)
[ 1567.700067] netlink_sendmsg (net/netlink/af_netlink.c:1633 net/netlink/af_netlink.c:2329)
[ 1567.774369] ? sched_clock_cpu (kernel/sched/clock.c:311)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] ? netlink_unicast (net/netlink/af_netlink.c:2273)
[ 1567.774369] sock_sendmsg (net/socket.c:614 net/socket.c:623)
[ 1567.774369] sock_write_iter (net/socket.c:823)
[ 1567.774369] ? sock_sendmsg (net/socket.c:806)
[ 1567.774369] __vfs_write (fs/read_write.c:479 fs/read_write.c:491)
[ 1567.774369] ? get_lock_stats (kernel/locking/lockdep.c:249)
[ 1567.774369] ? default_llseek (fs/read_write.c:487)
[ 1567.774369] ? vtime_account_user (kernel/sched/cputime.c:701)
[ 1567.774369] ? rw_verify_area (fs/read_write.c:406 (discriminator 4))
[ 1567.774369] vfs_write (fs/read_write.c:539)
[ 1567.774369] SyS_write (fs/read_write.c:586 fs/read_write.c:577)
[ 1567.774369] ? SyS_read (fs/read_write.c:577)
[ 1567.774369] ? __this_cpu_preempt_check (lib/smp_processor_id.c:63)
[ 1567.774369] ? trace_hardirqs_on_caller (kernel/locking/lockdep.c:2594 kernel/locking/lockdep.c:2636)
[ 1567.774369] ? trace_hardirqs_on_thunk (arch/x86/lib/thunk_64.S:42)
[ 1567.774369] system_call_fastpath (arch/x86/kernel/entry_64.S:261)

Fixes: 79930f5892e ("net: do not deplete pfmemalloc reserve")
Signed-off-by: Eric Dumazet 
Reported-by: Sasha Levin 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

net: add skb_checksum_complete_unset

2015-05-11T11:07:55+00:00

[ Upstream commit 4e18b9adf2f910ec4d30b811a74a5b626e6c6125 ]

This function changes ip_summed to CHECKSUM_NONE if CHECKSUM_COMPLETE
is set. This is called to discard checksum-complete when packet
is being modified and checksum is not pulled for headers in a layer.

Signed-off-by: Tom Herbert 
Signed-off-by: David S. Miller 
Signed-off-by: Sasha Levin

net: skb_fclone_busy() needs to detect orphaned skb

2014-10-30T23:58:30+00:00

Some drivers are unable to perform TX completions in a bound time.
They instead call skb_orphan()

Problem is skb_fclone_busy() has to detect this case, otherwise
we block TCP retransmits and can freeze unlucky tcp sessions on
mostly idle hosts.

Signed-off-by: Eric Dumazet 
Fixes: 1f3279ae0c13 ("tcp: avoid retransmits of TCP packets hanging in host queues")
Signed-off-by: David S. Miller

skbuff.h: fix kernel-doc warning for headers_end

2014-10-28T21:02:29+00:00

Fix kernel-doc warning in  by making both headers_start
and headers_end private fields.

Warning(..//include/linux/skbuff.h:654): No description found for parameter 'headers_end[0]'

Signed-off-by: Randy Dunlap 
Signed-off-by: David S. Miller