linux-stable.git/net/core, branch v3.4.112

net: avoid to hang up on sending due to sysctl configuration overflow.

2016-03-21T01:17:56+00:00

commit cdda88912d62f9603d27433338a18be83ef23ac1 upstream.

    I found if we write a larger than 4GB value to some sysctl
variables, the sending syscall will hang up forever, because these
variables are 32 bits, such large values make them overflow to 0 or
negative.

    This patch try to fix overflow or prevent from zero value setup
of below sysctl variables:

net.core.wmem_default
net.core.rmem_default

net.core.rmem_max
net.core.wmem_max

net.ipv4.udp_rmem_min
net.ipv4.udp_wmem_min

net.ipv4.tcp_wmem
net.ipv4.tcp_rmem

Signed-off-by: Eric Dumazet 
Signed-off-by: Li Yu 
Signed-off-by: David S. Miller 
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li

net: Clone skb before setting peeked flag

2016-03-21T01:17:45+00:00

commit 738ac1ebb96d02e0d23bc320302a6ea94c612dec upstream.

Shared skbs must not be modified and this is crucial for broadcast
and/or multicast paths where we use it as an optimisation to avoid
unnecessary cloning.

The function skb_recv_datagram breaks this rule by setting peeked
without cloning the skb first.  This causes funky races which leads
to double-free.

This patch fixes this by cloning the skb and replacing the skb
in the list when setting skb->peeked.

Fixes: a59322be07c9 ("[UDP]: Only increment counter on first peek/recv")
Reported-by: Konstantin Khlebnikov 
Signed-off-by: Herbert Xu 
Signed-off-by: David S. Miller 
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li

net: call rcu_read_lock early in process_backlog

2016-03-21T01:17:43+00:00

commit 2c17d27c36dcce2b6bf689f41a46b9e909877c21 upstream.

Incoming packet should be either in backlog queue or
in RCU read-side section. Otherwise, the final sequence of
flush_backlog() and synchronize_net() may miss packets
that can run without device reference:

CPU 1                  CPU 2
                       skb->dev: no reference
                       process_backlog:__skb_dequeue
                       process_backlog:local_irq_enable

on_each_cpu for
flush_backlog =>       IPI(hardirq): flush_backlog
                       - packet not found in backlog

                       CPU delayed ...
synchronize_net
- no ongoing RCU
read-side sections

netdev_run_todo,
rcu_barrier: no
ongoing callbacks
                       __netif_receive_skb_core:rcu_read_lock
                       - too late
free dev
                       process packet for freed dev

Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman 
Cc: Stephen Hemminger 
Signed-off-by: Julian Anastasov 
Signed-off-by: David S. Miller 
[lizf: Backported to 3.4:
 - adjust context
 - no need to change "goto unlock" to "goto out"]
Signed-off-by: Zefan Li

net: do not process device backlog during unregistration

2016-03-21T01:17:43+00:00

commit e9e4dd3267d0c5234c5c0f47440456b10875dec9 upstream.

commit 381c759d9916 ("ipv4: Avoid crashing in ip_error")
fixes a problem where processed packet comes from device
with destroyed inetdev (dev->ip_ptr). This is not expected
because inetdev_destroy is called in NETDEV_UNREGISTER
phase and packets should not be processed after
dev_close_many() and synchronize_net(). Above fix is still
required because inetdev_destroy can be called for other
reasons. But it shows the real problem: backlog can keep
packets for long time and they do not hold reference to
device. Such packets are then delivered to upper levels
at the same time when device is unregistered.
Calling flush_backlog after NETDEV_UNREGISTER_FINAL still
accounts all packets from backlog but before that some packets
continue to be delivered to upper levels long after the
synchronize_net call which is supposed to wait the last
ones. Also, as Eric pointed out, processed packets, mostly
from other devices, can continue to add new packets to backlog.

Fix the problem by moving flush_backlog early, after the
device driver is stopped and before the synchronize_net() call.
Then use netif_running check to make sure we do not add more
packets to backlog. We have to do it in enqueue_to_backlog
context when the local IRQ is disabled. As result, after the
flush_backlog and synchronize_net sequence all packets
should be accounted.

Thanks to Eric W. Biederman for the test script and his
valuable feedback!

Reported-by: Vittorio Gambaletta 
Fixes: 6e583ce5242f ("net: eliminate refcounting in backlog queue")
Cc: Eric W. Biederman 
Cc: Stephen Hemminger 
Signed-off-by: Julian Anastasov 
Signed-off-by: David S. Miller 
[lizf: Backported to 3.4: adjust context]
Signed-off-by: Zefan Li

rtnetlink: verify IFLA_VF_INFO attributes before passing them to driver

2016-03-21T01:17:43+00:00

commit 4f7d2cdfdde71ffe962399b7020c674050329423 upstream.

Jason Gunthorpe reported that since commit c02db8c6290b ("rtnetlink: make
SR-IOV VF interface symmetric"), we don't verify IFLA_VF_INFO attributes
anymore with respect to their policy, that is, ifla_vfinfo_policy[].

Before, they were part of ifla_policy[], but they have been nested since
placed under IFLA_VFINFO_LIST, that contains the attribute IFLA_VF_INFO,
which is another nested attribute for the actual VF attributes such as
IFLA_VF_MAC, IFLA_VF_VLAN, etc.

Despite the policy being split out from ifla_policy[] in this commit,
it's never applied anywhere. nla_for_each_nested() only does basic nla_ok()
testing for struct nlattr, but it doesn't know about the data context and
their requirements.

Fix, on top of Jason's initial work, does 1) parsing of the attributes
with the right policy, and 2) using the resulting parsed attribute table
from 1) instead of the nla_for_each_nested() loop (just like we used to
do when still part of ifla_policy[]).

Reference: http://thread.gmane.org/gmane.linux.network/368913
Fixes: c02db8c6290b ("rtnetlink: make SR-IOV VF interface symmetric")
Reported-by: Jason Gunthorpe 
Cc: Chris Wright 
Cc: Sucheta Chakraborty 
Cc: Greg Rose 
Cc: Jeff Kirsher 
Cc: Rony Efraim 
Cc: Vlad Zolotarov 
Cc: Nicolas Dichtel 
Cc: Thomas Graf 
Signed-off-by: Jason Gunthorpe 
Signed-off-by: Daniel Borkmann 
Acked-by: Vlad Zolotarov 
Signed-off-by: David S. Miller 
[bwh: Backported to 3.2:
 - Drop unsupported attributes
 - Use ndo_set_vf_tx_rate operation, not ndo_set_vf_rate]
Signed-off-by: Ben Hutchings 
Signed-off-by: Zefan Li

pktgen: adjust spacing in proc file interface output

2015-10-22T01:20:02+00:00

commit d079abd181950a44cdf31daafd1662388a6c4d2e upstream.

Too many spaces were introduced in commit 63adc6fb8ac0 ("pktgen: cleanup
checkpatch warnings"), thus misaligning "src_min:" to other columns.

Fixes: 63adc6fb8ac0 ("pktgen: cleanup checkpatch warnings")
Signed-off-by: Jesper Dangaard Brouer 
Signed-off-by: David S. Miller 
Signed-off-by: Zefan Li

net: use for_each_netdev_safe() in rtnl_group_changelink()

2015-06-19T03:40:30+00:00

commit d079535d5e1bf5e2e7c856bae2483414ea21e137 upstream.

In case we move the whole dev group to another netns,
we should call for_each_netdev_safe(), otherwise we get
a soft lockup:

 NMI watchdog: BUG: soft lockup - CPU#0 stuck for 22s! [ip:798]
 irq event stamp: 255424
 hardirqs last  enabled at (255423): [] restore_args+0x0/0x30
 hardirqs last disabled at (255424): [] apic_timer_interrupt+0x6a/0x80
 softirqs last  enabled at (255422): [] __do_softirq+0x2c1/0x3a9
 softirqs last disabled at (255417): [] irq_exit+0x41/0x95
 CPU: 0 PID: 798 Comm: ip Not tainted 4.0.0-rc4+ #881
 Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
 task: ffff8800d1b88000 ti: ffff880119530000 task.ti: ffff880119530000
 RIP: 0010:[]  [] debug_lockdep_rcu_enabled+0x28/0x30
 RSP: 0018:ffff880119533778  EFLAGS: 00000246
 RAX: ffff8800d1b88000 RBX: 0000000000000002 RCX: 0000000000000038
 RDX: 0000000000000000 RSI: ffff8800d1b888c8 RDI: ffff8800d1b888c8
 RBP: ffff880119533778 R08: 0000000000000000 R09: 0000000000000000
 R10: 0000000000000000 R11: 000000000000b5c2 R12: 0000000000000246
 R13: ffff880119533708 R14: 00000000001d5a40 R15: ffff88011a7d5a40
 FS:  00007fc01315f740(0000) GS:ffff88011a600000(0000) knlGS:0000000000000000
 CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b
 CR2: 00007f367a120988 CR3: 000000011849c000 CR4: 00000000000007f0
 Stack:
  ffff880119533798 ffffffff811ac868 ffffffff811ac831 ffffffff811ac828
  ffff8801195337c8 ffffffff811ac8c9 ffff8801195339b0 ffff8801197633e0
  0000000000000000 ffff8801195339b0 ffff8801195337d8 ffffffff811ad2d7
 Call Trace:
  [] rcu_read_lock+0x37/0x6e
  [] ? rcu_read_unlock+0x5f/0x5f
  [] ? rcu_read_unlock+0x56/0x5f
  [] __fget+0x2a/0x7a
  [] fget+0x13/0x15
  [] proc_ns_fget+0xe/0x38
  [] get_net_ns_by_fd+0x11/0x59
  [] rtnl_link_get_net+0x33/0x3e
  [] do_setlink+0x73/0x87b
  [] ? trace_hardirqs_off+0xd/0xf
  [] ? retint_restore_args+0xe/0xe
  [] rtnl_newlink+0x40c/0x699
  [] ? rtnl_newlink+0xeb/0x699
  [] ? _raw_spin_unlock+0x28/0x33
  [] ? security_capable+0x18/0x1a
  [] ? ns_capable+0x4d/0x65
  [] rtnetlink_rcv_msg+0x181/0x194
  [] ? rtnl_lock+0x17/0x19
  [] ? rtnl_lock+0x17/0x19
  [] ? __rtnl_unlock+0x17/0x17
  [] netlink_rcv_skb+0x4d/0x93
  [] rtnetlink_rcv+0x26/0x2d
  [] netlink_unicast+0xcb/0x150
  [] netlink_sendmsg+0x501/0x523
  [] ? might_fault+0x59/0xa9
  [] ? copy_from_user+0x2a/0x2c
  [] sock_sendmsg+0x34/0x3c
  [] ___sys_sendmsg+0x1b8/0x255
  [] ? handle_pte_fault+0xbd5/0xd4a
  [] ? native_sched_clock+0x35/0x37
  [] ? sched_clock_local+0x12/0x72
  [] ? sched_clock_cpu+0x9e/0xb7
  [] ? rcu_read_lock_held+0x3b/0x3d
  [] ? __fcheck_files+0x4c/0x58
  [] ? __fget_light+0x2d/0x52
  [] __sys_sendmsg+0x42/0x60
  [] SyS_sendmsg+0x12/0x1c
  [] system_call_fastpath+0x12/0x17

Fixes: e7ed828f10bd8 ("netlink: support setting devgroup parameters")
Signed-off-by: Cong Wang 
Signed-off-by: David S. Miller 
Signed-off-by: Zefan Li

rtnetlink: ifla_vf_policy: fix misuses of NLA_BINARY

2015-06-19T03:40:14+00:00

commit 364d5716a7adb91b731a35765d369602d68d2881 upstream.

ifla_vf_policy[] is wrong in advertising its individual member types as
NLA_BINARY since .type = NLA_BINARY in combination with .len declares the
len member as *max* attribute length [0, len].

The issue is that when do_setvfinfo() is being called to set up a VF
through ndo handler, we could set corrupted data if the attribute length
is less than the size of the related structure itself.

The intent is exactly the opposite, namely to make sure to pass at least
data of minimum size of len.

Fixes: ebc08a6f47ee ("rtnetlink: Add VF config code to rtnetlink")
Cc: Mitch Williams 
Cc: Jeff Kirsher 
Signed-off-by: Daniel Borkmann 
Acked-by: Thomas Graf 
Signed-off-by: David S. Miller 
[lizf: Backported to 3.4: drop changes to IFLA_VF_RATE and IFLA_VF_LINK_STATE]
Signed-off-by: Zefan Li

net: Fix stacked vlan offload features computation

2015-04-14T09:33:48+00:00

commit 796f2da81bead71ffc91ef70912cd8d1827bf756 upstream.

When vlan tags are stacked, it is very likely that the outer tag is stored
in skb->vlan_tci and skb->protocol shows the inner tag's vlan_proto.
Currently netif_skb_features() first looks at skb->protocol even if there
is the outer tag in vlan_tci, thus it incorrectly retrieves the protocol
encapsulated by the inner vlan instead of the inner vlan protocol.
This allows GSO packets to be passed to HW and they end up being
corrupted.

Fixes: 58e998c6d239 ("offloading: Force software GSO for multiple vlan tags.")
Signed-off-by: Toshiaki Makita 
Signed-off-by: David S. Miller 
[lizf: Backported to 3.4:
 - remove ETH_P_8021AD
 - pass protocol to harmonize_features()]
Signed-off-by: Zefan Li

net: Do not enable tx-nocache-copy by default

2014-12-01T10:02:45+00:00

commit cdb3f4a31b64c3a1c6eef40bc01ebc9594c58a8c upstream.

There are many cases where this feature does not improve performance or even
reduces it.

For example, here are the results from tests that I've run using 3.12.6 on one
Intel Xeon W3565 and one i7 920 connected by ixgbe adapters. The results are
from the Xeon, but they're similar on the i7. All numbers report the
mean±stddev over 10 runs of 10s.

1) latency tests similar to what is described in "c6e1a0d net: Allow no-cache
copy from user on transmit"
There is no statistically significant difference between tx-nocache-copy
on/off.
nic irqs spread out (one queue per cpu)

200x netperf -r 1400,1
tx-nocache-copy off
        692000±1000 tps
        50/90/95/99% latency (us): 275±2/643.8±0.4/799±1/2474.4±0.3
tx-nocache-copy on
        693000±1000 tps
        50/90/95/99% latency (us): 274±1/644.1±0.7/800±2/2474.5±0.7

200x netperf -r 14000,14000
tx-nocache-copy off
        86450±80 tps
        50/90/95/99% latency (us): 334.37±0.02/838±1/2100±20/3990±40
tx-nocache-copy on
        86110±60 tps
        50/90/95/99% latency (us): 334.28±0.01/837±2/2110±20/3990±20

2) single stream throughput tests
tx-nocache-copy leads to higher service demand

                        throughput  cpu0        cpu1        demand
                        (Gb/s)      (Gcycle)    (Gcycle)    (cycle/B)

nic irqs and netperf on cpu0 (1x netperf -T0,0 -t omni -- -d send)

tx-nocache-copy off     9402±5      9.4±0.2                 0.80±0.01
tx-nocache-copy on      9403±3      9.85±0.04               0.838±0.004

nic irqs on cpu0, netperf on cpu1 (1x netperf -T1,1 -t omni -- -d send)

tx-nocache-copy off     9401±5      5.83±0.03   5.0±0.1     0.923±0.007
tx-nocache-copy on      9404±2      5.74±0.03   5.523±0.009 0.958±0.002

As a second example, here are some results from Eric Dumazet with latest
net-next.
tx-nocache-copy also leads to higher service demand

(cpu is Intel(R) Xeon(R) CPU X5660  @ 2.80GHz)

lpq83:~# ./ethtool -K eth0 tx-nocache-copy on
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9407.44   2.50     -1.00    0.522   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       4282.648396 task-clock                #    0.423 CPUs utilized
             9,348 context-switches          #    0.002 M/sec
                88 CPU-migrations            #    0.021 K/sec
               355 page-faults               #    0.083 K/sec
    11,812,797,651 cycles                    #    2.758 GHz                     [82.79%]
     9,020,522,817 stalled-cycles-frontend   #   76.36% frontend cycles idle    [82.54%]
     4,579,889,681 stalled-cycles-backend    #   38.77% backend  cycles idle    [67.33%]
     6,053,172,792 instructions              #    0.51  insns per cycle
                                             #    1.49  stalled cycles per insn [83.64%]
       597,275,583 branches                  #  139.464 M/sec                   [83.70%]
         8,960,541 branch-misses             #    1.50% of all branches         [83.65%]

      10.128990264 seconds time elapsed

lpq83:~# ./ethtool -K eth0 tx-nocache-copy off
lpq83:~# perf stat ./netperf -H lpq84 -c
MIGRATED TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to lpq84.prod.google.com () port 0 AF_INET
Recv   Send    Send                          Utilization       Service Demand
Socket Socket  Message  Elapsed              Send     Recv     Send    Recv
Size   Size    Size     Time     Throughput  local    remote   local   remote
bytes  bytes   bytes    secs.    10^6bits/s  % S      % U      us/KB   us/KB

 87380  16384  16384    10.00      9412.45   2.15     -1.00    0.449   -1.000

 Performance counter stats for './netperf -H lpq84 -c':

       2847.375441 task-clock                #    0.281 CPUs utilized
            11,632 context-switches          #    0.004 M/sec
                49 CPU-migrations            #    0.017 K/sec
               354 page-faults               #    0.124 K/sec
     7,646,889,749 cycles                    #    2.686 GHz                     [83.34%]
     6,115,050,032 stalled-cycles-frontend   #   79.97% frontend cycles idle    [83.31%]
     1,726,460,071 stalled-cycles-backend    #   22.58% backend  cycles idle    [66.55%]
     2,079,702,453 instructions              #    0.27  insns per cycle
                                             #    2.94  stalled cycles per insn [83.22%]
       363,773,213 branches                  #  127.757 M/sec                   [83.29%]
         4,242,732 branch-misses             #    1.17% of all branches         [83.51%]

      10.128449949 seconds time elapsed

CC: Tom Herbert 
Signed-off-by: Benjamin Poirier 
Signed-off-by: David S. Miller 
Signed-off-by: Zefan Li