<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/drivers/net/ipvlan, branch linux-4.9.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>ipvlan: Fix out-of-bound bugs caused by unset skb-&gt;mac_header</title>
<updated>2022-09-28T08:55:45+00:00</updated>
<author>
<name>Lu Wei</name>
<email>luwei32@huawei.com</email>
</author>
<published>2022-09-07T10:12:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e2b46cd5796f083e452fbc624f65b80328b0c1a4'/>
<id>e2b46cd5796f083e452fbc624f65b80328b0c1a4</id>
<content type='text'>
[ Upstream commit 81225b2ea161af48e093f58e8dfee6d705b16af4 ]

If an AF_PACKET socket is used to send packets through ipvlan and the
default xmit function of the AF_PACKET socket is changed from
dev_queue_xmit() to packet_direct_xmit() via setsockopt() with the option
name of PACKET_QDISC_BYPASS, the skb-&gt;mac_header may not be reset and
remains as the initial value of 65535, this may trigger slab-out-of-bounds
bugs as following:

=================================================================
UG: KASAN: slab-out-of-bounds in ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan]
PU: 2 PID: 1768 Comm: raw_send Kdump: loaded Not tainted 6.0.0-rc4+ #6
ardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33
all Trace:
print_address_description.constprop.0+0x1d/0x160
print_report.cold+0x4f/0x112
kasan_report+0xa3/0x130
ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan]
ipvlan_start_xmit+0x29/0xa0 [ipvlan]
__dev_direct_xmit+0x2e2/0x380
packet_direct_xmit+0x22/0x60
packet_snd+0x7c9/0xc40
sock_sendmsg+0x9a/0xa0
__sys_sendto+0x18a/0x230
__x64_sys_sendto+0x74/0x90
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root cause is:
  1. packet_snd() only reset skb-&gt;mac_header when sock-&gt;type is SOCK_RAW
     and skb-&gt;protocol is not specified as in packet_parse_headers()

  2. packet_direct_xmit() doesn't reset skb-&gt;mac_header as dev_queue_xmit()

In this case, skb-&gt;mac_header is 65535 when ipvlan_xmit_mode_l2() is
called. So when ipvlan_xmit_mode_l2() gets mac header with eth_hdr() which
use "skb-&gt;head + skb-&gt;mac_header", out-of-bound access occurs.

This patch replaces eth_hdr() with skb_eth_hdr() in ipvlan_xmit_mode_l2()
and reset mac header in multicast to solve this out-of-bound bug.

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Lu Wei &lt;luwei32@huawei.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 81225b2ea161af48e093f58e8dfee6d705b16af4 ]

If an AF_PACKET socket is used to send packets through ipvlan and the
default xmit function of the AF_PACKET socket is changed from
dev_queue_xmit() to packet_direct_xmit() via setsockopt() with the option
name of PACKET_QDISC_BYPASS, the skb-&gt;mac_header may not be reset and
remains as the initial value of 65535, this may trigger slab-out-of-bounds
bugs as following:

=================================================================
UG: KASAN: slab-out-of-bounds in ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan]
PU: 2 PID: 1768 Comm: raw_send Kdump: loaded Not tainted 6.0.0-rc4+ #6
ardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.14.0-1.fc33
all Trace:
print_address_description.constprop.0+0x1d/0x160
print_report.cold+0x4f/0x112
kasan_report+0xa3/0x130
ipvlan_xmit_mode_l2+0xdb/0x330 [ipvlan]
ipvlan_start_xmit+0x29/0xa0 [ipvlan]
__dev_direct_xmit+0x2e2/0x380
packet_direct_xmit+0x22/0x60
packet_snd+0x7c9/0xc40
sock_sendmsg+0x9a/0xa0
__sys_sendto+0x18a/0x230
__x64_sys_sendto+0x74/0x90
do_syscall_64+0x3b/0x90
entry_SYSCALL_64_after_hwframe+0x63/0xcd

The root cause is:
  1. packet_snd() only reset skb-&gt;mac_header when sock-&gt;type is SOCK_RAW
     and skb-&gt;protocol is not specified as in packet_parse_headers()

  2. packet_direct_xmit() doesn't reset skb-&gt;mac_header as dev_queue_xmit()

In this case, skb-&gt;mac_header is 65535 when ipvlan_xmit_mode_l2() is
called. So when ipvlan_xmit_mode_l2() gets mac header with eth_hdr() which
use "skb-&gt;head + skb-&gt;mac_header", out-of-bound access occurs.

This patch replaces eth_hdr() with skb_eth_hdr() in ipvlan_xmit_mode_l2()
and reset mac header in multicast to solve this out-of-bound bug.

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Lu Wei &lt;luwei32@huawei.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan: fix device features</title>
<updated>2020-09-03T09:21:15+00:00</updated>
<author>
<name>Mahesh Bandewar</name>
<email>maheshb@google.com</email>
</author>
<published>2020-08-15T05:53:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=90c417270f6f14edc65b8df37cd982f4bb942f80'/>
<id>90c417270f6f14edc65b8df37cd982f4bb942f80</id>
<content type='text'>
[ Upstream commit d0f5c7076e01fef6fcb86988d9508bf3ce258bd4 ]

Processing NETDEV_FEAT_CHANGE causes IPvlan links to lose
NETIF_F_LLTX feature because of the incorrect handling of
features in ipvlan_fix_features().

--before--
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~# ethtool -K ipvl0 tso off
Cannot change tcp-segmentation-offload
Actual changes:
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: off [fixed]
lpaa10:~#

--after--
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~# ethtool -K ipvl0 tso off
Cannot change tcp-segmentation-offload
Could not change any device features
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~#

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit d0f5c7076e01fef6fcb86988d9508bf3ce258bd4 ]

Processing NETDEV_FEAT_CHANGE causes IPvlan links to lose
NETIF_F_LLTX feature because of the incorrect handling of
features in ipvlan_fix_features().

--before--
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~# ethtool -K ipvl0 tso off
Cannot change tcp-segmentation-offload
Actual changes:
vlan-challenged: off [fixed]
tx-lockless: off [fixed]
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: off [fixed]
lpaa10:~#

--after--
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~# ethtool -K ipvl0 tso off
Cannot change tcp-segmentation-offload
Could not change any device features
lpaa10:~# ethtool -k ipvl0 | grep tx-lockless
tx-lockless: on [fixed]
lpaa10:~#

Fixes: 2ad7bf363841 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Cc: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan: don't deref eth hdr before checking it's set</title>
<updated>2020-03-20T08:07:42+00:00</updated>
<author>
<name>Mahesh Bandewar</name>
<email>maheshb@google.com</email>
</author>
<published>2020-03-09T22:56:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2ad8246c4dbe850ee1acdb9cec162281f49ecec3'/>
<id>2ad8246c4dbe850ee1acdb9cec162281f49ecec3</id>
<content type='text'>
[ Upstream commit ad8192767c9f9cf97da57b9ffcea70fb100febef ]

IPvlan in L3 mode discards outbound multicast packets but performs
the check before ensuring the ether-header is set or not. This is
an error that Eric found through code browsing.

Fixes: 2ad7bf363841 (“ipvlan: Initial check-in of the IPVLAN driver.”)
Signed-off-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Reported-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit ad8192767c9f9cf97da57b9ffcea70fb100febef ]

IPvlan in L3 mode discards outbound multicast packets but performs
the check before ensuring the ether-header is set or not. This is
an error that Eric found through code browsing.

Fixes: 2ad7bf363841 (“ipvlan: Initial check-in of the IPVLAN driver.”)
Signed-off-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Reported-by: Eric Dumazet &lt;edumazet@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan: do not use cond_resched_rcu() in ipvlan_process_multicast()</title>
<updated>2020-03-20T08:07:41+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2020-03-10T01:22:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=8621153cb6c8b712cfad03c0fbbbf408c1ebc6e0'/>
<id>8621153cb6c8b712cfad03c0fbbbf408c1ebc6e0</id>
<content type='text'>
[ Upstream commit afe207d80a61e4d6e7cfa0611a4af46d0ba95628 ]

Commit e18b353f102e ("ipvlan: add cond_resched_rcu() while
processing muticast backlog") added a cond_resched_rcu() in a loop
using rcu protection to iterate over slaves.

This is breaking rcu rules, so lets instead use cond_resched()
at a point we can reschedule

Fixes: e18b353f102e ("ipvlan: add cond_resched_rcu() while processing muticast backlog")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Mahesh Bandewar &lt;maheshb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit afe207d80a61e4d6e7cfa0611a4af46d0ba95628 ]

Commit e18b353f102e ("ipvlan: add cond_resched_rcu() while
processing muticast backlog") added a cond_resched_rcu() in a loop
using rcu protection to iterate over slaves.

This is breaking rcu rules, so lets instead use cond_resched()
at a point we can reschedule

Fixes: e18b353f102e ("ipvlan: add cond_resched_rcu() while processing muticast backlog")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Mahesh Bandewar &lt;maheshb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan: egress mcast packets are not exceptional</title>
<updated>2020-03-20T08:07:41+00:00</updated>
<author>
<name>Paolo Abeni</name>
<email>pabeni@redhat.com</email>
</author>
<published>2018-02-28T10:43:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=276a875f7124b24c735e67e40a52a4f64047a501'/>
<id>276a875f7124b24c735e67e40a52a4f64047a501</id>
<content type='text'>
commit cccc200fcaf04cff4342036a72e51d6adf6c98c1 upstream.

Currently, if IPv6 is enabled on top of an ipvlan device in l3
mode, the following warning message:

 Dropped {multi|broad}cast of type= [86dd]

is emitted every time that a RS is generated and dmseg is soon
filled with irrelevant messages. Replace pr_warn with pr_debug,
to preserve debuggability, without scaring the sysadmin.

Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit cccc200fcaf04cff4342036a72e51d6adf6c98c1 upstream.

Currently, if IPv6 is enabled on top of an ipvlan device in l3
mode, the following warning message:

 Dropped {multi|broad}cast of type= [86dd]

is emitted every time that a RS is generated and dmseg is soon
filled with irrelevant messages. Replace pr_warn with pr_debug,
to preserve debuggability, without scaring the sysadmin.

Signed-off-by: Paolo Abeni &lt;pabeni@redhat.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan: do not add hardware address of master to its unicast filter list</title>
<updated>2020-03-20T08:07:41+00:00</updated>
<author>
<name>Jiri Wiesner</name>
<email>jwiesner@suse.com</email>
</author>
<published>2020-03-07T12:31:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=96ce4520f3952f899fc43408434f4fa501545e66'/>
<id>96ce4520f3952f899fc43408434f4fa501545e66</id>
<content type='text'>
[ Upstream commit 63aae7b17344d4b08a7d05cb07044de4c0f9dcc6 ]

There is a problem when ipvlan slaves are created on a master device that
is a vmxnet3 device (ipvlan in VMware guests). The vmxnet3 driver does not
support unicast address filtering. When an ipvlan device is brought up in
ipvlan_open(), the ipvlan driver calls dev_uc_add() to add the hardware
address of the vmxnet3 master device to the unicast address list of the
master device, phy_dev-&gt;uc. This inevitably leads to the vmxnet3 master
device being forced into promiscuous mode by __dev_set_rx_mode().

Promiscuous mode is switched on the master despite the fact that there is
still only one hardware address that the master device should use for
filtering in order for the ipvlan device to be able to receive packets.
The comment above struct net_device describes the uc_promisc member as a
"counter, that indicates, that promiscuous mode has been enabled due to
the need to listen to additional unicast addresses in a device that does
not implement ndo_set_rx_mode()". Moreover, the design of ipvlan
guarantees that only the hardware address of a master device,
phy_dev-&gt;dev_addr, will be used to transmit and receive all packets from
its ipvlan slaves. Thus, the unicast address list of the master device
should not be modified by ipvlan_open() and ipvlan_stop() in order to make
ipvlan a workable option on masters that do not support unicast address
filtering.

Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver")
Reported-by: Per Sundstrom &lt;per.sundstrom@redqube.se&gt;
Signed-off-by: Jiri Wiesner &lt;jwiesner@suse.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 63aae7b17344d4b08a7d05cb07044de4c0f9dcc6 ]

There is a problem when ipvlan slaves are created on a master device that
is a vmxnet3 device (ipvlan in VMware guests). The vmxnet3 driver does not
support unicast address filtering. When an ipvlan device is brought up in
ipvlan_open(), the ipvlan driver calls dev_uc_add() to add the hardware
address of the vmxnet3 master device to the unicast address list of the
master device, phy_dev-&gt;uc. This inevitably leads to the vmxnet3 master
device being forced into promiscuous mode by __dev_set_rx_mode().

Promiscuous mode is switched on the master despite the fact that there is
still only one hardware address that the master device should use for
filtering in order for the ipvlan device to be able to receive packets.
The comment above struct net_device describes the uc_promisc member as a
"counter, that indicates, that promiscuous mode has been enabled due to
the need to listen to additional unicast addresses in a device that does
not implement ndo_set_rx_mode()". Moreover, the design of ipvlan
guarantees that only the hardware address of a master device,
phy_dev-&gt;dev_addr, will be used to transmit and receive all packets from
its ipvlan slaves. Thus, the unicast address list of the master device
should not be modified by ipvlan_open() and ipvlan_stop() in order to make
ipvlan a workable option on masters that do not support unicast address
filtering.

Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver")
Reported-by: Per Sundstrom &lt;per.sundstrom@redqube.se&gt;
Signed-off-by: Jiri Wiesner &lt;jwiesner@suse.com&gt;
Reviewed-by: Eric Dumazet &lt;edumazet@google.com&gt;
Acked-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan: add cond_resched_rcu() while processing muticast backlog</title>
<updated>2020-03-20T08:07:41+00:00</updated>
<author>
<name>Mahesh Bandewar</name>
<email>maheshb@google.com</email>
</author>
<published>2020-03-09T22:57:02+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=8079db5d729a6ca4de911124f68df61e0e09b040'/>
<id>8079db5d729a6ca4de911124f68df61e0e09b040</id>
<content type='text'>
[ Upstream commit e18b353f102e371580f3f01dd47567a25acc3c1d ]

If there are substantial number of slaves created as simulated by
Syzbot, the backlog processing could take much longer and result
into the issue found in the Syzbot report.

INFO: rcu_sched detected stalls on CPUs/tasks:
        (detected by 1, t=10502 jiffies, g=5049, c=5048, q=752)
All QSes seen, last rcu_sched kthread activity 10502 (4294965563-4294955061), jiffies_till_next_fqs=1, root -&gt;qsmask 0x0
syz-executor.1  R  running task on cpu   1  10984 11210   3866 0x30020008 179034491270
Call Trace:
 &lt;IRQ&gt;
 [&lt;ffffffff81497163&gt;] _sched_show_task kernel/sched/core.c:8063 [inline]
 [&lt;ffffffff81497163&gt;] _sched_show_task.cold+0x2fd/0x392 kernel/sched/core.c:8030
 [&lt;ffffffff8146a91b&gt;] sched_show_task+0xb/0x10 kernel/sched/core.c:8073
 [&lt;ffffffff815c931b&gt;] print_other_cpu_stall kernel/rcu/tree.c:1577 [inline]
 [&lt;ffffffff815c931b&gt;] check_cpu_stall kernel/rcu/tree.c:1695 [inline]
 [&lt;ffffffff815c931b&gt;] __rcu_pending kernel/rcu/tree.c:3478 [inline]
 [&lt;ffffffff815c931b&gt;] rcu_pending kernel/rcu/tree.c:3540 [inline]
 [&lt;ffffffff815c931b&gt;] rcu_check_callbacks.cold+0xbb4/0xc29 kernel/rcu/tree.c:2876
 [&lt;ffffffff815e3962&gt;] update_process_times+0x32/0x80 kernel/time/timer.c:1635
 [&lt;ffffffff816164f0&gt;] tick_sched_handle+0xa0/0x180 kernel/time/tick-sched.c:161
 [&lt;ffffffff81616ae4&gt;] tick_sched_timer+0x44/0x130 kernel/time/tick-sched.c:1193
 [&lt;ffffffff815e75f7&gt;] __run_hrtimer kernel/time/hrtimer.c:1393 [inline]
 [&lt;ffffffff815e75f7&gt;] __hrtimer_run_queues+0x307/0xd90 kernel/time/hrtimer.c:1455
 [&lt;ffffffff815e90ea&gt;] hrtimer_interrupt+0x2ea/0x730 kernel/time/hrtimer.c:1513
 [&lt;ffffffff844050f4&gt;] local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1031 [inline]
 [&lt;ffffffff844050f4&gt;] smp_apic_timer_interrupt+0x144/0x5e0 arch/x86/kernel/apic/apic.c:1056
 [&lt;ffffffff84401cbe&gt;] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
RIP: 0010:do_raw_read_lock+0x22/0x80 kernel/locking/spinlock_debug.c:153
RSP: 0018:ffff8801dad07ab8 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff12
RAX: 0000000000000000 RBX: ffff8801c4135680 RCX: 0000000000000000
RDX: 1ffff10038826afe RSI: ffff88019d816bb8 RDI: ffff8801c41357f0
RBP: ffff8801dad07ac0 R08: 0000000000004b15 R09: 0000000000310273
R10: ffff88019d816bb8 R11: 0000000000000001 R12: ffff8801c41357e8
R13: 0000000000000000 R14: ffff8801cfb19850 R15: ffff8801cfb198b0
 [&lt;ffffffff8101460e&gt;] __raw_read_lock_bh include/linux/rwlock_api_smp.h:177 [inline]
 [&lt;ffffffff8101460e&gt;] _raw_read_lock_bh+0x3e/0x50 kernel/locking/spinlock.c:240
 [&lt;ffffffff840d78ca&gt;] ipv6_chk_mcast_addr+0x11a/0x6f0 net/ipv6/mcast.c:1006
 [&lt;ffffffff84023439&gt;] ip6_mc_input+0x319/0x8e0 net/ipv6/ip6_input.c:482
 [&lt;ffffffff840211c8&gt;] dst_input include/net/dst.h:449 [inline]
 [&lt;ffffffff840211c8&gt;] ip6_rcv_finish+0x408/0x610 net/ipv6/ip6_input.c:78
 [&lt;ffffffff840214de&gt;] NF_HOOK include/linux/netfilter.h:292 [inline]
 [&lt;ffffffff840214de&gt;] NF_HOOK include/linux/netfilter.h:286 [inline]
 [&lt;ffffffff840214de&gt;] ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:278
 [&lt;ffffffff83a29efa&gt;] __netif_receive_skb_one_core+0x12a/0x1f0 net/core/dev.c:5303
 [&lt;ffffffff83a2a15c&gt;] __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:5417
 [&lt;ffffffff83a2f536&gt;] process_backlog+0x216/0x6c0 net/core/dev.c:6243
 [&lt;ffffffff83a30d1b&gt;] napi_poll net/core/dev.c:6680 [inline]
 [&lt;ffffffff83a30d1b&gt;] net_rx_action+0x47b/0xfb0 net/core/dev.c:6748
 [&lt;ffffffff846002c8&gt;] __do_softirq+0x2c8/0x99a kernel/softirq.c:317
 [&lt;ffffffff813e656a&gt;] invoke_softirq kernel/softirq.c:399 [inline]
 [&lt;ffffffff813e656a&gt;] irq_exit+0x16a/0x1a0 kernel/softirq.c:439
 [&lt;ffffffff84405115&gt;] exiting_irq arch/x86/include/asm/apic.h:561 [inline]
 [&lt;ffffffff84405115&gt;] smp_apic_timer_interrupt+0x165/0x5e0 arch/x86/kernel/apic/apic.c:1058
 [&lt;ffffffff84401cbe&gt;] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
 &lt;/IRQ&gt;
RIP: 0010:__sanitizer_cov_trace_pc+0x26/0x50 kernel/kcov.c:102
RSP: 0018:ffff880196033bd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
RAX: ffff88019d8161c0 RBX: 00000000ffffffff RCX: ffffc90003501000
RDX: 0000000000000002 RSI: ffffffff816236d1 RDI: 0000000000000005
RBP: ffff880196033bd8 R08: ffff88019d8161c0 R09: 0000000000000000
R10: 1ffff10032c067f0 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000
 [&lt;ffffffff816236d1&gt;] do_futex+0x151/0x1d50 kernel/futex.c:3548
 [&lt;ffffffff816260f0&gt;] C_SYSC_futex kernel/futex_compat.c:201 [inline]
 [&lt;ffffffff816260f0&gt;] compat_SyS_futex+0x270/0x3b0 kernel/futex_compat.c:175
 [&lt;ffffffff8101da17&gt;] do_syscall_32_irqs_on arch/x86/entry/common.c:353 [inline]
 [&lt;ffffffff8101da17&gt;] do_fast_syscall_32+0x357/0xe1c arch/x86/entry/common.c:415
 [&lt;ffffffff84401a9b&gt;] entry_SYSENTER_compat+0x8b/0x9d arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7f23c69
RSP: 002b:00000000f5d1f12c EFLAGS: 00000282 ORIG_RAX: 00000000000000f0
RAX: ffffffffffffffda RBX: 000000000816af88 RCX: 0000000000000080
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000816af8c
RBP: 00000000f5d1f228 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
rcu_sched kthread starved for 10502 jiffies! g5049 c5048 f0x2 RCU_GP_WAIT_FQS(3) -&gt;state=0x0 -&gt;cpu=1
rcu_sched       R  running task on cpu   1  13048     8      2 0x90000000 179099587640
Call Trace:
 [&lt;ffffffff8147321f&gt;] context_switch+0x60f/0xa60 kernel/sched/core.c:3209
 [&lt;ffffffff8100095a&gt;] __schedule+0x5aa/0x1da0 kernel/sched/core.c:3934
 [&lt;ffffffff810021df&gt;] schedule+0x8f/0x1b0 kernel/sched/core.c:4011
 [&lt;ffffffff8101116d&gt;] schedule_timeout+0x50d/0xee0 kernel/time/timer.c:1803
 [&lt;ffffffff815c13f1&gt;] rcu_gp_kthread+0xda1/0x3b50 kernel/rcu/tree.c:2327
 [&lt;ffffffff8144b318&gt;] kthread+0x348/0x420 kernel/kthread.c:246
 [&lt;ffffffff84400266&gt;] ret_from_fork+0x56/0x70 arch/x86/entry/entry_64.S:393

Fixes: ba35f8588f47 (“ipvlan: Defer multicast / broadcast processing to a work-queue”)
Signed-off-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Reported-by: syzbot &lt;syzkaller@googlegroups.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit e18b353f102e371580f3f01dd47567a25acc3c1d ]

If there are substantial number of slaves created as simulated by
Syzbot, the backlog processing could take much longer and result
into the issue found in the Syzbot report.

INFO: rcu_sched detected stalls on CPUs/tasks:
        (detected by 1, t=10502 jiffies, g=5049, c=5048, q=752)
All QSes seen, last rcu_sched kthread activity 10502 (4294965563-4294955061), jiffies_till_next_fqs=1, root -&gt;qsmask 0x0
syz-executor.1  R  running task on cpu   1  10984 11210   3866 0x30020008 179034491270
Call Trace:
 &lt;IRQ&gt;
 [&lt;ffffffff81497163&gt;] _sched_show_task kernel/sched/core.c:8063 [inline]
 [&lt;ffffffff81497163&gt;] _sched_show_task.cold+0x2fd/0x392 kernel/sched/core.c:8030
 [&lt;ffffffff8146a91b&gt;] sched_show_task+0xb/0x10 kernel/sched/core.c:8073
 [&lt;ffffffff815c931b&gt;] print_other_cpu_stall kernel/rcu/tree.c:1577 [inline]
 [&lt;ffffffff815c931b&gt;] check_cpu_stall kernel/rcu/tree.c:1695 [inline]
 [&lt;ffffffff815c931b&gt;] __rcu_pending kernel/rcu/tree.c:3478 [inline]
 [&lt;ffffffff815c931b&gt;] rcu_pending kernel/rcu/tree.c:3540 [inline]
 [&lt;ffffffff815c931b&gt;] rcu_check_callbacks.cold+0xbb4/0xc29 kernel/rcu/tree.c:2876
 [&lt;ffffffff815e3962&gt;] update_process_times+0x32/0x80 kernel/time/timer.c:1635
 [&lt;ffffffff816164f0&gt;] tick_sched_handle+0xa0/0x180 kernel/time/tick-sched.c:161
 [&lt;ffffffff81616ae4&gt;] tick_sched_timer+0x44/0x130 kernel/time/tick-sched.c:1193
 [&lt;ffffffff815e75f7&gt;] __run_hrtimer kernel/time/hrtimer.c:1393 [inline]
 [&lt;ffffffff815e75f7&gt;] __hrtimer_run_queues+0x307/0xd90 kernel/time/hrtimer.c:1455
 [&lt;ffffffff815e90ea&gt;] hrtimer_interrupt+0x2ea/0x730 kernel/time/hrtimer.c:1513
 [&lt;ffffffff844050f4&gt;] local_apic_timer_interrupt arch/x86/kernel/apic/apic.c:1031 [inline]
 [&lt;ffffffff844050f4&gt;] smp_apic_timer_interrupt+0x144/0x5e0 arch/x86/kernel/apic/apic.c:1056
 [&lt;ffffffff84401cbe&gt;] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
RIP: 0010:do_raw_read_lock+0x22/0x80 kernel/locking/spinlock_debug.c:153
RSP: 0018:ffff8801dad07ab8 EFLAGS: 00000a02 ORIG_RAX: ffffffffffffff12
RAX: 0000000000000000 RBX: ffff8801c4135680 RCX: 0000000000000000
RDX: 1ffff10038826afe RSI: ffff88019d816bb8 RDI: ffff8801c41357f0
RBP: ffff8801dad07ac0 R08: 0000000000004b15 R09: 0000000000310273
R10: ffff88019d816bb8 R11: 0000000000000001 R12: ffff8801c41357e8
R13: 0000000000000000 R14: ffff8801cfb19850 R15: ffff8801cfb198b0
 [&lt;ffffffff8101460e&gt;] __raw_read_lock_bh include/linux/rwlock_api_smp.h:177 [inline]
 [&lt;ffffffff8101460e&gt;] _raw_read_lock_bh+0x3e/0x50 kernel/locking/spinlock.c:240
 [&lt;ffffffff840d78ca&gt;] ipv6_chk_mcast_addr+0x11a/0x6f0 net/ipv6/mcast.c:1006
 [&lt;ffffffff84023439&gt;] ip6_mc_input+0x319/0x8e0 net/ipv6/ip6_input.c:482
 [&lt;ffffffff840211c8&gt;] dst_input include/net/dst.h:449 [inline]
 [&lt;ffffffff840211c8&gt;] ip6_rcv_finish+0x408/0x610 net/ipv6/ip6_input.c:78
 [&lt;ffffffff840214de&gt;] NF_HOOK include/linux/netfilter.h:292 [inline]
 [&lt;ffffffff840214de&gt;] NF_HOOK include/linux/netfilter.h:286 [inline]
 [&lt;ffffffff840214de&gt;] ipv6_rcv+0x10e/0x420 net/ipv6/ip6_input.c:278
 [&lt;ffffffff83a29efa&gt;] __netif_receive_skb_one_core+0x12a/0x1f0 net/core/dev.c:5303
 [&lt;ffffffff83a2a15c&gt;] __netif_receive_skb+0x2c/0x1b0 net/core/dev.c:5417
 [&lt;ffffffff83a2f536&gt;] process_backlog+0x216/0x6c0 net/core/dev.c:6243
 [&lt;ffffffff83a30d1b&gt;] napi_poll net/core/dev.c:6680 [inline]
 [&lt;ffffffff83a30d1b&gt;] net_rx_action+0x47b/0xfb0 net/core/dev.c:6748
 [&lt;ffffffff846002c8&gt;] __do_softirq+0x2c8/0x99a kernel/softirq.c:317
 [&lt;ffffffff813e656a&gt;] invoke_softirq kernel/softirq.c:399 [inline]
 [&lt;ffffffff813e656a&gt;] irq_exit+0x16a/0x1a0 kernel/softirq.c:439
 [&lt;ffffffff84405115&gt;] exiting_irq arch/x86/include/asm/apic.h:561 [inline]
 [&lt;ffffffff84405115&gt;] smp_apic_timer_interrupt+0x165/0x5e0 arch/x86/kernel/apic/apic.c:1058
 [&lt;ffffffff84401cbe&gt;] apic_timer_interrupt+0x8e/0xa0 arch/x86/entry/entry_64.S:778
 &lt;/IRQ&gt;
RIP: 0010:__sanitizer_cov_trace_pc+0x26/0x50 kernel/kcov.c:102
RSP: 0018:ffff880196033bd8 EFLAGS: 00000246 ORIG_RAX: ffffffffffffff12
RAX: ffff88019d8161c0 RBX: 00000000ffffffff RCX: ffffc90003501000
RDX: 0000000000000002 RSI: ffffffff816236d1 RDI: 0000000000000005
RBP: ffff880196033bd8 R08: ffff88019d8161c0 R09: 0000000000000000
R10: 1ffff10032c067f0 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000080 R14: 0000000000000000 R15: 0000000000000000
 [&lt;ffffffff816236d1&gt;] do_futex+0x151/0x1d50 kernel/futex.c:3548
 [&lt;ffffffff816260f0&gt;] C_SYSC_futex kernel/futex_compat.c:201 [inline]
 [&lt;ffffffff816260f0&gt;] compat_SyS_futex+0x270/0x3b0 kernel/futex_compat.c:175
 [&lt;ffffffff8101da17&gt;] do_syscall_32_irqs_on arch/x86/entry/common.c:353 [inline]
 [&lt;ffffffff8101da17&gt;] do_fast_syscall_32+0x357/0xe1c arch/x86/entry/common.c:415
 [&lt;ffffffff84401a9b&gt;] entry_SYSENTER_compat+0x8b/0x9d arch/x86/entry/entry_64_compat.S:139
RIP: 0023:0xf7f23c69
RSP: 002b:00000000f5d1f12c EFLAGS: 00000282 ORIG_RAX: 00000000000000f0
RAX: ffffffffffffffda RBX: 000000000816af88 RCX: 0000000000000080
RDX: 0000000000000000 RSI: 0000000000000000 RDI: 000000000816af8c
RBP: 00000000f5d1f228 R08: 0000000000000000 R09: 0000000000000000
R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000000
R13: 0000000000000000 R14: 0000000000000000 R15: 0000000000000000
rcu_sched kthread starved for 10502 jiffies! g5049 c5048 f0x2 RCU_GP_WAIT_FQS(3) -&gt;state=0x0 -&gt;cpu=1
rcu_sched       R  running task on cpu   1  13048     8      2 0x90000000 179099587640
Call Trace:
 [&lt;ffffffff8147321f&gt;] context_switch+0x60f/0xa60 kernel/sched/core.c:3209
 [&lt;ffffffff8100095a&gt;] __schedule+0x5aa/0x1da0 kernel/sched/core.c:3934
 [&lt;ffffffff810021df&gt;] schedule+0x8f/0x1b0 kernel/sched/core.c:4011
 [&lt;ffffffff8101116d&gt;] schedule_timeout+0x50d/0xee0 kernel/time/timer.c:1803
 [&lt;ffffffff815c13f1&gt;] rcu_gp_kthread+0xda1/0x3b50 kernel/rcu/tree.c:2327
 [&lt;ffffffff8144b318&gt;] kthread+0x348/0x420 kernel/kthread.c:246
 [&lt;ffffffff84400266&gt;] ret_from_fork+0x56/0x70 arch/x86/entry/entry_64.S:393

Fixes: ba35f8588f47 (“ipvlan: Defer multicast / broadcast processing to a work-queue”)
Signed-off-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Reported-by: syzbot &lt;syzkaller@googlegroups.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan: disallow userns cap_net_admin to change global mode/flags</title>
<updated>2019-03-19T12:14:10+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2019-02-19T23:15:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=510c625222a11ba0e4f0a4bf2d9dfe0c4862d7b0'/>
<id>510c625222a11ba0e4f0a4bf2d9dfe0c4862d7b0</id>
<content type='text'>
[ Upstream commit 7cc9f7003a969d359f608ebb701d42cafe75b84a ]

When running Docker with userns isolation e.g. --userns-remap="default"
and spawning up some containers with CAP_NET_ADMIN under this realm, I
noticed that link changes on ipvlan slave device inside that container
can affect all devices from this ipvlan group which are in other net
namespaces where the container should have no permission to make changes
to, such as the init netns, for example.

This effectively allows to undo ipvlan private mode and switch globally to
bridge mode where slaves can communicate directly without going through
hostns, or it allows to switch between global operation mode (l2/l3/l3s)
for everyone bound to the given ipvlan master device. libnetwork plugin
here is creating an ipvlan master and ipvlan slave in hostns and a slave
each that is moved into the container's netns upon creation event.

* In hostns:

  # ip -d a
  [...]
  8: cilium_host@bond0: &lt;BROADCAST,MULTICAST,NOARP,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
     link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
     ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
     inet 10.41.0.1/32 scope link cilium_host
       valid_lft forever preferred_lft forever
  [...]

* Spawn container &amp; change ipvlan mode setting inside of it:

  # docker run -dt --cap-add=NET_ADMIN --network cilium-net --name client -l app=test cilium/netperf
  9fff485d69dcb5ce37c9e33ca20a11ccafc236d690105aadbfb77e4f4170879c

  # docker exec -ti client ip -d a
  [...]
  10: cilium0@if4: &lt;BROADCAST,MULTICAST,NOARP,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
         valid_lft forever preferred_lft forever

  # docker exec -ti client ip link change link cilium0 name cilium0 type ipvlan mode l2

  # docker exec -ti client ip -d a
  [...]
  10: cilium0@if4: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
         valid_lft forever preferred_lft forever

* In hostns (mode switched to l2):

  # ip -d a
  [...]
  8: cilium_host@bond0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.0.1/32 scope link cilium_host
         valid_lft forever preferred_lft forever
  [...]

Same l3 -&gt; l2 switch would also happen by creating another slave inside
the container's network namespace when specifying the existing cilium0
link to derive the actual (bond0) master:

  # docker exec -ti client ip link add link cilium0 name cilium1 type ipvlan mode l2

  # docker exec -ti client ip -d a
  [...]
  2: cilium1@if4: &lt;BROADCAST,MULTICAST&gt; mtu 1500 qdisc noop state DOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
  10: cilium0@if4: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
         valid_lft forever preferred_lft forever

* In hostns:

  # ip -d a
  [...]
  8: cilium_host@bond0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.0.1/32 scope link cilium_host
         valid_lft forever preferred_lft forever
  [...]

One way to mitigate it is to check CAP_NET_ADMIN permissions of
the ipvlan master device's ns, and only then allow to change
mode or flags for all devices bound to it. Above two cases are
then disallowed after the patch.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 7cc9f7003a969d359f608ebb701d42cafe75b84a ]

When running Docker with userns isolation e.g. --userns-remap="default"
and spawning up some containers with CAP_NET_ADMIN under this realm, I
noticed that link changes on ipvlan slave device inside that container
can affect all devices from this ipvlan group which are in other net
namespaces where the container should have no permission to make changes
to, such as the init netns, for example.

This effectively allows to undo ipvlan private mode and switch globally to
bridge mode where slaves can communicate directly without going through
hostns, or it allows to switch between global operation mode (l2/l3/l3s)
for everyone bound to the given ipvlan master device. libnetwork plugin
here is creating an ipvlan master and ipvlan slave in hostns and a slave
each that is moved into the container's netns upon creation event.

* In hostns:

  # ip -d a
  [...]
  8: cilium_host@bond0: &lt;BROADCAST,MULTICAST,NOARP,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
     link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
     ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
     inet 10.41.0.1/32 scope link cilium_host
       valid_lft forever preferred_lft forever
  [...]

* Spawn container &amp; change ipvlan mode setting inside of it:

  # docker run -dt --cap-add=NET_ADMIN --network cilium-net --name client -l app=test cilium/netperf
  9fff485d69dcb5ce37c9e33ca20a11ccafc236d690105aadbfb77e4f4170879c

  # docker exec -ti client ip -d a
  [...]
  10: cilium0@if4: &lt;BROADCAST,MULTICAST,NOARP,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l3 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
         valid_lft forever preferred_lft forever

  # docker exec -ti client ip link change link cilium0 name cilium0 type ipvlan mode l2

  # docker exec -ti client ip -d a
  [...]
  10: cilium0@if4: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
         valid_lft forever preferred_lft forever

* In hostns (mode switched to l2):

  # ip -d a
  [...]
  8: cilium_host@bond0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.0.1/32 scope link cilium_host
         valid_lft forever preferred_lft forever
  [...]

Same l3 -&gt; l2 switch would also happen by creating another slave inside
the container's network namespace when specifying the existing cilium0
link to derive the actual (bond0) master:

  # docker exec -ti client ip link add link cilium0 name cilium1 type ipvlan mode l2

  # docker exec -ti client ip -d a
  [...]
  2: cilium1@if4: &lt;BROADCAST,MULTICAST&gt; mtu 1500 qdisc noop state DOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
  10: cilium0@if4: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.197.43/32 brd 10.41.197.43 scope global cilium0
         valid_lft forever preferred_lft forever

* In hostns:

  # ip -d a
  [...]
  8: cilium_host@bond0: &lt;BROADCAST,MULTICAST,UP,LOWER_UP&gt; mtu 1500 qdisc noqueue state UNKNOWN group default qlen 1000
      link/ether 0c:c4:7a:e1:3d:cc brd ff:ff:ff:ff:ff:ff promiscuity 0 minmtu 68 maxmtu 65535
      ipvlan  mode l2 bridge numtxqueues 1 numrxqueues 1 gso_max_size 65536 gso_max_segs 65535
      inet 10.41.0.1/32 scope link cilium_host
         valid_lft forever preferred_lft forever
  [...]

One way to mitigate it is to check CAP_NET_ADMIN permissions of
the ipvlan master device's ns, and only then allow to change
mode or flags for all devices bound to it. Above two cases are
then disallowed after the patch.

Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Acked-by: Mahesh Bandewar &lt;maheshb@google.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan, l3mdev: fix broken l3s mode wrt local routes</title>
<updated>2019-02-06T16:33:27+00:00</updated>
<author>
<name>Daniel Borkmann</name>
<email>daniel@iogearbox.net</email>
</author>
<published>2019-01-30T11:49:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c574feb8a2cbb40f56a2b2fdb28fb892d97f44d5'/>
<id>c574feb8a2cbb40f56a2b2fdb28fb892d97f44d5</id>
<content type='text'>
[ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ]

While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
I ran into the issue that while l3 mode is working fine, l3s mode
does not have any connectivity to kube-apiserver and hence all pods
end up in Error state as well. The ipvlan master device sits on
top of a bond device and hostns traffic to kube-apiserver (also running
in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
where the latter is the address of the bond0. While in l3 mode, a
curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
works fine from hostns, neither of them do in case of l3s. In the
latter only a curl to https://127.0.0.1:37573 appeared to work where
for local addresses of bond0 I saw kernel suddenly starting to emit
ARP requests to query HW address of bond0 which remained unanswered
and neighbor entries in INCOMPLETE state. These ARP requests only
happen while in l3s.

Debugging this further, I found the issue is that l3s mode is piggy-
backing on l3 master device, and in this case local routes are using
l3mdev_master_dev_rcu(dev) instead of net-&gt;loopback_dev as per commit
f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev
if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be
a loopback"). I found that reverting them back into using the
net-&gt;loopback_dev fixed ipvlan l3s connectivity and got everything
working for the CNI.

Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the
l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
on l3 master device is to get the l3mdev_ip_rcv() receive hook for
setting the dst entry of the input route without adding its own
ipvlan specific hacks into the receive path, however, any l3 domain
semantics beyond just that are breaking l3s operation. Note that
ipvlan also has the ability to dynamically switch its internal
operation from l3 to l3s for all ports via ipvlan_set_port_mode()
at runtime. In any case, l3 vs l3s soley distinguishes itself by
'de-confusing' netfilter through switching skb-&gt;dev to ipvlan slave
device late in NF_INET_LOCAL_IN before handing the skb to L4.

Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
without any additional l3mdev semantics on top. This should also have
minimal impact since dev-&gt;priv_flags is already hot in cache. With
this set, l3s mode is working fine and I also get things like
masquerading pod traffic on the ipvlan master properly working.

  [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback")
Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode")
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Mahesh Bandewar &lt;maheshb@google.com&gt;
Cc: David Ahern &lt;dsa@cumulusnetworks.com&gt;
Cc: Florian Westphal &lt;fw@strlen.de&gt;
Cc: Martynas Pumputis &lt;m@lambda.lt&gt;
Acked-by: David Ahern &lt;dsa@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit d5256083f62e2720f75bb3c5a928a0afe47d6bc3 ]

While implementing ipvlan l3 and l3s mode for kubernetes CNI plugin,
I ran into the issue that while l3 mode is working fine, l3s mode
does not have any connectivity to kube-apiserver and hence all pods
end up in Error state as well. The ipvlan master device sits on
top of a bond device and hostns traffic to kube-apiserver (also running
in hostns) is DNATed from 10.152.183.1:443 to 139.178.29.207:37573
where the latter is the address of the bond0. While in l3 mode, a
curl to https://10.152.183.1:443 or to https://139.178.29.207:37573
works fine from hostns, neither of them do in case of l3s. In the
latter only a curl to https://127.0.0.1:37573 appeared to work where
for local addresses of bond0 I saw kernel suddenly starting to emit
ARP requests to query HW address of bond0 which remained unanswered
and neighbor entries in INCOMPLETE state. These ARP requests only
happen while in l3s.

Debugging this further, I found the issue is that l3s mode is piggy-
backing on l3 master device, and in this case local routes are using
l3mdev_master_dev_rcu(dev) instead of net-&gt;loopback_dev as per commit
f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev
if relevant") and 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be
a loopback"). I found that reverting them back into using the
net-&gt;loopback_dev fixed ipvlan l3s connectivity and got everything
working for the CNI.

Now judging from 4fbae7d83c98 ("ipvlan: Introduce l3s mode") and the
l3mdev paper in [0] the only sole reason why ipvlan l3s is relying
on l3 master device is to get the l3mdev_ip_rcv() receive hook for
setting the dst entry of the input route without adding its own
ipvlan specific hacks into the receive path, however, any l3 domain
semantics beyond just that are breaking l3s operation. Note that
ipvlan also has the ability to dynamically switch its internal
operation from l3 to l3s for all ports via ipvlan_set_port_mode()
at runtime. In any case, l3 vs l3s soley distinguishes itself by
'de-confusing' netfilter through switching skb-&gt;dev to ipvlan slave
device late in NF_INET_LOCAL_IN before handing the skb to L4.

Minimal fix taken here is to add a IFF_L3MDEV_RX_HANDLER flag which,
if set from ipvlan setup, gets us only the wanted l3mdev_l3_rcv() hook
without any additional l3mdev semantics on top. This should also have
minimal impact since dev-&gt;priv_flags is already hot in cache. With
this set, l3s mode is working fine and I also get things like
masquerading pod traffic on the ipvlan master properly working.

  [0] https://netdevconf.org/1.2/papers/ahern-what-is-l3mdev-paper.pdf

Fixes: f5a0aab84b74 ("net: ipv4: dst for local input routes should use l3mdev if relevant")
Fixes: 5f02ce24c269 ("net: l3mdev: Allow the l3mdev to be a loopback")
Fixes: 4fbae7d83c98 ("ipvlan: Introduce l3s mode")
Signed-off-by: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Mahesh Bandewar &lt;maheshb@google.com&gt;
Cc: David Ahern &lt;dsa@cumulusnetworks.com&gt;
Cc: Florian Westphal &lt;fw@strlen.de&gt;
Cc: Martynas Pumputis &lt;m@lambda.lt&gt;
Acked-by: David Ahern &lt;dsa@cumulusnetworks.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ipvlan: call dev_change_flags when ipvlan mode is reset</title>
<updated>2018-08-24T11:12:35+00:00</updated>
<author>
<name>Hangbin Liu</name>
<email>liuhangbin@gmail.com</email>
</author>
<published>2018-07-01T08:21:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=377c72c803e802926a947567ac30053697c413d7'/>
<id>377c72c803e802926a947567ac30053697c413d7</id>
<content type='text'>
[ Upstream commit 5dc2d3996a8b221c20dd0900bdad45031a572530 ]

After we change the ipvlan mode from l3 to l2, or vice versa, we only
reset IFF_NOARP flag, but don't flush the ARP table cache, which will
cause eth-&gt;h_dest to be equal to eth-&gt;h_source in ipvlan_xmit_mode_l2().
Then the message will not come out of host.

Here is the reproducer on local host:

ip link set eth1 up
ip addr add 192.168.1.1/24 dev eth1
ip link add link eth1 ipvlan1 type ipvlan mode l3

ip netns add net1
ip link set ipvlan1 netns net1
ip netns exec net1 ip link set ipvlan1 up
ip netns exec net1 ip addr add 192.168.2.1/24 dev ipvlan1

ip route add 192.168.2.0/24 via 192.168.1.2
ping 192.168.2.2 -c 2

ip netns exec net1 ip link set ipvlan1 type ipvlan mode l2
ping 192.168.2.2 -c 2

Add the same configuration on remote host. After we set the mode to l2,
we could find that the src/dst MAC addresses are the same on eth1:

21:26:06.648565 00:b7:13:ad:d3:05 &gt; 00:b7:13:ad:d3:05, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 58356, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.2.1 &gt; 192.168.2.2: ICMP echo request, id 22686, seq 1, length 64

Fix this by calling dev_change_flags(), which will call netdevice notifier
with flag change info.

v2:
a) As pointed out by Wang Cong, check return value for dev_change_flags() when
change dev flags.
b) As suggested by Stefano and Sabrina, move flags setting before l3mdev_ops.
So we don't need to redo ipvlan_{, un}register_nf_hook() again in err path.

Reported-by: Jianlin Shi &lt;jishi@redhat.com&gt;
Reviewed-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
Reviewed-by: Sabrina Dubroca &lt;sd@queasysnail.net&gt;
Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Hangbin Liu &lt;liuhangbin@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 5dc2d3996a8b221c20dd0900bdad45031a572530 ]

After we change the ipvlan mode from l3 to l2, or vice versa, we only
reset IFF_NOARP flag, but don't flush the ARP table cache, which will
cause eth-&gt;h_dest to be equal to eth-&gt;h_source in ipvlan_xmit_mode_l2().
Then the message will not come out of host.

Here is the reproducer on local host:

ip link set eth1 up
ip addr add 192.168.1.1/24 dev eth1
ip link add link eth1 ipvlan1 type ipvlan mode l3

ip netns add net1
ip link set ipvlan1 netns net1
ip netns exec net1 ip link set ipvlan1 up
ip netns exec net1 ip addr add 192.168.2.1/24 dev ipvlan1

ip route add 192.168.2.0/24 via 192.168.1.2
ping 192.168.2.2 -c 2

ip netns exec net1 ip link set ipvlan1 type ipvlan mode l2
ping 192.168.2.2 -c 2

Add the same configuration on remote host. After we set the mode to l2,
we could find that the src/dst MAC addresses are the same on eth1:

21:26:06.648565 00:b7:13:ad:d3:05 &gt; 00:b7:13:ad:d3:05, ethertype IPv4 (0x0800), length 98: (tos 0x0, ttl 64, id 58356, offset 0, flags [DF], proto ICMP (1), length 84)
    192.168.2.1 &gt; 192.168.2.2: ICMP echo request, id 22686, seq 1, length 64

Fix this by calling dev_change_flags(), which will call netdevice notifier
with flag change info.

v2:
a) As pointed out by Wang Cong, check return value for dev_change_flags() when
change dev flags.
b) As suggested by Stefano and Sabrina, move flags setting before l3mdev_ops.
So we don't need to redo ipvlan_{, un}register_nf_hook() again in err path.

Reported-by: Jianlin Shi &lt;jishi@redhat.com&gt;
Reviewed-by: Stefano Brivio &lt;sbrivio@redhat.com&gt;
Reviewed-by: Sabrina Dubroca &lt;sd@queasysnail.net&gt;
Fixes: 2ad7bf3638411 ("ipvlan: Initial check-in of the IPVLAN driver.")
Signed-off-by: Hangbin Liu &lt;liuhangbin@gmail.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@microsoft.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
