linux.git/net/core/link_watch.c, branch v6.2

net: linkwatch: only report IF_OPER_LOWERLAYERDOWN if iflink is actually down

2022-11-16T09:45:00+00:00

RFC 2863 says:

   The lowerLayerDown state is also a refinement on the down state.
   This new state indicates that this interface runs "on top of" one or
   more other interfaces (see ifStackTable) and that this interface is
   down specifically because one or more of these lower-layer interfaces
   are down.

DSA interfaces are virtual network devices, stacked on top of the DSA
master, but they have a physical MAC, with a PHY that reports a real
link status.

But since DSA (perhaps improperly) uses an iflink to describe the
relationship to its master since commit c084080151e1 ("dsa: set ->iflink
on slave interfaces to the ifindex of the parent"), default_operstate()
will misinterpret this to mean that every time the carrier of a DSA
interface is not ok, it is because of the master being not ok.

In fact, since commit c0a8a9c27493 ("net: dsa: automatically bring user
ports down when master goes down"), DSA cannot even in theory be in the
lowerLayerDown state, because it just calls dev_close_many(), thereby
going down, when the master goes down.

We could revert the commit that creates an iflink between a DSA user
port and its master, especially since now we have an alternative
IFLA_DSA_MASTER which has less side effects. But there may be tooling in
use which relies on the iflink, which has existed since 2009.

We could also probably do something local within DSA to overwrite what
rfc2863_policy() did, in a way similar to hsr_set_operstate(), but this
seems like a hack.

What seems appropriate is to follow the iflink, and check the carrier
status of that interface as well. If that's down too, yes, keep
reporting lowerLayerDown, otherwise just down.

Signed-off-by: Vladimir Oltean 
Signed-off-by: David S. Miller

net: rename reference+tracking helpers

2022-06-10T04:52:55+00:00

Netdev reference helpers have a dev_ prefix for historic
reasons. Renaming the old helpers would be too much churn
but we can rename the tracking ones which are relatively
recent and should be the default for new code.

Rename:
 dev_hold_track()    -> netdev_hold()
 dev_put_track()     -> netdev_put()
 dev_replace_track() -> netdev_ref_replace()

Link: https://lore.kernel.org/r/20220608043955.919359-1-kuba@kernel.org
Signed-off-by: Jakub Kicinski

net: extract a few internals from netdevice.h

2022-04-08T03:32:09+00:00

There's a number of functions and static variables used
under net/core/ but not from the outside. We currently
dump most of them into netdevice.h. That bad for many
reasons:
 - netdevice.h is very cluttered, hard to figure out
   what the APIs are;
 - netdevice.h is very long;
 - we have to touch netdevice.h more which causes expensive
   incremental builds.

Create a header under net/core/ and move some declarations.

The new header is also a bit of a catch-all but that's
fine, if we create more specific headers people will
likely over-think where their declaration fit best.
And end up putting them in netdevice.h, again.

More work should be done on splitting netdevice.h into more
targeted headers, but that'd be more time consuming so small
steps.

Signed-off-by: Jakub Kicinski

net: refine dev_put()/dev_hold() debugging

2022-02-05T15:22:45+00:00

We are still chasing some syzbot reports where we think a rogue dev_put()
is called with no corresponding prior dev_hold().
Unfortunately it eats a reference on dev->dev_refcnt taken by innocent
dev_hold_track(), meaning that the refcount saturation splat comes
too late to be useful.

Make sure that 'not tracked' dev_put() and dev_hold() better use
CONFIG_NET_DEV_REFCNT_TRACKER=y debug infrastructure:

Prior patch in the series allowed ref_tracker_alloc() and ref_tracker_free()
to be called with a NULL @trackerp parameter, and to use a separate refcount
only to detect too many put() even in the following case:

dev_hold_track(dev, tracker_1, GFP_ATOMIC);
 dev_hold(dev);
 dev_put(dev);
 dev_put(dev); // Should complain loudly here.
dev_put_track(dev, tracker_1); // instead of here

Add clarification about netdev_tracker_alloc() role.

v2: I replaced the dev_put() in linkwatch_do_dev()
    with __dev_put() because callers called netdev_tracker_free().

Signed-off-by: Eric Dumazet 
Signed-off-by: David S. Miller

net: linkwatch: be more careful about dev->linkwatch_dev_tracker

2021-12-15T02:45:58+00:00

Apparently a concurrent linkwatch_add_event() could
run while we are in __linkwatch_run_queue().

We need to free dev->linkwatch_dev_tracker tracker
under lweventlist_lock protection to avoid this race.

syzbot report:
[   77.935949][ T3661] reference already released.
[   77.941015][ T3661] allocated in:
[   77.944482][ T3661]  linkwatch_fire_event+0x202/0x260
[   77.950318][ T3661]  netif_carrier_on+0x9c/0x100
[   77.955120][ T3661]  __ieee80211_sta_join_ibss+0xc52/0x1590
[   77.960888][ T3661]  ieee80211_sta_create_ibss.cold+0xd2/0x11f
[   77.966908][ T3661]  ieee80211_ibss_work.cold+0x30e/0x60f
[   77.972483][ T3661]  ieee80211_iface_work+0xb70/0xd00
[   77.977715][ T3661]  process_one_work+0x9ac/0x1680
[   77.982671][ T3661]  worker_thread+0x652/0x11c0
[   77.987371][ T3661]  kthread+0x405/0x4f0
[   77.991465][ T3661]  ret_from_fork+0x1f/0x30
[   77.995895][ T3661] freed in:
[   77.999006][ T3661]  linkwatch_do_dev+0x96/0x160
[   78.004014][ T3661]  __linkwatch_run_queue+0x233/0x6a0
[   78.009496][ T3661]  linkwatch_event+0x4a/0x60
[   78.014099][ T3661]  process_one_work+0x9ac/0x1680
[   78.019034][ T3661]  worker_thread+0x652/0x11c0
[   78.023719][ T3661]  kthread+0x405/0x4f0
[   78.027810][ T3661]  ret_from_fork+0x1f/0x30
[   78.042541][ T3661] ------------[ cut here ]------------
[   78.048253][ T3661] WARNING: CPU: 0 PID: 3661 at lib/ref_tracker.c:120 ref_tracker_free.cold+0x110/0x14e
[   78.062364][ T3661] Modules linked in:
[   78.066424][ T3661] CPU: 0 PID: 3661 Comm: kworker/0:5 Not tainted 5.16.0-rc4-next-20211210-syzkaller #0
[   78.076075][ T3661] Hardware name: Google Google Compute Engine/Google Compute Engine, BIOS Google 01/01/2011
[   78.090648][ T3661] Workqueue: events linkwatch_event
[   78.095890][ T3661] RIP: 0010:ref_tracker_free.cold+0x110/0x14e
[   78.102191][ T3661] Code: ea 03 48 c1 e0 2a 0f b6 04 02 84 c0 74 04 3c 03 7e 4c 8b 7b 18 e8 6b 54 e9 fa e8 26 4d 57 f8 4c 89 ee 48 89 ef e8 fb 33 36 00 <0f> 0b 41 bd ea ff ff ff e9 bd 60 e9 fa 4c 89 f7 e8 16 45 a2 f8 e9
[   78.127211][ T3661] RSP: 0018:ffffc90002b5fb18 EFLAGS: 00010246
[   78.133684][ T3661] RAX: 0000000000000000 RBX: ffff88807467f700 RCX: 0000000000000000
[   78.141928][ T3661] RDX: 0000000000000001 RSI: 0000000000000001 RDI: 0000000000000001
[   78.150087][ T3661] RBP: ffff888057e105b8 R08: 0000000000000001 R09: ffffffff8ffa1967
[   78.158211][ T3661] R10: 0000000000000001 R11: 0000000000000000 R12: 1ffff9200056bf65
[   78.166204][ T3661] R13: 0000000000000292 R14: ffff88807467f718 R15: 00000000c0e0008c
[   78.174321][ T3661] FS:  0000000000000000(0000) GS:ffff8880b9c00000(0000) knlGS:0000000000000000
[   78.183310][ T3661] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[   78.190156][ T3661] CR2: 000000c000208800 CR3: 000000007f7b5000 CR4: 00000000003506f0
[   78.198235][ T3661] DR0: 0000000000000000 DR1: 0000000000000000 DR2: 0000000000000000
[   78.206214][ T3661] DR3: 0000000000000000 DR6: 00000000fffe0ff0 DR7: 0000000000000400
[   78.214328][ T3661] Call Trace:
[   78.217679][ T3661]  
[   78.220621][ T3661]  ? __sanitizer_cov_trace_const_cmp4+0x1c/0x70
[   78.226981][ T3661]  ? nlmsg_notify+0xbe/0x280
[   78.231607][ T3661]  ? ref_tracker_dir_exit+0x330/0x330
[   78.237654][ T3661]  ? linkwatch_do_dev+0x96/0x160
[   78.242628][ T3661]  ? __linkwatch_run_queue+0x233/0x6a0
[   78.248170][ T3661]  ? linkwatch_event+0x4a/0x60
[   78.252946][ T3661]  ? process_one_work+0x9ac/0x1680
[   78.258136][ T3661]  ? worker_thread+0x853/0x11c0
[   78.263020][ T3661]  ? kthread+0x405/0x4f0
[   78.267905][ T3661]  ? ret_from_fork+0x1f/0x30
[   78.272670][ T3661]  ? netdev_state_change+0xa1/0x130
[   78.278019][ T3661]  ? netdev_exit+0xd0/0xd0
[   78.282466][ T3661]  ? dev_activate+0x420/0xa60
[   78.287261][ T3661]  linkwatch_do_dev+0x96/0x160
[   78.292043][ T3661]  __linkwatch_run_queue+0x233/0x6a0
[   78.297505][ T3661]  ? linkwatch_do_dev+0x160/0x160
[   78.302561][ T3661]  linkwatch_event+0x4a/0x60
[   78.307225][ T3661]  process_one_work+0x9ac/0x1680
[   78.312292][ T3661]  ? pwq_dec_nr_in_flight+0x2a0/0x2a0
[   78.317757][ T3661]  ? rwlock_bug.part.0+0x90/0x90
[   78.322726][ T3661]  ? _raw_spin_lock_irq+0x41/0x50
[   78.327844][ T3661]  worker_thread+0x853/0x11c0
[   78.332543][ T3661]  ? process_one_work+0x1680/0x1680
[   78.338500][ T3661]  kthread+0x405/0x4f0
[   78.342610][ T3661]  ? set_kthread_struct+0x130/0x130

Fixes: 63f13937cbe9 ("net: linkwatch: add net device refcount tracker")
Signed-off-by: Eric Dumazet 
Reported-by: syzbot 
Link: https://lore.kernel.org/r/20211214051955.3569843-1-eric.dumazet@gmail.com
Signed-off-by: Jakub Kicinski

net: linkwatch: add net device refcount tracker

2021-12-07T00:05:44+00:00

Add a netdevice_tracker inside struct net_device, to track
the self reference when a device is in lweventlist.

Signed-off-by: Eric Dumazet 
Signed-off-by: Jakub Kicinski

net: Write lock dev_base_lock without disabling bottom halves.

2021-11-29T12:12:36+00:00

The writer acquires dev_base_lock with disabled bottom halves.
The reader can acquire dev_base_lock without disabling bottom halves
because there is no writer in softirq context.

On PREEMPT_RT the softirqs are preemptible and local_bh_disable() acts
as a lock to ensure that resources, that are protected by disabling
bottom halves, remain protected.
This leads to a circular locking dependency if the lock acquired with
disabled bottom halves (as in write_lock_bh()) and somewhere else with
enabled bottom halves (as by read_lock() in netstat_show()) followed by
disabling bottom halves (cxgb_get_stats() -> t4_wr_mbox_meat_timeout()
-> spin_lock_bh()). This is the reverse locking order.

All read_lock() invocation are from sysfs callback which are not invoked
from softirq context. Therefore there is no need to disable bottom
halves while acquiring a write lock.

Acquire the write lock of dev_base_lock without disabling bottom halves.

Reported-by: Pei Zhang 
Reported-by: Luis Claudio R. Goncalves 
Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: David S. Miller

net: linkwatch: fix failure to restore device state across suspend/resume

2021-08-11T21:43:16+00:00

After migrating my laptop from 4.19-LTS to 5.4-LTS a while ago I noticed
that my Ethernet port to which a bond and a VLAN interface are attached
appeared to remain up after resuming from suspend with the cable unplugged
(and that problem still persists with 5.10-LTS).

It happens that the following happens:

  - the network driver (e1000e here) prepares to suspend, calls e1000e_down()
    which calls netif_carrier_off() to signal that the link is going down.
  - netif_carrier_off() adds a link_watch event to the list of events for
    this device
  - the device is completely stopped.
  - the machine suspends
  - the cable is unplugged and the machine brought to another location
  - the machine is resumed
  - the queued linkwatch events are processed for the device
  - the device doesn't yet have the __LINK_STATE_PRESENT bit and its events
    are silently dropped
  - the device is resumed with its link down
  - the upper VLAN and bond interfaces are never notified that the link had
    been turned down and remain up
  - the only way to provoke a change is to physically connect the machine
    to a port and possibly unplug it.

The state after resume looks like this:
  $ ip -br li | egrep 'bond|eth'
  bond0            UP             e8:6a:64:64:64:64 
  eth0             DOWN           e8:6a:64:64:64:64 
  eth0.2@eth0      UP             e8:6a:64:64:64:64 

Placing an explicit call to netdev_state_change() either in the suspend
or the resume code in the NIC driver worked around this but the solution
is not satisfying.

The issue in fact really is in link_watch that loses events while it
ought not to. It happens that the test for the device being present was
added by commit 124eee3f6955 ("net: linkwatch: add check for netdevice
being present to linkwatch_do_dev") in 4.20 to avoid an access to
devices that are not present.

Instead of dropping events, this patch proceeds slightly differently by
postponing their handling so that they happen after the device is fully
resumed.

Fixes: 124eee3f6955 ("net: linkwatch: add check for netdevice being present to linkwatch_do_dev")
Link: https://lists.openwall.net/netdev/2018/03/15/62
Cc: Heiner Kallweit 
Cc: Geert Uytterhoeven 
Cc: Florian Fainelli 
Signed-off-by: Willy Tarreau 
Link: https://lore.kernel.org/r/20210809160628.22623-1-w@1wt.eu
Signed-off-by: Jakub Kicinski

net: Add IF_OPER_TESTING

2020-04-20T19:43:24+00:00

RFC 2863 defines the operational state testing. Add support for this
state, both as a IF_LINK_MODE_ and __LINK_STATE_.

Signed-off-by: Andrew Lunn 
Signed-off-by: David S. Miller

net: link_watch: prevent starvation when processing linkwatch wq

2019-07-02T02:02:47+00:00

When user has configured a large number of virtual netdev, such
as 4K vlans, the carrier on/off operation of the real netdev
will also cause it's virtual netdev's link state to be processed
in linkwatch. Currently, the processing is done in a work queue,
which may cause rtnl locking starvation problem and worker
starvation problem for other work queue, such as irqfd_inject wq.

This patch releases the cpu when link watch worker has processed
a fixed number of netdev' link watch event, and schedule the
work queue again when there is still link watch event remaining.

Signed-off-by: Yunsheng Lin 
Signed-off-by: David S. Miller