<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/net, branch v4.9.22</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>svcauth_gss: Close connection when dropping an incoming message</title>
<updated>2017-04-12T10:41:17+00:00</updated>
<author>
<name>Chuck Lever</name>
<email>chuck.lever@oracle.com</email>
</author>
<published>2017-04-04T19:32:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=607ca1dccbbd880b9f2bddfda23ec8aed1c5acbe'/>
<id>607ca1dccbbd880b9f2bddfda23ec8aed1c5acbe</id>
<content type='text'>
[ Upstream commit 4d712ef1db05c3aa5c3b690a50c37ebad584c53f ]

S5.3.3.1 of RFC 2203 requires that an incoming GSS-wrapped message
whose sequence number lies outside the current window is dropped.
The rationale is:

  The reason for discarding requests silently is that the server
  is unable to determine if the duplicate or out of range request
  was due to a sequencing problem in the client, network, or the
  operating system, or due to some quirk in routing, or a replay
  attack by an intruder.  Discarding the request allows the client
  to recover after timing out, if indeed the duplication was
  unintentional or well intended.

However, clients may rely on the server dropping the connection to
indicate that a retransmit is needed. Without a connection reset, a
client can wait forever without retransmitting, and the workload
just stops dead. I've reproduced this behavior by running xfstests
generic/323 on an NFSv4.0 mount with proto=rdma and sec=krb5i.

To address this issue, have the server close the connection when it
silently discards an incoming message due to a GSS sequence number
problem.

There are a few other places where the server will never reply.
Change those spots in a similar fashion.

Signed-off-by: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 4d712ef1db05c3aa5c3b690a50c37ebad584c53f ]

S5.3.3.1 of RFC 2203 requires that an incoming GSS-wrapped message
whose sequence number lies outside the current window is dropped.
The rationale is:

  The reason for discarding requests silently is that the server
  is unable to determine if the duplicate or out of range request
  was due to a sequencing problem in the client, network, or the
  operating system, or due to some quirk in routing, or a replay
  attack by an intruder.  Discarding the request allows the client
  to recover after timing out, if indeed the duplication was
  unintentional or well intended.

However, clients may rely on the server dropping the connection to
indicate that a retransmit is needed. Without a connection reset, a
client can wait forever without retransmitting, and the workload
just stops dead. I've reproduced this behavior by running xfstests
generic/323 on an NFSv4.0 mount with proto=rdma and sec=krb5i.

To address this issue, have the server close the connection when it
silently discards an incoming message due to a GSS sequence number
problem.

There are a few other places where the server will never reply.
Change those spots in a similar fashion.

Signed-off-by: Chuck Lever &lt;chuck.lever@oracle.com&gt;
Signed-off-by: J. Bruce Fields &lt;bfields@redhat.com&gt;
Signed-off-by: Sasha Levin &lt;alexander.levin@verizon.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mac80211: unconditionally start new netdev queues with iTXQ support</title>
<updated>2017-04-12T10:41:12+00:00</updated>
<author>
<name>Johannes Berg</name>
<email>johannes.berg@intel.com</email>
</author>
<published>2017-03-29T12:15:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c0321505df2eedfb468512d629cd198b1abc2d03'/>
<id>c0321505df2eedfb468512d629cd198b1abc2d03</id>
<content type='text'>
commit 7d65f82954dadbbe7b6e1aec7e07ad17bc6d958b upstream.

When internal mac80211 TXQs aren't supported, netdev queues must
always started out started even when driver queues are stopped
while the interface is added. This is necessary because with the
internal TXQ support netdev queues are never stopped and packet
scheduling/dropping is done in mac80211.

Fixes: 80a83cfc434b1 ("mac80211: skip netdev queue control with software queuing")
Reported-and-tested-by: Sven Eckelmann &lt;sven.eckelmann@openmesh.com&gt;
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 7d65f82954dadbbe7b6e1aec7e07ad17bc6d958b upstream.

When internal mac80211 TXQs aren't supported, netdev queues must
always started out started even when driver queues are stopped
while the interface is added. This is necessary because with the
internal TXQ support netdev queues are never stopped and packet
scheduling/dropping is done in mac80211.

Fixes: 80a83cfc434b1 ("mac80211: skip netdev queue control with software queuing")
Reported-and-tested-by: Sven Eckelmann &lt;sven.eckelmann@openmesh.com&gt;
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>cfg80211: check rdev resume callback only for registered wiphy</title>
<updated>2017-04-12T10:41:11+00:00</updated>
<author>
<name>Arend Van Spriel</name>
<email>arend.vanspriel@broadcom.com</email>
</author>
<published>2017-03-28T08:11:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=57e1e90dda74f87bef31bcc5eea89f775b7b3c69'/>
<id>57e1e90dda74f87bef31bcc5eea89f775b7b3c69</id>
<content type='text'>
commit b3ef5520c1eabb56064474043c7c55a1a65b8708 upstream.

We got the following use-after-free KASAN report:

 BUG: KASAN: use-after-free in wiphy_resume+0x591/0x5a0 [cfg80211]
	 at addr ffff8803fc244090
 Read of size 8 by task kworker/u16:24/2587
 CPU: 6 PID: 2587 Comm: kworker/u16:24 Tainted: G    B 4.9.13-debug+
 Hardware name: Dell Inc. XPS 15 9550/0N7TVV, BIOS 1.2.19 12/22/2016
 Workqueue: events_unbound async_run_entry_fn
  ffff880425d4f9d8 ffffffffaeedb541 ffff88042b80ef00 ffff8803fc244088
  ffff880425d4fa00 ffffffffae84d7a1 ffff880425d4fa98 ffff8803fc244080
  ffff88042b80ef00 ffff880425d4fa88 ffffffffae84da3a ffffffffc141f7d9
 Call Trace:
  [&lt;ffffffffaeedb541&gt;] dump_stack+0x85/0xc4
  [&lt;ffffffffae84d7a1&gt;] kasan_object_err+0x21/0x70
  [&lt;ffffffffae84da3a&gt;] kasan_report_error+0x1fa/0x500
  [&lt;ffffffffc141f7d9&gt;] ? cfg80211_bss_age+0x39/0xc0 [cfg80211]
  [&lt;ffffffffc141f83a&gt;] ? cfg80211_bss_age+0x9a/0xc0 [cfg80211]
  [&lt;ffffffffae48d46d&gt;] ? trace_hardirqs_on+0xd/0x10
  [&lt;ffffffffc13fb1c0&gt;] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
  [&lt;ffffffffae84def1&gt;] __asan_report_load8_noabort+0x61/0x70
  [&lt;ffffffffc13fb100&gt;] ? wiphy_suspend+0xbb0/0xc70 [cfg80211]
  [&lt;ffffffffc13fb751&gt;] ? wiphy_resume+0x591/0x5a0 [cfg80211]
  [&lt;ffffffffc13fb751&gt;] wiphy_resume+0x591/0x5a0 [cfg80211]
  [&lt;ffffffffc13fb1c0&gt;] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
  [&lt;ffffffffaf3b206e&gt;] dpm_run_callback+0x6e/0x4f0
  [&lt;ffffffffaf3b31b2&gt;] device_resume+0x1c2/0x670
  [&lt;ffffffffaf3b367d&gt;] async_resume+0x1d/0x50
  [&lt;ffffffffae3ee84e&gt;] async_run_entry_fn+0xfe/0x610
  [&lt;ffffffffae3d0666&gt;] process_one_work+0x716/0x1a50
  [&lt;ffffffffae3d05c9&gt;] ? process_one_work+0x679/0x1a50
  [&lt;ffffffffafdd7b6d&gt;] ? _raw_spin_unlock_irq+0x3d/0x60
  [&lt;ffffffffae3cff50&gt;] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
  [&lt;ffffffffae3d1a80&gt;] worker_thread+0xe0/0x1460
  [&lt;ffffffffae3d19a0&gt;] ? process_one_work+0x1a50/0x1a50
  [&lt;ffffffffae3e54c2&gt;] kthread+0x222/0x2e0
  [&lt;ffffffffae3e52a0&gt;] ? kthread_park+0x80/0x80
  [&lt;ffffffffae3e52a0&gt;] ? kthread_park+0x80/0x80
  [&lt;ffffffffae3e52a0&gt;] ? kthread_park+0x80/0x80
  [&lt;ffffffffafdd86aa&gt;] ret_from_fork+0x2a/0x40
 Object at ffff8803fc244088, in cache kmalloc-1024 size: 1024
 Allocated:
 PID = 71
  save_stack_trace+0x1b/0x20
  save_stack+0x46/0xd0
  kasan_kmalloc+0xad/0xe0
  kasan_slab_alloc+0x12/0x20
  __kmalloc_track_caller+0x134/0x360
  kmemdup+0x20/0x50
  brcmf_cfg80211_attach+0x10b/0x3a90 [brcmfmac]
  brcmf_bus_start+0x19a/0x9a0 [brcmfmac]
  brcmf_pcie_setup+0x1f1a/0x3680 [brcmfmac]
  brcmf_fw_request_nvram_done+0x44c/0x11b0 [brcmfmac]
  request_firmware_work_func+0x135/0x280
  process_one_work+0x716/0x1a50
  worker_thread+0xe0/0x1460
  kthread+0x222/0x2e0
  ret_from_fork+0x2a/0x40
 Freed:
 PID = 2568
  save_stack_trace+0x1b/0x20
  save_stack+0x46/0xd0
  kasan_slab_free+0x71/0xb0
  kfree+0xe8/0x2e0
  brcmf_cfg80211_detach+0x62/0xf0 [brcmfmac]
  brcmf_detach+0x14a/0x2b0 [brcmfmac]
  brcmf_pcie_remove+0x140/0x5d0 [brcmfmac]
  brcmf_pcie_pm_leave_D3+0x198/0x2e0 [brcmfmac]
  pci_pm_resume+0x186/0x220
  dpm_run_callback+0x6e/0x4f0
  device_resume+0x1c2/0x670
  async_resume+0x1d/0x50
  async_run_entry_fn+0xfe/0x610
  process_one_work+0x716/0x1a50
  worker_thread+0xe0/0x1460
  kthread+0x222/0x2e0
  ret_from_fork+0x2a/0x40
 Memory state around the buggy address:
  ffff8803fc243f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8803fc244000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 &gt;ffff8803fc244080: fc fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                          ^
  ffff8803fc244100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8803fc244180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

What is happening is that brcmf_pcie_resume() detects a device that
is no longer responsive and it decides to unbind resulting in a
wiphy_unregister() and wiphy_free() call. Now the wiphy instance
remains allocated, because PM needs to call wiphy_resume() for it.
However, brcmfmac already does a kfree() for the struct
cfg80211_registered_device::ops field. Change the checks in
wiphy_resume() to only access the struct cfg80211_registered_device::ops
if the wiphy instance is still registered at this time.

Reported-by: Daniel J Blueman &lt;daniel@quora.org&gt;
Reviewed-by: Hante Meuleman &lt;hante.meuleman@broadcom.com&gt;
Reviewed-by: Pieter-Paul Giesberts &lt;pieter-paul.giesberts@broadcom.com&gt;
Reviewed-by: Franky Lin &lt;franky.lin@broadcom.com&gt;
Signed-off-by: Arend van Spriel &lt;arend.vanspriel@broadcom.com&gt;
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit b3ef5520c1eabb56064474043c7c55a1a65b8708 upstream.

We got the following use-after-free KASAN report:

 BUG: KASAN: use-after-free in wiphy_resume+0x591/0x5a0 [cfg80211]
	 at addr ffff8803fc244090
 Read of size 8 by task kworker/u16:24/2587
 CPU: 6 PID: 2587 Comm: kworker/u16:24 Tainted: G    B 4.9.13-debug+
 Hardware name: Dell Inc. XPS 15 9550/0N7TVV, BIOS 1.2.19 12/22/2016
 Workqueue: events_unbound async_run_entry_fn
  ffff880425d4f9d8 ffffffffaeedb541 ffff88042b80ef00 ffff8803fc244088
  ffff880425d4fa00 ffffffffae84d7a1 ffff880425d4fa98 ffff8803fc244080
  ffff88042b80ef00 ffff880425d4fa88 ffffffffae84da3a ffffffffc141f7d9
 Call Trace:
  [&lt;ffffffffaeedb541&gt;] dump_stack+0x85/0xc4
  [&lt;ffffffffae84d7a1&gt;] kasan_object_err+0x21/0x70
  [&lt;ffffffffae84da3a&gt;] kasan_report_error+0x1fa/0x500
  [&lt;ffffffffc141f7d9&gt;] ? cfg80211_bss_age+0x39/0xc0 [cfg80211]
  [&lt;ffffffffc141f83a&gt;] ? cfg80211_bss_age+0x9a/0xc0 [cfg80211]
  [&lt;ffffffffae48d46d&gt;] ? trace_hardirqs_on+0xd/0x10
  [&lt;ffffffffc13fb1c0&gt;] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
  [&lt;ffffffffae84def1&gt;] __asan_report_load8_noabort+0x61/0x70
  [&lt;ffffffffc13fb100&gt;] ? wiphy_suspend+0xbb0/0xc70 [cfg80211]
  [&lt;ffffffffc13fb751&gt;] ? wiphy_resume+0x591/0x5a0 [cfg80211]
  [&lt;ffffffffc13fb751&gt;] wiphy_resume+0x591/0x5a0 [cfg80211]
  [&lt;ffffffffc13fb1c0&gt;] ? wiphy_suspend+0xc70/0xc70 [cfg80211]
  [&lt;ffffffffaf3b206e&gt;] dpm_run_callback+0x6e/0x4f0
  [&lt;ffffffffaf3b31b2&gt;] device_resume+0x1c2/0x670
  [&lt;ffffffffaf3b367d&gt;] async_resume+0x1d/0x50
  [&lt;ffffffffae3ee84e&gt;] async_run_entry_fn+0xfe/0x610
  [&lt;ffffffffae3d0666&gt;] process_one_work+0x716/0x1a50
  [&lt;ffffffffae3d05c9&gt;] ? process_one_work+0x679/0x1a50
  [&lt;ffffffffafdd7b6d&gt;] ? _raw_spin_unlock_irq+0x3d/0x60
  [&lt;ffffffffae3cff50&gt;] ? pwq_dec_nr_in_flight+0x2b0/0x2b0
  [&lt;ffffffffae3d1a80&gt;] worker_thread+0xe0/0x1460
  [&lt;ffffffffae3d19a0&gt;] ? process_one_work+0x1a50/0x1a50
  [&lt;ffffffffae3e54c2&gt;] kthread+0x222/0x2e0
  [&lt;ffffffffae3e52a0&gt;] ? kthread_park+0x80/0x80
  [&lt;ffffffffae3e52a0&gt;] ? kthread_park+0x80/0x80
  [&lt;ffffffffae3e52a0&gt;] ? kthread_park+0x80/0x80
  [&lt;ffffffffafdd86aa&gt;] ret_from_fork+0x2a/0x40
 Object at ffff8803fc244088, in cache kmalloc-1024 size: 1024
 Allocated:
 PID = 71
  save_stack_trace+0x1b/0x20
  save_stack+0x46/0xd0
  kasan_kmalloc+0xad/0xe0
  kasan_slab_alloc+0x12/0x20
  __kmalloc_track_caller+0x134/0x360
  kmemdup+0x20/0x50
  brcmf_cfg80211_attach+0x10b/0x3a90 [brcmfmac]
  brcmf_bus_start+0x19a/0x9a0 [brcmfmac]
  brcmf_pcie_setup+0x1f1a/0x3680 [brcmfmac]
  brcmf_fw_request_nvram_done+0x44c/0x11b0 [brcmfmac]
  request_firmware_work_func+0x135/0x280
  process_one_work+0x716/0x1a50
  worker_thread+0xe0/0x1460
  kthread+0x222/0x2e0
  ret_from_fork+0x2a/0x40
 Freed:
 PID = 2568
  save_stack_trace+0x1b/0x20
  save_stack+0x46/0xd0
  kasan_slab_free+0x71/0xb0
  kfree+0xe8/0x2e0
  brcmf_cfg80211_detach+0x62/0xf0 [brcmfmac]
  brcmf_detach+0x14a/0x2b0 [brcmfmac]
  brcmf_pcie_remove+0x140/0x5d0 [brcmfmac]
  brcmf_pcie_pm_leave_D3+0x198/0x2e0 [brcmfmac]
  pci_pm_resume+0x186/0x220
  dpm_run_callback+0x6e/0x4f0
  device_resume+0x1c2/0x670
  async_resume+0x1d/0x50
  async_run_entry_fn+0xfe/0x610
  process_one_work+0x716/0x1a50
  worker_thread+0xe0/0x1460
  kthread+0x222/0x2e0
  ret_from_fork+0x2a/0x40
 Memory state around the buggy address:
  ffff8803fc243f80: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
  ffff8803fc244000: fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc fc
 &gt;ffff8803fc244080: fc fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
                          ^
  ffff8803fc244100: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb
  ffff8803fc244180: fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb fb

What is happening is that brcmf_pcie_resume() detects a device that
is no longer responsive and it decides to unbind resulting in a
wiphy_unregister() and wiphy_free() call. Now the wiphy instance
remains allocated, because PM needs to call wiphy_resume() for it.
However, brcmfmac already does a kfree() for the struct
cfg80211_registered_device::ops field. Change the checks in
wiphy_resume() to only access the struct cfg80211_registered_device::ops
if the wiphy instance is still registered at this time.

Reported-by: Daniel J Blueman &lt;daniel@quora.org&gt;
Reviewed-by: Hante Meuleman &lt;hante.meuleman@broadcom.com&gt;
Reviewed-by: Pieter-Paul Giesberts &lt;pieter-paul.giesberts@broadcom.com&gt;
Reviewed-by: Franky Lin &lt;franky.lin@broadcom.com&gt;
Signed-off-by: Arend van Spriel &lt;arend.vanspriel@broadcom.com&gt;
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: force GFP_NOIO for socket allocations</title>
<updated>2017-04-08T07:30:30+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2017-03-21T12:44:28+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=8601537724611b724920fd397cc9fdb181e92ed3'/>
<id>8601537724611b724920fd397cc9fdb181e92ed3</id>
<content type='text'>
commit 633ee407b9d15a75ac9740ba9d3338815e1fcb95 upstream.

sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:

    Workqueue: ceph-msgr con_work [libceph]
    ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
    0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
    ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
    Call Trace:
    [&lt;ffffffff816dd629&gt;] schedule+0x29/0x70
    [&lt;ffffffff816e066d&gt;] schedule_timeout+0x1bd/0x200
    [&lt;ffffffff81093ffc&gt;] ? ttwu_do_wakeup+0x2c/0x120
    [&lt;ffffffff81094266&gt;] ? ttwu_do_activate.constprop.135+0x66/0x70
    [&lt;ffffffff816deb5f&gt;] wait_for_completion+0xbf/0x180
    [&lt;ffffffff81097cd0&gt;] ? try_to_wake_up+0x390/0x390
    [&lt;ffffffff81086335&gt;] flush_work+0x165/0x250
    [&lt;ffffffff81082940&gt;] ? worker_detach_from_pool+0xd0/0xd0
    [&lt;ffffffffa03b65b1&gt;] xlog_cil_force_lsn+0x81/0x200 [xfs]
    [&lt;ffffffff816d6b42&gt;] ? __slab_free+0xee/0x234
    [&lt;ffffffffa03b4b1d&gt;] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
    [&lt;ffffffff811adc1e&gt;] ? lookup_page_cgroup_used+0xe/0x30
    [&lt;ffffffffa039a723&gt;] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [&lt;ffffffffa03b4dcf&gt;] xfs_log_force_lsn+0x3f/0xf0 [xfs]
    [&lt;ffffffffa039a723&gt;] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [&lt;ffffffffa03a62c6&gt;] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
    [&lt;ffffffff810aa250&gt;] ? wake_atomic_t_function+0x40/0x40
    [&lt;ffffffffa039a723&gt;] xfs_reclaim_inode+0xa3/0x330 [xfs]
    [&lt;ffffffffa039ac07&gt;] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
    [&lt;ffffffffa039bb13&gt;] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
    [&lt;ffffffffa03ab745&gt;] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
    [&lt;ffffffff811c0c18&gt;] super_cache_scan+0x178/0x180
    [&lt;ffffffff8115912e&gt;] shrink_slab_node+0x14e/0x340
    [&lt;ffffffff811afc3b&gt;] ? mem_cgroup_iter+0x16b/0x450
    [&lt;ffffffff8115af70&gt;] shrink_slab+0x100/0x140
    [&lt;ffffffff8115e425&gt;] do_try_to_free_pages+0x335/0x490
    [&lt;ffffffff8115e7f9&gt;] try_to_free_pages+0xb9/0x1f0
    [&lt;ffffffff816d56e4&gt;] ? __alloc_pages_direct_compact+0x69/0x1be
    [&lt;ffffffff81150cba&gt;] __alloc_pages_nodemask+0x69a/0xb40
    [&lt;ffffffff8119743e&gt;] alloc_pages_current+0x9e/0x110
    [&lt;ffffffff811a0ac5&gt;] new_slab+0x2c5/0x390
    [&lt;ffffffff816d71c4&gt;] __slab_alloc+0x33b/0x459
    [&lt;ffffffff815b906d&gt;] ? sock_alloc_inode+0x2d/0xd0
    [&lt;ffffffff8164bda1&gt;] ? inet_sendmsg+0x71/0xc0
    [&lt;ffffffff815b906d&gt;] ? sock_alloc_inode+0x2d/0xd0
    [&lt;ffffffff811a21f2&gt;] kmem_cache_alloc+0x1a2/0x1b0
    [&lt;ffffffff815b906d&gt;] sock_alloc_inode+0x2d/0xd0
    [&lt;ffffffff811d8566&gt;] alloc_inode+0x26/0xa0
    [&lt;ffffffff811da04a&gt;] new_inode_pseudo+0x1a/0x70
    [&lt;ffffffff815b933e&gt;] sock_alloc+0x1e/0x80
    [&lt;ffffffff815ba855&gt;] __sock_create+0x95/0x220
    [&lt;ffffffff815baa04&gt;] sock_create_kern+0x24/0x30
    [&lt;ffffffffa04794d9&gt;] con_work+0xef9/0x2050 [libceph]
    [&lt;ffffffffa04aa9ec&gt;] ? rbd_img_request_submit+0x4c/0x60 [rbd]
    [&lt;ffffffff81084c19&gt;] process_one_work+0x159/0x4f0
    [&lt;ffffffff8108561b&gt;] worker_thread+0x11b/0x530
    [&lt;ffffffff81085500&gt;] ? create_worker+0x1d0/0x1d0
    [&lt;ffffffff8108b6f9&gt;] kthread+0xc9/0xe0
    [&lt;ffffffff8108b630&gt;] ? flush_kthread_worker+0x90/0x90
    [&lt;ffffffff816e1b98&gt;] ret_from_fork+0x58/0x90
    [&lt;ffffffff8108b630&gt;] ? flush_kthread_worker+0x90/0x90

Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov &lt;wintchester@gmail.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 633ee407b9d15a75ac9740ba9d3338815e1fcb95 upstream.

sock_alloc_inode() allocates socket+inode and socket_wq with
GFP_KERNEL, which is not allowed on the writeback path:

    Workqueue: ceph-msgr con_work [libceph]
    ffff8810871cb018 0000000000000046 0000000000000000 ffff881085d40000
    0000000000012b00 ffff881025cad428 ffff8810871cbfd8 0000000000012b00
    ffff880102fc1000 ffff881085d40000 ffff8810871cb038 ffff8810871cb148
    Call Trace:
    [&lt;ffffffff816dd629&gt;] schedule+0x29/0x70
    [&lt;ffffffff816e066d&gt;] schedule_timeout+0x1bd/0x200
    [&lt;ffffffff81093ffc&gt;] ? ttwu_do_wakeup+0x2c/0x120
    [&lt;ffffffff81094266&gt;] ? ttwu_do_activate.constprop.135+0x66/0x70
    [&lt;ffffffff816deb5f&gt;] wait_for_completion+0xbf/0x180
    [&lt;ffffffff81097cd0&gt;] ? try_to_wake_up+0x390/0x390
    [&lt;ffffffff81086335&gt;] flush_work+0x165/0x250
    [&lt;ffffffff81082940&gt;] ? worker_detach_from_pool+0xd0/0xd0
    [&lt;ffffffffa03b65b1&gt;] xlog_cil_force_lsn+0x81/0x200 [xfs]
    [&lt;ffffffff816d6b42&gt;] ? __slab_free+0xee/0x234
    [&lt;ffffffffa03b4b1d&gt;] _xfs_log_force_lsn+0x4d/0x2c0 [xfs]
    [&lt;ffffffff811adc1e&gt;] ? lookup_page_cgroup_used+0xe/0x30
    [&lt;ffffffffa039a723&gt;] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [&lt;ffffffffa03b4dcf&gt;] xfs_log_force_lsn+0x3f/0xf0 [xfs]
    [&lt;ffffffffa039a723&gt;] ? xfs_reclaim_inode+0xa3/0x330 [xfs]
    [&lt;ffffffffa03a62c6&gt;] xfs_iunpin_wait+0xc6/0x1a0 [xfs]
    [&lt;ffffffff810aa250&gt;] ? wake_atomic_t_function+0x40/0x40
    [&lt;ffffffffa039a723&gt;] xfs_reclaim_inode+0xa3/0x330 [xfs]
    [&lt;ffffffffa039ac07&gt;] xfs_reclaim_inodes_ag+0x257/0x3d0 [xfs]
    [&lt;ffffffffa039bb13&gt;] xfs_reclaim_inodes_nr+0x33/0x40 [xfs]
    [&lt;ffffffffa03ab745&gt;] xfs_fs_free_cached_objects+0x15/0x20 [xfs]
    [&lt;ffffffff811c0c18&gt;] super_cache_scan+0x178/0x180
    [&lt;ffffffff8115912e&gt;] shrink_slab_node+0x14e/0x340
    [&lt;ffffffff811afc3b&gt;] ? mem_cgroup_iter+0x16b/0x450
    [&lt;ffffffff8115af70&gt;] shrink_slab+0x100/0x140
    [&lt;ffffffff8115e425&gt;] do_try_to_free_pages+0x335/0x490
    [&lt;ffffffff8115e7f9&gt;] try_to_free_pages+0xb9/0x1f0
    [&lt;ffffffff816d56e4&gt;] ? __alloc_pages_direct_compact+0x69/0x1be
    [&lt;ffffffff81150cba&gt;] __alloc_pages_nodemask+0x69a/0xb40
    [&lt;ffffffff8119743e&gt;] alloc_pages_current+0x9e/0x110
    [&lt;ffffffff811a0ac5&gt;] new_slab+0x2c5/0x390
    [&lt;ffffffff816d71c4&gt;] __slab_alloc+0x33b/0x459
    [&lt;ffffffff815b906d&gt;] ? sock_alloc_inode+0x2d/0xd0
    [&lt;ffffffff8164bda1&gt;] ? inet_sendmsg+0x71/0xc0
    [&lt;ffffffff815b906d&gt;] ? sock_alloc_inode+0x2d/0xd0
    [&lt;ffffffff811a21f2&gt;] kmem_cache_alloc+0x1a2/0x1b0
    [&lt;ffffffff815b906d&gt;] sock_alloc_inode+0x2d/0xd0
    [&lt;ffffffff811d8566&gt;] alloc_inode+0x26/0xa0
    [&lt;ffffffff811da04a&gt;] new_inode_pseudo+0x1a/0x70
    [&lt;ffffffff815b933e&gt;] sock_alloc+0x1e/0x80
    [&lt;ffffffff815ba855&gt;] __sock_create+0x95/0x220
    [&lt;ffffffff815baa04&gt;] sock_create_kern+0x24/0x30
    [&lt;ffffffffa04794d9&gt;] con_work+0xef9/0x2050 [libceph]
    [&lt;ffffffffa04aa9ec&gt;] ? rbd_img_request_submit+0x4c/0x60 [rbd]
    [&lt;ffffffff81084c19&gt;] process_one_work+0x159/0x4f0
    [&lt;ffffffff8108561b&gt;] worker_thread+0x11b/0x530
    [&lt;ffffffff81085500&gt;] ? create_worker+0x1d0/0x1d0
    [&lt;ffffffff8108b6f9&gt;] kthread+0xc9/0xe0
    [&lt;ffffffff8108b630&gt;] ? flush_kthread_worker+0x90/0x90
    [&lt;ffffffff816e1b98&gt;] ret_from_fork+0x58/0x90
    [&lt;ffffffff8108b630&gt;] ? flush_kthread_worker+0x90/0x90

Use memalloc_noio_{save,restore}() to temporarily force GFP_NOIO here.

Link: http://tracker.ceph.com/issues/19309
Reported-by: Sergey Jerusalimov &lt;wintchester@gmail.com&gt;
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Jeff Layton &lt;jlayton@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>xfrm_user: validate XFRM_MSG_NEWAE incoming ESN size harder</title>
<updated>2017-03-31T08:31:45+00:00</updated>
<author>
<name>Andy Whitcroft</name>
<email>apw@canonical.com</email>
</author>
<published>2017-03-23T07:45:44+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=79191ea36dc9be10a9c9b03d6b341ed2d2f76045'/>
<id>79191ea36dc9be10a9c9b03d6b341ed2d2f76045</id>
<content type='text'>
commit f843ee6dd019bcece3e74e76ad9df0155655d0df upstream.

Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues.  To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.

CVE-2017-7184
Signed-off-by: Andy Whitcroft &lt;apw@canonical.com&gt;
Acked-by: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit f843ee6dd019bcece3e74e76ad9df0155655d0df upstream.

Kees Cook has pointed out that xfrm_replay_state_esn_len() is subject to
wrapping issues.  To ensure we are correctly ensuring that the two ESN
structures are the same size compare both the overall size as reported
by xfrm_replay_state_esn_len() and the internal length are the same.

CVE-2017-7184
Signed-off-by: Andy Whitcroft &lt;apw@canonical.com&gt;
Acked-by: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>xfrm_user: validate XFRM_MSG_NEWAE XFRMA_REPLAY_ESN_VAL replay_window</title>
<updated>2017-03-31T08:31:45+00:00</updated>
<author>
<name>Andy Whitcroft</name>
<email>apw@canonical.com</email>
</author>
<published>2017-03-22T07:29:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=64a5465799ee40e3d54d9da3037934cd4b7b502f'/>
<id>64a5465799ee40e3d54d9da3037934cd4b7b502f</id>
<content type='text'>
commit 677e806da4d916052585301785d847c3b3e6186a upstream.

When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer.  However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call.  There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents.  We do
not at this point check that the replay_window is within the allocated
memory.  This leads to out-of-bounds reads and writes triggered by
netlink packets.  This leads to memory corruption and the potential for
priviledge escalation.

We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len().  This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn.  It however does not check the replay_window
remains within that buffer.  Add validation of the contained
replay_window.

CVE-2017-7184
Signed-off-by: Andy Whitcroft &lt;apw@canonical.com&gt;
Acked-by: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 677e806da4d916052585301785d847c3b3e6186a upstream.

When a new xfrm state is created during an XFRM_MSG_NEWSA call we
validate the user supplied replay_esn to ensure that the size is valid
and to ensure that the replay_window size is within the allocated
buffer.  However later it is possible to update this replay_esn via a
XFRM_MSG_NEWAE call.  There we again validate the size of the supplied
buffer matches the existing state and if so inject the contents.  We do
not at this point check that the replay_window is within the allocated
memory.  This leads to out-of-bounds reads and writes triggered by
netlink packets.  This leads to memory corruption and the potential for
priviledge escalation.

We already attempt to validate the incoming replay information in
xfrm_new_ae() via xfrm_replay_verify_len().  This confirms that the user
is not trying to change the size of the replay state buffer which
includes the replay_esn.  It however does not check the replay_window
remains within that buffer.  Add validation of the contained
replay_window.

CVE-2017-7184
Signed-off-by: Andy Whitcroft &lt;apw@canonical.com&gt;
Acked-by: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>xfrm: policy: init locks early</title>
<updated>2017-03-31T08:31:45+00:00</updated>
<author>
<name>Florian Westphal</name>
<email>fw@strlen.de</email>
</author>
<published>2017-02-08T10:52:29+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f68a09c7944eca3bdfce2ef18b43dfd2359346b7'/>
<id>f68a09c7944eca3bdfce2ef18b43dfd2359346b7</id>
<content type='text'>
commit c282222a45cb9503cbfbebfdb60491f06ae84b49 upstream.

Dmitry reports following splat:
 INFO: trying to register non-static key.
 the code is fine but needs lockdep annotation.
 turning off the locking correctness validator.
 CPU: 0 PID: 13059 Comm: syz-executor1 Not tainted 4.10.0-rc7-next-20170207 #1
[..]
 spin_lock_bh include/linux/spinlock.h:304 [inline]
 xfrm_policy_flush+0x32/0x470 net/xfrm/xfrm_policy.c:963
 xfrm_policy_fini+0xbf/0x560 net/xfrm/xfrm_policy.c:3041
 xfrm_net_init+0x79f/0x9e0 net/xfrm/xfrm_policy.c:3091
 ops_init+0x10a/0x530 net/core/net_namespace.c:115
 setup_net+0x2ed/0x690 net/core/net_namespace.c:291
 copy_net_ns+0x26c/0x530 net/core/net_namespace.c:396
 create_new_namespaces+0x409/0x860 kernel/nsproxy.c:106
 unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:205
 SYSC_unshare kernel/fork.c:2281 [inline]

Problem is that when we get error during xfrm_net_init we will call
xfrm_policy_fini which will acquire xfrm_policy_lock before it was
initialized.  Just move it around so locks get set up first.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Fixes: 283bc9f35bbbcb0e9 ("xfrm: Namespacify xfrm state/policy locks")
Signed-off-by: Florian Westphal &lt;fw@strlen.de&gt;
Signed-off-by: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit c282222a45cb9503cbfbebfdb60491f06ae84b49 upstream.

Dmitry reports following splat:
 INFO: trying to register non-static key.
 the code is fine but needs lockdep annotation.
 turning off the locking correctness validator.
 CPU: 0 PID: 13059 Comm: syz-executor1 Not tainted 4.10.0-rc7-next-20170207 #1
[..]
 spin_lock_bh include/linux/spinlock.h:304 [inline]
 xfrm_policy_flush+0x32/0x470 net/xfrm/xfrm_policy.c:963
 xfrm_policy_fini+0xbf/0x560 net/xfrm/xfrm_policy.c:3041
 xfrm_net_init+0x79f/0x9e0 net/xfrm/xfrm_policy.c:3091
 ops_init+0x10a/0x530 net/core/net_namespace.c:115
 setup_net+0x2ed/0x690 net/core/net_namespace.c:291
 copy_net_ns+0x26c/0x530 net/core/net_namespace.c:396
 create_new_namespaces+0x409/0x860 kernel/nsproxy.c:106
 unshare_nsproxy_namespaces+0xae/0x1e0 kernel/nsproxy.c:205
 SYSC_unshare kernel/fork.c:2281 [inline]

Problem is that when we get error during xfrm_net_init we will call
xfrm_policy_fini which will acquire xfrm_policy_lock before it was
initialized.  Just move it around so locks get set up first.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Fixes: 283bc9f35bbbcb0e9 ("xfrm: Namespacify xfrm state/policy locks")
Signed-off-by: Florian Westphal &lt;fw@strlen.de&gt;
Signed-off-by: Steffen Klassert &lt;steffen.klassert@secunet.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>nl80211: fix dumpit error path RTNL deadlocks</title>
<updated>2017-03-30T07:41:28+00:00</updated>
<author>
<name>Johannes Berg</name>
<email>johannes.berg@intel.com</email>
</author>
<published>2017-03-15T13:26:04+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=56769e7a05268de991ef24e77383d44c56ad9141'/>
<id>56769e7a05268de991ef24e77383d44c56ad9141</id>
<content type='text'>
commit ea90e0dc8cecba6359b481e24d9c37160f6f524f upstream.

Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
to be perfectly accurate - there are various error paths that miss unlock
of the RTNL.

To fix those, change the locking a bit to not be conditional in all those
nl80211_prepare_*_dump() functions, but make those require the RTNL to
start with, and fix the buggy error paths. This also let me use sparse
(by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
validate the changes.

Reported-by: Sowmini Varadhan &lt;sowmini.varadhan@oracle.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit ea90e0dc8cecba6359b481e24d9c37160f6f524f upstream.

Sowmini pointed out Dmitry's RTNL deadlock report to me, and it turns out
to be perfectly accurate - there are various error paths that miss unlock
of the RTNL.

To fix those, change the locking a bit to not be conditional in all those
nl80211_prepare_*_dump() functions, but make those require the RTNL to
start with, and fix the buggy error paths. This also let me use sparse
(by appropriately overriding the rtnl_lock/rtnl_unlock functions) to
validate the changes.

Reported-by: Sowmini Varadhan &lt;sowmini.varadhan@oracle.com&gt;
Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Johannes Berg &lt;johannes.berg@intel.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>libceph: don't set weight to IN when OSD is destroyed</title>
<updated>2017-03-30T07:41:27+00:00</updated>
<author>
<name>Ilya Dryomov</name>
<email>idryomov@gmail.com</email>
</author>
<published>2017-03-01T16:33:27+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=81ec3dc1de0af47a4f38a276f68f55f4e7e8e621'/>
<id>81ec3dc1de0af47a4f38a276f68f55f4e7e8e621</id>
<content type='text'>
commit b581a5854eee4b7851dedb0f8c2ceb54fb902c06 upstream.

Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs.  Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.

Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

Fixes: 930c53286977 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Sage Weil &lt;sage@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit b581a5854eee4b7851dedb0f8c2ceb54fb902c06 upstream.

Since ceph.git commit 4e28f9e63644 ("osd/OSDMap: clear osd_info,
osd_xinfo on osd deletion"), weight is set to IN when OSD is deleted.
This changes the result of applying an incremental for clients, not
just OSDs.  Because CRUSH computations are obviously affected,
pre-4e28f9e63644 servers disagree with post-4e28f9e63644 clients on
object placement, resulting in misdirected requests.

Mirrors ceph.git commit a6009d1039a55e2c77f431662b3d6cc5a8e8e63f.

Fixes: 930c53286977 ("libceph: apply new_state before new_up_client on incrementals")
Link: http://tracker.ceph.com/issues/19122
Signed-off-by: Ilya Dryomov &lt;idryomov@gmail.com&gt;
Reviewed-by: Sage Weil &lt;sage@redhat.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>cgroup, net_cls: iterate the fds of only the tasks which are being migrated</title>
<updated>2017-03-30T07:41:27+00:00</updated>
<author>
<name>Tejun Heo</name>
<email>tj@kernel.org</email>
</author>
<published>2017-03-14T23:25:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=62f6341c858babe0390eea2fd497ceb6ea54bb07'/>
<id>62f6341c858babe0390eea2fd497ceb6ea54bb07</id>
<content type='text'>
commit a05d4fd9176003e0c1f9c3d083f4dac19fd346ab upstream.

The net_cls controller controls the classid field of each socket which
is associated with the cgroup.  Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51de ("cgroups: Allow dynamically changing net_classid").

While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations.  However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css.  This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.

On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds.  Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.

Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse.  Before bfc2cf6f61fc ("cgroup: call subsys-&gt;*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the -&gt;css_attach callback even for controllers which don't
see actual migration to a different css.

As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
-&gt;css_attach callback for those.  The net_cls -&gt;css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used.  This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.

The worst symptom is already fixed in upstream by bfc2cf6f61fc
("cgroup: call subsys-&gt;*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.

This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated.  This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup.  As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().

Reported-by: David Goode &lt;dgoode@fb.com&gt;
Fixes: 3b13758f51de ("cgroups: Allow dynamically changing net_classid")
Cc: Nina Schiff &lt;ninasc@fb.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit a05d4fd9176003e0c1f9c3d083f4dac19fd346ab upstream.

The net_cls controller controls the classid field of each socket which
is associated with the cgroup.  Because the classid is per-socket
attribute, when a task migrates to another cgroup or the configured
classid of the cgroup changes, the controller needs to walk all
sockets and update the classid value, which was implemented by
3b13758f51de ("cgroups: Allow dynamically changing net_classid").

While the approach is not scalable, migrating tasks which have a lot
of fds attached to them is rare and the cost is born by the ones
initiating the operations.  However, for simplicity, both the
migration and classid config change paths call update_classid() which
scans all fds of all tasks in the target css.  This is an overkill for
the migration path which only needs to cover a much smaller subset of
tasks which are actually getting migrated in.

On cgroup v1, this can lead to unexpected scalability issues when one
tries to migrate a task or process into a net_cls cgroup which already
contains a lot of fds.  Even if the migration traget doesn't have many
to get scanned, update_classid() ends up scanning all fds in the
target cgroup which can be extremely numerous.

Unfortunately, on cgroup v2 which doesn't use net_cls, the problem is
even worse.  Before bfc2cf6f61fc ("cgroup: call subsys-&gt;*attach() only
for subsystems which are actually affected by migration"), cgroup core
would call the -&gt;css_attach callback even for controllers which don't
see actual migration to a different css.

As net_cls is always disabled but still mounted on cgroup v2, whenever
a process is migrated on the cgroup v2 hierarchy, net_cls sees
identity migration from root to root and cgroup core used to call
-&gt;css_attach callback for those.  The net_cls -&gt;css_attach ends up
calling update_classid() on the root net_cls css to which all
processes on the system belong to as the controller isn't used.  This
makes any cgroup v2 migration O(total_number_of_fds_on_the_system)
which is horrible and easily leads to noticeable stalls triggering RCU
stall warnings and so on.

The worst symptom is already fixed in upstream by bfc2cf6f61fc
("cgroup: call subsys-&gt;*attach() only for subsystems which are
actually affected by migration"); however, backporting that commit is
too invasive and we want to avoid other cases too.

This patch updates net_cls's cgrp_attach() to iterate fds of only the
processes which are actually getting migrated.  This removes the
surprising migration cost which is dependent on the total number of
fds in the target cgroup.  As this leaves write_classid() the only
user of update_classid(), open-code the helper into write_classid().

Reported-by: David Goode &lt;dgoode@fb.com&gt;
Fixes: 3b13758f51de ("cgroups: Allow dynamically changing net_classid")
Cc: Nina Schiff &lt;ninasc@fb.com&gt;
Cc: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
</feed>
