<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/lib/radix-tree.c, branch linux-4.15.y</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>mm, truncate: do not check mapping for every page being truncated</title>
<updated>2017-11-16T02:21:06+00:00</updated>
<author>
<name>Mel Gorman</name>
<email>mgorman@techsingularity.net</email>
</author>
<published>2017-11-16T01:37:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=c7df8ad2910e965a6241b6d8f52fd122e26b0315'/>
<id>c7df8ad2910e965a6241b6d8f52fd122e26b0315</id>
<content type='text'>
During truncation, the mapping has already been checked for shmem and
dax so it's known that workingset_update_node is required.

This patch avoids the checks on mapping for each page being truncated.
In all other cases, a lookup helper is used to determine if
workingset_update_node() needs to be called.  The one danger is that the
API is slightly harder to use as calling workingset_update_node directly
without checking for dax or shmem mappings could lead to surprises.
However, the API rarely needs to be used and hopefully the comment is
enough to give people the hint.

sparsetruncate (tiny)
                              4.14.0-rc4             4.14.0-rc4
                             oneirq-v1r1        pickhelper-v1r1
Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)

As you'd expect, the gain is marginal but it can be detected.  The
differences in bonnie are all within the noise which is not surprising
given the impact on the microbenchmark.

radix_tree_update_node_t is a callback for some radix operations that
optionally passes in a private field.  The only user of the callback is
workingset_update_node and as it no longer requires a mapping, the
private field is removed.

Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
During truncation, the mapping has already been checked for shmem and
dax so it's known that workingset_update_node is required.

This patch avoids the checks on mapping for each page being truncated.
In all other cases, a lookup helper is used to determine if
workingset_update_node() needs to be called.  The one danger is that the
API is slightly harder to use as calling workingset_update_node directly
without checking for dax or shmem mappings could lead to surprises.
However, the API rarely needs to be used and hopefully the comment is
enough to give people the hint.

sparsetruncate (tiny)
                              4.14.0-rc4             4.14.0-rc4
                             oneirq-v1r1        pickhelper-v1r1
Min          Time      141.00 (   0.00%)      140.00 (   0.71%)
1st-qrtle    Time      142.00 (   0.00%)      141.00 (   0.70%)
2nd-qrtle    Time      142.00 (   0.00%)      142.00 (   0.00%)
3rd-qrtle    Time      143.00 (   0.00%)      143.00 (   0.00%)
Max-90%      Time      144.00 (   0.00%)      144.00 (   0.00%)
Max-95%      Time      147.00 (   0.00%)      145.00 (   1.36%)
Max-99%      Time      195.00 (   0.00%)      191.00 (   2.05%)
Max          Time      230.00 (   0.00%)      205.00 (  10.87%)
Amean        Time      144.37 (   0.00%)      143.82 (   0.38%)
Stddev       Time       10.44 (   0.00%)        9.00 (  13.74%)
Coeff        Time        7.23 (   0.00%)        6.26 (  13.41%)
Best99%Amean Time      143.72 (   0.00%)      143.34 (   0.26%)
Best95%Amean Time      142.37 (   0.00%)      142.00 (   0.26%)
Best90%Amean Time      142.19 (   0.00%)      141.85 (   0.24%)
Best75%Amean Time      141.92 (   0.00%)      141.58 (   0.24%)
Best50%Amean Time      141.69 (   0.00%)      141.31 (   0.27%)
Best25%Amean Time      141.38 (   0.00%)      140.97 (   0.29%)

As you'd expect, the gain is marginal but it can be detected.  The
differences in bonnie are all within the noise which is not surprising
given the impact on the microbenchmark.

radix_tree_update_node_t is a callback for some radix operations that
optionally passes in a private field.  The only user of the callback is
workingset_update_node and as it no longer requires a mapping, the
private field is removed.

Link: http://lkml.kernel.org/r/20171018075952.10627-3-mgorman@techsingularity.net
Signed-off-by: Mel Gorman &lt;mgorman@techsingularity.net&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Reviewed-by: Jan Kara &lt;jack@suse.cz&gt;
Cc: Andi Kleen &lt;ak@linux.intel.com&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Dave Hansen &lt;dave.hansen@intel.com&gt;
Cc: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>radix-tree: must check __radix_tree_preload() return value</title>
<updated>2017-09-09T01:26:49+00:00</updated>
<author>
<name>Eric Dumazet</name>
<email>edumazet@google.com</email>
</author>
<published>2017-09-08T23:15:54+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=bc9ae2247ac92fd4d962507bafa3afffff6660ff'/>
<id>bc9ae2247ac92fd4d962507bafa3afffff6660ff</id>
<content type='text'>
__radix_tree_preload() only disables preemption if no error is returned.

So we really need to make sure callers always check the return value.

idr_preload() contract is to always disable preemption, so we need
to add a missing preempt_disable() if an error happened.

Similarly, ida_pre_get() only needs to call preempt_enable() in the
case no error happened.

Link: http://lkml.kernel.org/r/1504637190.15310.62.camel@edumazet-glaptop3.roam.corp.google.com
Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;    [4.11+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
__radix_tree_preload() only disables preemption if no error is returned.

So we really need to make sure callers always check the return value.

idr_preload() contract is to always disable preemption, so we need
to add a missing preempt_disable() if an error happened.

Similarly, ida_pre_get() only needs to call preempt_enable() in the
case no error happened.

Link: http://lkml.kernel.org/r/1504637190.15310.62.camel@edumazet-glaptop3.roam.corp.google.com
Fixes: 0a835c4f090a ("Reimplement IDR and IDA using the radix tree")
Fixes: 7ad3d4d85c7a ("ida: Move ida_bitmap to a percpu variable")
Signed-off-by: Eric Dumazet &lt;edumazet@google.com&gt;
Cc: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
Cc: "Kirill A. Shutemov" &lt;kirill.shutemov@linux.intel.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;    [4.11+]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next</title>
<updated>2017-09-06T21:45:08+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2017-09-06T21:45:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=aae3dbb4776e7916b6cd442d00159bea27a695c1'/>
<id>aae3dbb4776e7916b6cd442d00159bea27a695c1</id>
<content type='text'>
Pull networking updates from David Miller:

 1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

 3) Allow generic XDP to work on virtual devices, from John Fastabend.

 4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

 5) Remove UFO offloads from the tree, gave us little other than bugs.

 6) Remove the IPSEC flow cache, from Florian Westphal.

 7) Support ipv6 route offload in mlxsw driver.

 8) Support VF representors in bnxt_en, from Sathya Perla.

 9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

13) Add new jump instructions to eBPF, from Daniel Borkmann.

14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

15) Support XDP in tap driver, from Jason Wang.

16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

17) Add Huawei hinic ethernet driver.

18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
  i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
  i40e: avoid NVM acquire deadlock during NVM update
  drivers: net: xgene: Remove return statement from void function
  drivers: net: xgene: Configure tx/rx delay for ACPI
  drivers: net: xgene: Read tx/rx delay for ACPI
  rocker: fix kcalloc parameter order
  rds: Fix non-atomic operation on shared flag variable
  net: sched: don't use GFP_KERNEL under spin lock
  vhost_net: correctly check tx avail during rx busy polling
  net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
  rxrpc: Make service connection lookup always check for retry
  net: stmmac: Delete dead code for MDIO registration
  gianfar: Fix Tx flow control deactivation
  cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
  cxgb4: Fix pause frame count in t4_get_port_stats
  cxgb4: fix memory leak
  tun: rename generic_xdp to skb_xdp
  tun: reserve extra headroom only when XDP is set
  net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
  net: dsa: bcm_sf2: Advertise number of egress queues
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull networking updates from David Miller:

 1) Support ipv6 checksum offload in sunvnet driver, from Shannon
    Nelson.

 2) Move to RB-tree instead of custom AVL code in inetpeer, from Eric
    Dumazet.

 3) Allow generic XDP to work on virtual devices, from John Fastabend.

 4) Add bpf device maps and XDP_REDIRECT, which can be used to build
    arbitrary switching frameworks using XDP. From John Fastabend.

 5) Remove UFO offloads from the tree, gave us little other than bugs.

 6) Remove the IPSEC flow cache, from Florian Westphal.

 7) Support ipv6 route offload in mlxsw driver.

 8) Support VF representors in bnxt_en, from Sathya Perla.

 9) Add support for forward error correction modes to ethtool, from
    Vidya Sagar Ravipati.

10) Add time filter for packet scheduler action dumping, from Jamal Hadi
    Salim.

11) Extend the zerocopy sendmsg() used by virtio and tap to regular
    sockets via MSG_ZEROCOPY. From Willem de Bruijn.

12) Significantly rework value tracking in the BPF verifier, from Edward
    Cree.

13) Add new jump instructions to eBPF, from Daniel Borkmann.

14) Rework rtnetlink plumbing so that operations can be run without
    taking the RTNL semaphore. From Florian Westphal.

15) Support XDP in tap driver, from Jason Wang.

16) Add 32-bit eBPF JIT for ARM, from Shubham Bansal.

17) Add Huawei hinic ethernet driver.

18) Allow to report MD5 keys in TCP inet_diag dumps, from Ivan
    Delalande.

* git://git.kernel.org/pub/scm/linux/kernel/git/davem/net-next: (1780 commits)
  i40e: point wb_desc at the nvm_wb_desc during i40e_read_nvm_aq
  i40e: avoid NVM acquire deadlock during NVM update
  drivers: net: xgene: Remove return statement from void function
  drivers: net: xgene: Configure tx/rx delay for ACPI
  drivers: net: xgene: Read tx/rx delay for ACPI
  rocker: fix kcalloc parameter order
  rds: Fix non-atomic operation on shared flag variable
  net: sched: don't use GFP_KERNEL under spin lock
  vhost_net: correctly check tx avail during rx busy polling
  net: mdio-mux: add mdio_mux parameter to mdio_mux_init()
  rxrpc: Make service connection lookup always check for retry
  net: stmmac: Delete dead code for MDIO registration
  gianfar: Fix Tx flow control deactivation
  cxgb4: Ignore MPS_TX_INT_CAUSE[Bubble] for T6
  cxgb4: Fix pause frame count in t4_get_port_stats
  cxgb4: fix memory leak
  tun: rename generic_xdp to skb_xdp
  tun: reserve extra headroom only when XDP is set
  net: dsa: bcm_sf2: Configure IMP port TC2QOS mapping
  net: dsa: bcm_sf2: Advertise number of egress queues
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>idr: Add new APIs to support unsigned long</title>
<updated>2017-08-30T21:36:44+00:00</updated>
<author>
<name>Chris Mi</name>
<email>chrism@mellanox.com</email>
</author>
<published>2017-08-30T06:31:57+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=388f79fda74fd3d8700ed5d899573ec58c2e0253'/>
<id>388f79fda74fd3d8700ed5d899573ec58c2e0253</id>
<content type='text'>
The following new APIs are added:

int idr_alloc_ext(struct idr *idr, void *ptr, unsigned long *index,
                  unsigned long start, unsigned long end, gfp_t gfp);
void *idr_remove_ext(struct idr *idr, unsigned long id);
void *idr_find_ext(const struct idr *idr, unsigned long id);
void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id);
void *idr_get_next_ext(struct idr *idr, unsigned long *nextid);

Signed-off-by: Chris Mi &lt;chrism@mellanox.com&gt;
Signed-off-by: Jiri Pirko &lt;jiri@mellanox.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The following new APIs are added:

int idr_alloc_ext(struct idr *idr, void *ptr, unsigned long *index,
                  unsigned long start, unsigned long end, gfp_t gfp);
void *idr_remove_ext(struct idr *idr, unsigned long id);
void *idr_find_ext(const struct idr *idr, unsigned long id);
void *idr_replace_ext(struct idr *idr, void *ptr, unsigned long id);
void *idr_get_next_ext(struct idr *idr, unsigned long *nextid);

Signed-off-by: Chris Mi &lt;chrism@mellanox.com&gt;
Signed-off-by: Jiri Pirko &lt;jiri@mellanox.com&gt;
Signed-off-by: David S. Miller &lt;davem@davemloft.net&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>drm/i915: Replace execbuf vma ht with an idr</title>
<updated>2017-08-18T10:59:02+00:00</updated>
<author>
<name>Chris Wilson</name>
<email>chris@chris-wilson.co.uk</email>
</author>
<published>2017-08-16T08:52:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d1b48c1e7184d9bc4ae6d7f9fe2eed9efed11ffc'/>
<id>d1b48c1e7184d9bc4ae6d7f9fe2eed9efed11ffc</id>
<content type='text'>
This was the competing idea long ago, but it was only with the rewrite
of the idr as an radixtree and using the radixtree directly ourselves,
along with the realisation that we can store the vma directly in the
radixtree and only need a list for the reverse mapping, that made the
patch performant enough to displace using a hashtable. Though the vma ht
is fast and doesn't require any extra allocation (as we can embed the node
inside the vma), it does require a thread for resizing and serialization
and will have the occasional slow lookup. That is hairy enough to
investigate alternatives and favour them if equivalent in peak performance.
One advantage of allocating an indirection entry is that we can support a
single shared bo between many clients, something that was done on a
first-come first-serve basis for shared GGTT vma previously. To offset
the extra allocations, we create yet another kmem_cache for them.

Signed-off-by: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Reviewed-by: Tvrtko Ursulin &lt;tvrtko.ursulin@intel.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
This was the competing idea long ago, but it was only with the rewrite
of the idr as an radixtree and using the radixtree directly ourselves,
along with the realisation that we can store the vma directly in the
radixtree and only need a list for the reverse mapping, that made the
patch performant enough to displace using a hashtable. Though the vma ht
is fast and doesn't require any extra allocation (as we can embed the node
inside the vma), it does require a thread for resizing and serialization
and will have the occasional slow lookup. That is hairy enough to
investigate alternatives and favour them if equivalent in peak performance.
One advantage of allocating an indirection entry is that we can support a
single shared bo between many clients, something that was done on a
first-come first-serve basis for shared GGTT vma previously. To offset
the extra allocations, we create yet another kmem_cache for them.

Signed-off-by: Chris Wilson &lt;chris@chris-wilson.co.uk&gt;
Reviewed-by: Tvrtko Ursulin &lt;tvrtko.ursulin@intel.com&gt;
Link: https://patchwork.freedesktop.org/patch/msgid/20170816085210.4199-5-chris@chris-wilson.co.uk
</pre>
</div>
</content>
</entry>
<entry>
<title>lockdep: allow to disable reclaim lockup detection</title>
<updated>2017-05-03T22:52:09+00:00</updated>
<author>
<name>Michal Hocko</name>
<email>mhocko@suse.com</email>
</author>
<published>2017-05-03T21:53:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=7e7844226f1053236b6f6d5d122a06509fb14fd9'/>
<id>7e7844226f1053236b6f6d5d122a06509fb14fd9</id>
<content type='text'>
The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the code
to silence the lockdep by using GFP_NOFS even though the context can use
__GFP_FS just fine.

See

  http://lkml.kernel.org/r/20160512080321.GA18496@dastard

as an example.

  =================================
  [ INFO: inconsistent lock state ]
  4.5.0-rc2+ #4 Tainted: G           O
  ---------------------------------
  inconsistent {RECLAIM_FS-ON-R} -&gt; {IN-RECLAIM_FS-W} usage.
  kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

  (&amp;xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]

  {RECLAIM_FS-ON-R} state was registered at:
    mark_held_locks+0x79/0xa0
    lockdep_trace_alloc+0xb3/0x100
    kmem_cache_alloc+0x33/0x230
    kmem_zone_alloc+0x81/0x120 [xfs]
    xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
    __xfs_refcount_find_shared+0x75/0x580 [xfs]
    xfs_refcount_find_shared+0x84/0xb0 [xfs]
    xfs_getbmap+0x608/0x8c0 [xfs]
    xfs_vn_fiemap+0xab/0xc0 [xfs]
    do_vfs_ioctl+0x498/0x670
    SyS_ioctl+0x79/0x90
    entry_SYSCALL_64_fastpath+0x12/0x6f

         CPU0
         ----
    lock(&amp;xfs_nondir_ilock_class);
    &lt;Interrupt&gt;
      lock(&amp;xfs_nondir_ilock_class);

   *** DEADLOCK ***

  3 locks held by kswapd0/543:

  stack backtrace:
  CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
  Call Trace:
   lock_acquire+0xd8/0x1e0
   down_write_nested+0x5e/0xc0
   xfs_ilock+0x177/0x200 [xfs]
   xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
   xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
   evict+0xc5/0x190
   dispose_list+0x39/0x60
   prune_icache_sb+0x4b/0x60
   super_cache_scan+0x14f/0x1a0
   shrink_slab.part.63.constprop.79+0x1e9/0x4e0
   shrink_zone+0x15e/0x170
   kswapd+0x4f1/0xa80
   kthread+0xf2/0x110
   ret_from_fork+0x3f/0x70

To quote Dave:
 "Ignoring whether reflink should be doing anything or not, that's a
  "xfs_refcountbt_init_cursor() gets called both outside and inside
  transactions" lockdep false positive case. The problem here is lockdep
  has seen this allocation from within a transaction, hence a GFP_NOFS
  allocation, and now it's seeing it in a GFP_KERNEL context. Also note
  that we have an active reference to this inode.

  So, because the reclaim annotations overload the interrupt level
  detections and it's seen the inode ilock been taken in reclaim
  ("interrupt") context, this triggers a reclaim context warning where
  it thinks it is unsafe to do this allocation in GFP_KERNEL context
  holding the inode ilock..."

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely.  Until then it is
much better to provide a way to add "I know what I am doing flag" and
mark problematic places.  This would prevent from abusing GFP_NOFS flag
which has a runtime effect even on configurations which have lockdep
disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Theodore Ts'o &lt;tytso@mit.edu&gt;
Cc: Chris Mason &lt;clm@fb.com&gt;
Cc: David Sterba &lt;dsterba@suse.cz&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Brian Foster &lt;bfoster@redhat.com&gt;
Cc: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Cc: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The current implementation of the reclaim lockup detection can lead to
false positives and those even happen and usually lead to tweak the code
to silence the lockdep by using GFP_NOFS even though the context can use
__GFP_FS just fine.

See

  http://lkml.kernel.org/r/20160512080321.GA18496@dastard

as an example.

  =================================
  [ INFO: inconsistent lock state ]
  4.5.0-rc2+ #4 Tainted: G           O
  ---------------------------------
  inconsistent {RECLAIM_FS-ON-R} -&gt; {IN-RECLAIM_FS-W} usage.
  kswapd0/543 [HC0[0]:SC0[0]:HE1:SE1] takes:

  (&amp;xfs_nondir_ilock_class){++++-+}, at: xfs_ilock+0x177/0x200 [xfs]

  {RECLAIM_FS-ON-R} state was registered at:
    mark_held_locks+0x79/0xa0
    lockdep_trace_alloc+0xb3/0x100
    kmem_cache_alloc+0x33/0x230
    kmem_zone_alloc+0x81/0x120 [xfs]
    xfs_refcountbt_init_cursor+0x3e/0xa0 [xfs]
    __xfs_refcount_find_shared+0x75/0x580 [xfs]
    xfs_refcount_find_shared+0x84/0xb0 [xfs]
    xfs_getbmap+0x608/0x8c0 [xfs]
    xfs_vn_fiemap+0xab/0xc0 [xfs]
    do_vfs_ioctl+0x498/0x670
    SyS_ioctl+0x79/0x90
    entry_SYSCALL_64_fastpath+0x12/0x6f

         CPU0
         ----
    lock(&amp;xfs_nondir_ilock_class);
    &lt;Interrupt&gt;
      lock(&amp;xfs_nondir_ilock_class);

   *** DEADLOCK ***

  3 locks held by kswapd0/543:

  stack backtrace:
  CPU: 0 PID: 543 Comm: kswapd0 Tainted: G           O    4.5.0-rc2+ #4
  Call Trace:
   lock_acquire+0xd8/0x1e0
   down_write_nested+0x5e/0xc0
   xfs_ilock+0x177/0x200 [xfs]
   xfs_reflink_cancel_cow_range+0x150/0x300 [xfs]
   xfs_fs_evict_inode+0xdc/0x1e0 [xfs]
   evict+0xc5/0x190
   dispose_list+0x39/0x60
   prune_icache_sb+0x4b/0x60
   super_cache_scan+0x14f/0x1a0
   shrink_slab.part.63.constprop.79+0x1e9/0x4e0
   shrink_zone+0x15e/0x170
   kswapd+0x4f1/0xa80
   kthread+0xf2/0x110
   ret_from_fork+0x3f/0x70

To quote Dave:
 "Ignoring whether reflink should be doing anything or not, that's a
  "xfs_refcountbt_init_cursor() gets called both outside and inside
  transactions" lockdep false positive case. The problem here is lockdep
  has seen this allocation from within a transaction, hence a GFP_NOFS
  allocation, and now it's seeing it in a GFP_KERNEL context. Also note
  that we have an active reference to this inode.

  So, because the reclaim annotations overload the interrupt level
  detections and it's seen the inode ilock been taken in reclaim
  ("interrupt") context, this triggers a reclaim context warning where
  it thinks it is unsafe to do this allocation in GFP_KERNEL context
  holding the inode ilock..."

This sounds like a fundamental problem of the reclaim lock detection.
It is really impossible to annotate such a special usecase IMHO unless
the reclaim lockup detection is reworked completely.  Until then it is
much better to provide a way to add "I know what I am doing flag" and
mark problematic places.  This would prevent from abusing GFP_NOFS flag
which has a runtime effect even on configurations which have lockdep
disabled.

Introduce __GFP_NOLOCKDEP flag which tells the lockdep gfp tracking to
skip the current allocation request.

While we are at it also make sure that the radix tree doesn't
accidentaly override tags stored in the upper part of the gfp_mask.

Link: http://lkml.kernel.org/r/20170306131408.9828-3-mhocko@kernel.org
Signed-off-by: Michal Hocko &lt;mhocko@suse.com&gt;
Suggested-by: Peter Zijlstra &lt;peterz@infradead.org&gt;
Acked-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Cc: Dave Chinner &lt;david@fromorbit.com&gt;
Cc: Theodore Ts'o &lt;tytso@mit.edu&gt;
Cc: Chris Mason &lt;clm@fb.com&gt;
Cc: David Sterba &lt;dsterba@suse.cz&gt;
Cc: Jan Kara &lt;jack@suse.cz&gt;
Cc: Brian Foster &lt;bfoster@redhat.com&gt;
Cc: Darrick J. Wong &lt;darrick.wong@oracle.com&gt;
Cc: Nikolay Borisov &lt;nborisov@suse.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ida: Free correct IDA bitmap</title>
<updated>2017-03-07T18:18:23+00:00</updated>
<author>
<name>Matthew Wilcox</name>
<email>mawilcox@microsoft.com</email>
</author>
<published>2017-03-03T17:16:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4ecd9542dbc3e07f3bd3870aac12839f72b47db4'/>
<id>4ecd9542dbc3e07f3bd3870aac12839f72b47db4</id>
<content type='text'>
There's a relatively rare race where we look at the per-cpu preallocated
IDA bitmap, see it's NULL, allocate a new one, and atomically update it.
If the kmalloc() happened to sleep and we were rescheduled to a different
CPU, or an interrupt came in at the exact right time, another task
might have successfully allocated a bitmap and already deposited it.
I forgot what the semantics of cmpxchg() were and ended up freeing the
wrong bitmap leading to KASAN reporting a use-after-free.

Dmitry found the bug with syzkaller &amp; wrote the patch.  I wrote the test
case that will reproduce the bug without his patch being applied.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There's a relatively rare race where we look at the per-cpu preallocated
IDA bitmap, see it's NULL, allocate a new one, and atomically update it.
If the kmalloc() happened to sleep and we were rescheduled to a different
CPU, or an interrupt came in at the exact right time, another task
might have successfully allocated a bitmap and already deposited it.
I forgot what the semantics of cmpxchg() were and ended up freeing the
wrong bitmap leading to KASAN reporting a use-after-free.

Dmitry found the bug with syzkaller &amp; wrote the patch.  I wrote the test
case that will reproduce the bug without his patch being applied.

Reported-by: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Signed-off-by: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax</title>
<updated>2017-03-01T04:29:41+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2017-03-01T04:29:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=cf393195c3ba5d4c0a8e237eb00f7ef104876ee5'/>
<id>cf393195c3ba5d4c0a8e237eb00f7ef104876ee5</id>
<content type='text'>
Pull IDR rewrite from Matthew Wilcox:
 "The most significant part of the following is the patch to rewrite the
  IDR &amp; IDA to be clients of the radix tree. But there's much more,
  including an enhancement of the IDA to be significantly more space
  efficient, an IDR &amp; IDA test suite, some improvements to the IDR API
  (and driver changes to take advantage of those improvements), several
  improvements to the radix tree test suite and RCU annotations.

  The IDR &amp; IDA rewrite had a good spin in linux-next and Andrew's tree
  for most of the last cycle. Coupled with the IDR test suite, I feel
  pretty confident that any remaining bugs are quite hard to hit. 0-day
  did a great job of watching my git tree and pointing out problems; as
  it hit them, I added new test-cases to be sure not to be caught the
  same way twice"

Willy goes on to expand a bit on the IDR rewrite rationale:
 "The radix tree and the IDR use very similar data structures.

  Merging the two codebases lets us share the memory allocation pools,
  and results in a net deletion of 500 lines of code. It also opens up
  the possibility of exposing more of the features of the radix tree to
  users of the IDR (and I have some interesting patches along those
  lines waiting for 4.12)

  It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
  will shrink a fair few data structures that embed an IDR"

* 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
  radix tree test suite: Add config option for map shift
  idr: Add missing __rcu annotations
  radix-tree: Fix __rcu annotations
  radix-tree: Add rcu_dereference and rcu_assign_pointer calls
  radix tree test suite: Run iteration tests for longer
  radix tree test suite: Fix split/join memory leaks
  radix tree test suite: Fix leaks in regression2.c
  radix tree test suite: Fix leaky tests
  radix tree test suite: Enable address sanitizer
  radix_tree_iter_resume: Fix out of bounds error
  radix-tree: Store a pointer to the root in each node
  radix-tree: Chain preallocated nodes through -&gt;parent
  radix tree test suite: Dial down verbosity with -v
  radix tree test suite: Introduce kmalloc_verbose
  idr: Return the deleted entry from idr_remove
  radix tree test suite: Build separate binaries for some tests
  ida: Use exceptional entries for small IDAs
  ida: Move ida_bitmap to a percpu variable
  Reimplement IDR and IDA using the radix tree
  radix-tree: Add radix_tree_iter_delete
  ...
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull IDR rewrite from Matthew Wilcox:
 "The most significant part of the following is the patch to rewrite the
  IDR &amp; IDA to be clients of the radix tree. But there's much more,
  including an enhancement of the IDA to be significantly more space
  efficient, an IDR &amp; IDA test suite, some improvements to the IDR API
  (and driver changes to take advantage of those improvements), several
  improvements to the radix tree test suite and RCU annotations.

  The IDR &amp; IDA rewrite had a good spin in linux-next and Andrew's tree
  for most of the last cycle. Coupled with the IDR test suite, I feel
  pretty confident that any remaining bugs are quite hard to hit. 0-day
  did a great job of watching my git tree and pointing out problems; as
  it hit them, I added new test-cases to be sure not to be caught the
  same way twice"

Willy goes on to expand a bit on the IDR rewrite rationale:
 "The radix tree and the IDR use very similar data structures.

  Merging the two codebases lets us share the memory allocation pools,
  and results in a net deletion of 500 lines of code. It also opens up
  the possibility of exposing more of the features of the radix tree to
  users of the IDR (and I have some interesting patches along those
  lines waiting for 4.12)

  It also shrinks the size of the 'struct idr' from 40 bytes to 24 which
  will shrink a fair few data structures that embed an IDR"

* 'idr-4.11' of git://git.infradead.org/users/willy/linux-dax: (32 commits)
  radix tree test suite: Add config option for map shift
  idr: Add missing __rcu annotations
  radix-tree: Fix __rcu annotations
  radix-tree: Add rcu_dereference and rcu_assign_pointer calls
  radix tree test suite: Run iteration tests for longer
  radix tree test suite: Fix split/join memory leaks
  radix tree test suite: Fix leaks in regression2.c
  radix tree test suite: Fix leaky tests
  radix tree test suite: Enable address sanitizer
  radix_tree_iter_resume: Fix out of bounds error
  radix-tree: Store a pointer to the root in each node
  radix-tree: Chain preallocated nodes through -&gt;parent
  radix tree test suite: Dial down verbosity with -v
  radix tree test suite: Introduce kmalloc_verbose
  idr: Return the deleted entry from idr_remove
  radix tree test suite: Build separate binaries for some tests
  ida: Use exceptional entries for small IDAs
  ida: Move ida_bitmap to a percpu variable
  Reimplement IDR and IDA using the radix tree
  radix-tree: Add radix_tree_iter_delete
  ...
</pre>
</div>
</content>
</entry>
<entry>
<title>radix-tree: Fix __rcu annotations</title>
<updated>2017-02-14T02:44:09+00:00</updated>
<author>
<name>Matthew Wilcox</name>
<email>mawilcox@microsoft.com</email>
</author>
<published>2017-02-13T20:58:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d7b627277b57370223d682cede979a279284b12a'/>
<id>d7b627277b57370223d682cede979a279284b12a</id>
<content type='text'>
Many places were missing __rcu annotations.  A few places needed a few
lines of explanation about why it was safe to not use RCU accessors.
Add a custom CFLAGS setting to the Makefile to ensure that new patches
don't miss RCU annotations.

Signed-off-by: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Many places were missing __rcu annotations.  A few places needed a few
lines of explanation about why it was safe to not use RCU accessors.
Add a custom CFLAGS setting to the Makefile to ensure that new patches
don't miss RCU annotations.

Signed-off-by: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>radix-tree: Add rcu_dereference and rcu_assign_pointer calls</title>
<updated>2017-02-14T02:44:09+00:00</updated>
<author>
<name>Matthew Wilcox</name>
<email>mawilcox@microsoft.com</email>
</author>
<published>2017-02-13T20:22:48+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=12320d0ff1c9d5582f5c35e4bb8b9c70c475fd71'/>
<id>12320d0ff1c9d5582f5c35e4bb8b9c70c475fd71</id>
<content type='text'>
Some of these have been missing for many years.  Others were recently
introduced by me.  Fortunately, we have tools that help us find such
things.

Signed-off-by: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Some of these have been missing for many years.  Others were recently
introduced by me.  Fortunately, we have tools that help us find such
things.

Signed-off-by: Matthew Wilcox &lt;mawilcox@microsoft.com&gt;
</pre>
</div>
</content>
</entry>
</feed>
