<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/mm/vmalloc.c, branch v6.11</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>mm: vmalloc: optimize vmap_lazy_nr arithmetic when purging each vmap_area</title>
<updated>2024-09-02T00:59:02+00:00</updated>
<author>
<name>Adrian Huang</name>
<email>ahuang12@lenovo.com</email>
</author>
<published>2024-08-29T13:06:33+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=409faf8c97d5abb0597ea43e99c8b3dd8dbe99e3'/>
<id>409faf8c97d5abb0597ea43e99c8b3dd8dbe99e3</id>
<content type='text'>
When running the vmalloc stress on a 448-core system, observe the average
latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
'funclatency.py' tool [1].

  # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node &amp; pid1=$! &amp;&amp; sleep 8 &amp;&amp; modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1

     usecs             : count    distribution
        0 -&gt; 1         : 0       |                                        |
        2 -&gt; 3         : 29      |                                        |
        4 -&gt; 7         : 19      |                                        |
        8 -&gt; 15        : 56      |                                        |
       16 -&gt; 31        : 483     |****                                    |
       32 -&gt; 63        : 1548    |************                            |
       64 -&gt; 127       : 2634    |*********************                   |
      128 -&gt; 255       : 2535    |*********************                   |
      256 -&gt; 511       : 1776    |**************                          |
      512 -&gt; 1023      : 1015    |********                                |
     1024 -&gt; 2047      : 573     |****                                    |
     2048 -&gt; 4095      : 488     |****                                    |
     4096 -&gt; 8191      : 1091    |*********                               |
     8192 -&gt; 16383     : 3078    |*************************               |
    16384 -&gt; 32767     : 4821    |****************************************|
    32768 -&gt; 65535     : 3318    |***************************             |
    65536 -&gt; 131071    : 1718    |**************                          |
   131072 -&gt; 262143    : 2220    |******************                      |
   262144 -&gt; 524287    : 1147    |*********                               |
   524288 -&gt; 1048575   : 1179    |*********                               |
  1048576 -&gt; 2097151   : 822     |******                                  |
  2097152 -&gt; 4194303   : 906     |*******                                 |
  4194304 -&gt; 8388607   : 2148    |*****************                       |
  8388608 -&gt; 16777215  : 4497    |*************************************   |
 16777216 -&gt; 33554431  : 289     |**                                      |

  avg = 2041714 usecs, total: 78381401772 usecs, count: 38390

  The worst case is over 16-33 seconds, so soft lockup is triggered [2].

[Root Cause]
1) Each purge_list has the long list. The following shows the number of
   vmap_area is purged.

   crash&gt; p vmap_nodes
   vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
   crash&gt; vmap_node 0xff2de5a900100000 128 | grep nr_purged
     nr_purged = 663070
     ...
     nr_purged = 821670
     nr_purged = 692214
     nr_purged = 726808
     ...

2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
   operation when purging each vmap_area. However, the iteration is over
   600000 vmap_area (See 'nr_purged' above).

   Here is objdump output:

     $ objdump -D vmlinux
     ffffffff813e8c80 &lt;purge_vmap_node&gt;:
     ...
     ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
     ...

   Quote from "Instruction tables" pdf file [3]:
     Instructions with a LOCK prefix have a long latency that depends on
     cache organization and possibly RAM speed. If there are multiple
     processors or cores or direct memory access (DMA) devices, then all
     locked instructions will lock a cache line for exclusive access,
     which may involve RAM access. A LOCK prefix typically costs more
     than a hundred clock cycles, even on single-processor systems.

   That's why the latency of purge_vmap_node() dramatically increases
   on a many-core system: One core is busy on purging each vmap_area of
   the *long* purge_list and executing atomic_long_sub() for each
   vmap_area, while other cores free vmalloc allocations and execute
   atomic_long_add_return() in free_vmap_area_noflush().

[Solution]
Employ a local variable to record the total purged pages, and execute
atomic_long_sub() after the traversal of the purge_list is done. The
experiment result shows the latency improvement is 99%.

[Experiment Result]
1) System Configuration: Three servers (with HT-enabled) are tested.
     * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
     * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
     * 448-core server: AMD Zen 4 Processor*2

2) Kernel Config
     * CONFIG_KASAN is disabled

3) The data in column "w/o patch" and "w/ patch"
     * Unit: micro seconds (us)
     * Each data is the average of 3-time measurements

         System        w/o patch (us)   w/ patch (us)    Improvement (%)
     ---------------   --------------   -------------    -------------
     72-core server          2194              14            99.36%
     192-core server       143799            1139            99.21%
     448-core server      1992122            6883            99.65%

[1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
[2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
[3] https://www.agner.org/optimize/instruction_tables.pdf

Link: https://lkml.kernel.org/r/20240829130633.2184-1-ahuang12@lenovo.com
Signed-off-by: Adrian Huang &lt;ahuang12@lenovo.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When running the vmalloc stress on a 448-core system, observe the average
latency of purge_vmap_node() is about 2 seconds by using the eBPF/bcc
'funclatency.py' tool [1].

  # /your-git-repo/bcc/tools/funclatency.py -u purge_vmap_node &amp; pid1=$! &amp;&amp; sleep 8 &amp;&amp; modprobe test_vmalloc nr_threads=$(nproc) run_test_mask=0x7; kill -SIGINT $pid1

     usecs             : count    distribution
        0 -&gt; 1         : 0       |                                        |
        2 -&gt; 3         : 29      |                                        |
        4 -&gt; 7         : 19      |                                        |
        8 -&gt; 15        : 56      |                                        |
       16 -&gt; 31        : 483     |****                                    |
       32 -&gt; 63        : 1548    |************                            |
       64 -&gt; 127       : 2634    |*********************                   |
      128 -&gt; 255       : 2535    |*********************                   |
      256 -&gt; 511       : 1776    |**************                          |
      512 -&gt; 1023      : 1015    |********                                |
     1024 -&gt; 2047      : 573     |****                                    |
     2048 -&gt; 4095      : 488     |****                                    |
     4096 -&gt; 8191      : 1091    |*********                               |
     8192 -&gt; 16383     : 3078    |*************************               |
    16384 -&gt; 32767     : 4821    |****************************************|
    32768 -&gt; 65535     : 3318    |***************************             |
    65536 -&gt; 131071    : 1718    |**************                          |
   131072 -&gt; 262143    : 2220    |******************                      |
   262144 -&gt; 524287    : 1147    |*********                               |
   524288 -&gt; 1048575   : 1179    |*********                               |
  1048576 -&gt; 2097151   : 822     |******                                  |
  2097152 -&gt; 4194303   : 906     |*******                                 |
  4194304 -&gt; 8388607   : 2148    |*****************                       |
  8388608 -&gt; 16777215  : 4497    |*************************************   |
 16777216 -&gt; 33554431  : 289     |**                                      |

  avg = 2041714 usecs, total: 78381401772 usecs, count: 38390

  The worst case is over 16-33 seconds, so soft lockup is triggered [2].

[Root Cause]
1) Each purge_list has the long list. The following shows the number of
   vmap_area is purged.

   crash&gt; p vmap_nodes
   vmap_nodes = $27 = (struct vmap_node *) 0xff2de5a900100000
   crash&gt; vmap_node 0xff2de5a900100000 128 | grep nr_purged
     nr_purged = 663070
     ...
     nr_purged = 821670
     nr_purged = 692214
     nr_purged = 726808
     ...

2) atomic_long_sub() employs the 'lock' prefix to ensure the atomic
   operation when purging each vmap_area. However, the iteration is over
   600000 vmap_area (See 'nr_purged' above).

   Here is objdump output:

     $ objdump -D vmlinux
     ffffffff813e8c80 &lt;purge_vmap_node&gt;:
     ...
     ffffffff813e8d70:  f0 48 29 2d 68 0c bb  lock sub %rbp,0x2bb0c68(%rip)
     ...

   Quote from "Instruction tables" pdf file [3]:
     Instructions with a LOCK prefix have a long latency that depends on
     cache organization and possibly RAM speed. If there are multiple
     processors or cores or direct memory access (DMA) devices, then all
     locked instructions will lock a cache line for exclusive access,
     which may involve RAM access. A LOCK prefix typically costs more
     than a hundred clock cycles, even on single-processor systems.

   That's why the latency of purge_vmap_node() dramatically increases
   on a many-core system: One core is busy on purging each vmap_area of
   the *long* purge_list and executing atomic_long_sub() for each
   vmap_area, while other cores free vmalloc allocations and execute
   atomic_long_add_return() in free_vmap_area_noflush().

[Solution]
Employ a local variable to record the total purged pages, and execute
atomic_long_sub() after the traversal of the purge_list is done. The
experiment result shows the latency improvement is 99%.

[Experiment Result]
1) System Configuration: Three servers (with HT-enabled) are tested.
     * 72-core server: 3rd Gen Intel Xeon Scalable Processor*1
     * 192-core server: 5th Gen Intel Xeon Scalable Processor*2
     * 448-core server: AMD Zen 4 Processor*2

2) Kernel Config
     * CONFIG_KASAN is disabled

3) The data in column "w/o patch" and "w/ patch"
     * Unit: micro seconds (us)
     * Each data is the average of 3-time measurements

         System        w/o patch (us)   w/ patch (us)    Improvement (%)
     ---------------   --------------   -------------    -------------
     72-core server          2194              14            99.36%
     192-core server       143799            1139            99.21%
     448-core server      1992122            6883            99.65%

[1] https://github.com/iovisor/bcc/blob/master/tools/funclatency.py
[2] https://gist.github.com/AdrianHuang/37c15f67b45407b83c2d32f918656c12
[3] https://www.agner.org/optimize/instruction_tables.pdf

Link: https://lkml.kernel.org/r/20240829130633.2184-1-ahuang12@lenovo.com
Signed-off-by: Adrian Huang &lt;ahuang12@lenovo.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: vmalloc: ensure vmap_block is initialised before adding to queue</title>
<updated>2024-09-02T00:58:59+00:00</updated>
<author>
<name>Will Deacon</name>
<email>will@kernel.org</email>
</author>
<published>2024-08-12T17:16:06+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=3e3de7947c751509027d26b679ecd243bc9db255'/>
<id>3e3de7947c751509027d26b679ecd243bc9db255</id>
<content type='text'>
Commit 8c61291fd850 ("mm: fix incorrect vbq reference in
purge_fragmented_block") extended the 'vmap_block' structure to contain a
'cpu' field which is set at allocation time to the id of the initialising
CPU.

When a new 'vmap_block' is being instantiated by new_vmap_block(), the
partially initialised structure is added to the local 'vmap_block_queue'
xarray before the 'cpu' field has been initialised.  If another CPU is
concurrently walking the xarray (e.g.  via vm_unmap_aliases()), then it
may perform an out-of-bounds access to the remote queue thanks to an
uninitialised index.

This has been observed as UBSAN errors in Android:

 | Internal error: UBSAN: array index out of bounds: 00000000f2005512 [#1] PREEMPT SMP
 |
 | Call trace:
 |  purge_fragmented_block+0x204/0x21c
 |  _vm_unmap_aliases+0x170/0x378
 |  vm_unmap_aliases+0x1c/0x28
 |  change_memory_common+0x1dc/0x26c
 |  set_memory_ro+0x18/0x24
 |  module_enable_ro+0x98/0x238
 |  do_init_module+0x1b0/0x310

Move the initialisation of 'vb-&gt;cpu' in new_vmap_block() ahead of the
addition to the xarray.

Link: https://lkml.kernel.org/r/20240812171606.17486-1-will@kernel.org
Fixes: 8c61291fd850 ("mm: fix incorrect vbq reference in purge_fragmented_block")
Signed-off-by: Will Deacon &lt;will@kernel.org&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Zhaoyang Huang &lt;zhaoyang.huang@unisoc.com&gt;
Cc: Hailong.Liu &lt;hailong.liu@oppo.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Commit 8c61291fd850 ("mm: fix incorrect vbq reference in
purge_fragmented_block") extended the 'vmap_block' structure to contain a
'cpu' field which is set at allocation time to the id of the initialising
CPU.

When a new 'vmap_block' is being instantiated by new_vmap_block(), the
partially initialised structure is added to the local 'vmap_block_queue'
xarray before the 'cpu' field has been initialised.  If another CPU is
concurrently walking the xarray (e.g.  via vm_unmap_aliases()), then it
may perform an out-of-bounds access to the remote queue thanks to an
uninitialised index.

This has been observed as UBSAN errors in Android:

 | Internal error: UBSAN: array index out of bounds: 00000000f2005512 [#1] PREEMPT SMP
 |
 | Call trace:
 |  purge_fragmented_block+0x204/0x21c
 |  _vm_unmap_aliases+0x170/0x378
 |  vm_unmap_aliases+0x1c/0x28
 |  change_memory_common+0x1dc/0x26c
 |  set_memory_ro+0x18/0x24
 |  module_enable_ro+0x98/0x238
 |  do_init_module+0x1b0/0x310

Move the initialisation of 'vb-&gt;cpu' in new_vmap_block() ahead of the
addition to the xarray.

Link: https://lkml.kernel.org/r/20240812171606.17486-1-will@kernel.org
Fixes: 8c61291fd850 ("mm: fix incorrect vbq reference in purge_fragmented_block")
Signed-off-by: Will Deacon &lt;will@kernel.org&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Zhaoyang Huang &lt;zhaoyang.huang@unisoc.com&gt;
Cc: Hailong.Liu &lt;hailong.liu@oppo.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/vmalloc: fix page mapping if vm_area_alloc_pages() with high order fallback to order 0</title>
<updated>2024-08-16T05:16:14+00:00</updated>
<author>
<name>Hailong Liu</name>
<email>hailong.liu@oppo.com</email>
</author>
<published>2024-08-08T12:19:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=61ebe5a747da649057c37be1c37eb934b4af79ca'/>
<id>61ebe5a747da649057c37be1c37eb934b4af79ca</id>
<content type='text'>
The __vmap_pages_range_noflush() assumes its argument pages** contains
pages with the same page shift.  However, since commit e9c3cda4d86e ("mm,
vmalloc: fix high order __GFP_NOFAIL allocations"), if gfp_flags includes
__GFP_NOFAIL with high order in vm_area_alloc_pages() and page allocation
failed for high order, the pages** may contain two different page shifts
(high order and order-0).  This could lead __vmap_pages_range_noflush() to
perform incorrect mappings, potentially resulting in memory corruption.

Users might encounter this as follows (vmap_allow_huge = true, 2M is for
PMD_SIZE):

kvmalloc(2M, __GFP_NOFAIL|GFP_X)
    __vmalloc_node_range_noprof(vm_flags=VM_ALLOW_HUGE_VMAP)
        vm_area_alloc_pages(order=9) ---&gt; order-9 allocation failed and fallback to order-0
            vmap_pages_range()
                vmap_pages_range_noflush()
                    __vmap_pages_range_noflush(page_shift = 21) ----&gt; wrong mapping happens

We can remove the fallback code because if a high-order allocation fails,
__vmalloc_node_range_noprof() will retry with order-0.  Therefore, it is
unnecessary to fallback to order-0 here.  Therefore, fix this by removing
the fallback code.

Link: https://lkml.kernel.org/r/20240808122019.3361-1-hailong.liu@oppo.com
Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
Signed-off-by: Hailong Liu &lt;hailong.liu@oppo.com&gt;
Reported-by: Tangquan Zheng &lt;zhengtangquan@oppo.com&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Acked-by: Barry Song &lt;baohua@kernel.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The __vmap_pages_range_noflush() assumes its argument pages** contains
pages with the same page shift.  However, since commit e9c3cda4d86e ("mm,
vmalloc: fix high order __GFP_NOFAIL allocations"), if gfp_flags includes
__GFP_NOFAIL with high order in vm_area_alloc_pages() and page allocation
failed for high order, the pages** may contain two different page shifts
(high order and order-0).  This could lead __vmap_pages_range_noflush() to
perform incorrect mappings, potentially resulting in memory corruption.

Users might encounter this as follows (vmap_allow_huge = true, 2M is for
PMD_SIZE):

kvmalloc(2M, __GFP_NOFAIL|GFP_X)
    __vmalloc_node_range_noprof(vm_flags=VM_ALLOW_HUGE_VMAP)
        vm_area_alloc_pages(order=9) ---&gt; order-9 allocation failed and fallback to order-0
            vmap_pages_range()
                vmap_pages_range_noflush()
                    __vmap_pages_range_noflush(page_shift = 21) ----&gt; wrong mapping happens

We can remove the fallback code because if a high-order allocation fails,
__vmalloc_node_range_noprof() will retry with order-0.  Therefore, it is
unnecessary to fallback to order-0 here.  Therefore, fix this by removing
the fallback code.

Link: https://lkml.kernel.org/r/20240808122019.3361-1-hailong.liu@oppo.com
Fixes: e9c3cda4d86e ("mm, vmalloc: fix high order __GFP_NOFAIL allocations")
Signed-off-by: Hailong Liu &lt;hailong.liu@oppo.com&gt;
Reported-by: Tangquan Zheng &lt;zhengtangquan@oppo.com&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Acked-by: Barry Song &lt;baohua@kernel.org&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Matthew Wilcox &lt;willy@infradead.org&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'mm-hotfixes-stable' into mm-stable to pick up "mm: fix</title>
<updated>2024-07-06T18:44:41+00:00</updated>
<author>
<name>Andrew Morton</name>
<email>akpm@linux-foundation.org</email>
</author>
<published>2024-07-06T18:44:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8ef6fd0e9ea83a792ba53882ddc6e0d38ce0d636'/>
<id>8ef6fd0e9ea83a792ba53882ddc6e0d38ce0d636</id>
<content type='text'>
crashes from deferred split racing folio migration", needed by "mm:
migrate: split folio_migrate_mapping()".
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
crashes from deferred split racing folio migration", needed by "mm:
migrate: split folio_migrate_mapping()".
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: vmalloc: check if a hash-index is in cpu_possible_mask</title>
<updated>2024-07-04T05:40:36+00:00</updated>
<author>
<name>Uladzislau Rezki (Sony)</name>
<email>urezki@gmail.com</email>
</author>
<published>2024-06-26T14:03:30+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=a34acf30b19bc4ee3ba2f1082756ea2604c19138'/>
<id>a34acf30b19bc4ee3ba2f1082756ea2604c19138</id>
<content type='text'>
The problem is that there are systems where cpu_possible_mask has gaps
between set CPUs, for example SPARC.  In this scenario addr_to_vb_xa()
hash function can return an index which accesses to not-possible and not
setup CPU area using per_cpu() macro.  This results in an oops on SPARC.

A per-cpu vmap_block_queue is also used as hash table, incorrectly
assuming the cpu_possible_mask has no gaps.  Fix it by adjusting an index
to a next possible CPU.

Link: https://lkml.kernel.org/r/20240626140330.89836-1-urezki@gmail.com
Fixes: 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks xarray")
Reported-by: Nick Bowler &lt;nbowler@draconx.ca&gt;
Closes: https://lore.kernel.org/linux-kernel/ZntjIE6msJbF8zTa@MiWiFi-R3L-srv/T/
Signed-off-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Hailong.Liu &lt;hailong.liu@oppo.com&gt;
Cc: Oleksiy Avramchenko &lt;oleksiy.avramchenko@sony.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
The problem is that there are systems where cpu_possible_mask has gaps
between set CPUs, for example SPARC.  In this scenario addr_to_vb_xa()
hash function can return an index which accesses to not-possible and not
setup CPU area using per_cpu() macro.  This results in an oops on SPARC.

A per-cpu vmap_block_queue is also used as hash table, incorrectly
assuming the cpu_possible_mask has no gaps.  Fix it by adjusting an index
to a next possible CPU.

Link: https://lkml.kernel.org/r/20240626140330.89836-1-urezki@gmail.com
Fixes: 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks xarray")
Reported-by: Nick Bowler &lt;nbowler@draconx.ca&gt;
Closes: https://lore.kernel.org/linux-kernel/ZntjIE6msJbF8zTa@MiWiFi-R3L-srv/T/
Signed-off-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Reviewed-by: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Hailong.Liu &lt;hailong.liu@oppo.com&gt;
Cc: Oleksiy Avramchenko &lt;oleksiy.avramchenko@sony.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vmalloc: modify the alloc_vmap_area() error message for better diagnostics</title>
<updated>2024-07-04T02:30:18+00:00</updated>
<author>
<name>Shubhang Kaushik OS</name>
<email>Shubhang@os.amperecomputing.com</email>
</author>
<published>2024-06-10T17:22:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=55ccad6fc1a03c814c26f0e6b35db50feda2c59e'/>
<id>55ccad6fc1a03c814c26f0e6b35db50feda2c59e</id>
<content type='text'>
'vmap allocation for size %lu failed: use vmalloc=&lt;size&gt; to increase size'
The above warning is seen in the kernel functionality for allocation of
the restricted virtual memory range till exhaustion.

This message is misleading because 'vmalloc=' is supported on arm32, x86
platforms and is not a valid kernel parameter on a number of other
platforms (in particular its not supported on arm64, alpha, loongarch,
arc, csky, hexagon, microblaze, mips, nios2, openrisc, parisc, m64k,
powerpc, riscv, sh, um, xtensa, s390, sparc).  With the update, the output
gets modified to include the function parameters along with the start and
end of the virtual memory range allowed.

The warning message after fix on kernel version 6.10.0-rc1+:

vmalloc_node_range for size 33619968 failed: Address range restricted between 0xffff800082640000 - 0xffff800084650000

Backtrace with the misleading error message:

	vmap allocation for size 33619968 failed: use vmalloc=&lt;size&gt; to increase size
	insmod: vmalloc error: size 33554432, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
	CPU: 46 PID: 1977 Comm: insmod Tainted: G            E      6.10.0-rc1+ #79
	Hardware name: INGRASYS Yushan Server iSystem TEMP-S000141176+10/Yushan Motherboard, BIOS 2.10.20230517 (SCP: xxx) yyyy/mm/dd
	Call trace:
		dump_backtrace+0xa0/0x128
		show_stack+0x20/0x38
		dump_stack_lvl+0x78/0x90
		dump_stack+0x18/0x28
		warn_alloc+0x12c/0x1b8
		__vmalloc_node_range_noprof+0x28c/0x7e0
		custom_init+0xb4/0xfff8 [test_driver]
		do_one_initcall+0x60/0x290
		do_init_module+0x68/0x250
		load_module+0x236c/0x2428
		init_module_from_file+0x8c/0xd8
		__arm64_sys_finit_module+0x1b4/0x388
		invoke_syscall+0x78/0x108
		el0_svc_common.constprop.0+0x48/0xf0
		do_el0_svc+0x24/0x38
		el0_svc+0x3c/0x130
		el0t_64_sync_handler+0x100/0x130
		el0t_64_sync+0x190/0x198

[Shubhang@os.amperecomputing.com: v5]
  Link: https://lkml.kernel.org/r/CH2PR01MB5894B0182EA0B28DF2EFB916F5C72@CH2PR01MB5894.prod.exchangelabs.com
Link: https://lkml.kernel.org/r/MN2PR01MB59025CC02D1D29516527A693F5C62@MN2PR01MB5902.prod.exchangelabs.com
Signed-off-by: Shubhang Kaushik &lt;shubhang@os.amperecomputing.com&gt;
Reviewed-by: Christoph Lameter (Ampere) &lt;cl@linux.com&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Guo Ren &lt;guoren@kernel.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Xiongwei Song &lt;xiongwei.song@windriver.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
'vmap allocation for size %lu failed: use vmalloc=&lt;size&gt; to increase size'
The above warning is seen in the kernel functionality for allocation of
the restricted virtual memory range till exhaustion.

This message is misleading because 'vmalloc=' is supported on arm32, x86
platforms and is not a valid kernel parameter on a number of other
platforms (in particular its not supported on arm64, alpha, loongarch,
arc, csky, hexagon, microblaze, mips, nios2, openrisc, parisc, m64k,
powerpc, riscv, sh, um, xtensa, s390, sparc).  With the update, the output
gets modified to include the function parameters along with the start and
end of the virtual memory range allowed.

The warning message after fix on kernel version 6.10.0-rc1+:

vmalloc_node_range for size 33619968 failed: Address range restricted between 0xffff800082640000 - 0xffff800084650000

Backtrace with the misleading error message:

	vmap allocation for size 33619968 failed: use vmalloc=&lt;size&gt; to increase size
	insmod: vmalloc error: size 33554432, vm_struct allocation failed, mode:0xcc0(GFP_KERNEL), nodemask=(null),cpuset=/,mems_allowed=0
	CPU: 46 PID: 1977 Comm: insmod Tainted: G            E      6.10.0-rc1+ #79
	Hardware name: INGRASYS Yushan Server iSystem TEMP-S000141176+10/Yushan Motherboard, BIOS 2.10.20230517 (SCP: xxx) yyyy/mm/dd
	Call trace:
		dump_backtrace+0xa0/0x128
		show_stack+0x20/0x38
		dump_stack_lvl+0x78/0x90
		dump_stack+0x18/0x28
		warn_alloc+0x12c/0x1b8
		__vmalloc_node_range_noprof+0x28c/0x7e0
		custom_init+0xb4/0xfff8 [test_driver]
		do_one_initcall+0x60/0x290
		do_init_module+0x68/0x250
		load_module+0x236c/0x2428
		init_module_from_file+0x8c/0xd8
		__arm64_sys_finit_module+0x1b4/0x388
		invoke_syscall+0x78/0x108
		el0_svc_common.constprop.0+0x48/0xf0
		do_el0_svc+0x24/0x38
		el0_svc+0x3c/0x130
		el0t_64_sync_handler+0x100/0x130
		el0t_64_sync+0x190/0x198

[Shubhang@os.amperecomputing.com: v5]
  Link: https://lkml.kernel.org/r/CH2PR01MB5894B0182EA0B28DF2EFB916F5C72@CH2PR01MB5894.prod.exchangelabs.com
Link: https://lkml.kernel.org/r/MN2PR01MB59025CC02D1D29516527A693F5C62@MN2PR01MB5902.prod.exchangelabs.com
Signed-off-by: Shubhang Kaushik &lt;shubhang@os.amperecomputing.com&gt;
Reviewed-by: Christoph Lameter (Ampere) &lt;cl@linux.com&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Cc: Guo Ren &lt;guoren@kernel.org&gt;
Cc: Jonathan Corbet &lt;corbet@lwn.net&gt;
Cc: Matthew Wilcox (Oracle) &lt;willy@infradead.org&gt;
Cc: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Xiongwei Song &lt;xiongwei.song@windriver.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/vmalloc: use __this_cpu_try_cmpxchg() in preload_this_cpu_lock()</title>
<updated>2024-07-04T02:30:02+00:00</updated>
<author>
<name>Uros Bizjak</name>
<email>ubizjak@gmail.com</email>
</author>
<published>2024-05-28T14:43:14+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f56810c94ca828ab5c223a3528cc44c823cace16'/>
<id>f56810c94ca828ab5c223a3528cc44c823cace16</id>
<content type='text'>
Use __this_cpu_try_cmpxchg() instead of __this_cpu_cmpxchg (*ptr, old,
new) == old in preload_this_cpu_lock().  x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg.

The generated code improves from:

    4bb6:	48 85 f6             	test   %rsi,%rsi
    4bb9:	0f 84 10 fa ff ff    	je     45cf &lt;...&gt;
    4bbf:	4c 89 e8             	mov    %r13,%rax
    4bc2:	65 48 0f b1 35 00 00 	cmpxchg %rsi,%gs:0x0(%rip)
    4bc9:	00 00
    4bcb:	48 85 c0             	test   %rax,%rax
    4bce:	0f 84 fb f9 ff ff    	je     45cf &lt;...&gt;

to:

    4bb6:	48 85 f6             	test   %rsi,%rsi
    4bb9:	0f 84 10 fa ff ff    	je     45cf &lt;...&gt;
    4bbf:	4c 89 e8             	mov    %r13,%rax
    4bc2:	65 48 0f b1 35 00 00 	cmpxchg %rsi,%gs:0x0(%rip)
    4bc9:	00 00
    4bcb:	0f 84 fe f9 ff ff    	je     45cf &lt;...&gt;

No functional change intended.

Link: https://lkml.kernel.org/r/20240528144345.5980-2-ubizjak@gmail.com
Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Dennis Zhou &lt;dennis@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use __this_cpu_try_cmpxchg() instead of __this_cpu_cmpxchg (*ptr, old,
new) == old in preload_this_cpu_lock().  x86 CMPXCHG instruction returns
success in ZF flag, so this change saves a compare after cmpxchg.

The generated code improves from:

    4bb6:	48 85 f6             	test   %rsi,%rsi
    4bb9:	0f 84 10 fa ff ff    	je     45cf &lt;...&gt;
    4bbf:	4c 89 e8             	mov    %r13,%rax
    4bc2:	65 48 0f b1 35 00 00 	cmpxchg %rsi,%gs:0x0(%rip)
    4bc9:	00 00
    4bcb:	48 85 c0             	test   %rax,%rax
    4bce:	0f 84 fb f9 ff ff    	je     45cf &lt;...&gt;

to:

    4bb6:	48 85 f6             	test   %rsi,%rsi
    4bb9:	0f 84 10 fa ff ff    	je     45cf &lt;...&gt;
    4bbf:	4c 89 e8             	mov    %r13,%rax
    4bc2:	65 48 0f b1 35 00 00 	cmpxchg %rsi,%gs:0x0(%rip)
    4bc9:	00 00
    4bcb:	0f 84 fe f9 ff ff    	je     45cf &lt;...&gt;

No functional change intended.

Link: https://lkml.kernel.org/r/20240528144345.5980-2-ubizjak@gmail.com
Signed-off-by: Uros Bizjak &lt;ubizjak@gmail.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Dennis Zhou &lt;dennis@kernel.org&gt;
Cc: Tejun Heo &lt;tj@kernel.org&gt;
Cc: Christoph Lameter &lt;cl@linux.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm: fix incorrect vbq reference in purge_fragmented_block</title>
<updated>2024-06-25T03:52:08+00:00</updated>
<author>
<name>Zhaoyang Huang</name>
<email>zhaoyang.huang@unisoc.com</email>
</author>
<published>2024-06-07T02:31:16+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8c61291fd8500e3b35c7ec0c781b273d8cc96cde'/>
<id>8c61291fd8500e3b35c7ec0c781b273d8cc96cde</id>
<content type='text'>
xa_for_each() in _vm_unmap_aliases() loops through all vbs.  However,
since commit 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks
xarray") the vb from xarray may not be on the corresponding CPU
vmap_block_queue.  Consequently, purge_fragmented_block() might use the
wrong vbq-&gt;lock to protect the free list, leading to vbq-&gt;free breakage.

Incorrect lock protection can exhaust all vmalloc space as follows:
CPU0                                            CPU1
+--------------------------------------------+
|    +--------------------+     +-----+      |
+--&gt; |                    |----&gt;|     |------+
     | CPU1:vbq free_list |     | vb1 |
+--- |                    |&lt;----|     |&lt;-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+

_vm_unmap_aliases()                             vb_alloc()
                                                new_vmap_block()
xa_for_each(&amp;vbq-&gt;vmap_blocks, idx, vb)
--&gt; vb in CPU1:vbq-&gt;freelist

purge_fragmented_block(vb)
spin_lock(&amp;vbq-&gt;lock)                           spin_lock(&amp;vbq-&gt;lock)
--&gt; use CPU0:vbq-&gt;lock                          --&gt; use CPU1:vbq-&gt;lock

list_del_rcu(&amp;vb-&gt;free_list)                    list_add_tail_rcu(&amp;vb-&gt;free_list, &amp;vbq-&gt;free)
    __list_del(vb-&gt;prev, vb-&gt;next)
        next-&gt;prev = prev
    +--------------------+
    |                    |
    | CPU1:vbq free_list |
+---|                    |&lt;--+
|   +--------------------+   |
+----------------------------+
                                                __list_add(new, head-&gt;prev, head)
+--------------------------------------------+
|    +--------------------+     +-----+      |
+--&gt; |                    |----&gt;|     |------+
     | CPU1:vbq free_list |     | vb2 |
+--- |                    |&lt;----|     |&lt;-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+

        prev-&gt;next = next
+--------------------------------------------+
|----------------------------+               |
|    +--------------------+  |  +-----+      |
+--&gt; |                    |--+  |     |------+
     | CPU1:vbq free_list |     | vb2 |
+--- |                    |&lt;----|     |&lt;-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+
Here’s a list breakdown. All vbs, which were to be added to
‘prev’, cannot be used by list_for_each_entry_rcu(vb, &amp;vbq-&gt;free,
free_list) in vb_alloc(). Thus, vmalloc space is exhausted.

This issue affects both erofs and f2fs, the stacktrace is as follows:
erofs:
[&lt;ffffffd4ffb93ad4&gt;] __switch_to+0x174
[&lt;ffffffd4ffb942f0&gt;] __schedule+0x624
[&lt;ffffffd4ffb946f4&gt;] schedule+0x7c
[&lt;ffffffd4ffb947cc&gt;] schedule_preempt_disabled+0x24
[&lt;ffffffd4ffb962ec&gt;] __mutex_lock+0x374
[&lt;ffffffd4ffb95998&gt;] __mutex_lock_slowpath+0x14
[&lt;ffffffd4ffb95954&gt;] mutex_lock+0x24
[&lt;ffffffd4fef2900c&gt;] reclaim_and_purge_vmap_areas+0x44
[&lt;ffffffd4fef25908&gt;] alloc_vmap_area+0x2e0
[&lt;ffffffd4fef24ea0&gt;] vm_map_ram+0x1b0
[&lt;ffffffd4ff1b46f4&gt;] z_erofs_lz4_decompress+0x278
[&lt;ffffffd4ff1b8ac4&gt;] z_erofs_decompress_queue+0x650
[&lt;ffffffd4ff1b8328&gt;] z_erofs_runqueue+0x7f4
[&lt;ffffffd4ff1b66a8&gt;] z_erofs_read_folio+0x104
[&lt;ffffffd4feeb6fec&gt;] filemap_read_folio+0x6c
[&lt;ffffffd4feeb68c4&gt;] filemap_fault+0x300
[&lt;ffffffd4fef0ecac&gt;] __do_fault+0xc8
[&lt;ffffffd4fef0c908&gt;] handle_mm_fault+0xb38
[&lt;ffffffd4ffb9f008&gt;] do_page_fault+0x288
[&lt;ffffffd4ffb9ed64&gt;] do_translation_fault[jt]+0x40
[&lt;ffffffd4fec39c78&gt;] do_mem_abort+0x58
[&lt;ffffffd4ffb8c3e4&gt;] el0_ia+0x70
[&lt;ffffffd4ffb8c260&gt;] el0t_64_sync_handler[jt]+0xb0
[&lt;ffffffd4fec11588&gt;] ret_to_user[jt]+0x0

f2fs:
[&lt;ffffffd4ffb93ad4&gt;] __switch_to+0x174
[&lt;ffffffd4ffb942f0&gt;] __schedule+0x624
[&lt;ffffffd4ffb946f4&gt;] schedule+0x7c
[&lt;ffffffd4ffb947cc&gt;] schedule_preempt_disabled+0x24
[&lt;ffffffd4ffb962ec&gt;] __mutex_lock+0x374
[&lt;ffffffd4ffb95998&gt;] __mutex_lock_slowpath+0x14
[&lt;ffffffd4ffb95954&gt;] mutex_lock+0x24
[&lt;ffffffd4fef2900c&gt;] reclaim_and_purge_vmap_areas+0x44
[&lt;ffffffd4fef25908&gt;] alloc_vmap_area+0x2e0
[&lt;ffffffd4fef24ea0&gt;] vm_map_ram+0x1b0
[&lt;ffffffd4ff1a3b60&gt;] f2fs_prepare_decomp_mem+0x144
[&lt;ffffffd4ff1a6c24&gt;] f2fs_alloc_dic+0x264
[&lt;ffffffd4ff175468&gt;] f2fs_read_multi_pages+0x428
[&lt;ffffffd4ff17b46c&gt;] f2fs_mpage_readpages+0x314
[&lt;ffffffd4ff1785c4&gt;] f2fs_readahead+0x50
[&lt;ffffffd4feec3384&gt;] read_pages+0x80
[&lt;ffffffd4feec32c0&gt;] page_cache_ra_unbounded+0x1a0
[&lt;ffffffd4feec39e8&gt;] page_cache_ra_order+0x274
[&lt;ffffffd4feeb6cec&gt;] do_sync_mmap_readahead+0x11c
[&lt;ffffffd4feeb6764&gt;] filemap_fault+0x1a0
[&lt;ffffffd4ff1423bc&gt;] f2fs_filemap_fault+0x28
[&lt;ffffffd4fef0ecac&gt;] __do_fault+0xc8
[&lt;ffffffd4fef0c908&gt;] handle_mm_fault+0xb38
[&lt;ffffffd4ffb9f008&gt;] do_page_fault+0x288
[&lt;ffffffd4ffb9ed64&gt;] do_translation_fault[jt]+0x40
[&lt;ffffffd4fec39c78&gt;] do_mem_abort+0x58
[&lt;ffffffd4ffb8c3e4&gt;] el0_ia+0x70
[&lt;ffffffd4ffb8c260&gt;] el0t_64_sync_handler[jt]+0xb0
[&lt;ffffffd4fec11588&gt;] ret_to_user[jt]+0x0

To fix this, introducee cpu within vmap_block to record which this vb
belongs to.

Link: https://lkml.kernel.org/r/20240614021352.1822225-1-zhaoyang.huang@unisoc.com
Link: https://lkml.kernel.org/r/20240607023116.1720640-1-zhaoyang.huang@unisoc.com
Fixes: fc1e0d980037 ("mm/vmalloc: prevent stale TLBs in fully utilized blocks")
Signed-off-by: Zhaoyang Huang &lt;zhaoyang.huang@unisoc.com&gt;
Suggested-by: Hailong.Liu &lt;hailong.liu@oppo.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
xa_for_each() in _vm_unmap_aliases() loops through all vbs.  However,
since commit 062eacf57ad9 ("mm: vmalloc: remove a global vmap_blocks
xarray") the vb from xarray may not be on the corresponding CPU
vmap_block_queue.  Consequently, purge_fragmented_block() might use the
wrong vbq-&gt;lock to protect the free list, leading to vbq-&gt;free breakage.

Incorrect lock protection can exhaust all vmalloc space as follows:
CPU0                                            CPU1
+--------------------------------------------+
|    +--------------------+     +-----+      |
+--&gt; |                    |----&gt;|     |------+
     | CPU1:vbq free_list |     | vb1 |
+--- |                    |&lt;----|     |&lt;-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+

_vm_unmap_aliases()                             vb_alloc()
                                                new_vmap_block()
xa_for_each(&amp;vbq-&gt;vmap_blocks, idx, vb)
--&gt; vb in CPU1:vbq-&gt;freelist

purge_fragmented_block(vb)
spin_lock(&amp;vbq-&gt;lock)                           spin_lock(&amp;vbq-&gt;lock)
--&gt; use CPU0:vbq-&gt;lock                          --&gt; use CPU1:vbq-&gt;lock

list_del_rcu(&amp;vb-&gt;free_list)                    list_add_tail_rcu(&amp;vb-&gt;free_list, &amp;vbq-&gt;free)
    __list_del(vb-&gt;prev, vb-&gt;next)
        next-&gt;prev = prev
    +--------------------+
    |                    |
    | CPU1:vbq free_list |
+---|                    |&lt;--+
|   +--------------------+   |
+----------------------------+
                                                __list_add(new, head-&gt;prev, head)
+--------------------------------------------+
|    +--------------------+     +-----+      |
+--&gt; |                    |----&gt;|     |------+
     | CPU1:vbq free_list |     | vb2 |
+--- |                    |&lt;----|     |&lt;-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+

        prev-&gt;next = next
+--------------------------------------------+
|----------------------------+               |
|    +--------------------+  |  +-----+      |
+--&gt; |                    |--+  |     |------+
     | CPU1:vbq free_list |     | vb2 |
+--- |                    |&lt;----|     |&lt;-----+
|    +--------------------+     +-----+      |
+--------------------------------------------+
Here’s a list breakdown. All vbs, which were to be added to
‘prev’, cannot be used by list_for_each_entry_rcu(vb, &amp;vbq-&gt;free,
free_list) in vb_alloc(). Thus, vmalloc space is exhausted.

This issue affects both erofs and f2fs, the stacktrace is as follows:
erofs:
[&lt;ffffffd4ffb93ad4&gt;] __switch_to+0x174
[&lt;ffffffd4ffb942f0&gt;] __schedule+0x624
[&lt;ffffffd4ffb946f4&gt;] schedule+0x7c
[&lt;ffffffd4ffb947cc&gt;] schedule_preempt_disabled+0x24
[&lt;ffffffd4ffb962ec&gt;] __mutex_lock+0x374
[&lt;ffffffd4ffb95998&gt;] __mutex_lock_slowpath+0x14
[&lt;ffffffd4ffb95954&gt;] mutex_lock+0x24
[&lt;ffffffd4fef2900c&gt;] reclaim_and_purge_vmap_areas+0x44
[&lt;ffffffd4fef25908&gt;] alloc_vmap_area+0x2e0
[&lt;ffffffd4fef24ea0&gt;] vm_map_ram+0x1b0
[&lt;ffffffd4ff1b46f4&gt;] z_erofs_lz4_decompress+0x278
[&lt;ffffffd4ff1b8ac4&gt;] z_erofs_decompress_queue+0x650
[&lt;ffffffd4ff1b8328&gt;] z_erofs_runqueue+0x7f4
[&lt;ffffffd4ff1b66a8&gt;] z_erofs_read_folio+0x104
[&lt;ffffffd4feeb6fec&gt;] filemap_read_folio+0x6c
[&lt;ffffffd4feeb68c4&gt;] filemap_fault+0x300
[&lt;ffffffd4fef0ecac&gt;] __do_fault+0xc8
[&lt;ffffffd4fef0c908&gt;] handle_mm_fault+0xb38
[&lt;ffffffd4ffb9f008&gt;] do_page_fault+0x288
[&lt;ffffffd4ffb9ed64&gt;] do_translation_fault[jt]+0x40
[&lt;ffffffd4fec39c78&gt;] do_mem_abort+0x58
[&lt;ffffffd4ffb8c3e4&gt;] el0_ia+0x70
[&lt;ffffffd4ffb8c260&gt;] el0t_64_sync_handler[jt]+0xb0
[&lt;ffffffd4fec11588&gt;] ret_to_user[jt]+0x0

f2fs:
[&lt;ffffffd4ffb93ad4&gt;] __switch_to+0x174
[&lt;ffffffd4ffb942f0&gt;] __schedule+0x624
[&lt;ffffffd4ffb946f4&gt;] schedule+0x7c
[&lt;ffffffd4ffb947cc&gt;] schedule_preempt_disabled+0x24
[&lt;ffffffd4ffb962ec&gt;] __mutex_lock+0x374
[&lt;ffffffd4ffb95998&gt;] __mutex_lock_slowpath+0x14
[&lt;ffffffd4ffb95954&gt;] mutex_lock+0x24
[&lt;ffffffd4fef2900c&gt;] reclaim_and_purge_vmap_areas+0x44
[&lt;ffffffd4fef25908&gt;] alloc_vmap_area+0x2e0
[&lt;ffffffd4fef24ea0&gt;] vm_map_ram+0x1b0
[&lt;ffffffd4ff1a3b60&gt;] f2fs_prepare_decomp_mem+0x144
[&lt;ffffffd4ff1a6c24&gt;] f2fs_alloc_dic+0x264
[&lt;ffffffd4ff175468&gt;] f2fs_read_multi_pages+0x428
[&lt;ffffffd4ff17b46c&gt;] f2fs_mpage_readpages+0x314
[&lt;ffffffd4ff1785c4&gt;] f2fs_readahead+0x50
[&lt;ffffffd4feec3384&gt;] read_pages+0x80
[&lt;ffffffd4feec32c0&gt;] page_cache_ra_unbounded+0x1a0
[&lt;ffffffd4feec39e8&gt;] page_cache_ra_order+0x274
[&lt;ffffffd4feeb6cec&gt;] do_sync_mmap_readahead+0x11c
[&lt;ffffffd4feeb6764&gt;] filemap_fault+0x1a0
[&lt;ffffffd4ff1423bc&gt;] f2fs_filemap_fault+0x28
[&lt;ffffffd4fef0ecac&gt;] __do_fault+0xc8
[&lt;ffffffd4fef0c908&gt;] handle_mm_fault+0xb38
[&lt;ffffffd4ffb9f008&gt;] do_page_fault+0x288
[&lt;ffffffd4ffb9ed64&gt;] do_translation_fault[jt]+0x40
[&lt;ffffffd4fec39c78&gt;] do_mem_abort+0x58
[&lt;ffffffd4ffb8c3e4&gt;] el0_ia+0x70
[&lt;ffffffd4ffb8c260&gt;] el0t_64_sync_handler[jt]+0xb0
[&lt;ffffffd4fec11588&gt;] ret_to_user[jt]+0x0

To fix this, introducee cpu within vmap_block to record which this vb
belongs to.

Link: https://lkml.kernel.org/r/20240614021352.1822225-1-zhaoyang.huang@unisoc.com
Link: https://lkml.kernel.org/r/20240607023116.1720640-1-zhaoyang.huang@unisoc.com
Fixes: fc1e0d980037 ("mm/vmalloc: prevent stale TLBs in fully utilized blocks")
Signed-off-by: Zhaoyang Huang &lt;zhaoyang.huang@unisoc.com&gt;
Suggested-by: Hailong.Liu &lt;hailong.liu@oppo.com&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Baoquan He &lt;bhe@redhat.com&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>vmalloc: check CONFIG_EXECMEM in is_vmalloc_or_module_addr()</title>
<updated>2024-06-06T02:19:25+00:00</updated>
<author>
<name>Cong Wang</name>
<email>cong.wang@bytedance.com</email>
</author>
<published>2024-05-28T16:08:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=0105eaabb27f31d9b8d340aca6fb6a3420cab30f'/>
<id>0105eaabb27f31d9b8d340aca6fb6a3420cab30f</id>
<content type='text'>
After commit 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on
CONFIG_MODULES of") CONFIG_BPF_JIT does not depend on CONFIG_MODULES any
more and bpf jit also uses the [MODULES_VADDR, MODULES_END] memory region.
But is_vmalloc_or_module_addr() still checks CONFIG_MODULES, which then
returns false for a bpf jit memory region when CONFIG_MODULES is not
defined.  It leads to the following kernel BUG:

[    1.567023] ------------[ cut here ]------------
[    1.567883] kernel BUG at mm/vmalloc.c:745!
[    1.568477] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[    1.569367] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.9.0+ #448
[    1.570247] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
[    1.570786] RIP: 0010:vmalloc_to_page+0x48/0x1ec
[    1.570786] Code: 0f 00 00 e8 eb 1a 05 00 b8 37 00 00 00 48 ba fe ff ff ff ff 1f 00 00 4c 03 25 76 49 c6 02 48 c1 e0 28 48 01 e8 48 39 d0 76 02 &lt;0f&gt; 0b 4c 89 e7 e8 bf 1a 05 00 49 8b 04 24 48 a9 9f ff ff ff 0f 84
[    1.570786] RSP: 0018:ffff888007787960 EFLAGS: 00010212
[    1.570786] RAX: 000036ffa0000000 RBX: 0000000000000640 RCX: ffffffff8147e93c
[    1.570786] RDX: 00001ffffffffffe RSI: dffffc0000000000 RDI: ffffffff840e32c8
[    1.570786] RBP: ffffffffa0000000 R08: 0000000000000000 R09: 0000000000000000
[    1.570786] R10: ffff888007787a88 R11: ffffffff8475d8e7 R12: ffffffff83e80ff8
[    1.570786] R13: 0000000000000640 R14: 0000000000000640 R15: 0000000000000640
[    1.570786] FS:  0000000000000000(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000
[    1.570786] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.570786] CR2: ffff888006a01000 CR3: 0000000003e80000 CR4: 0000000000350ef0
[    1.570786] Call Trace:
[    1.570786]  &lt;TASK&gt;
[    1.570786]  ? __die_body+0x1b/0x58
[    1.570786]  ? die+0x31/0x4b
[    1.570786]  ? do_trap+0x9d/0x138
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  ? do_error_trap+0xcd/0x102
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  ? handle_invalid_op+0x2f/0x38
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  ? exc_invalid_op+0x2b/0x41
[    1.570786]  ? asm_exc_invalid_op+0x16/0x20
[    1.570786]  ? vmalloc_to_page+0x26/0x1ec
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  __text_poke+0xb6/0x458
[    1.570786]  ? __pfx_text_poke_memcpy+0x10/0x10
[    1.570786]  ? __pfx___mutex_lock+0x10/0x10
[    1.570786]  ? __pfx___text_poke+0x10/0x10
[    1.570786]  ? __pfx_get_random_u32+0x10/0x10
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  text_poke_copy_locked+0x70/0x84
[    1.570786]  text_poke_copy+0x32/0x4f
[    1.570786]  bpf_arch_text_copy+0xf/0x27
[    1.570786]  bpf_jit_binary_pack_finalize+0x26/0x5a
[    1.570786]  bpf_int_jit_compile+0x576/0x8ad
[    1.570786]  ? __pfx_bpf_int_jit_compile+0x10/0x10
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  ? __kmalloc_node_track_caller+0x2b5/0x2e0
[    1.570786]  bpf_prog_select_runtime+0x7c/0x199
[    1.570786]  bpf_prepare_filter+0x1e9/0x25b
[    1.570786]  ? __pfx_bpf_prepare_filter+0x10/0x10
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  ? _find_next_bit+0x29/0x7e
[    1.570786]  bpf_prog_create+0xb8/0xe0
[    1.570786]  ptp_classifier_init+0x75/0xa1
[    1.570786]  ? __pfx_ptp_classifier_init+0x10/0x10
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  ? register_pernet_subsys+0x36/0x42
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  sock_init+0x99/0xa3
[    1.570786]  ? __pfx_sock_init+0x10/0x10
[    1.570786]  do_one_initcall+0x104/0x2c4
[    1.570786]  ? __pfx_do_one_initcall+0x10/0x10
[    1.570786]  ? parameq+0x25/0x2d
[    1.570786]  ? rcu_is_watching+0x1c/0x3c
[    1.570786]  ? trace_kmalloc+0x81/0xb2
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  ? __kmalloc+0x29c/0x2c7
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  do_initcalls+0xf9/0x123
[    1.570786]  kernel_init_freeable+0x24f/0x289
[    1.570786]  ? __pfx_kernel_init+0x10/0x10
[    1.570786]  kernel_init+0x19/0x13a
[    1.570786]  ret_from_fork+0x24/0x41
[    1.570786]  ? __pfx_kernel_init+0x10/0x10
[    1.570786]  ret_from_fork_asm+0x1a/0x30
[    1.570786]  &lt;/TASK&gt;
[    1.570819] ---[ end trace 0000000000000000 ]---
[    1.571463] RIP: 0010:vmalloc_to_page+0x48/0x1ec
[    1.572111] Code: 0f 00 00 e8 eb 1a 05 00 b8 37 00 00 00 48 ba fe ff ff ff ff 1f 00 00 4c 03 25 76 49 c6 02 48 c1 e0 28 48 01 e8 48 39 d0 76 02 &lt;0f&gt; 0b 4c 89 e7 e8 bf 1a 05 00 49 8b 04 24 48 a9 9f ff ff ff 0f 84
[    1.574632] RSP: 0018:ffff888007787960 EFLAGS: 00010212
[    1.575129] RAX: 000036ffa0000000 RBX: 0000000000000640 RCX: ffffffff8147e93c
[    1.576097] RDX: 00001ffffffffffe RSI: dffffc0000000000 RDI: ffffffff840e32c8
[    1.577084] RBP: ffffffffa0000000 R08: 0000000000000000 R09: 0000000000000000
[    1.578077] R10: ffff888007787a88 R11: ffffffff8475d8e7 R12: ffffffff83e80ff8
[    1.578810] R13: 0000000000000640 R14: 0000000000000640 R15: 0000000000000640
[    1.579823] FS:  0000000000000000(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000
[    1.580992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.581869] CR2: ffff888006a01000 CR3: 0000000003e80000 CR4: 0000000000350ef0
[    1.582800] Kernel panic - not syncing: Fatal exception
[    1.583765] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fix this by checking CONFIG_EXECMEM instead.

Link: https://lkml.kernel.org/r/20240528160838.102223-1-xiyou.wangcong@gmail.com
Fixes: 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of")
Signed-off-by: Cong Wang &lt;cong.wang@bytedance.com&gt;
Acked-by: Mike Rapoport (IBM) &lt;rppt@kernel.org&gt;
Cc: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
After commit 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on
CONFIG_MODULES of") CONFIG_BPF_JIT does not depend on CONFIG_MODULES any
more and bpf jit also uses the [MODULES_VADDR, MODULES_END] memory region.
But is_vmalloc_or_module_addr() still checks CONFIG_MODULES, which then
returns false for a bpf jit memory region when CONFIG_MODULES is not
defined.  It leads to the following kernel BUG:

[    1.567023] ------------[ cut here ]------------
[    1.567883] kernel BUG at mm/vmalloc.c:745!
[    1.568477] Oops: invalid opcode: 0000 [#1] PREEMPT SMP KASAN NOPTI
[    1.569367] CPU: 0 PID: 1 Comm: swapper/0 Not tainted 6.9.0+ #448
[    1.570247] Hardware name: QEMU Standard PC (Q35 + ICH9, 2009), BIOS 1.15.0-1 04/01/2014
[    1.570786] RIP: 0010:vmalloc_to_page+0x48/0x1ec
[    1.570786] Code: 0f 00 00 e8 eb 1a 05 00 b8 37 00 00 00 48 ba fe ff ff ff ff 1f 00 00 4c 03 25 76 49 c6 02 48 c1 e0 28 48 01 e8 48 39 d0 76 02 &lt;0f&gt; 0b 4c 89 e7 e8 bf 1a 05 00 49 8b 04 24 48 a9 9f ff ff ff 0f 84
[    1.570786] RSP: 0018:ffff888007787960 EFLAGS: 00010212
[    1.570786] RAX: 000036ffa0000000 RBX: 0000000000000640 RCX: ffffffff8147e93c
[    1.570786] RDX: 00001ffffffffffe RSI: dffffc0000000000 RDI: ffffffff840e32c8
[    1.570786] RBP: ffffffffa0000000 R08: 0000000000000000 R09: 0000000000000000
[    1.570786] R10: ffff888007787a88 R11: ffffffff8475d8e7 R12: ffffffff83e80ff8
[    1.570786] R13: 0000000000000640 R14: 0000000000000640 R15: 0000000000000640
[    1.570786] FS:  0000000000000000(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000
[    1.570786] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.570786] CR2: ffff888006a01000 CR3: 0000000003e80000 CR4: 0000000000350ef0
[    1.570786] Call Trace:
[    1.570786]  &lt;TASK&gt;
[    1.570786]  ? __die_body+0x1b/0x58
[    1.570786]  ? die+0x31/0x4b
[    1.570786]  ? do_trap+0x9d/0x138
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  ? do_error_trap+0xcd/0x102
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  ? handle_invalid_op+0x2f/0x38
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  ? exc_invalid_op+0x2b/0x41
[    1.570786]  ? asm_exc_invalid_op+0x16/0x20
[    1.570786]  ? vmalloc_to_page+0x26/0x1ec
[    1.570786]  ? vmalloc_to_page+0x48/0x1ec
[    1.570786]  __text_poke+0xb6/0x458
[    1.570786]  ? __pfx_text_poke_memcpy+0x10/0x10
[    1.570786]  ? __pfx___mutex_lock+0x10/0x10
[    1.570786]  ? __pfx___text_poke+0x10/0x10
[    1.570786]  ? __pfx_get_random_u32+0x10/0x10
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  text_poke_copy_locked+0x70/0x84
[    1.570786]  text_poke_copy+0x32/0x4f
[    1.570786]  bpf_arch_text_copy+0xf/0x27
[    1.570786]  bpf_jit_binary_pack_finalize+0x26/0x5a
[    1.570786]  bpf_int_jit_compile+0x576/0x8ad
[    1.570786]  ? __pfx_bpf_int_jit_compile+0x10/0x10
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  ? __kmalloc_node_track_caller+0x2b5/0x2e0
[    1.570786]  bpf_prog_select_runtime+0x7c/0x199
[    1.570786]  bpf_prepare_filter+0x1e9/0x25b
[    1.570786]  ? __pfx_bpf_prepare_filter+0x10/0x10
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  ? _find_next_bit+0x29/0x7e
[    1.570786]  bpf_prog_create+0xb8/0xe0
[    1.570786]  ptp_classifier_init+0x75/0xa1
[    1.570786]  ? __pfx_ptp_classifier_init+0x10/0x10
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  ? register_pernet_subsys+0x36/0x42
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  sock_init+0x99/0xa3
[    1.570786]  ? __pfx_sock_init+0x10/0x10
[    1.570786]  do_one_initcall+0x104/0x2c4
[    1.570786]  ? __pfx_do_one_initcall+0x10/0x10
[    1.570786]  ? parameq+0x25/0x2d
[    1.570786]  ? rcu_is_watching+0x1c/0x3c
[    1.570786]  ? trace_kmalloc+0x81/0xb2
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  ? __kmalloc+0x29c/0x2c7
[    1.570786]  ? srso_return_thunk+0x5/0x5f
[    1.570786]  do_initcalls+0xf9/0x123
[    1.570786]  kernel_init_freeable+0x24f/0x289
[    1.570786]  ? __pfx_kernel_init+0x10/0x10
[    1.570786]  kernel_init+0x19/0x13a
[    1.570786]  ret_from_fork+0x24/0x41
[    1.570786]  ? __pfx_kernel_init+0x10/0x10
[    1.570786]  ret_from_fork_asm+0x1a/0x30
[    1.570786]  &lt;/TASK&gt;
[    1.570819] ---[ end trace 0000000000000000 ]---
[    1.571463] RIP: 0010:vmalloc_to_page+0x48/0x1ec
[    1.572111] Code: 0f 00 00 e8 eb 1a 05 00 b8 37 00 00 00 48 ba fe ff ff ff ff 1f 00 00 4c 03 25 76 49 c6 02 48 c1 e0 28 48 01 e8 48 39 d0 76 02 &lt;0f&gt; 0b 4c 89 e7 e8 bf 1a 05 00 49 8b 04 24 48 a9 9f ff ff ff 0f 84
[    1.574632] RSP: 0018:ffff888007787960 EFLAGS: 00010212
[    1.575129] RAX: 000036ffa0000000 RBX: 0000000000000640 RCX: ffffffff8147e93c
[    1.576097] RDX: 00001ffffffffffe RSI: dffffc0000000000 RDI: ffffffff840e32c8
[    1.577084] RBP: ffffffffa0000000 R08: 0000000000000000 R09: 0000000000000000
[    1.578077] R10: ffff888007787a88 R11: ffffffff8475d8e7 R12: ffffffff83e80ff8
[    1.578810] R13: 0000000000000640 R14: 0000000000000640 R15: 0000000000000640
[    1.579823] FS:  0000000000000000(0000) GS:ffff88806cc00000(0000) knlGS:0000000000000000
[    1.580992] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[    1.581869] CR2: ffff888006a01000 CR3: 0000000003e80000 CR4: 0000000000350ef0
[    1.582800] Kernel panic - not syncing: Fatal exception
[    1.583765] ---[ end Kernel panic - not syncing: Fatal exception ]---

Fix this by checking CONFIG_EXECMEM instead.

Link: https://lkml.kernel.org/r/20240528160838.102223-1-xiyou.wangcong@gmail.com
Fixes: 2c9e5d4a0082 ("bpf: remove CONFIG_BPF_JIT dependency on CONFIG_MODULES of")
Signed-off-by: Cong Wang &lt;cong.wang@bytedance.com&gt;
Acked-by: Mike Rapoport (IBM) &lt;rppt@kernel.org&gt;
Cc: Luis Chamberlain &lt;mcgrof@kernel.org&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/vmalloc: fix vmalloc which may return null if called with __GFP_NOFAIL</title>
<updated>2024-05-24T18:55:04+00:00</updated>
<author>
<name>Hailong.Liu</name>
<email>hailong.liu@oppo.com</email>
</author>
<published>2024-05-10T10:01:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=8e0545c83d672750632f46e3f9ad95c48c91a0fc'/>
<id>8e0545c83d672750632f46e3f9ad95c48c91a0fc</id>
<content type='text'>
commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
includes support for __GFP_NOFAIL, but it presents a conflict with commit
dd544141b9eb ("vmalloc: back off when the current task is OOM-killed").  A
possible scenario is as follows:

process-a
__vmalloc_node_range(GFP_KERNEL | __GFP_NOFAIL)
    __vmalloc_area_node()
        vm_area_alloc_pages()
		--&gt; oom-killer send SIGKILL to process-a
        if (fatal_signal_pending(current)) break;
--&gt; return NULL;

To fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
if __GFP_NOFAIL set.

This issue occurred during OPLUS KASAN TEST. Below is part of the log
-&gt; oom-killer sends signal to process
[65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198

[65731.259685] [T32454] Call trace:
[65731.259698] [T32454]  dump_backtrace+0xf4/0x118
[65731.259734] [T32454]  show_stack+0x18/0x24
[65731.259756] [T32454]  dump_stack_lvl+0x60/0x7c
[65731.259781] [T32454]  dump_stack+0x18/0x38
[65731.259800] [T32454]  mrdump_common_die+0x250/0x39c [mrdump]
[65731.259936] [T32454]  ipanic_die+0x20/0x34 [mrdump]
[65731.260019] [T32454]  atomic_notifier_call_chain+0xb4/0xfc
[65731.260047] [T32454]  notify_die+0x114/0x198
[65731.260073] [T32454]  die+0xf4/0x5b4
[65731.260098] [T32454]  die_kernel_fault+0x80/0x98
[65731.260124] [T32454]  __do_kernel_fault+0x160/0x2a8
[65731.260146] [T32454]  do_bad_area+0x68/0x148
[65731.260174] [T32454]  do_mem_abort+0x151c/0x1b34
[65731.260204] [T32454]  el1_abort+0x3c/0x5c
[65731.260227] [T32454]  el1h_64_sync_handler+0x54/0x90
[65731.260248] [T32454]  el1h_64_sync+0x68/0x6c

[65731.260269] [T32454]  z_erofs_decompress_queue+0x7f0/0x2258
--&gt; be-&gt;decompressed_pages = kvcalloc(be-&gt;nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL);
	kernel panic by NULL pointer dereference.
	erofs assume kvmalloc with __GFP_NOFAIL never return NULL.
[65731.260293] [T32454]  z_erofs_runqueue+0xf30/0x104c
[65731.260314] [T32454]  z_erofs_readahead+0x4f0/0x968
[65731.260339] [T32454]  read_pages+0x170/0xadc
[65731.260364] [T32454]  page_cache_ra_unbounded+0x874/0xf30
[65731.260388] [T32454]  page_cache_ra_order+0x24c/0x714
[65731.260411] [T32454]  filemap_fault+0xbf0/0x1a74
[65731.260437] [T32454]  __do_fault+0xd0/0x33c
[65731.260462] [T32454]  handle_mm_fault+0xf74/0x3fe0
[65731.260486] [T32454]  do_mem_abort+0x54c/0x1b34
[65731.260509] [T32454]  el0_da+0x44/0x94
[65731.260531] [T32454]  el0t_64_sync_handler+0x98/0xb4
[65731.260553] [T32454]  el0t_64_sync+0x198/0x19c

Link: https://lkml.kernel.org/r/20240510100131.1865-1-hailong.liu@oppo.com
Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL")
Signed-off-by: Hailong.Liu &lt;hailong.liu@oppo.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Suggested-by: Barry Song &lt;21cnbao@gmail.com&gt;
Reported-by: Oven &lt;liyangouwen1@oppo.com&gt;
Reviewed-by: Barry Song &lt;baohua@kernel.org&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Chao Yu &lt;chao@kernel.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit a421ef303008 ("mm: allow !GFP_KERNEL allocations for kvmalloc")
includes support for __GFP_NOFAIL, but it presents a conflict with commit
dd544141b9eb ("vmalloc: back off when the current task is OOM-killed").  A
possible scenario is as follows:

process-a
__vmalloc_node_range(GFP_KERNEL | __GFP_NOFAIL)
    __vmalloc_area_node()
        vm_area_alloc_pages()
		--&gt; oom-killer send SIGKILL to process-a
        if (fatal_signal_pending(current)) break;
--&gt; return NULL;

To fix this, do not check fatal_signal_pending() in vm_area_alloc_pages()
if __GFP_NOFAIL set.

This issue occurred during OPLUS KASAN TEST. Below is part of the log
-&gt; oom-killer sends signal to process
[65731.222840] [ T1308] oom-kill:constraint=CONSTRAINT_NONE,nodemask=(null),cpuset=/,mems_allowed=0,global_oom,task_memcg=/apps/uid_10198,task=gs.intelligence,pid=32454,uid=10198

[65731.259685] [T32454] Call trace:
[65731.259698] [T32454]  dump_backtrace+0xf4/0x118
[65731.259734] [T32454]  show_stack+0x18/0x24
[65731.259756] [T32454]  dump_stack_lvl+0x60/0x7c
[65731.259781] [T32454]  dump_stack+0x18/0x38
[65731.259800] [T32454]  mrdump_common_die+0x250/0x39c [mrdump]
[65731.259936] [T32454]  ipanic_die+0x20/0x34 [mrdump]
[65731.260019] [T32454]  atomic_notifier_call_chain+0xb4/0xfc
[65731.260047] [T32454]  notify_die+0x114/0x198
[65731.260073] [T32454]  die+0xf4/0x5b4
[65731.260098] [T32454]  die_kernel_fault+0x80/0x98
[65731.260124] [T32454]  __do_kernel_fault+0x160/0x2a8
[65731.260146] [T32454]  do_bad_area+0x68/0x148
[65731.260174] [T32454]  do_mem_abort+0x151c/0x1b34
[65731.260204] [T32454]  el1_abort+0x3c/0x5c
[65731.260227] [T32454]  el1h_64_sync_handler+0x54/0x90
[65731.260248] [T32454]  el1h_64_sync+0x68/0x6c

[65731.260269] [T32454]  z_erofs_decompress_queue+0x7f0/0x2258
--&gt; be-&gt;decompressed_pages = kvcalloc(be-&gt;nr_pages, sizeof(struct page *), GFP_KERNEL | __GFP_NOFAIL);
	kernel panic by NULL pointer dereference.
	erofs assume kvmalloc with __GFP_NOFAIL never return NULL.
[65731.260293] [T32454]  z_erofs_runqueue+0xf30/0x104c
[65731.260314] [T32454]  z_erofs_readahead+0x4f0/0x968
[65731.260339] [T32454]  read_pages+0x170/0xadc
[65731.260364] [T32454]  page_cache_ra_unbounded+0x874/0xf30
[65731.260388] [T32454]  page_cache_ra_order+0x24c/0x714
[65731.260411] [T32454]  filemap_fault+0xbf0/0x1a74
[65731.260437] [T32454]  __do_fault+0xd0/0x33c
[65731.260462] [T32454]  handle_mm_fault+0xf74/0x3fe0
[65731.260486] [T32454]  do_mem_abort+0x54c/0x1b34
[65731.260509] [T32454]  el0_da+0x44/0x94
[65731.260531] [T32454]  el0t_64_sync_handler+0x98/0xb4
[65731.260553] [T32454]  el0t_64_sync+0x198/0x19c

Link: https://lkml.kernel.org/r/20240510100131.1865-1-hailong.liu@oppo.com
Fixes: 9376130c390a ("mm/vmalloc: add support for __GFP_NOFAIL")
Signed-off-by: Hailong.Liu &lt;hailong.liu@oppo.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Suggested-by: Barry Song &lt;21cnbao@gmail.com&gt;
Reported-by: Oven &lt;liyangouwen1@oppo.com&gt;
Reviewed-by: Barry Song &lt;baohua@kernel.org&gt;
Reviewed-by: Uladzislau Rezki (Sony) &lt;urezki@gmail.com&gt;
Cc: Chao Yu &lt;chao@kernel.org&gt;
Cc: Christoph Hellwig &lt;hch@infradead.org&gt;
Cc: Gao Xiang &lt;xiang@kernel.org&gt;
Cc: Lorenzo Stoakes &lt;lstoakes@gmail.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
</feed>
