linux-stable.git/mm, branch v5.4.166

mm: bdi: initialize bdi_min_ratio when bdi is unregistered

2021-12-14T13:49:00+00:00

commit 3c376dfafbf7a8ea0dea212d095ddd83e93280bb upstream.

Initialize min_ratio if it is set during bdi unregistration.  This can
prevent problems that may occur a when bdi is removed without resetting
min_ratio.

For example.
1) insert external sdcard
2) set external sdcard's min_ratio 70
3) remove external sdcard without setting min_ratio 0
4) insert external sdcard
5) set external sdcard's min_ratio 70 << error occur(can't set)

Because when an sdcard is removed, the present bdi_min_ratio value will
remain.  Currently, the only way to reset bdi_min_ratio is to reboot.

[akpm@linux-foundation.org: tweak comment and coding style]

Link: https://lkml.kernel.org/r/20211021161942.5983-1-mj0123.lee@samsung.com
Signed-off-by: Manjong Lee 
Acked-by: Peter Zijlstra (Intel) 
Cc: Changheun Lee 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Matthew Wilcox 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

hugetlbfs: flush TLBs correctly after huge_pmd_unshare

2021-11-26T09:47:23+00:00

commit a4a118f2eead1d6c49e00765de89878288d4b890 upstream.

When __unmap_hugepage_range() calls to huge_pmd_unshare() succeed, a TLB
flush is missing.  This TLB flush must be performed before releasing the
i_mmap_rwsem, in order to prevent an unshared PMDs page from being
released and reused before the TLB flush took place.

Arguably, a comprehensive solution would use mmu_gather interface to
batch the TLB flushes and the PMDs page release, however it is not an
easy solution: (1) try_to_unmap_one() and try_to_migrate_one() also call
huge_pmd_unshare() and they cannot use the mmu_gather interface; and (2)
deferring the release of the page reference for the PMDs page until
after i_mmap_rwsem is dropeed can confuse huge_pmd_unshare() into
thinking PMDs are shared when they are not.

Fix __unmap_hugepage_range() by adding the missing TLB flush, and
forcing a flush when unshare is successful.

Fixes: 24669e58477e ("hugetlb: use mmu_gather instead of a temporary linked list for accumulating pages)" # 3.6
Signed-off-by: Nadav Amit 
Reviewed-by: Mike Kravetz 
Cc: Aneesh Kumar K.V 
Cc: KAMEZAWA Hiroyuki 
Cc: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm: kmemleak: slob: respect SLAB_NOLEAKTRACE flag

2021-11-26T09:47:21+00:00

commit 34dbc3aaf5d9e89ba6cc5e24add9458c21ab1950 upstream.

When kmemleak is enabled for SLOB, system does not boot and does not
print anything to the console.  At the very early stage in the boot
process we hit infinite recursion from kmemleak_init() and eventually
kernel crashes.

kmemleak_init() specifies SLAB_NOLEAKTRACE for KMEM_CACHE(), but
kmem_cache_create_usercopy() removes it because CACHE_CREATE_MASK is not
valid for SLOB.

Let's fix CACHE_CREATE_MASK and make kmemleak work with SLOB

Link: https://lkml.kernel.org/r/20211115020850.3154366-1-rkovhaev@gmail.com
Fixes: d8843922fba4 ("slab: Ignore internal flags in cache creation")
Signed-off-by: Rustam Kovhaev 
Acked-by: Vlastimil Babka 
Reviewed-by: Muchun Song 
Cc: Christoph Lameter 
Cc: Pekka Enberg 
Cc: David Rientjes 
Cc: Joonsoo Kim 
Cc: Catalin Marinas 
Cc: Greg Kroah-Hartman 
Cc: Glauber Costa 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm, oom: do not trigger out_of_memory from the #PF

2021-11-17T08:48:50+00:00

commit 60e2793d440a3ec95abb5d6d4fc034a4b480472d upstream.

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory.  This can happen for 2
different reasons.  a) Memcg is out of memory and we rely on
mem_cgroup_oom_synchronize to perform the memcg OOM handling or b)
normal allocation fails.

The latter is quite problematic because allocation paths already trigger
out_of_memory and the page allocator tries really hard to not fail
allocations.  Anyway, if the OOM killer has been already invoked there
is no reason to invoke it again from the #PF path.  Especially when the
OOM condition might be gone by that time and we have no way to find out
other than allocate.

Moreover if the allocation failed and the OOM killer hasn't been invoked
then we are unlikely to do the right thing from the #PF context because
we have already lost the allocation context and restictions and
therefore might oom kill a task from a different NUMA domain.

This all suggests that there is no legitimate reason to trigger
out_of_memory from pagefault_out_of_memory so drop it.  Just to be sure
that no #PF path returns with VM_FAULT_OOM without allocation print a
warning that this is happening before we restart the #PF.

[VvS: #PF allocation can hit into limit of cgroup v1 kmem controller.
This is a local problem related to memcg, however, it causes unnecessary
global OOM kills that are repeated over and over again and escalate into a
real disaster.  This has been broken since kmem accounting has been
introduced for cgroup v1 (3.8).  There was no kmem specific reclaim for
the separate limit so the only way to handle kmem hard limit was to return
with ENOMEM.  In upstream the problem will be fixed by removing the
outdated kmem limit, however stable and LTS kernels cannot do it and are
still affected.  This patch fixes the problem and should be backported
into stable/LTS.]

Link: https://lkml.kernel.org/r/f5fd8dd8-0ad4-c524-5f65-920b01972a42@virtuozzo.com
Signed-off-by: Michal Hocko 
Signed-off-by: Vasily Averin 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Tetsuo Handa 
Cc: Uladzislau Rezki 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm, oom: pagefault_out_of_memory: don't force global OOM for dying tasks

2021-11-17T08:48:50+00:00

commit 0b28179a6138a5edd9d82ad2687c05b3773c387b upstream.

Patch series "memcg: prohibit unconditional exceeding the limit of dying tasks", v3.

Memory cgroup charging allows killed or exiting tasks to exceed the hard
limit.  It can be misused and allowed to trigger global OOM from inside
a memcg-limited container.  On the other hand if memcg fails allocation,
called from inside #PF handler it triggers global OOM from inside
pagefault_out_of_memory().

To prevent these problems this patchset:
 (a) removes execution of out_of_memory() from
     pagefault_out_of_memory(), becasue nobody can explain why it is
     necessary.
 (b) allow memcg to fail allocation of dying/killed tasks.

This patch (of 3):

Any allocation failure during the #PF path will return with VM_FAULT_OOM
which in turn results in pagefault_out_of_memory which in turn executes
out_out_memory() and can kill a random task.

An allocation might fail when the current task is the oom victim and
there are no memory reserves left.  The OOM killer is already handled at
the page allocator level for the global OOM and at the charging level
for the memcg one.  Both have much more information about the scope of
allocation/charge request.  This means that either the OOM killer has
been invoked properly and didn't lead to the allocation success or it
has been skipped because it couldn't have been invoked.  In both cases
triggering it from here is pointless and even harmful.

It makes much more sense to let the killed task die rather than to wake
up an eternally hungry oom-killer and send him to choose a fatter victim
for breakfast.

Link: https://lkml.kernel.org/r/0828a149-786e-7c06-b70a-52d086818ea3@virtuozzo.com
Signed-off-by: Vasily Averin 
Suggested-by: Michal Hocko 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Cc: Mel Gorman 
Cc: Roman Gushchin 
Cc: Shakeel Butt 
Cc: Tetsuo Handa 
Cc: Uladzislau Rezki 
Cc: Vladimir Davydov 
Cc: Vlastimil Babka 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/zsmalloc.c: close race window between zs_pool_dec_isolated() and zs_unregister_migration()

2021-11-17T08:48:47+00:00

[ Upstream commit afe8605ca45424629fdddfd85984b442c763dc47 ]

There is one possible race window between zs_pool_dec_isolated() and
zs_unregister_migration() because wait_for_isolated_drain() checks the
isolated count without holding class->lock and there is no order inside
zs_pool_dec_isolated().  Thus the below race window could be possible:

  zs_pool_dec_isolated		zs_unregister_migration
    check pool->destroying != 0
				  pool->destroying = true;
				  smp_mb();
				  wait_for_isolated_drain()
				    wait for pool->isolated_pages == 0
    atomic_long_dec(&pool->isolated_pages);
    atomic_long_read(&pool->isolated_pages) == 0

Since we observe the pool->destroying (false) before atomic_long_dec()
for pool->isolated_pages, waking pool->migration_wait up is missed.

Fix this by ensure checking pool->destroying happens after the
atomic_long_dec(&pool->isolated_pages).

Link: https://lkml.kernel.org/r/20210708115027.7557-1-linmiaohe@huawei.com
Fixes: 701d678599d0 ("mm/zsmalloc.c: fix race condition in zs_destroy_pool")
Signed-off-by: Miaohe Lin 
Cc: Minchan Kim 
Cc: Sergey Senozhatsky 
Cc: Henry Burns 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

mm, slub: fix potential memoryleak in kmem_cache_open()

2021-10-27T07:54:28+00:00

commit 9037c57681d25e4dcc442d940d6dbe24dd31f461 upstream.

In error path, the random_seq of slub cache might be leaked.  Fix this
by using __kmem_cache_release() to release all the relevant resources.

Link: https://lkml.kernel.org/r/20210916123920.48704-4-linmiaohe@huawei.com
Fixes: 210e7a43fa90 ("mm: SLUB freelist randomization")
Signed-off-by: Miaohe Lin 
Reviewed-by: Vlastimil Babka 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Bharata B Rao 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Faiyaz Mohammed 
Cc: Greg Kroah-Hartman 
Cc: Joonsoo Kim 
Cc: Kees Cook 
Cc: Pekka Enberg 
Cc: Roman Gushchin 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm, slub: fix mismatch between reconstructed freelist depth and cnt

2021-10-27T07:54:28+00:00

commit 899447f669da76cc3605665e1a95ee877bc464cc upstream.

If object's reuse is delayed, it will be excluded from the reconstructed
freelist.  But we forgot to adjust the cnt accordingly.  So there will
be a mismatch between reconstructed freelist depth and cnt.  This will
lead to free_debug_processing() complaining about freelist count or a
incorrect slub inuse count.

Link: https://lkml.kernel.org/r/20210916123920.48704-3-linmiaohe@huawei.com
Fixes: c3895391df38 ("kasan, slub: fix handling of kasan_slab_free hook")
Signed-off-by: Miaohe Lin 
Reviewed-by: Vlastimil Babka 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Bharata B Rao 
Cc: Christoph Lameter 
Cc: David Rientjes 
Cc: Faiyaz Mohammed 
Cc: Greg Kroah-Hartman 
Cc: Joonsoo Kim 
Cc: Kees Cook 
Cc: Pekka Enberg 
Cc: Roman Gushchin 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm/memory_hotplug: use "unsigned long" for PFN in zone_for_pfn_range()

2021-09-22T10:26:43+00:00

commit 7cf209ba8a86410939a24cb1aeb279479a7e0ca6 upstream.

Patch series "mm/memory_hotplug: preparatory patches for new online policy and memory"

These are all cleanups and one fix previously sent as part of [1]:
[PATCH v1 00/12] mm/memory_hotplug: "auto-movable" online policy and memory
groups.

These patches make sense even without the other series, therefore I pulled
them out to make the other series easier to digest.

[1] https://lkml.kernel.org/r/20210607195430.48228-1-david@redhat.com

This patch (of 4):

Checkpatch complained on a follow-up patch that we are using "unsigned"
here, which defaults to "unsigned int" and checkpatch is correct.

As we will search for a fitting zone using the wrong pfn, we might end
up onlining memory to one of the special kernel zones, such as ZONE_DMA,
which can end badly as the onlined memory does not satisfy properties of
these zones.

Use "unsigned long" instead, just as we do in other places when handling
PFNs.  This can bite us once we have physical addresses in the range of
multiple TB.

Link: https://lkml.kernel.org/r/20210712124052.26491-2-david@redhat.com
Fixes: e5e689302633 ("mm, memory_hotplug: display allowed zones in the preferred ordering")
Signed-off-by: David Hildenbrand 
Reviewed-by: Pankaj Gupta 
Reviewed-by: Muchun Song 
Reviewed-by: Oscar Salvador 
Cc: David Hildenbrand 
Cc: Vitaly Kuznetsov 
Cc: "Michael S. Tsirkin" 
Cc: Jason Wang 
Cc: Pankaj Gupta 
Cc: Wei Yang 
Cc: Michal Hocko 
Cc: Dan Williams 
Cc: Anshuman Khandual 
Cc: Dave Hansen 
Cc: Vlastimil Babka 
Cc: Mike Rapoport 
Cc: "Rafael J. Wysocki" 
Cc: Len Brown 
Cc: Pavel Tatashin 
Cc: Heiko Carstens 
Cc: Michael Ellerman 
Cc: Catalin Marinas 
Cc: virtualization@lists.linux-foundation.org
Cc: Andy Lutomirski 
Cc: "Aneesh Kumar K.V" 
Cc: Anton Blanchard 
Cc: Ard Biesheuvel 
Cc: Baoquan He 
Cc: Benjamin Herrenschmidt 
Cc: Borislav Petkov 
Cc: Christian Borntraeger 
Cc: Christophe Leroy 
Cc: Dave Jiang 
Cc: "H. Peter Anvin" 
Cc: Ingo Molnar 
Cc: Jia He 
Cc: Joe Perches 
Cc: Kefeng Wang 
Cc: Laurent Dufour 
Cc: Michel Lespinasse 
Cc: Nathan Lynch 
Cc: Nicholas Piggin 
Cc: Paul Mackerras 
Cc: Peter Zijlstra 
Cc: Pierre Morel 
Cc: "Rafael J. Wysocki" 
Cc: Rich Felker 
Cc: Scott Cheloha 
Cc: Sergei Trofimovich 
Cc: Thiago Jung Bauermann 
Cc: Thomas Gleixner 
Cc: Vasily Gorbik 
Cc: Vishal Verma 
Cc: Will Deacon 
Cc: Yoshinori Sato 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: David Hildenbrand 
Signed-off-by: Greg Kroah-Hartman

mm,vmscan: fix divide by zero in get_scan_count

2021-09-22T10:26:37+00:00

commit 32d4f4b782bb8f0ceb78c6b5dc46eb577ae25bf7 upstream.

Commit f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to
proportional memory.low reclaim") introduced a divide by zero corner
case when oomd is being used in combination with cgroup memory.low
protection.

When oomd decides to kill a cgroup, it will force the cgroup memory to
be reclaimed after killing the tasks, by writing to the memory.max file
for that cgroup, forcing the remaining page cache and reclaimable slab
to be reclaimed down to zero.

Previously, on cgroups with some memory.low protection that would result
in the memory being reclaimed down to the memory.low limit, or likely
not at all, having the page cache reclaimed asynchronously later.

With f56ce412a59d the oomd write to memory.max tries to reclaim all the
way down to zero, which may race with another reclaimer, to the point of
ending up with the divide by zero below.

This patch implements the obvious fix.

Link: https://lkml.kernel.org/r/20210826220149.058089c6@imladris.surriel.com
Fixes: f56ce412a59d ("mm: memcontrol: fix occasional OOMs due to proportional memory.low reclaim")
Signed-off-by: Rik van Riel 
Acked-by: Roman Gushchin 
Acked-by: Michal Hocko 
Acked-by: Johannes Weiner 
Acked-by: Chris Down 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman