linux-stable.git/mm/oom_kill.c, branch linux-4.3.y

mm/oom_kill.c: reverse the order of setting TIF_MEMDIE and sending SIGKILL

2016-02-19T22:28:24+00:00

commit 426fb5e72d92b868912e47a1e3ca2df6eabc3872 upstream.

It was confirmed that a local unprivileged user can consume all memory
reserves and hang up that system using time lag between the OOM killer
sets TIF_MEMDIE on an OOM victim and sends SIGKILL to that victim, for
printk() inside for_each_process() loop at oom_kill_process() can consume
many seconds when there are many thread groups sharing the same memory.

Before starting oom-depleter process:

    Node 0 DMA: 3*4kB (UM) 6*8kB (U) 4*16kB (UEM) 0*32kB 0*64kB 1*128kB (M) 2*256kB (EM) 2*512kB (UE) 2*1024kB (EM) 1*2048kB (E) 1*4096kB (M) = 9980kB
    Node 0 DMA32: 31*4kB (UEM) 27*8kB (UE) 32*16kB (UE) 13*32kB (UE) 14*64kB (UM) 7*128kB (UM) 8*256kB (UM) 8*512kB (UM) 3*1024kB (U) 4*2048kB (UM) 362*4096kB (UM) = 1503220kB

As of invoking the OOM killer:

    Node 0 DMA: 11*4kB (UE) 8*8kB (UEM) 6*16kB (UE) 2*32kB (EM) 0*64kB 1*128kB (U) 3*256kB (UEM) 2*512kB (UE) 3*1024kB (UEM) 1*2048kB (U) 0*4096kB = 7308kB
    Node 0 DMA32: 1049*4kB (UEM) 507*8kB (UE) 151*16kB (UE) 53*32kB (UEM) 83*64kB (UEM) 52*128kB (EM) 25*256kB (UEM) 11*512kB (M) 6*1024kB (UM) 1*2048kB (M) 0*4096kB = 44556kB

Between the thread group leader got TIF_MEMDIE and receives SIGKILL:

    Node 0 DMA: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB
    Node 0 DMA32: 0*4kB 0*8kB 0*16kB 0*32kB 0*64kB 0*128kB 0*256kB 0*512kB 0*1024kB 0*2048kB 0*4096kB = 0kB

The oom-depleter's thread group leader which got TIF_MEMDIE started
memset() in user space after the OOM killer set TIF_MEMDIE, and it was
free to abuse ALLOC_NO_WATERMARKS by TIF_MEMDIE for memset() in user space
until SIGKILL is delivered.  If SIGKILL is delivered before TIF_MEMDIE is
set, the oom-depleter can terminate without touching memory reserves.

Although the possibility of hitting this time lag is very small for 3.19
and earlier kernels because TIF_MEMDIE is set immediately before sending
SIGKILL, preemption or long interrupts (an extreme example is SysRq-t) can
step between and allow memory allocations which are not needed for
terminating the OOM victim.

Fixes: 83363b917a29 ("oom: make sure that TIF_MEMDIE is set under task_lock")
Signed-off-by: Tetsuo Handa 
Acked-by: Michal Hocko 
Cc: David Rientjes 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

mm, oom: remove unnecessary variable

2015-09-08T22:35:28+00:00

The "killed" variable in out_of_memory() can be removed since the call to
oom_kill_process() where we should block to allow the process time to
exit is obvious.

Signed-off-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Sergey Senozhatsky 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: do not panic for oom kills triggered from sysrq

2015-09-08T22:35:28+00:00

Sysrq+f is used to kill a process either for debug or when the VM is
otherwise unresponsive.

It is not intended to trigger a panic when no process may be killed.

Avoid panicking the system for sysrq+f when no processes are killed.

Signed-off-by: David Rientjes 
Suggested-by: Michal Hocko 
Cc: Sergey Senozhatsky 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: pass an oom order of -1 when triggered by sysrq

2015-09-08T22:35:28+00:00

The force_kill member of struct oom_control isn't needed if an order of -1
is used instead.  This is the same as order == -1 in struct
compact_control which requires full memory compaction.

This patch introduces no functional change.

Signed-off-by: David Rientjes 
Cc: Sergey Senozhatsky 
Cc: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: organize oom context into struct

2015-09-08T22:35:28+00:00

There are essential elements to an oom context that are passed around to
multiple functions.

Organize these elements into a new struct, struct oom_control, that
specifies the context for an oom condition.

This patch introduces no functional change.

Signed-off-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Sergey Senozhatsky 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/oom_kill.c: print points as unsigned int

2015-06-25T00:49:44+00:00

In oom_kill_process(), the variable 'points' is unsigned int.  Print it as
such.

Signed-off-by: Wang Long 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: simplify OOM killer locking

2015-06-25T00:49:43+00:00

The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.

The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain.  Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.

The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.

However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills.  And
the oom_sem is just flat-out redundant.

Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: remove unnecessary locking in exit_oom_victim()

2015-06-25T00:49:43+00:00

Disabling the OOM killer needs to exclude allocators from entering, not
existing victims from exiting.

Right now the only waiter is suspend code, which achieves quiescence by
disabling the OOM killer.  But later on we want to add waits that hold
the lock instead to stop new victims from showing up.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: generalize OOM progress waitqueue

2015-06-25T00:49:43+00:00

It turns out that the mechanism to wait for exiting OOM victims is less
generic than it looks: it won't issue wakeups unless the OOM killer is
disabled.

The reason this check was added was the thought that, since only the OOM
disabling code would wait on this queue, wakeup operations could be
saved when that specific consumer is known to be absent.

However, this is quite the handgrenade.  Later attempts to reuse the
waitqueue for other purposes will lead to completely unexpected bugs and
the failure mode will appear seemingly illogical.  Generally, providers
shouldn't make unnecessary assumptions about consumers.

This could have been replaced with waitqueue_active(), but it only saves
a few instructions in one of the coldest paths in the kernel.  Simply
remove it.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear

2015-06-25T00:49:43+00:00

exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else
can clear it concurrently.  Use clear_thread_flag() directly.

Signed-off-by: Johannes Weiner 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds