linux-stable.git/mm/oom_kill.c, branch v4.2

mm/oom_kill.c: print points as unsigned int

2015-06-25T00:49:44+00:00

In oom_kill_process(), the variable 'points' is unsigned int.  Print it as
such.

Signed-off-by: Wang Long 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: simplify OOM killer locking

2015-06-25T00:49:43+00:00

The zonelist locking and the oom_sem are two overlapping locks that are
used to serialize global OOM killing against different things.

The historical zonelist locking serializes OOM kills from allocations with
overlapping zonelists against each other to prevent killing more tasks
than necessary in the same memory domain.  Only when neither tasklists nor
zonelists from two concurrent OOM kills overlap (tasks in separate memcgs
bound to separate nodes) are OOM kills allowed to execute in parallel.

The younger oom_sem is a read-write lock to serialize OOM killing against
the PM code trying to disable the OOM killer altogether.

However, the OOM killer is a fairly cold error path, there is really no
reason to optimize for highly performant and concurrent OOM kills.  And
the oom_sem is just flat-out redundant.

Replace both locking schemes with a single global mutex serializing OOM
kills regardless of context.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: remove unnecessary locking in exit_oom_victim()

2015-06-25T00:49:43+00:00

Disabling the OOM killer needs to exclude allocators from entering, not
existing victims from exiting.

Right now the only waiter is suspend code, which achieves quiescence by
disabling the OOM killer.  But later on we want to add waits that hold
the lock instead to stop new victims from showing up.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: generalize OOM progress waitqueue

2015-06-25T00:49:43+00:00

It turns out that the mechanism to wait for exiting OOM victims is less
generic than it looks: it won't issue wakeups unless the OOM killer is
disabled.

The reason this check was added was the thought that, since only the OOM
disabling code would wait on this queue, wakeup operations could be
saved when that specific consumer is known to be absent.

However, this is quite the handgrenade.  Later attempts to reuse the
waitqueue for other purposes will lead to completely unexpected bugs and
the failure mode will appear seemingly illogical.  Generally, providers
shouldn't make unnecessary assumptions about consumers.

This could have been replaced with waitqueue_active(), but it only saves
a few instructions in one of the coldest paths in the kernel.  Simply
remove it.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: switch test-and-clear of known TIF_MEMDIE to clear

2015-06-25T00:49:43+00:00

exit_oom_victim() already knows that TIF_MEMDIE is set, and nobody else
can clear it concurrently.  Use clear_thread_flag() directly.

Signed-off-by: Johannes Weiner 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: clean up victim marking and exiting interfaces

2015-06-25T00:49:43+00:00

Rename unmark_oom_victim() to exit_oom_victim().  Marking and unmarking
are related in functionality, but the interface is not symmetrical at
all: one is an internal OOM killer function used during the killing, the
other is for an OOM victim to signal its own death on exit later on.
This has locking implications, see follow-up changes.

While at it, rename mark_tsk_oom_victim() to mark_oom_victim(), which
is easier on the eye.

Signed-off-by: Johannes Weiner 
Acked-by: David Rientjes 
Acked-by: Michal Hocko 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: oom_kill: remove unnecessary locking in oom_enable()

2015-06-25T00:49:43+00:00

Setting oom_killer_disabled to false is atomic, there is no need for
further synchronization with ongoing allocations trying to OOM-kill.

Signed-off-by: Johannes Weiner 
Acked-by: Michal Hocko 
Acked-by: David Rientjes 
Cc: Tetsuo Handa 
Cc: Andrea Arcangeli 
Cc: Dave Chinner 
Cc: Vlastimil Babka 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm/oom_kill.c: fix typo in comment

2015-04-15T23:35:16+00:00

Alter 'taks' -> 'task'

Signed-off-by: Yaowei Bai 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

memcg: print cgroup information when system panics due to panic_on_oom

2015-04-14T23:49:05+00:00

If kernel panics due to oom, caused by a cgroup reaching its limit, when
'compulsory panic_on_oom' is enabled, then we will only see that the OOM
happened because of "compulsory panic_on_oom is enabled" but this doesn't
tell the difference between mempolicy and memcg.  And dumping system wide
information is plain wrong and more confusing.  This patch provides the
information of the cgroup whose limit triggerred panic

Signed-off-by: Balasubramani Vivekanandan 
Acked-by: Michal Hocko 
Cc: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: account pmd page tables to the process

2015-02-12T01:06:04+00:00

Dave noticed that unprivileged process can allocate significant amount of
memory -- >500 MiB on x86_64 -- and stay unnoticed by oom-killer and
memory cgroup.  The trick is to allocate a lot of PMD page tables.  Linux
kernel doesn't account PMD tables to the process, only PTE.

The use-cases below use few tricks to allocate a lot of PMD page tables
while keeping VmRSS and VmPTE low.  oom_score for the process will be 0.

	#include 
	#include 
	#include 
	#include 
	#include 
	#include 

	#define PUD_SIZE (1UL << 30)
	#define PMD_SIZE (1UL << 21)

	#define NR_PUD 130000

	int main(void)
	{
		char *addr = NULL;
		unsigned long i;

		prctl(PR_SET_THP_DISABLE);
		for (i = 0; i < NR_PUD ; i++) {
			addr = mmap(addr + PUD_SIZE, PUD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE, -1, 0);
			if (addr == MAP_FAILED) {
				perror("mmap");
				break;
			}
			*addr = 'x';
			munmap(addr, PMD_SIZE);
			mmap(addr, PMD_SIZE, PROT_WRITE|PROT_READ,
					MAP_ANONYMOUS|MAP_PRIVATE|MAP_FIXED, -1, 0);
			if (addr == MAP_FAILED)
				perror("re-mmap"), exit(1);
		}
		printf("PID %d consumed %lu KiB in PMD page tables\n",
				getpid(), i * 4096 >> 10);
		return pause();
	}

The patch addresses the issue by account PMD tables to the process the
same way we account PTE.

The main place where PMD tables is accounted is __pmd_alloc() and
free_pmd_range(). But there're few corner cases:

 - HugeTLB can share PMD page tables. The patch handles by accounting
   the table to all processes who share it.

 - x86 PAE pre-allocates few PMD tables on fork.

 - Architectures with FIRST_USER_ADDRESS > 0. We need to adjust sanity
   check on exit(2).

Accounting only happens on configuration where PMD page table's level is
present (PMD is not folded).  As with nr_ptes we use per-mm counter.  The
counter value is used to calculate baseline for badness score by
oom-killer.

Signed-off-by: Kirill A. Shutemov 
Reported-by: Dave Hansen 
Cc: Hugh Dickins 
Reviewed-by: Cyrill Gorcunov 
Cc: Pavel Emelyanov 
Cc: David Rientjes 
Tested-by: Sedat Dilek 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds