linux-stable.git/mm/slab.c, branch linux-3.4.y

cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags

2014-12-01T10:02:38+00:00

commit 2ad654bc5e2b211e92f66da1d819e47d79a866f0 upstream.

When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Miao Xie 
Cc: Kees Cook 
Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
Reported-by: Tetsuo Handa 
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
[lizf: Backported to 3.4:
 - adjust context
 - check current->flags & PF_MEMPOLICY rather than current->mempolicy]

slab/mempolicy: always use local policy from interrupt context

2014-09-25T03:49:17+00:00

commit e7b691b085fda913830e5280ae6f724b2a63c824 upstream.

slab_node() could access current->mempolicy from interrupt context.
However there's a race condition during exit where the mempolicy
is first freed and then the pointer zeroed.

Using this from interrupts seems bogus anyways. The interrupt
will interrupt a random process and therefore get a random
mempolicy. Many times, this will be idle's, which noone can change.

Just disable this here and always use local for slab
from interrupts. I also cleaned up the callers of slab_node a bit
which always passed the same argument.

I believe the original mempolicy code did that in fact,
so it's likely a regression.

v2: send version with correct logic
v3: simplify. fix typo.
Reported-by: Arun Sharma 
Cc: penberg@kernel.org
Cc: cl@linux.com
Signed-off-by: Andi Kleen 
[tdmackey@twitter.com: Rework control flow based on feedback from
cl@linux.com, fix logic, and cleanup current task_struct reference]
Acked-by: David Rientjes 
Acked-by: Christoph Lameter 
Acked-by: KOSAKI Motohiro 
Signed-off-by: David Mackey 
Signed-off-by: Pekka Enberg 
Signed-off-by: Zefan Li

slab: fix the DEADLOCK issue on l3 alien lock

2012-10-12T20:38:37+00:00

commit 947ca1856a7e60aa6d20536785e6a42dff25aa6e upstream.

DEADLOCK will be report while running a kernel with NUMA and LOCKDEP enabled,
the process of this fake report is:

	   kmem_cache_free()	//free obj in cachep
	-> cache_free_alien()	//acquire cachep's l3 alien lock
	-> __drain_alien_cache()
	-> free_block()
	-> slab_destroy()
	-> kmem_cache_free()	//free slab in cachep->slabp_cache
	-> cache_free_alien()	//acquire cachep->slabp_cache's l3 alien lock

Since the cachep and cachep->slabp_cache's l3 alien are in the same lock class,
fake report generated.

This should not happen since we already have init_lock_keys() which will
reassign the lock class for both l3 list and l3 alien.

However, init_lock_keys() was invoked at a wrong position which is before we
invoke enable_cpucache() on each cache.

Since until set slab_state to be FULL, we won't invoke enable_cpucache()
on caches to build their l3 alien while creating them, so although we invoked
init_lock_keys(), the l3 alien lock class won't change since we don't have
them until invoked enable_cpucache() later.

This patch will invoke init_lock_keys() after we done enable_cpucache()
instead of before to avoid the fake DEADLOCK report.

Michael traced the problem back to a commit in release 3.0.0:

commit 30765b92ada267c5395fc788623cb15233276f5c
Author: Peter Zijlstra 
Date:   Thu Jul 28 23:22:56 2011 +0200

    slab, lockdep: Annotate the locks before using them

    Fernando found we hit the regular OFF_SLAB 'recursion' before we
    annotate the locks, cure this.

    The relevant portion of the stack-trace:

    > [    0.000000]  [] rt_spin_lock+0x50/0x56
    > [    0.000000]  [] __cache_free+0x43/0xc3
    > [    0.000000]  [] kmem_cache_free+0x6c/0xdc
    > [    0.000000]  [] slab_destroy+0x4f/0x53
    > [    0.000000]  [] free_block+0x94/0xc1
    > [    0.000000]  [] do_tune_cpucache+0x10b/0x2bb
    > [    0.000000]  [] enable_cpucache+0x7b/0xa7
    > [    0.000000]  [] kmem_cache_init_late+0x1f/0x61
    > [    0.000000]  [] start_kernel+0x24c/0x363
    > [    0.000000]  [] i386_start_kernel+0xa9/0xaf

    Reported-by: Fernando Lopez-Lezcano 
    Acked-by: Pekka Enberg 
    Signed-off-by: Peter Zijlstra 
    Link: http://lkml.kernel.org/r/1311888176.2617.379.camel@laptop
    Signed-off-by: Ingo Molnar 

The commit moved init_lock_keys() before we build up the alien, so we
failed to reclass it.

Acked-by: Christoph Lameter 
Tested-by: Paul E. McKenney 
Signed-off-by: Michael Wang 
Signed-off-by: Pekka Enberg 
Signed-off-by: Greg Kroah-Hartman

Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux

2012-03-28T22:04:26+00:00

Pull SLAB changes from Pekka Enberg:
 "There's the new kmalloc_array() API, minor fixes and performance
  improvements, but quite honestly, nothing terribly exciting."

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
  mm: SLAB Out-of-memory diagnostics
  slab: introduce kmalloc_array()
  slub: per cpu partial statistics change
  slub: include include for prefetch
  slub: Do not hold slub_lock when calling sysfs_slab_add()
  slub: prefetch next freelist pointer in slab_alloc()
  slab, cleanup: remove unneeded return

cpuset: mm: reduce large amounts of memory barrier related damage v3

2012-03-22T00:54:59+00:00

Commit c0ff7453bb5c ("cpuset,mm: fix no node to alloc memory when
changing cpuset's mems") wins a super prize for the largest number of
memory barriers entered into fast paths for one commit.

[get|put]_mems_allowed is incredibly heavy with pairs of full memory
barriers inserted into a number of hot paths.  This was detected while
investigating at large page allocator slowdown introduced some time
after 2.6.32.  The largest portion of this overhead was shown by
oprofile to be at an mfence introduced by this commit into the page
allocator hot path.

For extra style points, the commit introduced the use of yield() in an
implementation of what looks like a spinning mutex.

This patch replaces the full memory barriers on both read and write
sides with a sequence counter with just read barriers on the fast path
side.  This is much cheaper on some architectures, including x86.  The
main bulk of the patch is the retry logic if the nodemask changes in a
manner that can cause a false failure.

While updating the nodemask, a check is made to see if a false failure
is a risk.  If it is, the sequence number gets bumped and parallel
allocators will briefly stall while the nodemask update takes place.

In a page fault test microbenchmark, oprofile samples from
__alloc_pages_nodemask went from 4.53% of all samples to 1.15%.  The
actual results were

                             3.3.0-rc3          3.3.0-rc3
                             rc3-vanilla        nobarrier-v2r1
    Clients   1 UserTime       0.07 (  0.00%)   0.08 (-14.19%)
    Clients   2 UserTime       0.07 (  0.00%)   0.07 (  2.72%)
    Clients   4 UserTime       0.08 (  0.00%)   0.07 (  3.29%)
    Clients   1 SysTime        0.70 (  0.00%)   0.65 (  6.65%)
    Clients   2 SysTime        0.85 (  0.00%)   0.82 (  3.65%)
    Clients   4 SysTime        1.41 (  0.00%)   1.41 (  0.32%)
    Clients   1 WallTime       0.77 (  0.00%)   0.74 (  4.19%)
    Clients   2 WallTime       0.47 (  0.00%)   0.45 (  3.73%)
    Clients   4 WallTime       0.38 (  0.00%)   0.37 (  1.58%)
    Clients   1 Flt/sec/cpu  497620.28 (  0.00%) 520294.53 (  4.56%)
    Clients   2 Flt/sec/cpu  414639.05 (  0.00%) 429882.01 (  3.68%)
    Clients   4 Flt/sec/cpu  257959.16 (  0.00%) 258761.48 (  0.31%)
    Clients   1 Flt/sec      495161.39 (  0.00%) 517292.87 (  4.47%)
    Clients   2 Flt/sec      820325.95 (  0.00%) 850289.77 (  3.65%)
    Clients   4 Flt/sec      1020068.93 (  0.00%) 1022674.06 (  0.26%)
    MMTests Statistics: duration
    Sys Time Running Test (seconds)             135.68    132.17
    User+Sys Time Running Test (seconds)         164.2    160.13
    Total Elapsed Time (seconds)                123.46    120.87

The overall improvement is small but the System CPU time is much
improved and roughly in correlation to what oprofile reported (these
performance figures are without profiling so skew is expected).  The
actual number of page faults is noticeably improved.

For benchmarks like kernel builds, the overall benefit is marginal but
the system CPU time is slightly reduced.

To test the actual bug the commit fixed I opened two terminals.  The
first ran within a cpuset and continually ran a small program that
faulted 100M of anonymous data.  In a second window, the nodemask of the
cpuset was continually randomised in a loop.

Without the commit, the program would fail every so often (usually
within 10 seconds) and obviously with the commit everything worked fine.
With this patch applied, it also worked fine so the fix should be
functionally equivalent.

Signed-off-by: Mel Gorman 
Cc: Miao Xie 
Cc: David Rientjes 
Cc: Peter Zijlstra 
Cc: Christoph Lameter 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: SLAB Out-of-memory diagnostics

2012-03-10T08:45:17+00:00

Following the example at mm/slub.c, add out-of-memory diagnostics to the
SLAB allocator to help on debugging certain OOM conditions.

An example print out looks like this:

  
  SLAB: Unable to allocate memory on node 0 (gfp=0x11200)
    cache: bio-0, object size: 192, order: 0
    node 0: slabs: 3/3, objs: 60/60, free: 0

Signed-off-by: Rafael Aquini 
Acked-by: Rik van Riel 
Acked-by: David Rientjes 
Signed-off-by: Pekka Enberg

slab, cleanup: remove unneeded return

2012-01-23T13:32:26+00:00

The procedure ends right after the if-statement, so remove ``return''.
Also move the last common statement outside.

Signed-off-by: Zhao Jin 
Acked-by: David Rientjes 
Acked-by: Christoph Lameter 
Signed-off-by: Pekka Enberg

Merge branch 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux

2012-01-12T02:52:23+00:00

* 'slab/for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/penberg/linux:
  slub: disallow changing cpu_partial from userspace for debug caches
  slub: add missed accounting
  slub: Extract get_freelist from __slab_alloc
  slub: Switch per cpu partial page support off for debugging
  slub: fix a possible memleak in __slab_alloc()
  slub: fix slub_max_order Documentation
  slub: add missed accounting
  slab: add taint flag outputting to debug paths.
  slub: add taint flag outputting to debug paths
  slab: introduce slab_max_order kernel parameter
  slab: rename slab_break_gfp_order to slab_max_order

Merge branch 'slab/urgent' into slab/for-linus

2012-01-11T19:11:29+00:00

tracing/mm: Move include of trace/events/kmem.h out of header into slab.c

2012-01-09T22:19:33+00:00

Including trace/events/*.h TRACE_EVENT() macro headers in other headers
can cause strange side effects if another trace/event/*.h header
includes that header.  Having trace/events/kmem.h inside slab_def.h
caused a compile error in sparc64 when changes were done to some header
files.  Moving the kmem.h trace header out of slab.h and into slab.c
fixes the problem.

Note, both slub.c and slob.c already include the trace/events/kmem.h
file. Only slab.c had it missing.

Link: http://lkml.kernel.org/r/20120105190405.1e3191fb5a43b2a0f1655e1f@canb.auug.org.au

Reported-by: Stephen Rothwell 
Signed-off-by: Steven Rostedt 
Signed-off-by: Linus Torvalds