linux-stable.git/fs/file.c, branch v3.2.64

fs/file.c:fdtable: avoid triggering OOMs from alloc_fdmem

2014-04-01T23:58:50+00:00

commit 96c7a2ff21501691587e1ae969b83cbec8b78e08 upstream.

Recently due to a spike in connections per second memcached on 3
separate boxes triggered the OOM killer from accept.  At the time the
OOM killer was triggered there was 4GB out of 36GB free in zone 1.  The
problem was that alloc_fdtable was allocating an order 3 page (32KiB) to
hold a bitmap, and there was sufficient fragmentation that the largest
page available was 8KiB.

I find the logic that PAGE_ALLOC_COSTLY_ORDER can't fail pretty dubious
but I do agree that order 3 allocations are very likely to succeed.

There are always pathologies where order > 0 allocations can fail when
there are copious amounts of free memory available.  Using the pigeon
hole principle it is easy to show that it requires 1 page more than 50%
of the pages being free to guarantee an order 1 (8KiB) allocation will
succeed, 1 page more than 75% of the pages being free to guarantee an
order 2 (16KiB) allocation will succeed and 1 page more than 87.5% of
the pages being free to guarantee an order 3 allocate will succeed.

A server churning memory with a lot of small requests and replies like
memcached is a common case that if anything can will skew the odds
against large pages being available.

Therefore let's not give external applications a practical way to kill
linux server applications, and specify __GFP_NORETRY to the kmalloc in
alloc_fdmem.  Unless I am misreading the code and by the time the code
reaches should_alloc_retry in __alloc_pages_slowpath (where
__GFP_NORETRY becomes signification).  We have already tried everything
reasonable to allocate a page and the only thing left to do is wait.  So
not waiting and falling back to vmalloc immediately seems like the
reasonable thing to do even if there wasn't a chance of triggering the
OOM killer.

Signed-off-by: "Eric W. Biederman" 
Cc: Eric Dumazet 
Acked-by: David Rientjes 
Cc: Cong Wang 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings

vfs: avoid large kmalloc()s for the fdtable

2011-04-28T18:28:20+00:00

Azurit reports large increases in system time after 2.6.36 when running
Apache.  It was bisected down to a892e2d7dcdfa6c76e6 ("vfs: use kmalloc()
to allocate fdmem if possible").

That patch caused the vfs to use kmalloc() for very large allocations and
this is causing excessive work (and presumably excessive reclaim) within
the page allocator.

Fix it by falling back to vmalloc() earlier - when the allocation attempt
would have been considered "costly" by reclaim.

Reported-by: azurIt 
Tested-by: azurIt 
Acked-by: Changli Gao 
Cc: Americo Wang 
Cc: Jiri Slaby 
Acked-by: Eric Dumazet 
Cc: Mel Gorman 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vfs: use kmalloc() to allocate fdmem if possible

2010-08-11T15:59:02+00:00

Use kmalloc() to allocate fdmem if possible.

vmalloc() is used as a fallback solution for fdmem allocation.  A new
helper function __free_fdtable() is introduced to reduce the lines of
code.

A potential bug, vfree() a memory allocated by kmalloc(), is fixed.

[akpm@linux-foundation.org: use __GFP_NOWARN, uninline alloc_fdmem() and free_fdmem()]
Signed-off-by: Changli Gao 
Cc: Alexander Viro 
Cc: Jiri Slaby 
Cc: "Paul E. McKenney" 
Cc: Alexey Dobriyan 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Cc: Avi Kivity 
Cc: Tetsuo Handa 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fs: remove all rcu head initializations, except on_stack initializations

2010-06-14T23:37:26+00:00

Remove all rcu head inits. We don't care about the RCU head state before passing
it to call_rcu() anyway. Only leave the "on_stack" variants so debugobjects can
keep track of objects on stack.

Signed-off-by: Alexey Dobriyan 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Paul E. McKenney 
Cc: Alexander Viro 
Cc: Andries Brouwer

fs: use rlimit helpers

2010-03-06T19:26:29+00:00

Make sure compiler won't do weird things with limits.  E.g.  fetching them
twice may return 2 different values after writable limits are implemented.

I.e.  either use rlimit helpers added in commit 3e10e716abf3 ("resource:
add helpers for fetching rlimits") or ACCESS_ONCE if not applicable.

Signed-off-by: Jiri Slaby 
Cc: Alexander Viro 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

vfs: Apply lockdep-based checking to rcu_dereference() uses

2010-02-25T09:34:48+00:00

Add lockdep-ified RCU primitives to alloc_fd(), files_fdtable()
and fcheck_files().

Cc: Alexander Viro 
Signed-off-by: Paul E. McKenney 
Cc: laijs@cn.fujitsu.com
Cc: dipankar@in.ibm.com
Cc: mathieu.desnoyers@polymtl.ca
Cc: josh@joshtriplett.org
Cc: dvhltc@us.ibm.com
Cc: niv@us.ibm.com
Cc: peterz@infradead.org
Cc: rostedt@goodmis.org
Cc: Valdis.Kletnieks@vt.edu
Cc: dhowells@redhat.com
Cc: Alexander Viro 
LKML-Reference: <1266887105-1528-8-git-send-email-paulmck@linux.vnet.ibm.com>
Signed-off-by: Ingo Molnar

headers: remove sched.h from interrupt.h

2009-10-11T18:20:58+00:00

After m68k's task_thread_info() doesn't refer to current,
it's possible to remove sched.h from interrupt.h and not break m68k!
Many thanks to Heiko Carstens for allowing this.

Signed-off-by: Alexey Dobriyan

[PATCH] merge locate_fd() and get_unused_fd()

2008-08-01T15:25:23+00:00

	New primitive: alloc_fd(start, flags).  get_unused_fd() and
get_unused_fd_flags() become wrappers on top of it.

Signed-off-by: Al Viro

[PATCH] fix RLIM_NOFILE handling

2008-07-27T00:53:45+00:00

* dup2() should return -EBADF on exceeded sysctl_nr_open
* dup() should *not* return -EINVAL even if you have rlimit set to 0;
  it should get -EMFILE instead.

Check for orig_start exceeding rlimit taken to sys_fcntl().
Failing expand_files() in dup{2,3}() now gets -EMFILE remapped to -EBADF.
Consequently, remaining checks for rlimit are taken to expand_files().

Signed-off-by: Al Viro

[PATCH] avoid multiplication overflows and signedness issues for max_fds

2008-05-16T21:22:52+00:00

Limit sysctl_nr_open - we don't want ->max_fds to exceed MAX_INT and
we don't want size calculation for ->fd[] to overflow.

Signed-off-by: Al Viro