linux-stable.git/kernel/futex.c, branch linux-2.6.35.y

mm/futex: fix futex writes on archs with SW tracking of

2011-08-01T20:55:00+00:00

[ upstream commit 2efaca927f5cd7ecd0f1554b8f9b6a9a2c329c03 ]
 dirty & young

I haven't reproduced it myself but the fail scenario is that on such
machines (notably ARM and some embedded powerpc), if you manage to hit
that futex path on a writable page whose dirty bit has gone from the PTE,
you'll livelock inside the kernel from what I can tell.

It will go in a loop of trying the atomic access, failing, trying gup to
"fix it up", getting succcess from gup, go back to the atomic access,
failing again because dirty wasn't fixed etc...

So I think you essentially hang in the kernel.

The scenario is probably rare'ish because affected architecture are
embedded and tend to not swap much (if at all) so we probably rarely hit
the case where dirty is missing or young is missing, but I think Shan has
a piece of SW that can reliably reproduce it using a shared writable
mapping & fork or something like that.

On archs who use SW tracking of dirty & young, a page without dirty is
effectively mapped read-only and a page without young unaccessible in the
PTE.

Additionally, some architectures might lazily flush the TLB when relaxing
write protection (by doing only a local flush), and expect a fault to
invalidate the stale entry if it's still present on another processor.

The futex code assumes that if the "in_atomic()" access -EFAULT's, it can
"fix it up" by causing get_user_pages() which would then be equivalent to
taking the fault.

However that isn't the case.  get_user_pages() will not call
handle_mm_fault() in the case where the PTE seems to have the right
permissions, regardless of the dirty and young state.  It will eventually
update those bits ...  in the struct page, but not in the PTE.

Additionally, it will not handle the lazy TLB flushing that can be
required by some architectures in the fault case.

Basically, gup is the wrong interface for the job.  The patch provides a
more appropriate one which boils down to just calling handle_mm_fault()
since what we are trying to do is simulate a real page fault.

The futex code currently attempts to write to user memory within a
pagefault disabled section, and if that fails, tries to fix it up using
get_user_pages().

This doesn't work on archs where the dirty and young bits are maintained
by software, since they will gate access permission in the TLB, and will
not be updated by gup().

In addition, there's an expectation on some archs that a spurious write
fault triggers a local TLB flush, and that is missing from the picture as
well.

I decided that adding those "features" to gup() would be too much for this
already too complex function, and instead added a new simpler
fixup_user_fault() which is essentially a wrapper around handle_mm_fault()
which the futex code can call.

[akpm@linux-foundation.org: coding-style fixes]
[akpm@linux-foundation.org: fix some nits Darren saw, fiddle comment layout]
Signed-off-by: Benjamin Herrenschmidt 
Reported-by: Shan Hai 
Tested-by: Shan Hai 
Cc: David Laight 
Acked-by: Peter Zijlstra 
Cc: Darren Hart 
Cc: 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Andi Kleen

futex: Fix errors in nested key ref-counting

2010-11-22T18:59:45+00:00

commit 7ada876a8703f23befbb20a7465a702ee39b1704 upstream.

futex_wait() is leaking key references due to futex_wait_setup()
acquiring an additional reference via the queue_lock() routine. The
nested key ref-counting has been masking bugs and complicating code
analysis. queue_lock() is only called with a previously ref-counted
key, so remove the additional ref-counting from the queue_(un)lock()
functions.

Also futex_wait_requeue_pi() drops one key reference too many in
unqueue_me_pi(). Remove the key reference handling from
unqueue_me_pi(). This was paired with a queue_lock() in
futex_lock_pi(), so the count remains unchanged.

Document remaining nested key ref-counting sites.

Signed-off-by: Darren Hart 
Reported-and-tested-by: Matthieu Fertré
Reported-by: Louis Rilling
Cc: Peter Zijlstra 
Cc: Eric Dumazet 
Cc: John Kacur 
Cc: Rusty Russell 
LKML-Reference: <4CBB17A8.70401@linux.intel.com>
Signed-off-by: Thomas Gleixner 
Signed-off-by: Greg Kroah-Hartman

futex: futex_find_get_task remove credentails check

2010-06-30T22:43:44+00:00

futex_find_get_task is currently used (through lookup_pi_state) from two
contexts, futex_requeue and futex_lock_pi_atomic.  None of the paths
looks it needs the credentials check, though.  Different (e)uids
shouldn't matter at all because the only thing that is important for
shared futex is the accessibility of the shared memory.

The credentail check results in glibc assert failure or process hang (if
glibc is compiled without assert support) for shared robust pthread
mutex with priority inheritance if a process tries to lock already held
lock owned by a process with a different euid:

pthread_mutex_lock.c:312: __pthread_mutex_lock_full: Assertion `(-(e)) != 3 || !robust' failed.

The problem is that futex_lock_pi_atomic which is called when we try to
lock already held lock checks the current holder (tid is stored in the
futex value) to get the PI state.  It uses lookup_pi_state which in turn
gets task struct from futex_find_get_task.  ESRCH is returned either
when the task is not found or if credentials check fails.

futex_lock_pi_atomic simply returns if it gets ESRCH.  glibc code,
however, doesn't expect that robust lock returns with ESRCH because it
should get either success or owner died.

Signed-off-by: Michal Hocko 
Acked-by: Darren Hart 
Cc: Ingo Molnar 
Cc: Thomas Gleixner 
Cc: Nick Piggin 
Cc: Alexey Kuznetsov 
Cc: Peter Zijlstra 
Signed-off-by: Linus Torvalds

futex: Handle futex value corruption gracefully

2010-02-03T14:13:22+00:00

The WARN_ON in lookup_pi_state which complains about a mismatch
between pi_state->owner->pid and the pid which we retrieved from the
user space futex is completely bogus.

The code just emits the warning and then continues despite the fact
that it detected an inconsistent state of the futex. A conveniant way
for user space to spam the syslog.

Replace the WARN_ON by a consistency check. If the values do not match
return -EINVAL and let user space deal with the mess it created.

This also fixes the missing task_pid_vnr() when we compare the
pi_state->owner pid with the futex value.

Reported-by: Jermome Marchand 
Signed-off-by: Thomas Gleixner 
Acked-by: Darren Hart 
Acked-by: Peter Zijlstra 
Cc:

futex: Handle user space corruption gracefully

2010-02-03T14:13:22+00:00

If the owner of a PI futex dies we fix up the pi_state and set
pi_state->owner to NULL. When a malicious or just sloppy programmed
user space application sets the futex value to 0 e.g. by calling
pthread_mutex_init(), then the futex can be acquired again. A new
waiter manages to enqueue itself on the pi_state w/o damage, but on
unlock the kernel dereferences pi_state->owner and oopses.

Prevent this by checking pi_state->owner in the unlock path. If
pi_state->owner is not current we know that user space manipulated the
futex value. Ignore the mess and return -EINVAL.

This catches the above case and also the case where a task hijacks the
futex by setting the tid value and then tries to unlock it.

Reported-by: Jermome Marchand 
Signed-off-by: Thomas Gleixner 
Acked-by: Darren Hart 
Acked-by: Peter Zijlstra 
Cc:

futex_lock_pi() key refcnt fix

2010-02-03T14:13:22+00:00

This fixes a futex key reference count bug in futex_lock_pi(),
where a key's reference count is incremented twice but decremented
only once, causing the backing object to not be released.

If the futex is created in a temporary file in an ext3 file system,
this bug causes the file's inode to become an "undead" orphan,
which causes an oops from a BUG_ON() in ext3_put_super() when the
file system is unmounted. glibc's test suite is known to trigger this,
see .

The bug is a regression from 2.6.28-git3, namely Peter Zijlstra's
38d47c1b7075bd7ec3881141bb3629da58f88dab "[PATCH] futex: rely on
get_user_pages() for shared futexes". That commit made get_futex_key()
also increment the reference count of the futex key, and updated its
callers to decrement the key's reference count before returning.
Unfortunately the normal exit path in futex_lock_pi() wasn't corrected:
the reference count is incremented by get_futex_key() and queue_lock(),
but the normal exit path only decrements once, via unqueue_me_pi().
The fix is to put_futex_key() after unqueue_me_pi(), since 2.6.31
this is easily done by 'goto out_put_key' rather than 'goto out'.

Signed-off-by: Mikael Pettersson 
Acked-by: Peter Zijlstra 
Acked-by: Darren Hart 
Signed-off-by: Thomas Gleixner 
Cc:

futexes: Remove rw parameter from get_futex_key()

2010-01-13T08:17:36+00:00

Currently, futexes have two problem:

A) The current futex code doesn't handle private file mappings properly.

get_futex_key() uses PageAnon() to distinguish file and
anon, which can cause the following bad scenario:

  1) thread-A call futex(private-mapping, FUTEX_WAIT), it
     sleeps on file mapping object.
  2) thread-B writes a variable and it makes it cow.
  3) thread-B calls futex(private-mapping, FUTEX_WAKE), it
     wakes up blocked thread on the anonymous page. (but it's nothing)

B) Current futex code doesn't handle zero page properly.

Read mode get_user_pages() can return zero page, but current
futex code doesn't handle it at all. Then, zero page makes
infinite loop internally.

The solution is to use write mode get_user_page() always for
page lookup. It prevents the lookup of both file page of private
mappings and zero page.

Performance concerns:

Probaly very little, because glibc always initialize variables
for futex before to call futex(). It means glibc users never see
the overhead of this patch.

Compatibility concerns:

This patch has few compatibility issues. After this patch,
FUTEX_WAIT require writable access to futex variables (read-only
mappings makes EFAULT). But practically it's not a problem,
glibc always initalizes variables for futexes explicitly - nobody
uses read-only mappings.

Reported-by: Hugh Dickins 
Signed-off-by: KOSAKI Motohiro 
Acked-by: Peter Zijlstra 
Acked-by: Darren Hart 
Cc: 
Cc: Linus Torvalds 
Cc: KAMEZAWA Hiroyuki 
Cc: Nick Piggin 
Cc: Ulrich Drepper 
LKML-Reference: <20100105162633.45A2.A69D9226@jp.fujitsu.com>
Signed-off-by: Ingo Molnar

rtmutes: Convert rtmutex.lock to raw_spinlock

2009-12-14T22:55:33+00:00

Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner 
Acked-by: Peter Zijlstra 
Acked-by: Ingo Molnar

sched: Convert pi_lock to raw_spinlock

2009-12-14T22:55:33+00:00

Convert locks which cannot be sleeping locks in preempt-rt to
raw_spinlocks.

Signed-off-by: Thomas Gleixner 
Acked-by: Peter Zijlstra 
Acked-by: Ingo Molnar

plist: Make plist debugging raw_spinlock aware

2009-12-14T22:55:33+00:00

plists are used with spinlocks and raw_spinlocks. Change the plist
debugging to handle both types.

Signed-off-by: Thomas Gleixner 
Acked-by: Peter Zijlstra 
Acked-by: Ingo Molnar