linux-stable.git/include/linux/sched.h, branch v3.2.102

cpuset: PF_SPREAD_PAGE and PF_SPREAD_SLAB should be atomic flags

2017-10-12T14:27:22+00:00

commit 2ad654bc5e2b211e92f66da1d819e47d79a866f0 upstream.

When we change cpuset.memory_spread_{page,slab}, cpuset will flip
PF_SPREAD_{PAGE,SLAB} bit of tsk->flags for each task in that cpuset.
This should be done using atomic bitops, but currently we don't,
which is broken.

Tetsuo reported a hard-to-reproduce kernel crash on RHEL6, which happened
when one thread tried to clear PF_USED_MATH while at the same time another
thread tried to flip PF_SPREAD_PAGE/PF_SPREAD_SLAB. They both operate on
the same task.

Here's the full report:
https://lkml.org/lkml/2014/9/19/230

To fix this, we make PF_SPREAD_PAGE and PF_SPREAD_SLAB atomic flags.

v4:
- updated mm/slab.c. (Fengguang Wu)
- updated Documentation.

Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Cc: Miao Xie 
Cc: Kees Cook 
Fixes: 950592f7b991 ("cpusets: update tasks' page/slab spread flags in time")
Reported-by: Tetsuo Handa 
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
[lizf: Backported to 3.4:
 - adjust context
 - check current->flags & PF_MEMPOLICY rather than current->mempolicy]
Signed-off-by: Ben Hutchings

sched: add macros to define bitops for task atomic flags

2017-10-12T14:27:22+00:00

commit e0e5070b20e01f0321f97db4e4e174f3f6b49e50 upstream.

This will simplify code when we add new flags.

v3:
- Kees pointed out that no_new_privs should never be cleared, so we
shouldn't define task_clear_no_new_privs(). we define 3 macros instead
of a single one.

v2:
- updated scripts/tags.sh, suggested by Peter

Cc: Ingo Molnar 
Cc: Miao Xie 
Cc: Tetsuo Handa 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Kees Cook 
Signed-off-by: Zefan Li 
Signed-off-by: Tejun Heo 
[lizf: Backported to 3.4:
 - adjust context
 - remove no_new_priv code
 - add atomic_flags to struct task_struct]
[bwh: Backported to 3.2:
 - Drop changes in scripts/tags.sh
 - Adjust context]
Signed-off-by: Ben Hutchings

pipe: limit the per-user amount of pages allocated in pipes

2016-02-27T14:28:49+00:00

commit 759c01142a5d0f364a462346168a56de28a80f52 upstream.

On no-so-small systems, it is possible for a single process to cause an
OOM condition by filling large pipes with data that are never read. A
typical process filling 4000 pipes with 1 MB of data will use 4 GB of
memory. On small systems it may be tricky to set the pipe max size to
prevent this from happening.

This patch makes it possible to enforce a per-user soft limit above
which new pipes will be limited to a single page, effectively limiting
them to 4 kB each, as well as a hard limit above which no new pipes may
be created for this user. This has the effect of protecting the system
against memory abuse without hurting other users, and still allowing
pipes to work correctly though with less data at once.

The limit are controlled by two new sysctls : pipe-user-pages-soft, and
pipe-user-pages-hard. Both may be disabled by setting them to zero. The
default soft limit allows the default number of FDs per process (1024)
to create pipes of the default size (64kB), thus reaching a limit of 64MB
before starting to create only smaller pipes. With 256 processes limited
to 1024 FDs each, this results in 1024*64kB + (256*1024 - 1024) * 4kB =
1084 MB of memory allocated for a user. The hard limit is disabled by
default to avoid breaking existing applications that make intensive use
of pipes (eg: for splicing).

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa 
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds 
Signed-off-by: Willy Tarreau 
Signed-off-by: Al Viro 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

unix: properly account for FDs passed over unix sockets

2016-02-27T14:28:49+00:00

commit 712f4aad406bb1ed67f3f98d04c044191f0ff593 upstream.

It is possible for a process to allocate and accumulate far more FDs than
the process' limit by sending them over a unix socket then closing them
to keep the process' fd count low.

This change addresses this problem by keeping track of the number of FDs
in flight per user and preventing non-privileged processes from having
more FDs in flight than their configured FD limit.

Reported-by: socketpair@gmail.com
Reported-by: Tetsuo Handa 
Mitigates: CVE-2013-4312 (Linux 2.0+)
Suggested-by: Linus Torvalds 
Acked-by: Hannes Frederic Sowa 
Signed-off-by: Willy Tarreau 
Signed-off-by: David S. Miller 
[carnil: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

sched: declare pid_alive as inline

2015-11-17T15:54:45+00:00

commit 80e0b6e8a001361316a2d62b748fe677ec46b860 upstream.

We accidentally declared pid_alive without any extern/inline connotation.
Some platforms were fine with this, some like ia64 and mips were very angry.
If the function is inline, the prototype should be inline!

on ia64:
include/linux/sched.h:1718: warning: 'pid_alive' declared inline after
being called

Signed-off-by: Richard Guy Briggs 
Signed-off-by: Eric Paris 
Signed-off-by: Ben Hutchings 
Cc: Neal Gompa

include/linux/sched.h: don't use task->pid/tgid in same_thread_group/has_group_leader_pid

2015-08-06T23:32:18+00:00

commit e1403b8edf669ff49bbdf602cc97fefa2760cb15 upstream.

task_struct->pid/tgid should go away.

1. Change same_thread_group() to use task->signal for comparison.

2. Change has_group_leader_pid(task) to compare task_pid(task) with
   signal->leader_pid.

Signed-off-by: Oleg Nesterov 
Cc: Michal Hocko 
Cc: Sergey Dyasly 
Reviewed-by: "Eric W. Biederman" 
Cc: Thomas Gleixner 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings 
Cc: Sheng Yong

pid: get pid_t ppid of task in init_pid_ns

2014-04-30T15:23:23+00:00

commit ad36d28293936b03d6b7996e9d6aadfd73c0eb08 upstream.

Added the functions task_ppid_nr_ns() and task_ppid_nr() to abstract the lookup
of the PPID (real_parent's pid_t) of a process, including rcu locking, in the
arbitrary and init_pid_ns.
This provides an alternative to sys_getppid(), which is relative to the child
process' pid namespace.

(informed by ebiederman's 6c621b7e)
Cc: Eric W. Biederman 
Signed-off-by: Richard Guy Briggs 
Signed-off-by: Ben Hutchings

sched/rt: Avoid updating RT entry timeout twice within one tick period

2014-02-15T19:20:18+00:00

commit 57d2aa00dcec67afa52478730f2b524521af14fb upstream.

The issue below was found in 2.6.34-rt rather than mainline rt
kernel, but the issue still exists upstream as well.

So please let me describe how it was noticed on 2.6.34-rt:

On this version, each softirq has its own thread, it means there
is at least one RT FIFO task per cpu. The priority of these
tasks is set to 49 by default. If user launches an RT FIFO task
with priority lower than 49 of softirq RT tasks, it's possible
there are two RT FIFO tasks enqueued one cpu runqueue at one
moment. By current strategy of balancing RT tasks, when it comes
to RT tasks, we really need to put them off to a CPU that they
can run on as soon as possible. Even if it means a bit of cache
line flushing, we want RT tasks to be run with the least latency.

When the user RT FIFO task which just launched before is
running, the sched timer tick of the current cpu happens. In this
tick period, the timeout value of the user RT task will be
updated once. Subsequently, we try to wake up one softirq RT
task on its local cpu. As the priority of current user RT task
is lower than the softirq RT task, the current task will be
preempted by the higher priority softirq RT task. Before
preemption, we check to see if current can readily move to a
different cpu. If so, we will reschedule to allow the RT push logic
to try to move current somewhere else. Whenever the woken
softirq RT task runs, it first tries to migrate the user FIFO RT
task over to a cpu that is running a task of lesser priority. If
migration is done, it will send a reschedule request to the found
cpu by IPI interrupt. Once the target cpu responds the IPI
interrupt, it will pick the migrated user RT task to preempt its
current task. When the user RT task is running on the new cpu,
the sched timer tick of the cpu fires. So it will tick the user
RT task again. This also means the RT task timeout value will be
updated again. As the migration may be done in one tick period,
it means the user RT task timeout value will be updated twice
within one tick.

If we set a limit on the amount of cpu time for the user RT task
by setrlimit(RLIMIT_RTTIME), the SIGXCPU signal should be posted
upon reaching the soft limit.

But exactly when the SIGXCPU signal should be sent depends on the
RT task timeout value. In fact the timeout mechanism of sending
the SIGXCPU signal assumes the RT task timeout is increased once
every tick.

However, currently the timeout value may be added twice per
tick. So it results in the SIGXCPU signal being sent earlier
than expected.

To solve this issue, we prevent the timeout value from increasing
twice within one tick time by remembering the jiffies value of
last updating the timeout. As long as the RT task's jiffies is
different with the global jiffies value, we allow its timeout to
be updated.

Signed-off-by: Ying Xue 
Signed-off-by: Fan Du 
Reviewed-by: Yong Zhang 
Acked-by: Steven Rostedt 
Cc: 
Link: http://lkml.kernel.org/r/1342508623-2887-1-git-send-email-ying.xue@windriver.com
Signed-off-by: Ingo Molnar 
[ lizf: backported to 3.4: adjust context ]
Signed-off-by: Li Zefan 
[bwh: Backported to 3.2: adjust filename]
Signed-off-by: Ben Hutchings

exec/ptrace: fix get_dumpable() incorrect tests

2014-01-03T04:33:21+00:00

commit d049f74f2dbe71354d43d393ac3a188947811348 upstream.

The get_dumpable() return value is not boolean.  Most users of the
function actually want to be testing for non-SUID_DUMP_USER(1) rather than
SUID_DUMP_DISABLE(0).  The SUID_DUMP_ROOT(2) is also considered a
protected state.  Almost all places did this correctly, excepting the two
places fixed in this patch.

Wrong logic:
    if (dumpable == SUID_DUMP_DISABLE) { /* be protective */ }
        or
    if (dumpable == 0) { /* be protective */ }
        or
    if (!dumpable) { /* be protective */ }

Correct logic:
    if (dumpable != SUID_DUMP_USER) { /* be protective */ }
        or
    if (dumpable != 1) { /* be protective */ }

Without this patch, if the system had set the sysctl fs/suid_dumpable=2, a
user was able to ptrace attach to processes that had dropped privileges to
that user.  (This may have been partially mitigated if Yama was enabled.)

The macros have been moved into the file that declares get/set_dumpable(),
which means things like the ia64 code can see them too.

CVE-2013-2929

Reported-by: Vasily Kulikov 
Signed-off-by: Kees Cook 
Cc: "Luck, Tony" 
Cc: Oleg Nesterov 
Cc: "Eric W. Biederman" 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.2: adjust context]
Signed-off-by: Ben Hutchings

ptrace: introduce signal_wake_up_state() and ptrace_signal_wake_up()

2013-02-20T03:15:29+00:00

commit 910ffdb18a6408e14febbb6e4b6840fd2c928c82 upstream.

Cleanup and preparation for the next change.

signal_wake_up(resume => true) is overused. None of ptrace/jctl callers
actually want to wakeup a TASK_WAKEKILL task, but they can't specify the
necessary mask.

Turn signal_wake_up() into signal_wake_up_state(state), reintroduce
signal_wake_up() as a trivial helper, and add ptrace_signal_wake_up()
which adds __TASK_TRACED.

This way ptrace_signal_wake_up() can work "inside" ptrace_request()
even if the tracee doesn't have the TASK_WAKEKILL bit set.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Linus Torvalds 
Signed-off-by: Ben Hutchings