linux-stable.git/fs/proc/array.c, branch linux-4.2.y

fs/proc, core/debug: Don't expose absolute kernel addresses via wchan

2015-12-09T19:31:16+00:00

commit b2f73922d119686323f14fbbe46587f863852328 upstream.

So the /proc/PID/stat 'wchan' field (the 30th field, which contains
the absolute kernel address of the kernel function a task is blocked in)
leaks absolute kernel addresses to unprivileged user-space:

        seq_put_decimal_ull(m, ' ', wchan);

The absolute address might also leak via /proc/PID/wchan as well, if
KALLSYMS is turned off or if the symbol lookup fails for some reason:

static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
                          struct pid *pid, struct task_struct *task)
{
        unsigned long wchan;
        char symname[KSYM_NAME_LEN];

        wchan = get_wchan(task);

        if (lookup_symbol_name(wchan, symname) < 0) {
                if (!ptrace_may_access(task, PTRACE_MODE_READ))
                        return 0;
                seq_printf(m, "%lu", wchan);
        } else {
                seq_printf(m, "%s", symname);
        }

        return 0;
}

This isn't ideal, because for example it trivially leaks the KASLR offset
to any local attacker:

  fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35)
  ffffffff8123b380

Most real-life uses of wchan are symbolic:

  ps -eo pid:10,tid:10,wchan:30,comm

and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat:

  triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1
  open("/proc/30833/wchan", O_RDONLY)     = 6

There's one compatibility quirk here: procps relies on whether the
absolute value is non-zero - and we can provide that functionality
by outputing "0" or "1" depending on whether the task is blocked
(whether there's a wchan address).

These days there appears to be very little legitimate reason
user-space would be interested in  the absolute address. The
absolute address is mostly historic: from the days when we
didn't have kallsyms and user-space procps had to do the
decoding itself via the System.map.

So this patch sets all numeric output to "0" or "1" and keeps only
symbolic output, in /proc/PID/wchan.

( The absolute sleep address can generally still be profiled via
  perf, by tasks with sufficient privileges. )

Reviewed-by: Thomas Gleixner 
Acked-by: Kees Cook 
Acked-by: Linus Torvalds 
Cc: Al Viro 
Cc: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Denys Vlasenko 
Cc: Dmitry Vyukov 
Cc: Kostya Serebryany 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Peter Zijlstra 
Cc: Sasha Levin 
Cc: kasan-dev 
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

fs, proc: introduce CONFIG_PROC_CHILDREN

2015-06-26T00:00:37+00:00

Commit 818411616baf ("fs, proc: introduce /proc//task//children
entry") introduced the children entry for checkpoint restore and the
file is only available on kernels configured with CONFIG_EXPERT and
CONFIG_CHECKPOINT_RESTORE.

This is available in most distributions (Fedora, Debian, Ubuntu, CoreOS)
because they usually enable CONFIG_EXPERT and CONFIG_CHECKPOINT_RESTORE.
But Arch does not enable CONFIG_EXPERT or CONFIG_CHECKPOINT_RESTORE.

However, the children proc file is useful outside of checkpoint restore.
I would like to use it in rkt.  The rkt process exec() another program
it does not control, and that other program will fork()+exec() a child
process.  I would like to find the pid of the child process from an
external tool without iterating in /proc over all processes to find
which one has a parent pid equal to rkt.

This commit introduces CONFIG_PROC_CHILDREN and makes
CONFIG_CHECKPOINT_RESTORE select it.  This allows enabling
/proc//task//children without needing to enable
CONFIG_CHECKPOINT_RESTORE and CONFIG_EXPERT.

Alban tested that /proc//task//children is present when the
kernel is configured with CONFIG_PROC_CHILDREN=y but without
CONFIG_CHECKPOINT_RESTORE

Signed-off-by: Iago López Galeiras 
Tested-by: Alban Crequy 
Reviewed-by: Cyrill Gorcunov 
Cc: Oleg Nesterov 
Cc: Kees Cook 
Cc: Pavel Emelyanov 
Cc: Serge Hallyn 
Cc: KAMEZAWA Hiroyuki 
Cc: Alexander Viro 
Cc: Andy Lutomirski 
Cc: Djalal Harouni 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

procfs: treat parked tasks as sleeping for task state

2015-06-25T00:49:40+00:00

Allowing watchdog threads to be parked means that we now have the
opportunity of actually seeing persistent parked threads in the output
of /proc//stat and /proc//status.  The existing code reported
such threads as "Running", which is kind-of true if you think of the
case where we park them as part of taking cpus offline.  But if we allow
parking them indefinitely, "Running" is pretty misleading, so we report
them as "Sleeping" instead.

We could simply report them with a new string, "Parked", but it feels
like it's a bit risky for userspace to see unexpected new values; the
output is already documented in Documentation/filesystems/proc.txt, and
it seems like a mistake to change that lightly.

The scheduler does report parked tasks with a "P" in debugging output
from sched_show_task() or dump_cpu_task(), but that's a different API.
Similarly, the trace_ctxwake_* routines report a "P" for parked tasks,
but again, different API.

This change seemed slightly cleaner than updating the task_state_array
to have additional rows.  TASK_DEAD should be subsumed by the exit_state
bits; TASK_WAKEKILL is just a modifier; and TASK_WAKING can very
reasonably be reported as "Running" (as it is now).  Only TASK_PARKED
shows up with unreasonable output here.

Signed-off-by: Chris Metcalf 
Cc: Don Zickus 
Cc: Ingo Molnar 
Cc: Ulrich Obergfell 
Cc: Thomas Gleixner 
Cc: Peter Zijlstra 
Cc: Frederic Weisbecker 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc: remove use of seq_printf return value

2015-04-15T23:35:25+00:00

The seq_printf return value, because it's frequently misused,
will eventually be converted to void.

See: commit 1f33c41c03da ("seq_file: Rename seq_overflow() to
     seq_has_overflowed() and make public")

Signed-off-by: Joe Perches 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

lib/string_helpers.c: change semantics of string_escape_mem

2015-04-15T23:35:24+00:00

The current semantics of string_escape_mem are inadequate for one of its
current users, vsnprintf().  If that is to honour its contract, it must
know how much space would be needed for the entire escaped buffer, and
string_escape_mem provides no way of obtaining that (short of allocating a
large enough buffer (~4 times input string) to let it play with, and
that's definitely a big no-no inside vsnprintf).

So change the semantics for string_escape_mem to be more snprintf-like:
Return the size of the output that would be generated if the destination
buffer was big enough, but of course still only write to the part of dst
it is allowed to, and (contrary to snprintf) don't do '\0'-termination.
It is then up to the caller to detect whether output was truncated and to
append a '\0' if desired.  Also, we must output partial escape sequences,
otherwise a call such as snprintf(buf, 3, "%1pE", "\123") would cause
printf to write a \0 to buf[2] but leaving buf[0] and buf[1] with whatever
they previously contained.

This also fixes a bug in the escaped_string() helper function, which used
to unconditionally pass a length of "end-buf" to string_escape_mem();
since the latter doesn't check osz for being insanely large, it would
happily write to dst.  For example, kasprintf(GFP_KERNEL, "something and
then %pE", ...); is an easy way to trigger an oops.

In test-string_helpers.c, the -ENOMEM test is replaced with testing for
getting the expected return value even if the buffer is too small.  We
also ensure that nothing is written (by relying on a NULL pointer deref)
if the output size is 0 by passing NULL - this has to work for
kasprintf("%pE") to work.

In net/sunrpc/cache.c, I think qword_add still has the same semantics.
Someone should definitely double-check this.

In fs/proc/array.c, I made the minimum possible change, but longer-term it
should stop poking around in seq_file internals.

[andriy.shevchenko@linux.intel.com: simplify qword_add]
[andriy.shevchenko@linux.intel.com: add missed curly braces]
Signed-off-by: Rasmus Villemoes 
Acked-by: Andy Shevchenko 
Signed-off-by: Andy Shevchenko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

/proc/PID/status: show all sets of pid according to ns

2015-04-15T23:35:22+00:00

If some issues occurred inside a container guest, host user could not know
which process is in trouble just by guest pid: the users of container
guest only knew the pid inside containers.  This will bring obstacle for
trouble shooting.

This patch adds four fields: NStgid, NSpid, NSpgid and NSsid:

a) In init_pid_ns, nothing changed;

b) In one pidns, will tell the pid inside containers:
  NStgid: 21776   5       1
  NSpid:  21776   5       1
  NSpgid: 21776   5       1
  NSsid:  21729   1       0
  ** Process id is 21776 in level 0, 5 in level 1, 1 in level 2.

c) If pidns is nested, it depends on which pidns are you in.
  NStgid: 5       1
  NSpid:  5       1
  NSpgid: 5       1
  NSsid:  1       0
  ** Views from level 1

[akpm@linux-foundation.org: add CONFIG_PID_NS ifdef]
Signed-off-by: Chen Hanxiao 
Acked-by: Serge Hallyn 
Acked-by: "Eric W. Biederman" 
Tested-by: Serge Hallyn 
Tested-by: Nathan Scott 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc: use %*pb[l] to print bitmaps including cpumasks and nodemasks

2015-02-14T05:21:38+00:00

printk and friends can now format bitmaps using '%*pb[l]'.  cpumask
and nodemask also provide cpumask_pr_args() and nodemask_pr_args()
respectively which can be used to generate the two printf arguments
necessary to format the specified cpu/nodemask.

Signed-off-by: Tejun Heo 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

fs/proc/array.c: convert to use string_escape_str()

2015-02-13T02:54:12+00:00

Instead of custom approach let's use string_escape_str() to escape a given
string (task_name in this case).

Signed-off-by: Andy Shevchenko 
Cc: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc: task_state: ptrace_parent() doesn't need pid_alive() check

2014-12-11T01:41:09+00:00

p->ptrace != 0 means that release_task(p) was not called, so pid_alive()
buys nothing and we can remove this check.  Other callers already use it
directly without additional checks.

Note: with or without this patch ptrace_parent() can return the pointer to
the freed task, this will be explained/fixed later.

Signed-off-by: Oleg Nesterov 
Cc: Aaron Tomlin 
Cc: Alexey Dobriyan 
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander 
Cc: Peter Zijlstra 
Cc: Roland McGrath 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

proc: task_state: move the main seq_printf() outside of rcu_read_lock()

2014-12-11T01:41:09+00:00

task_state() does seq_printf() under rcu_read_lock(), but this is only
needed for task_tgid_nr_ns() and task_numa_group_id().  We can calculate
tgid/ngid and drop rcu lock.

Signed-off-by: Oleg Nesterov 
Cc: Aaron Tomlin 
Cc: Alexey Dobriyan 
Cc: "Eric W. Biederman" ,
Cc: Sterling Alexander 
Cc: Peter Zijlstra 
Cc: Roland McGrath 
Reviewed-by: Paul E. McKenney 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds