linux-stable.git/fs/proc, branch v4.1.41

proc: Fix unbalanced hard link numbers

2017-05-17T19:08:24+00:00

[ Upstream commit d66bb1607e2d8d384e53f3d93db5c18483c8c4f7 ]

proc_create_mount_point() forgot to increase the parent's nlink, and
it resulted in unbalanced hard link numbers, e.g. /proc/fs shows one
less than expected.

Fixes: eb6d38d5427b ("proc: Allow creating permanently empty directories...")
Cc: stable@vger.kernel.org
Reported-by: Tristan Ye 
Signed-off-by: Takashi Iwai 
Signed-off-by: Eric W. Biederman 
Signed-off-by: Sasha Levin

sysctl: Drop reference added by grab_header in proc_sys_readdir

2017-03-06T22:29:17+00:00

[ Upstream commit 93362fa47fe98b62e4a34ab408c4a418432e7939 ]

Fixes CVE-2016-9191, proc_sys_readdir doesn't drop reference
added by grab_header when return from !dir_emit_dots path.
It can cause any path called unregister_sysctl_table will
wait forever.

The calltrace of CVE-2016-9191:

[ 5535.960522] Call Trace:
[ 5535.963265]  [] schedule+0x3f/0xa0
[ 5535.968817]  [] schedule_timeout+0x3db/0x6f0
[ 5535.975346]  [] ? wait_for_completion+0x45/0x130
[ 5535.982256]  [] wait_for_completion+0xc3/0x130
[ 5535.988972]  [] ? wake_up_q+0x80/0x80
[ 5535.994804]  [] drop_sysctl_table+0xc4/0xe0
[ 5536.001227]  [] drop_sysctl_table+0x77/0xe0
[ 5536.007648]  [] unregister_sysctl_table+0x4d/0xa0
[ 5536.014654]  [] unregister_sysctl_table+0x7f/0xa0
[ 5536.021657]  [] unregister_sched_domain_sysctl+0x15/0x40
[ 5536.029344]  [] partition_sched_domains+0x44/0x450
[ 5536.036447]  [] ? __mutex_unlock_slowpath+0x111/0x1f0
[ 5536.043844]  [] rebuild_sched_domains_locked+0x64/0xb0
[ 5536.051336]  [] update_flag+0x11d/0x210
[ 5536.057373]  [] ? mutex_lock_nested+0x2df/0x450
[ 5536.064186]  [] ? cpuset_css_offline+0x1b/0x60
[ 5536.070899]  [] ? trace_hardirqs_on+0xd/0x10
[ 5536.077420]  [] ? mutex_lock_nested+0x2df/0x450
[ 5536.084234]  [] ? css_killed_work_fn+0x25/0x220
[ 5536.091049]  [] cpuset_css_offline+0x35/0x60
[ 5536.097571]  [] css_killed_work_fn+0x5c/0x220
[ 5536.104207]  [] process_one_work+0x1df/0x710
[ 5536.110736]  [] ? process_one_work+0x160/0x710
[ 5536.117461]  [] worker_thread+0x12b/0x4a0
[ 5536.123697]  [] ? process_one_work+0x710/0x710
[ 5536.130426]  [] kthread+0xfe/0x120
[ 5536.135991]  [] ret_from_fork+0x1f/0x40
[ 5536.142041]  [] ? kthread_create_on_node+0x230/0x230

One cgroup maintainer mentioned that "cgroup is trying to offline
a cpuset css, which takes place under cgroup_mutex.  The offlining
ends up trying to drain active usages of a sysctl table which apprently
is not happening."
The real reason is that proc_sys_readdir doesn't drop reference added
by grab_header when return from !dir_emit_dots path. So this cpuset
offline path will wait here forever.

See here for details: http://www.openwall.com/lists/oss-security/2016/11/04/13

Fixes: f0c3b5093add ("[readdir] convert procfs")
Cc: stable@vger.kernel.org
Reported-by: CAI Qian 
Tested-by: Yang Shukui 
Signed-off-by: Zhou Chengming 
Acked-by: Al Viro 
Signed-off-by: Eric W. Biederman 
Signed-off-by: Sasha Levin

fs: Give dentry to inode_change_ok() instead of inode

2016-12-23T13:56:35+00:00

[ Upstream commit 31051c85b5e2aaaf6315f74c72a732673632a905 ]

inode_change_ok() will be resposible for clearing capabilities and IMA
extended attributes and as such will need dentry. Give it as an argument
to inode_change_ok() instead of an inode. Also rename inode_change_ok()
to setattr_prepare() to better relect that it does also some
modifications in addition to checks.

References: CVE-2015-1350
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jan Kara 
Signed-off-by: Philipp Hahn 
Signed-off-by: Sasha Levin

fs/proc/task_mmu.c: fix mm_access() mode parameter in pagemap_read()

2016-08-12T17:27:29+00:00

Backport of caaee6234d05a58c5b4d05e7bf766131b810a657 ("ptrace: use fsuid,
fsgid, effective creds for fs access checks") to v4.1 failed to update the
mode parameter in the mm_access() call in pagemap_read() to have one of the
new PTRACE_MODE_*CREDS flags.

Attempting to read any other process' pagemap results in a WARN()

WARNING: CPU: 0 PID: 883 at kernel/ptrace.c:229 __ptrace_may_access+0x14a/0x160()
denying ptrace access check without PTRACE_MODE_*CREDS
Modules linked in: loop sg e1000 i2c_piix4 ppdev virtio_balloon virtio_pci parport_pc i2c_core virtio_ring ata_generic serio_raw pata_acpi virtio parport pcspkr floppy acpi_cpufreq ip_tables ext3 mbcache jbd sd_mod ata_piix crc32c_intel libata
CPU: 0 PID: 883 Comm: cat Tainted: G        W       4.1.12-51.el7uek.x86_64 #2
Hardware name: Bochs Bochs, BIOS Bochs 01/01/2011
  0000000000000286 00000000619f225a ffff88003b6fbc18 ffffffff81717021
  ffff88003b6fbc70 ffffffff819be870 ffff88003b6fbc58 ffffffff8108477a
  000000003b6fbc58 0000000000000001 ffff88003d287000 0000000000000001
Call Trace:
  [] dump_stack+0x63/0x81
  [] warn_slowpath_common+0x8a/0xc0
  [] warn_slowpath_fmt+0x55/0x70
  [] __ptrace_may_access+0x14a/0x160
  [] ptrace_may_access+0x32/0x50
  [] mm_access+0x6d/0xb0
  [] pagemap_read+0xe1/0x360
  [] ? lru_cache_add_active_or_unevictable+0x2b/0xa0
  [] __vfs_read+0x37/0x100
  [] ? security_file_permission+0x84/0xa0
  [] ? rw_verify_area+0x56/0xe0
  [] vfs_read+0x86/0x140
  [] SyS_read+0x55/0xd0
  [] system_call_fastpath+0x12/0x71

Fixes: ab88ce5feca4 (ptrace: use fsuid, fsgid, effective creds for fs access checks)
Signed-off-by: Kenny Keslar 
Cc: Roland McGrath 
Cc: Oleg Nesterov 
Cc: stable@vger.kernel.org
Signed-off-by: Sasha Levin

proc: prevent accessing /proc//environ until it's ready

2016-07-11T03:07:18+00:00

[ Upstream commit 8148a73c9901a8794a50f950083c00ccf97d43b3 ]

If /proc//environ gets read before the envp[] array is fully set up
in create_{aout,elf,elf_fdpic,flat}_tables(), we might end up trying to
read more bytes than are actually written, as env_start will already be
set but env_end will still be zero, making the range calculation
underflow, allowing to read beyond the end of what has been written.

Fix this as it is done for /proc//cmdline by testing env_end for
zero.  It is, apparently, intentionally set last in create_*_tables().

This bug was found by the PaX size_overflow plugin that detected the
arithmetic underflow of 'this_len = env_end - (env_start + src)' when
env_end is still zero.

The expected consequence is that userland trying to access
/proc//environ of a not yet fully set up process may get
inconsistent data as we're in the middle of copying in the environment
variables.

Fixes: https://forums.grsecurity.net/viewtopic.php?f=3&t=4363
Fixes: https://bugzilla.kernel.org/show_bug.cgi?id=116461
Signed-off-by: Mathias Krause 
Cc: Emese Revfy 
Cc: Pax Team 
Cc: Al Viro 
Cc: Mateusz Guzik 
Cc: Alexey Dobriyan 
Cc: Cyrill Gorcunov 
Cc: Jarod Wilson 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

proc: prevent stacking filesystems on top

2016-06-18T20:47:32+00:00

[ Upstream commit e54ad7f1ee263ffa5a2de9c609d58dfa27b21cd9 ]

This prevents stacking filesystems (ecryptfs and overlayfs) from using
procfs as lower filesystem.  There is too much magic going on inside
procfs, and there is no good reason to stack stuff on top of procfs.

(For example, procfs does access checks in VFS open handlers, and
ecryptfs by design calls open handlers from a kernel thread that doesn't
drop privileges or so.)

Signed-off-by: Jann Horn 
Cc: stable@vger.kernel.org
Signed-off-by: Linus Torvalds 
Signed-off-by: Sasha Levin

ptrace: use fsuid, fsgid, effective creds for fs access checks

2016-04-12T02:07:35+00:00

[ Upstream commit caaee6234d05a58c5b4d05e7bf766131b810a657 ]

By checking the effective credentials instead of the real UID / permitted
capabilities, ensure that the calling process actually intended to use its
credentials.

To ensure that all ptrace checks use the correct caller credentials (e.g.
in case out-of-tree code or newly added code omits the PTRACE_MODE_*CREDS
flag), use two new flags and require one of them to be set.

The problem was that when a privileged task had temporarily dropped its
privileges, e.g.  by calling setreuid(0, user_uid), with the intent to
perform following syscalls with the credentials of a user, it still passed
ptrace access checks that the user would not be able to pass.

While an attacker should not be able to convince the privileged task to
perform a ptrace() syscall, this is a problem because the ptrace access
check is reused for things in procfs.

In particular, the following somewhat interesting procfs entries only rely
on ptrace access checks:

 /proc/$pid/stat - uses the check for determining whether pointers
     should be visible, useful for bypassing ASLR
 /proc/$pid/maps - also useful for bypassing ASLR
 /proc/$pid/cwd - useful for gaining access to restricted
     directories that contain files with lax permissions, e.g. in
     this scenario:
     lrwxrwxrwx root root /proc/13020/cwd -> /root/foobar
     drwx------ root root /root
     drwxr-xr-x root root /root/foobar
     -rw-r--r-- root root /root/foobar/secret

Therefore, on a system where a root-owned mode 6755 binary changes its
effective credentials as described and then dumps a user-specified file,
this could be used by an attacker to reveal the memory layout of root's
processes or reveal the contents of files he is not allowed to access
(through /proc/$pid/cwd).

[akpm@linux-foundation.org: fix warning]
Signed-off-by: Jann Horn 
Acked-by: Kees Cook 
Cc: Casey Schaufler 
Cc: Oleg Nesterov 
Cc: Ingo Molnar 
Cc: James Morris 
Cc: "Serge E. Hallyn" 
Cc: Andy Shevchenko 
Cc: Andy Lutomirski 
Cc: Al Viro 
Cc: "Eric W. Biederman" 
Cc: Willy Tarreau 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 

Signed-off-by: Sasha Levin

fs/proc, core/debug: Don't expose absolute kernel addresses via wchan

2015-12-09T19:03:20+00:00

commit b2f73922d119686323f14fbbe46587f863852328 upstream.

So the /proc/PID/stat 'wchan' field (the 30th field, which contains
the absolute kernel address of the kernel function a task is blocked in)
leaks absolute kernel addresses to unprivileged user-space:

        seq_put_decimal_ull(m, ' ', wchan);

The absolute address might also leak via /proc/PID/wchan as well, if
KALLSYMS is turned off or if the symbol lookup fails for some reason:

static int proc_pid_wchan(struct seq_file *m, struct pid_namespace *ns,
                          struct pid *pid, struct task_struct *task)
{
        unsigned long wchan;
        char symname[KSYM_NAME_LEN];

        wchan = get_wchan(task);

        if (lookup_symbol_name(wchan, symname) < 0) {
                if (!ptrace_may_access(task, PTRACE_MODE_READ))
                        return 0;
                seq_printf(m, "%lu", wchan);
        } else {
                seq_printf(m, "%s", symname);
        }

        return 0;
}

This isn't ideal, because for example it trivially leaks the KASLR offset
to any local attacker:

  fomalhaut:~> printf "%016lx\n" $(cat /proc/$$/stat | cut -d' ' -f35)
  ffffffff8123b380

Most real-life uses of wchan are symbolic:

  ps -eo pid:10,tid:10,wchan:30,comm

and procps uses /proc/PID/wchan, not the absolute address in /proc/PID/stat:

  triton:~/tip> strace -f ps -eo pid:10,tid:10,wchan:30,comm 2>&1 | grep wchan | tail -1
  open("/proc/30833/wchan", O_RDONLY)     = 6

There's one compatibility quirk here: procps relies on whether the
absolute value is non-zero - and we can provide that functionality
by outputing "0" or "1" depending on whether the task is blocked
(whether there's a wchan address).

These days there appears to be very little legitimate reason
user-space would be interested in  the absolute address. The
absolute address is mostly historic: from the days when we
didn't have kallsyms and user-space procps had to do the
decoding itself via the System.map.

So this patch sets all numeric output to "0" or "1" and keeps only
symbolic output, in /proc/PID/wchan.

( The absolute sleep address can generally still be profiled via
  perf, by tasks with sufficient privileges. )

Reviewed-by: Thomas Gleixner 
Acked-by: Kees Cook 
Acked-by: Linus Torvalds 
Cc: Al Viro 
Cc: Alexander Potapenko 
Cc: Andrey Konovalov 
Cc: Andrey Ryabinin 
Cc: Andy Lutomirski 
Cc: Andy Lutomirski 
Cc: Borislav Petkov 
Cc: Denys Vlasenko 
Cc: Dmitry Vyukov 
Cc: Kostya Serebryany 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Peter Zijlstra 
Cc: Sasha Levin 
Cc: kasan-dev 
Cc: linux-kernel@vger.kernel.org
Link: http://lkml.kernel.org/r/20150930135917.GA3285@gmail.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

mnt: Refactor the logic for mounting sysfs and proc in a user namespace

2015-07-21T17:10:01+00:00

commit 1b852bceb0d111e510d1a15826ecc4a19358d512 upstream.

Fresh mounts of proc and sysfs are a very special case that works very
much like a bind mount.  Unfortunately the current structure can not
preserve the MNT_LOCK... mount flags.  Therefore refactor the logic
into a form that can be modified to preserve those lock bits.

Add a new filesystem flag FS_USERNS_VISIBLE that requires some mount
of the filesystem be fully visible in the current mount namespace,
before the filesystem may be mounted.

Move the logic for calling fs_fully_visible from proc and sysfs into
fs/namespace.c where it has greater access to mount namespace state.

Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Greg Kroah-Hartman

proc: Allow creating permanently empty directories that serve as mount points

2015-07-21T17:10:00+00:00

commit eb6d38d5427b3ad42f5268da0f1dd31bb0af1264 upstream.

Add a new function proc_create_mount_point that when used to creates a
directory that can not be added to.

Add a new function is_empty_pde to test if a function is a mount
point.

Update the code to use make_empty_dir_inode when reporting
a permanently empty directory to the vfs.

Update the code to not allow adding to permanently empty directories.

Update /proc/openprom and /proc/fs/nfsd to be permanently empty directories.

Signed-off-by: "Eric W. Biederman" 
Signed-off-by: Greg Kroah-Hartman