<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/fs/proc, branch v5.4.97</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>proc: fix lookup in /proc/net subdirectories after setns(2)</title>
<updated>2021-01-12T19:16:10+00:00</updated>
<author>
<name>Alexey Dobriyan</name>
<email>adobriyan@gmail.com</email>
</author>
<published>2020-12-16T04:42:39+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=9ea03f6890ceee39f03655814097c8933a22725d'/>
<id>9ea03f6890ceee39f03655814097c8933a22725d</id>
<content type='text'>
[ Upstream commit c6c75deda81344c3a95d1d1f606d5cee109e5d54 ]

Commit 1fde6f21d90f ("proc: fix /proc/net/* after setns(2)") only forced
revalidation of regular files under /proc/net/

However, /proc/net/ is unusual in the sense of /proc/net/foo handlers
take netns pointer from parent directory which is old netns.

Steps to reproduce:

	(void)open("/proc/net/sctp/snmp", O_RDONLY);
	unshare(CLONE_NEWNET);

	int fd = open("/proc/net/sctp/snmp", O_RDONLY);
	read(fd, &amp;c, 1);

Read will read wrong data from original netns.

Patch forces lookup on every directory under /proc/net .

Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain
Fixes: 1da4d377f943 ("proc: revalidate misc dentries")
Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" &lt;tommi.t.rantala@nokia.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit c6c75deda81344c3a95d1d1f606d5cee109e5d54 ]

Commit 1fde6f21d90f ("proc: fix /proc/net/* after setns(2)") only forced
revalidation of regular files under /proc/net/

However, /proc/net/ is unusual in the sense of /proc/net/foo handlers
take netns pointer from parent directory which is old netns.

Steps to reproduce:

	(void)open("/proc/net/sctp/snmp", O_RDONLY);
	unshare(CLONE_NEWNET);

	int fd = open("/proc/net/sctp/snmp", O_RDONLY);
	read(fd, &amp;c, 1);

Read will read wrong data from original netns.

Patch forces lookup on every directory under /proc/net .

Link: https://lkml.kernel.org/r/20201205160916.GA109739@localhost.localdomain
Fixes: 1da4d377f943 ("proc: revalidate misc dentries")
Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Reported-by: "Rantala, Tommi T. (Nokia - FI/Espoo)" &lt;tommi.t.rantala@nokia.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>proc: change -&gt;nlink under proc_subdir_lock</title>
<updated>2021-01-12T19:16:10+00:00</updated>
<author>
<name>Alexey Dobriyan</name>
<email>adobriyan@gmail.com</email>
</author>
<published>2019-12-05T00:49:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=d2942e958f26cb9d3b7ef2726783004ad1177749'/>
<id>d2942e958f26cb9d3b7ef2726783004ad1177749</id>
<content type='text'>
[ Upstream commit e06689bf57017ac022ccf0f2a5071f760821ce0f ]

Currently gluing PDE into global /proc tree is done under lock, but
changing -&gt;nlink is not.  Additionally struct proc_dir_entry::nlink is
not atomic so updates can be lost.

Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2
Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit e06689bf57017ac022ccf0f2a5071f760821ce0f ]

Currently gluing PDE into global /proc tree is done under lock, but
changing -&gt;nlink is not.  Additionally struct proc_dir_entry::nlink is
not atomic so updates can be lost.

Link: http://lkml.kernel.org/r/20190925202436.GA17388@avx2
Signed-off-by: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>exec: Transform exec_update_mutex into a rw_semaphore</title>
<updated>2021-01-09T12:44:55+00:00</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-12-03T20:12:00+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=117433236ae296d9770442960ddf57459177e90e'/>
<id>117433236ae296d9770442960ddf57459177e90e</id>
<content type='text'>
[ Upstream commit f7cfd871ae0c5008d94b6f66834e7845caa93c15 ]

Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex.  The problematic lock ordering found by lockdep
was:

   perf_event_open  (exec_update_mutex -&gt; ovl_i_mutex)
   chown            (ovl_i_mutex       -&gt; sb_writes)
   sendfile         (sb_writes         -&gt; p-&gt;lock)
     by reading from a proc file and writing to overlayfs
   proc_pid_syscall (p-&gt;lock           -&gt; exec_update_mutex)

While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same.  They are all readers.  The only writer is
exec.

There is no reason for readers to block on each other.  So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.

Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Vasiliy Kulikov &lt;segoon@openwall.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Christopher Yeoh &lt;cyeoh@au1.ibm.com&gt;
Cc: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
[0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit f7cfd871ae0c5008d94b6f66834e7845caa93c15 ]

Recently syzbot reported[0] that there is a deadlock amongst the users
of exec_update_mutex.  The problematic lock ordering found by lockdep
was:

   perf_event_open  (exec_update_mutex -&gt; ovl_i_mutex)
   chown            (ovl_i_mutex       -&gt; sb_writes)
   sendfile         (sb_writes         -&gt; p-&gt;lock)
     by reading from a proc file and writing to overlayfs
   proc_pid_syscall (p-&gt;lock           -&gt; exec_update_mutex)

While looking at possible solutions it occured to me that all of the
users and possible users involved only wanted to state of the given
process to remain the same.  They are all readers.  The only writer is
exec.

There is no reason for readers to block on each other.  So fix
this deadlock by transforming exec_update_mutex into a rw_semaphore
named exec_update_lock that only exec takes for writing.

Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Vasiliy Kulikov &lt;segoon@openwall.com&gt;
Cc: Al Viro &lt;viro@zeniv.linux.org.uk&gt;
Cc: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Cc: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Christopher Yeoh &lt;cyeoh@au1.ibm.com&gt;
Cc: Cyrill Gorcunov &lt;gorcunov@gmail.com&gt;
Cc: Sargun Dhillon &lt;sargun@sargun.me&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Arnd Bergmann &lt;arnd@arndb.de&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Ingo Molnar &lt;mingo@redhat.com&gt;
Cc: Arnaldo Carvalho de Melo &lt;acme@kernel.org&gt;
Fixes: eea9673250db ("exec: Add exec_update_mutex to replace cred_guard_mutex")
[0] https://lkml.kernel.org/r/00000000000063640c05ade8e3de@google.com
Reported-by: syzbot+db9cdf3dd1f64252c6ef@syzkaller.appspotmail.com
Link: https://lkml.kernel.org/r/87ft4mbqen.fsf@x220.int.ebiederm.org
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>proc: use untagged_addr() for pagemap_read addresses</title>
<updated>2020-12-16T09:56:58+00:00</updated>
<author>
<name>Miles Chen</name>
<email>miles.chen@mediatek.com</email>
</author>
<published>2020-12-11T21:36:31+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=7d5fc53439a1adc8b4b57b49bbce7cc2e9bb0919'/>
<id>7d5fc53439a1adc8b4b57b49bbce7cc2e9bb0919</id>
<content type='text'>
commit 40d6366e9d86d9a67b5642040e76082fdb5bdcf9 upstream.

When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag.
To fix it, we should untag the userspace pointers in pagemap_read().

I tested with 5.10-rc4 and the issue remains.

Explanation from Catalin in [1]:

 "Arguably, that's a user-space bug since tagged file offsets were never
  supported. In this case it's not even a tag at bit 56 as per the arm64
  tagged address ABI but rather down to bit 47. You could say that the
  problem is caused by the C library (malloc()) or whoever created the
  tagged vaddr and passed it to this function. It's not a kernel
  regression as we've never supported it.

  Now, pagemap is a special case where the offset is usually not
  generated as a classic file offset but rather derived by shifting a
  user virtual address. I guess we can make a concession for pagemap
  (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"

My test code is based on [2]:

A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8

userspace program:

  uint64 OsLayer::VirtualToPhysical(void *vaddr) {
	uint64 frame, paddr, pfnmask, pagemask;
	int pagesize = sysconf(_SC_PAGESIZE);
	off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
	int fd = open(kPagemapPath, O_RDONLY);
	...

	if (lseek64(fd, off, SEEK_SET) != off || read(fd, &amp;frame, 8) != 8) {
		int err = errno;
		string errtxt = ErrorString(err);
		if (fd &gt;= 0)
			close(fd);
		return 0;
	}
  ...
  }

kernel fs/proc/task_mmu.c:

  static ssize_t pagemap_read(struct file *file, char __user *buf,
		size_t count, loff_t *ppos)
  {
	...
	src = *ppos;
	svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
	start_vaddr = svpfn &lt;&lt; PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
	end_vaddr = mm-&gt;task_size;

	/* watch out for wraparound */
	// svpfn == 0xb400007662f54
	// (mm-&gt;task_size &gt;&gt; PAGE) == 0x8000000
	if (svpfn &gt; mm-&gt;task_size &gt;&gt; PAGE_SHIFT) // the condition is true because of the tag 0xb4
		start_vaddr = end_vaddr;

	ret = 0;
	while (count &amp;&amp; (start_vaddr &lt; end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
		int len;
		unsigned long end;
		...
	}
	...
  }

[1] https://lore.kernel.org/patchwork/patch/1343258/
[2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158

Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen &lt;miles.chen@mediatek.com&gt;
Reviewed-by: Vincenzo Frascino &lt;vincenzo.frascino@arm.com&gt;
Reviewed-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Cc: Alexander Potapenko &lt;glider@google.com&gt;
Cc: Vincenzo Frascino &lt;vincenzo.frascino@arm.com&gt;
Cc: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Marco Elver &lt;elver@google.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Cc: Song Bao Hua (Barry Song) &lt;song.bao.hua@hisilicon.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[5.4-]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit 40d6366e9d86d9a67b5642040e76082fdb5bdcf9 upstream.

When we try to visit the pagemap of a tagged userspace pointer, we find
that the start_vaddr is not correct because of the tag.
To fix it, we should untag the userspace pointers in pagemap_read().

I tested with 5.10-rc4 and the issue remains.

Explanation from Catalin in [1]:

 "Arguably, that's a user-space bug since tagged file offsets were never
  supported. In this case it's not even a tag at bit 56 as per the arm64
  tagged address ABI but rather down to bit 47. You could say that the
  problem is caused by the C library (malloc()) or whoever created the
  tagged vaddr and passed it to this function. It's not a kernel
  regression as we've never supported it.

  Now, pagemap is a special case where the offset is usually not
  generated as a classic file offset but rather derived by shifting a
  user virtual address. I guess we can make a concession for pagemap
  (only) and allow such offset with the tag at bit (56 - PAGE_SHIFT + 3)"

My test code is based on [2]:

A userspace pointer which has been tagged by 0xb4: 0xb400007662f541c8

userspace program:

  uint64 OsLayer::VirtualToPhysical(void *vaddr) {
	uint64 frame, paddr, pfnmask, pagemask;
	int pagesize = sysconf(_SC_PAGESIZE);
	off64_t off = ((uintptr_t)vaddr) / pagesize * 8; // off = 0xb400007662f541c8 / pagesize * 8 = 0x5a00003b317aa0
	int fd = open(kPagemapPath, O_RDONLY);
	...

	if (lseek64(fd, off, SEEK_SET) != off || read(fd, &amp;frame, 8) != 8) {
		int err = errno;
		string errtxt = ErrorString(err);
		if (fd &gt;= 0)
			close(fd);
		return 0;
	}
  ...
  }

kernel fs/proc/task_mmu.c:

  static ssize_t pagemap_read(struct file *file, char __user *buf,
		size_t count, loff_t *ppos)
  {
	...
	src = *ppos;
	svpfn = src / PM_ENTRY_BYTES; // svpfn == 0xb400007662f54
	start_vaddr = svpfn &lt;&lt; PAGE_SHIFT; // start_vaddr == 0xb400007662f54000
	end_vaddr = mm-&gt;task_size;

	/* watch out for wraparound */
	// svpfn == 0xb400007662f54
	// (mm-&gt;task_size &gt;&gt; PAGE) == 0x8000000
	if (svpfn &gt; mm-&gt;task_size &gt;&gt; PAGE_SHIFT) // the condition is true because of the tag 0xb4
		start_vaddr = end_vaddr;

	ret = 0;
	while (count &amp;&amp; (start_vaddr &lt; end_vaddr)) { // we cannot visit correct entry because start_vaddr is set to end_vaddr
		int len;
		unsigned long end;
		...
	}
	...
  }

[1] https://lore.kernel.org/patchwork/patch/1343258/
[2] https://github.com/stressapptest/stressapptest/blob/master/src/os.cc#L158

Link: https://lkml.kernel.org/r/20201204024347.8295-1-miles.chen@mediatek.com
Signed-off-by: Miles Chen &lt;miles.chen@mediatek.com&gt;
Reviewed-by: Vincenzo Frascino &lt;vincenzo.frascino@arm.com&gt;
Reviewed-by: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: Andrey Konovalov &lt;andreyknvl@google.com&gt;
Cc: Alexander Potapenko &lt;glider@google.com&gt;
Cc: Vincenzo Frascino &lt;vincenzo.frascino@arm.com&gt;
Cc: Andrey Ryabinin &lt;aryabinin@virtuozzo.com&gt;
Cc: Catalin Marinas &lt;catalin.marinas@arm.com&gt;
Cc: Dmitry Vyukov &lt;dvyukov@google.com&gt;
Cc: Marco Elver &lt;elver@google.com&gt;
Cc: Will Deacon &lt;will@kernel.org&gt;
Cc: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Cc: Song Bao Hua (Barry Song) &lt;song.bao.hua@hisilicon.com&gt;
Cc: &lt;stable@vger.kernel.org&gt;	[5.4-]
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>proc: don't allow async path resolution of /proc/self components</title>
<updated>2020-12-02T07:49:48+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2020-11-13T23:47:52+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=01968f9af0061b161045edefd6dcaf9fe4fdd496'/>
<id>01968f9af0061b161045edefd6dcaf9fe4fdd496</id>
<content type='text'>
[ Upstream commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19 ]

If this is attempted by a kthread, then return -EOPNOTSUPP as we don't
currently support that. Once we can get task_pid_ptr() doing the right
thing, then this can go away again.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 8d4c3e76e3be11a64df95ddee52e99092d42fc19 ]

If this is attempted by a kthread, then return -EOPNOTSUPP as we don't
currently support that. Once we can get task_pid_ptr() doing the right
thing, then this can go away again.

Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>mm, oom_adj: don't loop through tasks in __set_oom_adj when not necessary</title>
<updated>2020-10-29T08:57:45+00:00</updated>
<author>
<name>Suren Baghdasaryan</name>
<email>surenb@google.com</email>
</author>
<published>2020-10-13T23:58:35+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=91e4c12a3bf4cd6723eb9b599dfab9c16ab2c281'/>
<id>91e4c12a3bf4cd6723eb9b599dfab9c16ab2c281</id>
<content type='text'>
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ]

Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm.  This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.

Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important.  Such operation can happen frequently.  We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression.  Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.

Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM &amp;&amp; !CLONE_THREAD &amp;&amp; !CLONE_VFORK).  Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes.  To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex.  Its scope is changed to global.

The combination of (CLONE_VM &amp;&amp; !CLONE_THREAD) is rarely used except for
the case of vfork().  To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare.  Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.

With the combination of (CLONE_VM &amp;&amp; !CLONE_THREAD &amp;&amp; !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.

[surenb@google.com: v3]
  Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray &lt;timmurray@google.com&gt;
Suggested-by: Michal Hocko &lt;mhocko@kernel.org&gt;
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Eugene Syromiatnikov &lt;esyr@redhat.com&gt;
Cc: Christian Kellner &lt;christian@kellner.me&gt;
Cc: Adrian Reber &lt;areber@redhat.com&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Alexey Gladkov &lt;gladkov.alexey@gmail.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Daniel Jordan &lt;daniel.m.jordan@oracle.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Cc: John Johansen &lt;john.johansen@canonical.com&gt;
Cc: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 67197a4f28d28d0b073ab0427b03cb2ee5382578 ]

Currently __set_oom_adj loops through all processes in the system to keep
oom_score_adj and oom_score_adj_min in sync between processes sharing
their mm.  This is done for any task with more that one mm_users, which
includes processes with multiple threads (sharing mm and signals).
However for such processes the loop is unnecessary because their signal
structure is shared as well.

Android updates oom_score_adj whenever a tasks changes its role
(background/foreground/...) or binds to/unbinds from a service, making it
more/less important.  Such operation can happen frequently.  We noticed
that updates to oom_score_adj became more expensive and after further
investigation found out that the patch mentioned in "Fixes" introduced a
regression.  Using Pixel 4 with a typical Android workload, write time to
oom_score_adj increased from ~3.57us to ~362us.  Moreover this regression
linearly depends on the number of multi-threaded processes running on the
system.

Mark the mm with a new MMF_MULTIPROCESS flag bit when task is created with
(CLONE_VM &amp;&amp; !CLONE_THREAD &amp;&amp; !CLONE_VFORK).  Change __set_oom_adj to use
MMF_MULTIPROCESS instead of mm_users to decide whether oom_score_adj
update should be synchronized between multiple processes.  To prevent
races between clone() and __set_oom_adj(), when oom_score_adj of the
process being cloned might be modified from userspace, we use
oom_adj_mutex.  Its scope is changed to global.

The combination of (CLONE_VM &amp;&amp; !CLONE_THREAD) is rarely used except for
the case of vfork().  To prevent performance regressions of vfork(), we
skip taking oom_adj_mutex and setting MMF_MULTIPROCESS when CLONE_VFORK is
specified.  Clearing the MMF_MULTIPROCESS flag (when the last process
sharing the mm exits) is left out of this patch to keep it simple and
because it is believed that this threading model is rare.  Should there
ever be a need for optimizing that case as well, it can be done by hooking
into the exit path, likely following the mm_update_next_owner pattern.

With the combination of (CLONE_VM &amp;&amp; !CLONE_THREAD &amp;&amp; !CLONE_VFORK) being
quite rare, the regression is gone after the change is applied.

[surenb@google.com: v3]
  Link: https://lkml.kernel.org/r/20200902012558.2335613-1-surenb@google.com

Fixes: 44a70adec910 ("mm, oom_adj: make sure processes sharing mm have same view of oom_score_adj")
Reported-by: Tim Murray &lt;timmurray@google.com&gt;
Suggested-by: Michal Hocko &lt;mhocko@kernel.org&gt;
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Acked-by: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Acked-by: Michal Hocko &lt;mhocko@suse.com&gt;
Acked-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Cc: Ingo Molnar &lt;mingo@kernel.org&gt;
Cc: Peter Zijlstra &lt;peterz@infradead.org&gt;
Cc: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Cc: Eugene Syromiatnikov &lt;esyr@redhat.com&gt;
Cc: Christian Kellner &lt;christian@kellner.me&gt;
Cc: Adrian Reber &lt;areber@redhat.com&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Aleksa Sarai &lt;cyphar@cyphar.com&gt;
Cc: Alexey Dobriyan &lt;adobriyan@gmail.com&gt;
Cc: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Cc: Alexey Gladkov &lt;gladkov.alexey@gmail.com&gt;
Cc: Michel Lespinasse &lt;walken@google.com&gt;
Cc: Daniel Jordan &lt;daniel.m.jordan@oracle.com&gt;
Cc: Andrei Vagin &lt;avagin@gmail.com&gt;
Cc: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Cc: John Johansen &lt;john.johansen@canonical.com&gt;
Cc: Yafang Shao &lt;laoar.shao@gmail.com&gt;
Link: https://lkml.kernel.org/r/20200824153036.3201505-1-surenb@google.com
Debugged-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>proc: io_accounting: Use new infrastructure to fix deadlocks in execve</title>
<updated>2020-10-01T11:17:48+00:00</updated>
<author>
<name>Bernd Edlinger</name>
<email>bernd.edlinger@hotmail.de</email>
</author>
<published>2020-03-20T20:27:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=ba7eb0e48e765a3325e651ed362b27980d8a3c3d'/>
<id>ba7eb0e48e765a3325e651ed362b27980d8a3c3d</id>
<content type='text'>
[ Upstream commit 76518d3798855242817e8a8ed76b2d72f4415624 ]

This changes do_io_accounting to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/io for instance.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 76518d3798855242817e8a8ed76b2d72f4415624 ]

This changes do_io_accounting to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/io for instance.

This should be safe, as the credentials are only used for reading.

Signed-off-by: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>proc: Use new infrastructure to fix deadlocks in execve</title>
<updated>2020-10-01T11:17:48+00:00</updated>
<author>
<name>Bernd Edlinger</name>
<email>bernd.edlinger@hotmail.de</email>
</author>
<published>2020-03-20T20:27:24+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=4301db49e5569fbb729769bfaf9b21b858330543'/>
<id>4301db49e5569fbb729769bfaf9b21b858330543</id>
<content type='text'>
[ Upstream commit 2db9dbf71bf98d02a0bf33e798e5bfd2a9944696 ]

This changes lock_trace to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/stack for instance.

This should be safe, as the credentials are only used for reading,
and task-&gt;mm is updated on execve under the new exec_update_mutex.

Signed-off-by: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
[ Upstream commit 2db9dbf71bf98d02a0bf33e798e5bfd2a9944696 ]

This changes lock_trace to use the new exec_update_mutex
instead of cred_guard_mutex.

This fixes possible deadlocks when the trace is accessing
/proc/$pid/stack for instance.

This should be safe, as the credentials are only used for reading,
and task-&gt;mm is updated on execve under the new exec_update_mutex.

Signed-off-by: Bernd Edlinger &lt;bernd.edlinger@hotmail.de&gt;
Signed-off-by: Eric W. Biederman &lt;ebiederm@xmission.com&gt;
Signed-off-by: Sasha Levin &lt;sashal@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>proc: Use new_inode not new_inode_pseudo</title>
<updated>2020-06-17T14:40:33+00:00</updated>
<author>
<name>Eric W. Biederman</name>
<email>ebiederm@xmission.com</email>
</author>
<published>2020-06-12T14:42:03+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=12b24ba6ce563d20431b5a652d74b42687e71915'/>
<id>12b24ba6ce563d20431b5a652d74b42687e71915</id>
<content type='text'>
commit ef1548adada51a2f32ed7faef50aa465e1b4c5da upstream.

Recently syzbot reported that unmounting proc when there is an ongoing
inotify watch on the root directory of proc could result in a use
after free when the watch is removed after the unmount of proc
when the watcher exits.

Commit 69879c01a0c3 ("proc: Remove the now unnecessary internal mount
of proc") made it easier to unmount proc and allowed syzbot to see the
problem, but looking at the code it has been around for a long time.

Looking at the code the fsnotify watch should have been removed by
fsnotify_sb_delete in generic_shutdown_super.  Unfortunately the inode
was allocated with new_inode_pseudo instead of new_inode so the inode
was not on the sb-&gt;s_inodes list.  Which prevented
fsnotify_unmount_inodes from finding the inode and removing the watch
as well as made it so the "VFS: Busy inodes after unmount" warning
could not find the inodes to warn about them.

Make all of the inodes in proc visible to generic_shutdown_super,
and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo.
The only functional difference is that new_inode places the inodes
on the sb-&gt;s_inodes list.

I wrote a small test program and I can verify that without changes it
can trigger this issue, and by replacing new_inode_pseudo with
new_inode the issues goes away.

Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com
Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com
Fixes: 0097875bd415 ("proc: Implement /proc/thread-self to point at the directory of the current thread")
Fixes: 021ada7dff22 ("procfs: switch /proc/self away from proc_dir_entry")
Fixes: 51f0885e5415 ("vfs,proc: guarantee unique inodes in /proc")
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit ef1548adada51a2f32ed7faef50aa465e1b4c5da upstream.

Recently syzbot reported that unmounting proc when there is an ongoing
inotify watch on the root directory of proc could result in a use
after free when the watch is removed after the unmount of proc
when the watcher exits.

Commit 69879c01a0c3 ("proc: Remove the now unnecessary internal mount
of proc") made it easier to unmount proc and allowed syzbot to see the
problem, but looking at the code it has been around for a long time.

Looking at the code the fsnotify watch should have been removed by
fsnotify_sb_delete in generic_shutdown_super.  Unfortunately the inode
was allocated with new_inode_pseudo instead of new_inode so the inode
was not on the sb-&gt;s_inodes list.  Which prevented
fsnotify_unmount_inodes from finding the inode and removing the watch
as well as made it so the "VFS: Busy inodes after unmount" warning
could not find the inodes to warn about them.

Make all of the inodes in proc visible to generic_shutdown_super,
and fsnotify_sb_delete by using new_inode instead of new_inode_pseudo.
The only functional difference is that new_inode places the inodes
on the sb-&gt;s_inodes list.

I wrote a small test program and I can verify that without changes it
can trigger this issue, and by replacing new_inode_pseudo with
new_inode the issues goes away.

Cc: stable@vger.kernel.org
Link: https://lkml.kernel.org/r/000000000000d788c905a7dfa3f4@google.com
Reported-by: syzbot+7d2debdcdb3cb93c1e5e@syzkaller.appspotmail.com
Fixes: 0097875bd415 ("proc: Implement /proc/thread-self to point at the directory of the current thread")
Fixes: 021ada7dff22 ("procfs: switch /proc/self away from proc_dir_entry")
Fixes: 51f0885e5415 ("vfs,proc: guarantee unique inodes in /proc")
Signed-off-by: "Eric W. Biederman" &lt;ebiederm@xmission.com&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
<entry>
<title>vmalloc: fix remap_vmalloc_range() bounds checks</title>
<updated>2020-04-29T14:33:14+00:00</updated>
<author>
<name>Jann Horn</name>
<email>jannh@google.com</email>
</author>
<published>2020-04-21T01:14:11+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=f4f235309b5cc8e6217774279ae2b263345f5f6a'/>
<id>f4f235309b5cc8e6217774279ae2b263345f5f6a</id>
<content type='text'>
commit bdebd6a2831b6fab69eb85cee74a8ba77f1a1cc2 upstream.

remap_vmalloc_range() has had various issues with the bounds checks it
promises to perform ("This function checks that addr is a valid
vmalloc'ed area, and that it is big enough to cover the vma") over time,
e.g.:

 - not detecting pgoff&lt;&lt;PAGE_SHIFT overflow

 - not detecting (pgoff&lt;&lt;PAGE_SHIFT)+usize overflow

 - not checking whether addr and addr+(pgoff&lt;&lt;PAGE_SHIFT) are the same
   vmalloc allocation

 - comparing a potentially wildly out-of-bounds pointer with the end of
   the vmalloc region

In particular, since commit fc9702273e2e ("bpf: Add mmap() support for
BPF_MAP_TYPE_ARRAY"), unprivileged users can cause kernel null pointer
dereferences by calling mmap() on a BPF map with a size that is bigger
than the distance from the start of the BPF map to the end of the
address space.

This could theoretically be used as a kernel ASLR bypass, by using
whether mmap() with a given offset oopses or returns an error code to
perform a binary search over the possible address range.

To allow remap_vmalloc_range_partial() to verify that addr and
addr+(pgoff&lt;&lt;PAGE_SHIFT) are in the same vmalloc region, pass the offset
to remap_vmalloc_range_partial() instead of adding it to the pointer in
remap_vmalloc_range().

In remap_vmalloc_range_partial(), fix the check against
get_vm_area_size() by using size comparisons instead of pointer
comparisons, and add checks for pgoff.

Fixes: 833423143c3a ("[PATCH] mm: introduce remap_vmalloc_range()")
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: stable@vger.kernel.org
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Martin KaFai Lau &lt;kafai@fb.com&gt;
Cc: Song Liu &lt;songliubraving@fb.com&gt;
Cc: Yonghong Song &lt;yhs@fb.com&gt;
Cc: Andrii Nakryiko &lt;andriin@fb.com&gt;
Cc: John Fastabend &lt;john.fastabend@gmail.com&gt;
Cc: KP Singh &lt;kpsingh@chromium.org&gt;
Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
commit bdebd6a2831b6fab69eb85cee74a8ba77f1a1cc2 upstream.

remap_vmalloc_range() has had various issues with the bounds checks it
promises to perform ("This function checks that addr is a valid
vmalloc'ed area, and that it is big enough to cover the vma") over time,
e.g.:

 - not detecting pgoff&lt;&lt;PAGE_SHIFT overflow

 - not detecting (pgoff&lt;&lt;PAGE_SHIFT)+usize overflow

 - not checking whether addr and addr+(pgoff&lt;&lt;PAGE_SHIFT) are the same
   vmalloc allocation

 - comparing a potentially wildly out-of-bounds pointer with the end of
   the vmalloc region

In particular, since commit fc9702273e2e ("bpf: Add mmap() support for
BPF_MAP_TYPE_ARRAY"), unprivileged users can cause kernel null pointer
dereferences by calling mmap() on a BPF map with a size that is bigger
than the distance from the start of the BPF map to the end of the
address space.

This could theoretically be used as a kernel ASLR bypass, by using
whether mmap() with a given offset oopses or returns an error code to
perform a binary search over the possible address range.

To allow remap_vmalloc_range_partial() to verify that addr and
addr+(pgoff&lt;&lt;PAGE_SHIFT) are in the same vmalloc region, pass the offset
to remap_vmalloc_range_partial() instead of adding it to the pointer in
remap_vmalloc_range().

In remap_vmalloc_range_partial(), fix the check against
get_vm_area_size() by using size comparisons instead of pointer
comparisons, and add checks for pgoff.

Fixes: 833423143c3a ("[PATCH] mm: introduce remap_vmalloc_range()")
Signed-off-by: Jann Horn &lt;jannh@google.com&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Cc: stable@vger.kernel.org
Cc: Alexei Starovoitov &lt;ast@kernel.org&gt;
Cc: Daniel Borkmann &lt;daniel@iogearbox.net&gt;
Cc: Martin KaFai Lau &lt;kafai@fb.com&gt;
Cc: Song Liu &lt;songliubraving@fb.com&gt;
Cc: Yonghong Song &lt;yhs@fb.com&gt;
Cc: Andrii Nakryiko &lt;andriin@fb.com&gt;
Cc: John Fastabend &lt;john.fastabend@gmail.com&gt;
Cc: KP Singh &lt;kpsingh@chromium.org&gt;
Link: http://lkml.kernel.org/r/20200415222312.236431-1-jannh@google.com
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Signed-off-by: Greg Kroah-Hartman &lt;gregkh@linuxfoundation.org&gt;

</pre>
</div>
</content>
</entry>
</feed>
