<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux.git/arch/xtensa, branch v5.10</title>
<subtitle>Linux kernel source tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/'/>
<entry>
<title>xtensa: uaccess: Add missing __user to strncpy_from_user() prototype</title>
<updated>2020-11-17T13:09:28+00:00</updated>
<author>
<name>Laurent Pinchart</name>
<email>laurent.pinchart@ideasonboard.com</email>
</author>
<published>2020-08-31T21:09:37+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=dc293f2106903ab9c24e9cea18c276e32c394c33'/>
<id>dc293f2106903ab9c24e9cea18c276e32c394c33</id>
<content type='text'>
When adding __user annotations in commit 2adf5352a34a, the
strncpy_from_user() function declaration for the
CONFIG_GENERIC_STRNCPY_FROM_USER case was missed. Fix it.

Reported-by: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Laurent Pinchart &lt;laurent.pinchart@ideasonboard.com&gt;
Message-Id: &lt;20200831210937.17938-1-laurent.pinchart@ideasonboard.com&gt;
Signed-off-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When adding __user annotations in commit 2adf5352a34a, the
strncpy_from_user() function declaration for the
CONFIG_GENERIC_STRNCPY_FROM_USER case was missed. Fix it.

Reported-by: kernel test robot &lt;lkp@intel.com&gt;
Signed-off-by: Laurent Pinchart &lt;laurent.pinchart@ideasonboard.com&gt;
Message-Id: &lt;20200831210937.17938-1-laurent.pinchart@ideasonboard.com&gt;
Signed-off-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xtensa: disable preemption around cache alias management calls</title>
<updated>2020-11-16T10:13:16+00:00</updated>
<author>
<name>Max Filippov</name>
<email>jcmvbkbc@gmail.com</email>
</author>
<published>2020-11-16T09:38:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=3a860d165eb5f4d7cf0bf81ef6a5b5c5e1754422'/>
<id>3a860d165eb5f4d7cf0bf81ef6a5b5c5e1754422</id>
<content type='text'>
Although cache alias management calls set up and tear down TLB entries
and fast_second_level_miss is able to restore TLB entry should it be
evicted they absolutely cannot preempt each other because they use the
same TLBTEMP area for different purposes.
Disable preemption around all cache alias management calls to enforce
that.

Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Although cache alias management calls set up and tear down TLB entries
and fast_second_level_miss is able to restore TLB entry should it be
evicted they absolutely cannot preempt each other because they use the
same TLBTEMP area for different purposes.
Disable preemption around all cache alias management calls to enforce
that.

Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>xtensa: fix TLBTEMP area placement</title>
<updated>2020-11-16T10:13:15+00:00</updated>
<author>
<name>Max Filippov</name>
<email>jcmvbkbc@gmail.com</email>
</author>
<published>2020-11-16T09:25:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=481535c5b41d191b22775a6873de5ec0e1cdced1'/>
<id>481535c5b41d191b22775a6873de5ec0e1cdced1</id>
<content type='text'>
fast_second_level_miss handler for the TLBTEMP area has an assumption
that page table directory entry for the TLBTEMP address range is 0. For
it to be true the TLBTEMP area must be aligned to 4MB boundary and not
share its 4MB region with anything that may use a page table. This is
not true currently: TLBTEMP shares space with vmalloc space which
results in the following kinds of runtime errors when
fast_second_level_miss loads page table directory entry for the vmalloc
space instead of fixing up the TLBTEMP area:

 Unable to handle kernel paging request at virtual address c7ff0e00
  pc = d0009275, ra = 90009478
 Oops: sig: 9 [#1] PREEMPT
 CPU: 1 PID: 61 Comm: kworker/u9:2 Not tainted 5.10.0-rc3-next-20201110-00007-g1fe4962fa983-dirty #58
 Workqueue: xprtiod xs_stream_data_receive_workfn
 a00: 90009478 d11e1dc0 c7ff0e00 00000020 c7ff0000 00000001 7f8b8107 00000000
 a08: 900c5992 d11e1d90 d0cc88b8 5506e97c 00000000 5506e97c d06c8074 d11e1d90
 pc: d0009275, ps: 00060310, depc: 00000014, excvaddr: c7ff0e00
 lbeg: d0009275, lend: d0009287 lcount: 00000003, sar: 00000010
 Call Trace:
   xs_stream_data_receive_workfn+0x43c/0x770
   process_one_work+0x1a1/0x324
   worker_thread+0x1cc/0x3c0
   kthread+0x10d/0x124
   ret_from_kernel_thread+0xc/0x18

Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
fast_second_level_miss handler for the TLBTEMP area has an assumption
that page table directory entry for the TLBTEMP address range is 0. For
it to be true the TLBTEMP area must be aligned to 4MB boundary and not
share its 4MB region with anything that may use a page table. This is
not true currently: TLBTEMP shares space with vmalloc space which
results in the following kinds of runtime errors when
fast_second_level_miss loads page table directory entry for the vmalloc
space instead of fixing up the TLBTEMP area:

 Unable to handle kernel paging request at virtual address c7ff0e00
  pc = d0009275, ra = 90009478
 Oops: sig: 9 [#1] PREEMPT
 CPU: 1 PID: 61 Comm: kworker/u9:2 Not tainted 5.10.0-rc3-next-20201110-00007-g1fe4962fa983-dirty #58
 Workqueue: xprtiod xs_stream_data_receive_workfn
 a00: 90009478 d11e1dc0 c7ff0e00 00000020 c7ff0000 00000001 7f8b8107 00000000
 a08: 900c5992 d11e1d90 d0cc88b8 5506e97c 00000000 5506e97c d06c8074 d11e1d90
 pc: d0009275, ps: 00060310, depc: 00000014, excvaddr: c7ff0e00
 lbeg: d0009275, lend: d0009287 lcount: 00000003, sar: 00000010
 Call Trace:
   xs_stream_data_receive_workfn+0x43c/0x770
   process_one_work+0x1a1/0x324
   worker_thread+0x1cc/0x3c0
   kthread+0x10d/0x124
   ret_from_kernel_thread+0xc/0x18

Cc: stable@vger.kernel.org
Signed-off-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>ARM, xtensa: highmem: avoid clobbering non-page aligned memory reservations</title>
<updated>2020-11-04T08:42:57+00:00</updated>
<author>
<name>Ard Biesheuvel</name>
<email>ardb@kernel.org</email>
</author>
<published>2020-10-31T09:43:45+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=b9bc36704cca500e2b41be4c5bf615c1d7ddc3ce'/>
<id>b9bc36704cca500e2b41be4c5bf615c1d7ddc3ce</id>
<content type='text'>
free_highpages() iterates over the free memblock regions in high
memory, and marks each page as available for the memory management
system.

Until commit cddb5ddf2b76 ("arm, xtensa: simplify initialization of
high memory pages") it rounded beginning of each region upwards and end of
each region downwards.

However, after that commit free_highmem() rounds the beginning and end of
each region downwards, and we may end up freeing a page that is
memblock_reserve()d, resulting in memory corruption.

Restore the original rounding of the region boundaries to avoid freeing
reserved pages.

Fixes: cddb5ddf2b76 ("arm, xtensa: simplify initialization of high memory pages")
Link: https://lore.kernel.org/r/20201029110334.4118-1-ardb@kernel.org/
Link: https://lore.kernel.org/r/20201031094345.6984-1-rppt@kernel.org
Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Co-developed-by:  Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Signed-off-by: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Acked-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
free_highpages() iterates over the free memblock regions in high
memory, and marks each page as available for the memory management
system.

Until commit cddb5ddf2b76 ("arm, xtensa: simplify initialization of
high memory pages") it rounded beginning of each region upwards and end of
each region downwards.

However, after that commit free_highmem() rounds the beginning and end of
each region downwards, and we may end up freeing a page that is
memblock_reserve()d, resulting in memory corruption.

Restore the original rounding of the region boundaries to avoid freeing
reserved pages.

Fixes: cddb5ddf2b76 ("arm, xtensa: simplify initialization of high memory pages")
Link: https://lore.kernel.org/r/20201029110334.4118-1-ardb@kernel.org/
Link: https://lore.kernel.org/r/20201031094345.6984-1-rppt@kernel.org
Signed-off-by: Ard Biesheuvel &lt;ardb@kernel.org&gt;
Co-developed-by:  Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Signed-off-by: Mike Rapoport &lt;rppt@linux.ibm.com&gt;
Acked-by: Max Filippov &lt;jcmvbkbc@gmail.com&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>treewide: Convert macro and uses of __section(foo) to __section("foo")</title>
<updated>2020-10-25T21:51:49+00:00</updated>
<author>
<name>Joe Perches</name>
<email>joe@perches.com</email>
</author>
<published>2020-10-22T02:36:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=33def8498fdde180023444b08e12b72a9efed41d'/>
<id>33def8498fdde180023444b08e12b72a9efed41d</id>
<content type='text'>
Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches &lt;joe@perches.com&gt;
Reviewed-by: Nick Desaulniers &lt;ndesaulniers@gooogle.com&gt;
Reviewed-by: Miguel Ojeda &lt;ojeda@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Use a more generic form for __section that requires quotes to avoid
complications with clang and gcc differences.

Remove the quote operator # from compiler_attributes.h __section macro.

Convert all unquoted __section(foo) uses to quoted __section("foo").
Also convert __attribute__((section("foo"))) uses to __section("foo")
even if the __attribute__ has multiple list entry forms.

Conversion done using the script at:

    https://lore.kernel.org/lkml/75393e5ddc272dc7403de74d645e6c6e0f4e70eb.camel@perches.com/2-convert_section.pl

Signed-off-by: Joe Perches &lt;joe@perches.com&gt;
Reviewed-by: Nick Desaulniers &lt;ndesaulniers@gooogle.com&gt;
Reviewed-by: Miguel Ojeda &lt;ojeda@kernel.org&gt;
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block</title>
<updated>2020-10-23T17:06:38+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-10-23T17:06:38+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=4a22709e21c2b1bedf90f68c823daf65d8e6b491'/>
<id>4a22709e21c2b1bedf90f68c823daf65d8e6b491</id>
<content type='text'>
Pull arch task_work cleanups from Jens Axboe:
 "Two cleanups that don't fit other categories:

   - Finally get the task_work_add() cleanup done properly, so we don't
     have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates
     all callers, and also fixes up the documentation for
     task_work_add().

   - While working on some TIF related changes for 5.11, this
     TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch
     duplication for how that is handled"

* tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block:
  task_work: cleanup notification modes
  tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull arch task_work cleanups from Jens Axboe:
 "Two cleanups that don't fit other categories:

   - Finally get the task_work_add() cleanup done properly, so we don't
     have random 0/1/false/true/TWA_SIGNAL confusing use cases. Updates
     all callers, and also fixes up the documentation for
     task_work_add().

   - While working on some TIF related changes for 5.11, this
     TIF_NOTIFY_RESUME cleanup fell out of that. Remove some arch
     duplication for how that is handled"

* tag 'arch-cleanup-2020-10-22' of git://git.kernel.dk/linux-block:
  task_work: cleanup notification modes
  tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs</title>
<updated>2020-10-22T16:59:21+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-10-22T16:59:21+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=f56e65dff6ad52395ef45738799b4fb70ff43376'/>
<id>f56e65dff6ad52395ef45738799b4fb70ff43376</id>
<content type='text'>
Pull initial set_fs() removal from Al Viro:
 "Christoph's set_fs base series + fixups"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Allow a NULL pos pointer to __kernel_read
  fs: Allow a NULL pos pointer to __kernel_write
  powerpc: remove address space overrides using set_fs()
  powerpc: use non-set_fs based maccess routines
  x86: remove address space overrides using set_fs()
  x86: make TASK_SIZE_MAX usable from assembly code
  x86: move PAGE_OFFSET, TASK_SIZE &amp; friends to page_{32,64}_types.h
  lkdtm: remove set_fs-based tests
  test_bitmap: remove user bitmap tests
  uaccess: add infrastructure for kernel builds with set_fs()
  fs: don't allow splice read/write without explicit ops
  fs: don't allow kernel reads and writes without iter ops
  sysctl: Convert to iter interfaces
  proc: add a read_iter method to proc proc_ops
  proc: cleanup the compat vs no compat file ops
  proc: remove a level of indentation in proc_get_inode
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull initial set_fs() removal from Al Viro:
 "Christoph's set_fs base series + fixups"

* 'work.set_fs' of git://git.kernel.org/pub/scm/linux/kernel/git/viro/vfs:
  fs: Allow a NULL pos pointer to __kernel_read
  fs: Allow a NULL pos pointer to __kernel_write
  powerpc: remove address space overrides using set_fs()
  powerpc: use non-set_fs based maccess routines
  x86: remove address space overrides using set_fs()
  x86: make TASK_SIZE_MAX usable from assembly code
  x86: move PAGE_OFFSET, TASK_SIZE &amp; friends to page_{32,64}_types.h
  lkdtm: remove set_fs-based tests
  test_bitmap: remove user bitmap tests
  uaccess: add infrastructure for kernel builds with set_fs()
  fs: don't allow splice read/write without explicit ops
  fs: don't allow kernel reads and writes without iter ops
  sysctl: Convert to iter interfaces
  proc: add a read_iter method to proc proc_ops
  proc: cleanup the compat vs no compat file ops
  proc: remove a level of indentation in proc_get_inode
</pre>
</div>
</content>
</entry>
<entry>
<title>mm/madvise: introduce process_madvise() syscall: an external memory hinting API</title>
<updated>2020-10-18T16:27:10+00:00</updated>
<author>
<name>Minchan Kim</name>
<email>minchan@kernel.org</email>
</author>
<published>2020-10-17T23:14:59+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=ecb8ac8b1f146915aa6b96449b66dd48984caacc'/>
<id>ecb8ac8b1f146915aa6b96449b66dd48984caacc</id>
<content type='text'>
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.

The information required to make the reclaim decision is not known to the
app.  Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.

To solve the issue, this patch introduces a new syscall
process_madvise(2).  It uses pidfd of an external process to give the
hint.  It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement.  I
think it would be bigger in real practice because the testing ran very
cache friendly environment).

Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations.  In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment.  With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.

ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.

I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky.  Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone.  Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

If someone want to add other hints, we could hear the usecase and review
it for each hint.  It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.

So finally, the API is as follows,

      ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
      The process_madvise() system call is used to give advice or directions
      to the kernel about the address ranges from external process as well as
      local process. It provides the advice to address ranges of process
      described by iovec and vlen. The goal of such advice is to improve
      system or application performance.

      The pidfd selects the process referred to by the PID file descriptor
      specified in pidfd. (See pidofd_open(2) for further information)

      The pointer iovec points to an array of iovec structures, defined in
      &lt;sys/uio.h&gt; as:

        struct iovec {
            void *iov_base;         /* starting address */
            size_t iov_len;         /* number of bytes to be advised */
        };

      The iovec describes address ranges beginning at address(iov_base)
      and with size length of bytes(iov_len).

      The vlen represents the number of elements in iovec.

      The advice is indicated in the advice argument, which is one of the
      following at this moment if the target process specified by pidfd is
      external.

        MADV_COLD
        MADV_PAGEOUT

      Permission to provide a hint to external process is governed by a
      ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

      The process_madvise supports every advice madvise(2) has if target
      process is in same thread group with calling process so user could
      use process_madvise(2) to extend existing madvise(2) to support
      vector address ranges.

    RETURN VALUE
      On success, process_madvise() returns the number of bytes advised.
      This return value may be less than the total number of requested
      bytes, if an error occurred. The caller should check return value
      to determine whether a partial advice occurred.

FAQ:

Q.1 - Why does any external entity have better knowledge?

Quote from Sandeep

"For Android, every application (including the special SystemServer)
are forked from Zygote.  The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.

After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.

In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.

So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.

Besides, we can never rely on applications to clean things up
themselves.  We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.

So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.

- ssp

Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?

process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called.  If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect.  It's the
responsibility of the process calling process_madvise to close this
race condition.  For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called.  Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process.  Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm.  The suggested API itself does not provide synchronization.  It
also apply other APIs like move_pages, process_vm_write.

The race isn't really a problem though.  Why is it so wrong to require
that callers do their own synchronization in some manner?  Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something.  Think about mmap.  It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before.  That's where we need synchronization by using other API or
design from userside.  It shouldn't be part of API itself.  If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.

To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.

Q.3 - Why doesn't ptrace work?

Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA.  Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill.  It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.

[1] https://developer.android.com/topic/performance/memory"

[2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

[3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

[minchan@kernel.org: fix process_madvise build break for arm64]
  Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
  Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
  Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
  Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
  Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
  Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
  Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
  Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: YueHaibing &lt;yuehaibing@huawei.com&gt;
Signed-off-by: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: Alexander Duyck &lt;alexander.h.duyck@linux.intel.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: Christian Brauner &lt;christian@brauner.io&gt;
Cc: Daniel Colascione &lt;dancol@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Joel Fernandes &lt;joel@joelfernandes.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: John Dias &lt;joaodias@google.com&gt;
Cc: Kirill Tkhai &lt;ktkhai@virtuozzo.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Oleksandr Natalenko &lt;oleksandr@redhat.com&gt;
Cc: Sandeep Patil &lt;sspatil@google.com&gt;
Cc: SeongJae Park &lt;sj38.park@gmail.com&gt;
Cc: SeongJae Park &lt;sjpark@amazon.de&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Sonny Rao &lt;sonnyrao@google.com&gt;
Cc: Tim Murray &lt;timmurray@google.com&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Florian Weimer &lt;fw@deneb.enyo.de&gt;
Cc: &lt;linux-man@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
There is usecase that System Management Software(SMS) want to give a
memory hint like MADV_[COLD|PAGEEOUT] to other processes and in the
case of Android, it is the ActivityManagerService.

The information required to make the reclaim decision is not known to the
app.  Instead, it is known to the centralized userspace
daemon(ActivityManagerService), and that daemon must be able to initiate
reclaim on its own without any app involvement.

To solve the issue, this patch introduces a new syscall
process_madvise(2).  It uses pidfd of an external process to give the
hint.  It also supports vector address range because Android app has
thousands of vmas due to zygote so it's totally waste of CPU and power if
we should call the syscall one by one for each vma.(With testing 2000-vma
syscall vs 1-vector syscall, it showed 15% performance improvement.  I
think it would be bigger in real practice because the testing ran very
cache friendly environment).

Another potential use case for the vector range is to amortize the cost
ofTLB shootdowns for multiple ranges when using MADV_DONTNEED; this could
benefit users like TCP receive zerocopy and malloc implementations.  In
future, we could find more usecases for other advises so let's make it
happens as API since we introduce a new syscall at this moment.  With
that, existing madvise(2) user could replace it with process_madvise(2)
with their own pid if they want to have batch address ranges support
feature.

ince it could affect other process's address range, only privileged
process(PTRACE_MODE_ATTACH_FSCREDS) or something else(e.g., being the same
UID) gives it the right to ptrace the process could use it successfully.
The flag argument is reserved for future use if we need to extend the API.

I think supporting all hints madvise has/will supported/support to
process_madvise is rather risky.  Because we are not sure all hints make
sense from external process and implementation for the hint may rely on
the caller being in the current context so it could be error-prone.  Thus,
I just limited hints as MADV_[COLD|PAGEOUT] in this patch.

If someone want to add other hints, we could hear the usecase and review
it for each hint.  It's safer for maintenance rather than introducing a
buggy syscall but hard to fix it later.

So finally, the API is as follows,

      ssize_t process_madvise(int pidfd, const struct iovec *iovec,
                unsigned long vlen, int advice, unsigned int flags);

    DESCRIPTION
      The process_madvise() system call is used to give advice or directions
      to the kernel about the address ranges from external process as well as
      local process. It provides the advice to address ranges of process
      described by iovec and vlen. The goal of such advice is to improve
      system or application performance.

      The pidfd selects the process referred to by the PID file descriptor
      specified in pidfd. (See pidofd_open(2) for further information)

      The pointer iovec points to an array of iovec structures, defined in
      &lt;sys/uio.h&gt; as:

        struct iovec {
            void *iov_base;         /* starting address */
            size_t iov_len;         /* number of bytes to be advised */
        };

      The iovec describes address ranges beginning at address(iov_base)
      and with size length of bytes(iov_len).

      The vlen represents the number of elements in iovec.

      The advice is indicated in the advice argument, which is one of the
      following at this moment if the target process specified by pidfd is
      external.

        MADV_COLD
        MADV_PAGEOUT

      Permission to provide a hint to external process is governed by a
      ptrace access mode PTRACE_MODE_ATTACH_FSCREDS check; see ptrace(2).

      The process_madvise supports every advice madvise(2) has if target
      process is in same thread group with calling process so user could
      use process_madvise(2) to extend existing madvise(2) to support
      vector address ranges.

    RETURN VALUE
      On success, process_madvise() returns the number of bytes advised.
      This return value may be less than the total number of requested
      bytes, if an error occurred. The caller should check return value
      to determine whether a partial advice occurred.

FAQ:

Q.1 - Why does any external entity have better knowledge?

Quote from Sandeep

"For Android, every application (including the special SystemServer)
are forked from Zygote.  The reason of course is to share as many
libraries and classes between the two as possible to benefit from the
preloading during boot.

After applications start, (almost) all of the APIs end up calling into
this SystemServer process over IPC (binder) and back to the
application.

In a fully running system, the SystemServer monitors every single
process periodically to calculate their PSS / RSS and also decides
which process is "important" to the user for interactivity.

So, because of how these processes start _and_ the fact that the
SystemServer is looping to monitor each process, it does tend to *know*
which address range of the application is not used / useful.

Besides, we can never rely on applications to clean things up
themselves.  We've had the "hey app1, the system is low on memory,
please trim your memory usage down" notifications for a long time[1].
They rely on applications honoring the broadcasts and very few do.

So, if we want to avoid the inevitable killing of the application and
restarting it, some way to be able to tell the OS about unimportant
memory in these applications will be useful.

- ssp

Q.2 - How to guarantee the race(i.e., object validation) between when
giving a hint from an external process and get the hint from the target
process?

process_madvise operates on the target process's address space as it
exists at the instant that process_madvise is called.  If the space
target process can run between the time the process_madvise process
inspects the target process address space and the time that
process_madvise is actually called, process_madvise may operate on
memory regions that the calling process does not expect.  It's the
responsibility of the process calling process_madvise to close this
race condition.  For example, the calling process can suspend the
target process with ptrace, SIGSTOP, or the freezer cgroup so that it
doesn't have an opportunity to change its own address space before
process_madvise is called.  Another option is to operate on memory
regions that the caller knows a priori will be unchanged in the target
process.  Yet another option is to accept the race for certain
process_madvise calls after reasoning that mistargeting will do no
harm.  The suggested API itself does not provide synchronization.  It
also apply other APIs like move_pages, process_vm_write.

The race isn't really a problem though.  Why is it so wrong to require
that callers do their own synchronization in some manner?  Nobody
objects to write(2) merely because it's possible for two processes to
open the same file and clobber each other's writes --- instead, we tell
people to use flock or something.  Think about mmap.  It never
guarantees newly allocated address space is still valid when the user
tries to access it because other threads could unmap the memory right
before.  That's where we need synchronization by using other API or
design from userside.  It shouldn't be part of API itself.  If someone
needs more fine-grained synchronization rather than process level,
there were two ideas suggested - cookie[2] and anon-fd[3].  Both are
applicable via using last reserved argument of the API but I don't
think it's necessary right now since we have already ways to prevent
the race so don't want to add additional complexity with more
fine-grained optimization model.

To make the API extend, it reserved an unsigned long as last argument
so we could support it in future if someone really needs it.

Q.3 - Why doesn't ptrace work?

Injecting an madvise in the target process using ptrace would not work
for us because such injected madvise would have to be executed by the
target process, which means that process would have to be runnable and
that creates the risk of the abovementioned race and hinting a wrong
VMA.  Furthermore, we want to act the hint in caller's context, not the
callee's, because the callee is usually limited in cpuset/cgroups or
even freezed state so they can't act by themselves quick enough, which
causes more thrashing/kill.  It doesn't work if the target process are
ptraced(e.g., strace, debugger, minidump) because a process can have at
most one ptracer.

[1] https://developer.android.com/topic/performance/memory"

[2] process_getinfo for getting the cookie which is updated whenever
    vma of process address layout are changed - Daniel Colascione -
    https://lore.kernel.org/lkml/20190520035254.57579-1-minchan@kernel.org/T/#m7694416fd179b2066a2c62b5b139b14e3894e224

[3] anonymous fd which is used for the object(i.e., address range)
    validation - Michal Hocko -
    https://lore.kernel.org/lkml/20200120112722.GY18451@dhcp22.suse.cz/

[minchan@kernel.org: fix process_madvise build break for arm64]
  Link: http://lkml.kernel.org/r/20200303145756.GA219683@google.com
[minchan@kernel.org: fix build error for mips of process_madvise]
  Link: http://lkml.kernel.org/r/20200508052517.GA197378@google.com
[akpm@linux-foundation.org: fix patch ordering issue]
[akpm@linux-foundation.org: fix arm64 whoops]
[minchan@kernel.org: make process_madvise() vlen arg have type size_t, per Florian]
[akpm@linux-foundation.org: fix i386 build]
[sfr@canb.auug.org.au: fix syscall numbering]
  Link: https://lkml.kernel.org/r/20200905142639.49fc3f1a@canb.auug.org.au
[sfr@canb.auug.org.au: madvise.c needs compat.h]
  Link: https://lkml.kernel.org/r/20200908204547.285646b4@canb.auug.org.au
[minchan@kernel.org: fix mips build]
  Link: https://lkml.kernel.org/r/20200909173655.GC2435453@google.com
[yuehaibing@huawei.com: remove duplicate header which is included twice]
  Link: https://lkml.kernel.org/r/20200915121550.30584-1-yuehaibing@huawei.com
[minchan@kernel.org: do not use helper functions for process_madvise]
  Link: https://lkml.kernel.org/r/20200921175539.GB387368@google.com
[akpm@linux-foundation.org: pidfd_get_pid() gained an argument]
[sfr@canb.auug.org.au: fix up for "iov_iter: transparently handle compat iovecs in import_iovec"]
  Link: https://lkml.kernel.org/r/20200928212542.468e1fef@canb.auug.org.au

Signed-off-by: Minchan Kim &lt;minchan@kernel.org&gt;
Signed-off-by: YueHaibing &lt;yuehaibing@huawei.com&gt;
Signed-off-by: Stephen Rothwell &lt;sfr@canb.auug.org.au&gt;
Signed-off-by: Andrew Morton &lt;akpm@linux-foundation.org&gt;
Reviewed-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Reviewed-by: Vlastimil Babka &lt;vbabka@suse.cz&gt;
Acked-by: David Rientjes &lt;rientjes@google.com&gt;
Cc: Alexander Duyck &lt;alexander.h.duyck@linux.intel.com&gt;
Cc: Brian Geffon &lt;bgeffon@google.com&gt;
Cc: Christian Brauner &lt;christian@brauner.io&gt;
Cc: Daniel Colascione &lt;dancol@google.com&gt;
Cc: Jann Horn &lt;jannh@google.com&gt;
Cc: Jens Axboe &lt;axboe@kernel.dk&gt;
Cc: Joel Fernandes &lt;joel@joelfernandes.org&gt;
Cc: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: John Dias &lt;joaodias@google.com&gt;
Cc: Kirill Tkhai &lt;ktkhai@virtuozzo.com&gt;
Cc: Michal Hocko &lt;mhocko@suse.com&gt;
Cc: Oleksandr Natalenko &lt;oleksandr@redhat.com&gt;
Cc: Sandeep Patil &lt;sspatil@google.com&gt;
Cc: SeongJae Park &lt;sj38.park@gmail.com&gt;
Cc: SeongJae Park &lt;sjpark@amazon.de&gt;
Cc: Shakeel Butt &lt;shakeelb@google.com&gt;
Cc: Sonny Rao &lt;sonnyrao@google.com&gt;
Cc: Tim Murray &lt;timmurray@google.com&gt;
Cc: Christian Brauner &lt;christian.brauner@ubuntu.com&gt;
Cc: Florian Weimer &lt;fw@deneb.enyo.de&gt;
Cc: &lt;linux-man@vger.kernel.org&gt;
Link: http://lkml.kernel.org/r/20200302193630.68771-3-minchan@kernel.org
Link: http://lkml.kernel.org/r/20200508183320.GA125527@google.com
Link: http://lkml.kernel.org/r/20200622192900.22757-4-minchan@kernel.org
Link: https://lkml.kernel.org/r/20200901000633.1920247-4-minchan@kernel.org
Signed-off-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>tracehook: clear TIF_NOTIFY_RESUME in tracehook_notify_resume()</title>
<updated>2020-10-17T21:04:36+00:00</updated>
<author>
<name>Jens Axboe</name>
<email>axboe@kernel.dk</email>
</author>
<published>2020-10-03T16:49:22+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=3c532798ec96b6c2d77706f04ed1d8b566a805df'/>
<id>3c532798ec96b6c2d77706f04ed1d8b566a805df</id>
<content type='text'>
All the callers currently do this, clean it up and move the clearing
into tracehook_notify_resume() instead.

Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
All the callers currently do this, clean it up and move the clearing
into tracehook_notify_resume() instead.

Reviewed-by: Oleg Nesterov &lt;oleg@redhat.com&gt;
Reviewed-by: Thomas Gleixner &lt;tglx@linutronix.de&gt;
Signed-off-by: Jens Axboe &lt;axboe@kernel.dk&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial</title>
<updated>2020-10-15T22:11:56+00:00</updated>
<author>
<name>Linus Torvalds</name>
<email>torvalds@linux-foundation.org</email>
</author>
<published>2020-10-15T22:11:56+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux.git/commit/?id=bbf625990371782370f6eacb3155dc1fe131ddfc'/>
<id>bbf625990371782370f6eacb3155dc1fe131ddfc</id>
<content type='text'>
Pull trivial updates from Jiri Kosina:
 "The latest advances in computer science from the trivial queue"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  xtensa: fix Kconfig typo
  spelling.txt: Remove some duplicate entries
  mtd: rawnand: oxnas: cleanup/simplify code
  selftests: vm: add fragment CONFIG_GUP_BENCHMARK
  perf: Fix opt help text for --no-bpf-event
  HID: logitech-dj: Fix spelling in comment
  bootconfig: Fix kernel message mentioning CONFIG_BOOT_CONFIG
  MAINTAINERS: rectify MMP SUPPORT after moving cputype.h
  scif: Fix spelling of EACCES
  printk: fix global comment
  lib/bitmap.c: fix spello
  fs: Fix missing 'bit' in comment
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pull trivial updates from Jiri Kosina:
 "The latest advances in computer science from the trivial queue"

* 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/jikos/trivial:
  xtensa: fix Kconfig typo
  spelling.txt: Remove some duplicate entries
  mtd: rawnand: oxnas: cleanup/simplify code
  selftests: vm: add fragment CONFIG_GUP_BENCHMARK
  perf: Fix opt help text for --no-bpf-event
  HID: logitech-dj: Fix spelling in comment
  bootconfig: Fix kernel message mentioning CONFIG_BOOT_CONFIG
  MAINTAINERS: rectify MMP SUPPORT after moving cputype.h
  scif: Fix spelling of EACCES
  printk: fix global comment
  lib/bitmap.c: fix spello
  fs: Fix missing 'bit' in comment
</pre>
</div>
</content>
</entry>
</feed>
