linux-stable.git/include/linux, branch v3.16.67

coredump: fix race condition between mmget_not_zero()/get_task_mm() and core dumping

2019-05-02T20:42:04+00:00

commit 04f5866e41fb70690e28397487d8bd8eea7d712a upstream.

The core dumping code has always run without holding the mmap_sem for
writing, despite that is the only way to ensure that the entire vma
layout will not change from under it.  Only using some signal
serialization on the processes belonging to the mm is not nearly enough.
This was pointed out earlier.  For example in Hugh's post from Jul 2017:

  https://lkml.kernel.org/r/alpine.LSU.2.11.1707191716030.2055@eggly.anvils

  "Not strictly relevant here, but a related note: I was very surprised
   to discover, only quite recently, how handle_mm_fault() may be called
   without down_read(mmap_sem) - when core dumping. That seems a
   misguided optimization to me, which would also be nice to correct"

In particular because the growsdown and growsup can move the
vm_start/vm_end the various loops the core dump does around the vma will
not be consistent if page faults can happen concurrently.

Pretty much all users calling mmget_not_zero()/get_task_mm() and then
taking the mmap_sem had the potential to introduce unexpected side
effects in the core dumping code.

Adding mmap_sem for writing around the ->core_dump invocation is a
viable long term fix, but it requires removing all copy user and page
faults and to replace them with get_dump_page() for all binary formats
which is not suitable as a short term fix.

For the time being this solution manually covers the places that can
confuse the core dump either by altering the vma layout or the vma flags
while it runs.  Once ->core_dump runs under mmap_sem for writing the
function mmget_still_valid() can be dropped.

Allowing mmap_sem protected sections to run in parallel with the
coredump provides some minor parallelism advantage to the swapoff code
(which seems to be safe enough by never mangling any vma field and can
keep doing swapins in parallel to the core dumping) and to some other
corner case.

In order to facilitate the backporting I added "Fixes: 86039bd3b4e6"
however the side effect of this same race condition in /proc/pid/mem
should be reproducible since before 2.6.12-rc2 so I couldn't add any
other "Fixes:" because there's no hash beyond the git genesis commit.

Because find_extend_vma() is the only location outside of the process
context that could modify the "mm" structures under mmap_sem for
reading, by adding the mmget_still_valid() check to it, all other cases
that take the mmap_sem for reading don't need the new check after
mmget_not_zero()/get_task_mm().  The expand_stack() in page fault
context also doesn't need the new check, because all tasks under core
dumping are frozen.

Link: http://lkml.kernel.org/r/20190325224949.11068-1-aarcange@redhat.com
Fixes: 86039bd3b4e6 ("userfaultfd: add new syscall to provide memory externalization")
Signed-off-by: Andrea Arcangeli 
Reported-by: Jann Horn 
Suggested-by: Oleg Nesterov 
Acked-by: Peter Xu 
Reviewed-by: Mike Rapoport 
Reviewed-by: Oleg Nesterov 
Reviewed-by: Jann Horn 
Acked-by: Jason Gunthorpe 
Acked-by: Michal Hocko 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
[bwh: Backported to 3.16:
 - Drop changes in Infiniband and userfaultfd
 - In clear_refs_write(), use up_read() as we never upgrade to a write lock
 - Adjust filename, context]
Signed-off-by: Ben Hutchings

perf/x86: Add check_period PMU callback

2019-05-02T20:41:51+00:00

commit 81ec3f3c4c4d78f2d3b6689c9816bfbdf7417dbb upstream.

Vince (and later on Ravi) reported crashes in the BTS code during
fuzzing with the following backtrace:

  general protection fault: 0000 [#1] SMP PTI
  ...
  RIP: 0010:perf_prepare_sample+0x8f/0x510
  ...
  Call Trace:
   
   ? intel_pmu_drain_bts_buffer+0x194/0x230
   intel_pmu_drain_bts_buffer+0x160/0x230
   ? tick_nohz_irq_exit+0x31/0x40
   ? smp_call_function_single_interrupt+0x48/0xe0
   ? call_function_single_interrupt+0xf/0x20
   ? call_function_single_interrupt+0xa/0x20
   ? x86_schedule_events+0x1a0/0x2f0
   ? x86_pmu_commit_txn+0xb4/0x100
   ? find_busiest_group+0x47/0x5d0
   ? perf_event_set_state.part.42+0x12/0x50
   ? perf_mux_hrtimer_restart+0x40/0xb0
   intel_pmu_disable_event+0xae/0x100
   ? intel_pmu_disable_event+0xae/0x100
   x86_pmu_stop+0x7a/0xb0
   x86_pmu_del+0x57/0x120
   event_sched_out.isra.101+0x83/0x180
   group_sched_out.part.103+0x57/0xe0
   ctx_sched_out+0x188/0x240
   ctx_resched+0xa8/0xd0
   __perf_event_enable+0x193/0x1e0
   event_function+0x8e/0xc0
   remote_function+0x41/0x50
   flush_smp_call_function_queue+0x68/0x100
   generic_smp_call_function_single_interrupt+0x13/0x30
   smp_call_function_single_interrupt+0x3e/0xe0
   call_function_single_interrupt+0xf/0x20
   

The reason is that while event init code does several checks
for BTS events and prevents several unwanted config bits for
BTS event (like precise_ip), the PERF_EVENT_IOC_PERIOD allows
to create BTS event without those checks being done.

Following sequence will cause the crash:

If we create an 'almost' BTS event with precise_ip and callchains,
and it into a BTS event it will crash the perf_prepare_sample()
function because precise_ip events are expected to come
in with callchain data initialized, but that's not the
case for intel_pmu_drain_bts_buffer() caller.

Adding a check_period callback to be called before the period
is changed via PERF_EVENT_IOC_PERIOD. It will deny the change
if the event would become BTS. Plus adding also the limit_period
check as well.

Reported-by: Vince Weaver 
Signed-off-by: Jiri Olsa 
Acked-by: Peter Zijlstra 
Cc: Alexander Shishkin 
Cc: Arnaldo Carvalho de Melo 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Naveen N. Rao 
Cc: Ravi Bangoria 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20190204123532.GA4794@krava
Signed-off-by: Ingo Molnar 
[bwh: Backported to 3.16:
 - Don't call limit_period operation, which doesn't exist and isn't needed here
 - Add the intel_pmu_has_bts() function, which didn't previously exist here
 - Adjust filenames, context]
Signed-off-by: Ben Hutchings

Rip out get_signal_to_deliver()

2019-05-02T20:41:48+00:00

commit 828b1f65d23cf8a68795739f6dd08fc8abd9ee64 upstream.

Now we can turn get_signal() to the main function.

Signed-off-by: Richard Weinberger 
[bwh: Backported to 3.16 as dependency of commit 35634ffa1751
 "signal: Always notice exiting tasks"]
Signed-off-by: Ben Hutchings

Clean up signal_delivered()

2019-05-02T20:41:48+00:00

commit 10b1c7ac8bfed429cf3dcb0225482c8dc1485d8e upstream.

 - Pass a ksignal struct to it
 - Remove unused regs parameter
 - Make it private as it's nowhere outside of kernel/signal.c is used

Signed-off-by: Richard Weinberger 
[bwh: Backported to 3.16 as dependency of commit 35634ffa1751
 "signal: Always notice exiting tasks"]
Signed-off-by: Ben Hutchings

tracehook_signal_handler: Remove sig, info, ka and regs

2019-05-02T20:41:48+00:00

commit df5601f9c3d831b4c478b004a1ed90a18643adbe upstream.

These parameters are nowhere used, so we can remove them.

Signed-off-by: Richard Weinberger 
[bwh: Backported to 3.16 as dependency of commit 35634ffa1751
 "signal: Always notice exiting tasks"]
Signed-off-by: Ben Hutchings

mm: migration: fix migration of huge PMD shared pages

2019-04-04T15:14:08+00:00

commit 017b1660df89f5fb4bfe66c34e35f7d2031100c7 upstream.

The page migration code employs try_to_unmap() to try and unmap the source
page.  This is accomplished by using rmap_walk to find all vmas where the
page is mapped.  This search stops when page mapcount is zero.  For shared
PMD huge pages, the page map count is always 1 no matter the number of
mappings.  Shared mappings are tracked via the reference count of the PMD
page.  Therefore, try_to_unmap stops prematurely and does not completely
unmap all mappings of the source page.

This problem can result is data corruption as writes to the original
source page can happen after contents of the page are copied to the target
page.  Hence, data is lost.

This problem was originally seen as DB corruption of shared global areas
after a huge page was soft offlined due to ECC memory errors.  DB
developers noticed they could reproduce the issue by (hotplug) offlining
memory used to back huge pages.  A simple testcase can reproduce the
problem by creating a shared PMD mapping (note that this must be at least
PUD_SIZE in size and PUD_SIZE aligned (1GB on x86)), and using
migrate_pages() to migrate process pages between nodes while continually
writing to the huge pages being migrated.

To fix, have the try_to_unmap_one routine check for huge PMD sharing by
calling huge_pmd_unshare for hugetlbfs huge pages.  If it is a shared
mapping it will be 'unshared' which removes the page table entry and drops
the reference on the PMD page.  After this, flush caches and TLB.

mmu notifiers are called before locking page tables, but we can not be
sure of PMD sharing until page tables are locked.  Therefore, check for
the possibility of PMD sharing before locking so that notifiers can
prepare for the worst possible case.

Link: http://lkml.kernel.org/r/20180823205917.16297-2-mike.kravetz@oracle.com
[mike.kravetz@oracle.com: make _range_in_vma() a static inline]
  Link: http://lkml.kernel.org/r/6063f215-a5c8-2f0c-465a-2c515ddc952d@oracle.com
Fixes: 39dde65c9940 ("shared page table for hugetlb page")
Signed-off-by: Mike Kravetz 
Acked-by: Kirill A. Shutemov 
Reviewed-by: Naoya Horiguchi 
Acked-by: Michal Hocko 
Cc: Vlastimil Babka 
Cc: Davidlohr Bueso 
Cc: Jerome Glisse 
Cc: Mike Kravetz 
Signed-off-by: Andrew Morton 
Signed-off-by: Mike Kravetz 
Acked-by: Michal Hocko 
Reviewed-by: Jérôme Glisse 
Signed-off-by: Greg Kroah-Hartman 
[bwh: Backported from 4.4 to 3.16: adjust context]
Signed-off-by: Ben Hutchings

gpiolib: Fix return value of gpio_to_desc() stub if !GPIOLIB

2019-04-04T15:13:49+00:00

commit c5510b8dafce5f3f5a039c9b262ebcae0092c462 upstream.

If CONFIG_GPOILIB is not set, the stub of gpio_to_desc() should return
the same type of error as regular version: NULL.  All the callers
compare the return value of gpio_to_desc() against NULL, so returned
ERR_PTR would be treated as non-error case leading to dereferencing of
error value.

Fixes: 79a9becda894 ("gpiolib: export descriptor-based GPIO interface")
Signed-off-by: Krzysztof Kozlowski 
Signed-off-by: Linus Walleij 
Signed-off-by: Ben Hutchings

KVM: Protect device ops->create and list_add with kvm->lock

2019-03-25T17:32:35+00:00

commit a28ebea2adc4a2bef5989a5a181ec238f59fbcad upstream.

KVM devices were manipulating list data structures without any form of
synchronization, and some implementations of the create operations also
suffered from a lack of synchronization.

Now when we've split the xics create operation into create and init, we
can hold the kvm->lock mutex while calling the create operation and when
manipulating the devices list.

The error path in the generic code gets slightly ugly because we have to
take the mutex again and delete the device from the list, but holding
the mutex during anon_inode_getfd or releasing/locking the mutex in the
common non-error path seemed wrong.

Signed-off-by: Christoffer Dall 
Reviewed-by: Paolo Bonzini 
Acked-by: Christian Borntraeger 
Signed-off-by: Radim Krčmář 
[bwh: Backported to 3.16:
 - Drop change to a failure path that doesn't exist in kvm_vgic_create() 
 - Adjust filename, context]
Signed-off-by: Ben Hutchings

KVM: PPC: Move xics_debugfs_init out of create

2019-03-25T17:32:34+00:00

commit 023e9fddc3616b005c3753fc1bb6526388cd7a30 upstream.

As we are about to hold the kvm->lock during the create operation on KVM
devices, we should move the call to xics_debugfs_init into its own
function, since holding a mutex over extended amounts of time might not
be a good idea.

Introduce an init operation on the kvm_device_ops struct which cannot
fail and call this, if configured, after the device has been created.

Signed-off-by: Christoffer Dall 
Reviewed-by: Paolo Bonzini 
Signed-off-by: Radim Krčmář 
Signed-off-by: Ben Hutchings

HID: debug: fix the ring buffer implementation

2019-03-25T17:32:34+00:00

commit 13054abbaa4f1fd4e6f3b4b63439ec033b4c8035 upstream.

Ring buffer implementation in hid_debug_event() and hid_debug_events_read()
is strange allowing lost or corrupted data. After commit 717adfdaf147
("HID: debug: check length before copy_to_user()") it is possible to enter
an infinite loop in hid_debug_events_read() by providing 0 as count, this
locks up a system. Fix this by rewriting the ring buffer implementation
with kfifo and simplify the code.

This fixes CVE-2019-3819.

v2: fix an execution logic and add a comment
v3: use __set_current_state() instead of set_current_state()

Link: https://bugzilla.redhat.com/show_bug.cgi?id=1669187
Fixes: cd667ce24796 ("HID: use debugfs for events/reports dumping")
Fixes: 717adfdaf147 ("HID: debug: check length before copy_to_user()")
Signed-off-by: Vladis Dronov 
Reviewed-by: Oleg Nesterov 
Signed-off-by: Benjamin Tissoires 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings