linux-stable.git/kernel/sched, branch v4.0.7

sched, numa: do not hint for NUMA balancing on VM_MIXEDMAP mappings

2015-06-23T00:03:35+00:00

commit 8e76d4eecf7afeec9328e21cd5880e281838d0d6 upstream.

Jovi Zhangwei reported the following problem

  Below kernel vm bug can be triggered by tcpdump which mmaped a lot of pages
  with GFP_COMP flag.

  [Mon May 25 05:29:33 2015] page:ffffea0015414000 count:66 mapcount:1 mapping:          (null) index:0x0
  [Mon May 25 05:29:33 2015] flags: 0x20047580004000(head)
  [Mon May 25 05:29:33 2015] page dumped because: VM_BUG_ON_PAGE(compound_order(page) && !PageTransHuge(page))
  [Mon May 25 05:29:33 2015] ------------[ cut here ]------------
  [Mon May 25 05:29:33 2015] kernel BUG at mm/migrate.c:1661!
  [Mon May 25 05:29:33 2015] invalid opcode: 0000 [#1] SMP

In this case it was triggered by running tcpdump but it's not necessary
reproducible on all systems.

  sudo tcpdump -i bond0.100 'tcp port 4242' -c 100000000000 -w 4242.pcap

Compound pages cannot be migrated and it was not expected that such pages
be marked for NUMA balancing.  This did not take into account that drivers
such as net/packet/af_packet.c may insert compound pages into userspace
with vm_insert_page.  This patch tells the NUMA balancing protection
scanner to skip all VM_MIXEDMAP mappings which avoids the possibility that
compound pages are marked for migration.

Signed-off-by: Mel Gorman 
Reported-by: Jovi Zhangwei 
Cc: Ingo Molnar 
Cc: Peter Zijlstra 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds 
Signed-off-by: Greg Kroah-Hartman

sched: always use blk_schedule_flush_plug in io_schedule_out

2015-06-06T15:21:04+00:00

commit 10d784eae2b41e25d8fc6a88096cd27286093c84 upstream.

block plug callback could sleep, so we introduce a parameter
'from_schedule' and corresponding drivers can use it to destinguish a
schedule plug flush or a plug finish. Unfortunately io_schedule_out
still uses blk_flush_plug(). This causes below output (Note, I added a
might_sleep() in raid1_unplug to make it trigger faster, but the whole
thing doesn't matter if I add might_sleep). In raid1/10, this can cause
deadlock.

This patch makes io_schedule_out always uses blk_schedule_flush_plug.
This should only impact drivers (as far as I know, raid 1/10) which are
sensitive to the 'from_schedule' parameter.

[  370.817949] ------------[ cut here ]------------
[  370.817960] WARNING: CPU: 7 PID: 145 at ../kernel/sched/core.c:7306 __might_sleep+0x7f/0x90()
[  370.817969] do not call blocking ops when !TASK_RUNNING; state=2 set at [] prepare_to_wait+0x2f/0x90
[  370.817971] Modules linked in: raid1
[  370.817976] CPU: 7 PID: 145 Comm: kworker/u16:9 Tainted: G        W       4.0.0+ #361
[  370.817977] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS 1.7.5-20140709_153802- 04/01/2014
[  370.817983] Workqueue: writeback bdi_writeback_workfn (flush-9:1)
[  370.817985]  ffffffff81cd83be ffff8800ba8cb298 ffffffff819dd7af 0000000000000001
[  370.817988]  ffff8800ba8cb2e8 ffff8800ba8cb2d8 ffffffff81051afc ffff8800ba8cb2c8
[  370.817990]  ffffffffa00061a8 000000000000041e 0000000000000000 ffff8800ba8cba28
[  370.817993] Call Trace:
[  370.817999]  [] dump_stack+0x4f/0x7b
[  370.818002]  [] warn_slowpath_common+0x8c/0xd0
[  370.818004]  [] warn_slowpath_fmt+0x46/0x50
[  370.818006]  [] ? prepare_to_wait+0x2f/0x90
[  370.818008]  [] ? prepare_to_wait+0x2f/0x90
[  370.818010]  [] __might_sleep+0x7f/0x90
[  370.818014]  [] raid1_unplug+0xd3/0x170 [raid1]
[  370.818024]  [] blk_flush_plug_list+0x8a/0x1e0
[  370.818028]  [] ? bit_wait+0x50/0x50
[  370.818031]  [] io_schedule_timeout+0x130/0x140
[  370.818033]  [] bit_wait_io+0x36/0x50
[  370.818034]  [] __wait_on_bit+0x65/0x90
[  370.818041]  [] ? ext4_read_block_bitmap_nowait+0xbc/0x630
[  370.818043]  [] ? bit_wait+0x50/0x50
[  370.818045]  [] out_of_line_wait_on_bit+0x72/0x80
[  370.818047]  [] ? autoremove_wake_function+0x40/0x40
[  370.818050]  [] __wait_on_buffer+0x44/0x50
[  370.818053]  [] ext4_wait_block_bitmap+0xe0/0xf0
[  370.818058]  [] ext4_mb_init_cache+0x206/0x790
[  370.818062]  [] ? lru_cache_add+0x1c/0x50
[  370.818064]  [] ext4_mb_init_group+0x11e/0x200
[  370.818066]  [] ext4_mb_load_buddy+0x341/0x360
[  370.818068]  [] ext4_mb_find_by_goal+0x93/0x2f0
[  370.818070]  [] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818072]  [] ext4_mb_regular_allocator+0x67/0x460
[  370.818074]  [] ? ext4_mb_normalize_request+0x1e4/0x5b0
[  370.818076]  [] ext4_mb_new_blocks+0x4cb/0x620
[  370.818079]  [] ext4_ext_map_blocks+0x4c6/0x14d0
[  370.818081]  [] ? ext4_es_lookup_extent+0x4e/0x290
[  370.818085]  [] ext4_map_blocks+0x14d/0x4f0
[  370.818088]  [] ext4_writepages+0x76d/0xe50
[  370.818094]  [] do_writepages+0x21/0x50
[  370.818097]  [] __writeback_single_inode+0x60/0x490
[  370.818099]  [] writeback_sb_inodes+0x2da/0x590
[  370.818103]  [] ? trylock_super+0x1b/0x50
[  370.818105]  [] ? trylock_super+0x1b/0x50
[  370.818107]  [] __writeback_inodes_wb+0x9f/0xd0
[  370.818109]  [] wb_writeback+0x34b/0x3c0
[  370.818111]  [] bdi_writeback_workfn+0x23f/0x550
[  370.818116]  [] process_one_work+0x1c8/0x570
[  370.818117]  [] ? process_one_work+0x14b/0x570
[  370.818119]  [] worker_thread+0x11b/0x470
[  370.818121]  [] ? process_one_work+0x570/0x570
[  370.818124]  [] kthread+0xf8/0x110
[  370.818126]  [] ? kthread_create_on_node+0x210/0x210
[  370.818129]  [] ret_from_fork+0x42/0x70
[  370.818131]  [] ? kthread_create_on_node+0x210/0x210
[  370.818132] ---[ end trace 7b4deb71e68b6605 ]---

V2: don't change ->in_iowait

Cc: NeilBrown 
Signed-off-by: Shaohua Li 
Reviewed-by: Jeff Moyer 
Signed-off-by: Jens Axboe 
Cc: poma 
Signed-off-by: Greg Kroah-Hartman

sched: Handle priority boosted tasks proper in setscheduler()

2015-06-06T15:21:04+00:00

commit 0782e63bc6fe7e2d3408d250df11d388b7799c6b upstream.

Ronny reported that the following scenario is not handled correctly:

	T1 (prio = 10)
	   lock(rtmutex);

	T2 (prio = 20)
	   lock(rtmutex)
	      boost T1

	T1 (prio = 20)
	   sys_set_scheduler(prio = 30)
	   T1 prio = 30
	   ....
	   sys_set_scheduler(prio = 10)
	   T1 prio = 30

The last step is wrong as T1 should now be back at prio 20.

Commit c365c292d059 ("sched: Consider pi boosting in setscheduler()")
only handles the case where a boosted tasks tries to lower its
priority.

Fix it by taking the new effective priority into account for the
decision whether a change of the priority is required.

Reported-by: Ronny Meeus 
Tested-by: Steven Rostedt 
Signed-off-by: Thomas Gleixner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Steven Rostedt 
Cc: Borislav Petkov 
Cc: H. Peter Anvin 
Cc: Mike Galbraith 
Fixes: c365c292d059 ("sched: Consider pi boosting in setscheduler()")
Link: http://lkml.kernel.org/r/alpine.DEB.2.11.1505051806060.4225@nanos
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/deadline: Always enqueue on previous rq when dl_task_timer() fires

2015-05-06T20:04:06+00:00

commit 4cd57f97135840f637431c92380c8da3edbe44ed upstream.

dl_task_timer() may fire on a different rq from where a task was removed
after throttling. Since the call path is:

  dl_task_timer() ->
    enqueue_task_dl() ->
      enqueue_dl_entity() ->
        replenish_dl_entity()

and replenish_dl_entity() uses dl_se's rq, we can't use current's rq
in dl_task_timer(), but we need to lock the task's previous one.

Tested-by: Wanpeng Li 
Signed-off-by: Juri Lelli 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Kirill Tkhai 
Cc: Juri Lelli 
Fixes: 3960c8c0c789 ("sched: Make dl_task_time() use task_rq_lock()")
Link: http://lkml.kernel.org/r/1427792017-7356-1-git-send-email-juri.lelli@arm.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

x86: kvm: Revert "remove sched notifier for cross-cpu migrations"

2015-05-06T20:03:36+00:00

commit 0a4e6be9ca17c54817cf814b4b5aa60478c6df27 upstream.

The following point:

    2. per-CPU pvclock time info is updated if the
       underlying CPU changes.

Is not true anymore since "KVM: x86: update pvclock area conditionally,
on cpu migration".

Add task migration notification back.

Problem noticed by Andy Lutomirski.

Signed-off-by: Marcelo Tosatti 
Signed-off-by: Greg Kroah-Hartman

mm: numa: disable change protection for vma(VM_HUGETLB)

2015-04-07T23:45:33+00:00

Currently when a process accesses a hugetlb range protected with
PROTNONE, unexpected COWs are triggered, which finally puts the hugetlb
subsystem into a broken/uncontrollable state, where for example
h->resv_huge_pages is subtracted too much and wraps around to a very
large number, and the free hugepage pool is no longer maintainable.

This patch simply stops changing protection for vma(VM_HUGETLB) to fix
the problem.  And this also allows us to avoid useless overhead of minor
faults.

Signed-off-by: Naoya Horiguchi 
Suggested-by: Mel Gorman 
Cc: Hugh Dickins 
Cc: "Kirill A. Shutemov" 
Cc: David Rientjes 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2015-03-28T18:17:32+00:00

Pull scheduler fix from Ingo Molnar:
 "A single sched/rt corner case fix for RLIMIT_RTIME correctness"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched: Fix RLIMIT_RTTIME when PI-boosting to RT

mm: numa: slow PTE scan rate if migration failures occur

2015-03-25T23:20:31+00:00

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman 
Reported-by: Dave Chinner 
Tested-by: Dave Chinner 
Cc: Ingo Molnar 
Cc: Aneesh Kumar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

sched: Fix RLIMIT_RTTIME when PI-boosting to RT

2015-03-23T09:47:55+00:00

When non-realtime tasks get priority-inheritance boosted to a realtime
scheduling class, RLIMIT_RTTIME starts to apply to them. However, the
counter used for checking this (the same one used for SCHED_RR
timeslices) was not getting reset. This meant that tasks running with a
non-realtime scheduling class which are repeatedly boosted to a realtime
one, but never block while they are running realtime, eventually hit the
timeout without ever running for a time over the limit. This patch
resets the realtime timeslice counter when un-PI-boosting from an RT to
a non-RT scheduling class.

I have some test code with two threads and a shared PTHREAD_PRIO_INHERIT
mutex which induces priority boosting and spins while boosted that gets
killed by a SIGXCPU on non-fixed kernels but doesn't with this patch
applied. It happens much faster with a CONFIG_PREEMPT_RT kernel, and
does happen eventually with PREEMPT_VOLUNTARY kernels.

Signed-off-by: Brian Silverman 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: austin@peloton-tech.com
Cc: 
Link: http://lkml.kernel.org/r/1424305436-6716-1-git-send-email-brian@peloton-tech.com
Signed-off-by: Ingo Molnar

cpuidle / sleep: Use broadcast timer for states that stop local timer

2015-03-05T22:13:19+00:00

Commit 381063133246 (PM / sleep: Re-implement suspend-to-idle handling)
overlooked the fact that entering some sufficiently deep idle states
by CPUs may cause their local timers to stop and in those cases it
is necessary to switch over to a broadcast timer prior to entering
the idle state.  If the cpuidle driver in use does not provide
the new ->enter_freeze callback for any of the idle states, that
problem affects suspend-to-idle too, but it is not taken into account
after the changes made by commit 381063133246.

Fix that by changing the definition of cpuidle_enter_freeze() and
re-arranging of the code in cpuidle_idle_call(), so the former does
not call cpuidle_enter() any more and the fallback case is handled
by cpuidle_idle_call() directly.

Fixes: 381063133246 (PM / sleep: Re-implement suspend-to-idle handling)
Reported-and-tested-by: Lorenzo Pieralisi 
Signed-off-by: Rafael J. Wysocki 
Acked-by: Peter Zijlstra (Intel)