linux-stable.git/kernel/sched/fair.c, branch v4.0

mm: numa: disable change protection for vma(VM_HUGETLB)

2015-04-07T23:45:33+00:00

Currently when a process accesses a hugetlb range protected with
PROTNONE, unexpected COWs are triggered, which finally puts the hugetlb
subsystem into a broken/uncontrollable state, where for example
h->resv_huge_pages is subtracted too much and wraps around to a very
large number, and the free hugepage pool is no longer maintainable.

This patch simply stops changing protection for vma(VM_HUGETLB) to fix
the problem.  And this also allows us to avoid useless overhead of minor
faults.

Signed-off-by: Naoya Horiguchi 
Suggested-by: Mel Gorman 
Cc: Hugh Dickins 
Cc: "Kirill A. Shutemov" 
Cc: David Rientjes 
Cc: Rik van Riel 
Cc: Peter Zijlstra 
Cc: Ingo Molnar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm: numa: slow PTE scan rate if migration failures occur

2015-03-25T23:20:31+00:00

Dave Chinner reported the following on https://lkml.org/lkml/2015/3/1/226

  Across the board the 4.0-rc1 numbers are much slower, and the degradation
  is far worse when using the large memory footprint configs. Perf points
  straight at the cause - this is from 4.0-rc1 on the "-o bhash=101073" config:

   -   56.07%    56.07%  [kernel]            [k] default_send_IPI_mask_sequence_phys
      - default_send_IPI_mask_sequence_phys
         - 99.99% physflat_send_IPI_mask
            - 99.37% native_send_call_func_ipi
                 smp_call_function_many
               - native_flush_tlb_others
                  - 99.85% flush_tlb_page
                       ptep_clear_flush
                       try_to_unmap_one
                       rmap_walk
                       try_to_unmap
                       migrate_pages
                       migrate_misplaced_page
                     - handle_mm_fault
                        - 99.73% __do_page_fault
                             trace_do_page_fault
                             do_async_page_fault
                           + async_page_fault
              0.63% native_send_call_func_single_ipi
                 generic_exec_single
                 smp_call_function_single

This is showing excessive migration activity even though excessive
migrations are meant to get throttled.  Normally, the scan rate is tuned
on a per-task basis depending on the locality of faults.  However, if
migrations fail for any reason then the PTE scanner may scan faster if
the faults continue to be remote.  This means there is higher system CPU
overhead and fault trapping at exactly the time we know that migrations
cannot happen.  This patch tracks when migration failures occur and
slows the PTE scanner.

Signed-off-by: Mel Gorman 
Reported-by: Dave Chinner 
Tested-by: Dave Chinner 
Cc: Ingo Molnar 
Cc: Aneesh Kumar 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'sched/urgent' into sched/core

2015-01-30T18:28:36+00:00

Merge all pending fixes and refresh the tree, before applying new changes.

Signed-off-by: Ingo Molnar

sched/fair: Avoid using uninitialized variable in preferred_group_nid()

2015-01-28T12:14:12+00:00

At least some gcc versions - validly afaict - warn about potentially
using max_group uninitialized: There's no way the compiler can prove
that the body of the conditional where it and max_faults get set/
updated gets executed; in fact, without knowing all the details of
other scheduler code, I can't prove this either.

Generally the necessary change would appear to be to clear max_group
prior to entering the inner loop, and break out of the outer loop when
it ends up being all clear after the inner one. This, however, seems
inefficient, and afaict the same effect can be achieved by exiting the
outer loop when max_faults is still zero after the inner loop.

[ mingo: changed the solution to zero initialization: uninitialized_var()
  needs to die, as it's an actively dangerous construct: if in the future
  a known-proven-good piece of code is changed to have a true, buggy
  uninitialized variable, the compiler warning is then supressed...

  The better long term solution is to clean up the code flow, so that
  even simple minded compilers (and humans!) are able to read it without
  getting a headache.  ]

Signed-off-by: Jan Beulich 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Rik van Riel 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/54C2139202000078000588F7@mail.emea.novell.com
Signed-off-by: Ingo Molnar

sched/core: Rework rq->clock update skips

2015-01-14T12:34:20+00:00

The original purpose of rq::skip_clock_update was to avoid 'costly' clock
updates for back to back wakeup-preempt pairs. The big problem with it
has always been that the rq variable is unaware of the context and
causes indiscrimiate clock skips.

Rework the entire thing and create a sense of context by only allowing
schedule() to skip clock updates. (XXX can we measure the cost of the
added store?)

By ensuring only schedule can ever skip an update, we guarantee we're
never more than 1 tick behind on the update.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: umgwanakikbuti@gmail.com
Link: http://lkml.kernel.org/r/20150105103554.432381549@infradead.org
Signed-off-by: Ingo Molnar

sched/core: Validate rq_clock*() serialization

2015-01-14T12:34:19+00:00

rq->clock{,_task} are serialized by rq->lock, verify this.

One immediate fail is the usage in scale_rt_capability, so 'annotate'
that for now, there's more 'funny' there. Maybe change rq->lock into a
raw_seqlock_t?

(Only 32-bit is affected)

Signed-off-by: Peter Zijlstra (Intel) 
Link: http://lkml.kernel.org/r/20150105103554.361872747@infradead.org
Cc: Linus Torvalds 
Cc: umgwanakikbuti@gmail.com
Signed-off-by: Ingo Molnar

sched/fair: Fix sched_entity::avg::decay_count initialization

2015-01-14T12:34:16+00:00

Child has the same decay_count as parent. If it's not zero,
we add it to parent's cfs_rq->removed_load:

wake_up_new_task()->set_task_cpu()->migrate_task_rq_fair().

Child's load is a just garbade after copying of parent,
it hasn't been on cfs_rq yet, and it must not be added to
cfs_rq::removed_load in migrate_task_rq_fair().

The patch moves sched_entity::avg::decay_count intialization
in sched_fork(). So, migrate_task_rq_fair() does not change
removed_load.

Signed-off-by: Kirill Tkhai 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1418644618.6074.13.camel@tkhai
Signed-off-by: Ingo Molnar

sched/fair: Fix the dealing with decay_count in __synchronize_entity_decay()

2015-01-14T12:34:13+00:00

In __synchronize_entity_decay(), if "decays" happens to be zero,
se->avg.decay_count will not be zeroed, holding the positive value
assigned when dequeued last time.

This is problematic in the following case:
If this runnable task is CFS-balanced to other CPUs soon afterwards,
migrate_task_rq_fair() will treat it as a blocked task due to its
non-zero decay_count, thereby adding its load to cfs_rq->removed_load
wrongly.

Thus, we must zero se->avg.decay_count in this case as well.

Signed-off-by: Xunlei Pang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/1418745509-2609-1-git-send-email-pang.xunlei@linaro.org
Signed-off-by: Ingo Molnar

sched/fair: Fix RCU stall upon -ENOMEM in sched_create_group()

2015-01-09T10:19:00+00:00

When alloc_fair_sched_group() in sched_create_group() fails,
free_sched_group() is called, and free_fair_sched_group() is called by
free_sched_group(). Since destroy_cfs_bandwidth() is called by
free_fair_sched_group() without calling init_cfs_bandwidth(),
RCU stall occurs at hrtimer_cancel():

  INFO: rcu_sched self-detected stall on CPU { 1}  (t=60000 jiffies g=13074 c=13073 q=0)
  Task dump for CPU 1:
  (fprintd)       R  running task        0  6249      1 0x00000088
  ...
  Call Trace:
     [] sched_show_task+0xa8/0x110
   [] dump_cpu_task+0x3d/0x50
   [] rcu_dump_cpu_stacks+0x90/0xd0
   [] rcu_check_callbacks+0x491/0x700
   [] update_process_times+0x4b/0x80
   [] tick_sched_handle.isra.20+0x36/0x50
   [] tick_sched_timer+0x42/0x70
   [] __run_hrtimer+0x69/0x1a0
   [] ? tick_sched_handle.isra.20+0x50/0x50
   [] hrtimer_interrupt+0xef/0x230
   [] local_apic_timer_interrupt+0x3b/0x70
   [] smp_apic_timer_interrupt+0x45/0x60
   [] apic_timer_interrupt+0x6d/0x80
     [] ? lock_hrtimer_base.isra.23+0x18/0x50
   [] ? __kmalloc+0x211/0x230
   [] hrtimer_try_to_cancel+0x22/0xd0
   [] ? __kmalloc+0x211/0x230
   [] hrtimer_cancel+0x22/0x30
   [] free_fair_sched_group+0x25/0xd0
   [] free_sched_group+0x16/0x40
   [] sched_create_group+0x4b/0x80
   [] sched_autogroup_create_attach+0x43/0x1c0
   [] sys_setsid+0x7c/0x110
   [] system_call_fastpath+0x12/0x17

Check whether init_cfs_bandwidth() was called before calling
destroy_cfs_bandwidth().

Signed-off-by: Tetsuo Handa 
[ Move the check into destroy_cfs_bandwidth() to aid compilability. ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Paul Turner 
Cc: Ben Segall 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/201412252210.GCC30204.SOMVFFOtQJFLOH@I-love.SAKURA.ne.jp
Signed-off-by: Ingo Molnar

sched: Fix odd values in effective_load() calculations

2015-01-09T10:18:54+00:00

In effective_load, we have (long w * unsigned long tg->shares) / long W,
when w is negative, it is cast to unsigned long and hence the product is
insanely large. Fix this by casting tg->shares to long.

Reported-by: Sasha Levin 
Signed-off-by: Yuyang Du 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Dave Jones 
Cc: Andrey Ryabinin 
Cc: Linus Torvalds 
Link: http://lkml.kernel.org/r/20141219002956.GA25405@intel.com
Signed-off-by: Ingo Molnar