linux-stable.git/kernel/sched/fair.c, branch v4.4.185

sched/numa: Fix a possible divide-by-zero

2019-05-16T17:44:43+00:00

commit a860fa7b96e1a1c974556327aa1aee852d434c21 upstream.

sched_clock_cpu() may not be consistent between CPUs. If a task
migrates to another CPU, then se.exec_start is set to that CPU's
rq_clock_task() by update_stats_curr_start(). Specifically, the new
value might be before the old value due to clock skew.

So then if in numa_get_avg_runtime() the expression:

  'now - p->last_task_numa_placement'

ends up as -1, then the divider '*period + 1' in task_numa_placement()
is 0 and things go bang. Similar to update_curr(), check if time goes
backwards to avoid this.

[ peterz: Wrote new changelog. ]
[ mingo: Tweaked the code comment. ]

Signed-off-by: Xie XiuQi 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: cj.chengjian@huawei.com
Cc: 
Link: http://lkml.kernel.org/r/20190425080016.GX11158@hirez.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Limit sched_cfs_period_timer() loop to avoid hard lockup

2019-04-27T07:34:02+00:00

[ Upstream commit 2e8e19226398db8265a8e675fcc0118b9e80c9e8 ]

With extremely short cfs_period_us setting on a parent task group with a large
number of children the for loop in sched_cfs_period_timer() can run until the
watchdog fires. There is no guarantee that the call to hrtimer_forward_now()
will ever return 0.  The large number of children can make
do_sched_cfs_period_timer() take longer than the period.

 NMI watchdog: Watchdog detected hard LOCKUP on cpu 24
 RIP: 0010:tg_nop+0x0/0x10
  
  walk_tg_tree_from+0x29/0xb0
  unthrottle_cfs_rq+0xe0/0x1a0
  distribute_cfs_runtime+0xd3/0xf0
  sched_cfs_period_timer+0xcb/0x160
  ? sched_cfs_slack_timer+0xd0/0xd0
  __hrtimer_run_queues+0xfb/0x270
  hrtimer_interrupt+0x122/0x270
  smp_apic_timer_interrupt+0x6a/0x140
  apic_timer_interrupt+0xf/0x20
  

To prevent this we add protection to the loop that detects when the loop has run
too many times and scales the period and quota up, proportionally, so that the timer
can complete before then next period expires.  This preserves the relative runtime
quota while preventing the hard lockup.

A warning is issued reporting this state and the new values.

Signed-off-by: Phil Auld 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: 
Cc: Anton Blanchard 
Cc: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190319130005.25492-1-pauld@redhat.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

sched/fair: Do not re-read ->h_load_next during hierarchical load calculation

2019-04-27T07:33:56+00:00

commit 0e9f02450da07fc7b1346c8c32c771555173e397 upstream.

A NULL pointer dereference bug was reported on a distribution kernel but
the same issue should be present on mainline kernel. It occured on s390
but should not be arch-specific.  A partial oops looks like:

  Unable to handle kernel pointer dereference in virtual kernel address space
  ...
  Call Trace:
    ...
    try_to_wake_up+0xfc/0x450
    vhost_poll_wakeup+0x3a/0x50 [vhost]
    __wake_up_common+0xbc/0x178
    __wake_up_common_lock+0x9e/0x160
    __wake_up_sync_key+0x4e/0x60
    sock_def_readable+0x5e/0x98

The bug hits any time between 1 hour to 3 days. The dereference occurs
in update_cfs_rq_h_load when accumulating h_load. The problem is that
cfq_rq->h_load_next is not protected by any locking and can be updated
by parallel calls to task_h_load. Depending on the compiler, code may be
generated that re-reads cfq_rq->h_load_next after the check for NULL and
then oops when reading se->avg.load_avg. The dissassembly showed that it
was possible to reread h_load_next after the check for NULL.

While this does not appear to be an issue for later compilers, it's still
an accident if the correct code is generated. Full locking in this path
would have high overhead so this patch uses READ_ONCE to read h_load_next
only once and check for NULL before dereferencing. It was confirmed that
there were no further oops after 10 days of testing.

As Peter pointed out, it is also necessary to use WRITE_ONCE() to avoid any
potential problems with store tearing.

Signed-off-by: Mel Gorman 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Valentin Schneider 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: 
Fixes: 685207963be9 ("sched: Move h_load calculation to task_h_load()")
Link: https://lkml.kernel.org/r/20190319123610.nsivgf3mjbjjesxb@techsingularity.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()

2019-04-03T04:23:20+00:00

[ Upstream commit 0905f04eb21fc1c2e690bed5d0418a061d56c225 ]

If a newly created task is selected to go to a different CPU in fork
balance when it wakes up the first time, its load averages should
not be removed from the source CPU since they are never added to
it before. The same is also applicable to a never used group entity.

Fix it in remove_entity_load_avg(): when entity's last_update_time
is 0, simply return. This should precisely identify the case in
question, because in other migrations, the last_update_time is set
to 0 after remove_entity_load_avg().

Reported-by: Steve Muckle 
Signed-off-by: Yuyang Du 
[peterz: cfs_rq_last_update_time]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Dietmar Eggemann 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Patrick Bellasi 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vincent Guittot 
Link: http://lkml.kernel.org/r/20151216233427.GJ28098@intel.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

sched/fair: Fix throttle_list starvation with low CFS quota

2018-11-10T15:41:43+00:00

commit baa9be4ffb55876923dc9716abc0a448e510ba30 upstream.

With a very low cpu.cfs_quota_us setting, such as the minimum of 1000,
distribute_cfs_runtime may not empty the throttled_list before it runs
out of runtime to distribute. In that case, due to the change from
c06f04c7048 to put throttled entries at the head of the list, later entries
on the list will starve.  Essentially, the same X processes will get pulled
off the list, given CPU time and then, when expired, get put back on the
head of the list where distribute_cfs_runtime will give runtime to the same
set of processes leaving the rest.

Fix the issue by setting a bit in struct cfs_bandwidth when
distribute_cfs_runtime is running, so that the code in throttle_cfs_rq can
decide to put the throttled entry on the tail or the head of the list.  The
bit is set/cleared by the callers of distribute_cfs_runtime while they hold
cfs_bandwidth->lock.

This is easy to reproduce with a handful of CPU consumers. I use 'crash' on
the live system. In some cases you can simply look at the throttled list and
see the later entries are not changing:

  crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
    1     ffff90b56cb2d200  -976050
    2     ffff90b56cb2cc00  -484925
    3     ffff90b56cb2bc00  -658814
    4     ffff90b56cb2ba00  -275365
    5     ffff90b166a45600  -135138
    6     ffff90b56cb2da00  -282505
    7     ffff90b56cb2e000  -148065
    8     ffff90b56cb2fa00  -872591
    9     ffff90b56cb2c000  -84687
   10     ffff90b56cb2f000  -87237
   11     ffff90b166a40a00  -164582

  crash> list cfs_rq.throttled_list -H 0xffff90b54f6ade40 -s cfs_rq.runtime_remaining | paste - - | awk '{print $1"  "$4}' | pr -t -n3
    1     ffff90b56cb2d200  -994147
    2     ffff90b56cb2cc00  -306051
    3     ffff90b56cb2bc00  -961321
    4     ffff90b56cb2ba00  -24490
    5     ffff90b166a45600  -135138
    6     ffff90b56cb2da00  -282505
    7     ffff90b56cb2e000  -148065
    8     ffff90b56cb2fa00  -872591
    9     ffff90b56cb2c000  -84687
   10     ffff90b56cb2f000  -87237
   11     ffff90b166a40a00  -164582

Sometimes it is easier to see by finding a process getting starved and looking
at the sched_info:

  crash> task ffff8eb765994500 sched_info
  PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
    sched_info = {
      pcount = 8,
      run_delay = 697094208,
      last_arrival = 240260125039,
      last_queued = 240260327513
    },
  crash> task ffff8eb765994500 sched_info
  PID: 7800   TASK: ffff8eb765994500  CPU: 16  COMMAND: "cputest"
    sched_info = {
      pcount = 8,
      run_delay = 697094208,
      last_arrival = 240260125039,
      last_queued = 240260327513
    },

Signed-off-by: Phil Auld 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: stable@vger.kernel.org
Fixes: c06f04c70489 ("sched: Fix potential near-infinite distribute_cfs_runtime() loop")
Link: http://lkml.kernel.org/r/20181008143639.GA4019@pauld.bos.csb
Signed-off-by: Ingo Molnar 
Signed-off-by: Greg Kroah-Hartman

sched/cgroup: Fix cgroup entity load tracking tear-down

2018-11-10T15:41:35+00:00

[ Upstream commit 6fe1f348b3dd1f700f9630562b7d38afd6949568 ]

When a cgroup's CPU runqueue is destroyed, it should remove its
remaining load accounting from its parent cgroup.

The current site for doing so it unsuited because its far too late and
unordered against other cgroup removal (->css_free() will be, but we're also
in an RCU callback).

Put it in the ->css_offline() callback, which is the start of cgroup
destruction, right after the group has been made unavailable to
userspace. The ->css_offline() callbacks are called in hierarchical order
after the following v4.4 commit:

  aa226ff4a1ce ("cgroup: make sure a parent css isn't offlined before its children")

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Christian Borntraeger 
Cc: Johannes Weiner 
Cc: Li Zefan 
Cc: Linus Torvalds 
Cc: Oleg Nesterov 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Tejun Heo 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20160121212416.GL6357@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin

sched/numa: Use down_read_trylock() for the mmap_sem

2018-04-13T17:50:09+00:00

[ Upstream commit 8655d5497735b288f8a9b458bd22e7d1bf95bb61 ]

A customer has reported a soft-lockup when running an intensive
memory stress test, where the trace on multiple CPU's looks like this:

 RIP: 0010:[]
  [] native_queued_spin_lock_slowpath+0x10e/0x190
...
 Call Trace:
  [] queued_spin_lock_slowpath+0x7/0xa
  [] change_protection_range+0x3b1/0x930
  [] change_prot_numa+0x18/0x30
  [] task_numa_work+0x1fe/0x310
  [] task_work_run+0x72/0x90

Further investigation showed that the lock contention here is pmd_lock().

The task_numa_work() function makes sure that only one thread is let to perform
the work in a single scan period (via cmpxchg), but if there's a thread with
mmap_sem locked for writing for several periods, multiple threads in
task_numa_work() can build up a convoy waiting for mmap_sem for read and then
all get unblocked at once.

This patch changes the down_read() to the trylock version, which prevents the
build up. For a workload experiencing mmap_sem contention, it's probably better
to postpone the NUMA balancing work anyway. This seems to have fixed the soft
lockups involving pmd_lock(), which is in line with the convoy theory.

Signed-off-by: Vlastimil Babka 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Rik van Riel 
Acked-by: Mel Gorman 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/20170515131316.21909-1-vbabka@suse.cz
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Initialize throttle_count for new task-groups lazily

2017-05-25T12:30:12+00:00

commit 094f469172e00d6ab0a3130b0e01c83b3cf3a98d upstream.

Cgroup created inside throttled group must inherit current throttle_count.
Broken throttle_count allows to nominate throttled entries as a next buddy,
later this leads to null pointer dereference in pick_next_task_fair().

This patch initialize cfs_rq->throttle_count at first enqueue: laziness
allows to skip locking all rq at group creation. Lazy approach also allows
to skip full sub-tree scan at throttling hierarchy (not in this patch).

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: bsegall@google.com
Link: http://lkml.kernel.org/r/146608182119.21870.8439834428248129633.stgit@buzz
Signed-off-by: Ingo Molnar 
Cc: Ben Pineau 
Signed-off-by: Greg Kroah-Hartman

sched/fair: Do not announce throttled next buddy in dequeue_task_fair()

2017-05-25T12:30:12+00:00

commit 754bd598be9bbc953bc709a9e8ed7f3188bfb9d7 upstream.

Hierarchy could be already throttled at this point. Throttled next
buddy could trigger a NULL pointer dereference in pick_next_task_fair().

Signed-off-by: Konstantin Khlebnikov 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/146608183552.21905.15924473394414832071.stgit@buzz
Signed-off-by: Ingo Molnar 
Cc: Ben Pineau 
Signed-off-by: Greg Kroah-Hartman

sched/numa: Fix use-after-free bug in the task_numa_compare

2016-09-15T06:27:45+00:00

[ Upstream commit 1dff76b92f69051e579bdc131e01500da9fa2a91 ]

The following message can be observed on the Ubuntu v3.13.0-65 with KASan
backported:

  ==================================================================
  BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
  Read of size 8 by task qemu-system-x86/3998900
  =============================================================================
  BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
  -----------------------------------------------------------------------------

  INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
	__slab_alloc+0x4f8/0x560
	__kmalloc+0x1eb/0x280
	task_numa_fault+0xc1b/0xed0
	do_numa_page+0x192/0x200
	handle_mm_fault+0x808/0x1160
	__do_page_fault+0x218/0x750
	do_page_fault+0x1a/0x70
	page_fault+0x28/0x30
	SyS_poll+0x66/0x1a0
	system_call_fastpath+0x1a/0x1f
  INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
	__slab_free+0x2ab/0x3f0
	kfree+0x161/0x170
	task_numa_free+0x1d2/0x200
	finish_task_switch+0x1d2/0x210
	__schedule+0x5d4/0xc60
	schedule_preempt_disabled+0x40/0xc0
	cpu_startup_entry+0x2da/0x340
	start_secondary+0x28f/0x360
  Call Trace:
   [] dump_stack+0x45/0x56
   [] print_trailer+0xfd/0x170
   [] object_err+0x36/0x40
   [] kasan_report_error+0x1e9/0x3a0
   [] kasan_report+0x40/0x50
   [] ? task_numa_find_cpu+0x64c/0x890
   [] __asan_load8+0x69/0xa0
   [] ? find_next_bit+0xd8/0x120
   [] task_numa_find_cpu+0x64c/0x890
   [] task_numa_migrate+0x4ac/0x7b0
   [] numa_migrate_preferred+0xb3/0xc0
   [] task_numa_fault+0xb88/0xed0
   [] do_numa_page+0x192/0x200
   [] handle_mm_fault+0x808/0x1160
   [] ? sched_clock_cpu+0x10d/0x160
   [] ? native_load_tls+0x82/0xa0
   [] __do_page_fault+0x218/0x750
   [] ? hrtimer_try_to_cancel+0x76/0x160
   [] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
   [] do_page_fault+0x1a/0x70
   [] page_fault+0x28/0x30
   [] ? do_sys_poll+0x1c4/0x6d0
   [] ? enqueue_task_fair+0x4b6/0xaa0
   [] ? sched_clock+0x9/0x10
   [] ? resched_task+0x7a/0xc0
   [] ? check_preempt_curr+0xb3/0x130
   [] ? poll_select_copy_remaining+0x170/0x170
   [] ? wake_up_state+0x10/0x20
   [] ? drop_futex_key_refs.isra.14+0x1f/0x90
   [] ? futex_requeue+0x3de/0xba0
   [] ? do_futex+0xbe/0x8f0
   [] ? read_tsc+0x9/0x20
   [] ? ktime_get_ts+0x12d/0x170
   [] ? timespec_add_safe+0x59/0xe0
   [] SyS_poll+0x66/0x1a0
   [] system_call_fastpath+0x1a/0x1f

As commit 1effd9f19324 ("sched/numa: Fix unsafe get_task_struct() in
task_numa_assign()") points out, the rcu_read_lock() cannot protect the
task_struct from being freed in the finish_task_switch(). And the bug
happens in the process of calculation of imp which requires the access of
p->numa_faults being freed in the following path:

do_exit()
        current->flags |= PF_EXITING;
    release_task()
        ~~delayed_put_task_struct()~~
    schedule()
    ...
    ...
rq->curr = next;
    context_switch()
        finish_task_switch()
            put_task_struct()
                __put_task_struct()
		    task_numa_free()

The fix here to get_task_struct() early before end of dst_rq->lock to
protect the calculation process and also put_task_struct() in the
corresponding point if finally the dst_rq->curr somehow cannot be
assigned.

Additional credit to Liang Chen who helped fix the error logic and add the
put_task_struct() to the place it missed.

Signed-off-by: Gavin Guo 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Hugh Dickins 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: jay.vosburgh@canonical.com
Cc: liang.chen@canonical.com
Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
Signed-off-by: Ingo Molnar 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman