linux-stable.git/kernel/sched/fair.c, branch v4.5

sched/numa: Fix use-after-free bug in the task_numa_compare

2016-01-22T12:51:04+00:00

The following message can be observed on the Ubuntu v3.13.0-65 with KASan
backported:

  ==================================================================
  BUG: KASan: use after free in task_numa_find_cpu+0x64c/0x890 at addr ffff880dd393ecd8
  Read of size 8 by task qemu-system-x86/3998900
  =============================================================================
  BUG kmalloc-128 (Tainted: G    B        ): kasan: bad access detected
  -----------------------------------------------------------------------------

  INFO: Allocated in task_numa_fault+0xc1b/0xed0 age=41980 cpu=18 pid=3998890
	__slab_alloc+0x4f8/0x560
	__kmalloc+0x1eb/0x280
	task_numa_fault+0xc1b/0xed0
	do_numa_page+0x192/0x200
	handle_mm_fault+0x808/0x1160
	__do_page_fault+0x218/0x750
	do_page_fault+0x1a/0x70
	page_fault+0x28/0x30
	SyS_poll+0x66/0x1a0
	system_call_fastpath+0x1a/0x1f
  INFO: Freed in task_numa_free+0x1d2/0x200 age=62 cpu=18 pid=0
	__slab_free+0x2ab/0x3f0
	kfree+0x161/0x170
	task_numa_free+0x1d2/0x200
	finish_task_switch+0x1d2/0x210
	__schedule+0x5d4/0xc60
	schedule_preempt_disabled+0x40/0xc0
	cpu_startup_entry+0x2da/0x340
	start_secondary+0x28f/0x360
  Call Trace:
   [] dump_stack+0x45/0x56
   [] print_trailer+0xfd/0x170
   [] object_err+0x36/0x40
   [] kasan_report_error+0x1e9/0x3a0
   [] kasan_report+0x40/0x50
   [] ? task_numa_find_cpu+0x64c/0x890
   [] __asan_load8+0x69/0xa0
   [] ? find_next_bit+0xd8/0x120
   [] task_numa_find_cpu+0x64c/0x890
   [] task_numa_migrate+0x4ac/0x7b0
   [] numa_migrate_preferred+0xb3/0xc0
   [] task_numa_fault+0xb88/0xed0
   [] do_numa_page+0x192/0x200
   [] handle_mm_fault+0x808/0x1160
   [] ? sched_clock_cpu+0x10d/0x160
   [] ? native_load_tls+0x82/0xa0
   [] __do_page_fault+0x218/0x750
   [] ? hrtimer_try_to_cancel+0x76/0x160
   [] ? schedule_hrtimeout_range_clock.part.24+0xf7/0x1c0
   [] do_page_fault+0x1a/0x70
   [] page_fault+0x28/0x30
   [] ? do_sys_poll+0x1c4/0x6d0
   [] ? enqueue_task_fair+0x4b6/0xaa0
   [] ? sched_clock+0x9/0x10
   [] ? resched_task+0x7a/0xc0
   [] ? check_preempt_curr+0xb3/0x130
   [] ? poll_select_copy_remaining+0x170/0x170
   [] ? wake_up_state+0x10/0x20
   [] ? drop_futex_key_refs.isra.14+0x1f/0x90
   [] ? futex_requeue+0x3de/0xba0
   [] ? do_futex+0xbe/0x8f0
   [] ? read_tsc+0x9/0x20
   [] ? ktime_get_ts+0x12d/0x170
   [] ? timespec_add_safe+0x59/0xe0
   [] SyS_poll+0x66/0x1a0
   [] system_call_fastpath+0x1a/0x1f

As commit 1effd9f19324 ("sched/numa: Fix unsafe get_task_struct() in
task_numa_assign()") points out, the rcu_read_lock() cannot protect the
task_struct from being freed in the finish_task_switch(). And the bug
happens in the process of calculation of imp which requires the access of
p->numa_faults being freed in the following path:

do_exit()
        current->flags |= PF_EXITING;
    release_task()
        ~~delayed_put_task_struct()~~
    schedule()
    ...
    ...
rq->curr = next;
    context_switch()
        finish_task_switch()
            put_task_struct()
                __put_task_struct()
		    task_numa_free()

The fix here to get_task_struct() early before end of dst_rq->lock to
protect the calculation process and also put_task_struct() in the
corresponding point if finally the dst_rq->curr somehow cannot be
assigned.

Additional credit to Liang Chen who helped fix the error logic and add the
put_task_struct() to the place it missed.

Signed-off-by: Gavin Guo 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrea Arcangeli 
Cc: Andrew Morton 
Cc: Hugh Dickins 
Cc: Linus Torvalds 
Cc: Mel Gorman 
Cc: Peter Zijlstra 
Cc: Rik van Riel 
Cc: Thomas Gleixner 
Cc: jay.vosburgh@canonical.com
Cc: liang.chen@canonical.com
Link: http://lkml.kernel.org/r/1453264618-17645-1-git-send-email-gavin.guo@canonical.com
Signed-off-by: Ingo Molnar

sched/fair: Fix new task's load avg removed from source CPU in wake_up_new_task()

2016-01-06T10:06:29+00:00

If a newly created task is selected to go to a different CPU in fork
balance when it wakes up the first time, its load averages should
not be removed from the source CPU since they are never added to
it before. The same is also applicable to a never used group entity.

Fix it in remove_entity_load_avg(): when entity's last_update_time
is 0, simply return. This should precisely identify the case in
question, because in other migrations, the last_update_time is set
to 0 after remove_entity_load_avg().

Reported-by: Steve Muckle 
Signed-off-by: Yuyang Du 
[peterz: cfs_rq_last_update_time]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Dietmar Eggemann 
Cc: Juri Lelli 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Patrick Bellasi 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vincent Guittot 
Link: http://lkml.kernel.org/r/20151216233427.GJ28098@intel.com
Signed-off-by: Ingo Molnar

Merge branch 'sched/urgent' into sched/core, to pick up fixes before merging new patches

2016-01-06T10:02:29+00:00

Signed-off-by: Ingo Molnar

sched/fair: Fix multiplication overflow on 32-bit systems

2016-01-06T10:01:05+00:00

Make 'r' 64-bit type to avoid overflow in 'r * LOAD_AVG_MAX'
on 32-bit systems:

	UBSAN: Undefined behaviour in kernel/sched/fair.c:2785:18
	signed integer overflow:
	87950 * 47742 cannot be represented in type 'int'

The most likely effect of this bug are bad load average numbers
resulting in weird scheduling. It's also likely that this can
persist for a longer time - until the system goes idle for
a long time so that all load avg numbers get reset.

[ This is the CFS load average metric, not the procfs output, which
  is separate. ]

Signed-off-by: Andrey Ryabinin 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Fixes: 9d89c257dfb9 ("sched/fair: Rewrite runnable load and utilization average tracking")
Link: http://lkml.kernel.org/r/1450097243-30137-1-git-send-email-aryabinin@virtuozzo.com
[ Improved the changelog. ]
Signed-off-by: Ingo Molnar

sched/fair: Disable the task group load_avg update for the root_task_group

2015-12-04T09:34:49+00:00

Currently, the update_tg_load_avg() function attempts to update the
tg's load_avg value whenever the load changes even for root_task_group
where the load_avg value will never be used. This patch will disable
the load_avg update when the given task group is the root_task_group.

Running a Java benchmark with noautogroup and a 4.3 kernel on a
16-socket IvyBridge-EX system, the amount of CPU time (as reported by
perf) consumed by task_tick_fair() which includes update_tg_load_avg()
decreased from 0.71% to 0.22%, a more than 3X reduction. The Max-jOPs
results also increased slightly from 983015 to 986449.

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ben Segall 
Cc: Douglas Hatch 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Morten Rasmussen 
Cc: Paul Turner 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Cc: Yuyang Du 
Link: http://lkml.kernel.org/r/1449081710-20185-4-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar

sched/fair: Avoid redundant idle_cpu() call in update_sg_lb_stats()

2015-12-04T09:34:47+00:00

Part of the responsibility of the update_sg_lb_stats() function is to
update the idle_cpus statistical counter in struct sg_lb_stats. This
check is done by calling idle_cpu(). The idle_cpu() function, in
turn, checks a number of fields within the run queue structure such
as rq->curr and rq->nr_running.

With the current layout of the run queue structure, rq->curr and
rq->nr_running are in separate cachelines. The rq->curr variable is
checked first followed by nr_running. As nr_running is also accessed
by update_sg_lb_stats() earlier, it makes no sense to load another
cacheline when nr_running is not 0 as idle_cpu() will always return
false in this case.

This patch eliminates this redundant cacheline load by checking the
cached nr_running before calling idle_cpu().

Signed-off-by: Waiman Long 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Douglas Hatch 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Scott J Norton 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1448478580-26467-2-git-send-email-Waiman.Long@hpe.com
Signed-off-by: Ingo Molnar

sched/fair: Make it possible to account fair load avg consistently

2015-12-04T09:34:42+00:00

The current code accounts for the time a task was absent from the fair
class (per ATTACH_AGE_LOAD). However it does not work correctly when a
task got migrated or moved to another cgroup while outside of the fair
class.

This patch tries to address that by aging on migration. We locklessly
read the 'last_update_time' stamp from both the old and new cfs_rq,
ages the load upto the old time, and sets it to the new time.

These timestamps should in general not be more than 1 tick apart from
one another, so there is a definite bound on things.

Signed-off-by: Byungchul Park 
[ Changelog, a few edits and !SMP build fix ]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1445616981-29904-2-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar

sched/fair: Modify the comment about lock assumptions in migrate_task_rq_fair()

2015-11-23T08:48:20+00:00

The comment describing migrate_task_rq_fair() says that the caller
should hold p->pi_lock. But in some cases the caller can hold
task_rq(p)->lock instead of p->pi_lock. So the comment is broken and
this patch fixes it.

Signed-off-by: Byungchul Park 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Link: http://lkml.kernel.org/r/1447806899-20303-1-git-send-email-byungchul.park@lge.com
Signed-off-by: Ingo Molnar

sched/core: Fix incorrect wait time and wait count statistics

2015-11-23T08:48:17+00:00

At present scheduler resets task's wait start timestamp when the task
migrates to another rq.  This misleads scheduler itself into reporting
less wait time than actual by omitting time spent for waiting prior to
migration and also more wait count than actual by counting migration as
wait end event which can be seen by trace or /proc//sched with
CONFIG_SCHEDSTATS=y.

Carry forward migrating task's wait time prior to migration and
don't count migration as a wait end event to fix such statistics error.

In order to determine whether task is migrating mark task->on_rq with
TASK_ON_RQ_MIGRATING while dequeuing and enqueuing due to migration.

Signed-off-by: Joonwoo Park 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: ohaugan@codeaurora.org
Link: http://lkml.kernel.org/r/20151113033854.GA4247@codeaurora.org
Signed-off-by: Ingo Molnar

treewide: Remove old email address

2015-11-23T08:44:58+00:00

There were still a number of references to my old Red Hat email
address in the kernel source. Remove these while keeping the
Red Hat copyright notices intact.

Signed-off-by: Peter Zijlstra (Intel) 
Cc: Arnaldo Carvalho de Melo 
Cc: Jiri Olsa 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Stephane Eranian 
Cc: Thomas Gleixner 
Cc: Vince Weaver 
Signed-off-by: Ingo Molnar