linux-stable.git/include/linux/sched.h, branch v4.8.2

signal: consolidate {TS,TLF}_RESTORE_SIGMASK code

2016-08-02T23:35:23+00:00

In general, there's no need for the "restore sigmask" flag to live in
ti->flags.  alpha, ia64, microblaze, powerpc, sh, sparc (64-bit only),
tile, and x86 use essentially identical alternative implementations,
placing the flag in ti->status.

Replace those optimized implementations with an equally good common
implementation that stores it in a bitfield in struct task_struct and
drop the custom implementations.

Additional architectures can opt in by removing their
TIF_RESTORE_SIGMASK defines.

Link: http://lkml.kernel.org/r/8a14321d64a28e40adfddc90e18a96c086a6d6f9.1468522723.git.luto@kernel.org
Signed-off-by: Andy Lutomirski 
Tested-by: Michael Ellerman 	[powerpc]
Cc: Richard Henderson 
Cc: Ivan Kokshaysky 
Cc: Matt Turner 
Cc: Tony Luck 
Cc: Fenghua Yu 
Cc: Michal Simek 
Cc: Benjamin Herrenschmidt 
Cc: Paul Mackerras 
Cc: Yoshinori Sato 
Cc: Rich Felker 
Cc: "David S. Miller" 
Cc: Chris Metcalf 
Cc: Peter Zijlstra 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dmitry Safonov 
Cc: Oleg Nesterov 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom_reaper: do not attempt to reap a task more than twice

2016-07-28T23:07:41+00:00

oom_reaper relies on the mmap_sem for read to do its job.  Many places
which might block readers have been converted to use down_write_killable
and that has reduced chances of the contention a lot.  Some paths where
the mmap_sem is held for write can take other locks and they might
either be not prepared to fail due to fatal signal pending or too
impractical to be changed.

This patch introduces MMF_OOM_NOT_REAPABLE flag which gets set after the
first attempt to reap a task's mm fails.  If the flag is present after
the failure then we set MMF_OOM_REAPED to hide this mm from the oom
killer completely so it can go and chose another victim.

As a result a risk of OOM deadlock when the oom victim would be blocked
indefinetly and so the oom killer cannot make any progress should be
mitigated considerably while we still try really hard to perform all
reclaim attempts and stay predictable in the behavior.

Link: http://lkml.kernel.org/r/1466426628-15074-10-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Oleg Nesterov 
Cc: Vladimir Davydov 
Cc: David Rientjes 
Cc: Tetsuo Handa 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

mm, oom: skip vforked tasks from being selected

2016-07-28T23:07:41+00:00

vforked tasks are not really sitting on any memory.  They are sharing the
mm with parent until they exec into a new code.  Until then it is just
pinning the address space.  OOM killer will kill the vforked task along
with its parent but we still can end up selecting vforked task when the
parent wouldn't be selected.  E.g.  init doing vfork to launch a task or
vforked being a child of oom unkillable task with an updated oom_score_adj
to be killable.

Add a new helper to check whether a task is in the vfork sharing memory
with its parent and use it in oom_badness to skip over these tasks.

Link: http://lkml.kernel.org/r/1466426628-15074-6-git-send-email-mhocko@kernel.org
Signed-off-by: Michal Hocko 
Acked-by: Oleg Nesterov 
Cc: Vladimir Davydov 
Cc: David Rientjes 
Cc: Tetsuo Handa 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

sched/fair: Fix PELT integrity for new tasks

2016-06-27T10:17:53+00:00

Vincent and Yuyang found another few scenarios in which entity
tracking goes wobbly.

The scenarios are basically due to the fact that new tasks are not
immediately attached and thereby differ from the normal situation -- a
task is always attached to a cfs_rq load average (such that it
includes its blocked contribution) and are explicitly
detached/attached on migration to another cfs_rq.

Scenario 1: switch to fair class

  p->sched_class = fair_class;
  if (queued)
    enqueue_task(p);
      ...
        enqueue_entity()
	  enqueue_entity_load_avg()
	    migrated = !sa->last_update_time (true)
	    if (migrated)
	      attach_entity_load_avg()
  check_class_changed()
    switched_from() (!fair)
    switched_to()   (fair)
      switched_to_fair()
        attach_entity_load_avg()

If @p is a new task that hasn't been fair before, it will have
!last_update_time and, per the above, end up in
attach_entity_load_avg() _twice_.

Scenario 2: change between cgroups

  sched_move_group(p)
    if (queued)
      dequeue_task()
    task_move_group_fair()
      detach_task_cfs_rq()
        detach_entity_load_avg()
      set_task_rq()
      attach_task_cfs_rq()
        attach_entity_load_avg()
    if (queued)
      enqueue_task();
        ...
          enqueue_entity()
	    enqueue_entity_load_avg()
	      migrated = !sa->last_update_time (true)
	      if (migrated)
	        attach_entity_load_avg()

Similar as with scenario 1, if @p is a new task, it will have
!load_update_time and we'll end up in attach_entity_load_avg()
_twice_.

Furthermore, notice how we do a detach_entity_load_avg() on something
that wasn't attached to begin with.

As stated above; the problem is that the new task isn't yet attached
to the load tracking and thereby violates the invariant assumption.

This patch remedies this by ensuring a new task is indeed properly
attached to the load tracking on creation, through
post_init_entity_util_avg().

Of course, this isn't entirely as straightforward as one might think,
since the task is hashed before we call wake_up_new_task() and thus
can be poked at. We avoid this by adding TASK_NEW and teaching
cpu_cgroup_can_attach() to refuse such tasks.

Reported-by: Yuyang Du 
Reported-by: Vincent Guittot 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Ingo Molnar

Merge branch 'sched/urgent' into sched/core, to pick up fixes

2016-06-27T09:35:02+00:00

Signed-off-by: Ingo Molnar

Clarify naming of thread info/stack allocators

2016-06-24T22:09:37+00:00

We've had the thread info allocated together with the thread stack for
most architectures for a long time (since the thread_info was split off
from the task struct), but that is about to change.

But the patches that move the thread info to be off-stack (and a part of
the task struct instead) made it clear how confused the allocator and
freeing functions are.

Because the common case was that we share an allocation with the thread
stack and the thread_info, the two pointers were identical.  That
identity then meant that we would have things like

	ti = alloc_thread_info_node(tsk, node);
	...
	tsk->stack = ti;

which certainly _worked_ (since stack and thread_info have the same
value), but is rather confusing: why are we assigning a thread_info to
the stack? And if we move the thread_info away, the "confusing" code
just gets to be entirely bogus.

So remove all this confusion, and make it clear that we are doing the
stack allocation by renaming and clarifying the function names to be
about the stack.  The fact that the thread_info then shares the
allocation is an implementation detail, and not really about the
allocation itself.

This is a pure renaming and type fix: we pass in the same pointer, it's
just that we clarify what the pointer means.

The ia64 code that actually only has one single allocation (for all of
task_struct, thread_info and kernel thread stack) now looks a bit odd,
but since "tsk->stack" is actually not even used there, that oddity
doesn't matter.  It would be a separate thing to clean that up, I
intentionally left the ia64 changes as a pure brute-force renaming and
type change.

Acked-by: Andy Lutomirski 
Signed-off-by: Linus Torvalds

sched/api: Introduce task_rcu_dereference() and try_get_task_struct()

2016-06-03T07:18:57+00:00

Generally task_struct is only protected by RCU if it was found on a
RCU protected list (say, for_each_process() or find_task_by_vpid()).

As Kirill pointed out rq->curr isn't protected by RCU, the scheduler
drops the (potentially) last reference without RCU gp, this means
that we need to fix the code which uses foreign_rq->curr under
rcu_read_lock().

Add a new helper which can be used to dereference rq->curr or any
other pointer to task_struct assuming that it should be cleared or
updated before the final put_task_struct(). It returns non-NULL
only if this task can't go away before rcu_read_unlock().

( Also add try_get_task_struct() to make it easier to use this API
  correctly. )

Suggested-by: Kirill Tkhai 
Signed-off-by: Oleg Nesterov 
[ Updated comments; added try_get_task_struct()]
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Vladimir Davydov 
Link: http://lkml.kernel.org/r/20160518170218.GY3192@twins.programming.kicks-ass.net
Signed-off-by: Ingo Molnar

mm: oom_reaper: remove some bloat

2016-05-26T22:35:44+00:00

mmput_async is currently used only from the oom_reaper which is defined
only for CONFIG_MMU.  We can save work_struct in mm_struct for
!CONFIG_MMU.

[akpm@linux-foundation.org: fix typo, per Minchan]
Link: http://lkml.kernel.org/r/20160520061658.GB19172@dhcp22.suse.cz
Reported-by: Minchan Kim 
Signed-off-by: Michal Hocko 
Acked-by: Minchan Kim 
Cc: Tetsuo Handa 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2016-05-26T00:11:43+00:00

Pull scheduler fixes from Ingo Molnar:
 "Two fixes: one for a lost wakeup, the other to fix the compiler
  optimizing out preempt operations on ARM64 (and possibly other non-x86
  architectures)"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/core: Fix remote wakeups
  sched/preempt: Fix preempt_count manipulations

sched/core: Fix remote wakeups

2016-05-25T06:35:18+00:00

Commit:

  b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")

... introduced a bug: Mike Galbraith found that it introduced a
performance regression, while Paul E. McKenney reported lost
wakeups and bisected it to this commit.

The reason is that I mis-read ttwu_queue() such that I assumed any
wakeup that got a remote queue must have had the task migrated.

Since this is not so; we need to transfer this information between
queueing the wakeup and actually doing the wakeup. Use a new
task_struct::sched_flag for this, we already write to
sched_contributes_to_load in the wakeup path so this is a hot and
modified cacheline.

Reported-by: Paul E. McKenney 
Reported-by: Mike Galbraith 
Tested-by: Mike Galbraith 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Andrew Hunter 
Cc: Andy Lutomirski 
Cc: Ben Segall 
Cc: Borislav Petkov 
Cc: Brian Gerst 
Cc: Dave Hansen 
Cc: Denys Vlasenko 
Cc: Fenghua Yu 
Cc: H. Peter Anvin 
Cc: Linus Torvalds 
Cc: Matt Fleming 
Cc: Morten Rasmussen 
Cc: Oleg Nesterov 
Cc: Paul Turner 
Cc: Pavan Kondeti 
Cc: Peter Zijlstra 
Cc: Quentin Casasnovas 
Cc: Thomas Gleixner 
Cc: byungchul.park@lge.com
Fixes: b5179ac70de8 ("sched/fair: Prepare to fix fairness problems on migration")
Link: http://lkml.kernel.org/r/20160523091907.GD15728@worktop.ger.corp.intel.com
Signed-off-by: Ingo Molnar