linux.git/include/linux/sched, branch v5.4

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2019-09-28T19:39:07+00:00

Pull scheduler fixes from Ingo Molnar:

 - Apply a number of membarrier related fixes and cleanups, which fixes
   a use-after-free race in the membarrier code

 - Introduce proper RCU protection for tasks on the runqueue - to get
   rid of the subtle task_rcu_dereference() interface that was easy to
   get wrong

 - Misc fixes, but also an EAS speedup

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Avoid redundant EAS calculation
  sched/core: Remove double update_max_interval() call on CPU startup
  sched/core: Fix preempt_schedule() interrupt return comment
  sched/fair: Fix -Wunused-but-set-variable warnings
  sched/core: Fix migration to invalid CPU in __set_cpus_allowed_ptr()
  sched/membarrier: Return -ENOMEM to userspace on memory allocation failure
  sched/membarrier: Skip IPIs when mm->mm_users == 1
  selftests, sched/membarrier: Add multi-threaded test
  sched/membarrier: Fix p->mm->membarrier_state racy load
  sched/membarrier: Call sync_core only before usermode for same mm
  sched/membarrier: Remove redundant check
  sched/membarrier: Fix private expedited registration check
  tasks, sched/core: RCUify the assignment of rq->curr
  tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code
  tasks, sched/core: Ensure tasks are available for a grace period after leaving the runqueue
  tasks: Add a count of task RCU users
  sched/core: Convert vcpu_is_preempted() from macro to an inline function
  sched/fair: Remove unused cfs_rq_clock_task() function

sched/membarrier: Fix p->mm->membarrier_state racy load

2019-09-25T15:42:30+00:00

The membarrier_state field is located within the mm_struct, which
is not guaranteed to exist when used from runqueue-lock-free iteration
on runqueues by the membarrier system call.

Copy the membarrier_state from the mm_struct into the scheduler runqueue
when the scheduler switches between mm.

When registering membarrier for mm, after setting the registration bit
in the mm membarrier state, issue a synchronize_rcu() to ensure the
scheduler observes the change. In order to take care of the case
where a runqueue keeps executing the target mm without swapping to
other mm, iterate over each runqueue and issue an IPI to copy the
membarrier_state from the mm_struct into each runqueue which have the
same mm which state has just been modified.

Move the mm membarrier_state field closer to pgd in mm_struct to use
a cache line already touched by the scheduler switch_mm.

The membarrier_execve() (now membarrier_exec_mmap) hook now needs to
clear the runqueue's membarrier state in addition to clear the mm
membarrier state, so move its implementation into the scheduler
membarrier code so it can access the runqueue structure.

Add memory barrier in membarrier_exec_mmap() prior to clearing
the membarrier state, ensuring memory accesses executed prior to exec
are not reordered with the stores clearing the membarrier state.

As suggested by Linus, move all membarrier.c RCU read-side locks outside
of the for each cpu loops.

Suggested-by: Linus Torvalds 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Eric W. Biederman 
Cc: Kirill Tkhai 
Cc: Mike Galbraith 
Cc: Oleg Nesterov 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Russell King - ARM Linux admin 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190919173705.2181-5-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar

sched/membarrier: Call sync_core only before usermode for same mm

2019-09-25T15:42:30+00:00

When the prev and next task's mm change, switch_mm() provides the core
serializing guarantees before returning to usermode. The only case
where an explicit core serialization is needed is when the scheduler
keeps the same mm for prev and next.

Suggested-by: Oleg Nesterov 
Signed-off-by: Mathieu Desnoyers 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Eric W. Biederman 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Russell King - ARM Linux admin 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190919173705.2181-4-mathieu.desnoyers@efficios.com
Signed-off-by: Ingo Molnar

tasks, sched/core: With a grace period after finish_task_switch(), remove unnecessary code

2019-09-25T15:42:29+00:00

Remove work arounds that were written before there was a grace period
after tasks left the runqueue in finish_task_switch().

In particular now that there tasks exiting the runqueue exprience
a RCU grace period none of the work performed by task_rcu_dereference()
excpet the rcu_dereference() is necessary so replace task_rcu_dereference()
with rcu_dereference().

Remove the code in rcuwait_wait_event() that checks to ensure the current
task has not exited.  It is no longer necessary as it is guaranteed
that any running task will experience a RCU grace period after it
leaves the run queueue.

Remove the comment in rcuwait_wake_up() as it is no longer relevant.

Ref: 8f95c90ceb54 ("sched/wait, RCU: Introduce rcuwait machinery")
Ref: 150593bf8693 ("sched/api: Introduce task_rcu_dereference() and try_get_task_struct()")
Signed-off-by: Eric W. Biederman 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Davidlohr Bueso 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Oleg Nesterov 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Russell King - ARM Linux admin 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/87lfurdpk9.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar

tasks: Add a count of task RCU users

2019-09-25T15:42:29+00:00

Add a count of the number of RCU users (currently 1) of the task
struct so that we can later add the scheduler case and get rid of the
very subtle task_rcu_dereference(), and just use rcu_dereference().

As suggested by Oleg have the count overlap rcu_head so that no
additional space in task_struct is required.

Inspired-by: Linus Torvalds 
Inspired-by: Oleg Nesterov 
Signed-off-by: Eric W. Biederman 
Signed-off-by: Peter Zijlstra (Intel) 
Cc: Chris Metcalf 
Cc: Christoph Lameter 
Cc: Davidlohr Bueso 
Cc: Kirill Tkhai 
Cc: Linus Torvalds 
Cc: Mike Galbraith 
Cc: Paul E. McKenney 
Cc: Peter Zijlstra 
Cc: Russell King - ARM Linux admin 
Cc: Thomas Gleixner 
Link: https://lkml.kernel.org/r/87woebdplt.fsf_-_@x220.int.ebiederm.org
Signed-off-by: Ingo Molnar

Merge branch 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2019-09-17T19:35:15+00:00

Pull core timer updates from Thomas Gleixner:
 "Timers and timekeeping updates:

   - A large overhaul of the posix CPU timer code which is a preparation
     for moving the CPU timer expiry out into task work so it can be
     properly accounted on the task/process.

     An update to the bogus permission checks will come later during the
     merge window as feedback was not complete before heading of for
     travel.

   - Switch the timerqueue code to use cached rbtrees and get rid of the
     homebrewn caching of the leftmost node.

   - Consolidate hrtimer_init() + hrtimer_init_sleeper() calls into a
     single function

   - Implement the separation of hrtimers to be forced to expire in hard
     interrupt context even when PREEMPT_RT is enabled and mark the
     affected timers accordingly.

   - Implement a mechanism for hrtimers and the timer wheel to protect
     RT against priority inversion and live lock issues when a (hr)timer
     which should be canceled is currently executing the callback.
     Instead of infinitely spinning, the task which tries to cancel the
     timer blocks on a per cpu base expiry lock which is held and
     released by the (hr)timer expiry code.

   - Enable the Hyper-V TSC page based sched_clock for Hyper-V guests
     resulting in faster access to timekeeping functions.

   - Updates to various clocksource/clockevent drivers and their device
     tree bindings.

   - The usual small improvements all over the place"

* 'timers-core-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (101 commits)
  posix-cpu-timers: Fix permission check regression
  posix-cpu-timers: Always clear head pointer on dequeue
  hrtimer: Add a missing bracket and hide `migration_base' on !SMP
  posix-cpu-timers: Make expiry_active check actually work correctly
  posix-timers: Unbreak CONFIG_POSIX_TIMERS=n build
  tick: Mark sched_timer to expire in hard interrupt context
  hrtimer: Add kernel doc annotation for HRTIMER_MODE_HARD
  x86/hyperv: Hide pv_ops access for CONFIG_PARAVIRT=n
  posix-cpu-timers: Utilize timerqueue for storage
  posix-cpu-timers: Move state tracking to struct posix_cputimers
  posix-cpu-timers: Deduplicate rlimit handling
  posix-cpu-timers: Remove pointless comparisons
  posix-cpu-timers: Get rid of 64bit divisions
  posix-cpu-timers: Consolidate timer expiry further
  posix-cpu-timers: Get rid of zero checks
  rlimit: Rewrite non-sensical RLIMIT_CPU comment
  posix-cpu-timers: Respect INFINITY for hard RTTIME limit
  posix-cpu-timers: Switch thread group sampling to array
  posix-cpu-timers: Restructure expiry array
  posix-cpu-timers: Remove cputime_expires
  ...

posix-cpu-timers: Move state tracking to struct posix_cputimers

2019-08-28T09:50:42+00:00

Put it where it belongs and clean up the ifdeffery in fork completely.

Signed-off-by: Thomas Gleixner 
Link: https://lkml.kernel.org/r/20190821192922.743229404@linutronix.de

posix-cpu-timers: Switch thread group sampling to array

2019-08-28T09:50:39+00:00

That allows more simplifications in various places.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Frederic Weisbecker 
Link: https://lkml.kernel.org/r/20190821192921.988426956@linutronix.de

posix-cpu-timers: Provide array based access to expiry cache

2019-08-28T09:50:35+00:00

Using struct task_cputime for the expiry cache is a pretty odd choice and
comes with magic defines to rename the fields for usage in the expiry
cache.

struct task_cputime is basically a u64 array with 3 members, but it has
distinct members.

The expiry cache content is different than the content of task_cputime
because

  expiry[PROF]  = task_cputime.stime + task_cputime.utime
  expiry[VIRT]  = task_cputime.utime
  expiry[SCHED] = task_cputime.sum_exec_runtime

So there is no direct mapping between task_cputime and the expiry cache and
the #define based remapping is just a horrible hack.

Having the expiry cache array based allows further simplification of the
expiry code.

To avoid an all in one cleanup which is hard to review add a temporary
anonymous union into struct task_cputime which allows array based access to
it. That requires to reorder the members. Add a build time sanity check to
validate that the members are at the same place.

The union and the build time checks will be removed after conversion.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Frederic Weisbecker 
Link: https://lkml.kernel.org/r/20190821192921.105793824@linutronix.de

posix-cpu-timers: Move expiry cache into struct posix_cputimers

2019-08-28T09:50:35+00:00

The expiry cache belongs into the posix_cputimers container where the other
cpu timers information is.

Signed-off-by: Thomas Gleixner 
Reviewed-by: Frederic Weisbecker 
Link: https://lkml.kernel.org/r/20190821192921.014444012@linutronix.de