linux.git/kernel/sched/psi.c, branch v5.10

sched,psi: Convert to sched_set_fifo_low()

2020-06-15T12:10:25+00:00

Because SCHED_FIFO is a broken scheduler model (see previous patches)
take away the priority field, the kernel can't possibly make an
informed decision.

Effectively no change.

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Ingo Molnar 
Acked-by: Johannes Weiner

psi: eliminate kthread_worker from psi trigger scheduling mechanism

2020-06-15T12:10:03+00:00

Each psi group requires a dedicated kthread_delayed_work and
kthread_worker. Since no other work can be performed using psi_group's
kthread_worker, the same result can be obtained using a task_struct and
a timer directly. This makes psi triggering simpler by removing lists
and locks involved with kthread_worker usage and eliminates the need for
poll_scheduled atomic use in the hot path.

Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200528195442.190116-1-surenb@google.com

psi: Move PF_MEMSTALL out of task->flags

2020-03-20T12:06:19+00:00

The task->flags is a 32-bits flag, in which 31 bits have already been
consumed. So it is hardly to introduce other new per process flag.
Currently there're still enough spaces in the bit-field section of
task_struct, so we can define the memstall state as a single bit in
task_struct instead.
This patch also removes an out-of-date comment pointed by Matthew.

Suggested-by: Johannes Weiner 
Signed-off-by: Yafang Shao 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lkml.kernel.org/r/1584408485-1921-1-git-send-email-laoar.shao@gmail.com

psi: Optimize switching tasks inside shared cgroups

2020-03-20T12:06:19+00:00

When switching tasks running on a CPU, the psi state of a cgroup
containing both of these tasks does not change. Right now, we don't
exploit that, and can perform many unnecessary state changes in nested
hierarchies, especially when most activity comes from one leaf cgroup.

This patch implements an optimization where we only update cgroups
whose state actually changes during a task switch. These are all
cgroups that contain one task but not the other, up to the first
shared ancestor. When both tasks are in the same group, we don't need
to update anything at all.

We can identify the first shared ancestor by walking the groups of the
incoming task until we see TSK_ONCPU set on the local CPU; that's the
first group that also contains the outgoing task.

The new psi_task_switch() is similar to psi_task_change(). To allow
code reuse, move the task flag maintenance code into a new function
and the poll/avg worker wakeups into the shared psi_group_change().

Suggested-by: Peter Zijlstra 
Signed-off-by: Johannes Weiner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200316191333.115523-3-hannes@cmpxchg.org

psi: Fix cpu.pressure for cpu.max and competing cgroups

2020-03-20T12:06:18+00:00

For simplicity, cpu pressure is defined as having more than one
runnable task on a given CPU. This works on the system-level, but it
has limitations in a cgrouped reality: When cpu.max is in use, it
doesn't capture the time in which a task is not executing on the CPU
due to throttling. Likewise, it doesn't capture the time in which a
competing cgroup is occupying the CPU - meaning it only reflects
cgroup-internal competitive pressure, not outside pressure.

Enable tracking of currently executing tasks, and then change the
definition of cpu pressure in a cgroup from

	NR_RUNNING > 1

to

	NR_RUNNING > ON_CPU

which will capture the effects of cpu.max as well as competition from
outside the cgroup.

After this patch, a cgroup running `stress -c 1` with a cpu.max
setting of 5000 10000 shows ~50% continuous CPU pressure.

Signed-off-by: Johannes Weiner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20200316191333.115523-2-hannes@cmpxchg.org

Merge branch 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2020-02-15T20:51:22+00:00

Pull scheduler fixes from Ingo Molnar:
 "Misc fixes all over the place:

   - Fix NUMA over-balancing between lightly loaded nodes. This is
     fallout of the big load-balancer rewrite.

   - Fix the NOHZ remote loadavg update logic, which fixes anomalies
     like reported 150 loadavg on mostly idle CPUs.

   - Fix XFS performance/scalability

   - Fix throttled groups unbound task-execution bug

   - Fix PSI procfs boundary condition

   - Fix the cpu.uclamp.{min,max} cgroup configuration write checks

   - Fix DocBook annotations

   - Fix RCU annotations

   - Fix overly CPU-intensive housekeeper CPU logic loop on large CPU
     counts"

* 'sched-urgent-for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  sched/fair: Fix kernel-doc warning in attach_entity_load_avg()
  sched/core: Annotate curr pointer in rq with __rcu
  sched/psi: Fix OOB write when writing 0 bytes to PSI files
  sched/fair: Allow a per-CPU kthread waking a task to stack on the same CPU, to fix XFS performance regression
  sched/fair: Prevent unlimited runtime on throttled group
  sched/nohz: Optimize get_nohz_timer_target()
  sched/uclamp: Reject negative values in cpu_uclamp_write()
  sched/fair: Allow a small load imbalance between low utilisation SD_NUMA domains
  timers/nohz: Update NOHZ load in remote tick
  sched/core: Don't skip remote tick for idle CPUs

sched/psi: Fix OOB write when writing 0 bytes to PSI files

2020-02-11T12:00:02+00:00

Issuing write() with count parameter set to 0 on any file under
/proc/pressure/ will cause an OOB write because of the access to
buf[buf_size-1] when NUL-termination is performed. Fix this by checking
for buf_size to be non-zero.

Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Peter Zijlstra (Intel) 
Signed-off-by: Ingo Molnar 
Acked-by: Johannes Weiner 
Link: https://lkml.kernel.org/r/20200203212216.7076-1-surenb@google.com

proc: convert everything to "struct proc_ops"

2020-02-04T03:05:26+00:00

The most notable change is DEFINE_SHOW_ATTRIBUTE macro split in
seq_file.h.

Conversion rule is:

	llseek		=> proc_lseek
	unlocked_ioctl	=> proc_ioctl

	xxx		=> proc_xxx

	delete ".owner = THIS_MODULE" line

[akpm@linux-foundation.org: fix drivers/isdn/capi/kcapi_proc.c]
[sfr@canb.auug.org.au: fix kernel/sched/psi.c]
  Link: http://lkml.kernel.org/r/20200122180545.36222f50@canb.auug.org.au
Link: http://lkml.kernel.org/r/20191225172546.GB13378@avx2
Signed-off-by: Alexey Dobriyan 
Signed-off-by: Stephen Rothwell 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

sched/psi: create /proc/pressure and /proc/pressure/{io|memory|cpu} only when psi enabled

2020-01-17T09:19:22+00:00

when CONFIG_PSI_DEFAULT_DISABLED set to N or the command line set psi=0,
I think we should not create /proc/pressure and
/proc/pressure/{io|memory|cpu}.

In the future, user maybe determine whether the psi feature is enabled by
checking the existence of the /proc/pressure dir or
/proc/pressure/{io|memory|cpu} files.

Signed-off-by: Wang Long 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lkml.kernel.org/r/1576672698-32504-1-git-send-email-w@laoqinren.net

psi: Fix a division error in psi poll()

2019-12-17T12:32:48+00:00

The psi window size is a u64 an can be up to 10 seconds right now,
which exceeds the lower 32 bits of the variable. We currently use
div_u64 for it, which is meant only for 32-bit divisors. The result is
garbage pressure sampling values and even potential div0 crashes.

Use div64_u64.

Signed-off-by: Johannes Weiner 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Suren Baghdasaryan 
Cc: Jingfeng Xie 
Link: https://lkml.kernel.org/r/20191203183524.41378-3-hannes@cmpxchg.org