linux-stable.git/kernel/sched/psi.c, branch linux-5.17.y

sched/psi: report zeroes for CPU full at the system level

2022-06-09T08:25:41+00:00

[ Upstream commit 890d550d7dbac7a31ecaa78732aa22be282bb6b8 ]

Martin find it confusing when look at the /proc/pressure/cpu output,
and found no hint about that CPU "full" line in psi Documentation.

% cat /proc/pressure/cpu
some avg10=0.92 avg60=0.91 avg300=0.73 total=933490489
full avg10=0.22 avg60=0.23 avg300=0.16 total=358783277

The PSI_CPU_FULL state is introduced by commit e7fcd7622823
("psi: Add PSI_CPU_FULL state"), which mainly for cgroup level,
but also counted at the system level as a side effect.

Naturally, the FULL state doesn't exist for the CPU resource at
the system level. These "full" numbers can come from CPU idle
schedule latency. For example, t1 is the time when task wakeup
on an idle CPU, t2 is the time when CPU pick and switch to it.
The delta of (t2 - t1) will be in CPU_FULL state.

Another case all processes can be stalled is when all cgroups
have been throttled at the same time, which unlikely to happen.

Anyway, CPU_FULL metric is meaningless and confusing at the
system level. So this patch will report zeroes for CPU full
at the system level, and update psi Documentation accordingly.

Fixes: e7fcd7622823 ("psi: Add PSI_CPU_FULL state")
Reported-by: Martin Steigerwald 
Suggested-by: Johannes Weiner 
Signed-off-by: Chengming Zhou 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lore.kernel.org/r/20220408121914.82855-1-zhouchengming@bytedance.com
Signed-off-by: Sasha Levin

psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n

2022-01-30T07:56:58+00:00

When CONFIG_PROC_FS is disabled psi code generates the following
warnings:

  kernel/sched/psi.c:1364:30: warning: 'psi_cpu_proc_ops' defined but not used [-Wunused-const-variable=]
      1364 | static const struct proc_ops psi_cpu_proc_ops = {
           |                              ^~~~~~~~~~~~~~~~
  kernel/sched/psi.c:1355:30: warning: 'psi_memory_proc_ops' defined but not used [-Wunused-const-variable=]
      1355 | static const struct proc_ops psi_memory_proc_ops = {
           |                              ^~~~~~~~~~~~~~~~~~~
  kernel/sched/psi.c:1346:30: warning: 'psi_io_proc_ops' defined but not used [-Wunused-const-variable=]
      1346 | static const struct proc_ops psi_io_proc_ops = {
           |                              ^~~~~~~~~~~~~~~

Make definitions of these structures and related functions conditional
on CONFIG_PROC_FS config.

Link: https://lkml.kernel.org/r/20220119223940.787748-3-surenb@google.com
Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Signed-off-by: Suren Baghdasaryan 
Reported-by: kernel test robot 
Acked-by: Johannes Weiner 
Signed-off-by: Andrew Morton 
Signed-off-by: Linus Torvalds

psi: Fix uaf issue when psi trigger is destroyed while being polled

2022-01-18T11:09:57+00:00

With write operation on psi files replacing old trigger with a new one,
the lifetime of its waitqueue is totally arbitrary. Overwriting an
existing trigger causes its waitqueue to be freed and pending poll()
will stumble on trigger->event_wait which was destroyed.
Fix this by disallowing to redefine an existing psi trigger. If a write
operation is used on a file descriptor with an already existing psi
trigger, the operation will fail with EBUSY error.
Also bypass a check for psi_disabled in the psi_trigger_destroy as the
flag can be flipped after the trigger is created, leading to a memory
leak.

Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds 
Analyzed-by: Eric Biggers 
Signed-off-by: Suren Baghdasaryan 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Eric Biggers 
Acked-by: Johannes Weiner 
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com

psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim

2021-11-17T13:49:00+00:00

We've noticed cases where tasks in a cgroup are stalled on memory but
there is little memory FULL pressure since tasks stay on the runqueue
in reclaim.

A simple example involves a single threaded program that keeps leaking
and touching large amounts of memory. It runs in a cgroup with swap
enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
there is significant CPU pressure and memory SOME, there is barely any
memory FULL since the task enters reclaim and stays on the runqueue.
However, this memory-bound task is effectively stalled on memory and
we expect memory FULL to match memory SOME in this scenario.

The code is confused about memstall && running, thinking there is a
stalled task and a productive task when there's only one task: a
reclaimer that's counted as both. To fix this, we redefine the
condition for PSI_MEM_FULL to check that all running tasks are in an
active memstall instead of checking that there are no running tasks.

        case PSI_MEM_FULL:
-               return unlikely(tasks[NR_MEMSTALL] && !tasks[NR_RUNNING]);
+               return unlikely(tasks[NR_MEMSTALL] &&
+                       tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);

This will capture reclaimers. It will also capture tasks that called
psi_memstall_enter() and are about to sleep, but this should be
negligible noise.

Signed-off-by: Brian Chen 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com

psi: Add a missing SPDX license header

2021-11-17T13:48:59+00:00

Add the missing SPDX license header to
include/linux/psi.h
include/linux/psi_types.h
kernel/sched/psi.c

Signed-off-by: Liu Xinpeng 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lore.kernel.org/r/1635133586-84611-2-git-send-email-liuxp11@chinatelecom.cn

psi: Remove repeated verbose comment

2021-11-17T13:48:59+00:00

Comment in function psi_task_switch,there are two same lines.
...
* runtime state, the cgroup that contains both tasks
* runtime state, the cgroup that contains both tasks
...

Signed-off-by: Liu Xinpeng 
Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Link: https://lore.kernel.org/r/1635133586-84611-1-git-send-email-liuxp11@chinatelecom.cn

Merge branch 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup

2021-07-02T00:22:14+00:00

Pull cgroup updates from Tejun Heo:

 - cgroup.kill is added which implements atomic killing of the whole
   subtree.

   Down the line, this should be able to replace the multiple userland
   implementations of "keep killing till empty".

 - PSI can now be turned off at boot time to avoid overhead for
   configurations which don't care about PSI.

* 'for-5.14' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/cgroup:
  cgroup: make per-cgroup pressure stall tracking configurable
  cgroup: Fix kernel-doc
  cgroup: inline cgroup_task_freeze()
  tests/cgroup: test cgroup.kill
  tests/cgroup: move cg_wait_for(), cg_prepare_for_wait()
  tests/cgroup: use cgroup.kill in cg_killall()
  docs/cgroup: add entry for cgroup.kill
  cgroup: introduce cgroup.kill

psi: Fix race between psi_trigger_create/destroy

2021-06-24T07:07:50+00:00

Race detected between psi_trigger_destroy/create as shown below, which
cause panic by accessing invalid psi_system->poll_wait->wait_queue_entry
and psi_system->poll_timer->entry->next. Under this modification, the
race window is removed by initialising poll_wait and poll_timer in
group_init which are executed only once at beginning.

  psi_trigger_destroy()                   psi_trigger_create()

  mutex_lock(trigger_lock);
  rcu_assign_pointer(poll_task, NULL);
  mutex_unlock(trigger_lock);
					  mutex_lock(trigger_lock);
					  if (!rcu_access_pointer(group->poll_task)) {
					    timer_setup(poll_timer, poll_timer_fn, 0);
					    rcu_assign_pointer(poll_task, task);
					  }
					  mutex_unlock(trigger_lock);

  synchronize_rcu();
  del_timer_sync(poll_timer); <-- poll_timer has been reinitialized by
                                  psi_trigger_create()

So, trigger_lock/RCU correctly protects destruction of
group->poll_task but misses this race affecting poll_timer and
poll_wait.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Co-developed-by: ziwei.dai 
Signed-off-by: ziwei.dai 
Co-developed-by: ke.wang 
Signed-off-by: ke.wang 
Signed-off-by: Zhaoyang Huang 
Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Suren Baghdasaryan 
Acked-by: Johannes Weiner 
Link: https://lkml.kernel.org/r/1623371374-15664-1-git-send-email-huangzhaoyang@gmail.com

cgroup: make per-cgroup pressure stall tracking configurable

2021-06-08T18:59:02+00:00

PSI accounts stalls for each cgroup separately and aggregates it at each
level of the hierarchy. This causes additional overhead with psi_avgs_work
being called for each cgroup in the hierarchy. psi_avgs_work has been
highly optimized, however on systems with large number of cgroups the
overhead becomes noticeable.
Systems which use PSI only at the system level could avoid this overhead
if PSI can be configured to skip per-cgroup stall accounting.
Add "cgroup_disable=pressure" kernel command-line option to allow
requesting system-wide only pressure stall accounting. When set, it
keeps system-wide accounting under /proc/pressure/ but skips accounting
for individual cgroups and does not expose PSI nodes in cgroup hierarchy.

Signed-off-by: Suren Baghdasaryan 
Acked-by: Peter Zijlstra (Intel) 
Acked-by: Johannes Weiner 
Signed-off-by: Tejun Heo

psi: Fix psi state corruption when schedule() races with cgroup move

2021-05-06T13:33:26+00:00

4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
introduced a race condition that corrupts internal psi state. This
manifests as kernel warnings, sometimes followed by bogusly high IO
pressure:

  psi: task underflow! cpu=1 t=2 tasks=[0 0 0 0] clear=c set=0
  (schedule() decreasing RUNNING and ONCPU, both of which are 0)

  psi: incosistent task state! task=2412744:systemd cpu=17 psi_flags=e clear=3 set=0
  (cgroup_move_task() clearing MEMSTALL and IOWAIT, but task is MEMSTALL | RUNNING | ONCPU)

What the offending commit does is batch the two psi callbacks in
schedule() to reduce the number of cgroup tree updates. When prev is
deactivated and removed from the runqueue, nothing is done in psi at
first; when the task switch completes, TSK_RUNNING and TSK_IOWAIT are
updated along with TSK_ONCPU.

However, the deactivation and the task switch inside schedule() aren't
atomic: pick_next_task() may drop the rq lock for load balancing. When
this happens, cgroup_move_task() can run after the task has been
physically dequeued, but the psi updates are still pending. Since it
looks at the task's scheduler state, it doesn't move everything to the
new cgroup that the task switch that follows is about to clear from
it. cgroup_move_task() will leak the TSK_RUNNING count in the old
cgroup, and psi_sched_switch() will underflow it in the new cgroup.

A similar thing can happen for iowait. TSK_IOWAIT is usually set when
a p->in_iowait task is dequeued, but again this update is deferred to
the switch. cgroup_move_task() can see an unqueued p->in_iowait task
and move a non-existent TSK_IOWAIT. This results in the inconsistent
task state warning, as well as a counter underflow that will result in
permanent IO ghost pressure being reported.

Fix this bug by making cgroup_move_task() use task->psi_flags instead
of looking at the potentially mismatching scheduler state.

[ We used the scheduler state historically in order to not rely on
  task->psi_flags for anything but debugging. But that ship has sailed
  anyway, and this is simpler and more robust.

  We previously already batched TSK_ONCPU clearing with the
  TSK_RUNNING update inside the deactivation call from schedule(). But
  that ordering was safe and didn't result in TSK_ONCPU corruption:
  unlike most places in the scheduler, cgroup_move_task() only checked
  task_current() and handled TSK_ONCPU if the task was still queued. ]

Fixes: 4117cebf1a9f ("psi: Optimize task switch inside shared cgroups")
Signed-off-by: Johannes Weiner 
Signed-off-by: Peter Zijlstra (Intel) 
Link: https://lkml.kernel.org/r/20210503174917.38579-1-hannes@cmpxchg.org