<feed xmlns='http://www.w3.org/2005/Atom'>
<title>linux-stable.git/include/linux/psi_types.h, branch v6.2.5</title>
<subtitle>Linux kernel stable tree</subtitle>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/'/>
<entry>
<title>sched/psi: Stop relying on timer_pending() for poll_work rescheduling</title>
<updated>2022-10-30T09:12:14+00:00</updated>
<author>
<name>Suren Baghdasaryan</name>
<email>surenb@google.com</email>
</author>
<published>2022-10-28T19:45:41+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=710ffe671e014d5ccbcff225130a178b088ef090'/>
<id>710ffe671e014d5ccbcff225130a178b088ef090</id>
<content type='text'>
Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:

poll_timer_fn
  wake_up_interruptible(&amp;group-&gt;poll_wait);

psi_poll_worker
  wait_event_interruptible(group-&gt;poll_wait, ...)
  psi_poll_work
    psi_schedule_poll_work
      if (timer_pending(&amp;group-&gt;poll_timer)) return;
      ...
      mod_timer(&amp;group-&gt;poll_timer, jiffies + delay);

Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06bdc.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:

Before the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           684365   1385156    1261240
Immediate reschedules blocked:             682846   1381654    1258682
Immediate reschedules (delta):             1519     3502       2558
Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%

After the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           882244   770298    426218
Immediate reschedules blocked:             881996   769796    426074
Immediate reschedules (delta):             248      502       144
Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%

The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang &lt;yt.chang@mediatek.com&gt;
Reported-by: Wenju Xu &lt;wenju.xu@mediatek.com&gt;
Reported-by: Jonathan Chen &lt;jonathan.jmchen@mediatek.com&gt;
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Tested-by: SH Chen &lt;show-hong.chen@mediatek.com&gt;
Link: https://lore.kernel.org/r/20221028194541.813985-1-surenb@google.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Psi polling mechanism is trying to minimize the number of wakeups to
run psi_poll_work and is currently relying on timer_pending() to detect
when this work is already scheduled. This provides a window of opportunity
for psi_group_change to schedule an immediate psi_poll_work after
poll_timer_fn got called but before psi_poll_work could reschedule itself.
Below is the depiction of this entire window:

poll_timer_fn
  wake_up_interruptible(&amp;group-&gt;poll_wait);

psi_poll_worker
  wait_event_interruptible(group-&gt;poll_wait, ...)
  psi_poll_work
    psi_schedule_poll_work
      if (timer_pending(&amp;group-&gt;poll_timer)) return;
      ...
      mod_timer(&amp;group-&gt;poll_timer, jiffies + delay);

Prior to 461daba06bdc we used to rely on poll_scheduled atomic which was
reset and set back inside psi_poll_work and therefore this race window
was much smaller.
The larger window causes increased number of wakeups and our partners
report visible power regression of ~10mA after applying 461daba06bdc.
Bring back the poll_scheduled atomic and make this race window even
narrower by resetting poll_scheduled only when we reach polling expiration
time. This does not completely eliminate the possibility of extra wakeups
caused by a race with psi_group_change however it will limit it to the
worst case scenario of one extra wakeup per every tracking window (0.5s
in the worst case).
This patch also ensures correct ordering between clearing poll_scheduled
flag and obtaining changed_states using memory barrier. Correct ordering
between updating changed_states and setting poll_scheduled is ensured by
atomic_xchg operation.
By tracing the number of immediate rescheduling attempts performed by
psi_group_change and the number of these attempts being blocked due to
psi monitor being already active, we can assess the effects of this change:

Before the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           684365   1385156    1261240
Immediate reschedules blocked:             682846   1381654    1258682
Immediate reschedules (delta):             1519     3502       2558
Immediate reschedules (% of attempted):    0.22%    0.25%      0.20%

After the patch:
                                           Run#1    Run#2      Run#3
Immediate reschedules attempted:           882244   770298    426218
Immediate reschedules blocked:             881996   769796    426074
Immediate reschedules (delta):             248      502       144
Immediate reschedules (% of attempted):    0.03%    0.07%     0.03%

The number of non-blocked immediate reschedules dropped from 0.22-0.25%
to 0.03-0.07%. The drop is attributed to the decrease in the race window
size and the fact that we allow this race only when psi monitors reach
polling window expiration time.

Fixes: 461daba06bdc ("psi: eliminate kthread_worker from psi trigger scheduling mechanism")
Reported-by: Kathleen Chang &lt;yt.chang@mediatek.com&gt;
Reported-by: Wenju Xu &lt;wenju.xu@mediatek.com&gt;
Reported-by: Jonathan Chen &lt;jonathan.jmchen@mediatek.com&gt;
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Tested-by: SH Chen &lt;show-hong.chen@mediatek.com&gt;
Link: https://lore.kernel.org/r/20221028194541.813985-1-surenb@google.com
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/psi: Fix avgs_work re-arm in psi_avgs_work()</title>
<updated>2022-10-30T09:12:14+00:00</updated>
<author>
<name>Chengming Zhou</name>
<email>zhouchengming@bytedance.com</email>
</author>
<published>2022-10-14T11:05:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=2fcd7bbae90a6d844da8660a9d27079281dfbba2'/>
<id>2fcd7bbae90a6d844da8660a9d27079281dfbba2</id>
<content type='text'>
Pavan reported a problem that PSI avgs_work idle shutoff is not
working at all. Because PSI_NONIDLE condition would be observed in
psi_avgs_work()-&gt;collect_percpu_times()-&gt;get_recent_times() even if
only the kworker running avgs_work on the CPU.

Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
still will always re-arm the avgs_work, so shutoff is not working.

This patch changes to use PSI_STATE_RESCHEDULE to flag whether to
re-arm avgs_work in get_recent_times(). For the current CPU, we re-arm
avgs_work only when (NR_RUNNING &gt; 1 || NR_IOWAIT &gt; 0 || NR_MEMSTALL &gt; 0),
for other CPUs we can just check PSI_NONIDLE delta. The new flag
is only used in psi_avgs_work(), so we check in get_recent_times()
that current_work() is avgs_work.

One potential problem is that the brief period of non-idle time
incurred between the aggregation run and the kworker's dequeue will
be stranded in the per-cpu buckets until avgs_work run next time.
The buckets can hold 4s worth of time, and future activity will wake
the avgs_work with a 2s delay, giving us 2s worth of data we can leave
behind when shut off the avgs_work. If the kworker run other works after
avgs_work shut off and doesn't have any scheduler activities for 2s,
this maybe a problem.

Reported-by: Pavan Kondeti &lt;quic_pkondeti@quicinc.com&gt;
Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Tested-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Link: https://lore.kernel.org/r/20221014110551.22695-1-zhouchengming@bytedance.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Pavan reported a problem that PSI avgs_work idle shutoff is not
working at all. Because PSI_NONIDLE condition would be observed in
psi_avgs_work()-&gt;collect_percpu_times()-&gt;get_recent_times() even if
only the kworker running avgs_work on the CPU.

Although commit 1b69ac6b40eb ("psi: fix aggregation idle shut-off")
avoided the ping-pong wake problem when the worker sleep, psi_avgs_work()
still will always re-arm the avgs_work, so shutoff is not working.

This patch changes to use PSI_STATE_RESCHEDULE to flag whether to
re-arm avgs_work in get_recent_times(). For the current CPU, we re-arm
avgs_work only when (NR_RUNNING &gt; 1 || NR_IOWAIT &gt; 0 || NR_MEMSTALL &gt; 0),
for other CPUs we can just check PSI_NONIDLE delta. The new flag
is only used in psi_avgs_work(), so we check in get_recent_times()
that current_work() is avgs_work.

One potential problem is that the brief period of non-idle time
incurred between the aggregation run and the kworker's dequeue will
be stranded in the per-cpu buckets until avgs_work run next time.
The buckets can hold 4s worth of time, and future activity will wake
the avgs_work with a 2s delay, giving us 2s worth of data we can leave
behind when shut off the avgs_work. If the kworker run other works after
avgs_work shut off and doesn't have any scheduler activities for 2s,
this maybe a problem.

Reported-by: Pavan Kondeti &lt;quic_pkondeti@quicinc.com&gt;
Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Tested-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Link: https://lore.kernel.org/r/20221014110551.22695-1-zhouchengming@bytedance.com
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/psi: Per-cgroup PSI accounting disable/re-enable interface</title>
<updated>2022-09-09T09:08:33+00:00</updated>
<author>
<name>Chengming Zhou</name>
<email>zhouchengming@bytedance.com</email>
</author>
<published>2022-09-07T09:03:32+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=34f26a15611afb03c33df6819359d36f5b382589'/>
<id>34f26a15611afb03c33df6819359d36f5b382589</id>
<content type='text'>
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.

commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.

But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.

So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.

Implementation details:

It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.

But it's hard or complex to stop/restart groupc-&gt;tasks[] updates,
which is not implemented in this patch. So we always update
groupc-&gt;tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.

Suggested-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
PSI accounts stalls for each cgroup separately and aggregates it
at each level of the hierarchy. This may cause non-negligible overhead
for some workloads when under deep level of the hierarchy.

commit 3958e2d0c34e ("cgroup: make per-cgroup pressure stall tracking configurable")
make PSI to skip per-cgroup stall accounting, only account system-wide
to avoid this each level overhead.

But for our use case, we also want leaf cgroup PSI stats accounted for
userspace adjustment on that cgroup, apart from only system-wide adjustment.

So this patch introduce a per-cgroup PSI accounting disable/re-enable
interface "cgroup.pressure", which is a read-write single value file that
allowed values are "0" and "1", the defaults is "1" so per-cgroup
PSI stats is enabled by default.

Implementation details:

It should be relatively straight-forward to disable and re-enable
state aggregation, time tracking, averaging on a per-cgroup level,
if we can live with losing history from while it was disabled.
I.e. the avgs will restart from 0, total= will have gaps.

But it's hard or complex to stop/restart groupc-&gt;tasks[] updates,
which is not implemented in this patch. So we always update
groupc-&gt;tasks[] and PSI_ONCPU bit in psi_group_change() even when
the cgroup PSI stats is disabled.

Suggested-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Suggested-by: Tejun Heo &lt;tj@kernel.org&gt;
Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lkml.kernel.org/r/20220907090332.2078-1-zhouchengming@bytedance.com
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/psi: Cache parent psi_group to speed up group iteration</title>
<updated>2022-09-09T09:08:33+00:00</updated>
<author>
<name>Chengming Zhou</name>
<email>zhouchengming@bytedance.com</email>
</author>
<published>2022-08-25T16:41:10+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=dc86aba751e2867244411adda1562f6664747019'/>
<id>dc86aba751e2867244411adda1562f6664747019</id>
<content type='text'>
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.

In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.

This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group-&gt;parent to iterate.

Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lore.kernel.org/r/20220825164111.29534-10-zhouchengming@bytedance.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We use iterate_groups() to iterate each level psi_group to update
PSI stats, which is a very hot path.

In current code, iterate_groups() have to use multiple branches and
cgroup_parent() to get parent psi_group for each level, which is not
very efficient.

This patch cache parent psi_group in struct psi_group, only need to get
psi_group of task itself first, then just use group-&gt;parent to iterate.

Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lore.kernel.org/r/20220825164111.29534-10-zhouchengming@bytedance.com
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/psi: Add PSI_IRQ to track IRQ/SOFTIRQ pressure</title>
<updated>2022-09-09T09:08:32+00:00</updated>
<author>
<name>Chengming Zhou</name>
<email>zhouchengming@bytedance.com</email>
</author>
<published>2022-08-25T16:41:08+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=52b1364ba0b105122d6de0e719b36db705011ac1'/>
<id>52b1364ba0b105122d6de0e719b36db705011ac1</id>
<content type='text'>
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.

When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.

Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.

Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
Now PSI already tracked workload pressure stall information for
CPU, memory and IO. Apart from these, IRQ/SOFTIRQ could have
obvious impact on some workload productivity, such as web service
workload.

When CONFIG_IRQ_TIME_ACCOUNTING, we can get IRQ/SOFTIRQ delta time
from update_rq_clock_task(), in which we can record that delta
to CPU curr task's cgroups as PSI_IRQ_FULL status.

Note we don't use PSI_IRQ_SOME since IRQ/SOFTIRQ always happen in
the current task on the CPU, make nothing productive could run
even if it were runnable, so we only use PSI_IRQ_FULL.

Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lore.kernel.org/r/20220825164111.29534-8-zhouchengming@bytedance.com
</pre>
</div>
</content>
</entry>
<entry>
<title>sched/psi: Remove NR_ONCPU task accounting</title>
<updated>2022-09-09T09:08:32+00:00</updated>
<author>
<name>Johannes Weiner</name>
<email>hannes@cmpxchg.org</email>
</author>
<published>2022-08-25T16:41:07+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=71dbdde7914d32e86f01ac1f6e54e964c9dfdbd9'/>
<id>71dbdde7914d32e86f01ac1f6e54e964c9dfdbd9</id>
<content type='text'>
We put all fields updated by the scheduler in the first cacheline of
struct psi_group_cpu for performance.

Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
we need to reclaim space first. This patch remove NR_ONCPU task accounting
in struct psi_group_cpu, use one bit in state_mask to track instead.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Tested-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Link: https://lore.kernel.org/r/20220825164111.29534-7-zhouchengming@bytedance.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We put all fields updated by the scheduler in the first cacheline of
struct psi_group_cpu for performance.

Since we want add another PSI_IRQ_FULL to track IRQ/SOFTIRQ pressure,
we need to reclaim space first. This patch remove NR_ONCPU task accounting
in struct psi_group_cpu, use one bit in state_mask to track instead.

Signed-off-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Signed-off-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Tested-by: Chengming Zhou &lt;zhouchengming@bytedance.com&gt;
Link: https://lore.kernel.org/r/20220825164111.29534-7-zhouchengming@bytedance.com
</pre>
</div>
</content>
</entry>
<entry>
<title>Merge tag 'v5.17-rc5' into sched/core, to resolve conflicts</title>
<updated>2022-02-21T10:53:51+00:00</updated>
<author>
<name>Ingo Molnar</name>
<email>mingo@kernel.org</email>
</author>
<published>2022-02-21T10:53:51+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=6255b48aebfd4dff375e97fc8b075a235848db0b'/>
<id>6255b48aebfd4dff375e97fc8b075a235848db0b</id>
<content type='text'>
New conflicts in sched/core due to the following upstream fixes:

  44585f7bc0cb ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
  a06247c6804f ("psi: Fix uaf issue when psi trigger is destroyed while being polled")

Conflicts:
	include/linux/psi_types.h
	kernel/sched/psi.c

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
New conflicts in sched/core due to the following upstream fixes:

  44585f7bc0cb ("psi: fix "defined but not used" warnings when CONFIG_PROC_FS=n")
  a06247c6804f ("psi: Fix uaf issue when psi trigger is destroyed while being polled")

Conflicts:
	include/linux/psi_types.h
	kernel/sched/psi.c

Signed-off-by: Ingo Molnar &lt;mingo@kernel.org&gt;
</pre>
</div>
</content>
</entry>
<entry>
<title>psi: fix possible trigger missing in the window</title>
<updated>2022-02-16T14:57:54+00:00</updated>
<author>
<name>Zhaoyang Huang</name>
<email>zhaoyang.huang@unisoc.com</email>
</author>
<published>2022-01-25T06:56:58+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=e6df4ead85d9da1b07dd40bd4c6d2182f3e210c4'/>
<id>e6df4ead85d9da1b07dd40bd4c6d2182f3e210c4</id>
<content type='text'>
When a new threshold breaching stall happens after a psi event was
generated and within the window duration, the new event is not
generated because the events are rate-limited to one per window. If
after that no new stall is recorded then the event will not be
generated even after rate-limiting duration has passed. This is
happening because with no new stall, window_update will not be called
even though threshold was previously breached. To fix this, record
threshold breaching occurrence and generate the event once window
duration is passed.

Suggested-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Zhaoyang Huang &lt;zhaoyang.huang@unisoc.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Link: https://lore.kernel.org/r/1643093818-19835-1-git-send-email-huangzhaoyang@gmail.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
When a new threshold breaching stall happens after a psi event was
generated and within the window duration, the new event is not
generated because the events are rate-limited to one per window. If
after that no new stall is recorded then the event will not be
generated even after rate-limiting duration has passed. This is
happening because with no new stall, window_update will not be called
even though threshold was previously breached. To fix this, record
threshold breaching occurrence and generate the event once window
duration is passed.

Suggested-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Zhaoyang Huang &lt;zhaoyang.huang@unisoc.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Acked-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Link: https://lore.kernel.org/r/1643093818-19835-1-git-send-email-huangzhaoyang@gmail.com
</pre>
</div>
</content>
</entry>
<entry>
<title>psi: Fix uaf issue when psi trigger is destroyed while being polled</title>
<updated>2022-01-18T11:09:57+00:00</updated>
<author>
<name>Suren Baghdasaryan</name>
<email>surenb@google.com</email>
</author>
<published>2022-01-11T23:23:09+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=a06247c6804f1a7c86a2e5398a4c1f1db1471848'/>
<id>a06247c6804f1a7c86a2e5398a4c1f1db1471848</id>
<content type='text'>
With write operation on psi files replacing old trigger with a new one,
the lifetime of its waitqueue is totally arbitrary. Overwriting an
existing trigger causes its waitqueue to be freed and pending poll()
will stumble on trigger-&gt;event_wait which was destroyed.
Fix this by disallowing to redefine an existing psi trigger. If a write
operation is used on a file descriptor with an already existing psi
trigger, the operation will fail with EBUSY error.
Also bypass a check for psi_disabled in the psi_trigger_destroy as the
flag can be flipped after the trigger is created, leading to a memory
leak.

Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Analyzed-by: Eric Biggers &lt;ebiggers@kernel.org&gt;
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Eric Biggers &lt;ebiggers@google.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
With write operation on psi files replacing old trigger with a new one,
the lifetime of its waitqueue is totally arbitrary. Overwriting an
existing trigger causes its waitqueue to be freed and pending poll()
will stumble on trigger-&gt;event_wait which was destroyed.
Fix this by disallowing to redefine an existing psi trigger. If a write
operation is used on a file descriptor with an already existing psi
trigger, the operation will fail with EBUSY error.
Also bypass a check for psi_disabled in the psi_trigger_destroy as the
flag can be flipped after the trigger is created, leading to a memory
leak.

Fixes: 0e94682b73bf ("psi: introduce psi monitor")
Reported-by: syzbot+cdb5dd11c97cc532efad@syzkaller.appspotmail.com
Suggested-by: Linus Torvalds &lt;torvalds@linux-foundation.org&gt;
Analyzed-by: Eric Biggers &lt;ebiggers@kernel.org&gt;
Signed-off-by: Suren Baghdasaryan &lt;surenb@google.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Reviewed-by: Eric Biggers &lt;ebiggers@google.com&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Cc: stable@vger.kernel.org
Link: https://lore.kernel.org/r/20220111232309.1786347-1-surenb@google.com
</pre>
</div>
</content>
</entry>
<entry>
<title>psi: Fix PSI_MEM_FULL state when tasks are in memstall and doing reclaim</title>
<updated>2021-11-17T13:49:00+00:00</updated>
<author>
<name>Brian Chen</name>
<email>brianchen118@gmail.com</email>
</author>
<published>2021-11-10T21:33:12+00:00</published>
<link rel='alternate' type='text/html' href='https://git.tavy.me/linux-stable.git/commit/?id=cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf'/>
<id>cb0e52b7748737b2cf6481fdd9b920ce7e1ebbdf</id>
<content type='text'>
We've noticed cases where tasks in a cgroup are stalled on memory but
there is little memory FULL pressure since tasks stay on the runqueue
in reclaim.

A simple example involves a single threaded program that keeps leaking
and touching large amounts of memory. It runs in a cgroup with swap
enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
there is significant CPU pressure and memory SOME, there is barely any
memory FULL since the task enters reclaim and stays on the runqueue.
However, this memory-bound task is effectively stalled on memory and
we expect memory FULL to match memory SOME in this scenario.

The code is confused about memstall &amp;&amp; running, thinking there is a
stalled task and a productive task when there's only one task: a
reclaimer that's counted as both. To fix this, we redefine the
condition for PSI_MEM_FULL to check that all running tasks are in an
active memstall instead of checking that there are no running tasks.

        case PSI_MEM_FULL:
-               return unlikely(tasks[NR_MEMSTALL] &amp;&amp; !tasks[NR_RUNNING]);
+               return unlikely(tasks[NR_MEMSTALL] &amp;&amp;
+                       tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);

This will capture reclaimers. It will also capture tasks that called
psi_memstall_enter() and are about to sleep, but this should be
negligible noise.

Signed-off-by: Brian Chen &lt;brianchen118@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com
</content>
<content type='xhtml'>
<div xmlns='http://www.w3.org/1999/xhtml'>
<pre>
We've noticed cases where tasks in a cgroup are stalled on memory but
there is little memory FULL pressure since tasks stay on the runqueue
in reclaim.

A simple example involves a single threaded program that keeps leaking
and touching large amounts of memory. It runs in a cgroup with swap
enabled, memory.high set at 10M and cpu.max ratio set at 5%. Though
there is significant CPU pressure and memory SOME, there is barely any
memory FULL since the task enters reclaim and stays on the runqueue.
However, this memory-bound task is effectively stalled on memory and
we expect memory FULL to match memory SOME in this scenario.

The code is confused about memstall &amp;&amp; running, thinking there is a
stalled task and a productive task when there's only one task: a
reclaimer that's counted as both. To fix this, we redefine the
condition for PSI_MEM_FULL to check that all running tasks are in an
active memstall instead of checking that there are no running tasks.

        case PSI_MEM_FULL:
-               return unlikely(tasks[NR_MEMSTALL] &amp;&amp; !tasks[NR_RUNNING]);
+               return unlikely(tasks[NR_MEMSTALL] &amp;&amp;
+                       tasks[NR_RUNNING] == tasks[NR_MEMSTALL_RUNNING]);

This will capture reclaimers. It will also capture tasks that called
psi_memstall_enter() and are about to sleep, but this should be
negligible noise.

Signed-off-by: Brian Chen &lt;brianchen118@gmail.com&gt;
Signed-off-by: Peter Zijlstra (Intel) &lt;peterz@infradead.org&gt;
Acked-by: Johannes Weiner &lt;hannes@cmpxchg.org&gt;
Link: https://lore.kernel.org/r/20211110213312.310243-1-brianchen118@gmail.com
</pre>
</div>
</content>
</entry>
</feed>
