linux-stable.git/kernel/softirq.c, branch v6.5.2

Revert "softirq: Let ksoftirqd do its job"

2023-05-09T19:50:27+00:00

This reverts the following commits:

  4cd13c21b207 ("softirq: Let ksoftirqd do its job")
  3c53776e29f8 ("Mark HI and TASKLET softirq synchronous")
  1342d8080f61 ("softirq: Don't skip softirq execution when softirq thread is parking")

in a single change to avoid known bad intermediate states introduced by a
patch series reverting them individually.

Due to the mentioned commit, when the ksoftirqd threads take charge of
softirq processing, the system can experience high latencies.

In the past a few workarounds have been implemented for specific
side-effects of the initial ksoftirqd enforcement commit:

commit 1ff688209e2e ("watchdog: core: make sure the watchdog_worker is not deferred")
commit 8d5755b3f77b ("watchdog: softdog: fire watchdog even if softirqs do not get to run")
commit 217f69743681 ("net: busy-poll: allow preemption in sk_busy_loop()")
commit 3c53776e29f8 ("Mark HI and TASKLET softirq synchronous")

But the latency problem still exists in real-life workloads, see the link
below.

The reverted commit intended to solve a live-lock scenario that can now be
addressed with the NAPI threaded mode, introduced with commit 29863d41bb6e
("net: implement threaded-able napi poll loop support"), which is nowadays
in a pretty stable status.

While a complete solution to put softirq processing under nice resource
control would be preferable, that has proven to be a very hard task. In
the short term, remove the main pain point, and also simplify a bit the
current softirq implementation.

Signed-off-by: Paolo Abeni 
Signed-off-by: Thomas Gleixner 
Tested-by: Jason Xing 
Reviewed-by: Jakub Kicinski 
Reviewed-by: Eric Dumazet 
Reviewed-by: Sebastian Andrzej Siewior 
Cc: "Paul E. McKenney" 
Cc: Peter Zijlstra 
Cc: netdev@vger.kernel.org
Link: https://lore.kernel.org/netdev/305d7742212cbe98621b16be782b0562f1012cb6.camel@redhat.com
Link: https://lore.kernel.org/r/57e66b364f1b6f09c9bc0316742c3b14f4ce83bd.1683526542.git.pabeni@redhat.com

softirq: Add trace points for tasklet entry/exit

2023-04-15T08:17:16+00:00

Tasklets are supposed to finish their work quickly and should not block the
current running process, but it is not guaranteed that they do so.

Currently softirq_entry/exit can be used to analyse the total tasklets
execution time, but that's not helpful to track individual tasklets
execution time. That makes it hard to identify tasklet functions, which
take more time than expected.

Add tasklet_entry/exit trace point support to track individual tasklet
execution.

Trivial usage example:
   # echo 1 > /sys/kernel/debug/tracing/events/irq/tasklet_entry/enable
   # echo 1 > /sys/kernel/debug/tracing/events/irq/tasklet_exit/enable
   # cat /sys/kernel/debug/tracing/trace
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 4/4   #P:4
 #
 #                                _-----=> irqs-off/BH-disabled
 #                               / _----=> need-resched
 #                              | / _---=> hardirq/softirq
 #                              || / _--=> preempt-depth
 #                              ||| / _-=> migrate-disable
 #                              |||| /     delay
 #           TASK-PID     CPU#  |||||  TIMESTAMP  FUNCTION
 #              | |         |   |||||     |         |
           -0       [003] ..s1.   314.011428: tasklet_entry: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func
           -0       [003] ..s1.   314.011432: tasklet_exit: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func
           -0       [003] ..s1.   314.017369: tasklet_entry: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func
           -0       [003] ..s1.   314.017371: tasklet_exit: tasklet=0xffffa01ef8db2740 function=tcp_tasklet_func

Signed-off-by: Lingutla Chandrasekhar 
Signed-off-by: J. Avila 
Signed-off-by: John Stultz 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Steven Rostedt (Google) 
Link: https://lore.kernel.org/r/20230407230526.1685443-1-jstultz@google.com

[elavila: Port to android-mainline]
[jstultz: Rebased to upstream, cut unused trace points, added
 comments for the tracepoints, reworded commit]

context_tracking: Take IRQ eqs entrypoints over RCU

2022-07-05T20:32:59+00:00

The RCU dynticks counter is going to be merged into the context tracking
subsystem. Prepare with moving the IRQ extended quiescent states
entrypoints to context tracking. For now those are dumb redirection to
existing RCU calls.

[ paulmck: Apply Stephen Rothwell feedback from -next. ]
[ paulmck: Apply Nathan Chancellor feedback. ]

Acked-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Cc: Peter Zijlstra 
Cc: Thomas Gleixner 
Cc: Neeraj Upadhyay 
Cc: Uladzislau Rezki 
Cc: Joel Fernandes 
Cc: Boqun Feng 
Cc: Nicolas Saenz Julienne 
Cc: Marcelo Tosatti 
Cc: Xiongfeng Wang 
Cc: Yu Liao 
Cc: Phil Auld 
Cc: Paul Gortmaker
Cc: Alex Belits 
Signed-off-by: Paul E. McKenney 
Reviewed-by: Nicolas Saenz Julienne 
Tested-by: Nicolas Saenz Julienne

smp: Make softirq handling RT safe in flush_smp_call_function_queue()

2022-05-01T08:03:43+00:00

flush_smp_call_function_queue() invokes do_softirq() which is not available
on PREEMPT_RT. flush_smp_call_function_queue() is invoked from the idle
task and the migration task with preemption or interrupts disabled.

So RT kernels cannot process soft interrupts in that context as that has to
acquire 'sleeping spinlocks' which is not possible with preemption or
interrupts disabled and forbidden from the idle task anyway.

The currently known SMP function call which raises a soft interrupt is in
the block layer, but this functionality is not enabled on RT kernels due to
latency and performance reasons.

RT could wake up ksoftirqd unconditionally, but this wants to be avoided if
there were soft interrupts pending already when this is invoked in the
context of the migration task. The migration task might have preempted a
threaded interrupt handler which raised a soft interrupt, but did not reach
the local_bh_enable() to process it. The "running" ksoftirqd might prevent
the handling in the interrupt thread context which is causing latency
issues.

Add a new function which handles this case explicitely for RT and falls
back to do_softirq() on !RT kernels. In the RT case this warns when one of
the flushed SMP function calls raised a soft interrupt so this can be
investigated.

[ tglx: Moved the RT part out of SMP code ]

Signed-off-by: Sebastian Andrzej Siewior 
Signed-off-by: Thomas Gleixner 
Acked-by: Peter Zijlstra (Intel) 
Link: https://lore.kernel.org/r/YgKgL6aPj8aBES6G@linutronix.de
Link: https://lore.kernel.org/r/20220413133024.356509586@linutronix.de

genirq, softirq: Use in_hardirq() instead of in_irq()

2022-02-02T20:34:19+00:00

Replace the obsolete and ambiguos macro in_irq() with the new macro
in_hardirq().

Signed-off-by: Changbin Du 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20220128110727.5110-1-changbin.du@gmail.com

timers/nohz: Last resort update jiffies on nohz_full IRQ entry

2021-12-02T14:07:22+00:00

When at least one CPU runs in nohz_full mode, a dedicated timekeeper CPU
is guaranteed to stay online and to never stop its tick.

Meanwhile on some rare case, the dedicated timekeeper may be running
with interrupts disabled for a while, such as in stop_machine.

If jiffies stop being updated, a nohz_full CPU may end up endlessly
programming the next tick in the past, taking the last jiffies update
monotonic timestamp as a stale base, resulting in an tick storm.

Here is a scenario where it matters:

0) CPU 0 is the timekeeper and CPU 1 a nohz_full CPU.

1) A stop machine callback is queued to execute somewhere.

2) CPU 0 reaches MULTI_STOP_DISABLE_IRQ while CPU 1 is still in
   MULTI_STOP_PREPARE. Hence CPU 0 can't do its timekeeping duty. CPU 1
   can still take IRQs.

3) CPU 1 receives an IRQ which queues a timer callback one jiffy forward.

4) On IRQ exit, CPU 1 schedules the tick one jiffy forward, taking
   last_jiffies_update as a base. But last_jiffies_update hasn't been
   updated for 2 jiffies since the timekeeper has interrupts disabled.

5) clockevents_program_event(), which relies on ktime_get(), observes
   that the expiration is in the past and therefore programs the min
   delta event on the clock.

6) The tick fires immediately, goto 3)

7) Tick storm, the nohz_full CPU is drown and takes ages to reach
   MULTI_STOP_DISABLE_IRQ, which is the only way out of this situation.

Solve this with unconditionally updating jiffies if the value is stale
on nohz_full IRQ entry. IRQs and other disturbances are expected to be
rare enough on nohz_full for the unconditional call to ktime_get() to
actually matter.

Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Paul E. McKenney 
Link: https://lore.kernel.org/r/20211026141055.57358-2-frederic@kernel.org

genirq: Change force_irqthreads to a static key

2021-08-10T20:50:07+00:00

With CONFIG_IRQ_FORCED_THREADING=y, testing the boolean force_irqthreads
could incur a cache line miss in invoke_softirq() and other places.

Replace the test with a static key to avoid the potential cache miss.

[ tglx: Dropped the IDE part, removed the export and updated blk-mq ]

Suggested-by: Eric Dumazet 
Signed-off-by: Tanner Love 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Eric Dumazet 
Reviewed-by: Kees Cook 
Link: https://lore.kernel.org/r/20210602180338.3324213-1-tannerlove.kernel@gmail.com

sched: Introduce task_is_running()

2021-06-18T09:43:07+00:00

Replace a bunch of 'p->state == TASK_RUNNING' with a new helper:
task_is_running(p).

Signed-off-by: Peter Zijlstra (Intel) 
Acked-by: Davidlohr Bueso 
Acked-by: Geert Uytterhoeven 
Acked-by: Will Deacon 
Link: https://lore.kernel.org/r/20210611082838.222401495@infradead.org

sched: Unbreak wakeups

2021-06-18T09:43:06+00:00

Remove broken task->state references and let wake_up_process() DTRT.

The anti-pattern in these patches breaks the ordering of ->state vs
COND as described in the comment near set_current_state() and can lead
to missed wakeups:

	(OoO load, observes RUNNING)<-.
	for (;;) {                    |
	  t->state = UNINTERRUPTIBLE; |
	  smp_mb();          ,----->  | (observes !COND)
                             |        /
	  if (COND) ---------'       |	COND = 1;
		break;		     `- if (t->state != RUNNING)
					  wake_up_process(t); // not done
	  schedule(); // forever waiting
	}
	t->state = TASK_RUNNING;

Signed-off-by: Peter Zijlstra (Intel) 
Reviewed-by: Davidlohr Bueso 
Acked-by: Greg Kroah-Hartman 
Acked-by: Will Deacon 
Link: https://lore.kernel.org/r/20210611082838.160855222@infradead.org

Merge tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2021-04-28T19:00:13+00:00

Pull RCU updates from Ingo Molnar:

 - Support for "N" as alias for last bit in bitmap parsing library (eg
   using syntax like "nohz_full=2-N")

 - kvfree_rcu updates

 - mm_dump_obj() updates. (One of these is to mm, but was suggested by
   Andrew Morton.)

 - RCU callback offloading update

 - Polling RCU grace-period interfaces

 - Realtime-related RCU updates

 - Tasks-RCU updates

 - Torture-test updates

 - Torture-test scripting updates

 - Miscellaneous fixes

* tag 'core-rcu-2021-04-28' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (77 commits)
  rcutorture: Test start_poll_synchronize_rcu() and poll_state_synchronize_rcu()
  rcu: Provide polling interfaces for Tiny RCU grace periods
  torture: Fix kvm.sh --datestamp regex check
  torture: Consolidate qemu-cmd duration editing into kvm-transform.sh
  torture: Print proper vmlinux path for kvm-again.sh runs
  torture: Make TORTURE_TRUST_MAKE available in kvm-again.sh environment
  torture: Make kvm-transform.sh update jitter commands
  torture: Add --duration argument to kvm-again.sh
  torture: Add kvm-again.sh to rerun a previous torture-test
  torture: Create a "batches" file for build reuse
  torture: De-capitalize TORTURE_SUITE
  torture: Make upper-case-only no-dot no-slash scenario names official
  torture: Rename SRCU-t and SRCU-u to avoid lowercase characters
  torture: Remove no-mpstat error message
  torture: Record kvm-test-1-run.sh and kvm-test-1-run-qemu.sh PIDs
  torture: Record jitter start/stop commands
  torture: Extract kvm-test-1-run-qemu.sh from kvm-test-1-run.sh
  torture: Record TORTURE_KCONFIG_GDB_ARG in qemu-cmd
  torture: Abstract jitter.sh start/stop into scripts
  rcu: Provide polling interfaces for Tree RCU grace periods
  ...