linux.git - Linux kernel source tree

Age	Commit message (Collapse)	Author
12 days	Merge tag 'wq-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	Linus Torvalds
	Pull workqueue updates from Tejun Heo: - Continued progress toward making alloc_workqueue() unbound by default: more callers converted to WQ_PERCPU / system_percpu_wq / system_dfl_wq, and new warnings for queues that use neither WQ_PERCPU nor WQ_UNBOUND or the legacy system_wq / system_unbound_wq. - Misc: drop the now-trivial apply_wqattrs_lock()/unlock() wrappers, forbid the TEST_WORKQUEUE benchmark from being built-in, and fix a spurious pointer level in the worker debug-dump path. * tag 'wq-for-7.2' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: drm/bridge: anx7625: Add WQ_PERCPU add to alloc_workqueue wifi: ath6kl: fix invalid workqueue flags in ath6kl_usb_create() btrfs: Drop WQ_PERCPU from ordered_flags in btrfs_init_workqueues() workqueue: Add warnings and ensure one among WQ_PERCPU or WQ_UNBOUND is present workqueue: Add warnings and fallback if system_{unbound}_wq is used workqueue: drop spurious '*' from print_worker_info() fn declaration workqueue: forbid TEST_WORKQUEUE from being built-in workqueue: drop apply_wqattrs_lock()/unlock() wrappers umh: replace use of system_unbound_wq with system_dfl_wq rapidio: rio: add WQ_PERCPU to alloc_workqueue users media: ddbridge: add WQ_PERCPU to alloc_workqueue users platform: cznic: turris-omnia-mcu: replace use of system_wq with system_percpu_wq media: synopsys: hdmirx: replace use of system_unbound_wq with system_dfl_wq virt: acrn: Add WQ_PERCPU to alloc_workqueue users
2026-06-15	Merge tag 'sched-core-2026-06-14' of ↵	Linus Torvalds
	gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip Pull scheduler updates from Ingo Molnar: "SMP load-balancing updates: - A large series to introduce infrastructure for cache-aware load balancing, with the goal of co-locating tasks that share data within the same Last Level Cache (LLC) domain. By improving cache locality, the scheduler can reduce cache bouncing and cache misses, ultimately improving data access efficiency. Implemented by Chen Yu and Tim Chen, based on early prototype work by Peter Zijlstra, with fixes by Jianyong Wu, Peter Zijlstra and Shrikanth Hegde. - A series to simplify CONFIG_SCHED_SMT ifdef usage (Shrikanth Hegde) Fair scheduler updates: - A series to improve SD_ASYM_CPUCAPACITY scheduling by introducing SMT awareness (Andrea Righi, K Prateek Nayak) - A series to optimize cfs_rq and sched_entity allocation for better data locality (Zecheng Li) - A preparatory series to change fair/cgroup scheduling to a single runqueue, without the final change (Peter Zijlstra) - Auto-manage ext/fair dl_server bandwidth (Andrea Righi) - Fix cpu_util runnable_avg arithmetic (Hongyan Xia) - Optimize update_tg_load_avg()'s rate-limiting code (Rik van Riel) - Allow account_cfs_rq_runtime() to throttle current hierarchy (K Prateek Nayak) - Update util_est after updating util_avg during dequeue, to fix the util signal update logic, which reduces signal noise (Vincent Guittot) Scheduler topology updates: - Allow multiple domains to claim sched_domain_shared (K Prateek Nayak) - Add parameter to split LLC (Peter Zijlstra) Core scheduler updates: - Use trace_call__<tp>() to save a static branch (Gabriele Monaco) Scheduler statistics updates: - Drop now-stale mul_u64_u64_div_u64() cputime over-approximation guard (Nicolas Pitre) Deadline scheduler updates: - Reject debugfs dl_server writes for offline CPUs (Andrea Righi) - Fix replenishment logic for non-deferred servers (Yuri Andriaccio) RT scheduling updates: - Turn RT_PUSH_IPI default off for non PREEMPT_RT (Steven Rostedt) - Update default bandwidth for real-time tasks to 1.0 (Yuri Andriaccio) Proxy scheduling updates: - A series to implement Optimized Donor Migration for Proxy Execution (John Stultz, Peter Zijlstra) - Various proxy scheduling cleanups and fixes (Peter Zijlstra, K Prateek Nayak) Misc fixes, improvements and cleanups by Aaron Lu, Andrea Righi, Zenghui Yu, Chen Yu, Guanyou.Chen, John Stultz, Shrikanth Hegde, Peter Zijlstra, Liang Luo and Yiyang Chen" * tag 'sched-core-2026-06-14' of gitolite.kernel.org:pub/scm/linux/kernel/git/tip/tip: (91 commits) sched/fair: Fix newidle vs core-sched sched/deadline: Use task_on_rq_migrating() helper sched/core: Combine separate 'else' and 'if' statements sched/fair: Fix cpu_util runnable_avg arithmetic sched/fair: Unify cfs_rq throttling via account_cfs_rq_runtime() sched/fair: Move the throttled tasks to a local list in tg_unthrottle_up() sched/fair: Call update_curr() before unthrottling the hierarchy sched/fair: Use throttled_csd_list for local unthrottle sched/fair: Convert cfs bandwidth throttling to use guards sched/fair: Allocate cfs_tg_state with percpu allocator sched/fair: Remove task_group->se pointer array sched/fair: Co-locate cfs_rq and sched_entity in cfs_tg_state sched: restore timer_slack_ns when resetting RT policy on fork MAINTAINERS: Fix spelling mistake in Peter's name sched: Simplify ttwu_runnable() sched/proxy: Remove superfluous clear_task_blocked_in() sched/proxy: Remove PROXY_WAKING sched/proxy: Switch proxy to use p->is_blocked sched/proxy: Only return migrate when needed sched: Be more strict about p->is_blocked ...
2026-05-29	workqueue: Add warnings and ensure one among WQ_PERCPU or WQ_UNBOUND is present	Marco Crivellari
	Currently there are no checks in order to enforce the use of one between WQ_PERCPU or WQ_UNBOUND. So act as following: - if neither of them is present, set WQ_PERCPU - if both are present, remove WQ_PERCPU Along with this change, WARN_ONCE(), so that the code still uses both or neither of them, can be changed. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-29	workqueue: Add warnings and fallback if system_{unbound}_wq is used	Marco Crivellari
	Currently many users transitioned already to the new introduced workqueue (system_percpu_wq, system_dfl_wq), but there are new users who still use the older system_wq and system_unbound_wq. This change try to push this transition forward, by warning whether the old workqueues are used. Link: https://lore.kernel.org/all/20250221112003.1dSuoGyc@linutronix.de/ Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-27	workqueue: drop spurious '*' from print_worker_info() fn declaration	Breno Leitao
	print_worker_info() declares its local 'fn' as work_func_t * but worker->current_func has type work_func_t (a function pointer). The extra level of indirection is wrong and only happens to be harmless today because every supported Linux architecture has sizeof(work_func_t) == sizeof(work_func_t ): copy_from_kernel_nofault() reads the correct number of bytes by accident, and %ps still resolves the printed address because the stored value is the function address regardless of declared type. On any future ABI where sizeof(void ()()) differs from sizeof(void *), the nofault copy would transfer the wrong number of bytes and the subsequent %ps would print an incorrect address. Match the field type so the intent is explicit and the code does not silently rely on equal pointer sizes. Fixes: 3d1cb2059d93 ("workqueue: include workqueue info when printing debug dump of a worker task") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-19	sched: Simplify ifdeffery around cpu_smt_mask	Shrikanth Hegde
	Now, that cpu_smt_mask is defined as cpumask_of(cpu) for CONFIG_SCHED_SMT=n, it is possible to get rid of the ifdeffery. Effectively, - This makes sched_smt_present is defined always - cpumask_weight(cpumask_of(cpu)) == 1. So sched_smt_present_inc/dec will never enable the sched_smt_present. Which is expected. - Paths that were compile-time eliminated become runtime guarded using static keys. - Defines set_idle_cores, test_idle_cores, etc which could likely benefit the CONFIG_SCHED_SMT=n systems to use the same optimizations within the LLC at wakeups. - This will expose sched_smt_present symbol for CONFIG_SCHED_SMT=n. Likely not a concern. - There is a bloat of code CONFIG_SCHED_SMT=n. (NR_CPUS=2048) add/remove: 24/18 grow/shrink: 26/28 up/down: 6396/-3188 (3208) Total: Before=30629880, After=30633088, chg +0.01% - No code bloat for CONFIG_SCHED_SMT=y, which is expected. - Add comments around stop_core_cpuslocked on why ifdefs are not removed. - This leaves the remaining uses of CONFIG_SCHED_SMT mainly for topology building bits which has a policy based decision. Signed-off-by: Shrikanth Hegde <sshegde@linux.ibm.com> Signed-off-by: Peter Zijlstra (Intel) <peterz@infradead.org> Reviewed-by: Phil Auld <pauld@redhat.com> Reviewed-by: Valentin Schneider <vschneid@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Tested-by: K Prateek Nayak <kprateek.nayak@amd.com> Link: https://patch.msgid.link/20260515172456.542799-3-sshegde@linux.ibm.com
2026-05-11	workqueue: drop apply_wqattrs_lock()/unlock() wrappers	Breno Leitao
	The apply_wqattrs_lock()/unlock() helpers were introduced by commit a0111cf6710b ("workqueue: separate out and refactor the locking of applying attrs") to encapsulate the get_online_cpus() (later cpus_read_lock()) + mutex_lock(&wq_pool_mutex) acquire pair that was duplicated across the apply-attrs paths. Since commit 19af45757383 ("workqueue: Remove cpus_read_lock() from apply_wqattrs_lock()") removed the cpus_read_lock() (pwq creation and installation now operate on wq_online_cpumask, so CPU hotplug no longer needs to be excluded), the wrappers have been one-line forwarders to mutex_lock(&wq_pool_mutex)/mutex_unlock(&wq_pool_mutex). They no longer encode any non-trivial locking rule and obscure the fact that callers just take the existing wq_pool_mutex. This align with the "unnecessary" helpers that got discussed in [1] Inline the eight call sites and remove the wrappers. No functional change. Link: https://lore.kernel.org/all/afs_44-6ToJJVZTn@gmail.com/ [1] Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-08	workqueue: Fix wq->cpu_pwq leak in alloc_and_link_pwqs() WQ_UNBOUND path	Breno Leitao
	For WQ_UNBOUND workqueues, alloc_and_link_pwqs() allocates wq->cpu_pwq via alloc_percpu() and then calls apply_workqueue_attrs_locked(). On failure it returns the error directly, bypassing the enomem: label which holds the only free_percpu(wq->cpu_pwq) in this function. The caller's error path kfree()s wq without touching wq->cpu_pwq, leaking one percpu pointer table (nr_cpu_ids * sizeof(void *) bytes) per failed call. If kmemleak is enabled, we can see: unreferenced object (percpu) 0xc0fffa5b121048 (size 8): comm "insmod", pid 776, jiffies 4294682844 backtrace (crc 0): pcpu_alloc_noprof+0x665/0xac0 __alloc_workqueue+0x33f/0xa20 alloc_workqueue_noprof+0x60/0x100 Route the error through the existing enomem: cleanup and any error before this one. Cc: stable@kernel.org Fixes: 636b927eba5b ("workqueue: Make unbound workqueues to use per-cpu pool_workqueues") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-05-08	workqueue: Release PENDING in __queue_work() drain/destroy reject path	Breno Leitao
	The caller of __queue_work() owns WORK_STRUCT_PENDING, won via test_and_set_bit() in queue_work_on()/__queue_delayed_work(). The state machine documented above __queue_work() requires that owner to either hand the token to a pwq (insert_work() -> set_work_pwq()), hand it to a timer, or release it via set_work_pool_and_clear_pending(). try_to_grab_pending() relies on this: when it observes "PENDING && off-queue" it busy-loops, trusting the current owner to make progress. The (__WQ_DESTROYING \| __WQ_DRAINING) early-return path violates that contract. It WARN_ONCE()s and bare-returns, leaving work->data with PENDING set, WORK_STRUCT_PWQ clear, and work->entry empty. The path is reachable without explicit API abuse: queue_delayed_work() arms a timer with PENDING set; if drain_workqueue() runs while the timer is still pending, delayed_work_timer_fn() -> __queue_work() in softirq context hits the WARN, current is not a wq worker so is_chained_work() is false, and the work is silently dropped with PENDING leaked. Mirror what clear_pending_if_disabled() already does on its analogous reject path: unpack the off-queue data and call set_work_pool_and_clear_pending() to release the token before returning. I was able to reproduce this by queueing several slow works on a max_active=1 wq, arm a delayed_work whose timer fires while drain_workqueue() is blocked, then call cancel_delayed_work_sync(). Without this patch the cancel livelocks at 100% CPU; with it the cancel returns immediately. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-29	workqueue: Annotate alloc_workqueue_va() with __printf(1, 0)	Tejun Heo
	alloc_workqueue_va() forwards its va_list to __alloc_workqueue() which ultimately feeds vsnprintf(). __alloc_workqueue() already carries __printf(1, 0); the new wrapper needs the same annotation so format string checking propagates through the forwarding. Fixes: 0de4cb473aed ("workqueue: fix devm_alloc_workqueue() va_list misuse") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604300347.2LgXyteh-lkp@intel.com/ Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-28	workqueue: fix devm_alloc_workqueue() va_list misuse	Breno Leitao
	devm_alloc_workqueue() built a va_list and passed it as a single positional argument to the variadic alloc_workqueue() macro: va_start(args, max_active); wq = alloc_workqueue(fmt, flags, max_active, args); va_end(args); C does not allow forwarding a va_list through a ... parameter. alloc_workqueue() expands to alloc_workqueue_noprof(), which runs its own va_start() over its ... params, so the inner vsnprintf(wq->name, sizeof(wq->name), fmt, args) in __alloc_workqueue() received the outer va_list object as the first variadic slot rather than the caller's actual format arguments. Add a new static helper alloc_workqueue_va() that wraps __alloc_workqueue() and runs wq_init_lockdep() on success, and fold both alloc_workqueue_noprof() and devm_alloc_workqueue_noprof() onto it as suggested by Tejun. The wq_init_lockdep() step is required on the devm path too, otherwise __flush_workqueue()'s on-stack COMPLETION_INITIALIZER_ONSTACK_MAP would NULL-deref wq->lockdep_map. No caller changes are required. devm_alloc_ordered_workqueue() is a macro forwarding to devm_alloc_workqueue() and inherits the fix. Two in-tree callers actively trigger the broken path on every probe: drivers/power/supply/mt6370-charger.c:889 drivers/power/supply/max77705_charger.c:649 both of which use devm_alloc_ordered_workqueue(dev, "%s", 0, dev_name(dev)). A standalone reproducer module is available at[1]. Link: https://github.com/leitao/debug/blob/main/workqueue/valist/wq_va_test.c [1] Fixes: 1dfc9d60a69e ("workqueue: devres: Add device-managed allocate workqueue") Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-15	Merge tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	Linus Torvalds
	Pull workqueue updates from Tejun Heo: - New default WQ_AFFN_CACHE_SHARD affinity scope subdivides LLCs into smaller shards to improve scalability on machines with many CPUs per LLC - Misc: - system_dfl_long_wq for long unbound works - devm_alloc_workqueue() for device-managed allocation - sysfs exposure for ordered workqueues and the EFI workqueue - removal of HK_TYPE_WQ from wq_unbound_cpumask - various small fixes * tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (21 commits) workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id() workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value workqueue: avoid unguarded 64-bit division docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope workqueue: add test_workqueue benchmark module tools/workqueue: add CACHE_SHARD support to wq_dump.py workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope workqueue: add WQ_AFFN_CACHE_SHARD affinity scope workqueue: fix typo in WQ_AFFN_SMT comment workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path workqueue: Remove NULL wq WARN in __queue_delayed_work() workqueue: fix parse_affn_scope() prefix matching bug workqueue: devres: Add device-managed allocate workqueue workqueue: Add system_dfl_long_wq for long unbound works tools/workqueue/wq_dump.py: add NODE prefix to all node columns tools/workqueue/wq_dump.py: fix column alignment in node_nr/max_active section tools/workqueue/wq_dump.py: remove backslash separator from node_nr/max_active header efi: Allow to expose the workqueue via sysfs workqueue: Allow to expose ordered workqueues via sysfs ...
2026-04-13	workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()	Breno Leitao
	On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so cpu_shard_id[] is a single-element array (int[1]). In llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an unsigned int that the compiler cannot prove is always 0, triggering a -Warray-bounds warning when the result is used to index cpu_shard_id[]: kernel/workqueue.c:8321:55: warning: array subscript 1 is above array bounds of 'int[1]' [-Warray-bounds] 8321 \| cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)]; \| ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~ This is a false positive: sibling_cpus can never be empty here because 'c' itself is always set in it, so cpumask_first() will always return a valid CPU. However, the compiler cannot prove this statically, and the warning only manifests on UP configs where the array size is 1. Add a bounds check with WARN_ON_ONCE to silence the warning, and store the result in a local variable to make the code clearer and avoid calling cpumask_first() twice. Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope") Reported-by: kernel test robot <lkp@intel.com> Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/ Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-07	workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value	Maninder Singh
	use NR_STD_WORKER_POOLS for irq_work_fns[] array definition. NR_STD_WORKER_POOLS is also 2, but better to use MACRO. Initialization loop for_each_bh_worker_pool() also uses same MACRO. Signed-off-by: Maninder Singh <maninder1.s@samsung.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-01	workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope	Breno Leitao
	Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound workqueues. On systems where many CPUs share one LLC, the previous default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool, causing heavy spinlock contention on pool->lock. WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing a better balance between locality and contention. Users can revert to the previous behavior with workqueue.default_affinity_scope=cache. On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single shard covering the entire LLC, making it functionally identical to the previous CACHE default. The sharding only activates when an LLC has more than 8 cores. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-01	workqueue: add WQ_AFFN_CACHE_SHARD affinity scope	Breno Leitao
	On systems where many CPUs share one LLC, unbound workqueues using WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock contention on pool->lock. For example, Chuck Lever measured 39% of cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3 NFS-over-RDMA system. The existing affinity hierarchy (cpu, smt, cache, numa, system) offers no intermediate option between per-LLC and per-SMT-core granularity. Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at most wq_cache_shard_size cores (default 8, tunable via boot parameter). Shards are always split on core (SMT group) boundaries so that Hyper-Threading siblings are never placed in different pods. Cores are distributed across shards as evenly as possible -- for example, 36 cores in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7 cores. The implementation follows the same comparator pattern as other affinity scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology, and cpus_share_cache_shard() is passed to init_pod_type(). Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency compared to cache scope on this 72-core single-LLC system. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-04-01	workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple ↵	Matthew Brost
	inactive works In unplug_oldest_pwq(), the first inactive work item on the pool_workqueue is activated correctly. However, if multiple inactive works exist on the same pool_workqueue, subsequent works fail to activate because wq_node_nr_active.pending_pwqs is empty — the list insertion is skipped when the pool_workqueue is plugged. Fix this by checking for additional inactive works in unplug_oldest_pwq() and updating wq_node_nr_active.pending_pwqs accordingly. Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues") Cc: stable@vger.kernel.org Cc: Carlos Santa <carlos.santa@intel.com> Cc: Ryan Neph <ryanneph@google.com> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Waiman Long <longman@redhat.com> Cc: linux-kernel@vger.kernel.org Signed-off-by: Matthew Brost <matthew.brost@intel.com> Signed-off-by: Tejun Heo <tj@kernel.org> Acked-by: Waiman Long <longman@redhat.com>
2026-03-31	workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask	Waiman Long
	For historical reason, wq_unbound_cpumask is initially set as intersection of HK_TYPE_DOMAIN, HK_TYPE_WQ and workqueue.unbound_cpus boot command line option. At run time, users can update the unbound cpumask via the /sys/devices/virtual/workqueue/cpumask sysfs file. Creation and modification of cpuset isolated partitions will also update wq_unbound_cpumask based on the latest HK_TYPE_DOMAIN cpumask. The HK_TYPE_WQ cpumask is out of the picture with these runtime updates. Complete the transition by taking HK_TYPE_WQ out from the workqueue code and make it depends on HK_TYPE_DOMAIN only from the housekeeping side. The final goal is to eliminate HK_TYPE_WQ as a housekeeping cpumask type. Signed-off-by: Waiman Long <longman@redhat.com> Acked-by: Frederic Weisbecker <frederic@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-25	workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path	Breno Leitao
	When alloc_and_link_pwqs() fails partway through the per-cpu allocation loop, some pool_workqueues may have already been linked into wq->pwqs via link_pwq(). The error path frees these pwqs with kmem_cache_free() but never removes them from the wq->pwqs list, leaving dangling pointers in the list. Currently this is not exploitable because the workqueue was never added to the global workqueues list and the caller frees the wq immediately after. However, this makes sure that alloc_and_link_pwqs() doesn't leave any half-baked structure, which may have side effects if not properly cleaned up. Fix this by unlinking each pwq from wq->pwqs before freeing it. No locking is needed as the workqueue has not been published yet, thus no concurrency is possible. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-25	workqueue: Better describe stall check	Petr Mladek
	Try to be more explicit why the workqueue watchdog does not take pool->lock by default. Spin locks are full memory barriers which delay anything. Obviously, they would primary delay operations on the related worker pools. Explain why it is enough to prevent the false positive by re-checking the timestamp under the pool->lock. Finally, make it clear what would be the alternative solution in __queue_work() which is a hotter path. Signed-off-by: Petr Mladek <pmladek@suse.com> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-21	workqueue: Fix false positive stall reports	Song Liu
	On weakly ordered architectures (e.g., arm64), the lockless check in wq_watchdog_timer_fn() can observe a reordering between the worklist insertion and the last_progress_ts update. Specifically, the watchdog can see a non-empty worklist (from a list_add) while reading a stale last_progress_ts value, causing a false positive stall report. This was confirmed by reading pool->last_progress_ts again after holding pool->lock in wq_watchdog_timer_fn(): workqueue watchdog: pool 7 false positive detected! lockless_ts=4784580465 locked_ts=4785033728 diff=453263ms worklist_empty=0 To avoid slowing down the hot path (queue_work, etc.), recheck last_progress_ts with pool->lock held. This will eliminate the false positive with minimal overhead. Remove two extra empty lines in wq_watchdog_timer_fn() as we are on it. Fixes: 82607adcf9cd ("workqueue: implement lockup detector") Cc: stable@vger.kernel.org # v4.5+ Assisted-by: claude-code:claude-opus-4-6 Signed-off-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-17	workqueue: Remove NULL wq WARN in __queue_delayed_work()	Tejun Heo
	Remove the WARN_ON_ONCE(!wq) which doesn't serve any useful purpose. Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-13	workqueue: fix parse_affn_scope() prefix matching bug	Breno Leitao
	parse_affn_scope() uses strncasecmp() with the length of the candidate name, which means it only checks if the input starts with a known scope name. Given that the upcoming diff will create "cache_shard" affinity scope, writing "cache_shard" to a workqueue's affinity_scope sysfs attribute always matches "cache" first, making it impossible to select "cache_shard" via sysfs, so, this fix enable it to distinguish "cache" and "cache_shard" Fix by replacing the hand-rolled prefix matching loop with sysfs_match_string(), which uses sysfs_streq() for exact matching (modulo trailing newlines). Also add the missing const qualifier to the wq_affn_names[] array declaration. Note that sysfs_streq() is case-sensitive, unlike the previous strncasecmp() approach. This is intentional and consistent with how other sysfs attributes handle string matching in the kernel. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-10	Merge branch 'for-7.1-devm-alloc-wq' into for-7.1	Tejun Heo

2026-03-10	workqueue: devres: Add device-managed allocate workqueue	Krzysztof Kozlowski
	Add a Resource-managed version of alloc_workqueue() to fix common problem of drivers mixing devm() calls with destroy_workqueue. Such naive and discouraged driver approach leads to difficult to debug bugs when the driver: 1. Allocates workqueue in standard way and destroys it in driver remove() callback, 2. Sets work struct with devm_work_autocancel(), 3. Registers interrupt handler with devm_request_threaded_irq(). Which leads to following unbind/removal path: 1. destroy_workqueue() via driver remove(), Any interrupt coming now would still execute the interrupt handler, which queues work on destroyed workqueue. 2. devm_irq_release(), 3. devm_work_drop() -> cancel_work_sync() on destroyed workqueue. devm_alloc_workqueue() has two benefits: 1. Solves above problem of mix-and-match devres and non-devres code in driver, 2. Simplify any sane drivers which were correctly using alloc_workqueue() + devm_add_action_or_reset(). Signed-off-by: Krzysztof Kozlowski <krzysztof.kozlowski@oss.qualcomm.com> Acked-by: Tejun Heo <tj@kernel.org> Reviewed-by: Andy Shevchenko <andriy.shevchenko@linux.intel.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-09	workqueue: Add system_dfl_long_wq for long unbound works	Marco Crivellari
	Currently there are users of queue_delayed_work() who specify system_long_wq, the per-cpu workqueue. This workqueue should be used for long per-cpu works, but queue_delayed_work() queue the work using: queue_delayed_work_on(WORK_CPU_UNBOUND, ...); This would end up calling __queue_delayed_work() that does: if (housekeeping_enabled(HK_TYPE_TIMER)) { // [....] } else { if (likely(cpu == WORK_CPU_UNBOUND)) add_timer_global(timer); else add_timer_on(timer, cpu); } So when cpu == WORK_CPU_UNBOUND the timer is global and is not using a specific CPU. Later, when __queue_work() is called: if (req_cpu == WORK_CPU_UNBOUND) { if (wq->flags & WQ_UNBOUND) cpu = wq_select_unbound_cpu(raw_smp_processor_id()); else cpu = raw_smp_processor_id(); } Because the wq is not unbound, it takes the CPU where the timer fired and enqueue the work on that CPU. The consequence of all of this is that the work can run anywhere, depending on where the timer fired. Introduce system_dfl_long_wq in order to change, in a future step, users that are still calling: queue_delayed_work(system_long_wq, ...); with the new system_dfl_long_wq instead, so that the work may benefit from scheduler task placement. Signed-off-by: Marco Crivellari <marco.crivellari@suse.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-06	workqueue: Rename show_cpu_pool{s,}_hog{s,}() to reflect broadened scope	Breno Leitao
	show_cpu_pool_hog() and show_cpu_pools_hogs() no longer only dump CPU hogs — since commit 8823eaef45da ("workqueue: Show all busy workers in stall diagnostics"), they dump every in-flight worker in the pool's busy_hash. Rename them to show_cpu_pool_busy_workers() and show_cpu_pools_busy_workers() to accurately describe what they do. Also fix the pr_info() message to say "stalled worker pools" instead of "stalled CPU-bound worker pools", since sleeping/blocked workers are now included. No functional change. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05	workqueue: Show all busy workers in stall diagnostics	Breno Leitao
	show_cpu_pool_hog() only prints workers whose task is currently running on the CPU (task_is_running()). This misses workers that are busy processing a work item but are sleeping or blocked — for example, a worker that clears PF_WQ_WORKER and enters wait_event_idle(). Such a worker still occupies a pool slot and prevents progress, yet produces an empty backtrace section in the watchdog output. This is happening on real arm64 systems, where toggle_allocation_gate() IPIs every single CPU in the machine (which lacks NMI), causing workqueue stalls that show empty backtraces because toggle_allocation_gate() is sleeping in wait_event_idle(). Remove the task_is_running() filter so every in-flight worker in the pool's busy_hash is dumped. The busy_hash is protected by pool->lock, which is already held. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05	workqueue: Show in-flight work item duration in stall diagnostics	Breno Leitao
	When diagnosing workqueue stalls, knowing how long each in-flight work item has been executing is valuable. Add a current_start timestamp (jiffies) to struct worker, set it when a work item begins execution in process_one_work(), and print the elapsed wall-clock time in show_pwq(). Unlike current_at (which tracks CPU runtime and resets on wakeup for CPU-intensive detection), current_start is never reset because the diagnostic cares about total wall-clock time including sleeps. Before: in-flight: 165:stall_work_fn [wq_stall] After: in-flight: 165:stall_work_fn [wq_stall] for 100s Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05	workqueue: Rename pool->watchdog_ts to pool->last_progress_ts	Breno Leitao
	The watchdog_ts name doesn't convey what the timestamp actually tracks. This field tracks the last time a workqueue got progress. Rename it to last_progress_ts to make it clear that it records when the pool last made forward progress (started processing new work items). No functional change. Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-03-05	workqueue: Use POOL_BH instead of WQ_BH when checking pool flags	Breno Leitao
	pr_cont_worker_id() checks pool->flags against WQ_BH, which is a workqueue-level flag (defined in workqueue.h). Pool flags use a separate namespace with POOL_* constants (defined in workqueue.c). The correct constant is POOL_BH. Both WQ_BH and POOL_BH are defined as (1 << 0) so this has no behavioral impact, but it is semantically wrong and inconsistent with every other pool-level BH check in the file. Fixes: 4cb1ef64609f ("workqueue: Implement BH workqueues to eventually replace tasklets") Signed-off-by: Breno Leitao <leitao@debian.org> Acked-by: Song Liu <song@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-27	workqueue: Allow to expose ordered workqueues via sysfs	Sebastian Andrzej Siewior
	Ordered workqueues are not exposed via sysfs because the 'max_active' attribute changes the number actives worker. More than one active worker can break ordering guarantees. This can be avoided by forbidding writes the file for ordered workqueues. Exposing it via sysfs allows to alter other attributes such as the cpumask on which CPU the worker can run. The 'max_active' value shouldn't be changed for BH worker because the core never spawns additional worker and the worker itself can not be preempted. So this make no sense. Allow to expose ordered workqueues via sysfs if requested and forbid changing 'max_active' value for ordered and BH worker. Signed-off-by: Sebastian Andrzej Siewior <bigeasy@linutronix.de> Acked-by: Tejun Heo <tj@kernel.org> Acked-by: Ard Biesheuvel <ardb@kernel.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-21	Convert 'alloc_flex' family to use the new default GFP_KERNEL argument	Linus Torvalds
	This is the exact same thing as the 'alloc_obj()' version, only much smaller because there are a lot fewer users of the alloc_flex() interface. As with alloc_obj() version, this was done entirely with mindless brute force, using the same script, except using 'flex' in the pattern rather than 'objs'. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21	Convert 'alloc_obj' family to use the new default GFP_KERNEL argument	Linus Torvalds
	This was done entirely with mindless brute force, using git grep -l '\<k[vmz]alloc_objs(., GFP_KERNEL)' \| xargs sed -i 's/\(alloc_objs(.*\), GFP_KERNEL)/\1)/' to convert the new alloc_obj() users that had a simple GFP_KERNEL argument to just drop that argument. Note that due to the extreme simplicity of the scripting, any slightly more complex cases spread over multiple lines would not be triggered: they definitely exist, but this covers the vast bulk of the cases, and the resulting diff is also then easier to check automatically. For the same reason the 'flex' versions will be done as a separate conversion. Signed-off-by: Linus Torvalds <torvalds@linux-foundation.org>
2026-02-21	treewide: Replace kmalloc with kmalloc_obj for non-scalar types	Kees Cook
	This is the result of running the Coccinelle script from scripts/coccinelle/api/kmalloc_objs.cocci. The script is designed to avoid scalar types (which need careful case-by-case checking), and instead replace kmalloc-family calls that allocate struct or union object instances: Single allocations: kmalloc(sizeof(TYPE), ...) are replaced with: kmalloc_obj(TYPE, ...) Array allocations: kmalloc_array(COUNT, sizeof(TYPE), ...) are replaced with: kmalloc_objs(TYPE, COUNT, ...) Flex array allocations: kmalloc(struct_size(PTR, FAM, COUNT), ...) are replaced with: kmalloc_flex(PTR, FAM, COUNT, ...) (where TYPE may also be VAR) The resulting allocations no longer return "void ", instead returning "TYPE ". Signed-off-by: Kees Cook <kees@kernel.org>
2026-02-11	Merge tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq	Linus Torvalds
	Pull workqueue updates from Tejun Heo: - Rework the rescuer to process work items one-by-one instead of slurping all pending work items in a single pass. As there is only one rescuer per workqueue, a single long-blocking work item could cause high latency for all tasks queued behind it, even after memory pressure is relieved and regular kworkers become available to service them. - Add CONFIG_BOOTPARAM_WQ_STALL_PANIC build-time option and workqueue.panic_on_stall_time parameter for time-based stall panic, giving systems more control over workqueue stall handling. - Replace BUG_ON() with panic() in the stall panic path for clearer intent and more informative output. * tag 'wq-for-6.20' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: workqueue: replace BUG_ON with panic in panic_on_wq_watchdog workqueue: add time-based panic for stalls workqueue: add CONFIG_BOOTPARAM_WQ_STALL_PANIC option workqueue: Process extra works in rescuer on memory pressure workqueue: Process rescuer work items one-by-one using a cursor workqueue: Make send_mayday() take a PWQ argument directly
2026-02-07	workqueue: replace BUG_ON with panic in panic_on_wq_watchdog	Breno Leitao
	Replace BUG_ON() with panic() in panic_on_wq_watchdog(). This is not a bug condition but a deliberate forced panic requested by the user via module parameters to crash the system for debugging purposes. Using panic() instead of BUG_ON() makes this intent clearer and provides more informative output about which threshold was exceeded and the actual values, making it easier to diagnose the stall condition from crash dumps. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-07	workqueue: add time-based panic for stalls	Breno Leitao
	Add a new module parameter 'panic_on_stall_time' that triggers a panic when a workqueue stall persists for longer than the specified duration in seconds. Unlike 'panic_on_stall' which counts accumulated stall events, this parameter triggers based on the duration of a single continuous stall. This is useful for catching truly stuck workqueues rather than accumulating transient stalls. Usage: workqueue.panic_on_stall_time=120 This would panic if any workqueue pool has been stalled for 120 seconds or more. The stall duration is measured from the workqueue last progress (poll_ts) which accounts for legitimate system stalls. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-03	workqueue: add CONFIG_BOOTPARAM_WQ_STALL_PANIC option	Breno Leitao
	Add a kernel config option to set the default value of workqueue.panic_on_stall, similar to CONFIG_BOOTPARAM_SOFTLOCKUP_PANIC, CONFIG_BOOTPARAM_HARDLOCKUP_PANIC and CONFIG_BOOTPARAM_HUNG_TASK_PANIC. This allows setting the number of workqueue stalls before triggering a kernel panic at build time, which is useful for high-availability systems that need consistent panic-on-stall, in other words, those servers which run with CONFIG_BOOTPARAM_*_PANIC=y already. The default remains 0 (disabled). Setting it to 1 will panic on the first stall, and higher values will panic after that many stall warnings. The value can still be overridden at runtime via the workqueue.panic_on_stall boot parameter or sysfs. Signed-off-by: Breno Leitao <leitao@debian.org> Signed-off-by: Tejun Heo <tj@kernel.org>
2026-02-03	cpuset: Propagate cpuset isolation update to workqueue through housekeeping	Frederic Weisbecker
	Until now, cpuset would propagate isolated partition changes to workqueues so that unbound workers get properly reaffined. Since housekeeping now centralizes, synchronize and propagates isolation cpumask changes, perform the work from that subsystem for consolidation and consistency purposes. For simplification purpose, the target function is adapted to take the new housekeeping mask instead of the isolated mask. Suggested-by: Tejun Heo <tj@kernel.org> Signed-off-by: Frederic Weisbecker <frederic@kernel.org> Reviewed-by: Waiman Long <longman@redhat.com> Acked-by: Tejun Heo <tj@kernel.org> Cc: "Michal Koutný" <mkoutny@suse.com> Cc: Ingo Molnar <mingo@redhat.com> Cc: Johannes Weiner <hannes@cmpxchg.org> Cc: Lai Jiangshan <jiangshanlai@gmail.com> Cc: Marco Crivellari <marco.crivellari@suse.com> Cc: Michal Hocko <mhocko@suse.com> Cc: Peter Zijlstra <peterz@infradead.org> Cc: Tejun Heo <tj@kernel.org> Cc: Thomas Gleixner <tglx@linutronix.de> Cc: Vlastimil Babka <vbabka@suse.cz> Cc: Waiman Long <longman@redhat.com> Cc: cgroups@vger.kernel.org
2025-12-08	workqueue: Process extra works in rescuer on memory pressure	Lai Jiangshan
	Make the rescuer process more work on the last pwq when there are no more to rescue for the whole workqueue to help the regular workers in case it is a temporary memory pressure relief and to reduce relapse. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08	workqueue: Process rescuer work items one-by-one using a cursor	Lai Jiangshan
	Previously, the rescuer scanned for all matching work items at once and processed them within a single rescuer thread, which could cause one blocking work item to stall all others. Make the rescuer process work items one-by-one instead of slurping all matches in a single pass. Break the rescuer loop after finding and processing the first matching work item, then restart the search to pick up the next. This gives normal worker threads a chance to process other items which gives them the opportunity to be processed instead of waiting on the rescuer's queue and prevents a blocking work item from stalling the rest once memory pressure is relieved. Introduce a dummy cursor work item to avoid potentially O(N^2) rescans of the work list. The marker records the resume position for the next scan, eliminating redundant traversals. Also introduce RESCUER_BATCH to control the maximum number of work items the rescuer processes in each turn, and move on to other PWQs when the limit is reached. Cc: ying chen <yc1082463@gmail.com> Reported-by: ying chen <yc1082463@gmail.com> Fixes: e22bee782b3b ("workqueue: implement concurrency managed dynamic worker pool") Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-12-08	workqueue: Make send_mayday() take a PWQ argument directly	Lai Jiangshan
	Make send_mayday() operate on a PWQ directly instead of taking a work item, so that rescuer_thread() now calls send_mayday(pwq) instead of open-coding the mayday list manipulation. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-21	workqueue: Don't rely on wq->rescuer to stop rescuer	Lai Jiangshan
	The commit1 def98c84b6cd ("workqueue: Fix spurious sanity check failures in destroy_workqueue()") tries to fix spurious sanity check failures by stopping send_mayday() via setting wq->rescuer to NULL. But it fails to stop the pwq->mayday_node requeuing in the rescuer, and the commit2 e66b39af00f4 ("workqueue: Fix pwq ref leak in rescuer_thread()") fixes it by checking wq->rescuer which is the result of commit1. Both commits together really fix spurious sanity check failures caused by the rescuer, but they both use a convoluted method by relying on wq->rescuer state rather than the real count of work items. Actually __WQ_DESTROYING and drain_workqueue() together already stop send_mayday() by draining all the work items and ensuring no new work item requeuing. And the more proper fix to stop the pwq->mayday_node requeuing in the rescuer is from commit3 4f3f4cf388f8 ("workqueue: avoid unneeded requeuing the pwq in rescuer thread") and renders the checking of wq->rescuer in commit2 unnecessary. So __WQ_DESTROYING, drain_workqueue() and commit3 together fix spurious sanity check failures introduced by the rescuer. Just remove the convoluted code of using wq->rescuer. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-21	workqueue: Only assign rescuer work when really needed	Lai Jiangshan
	If the pwq does not need rescue (normal workers have been created or become available), the rescuer can immediately move on to other stalled pwqs. Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-21	workqueue: Factor out assign_rescuer_work()	Lai Jiangshan
	Move the code to assign work to rescuer and assign_rescuer_work(). Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20	workqueue: Init rescuer's affinities as wq_unbound_cpumask	Lai Jiangshan
	The affinity to set to the rescuers should be consistent in all paths when a rescuer is in detached state. The affinity could be either wq_unbound_cpumask or unbound_effective_cpumask(wq). Related paths: rescuer's worker_detach_from_pool() update wq_unbound_cpumask update wq's cpumask init_rescuer() Both affinities are Ok as long as they are consistent in all paths. In the commit 449b31ad2937 ("workqueue: Init rescuer's affinities as the wq's effective cpumask") makes init_rescuer use unbound_effective_cpumask(wq) which is consistent with then apply_wqattrs_commit(). But using unbound_effective_cpumask(wq) requres much more code to maintain the consistency, and it doesn't make much sense since the affinity is only effective when the rescuer is not processing works. wq_unbound_cpumask is more favorable. So apply_wqattrs_commit() and the path of "updating wq's cpumask" had been changed to not update the rescuer's affinity, and both the paths of "updating wq_unbound_cpumask" and "rescuer's worker_detach_from_pool()" had been changed to use wq_unbound_cpumask. Now, make init_rescuer() use wq_unbound_cpumask for rescuer's affinity and make all the paths consistent. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20	workqueue: Let DISASSOCIATED workers follow unbound wq cpumask changes	Lai Jiangshan
	When workqueue cpumask changes are committed, the DISASSOCIATED workers affinity is not touched and this might be a problem down the line for isolated setups when the DISASSOCIATED pools still have works to run after the cpu is offline. Make sure the workers' affinity is updated every time a workqueue cpumask changes, so these workers can't break isolation. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-20	workqueue: Update the rescuer's affinity only when it is detached	Lai Jiangshan
	When a rescuer is attached to a pool, its affinity should be only managed by the pool. But updating the detached rescuer's affinity is still meaningful so that it will not disrupt isolated CPUs when it is to be waken up. But the commit d64f2fa064f8 ("kernel/workqueue: Let rescuers follow unbound wq cpumask changes") updates the affinity unconditionally, and causes some issues 1) it also changes the affinity when the rescuer is already attached to a pool, which violates the affinity management. 2) the said commit tries to update the affinity of the rescuers, but it misses the rescuers of the PERCPU workqueues, and isolated CPUs can be possibly disrupted by these rescuers when they are summoned. 3) The affinity to set to the rescuers should be consistent in all paths when a rescuer is in detached state. The affinity could be either wq_unbound_cpumask or unbound_effective_cpumask(wq). Related paths: rescuer's worker_detach_from_pool() update wq_unbound_cpumask update wq's cpumask init_rescuer() Both affinities are Ok as long as they are consistent in all paths. But using unbound_effective_cpumask(wq) requres much more code to maintain the consistency, and it doesn't make much sense since the affinity is only effective when the rescuer is not processing works. wq_unbound_cpumask is more favorable. Fix the 1) issue by testing rescuer->pool before updating with wq_pool_attach_mutex held. Fix the 2) issue by moving the rescuer's affinity updating code to the place updating wq_unbound_cpumask and make it also update for PERCPU workqueues. Partially cleanup the 3) consistency issue by using wq_unbound_cpumask. So that the path of "updating wq's cpumask" doesn't need to maintain it. and both the paths of "updating wq_unbound_cpumask" and "rescuer's worker_detach_from_pool()" use wq_unbound_cpumask. Cleanup for init_rescuer()'s consistency for affinity can be done in future. Cc: Juri Lelli <juri.lelli@redhat.com> Cc: Waiman Long <longman@redhat.com> Signed-off-by: Lai Jiangshan <jiangshan.ljs@antgroup.com> Signed-off-by: Tejun Heo <tj@kernel.org>
2025-11-10	workqueue: Remove unused assert_rcu_or_wq_mutex_or_pool_mutex	zhang jiao
	assert_rcu_or_wq_mutex_or_pool_mutex is never referenced in the code. Just remove it. Signed-off-by: zhang jiao <zhangjiao2@cmss.chinamobile.com> Reviewed-by: Lai Jiangshan <jiangshanlai@gmail.com> Signed-off-by: Tejun Heo <tj@kernel.org>