linux.git/kernel/workqueue.c, branch v7.1-rc3

workqueue: Annotate alloc_workqueue_va() with __printf(1, 0)

2026-04-29T19:44:16+00:00

alloc_workqueue_va() forwards its va_list to __alloc_workqueue() which
ultimately feeds vsnprintf(). __alloc_workqueue() already carries
__printf(1, 0); the new wrapper needs the same annotation so format
string checking propagates through the forwarding.

Fixes: 0de4cb473aed ("workqueue: fix devm_alloc_workqueue() va_list misuse")
Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-kbuild-all/202604300347.2LgXyteh-lkp@intel.com/
Signed-off-by: Tejun Heo

workqueue: fix devm_alloc_workqueue() va_list misuse

2026-04-28T16:13:40+00:00

devm_alloc_workqueue() built a va_list and passed it as a single
positional argument to the variadic alloc_workqueue() macro:

	va_start(args, max_active);
	wq = alloc_workqueue(fmt, flags, max_active, args);
	va_end(args);

C does not allow forwarding a va_list through a ... parameter.
alloc_workqueue() expands to alloc_workqueue_noprof(), which runs
its own va_start() over its ... params, so the inner
vsnprintf(wq->name, sizeof(wq->name), fmt, args) in
__alloc_workqueue() received the outer va_list object as the first
variadic slot rather than the caller's actual format arguments.

Add a new static helper alloc_workqueue_va() that wraps
__alloc_workqueue() and runs wq_init_lockdep() on success, and
fold both alloc_workqueue_noprof() and devm_alloc_workqueue_noprof()
onto it as suggested by Tejun.

The wq_init_lockdep() step is required on the devm path
too, otherwise __flush_workqueue()'s on-stack
COMPLETION_INITIALIZER_ONSTACK_MAP would NULL-deref wq->lockdep_map.

No caller changes are required. devm_alloc_ordered_workqueue() is
a macro forwarding to devm_alloc_workqueue() and inherits the fix.
Two in-tree callers actively trigger the broken path on every probe:

  drivers/power/supply/mt6370-charger.c:889
  drivers/power/supply/max77705_charger.c:649

both of which use devm_alloc_ordered_workqueue(dev, "%s", 0,
dev_name(dev)).

A standalone reproducer module is available at[1].

Link: https://github.com/leitao/debug/blob/main/workqueue/valist/wq_va_test.c [1]
Fixes: 1dfc9d60a69e ("workqueue: devres: Add device-managed allocate workqueue")
Signed-off-by: Breno Leitao 
Signed-off-by: Tejun Heo

Merge tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq

2026-04-15T17:32:08+00:00

Pull workqueue updates from Tejun Heo:

 - New default WQ_AFFN_CACHE_SHARD affinity scope subdivides LLCs into
   smaller shards to improve scalability on machines with many CPUs per
   LLC

 - Misc:
    - system_dfl_long_wq for long unbound works
    - devm_alloc_workqueue() for device-managed allocation
    - sysfs exposure for ordered workqueues and the EFI workqueue
    - removal of HK_TYPE_WQ from wq_unbound_cpumask
    - various small fixes

* tag 'wq-for-7.1' of git://git.kernel.org/pub/scm/linux/kernel/git/tj/wq: (21 commits)
  workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()
  workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value
  workqueue: avoid unguarded 64-bit division
  docs: workqueue: document WQ_AFFN_CACHE_SHARD affinity scope
  workqueue: add test_workqueue benchmark module
  tools/workqueue: add CACHE_SHARD support to wq_dump.py
  workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope
  workqueue: add WQ_AFFN_CACHE_SHARD affinity scope
  workqueue: fix typo in WQ_AFFN_SMT comment
  workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask
  workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path
  workqueue: Remove NULL wq WARN in __queue_delayed_work()
  workqueue: fix parse_affn_scope() prefix matching bug
  workqueue: devres: Add device-managed allocate workqueue
  workqueue: Add system_dfl_long_wq for long unbound works
  tools/workqueue/wq_dump.py: add NODE prefix to all node columns
  tools/workqueue/wq_dump.py: fix column alignment in node_nr/max_active section
  tools/workqueue/wq_dump.py: remove backslash separator from node_nr/max_active header
  efi: Allow to expose the workqueue via sysfs
  workqueue: Allow to expose ordered workqueues via sysfs
  ...

workqueue: validate cpumask_first() result in llc_populate_cpu_shard_id()

2026-04-13T16:15:26+00:00

On uniprocessor (UP) configs such as nios2, NR_CPUS is 1, so
cpu_shard_id[] is a single-element array (int[1]). In
llc_populate_cpu_shard_id(), cpumask_first(sibling_cpus) returns an
unsigned int that the compiler cannot prove is always 0, triggering
a -Warray-bounds warning when the result is used to index
cpu_shard_id[]:

  kernel/workqueue.c:8321:55: warning: array subscript 1 is above
  array bounds of 'int[1]' [-Warray-bounds]
   8321 |  cpu_shard_id[c] = cpu_shard_id[cpumask_first(sibling_cpus)];
        |                    ~~~~~~~~~~~~^~~~~~~~~~~~~~~~~~~~~~~~~~~~~

This is a false positive: sibling_cpus can never be empty here because
'c' itself is always set in it, so cpumask_first() will always return a
valid CPU. However, the compiler cannot prove this statically, and the
warning only manifests on UP configs where the array size is 1.

Add a bounds check with WARN_ON_ONCE to silence the warning, and store
the result in a local variable to make the code clearer and avoid calling
cpumask_first() twice.

Fixes: 5920d046f7ae ("workqueue: add WQ_AFFN_CACHE_SHARD affinity scope")
Reported-by: kernel test robot 
Closes: https://lore.kernel.org/oe-kbuild-all/202604022343.GQtkF2vO-lkp@intel.com/
Signed-off-by: Breno Leitao 
Signed-off-by: Tejun Heo

workqueue: use NR_STD_WORKER_POOLS instead of hardcoded value

2026-04-07T18:13:19+00:00

use NR_STD_WORKER_POOLS for irq_work_fns[] array definition.
NR_STD_WORKER_POOLS is also 2, but better to use MACRO.
Initialization loop for_each_bh_worker_pool() also uses same MACRO.

Signed-off-by: Maninder Singh 
Signed-off-by: Tejun Heo

workqueue: set WQ_AFFN_CACHE_SHARD as the default affinity scope

2026-04-01T20:24:18+00:00

Set WQ_AFFN_CACHE_SHARD as the default affinity scope for unbound
workqueues. On systems where many CPUs share one LLC, the previous
default (WQ_AFFN_CACHE) collapses all CPUs to a single worker pool,
causing heavy spinlock contention on pool->lock.

WQ_AFFN_CACHE_SHARD subdivides each LLC into smaller groups, providing
a better balance between locality and contention. Users can revert to
the previous behavior with workqueue.default_affinity_scope=cache.

On systems with 8 or fewer cores per LLC, CACHE_SHARD produces a single
shard covering the entire LLC, making it functionally identical to the
previous CACHE default. The sharding only activates when an LLC has more
than 8 cores.

Signed-off-by: Breno Leitao 
Signed-off-by: Tejun Heo

workqueue: add WQ_AFFN_CACHE_SHARD affinity scope

2026-04-01T20:24:18+00:00

On systems where many CPUs share one LLC, unbound workqueues using
WQ_AFFN_CACHE collapse to a single worker pool, causing heavy spinlock
contention on pool->lock. For example, Chuck Lever measured 39% of
cycles lost to native_queued_spin_lock_slowpath on a 12-core shared-L3
NFS-over-RDMA system.

The existing affinity hierarchy (cpu, smt, cache, numa, system) offers
no intermediate option between per-LLC and per-SMT-core granularity.

Add WQ_AFFN_CACHE_SHARD, which subdivides each LLC into groups of at
most wq_cache_shard_size cores (default 8, tunable via boot parameter).
Shards are always split on core (SMT group) boundaries so that
Hyper-Threading siblings are never placed in different pods. Cores are
distributed across shards as evenly as possible -- for example, 36 cores
in a single LLC with max shard size 8 produces 5 shards of 8+7+7+7+7
cores.

The implementation follows the same comparator pattern as other affinity
scopes: precompute_cache_shard_ids() pre-fills the cpu_shard_id[] array
from the already-initialized WQ_AFFN_CACHE and WQ_AFFN_SMT topology,
and cpus_share_cache_shard() is passed to init_pod_type().

Benchmark on NVIDIA Grace (72 CPUs, single LLC, 50k items/thread), show
cache_shard delivers ~5x the throughput and ~6.5x lower p50 latency
compared to cache scope on this 72-core single-LLC system.

Suggested-by: Tejun Heo 
Signed-off-by: Breno Leitao 
Signed-off-by: Tejun Heo

workqueue: Add pool_workqueue to pending_pwqs list when unplugging multiple inactive works

2026-04-01T20:18:22+00:00

In unplug_oldest_pwq(), the first inactive work item on the
pool_workqueue is activated correctly. However, if multiple inactive
works exist on the same pool_workqueue, subsequent works fail to
activate because wq_node_nr_active.pending_pwqs is empty — the list
insertion is skipped when the pool_workqueue is plugged.

Fix this by checking for additional inactive works in
unplug_oldest_pwq() and updating wq_node_nr_active.pending_pwqs
accordingly.

Fixes: 4c065dbce1e8 ("workqueue: Enable unbound cpumask update on ordered workqueues")
Cc: stable@vger.kernel.org
Cc: Carlos Santa 
Cc: Ryan Neph 
Cc: Lai Jiangshan 
Cc: Waiman Long 
Cc: linux-kernel@vger.kernel.org
Signed-off-by: Matthew Brost 
Signed-off-by: Tejun Heo 
Acked-by: Waiman Long

workqueue: Remove HK_TYPE_WQ from affecting wq_unbound_cpumask

2026-03-31T21:43:25+00:00

For historical reason, wq_unbound_cpumask is initially set as
intersection of HK_TYPE_DOMAIN, HK_TYPE_WQ and workqueue.unbound_cpus
boot command line option.

At run time, users can update the unbound cpumask via the
/sys/devices/virtual/workqueue/cpumask sysfs file. Creation
and modification of cpuset isolated partitions will also update
wq_unbound_cpumask based on the latest HK_TYPE_DOMAIN cpumask.
The HK_TYPE_WQ cpumask is out of the picture with these runtime updates.

Complete the transition by taking HK_TYPE_WQ out from the workqueue code
and make it depends on HK_TYPE_DOMAIN only from the housekeeping side.
The final goal is to eliminate HK_TYPE_WQ as a housekeeping cpumask type.

Signed-off-by: Waiman Long 
Acked-by: Frederic Weisbecker 
Signed-off-by: Tejun Heo

workqueue: unlink pwqs from wq->pwqs list in alloc_and_link_pwqs() error path

2026-03-25T15:52:01+00:00

When alloc_and_link_pwqs() fails partway through the per-cpu allocation
loop, some pool_workqueues may have already been linked into wq->pwqs
via link_pwq(). The error path frees these pwqs with kmem_cache_free()
but never removes them from the wq->pwqs list, leaving dangling pointers
in the list.

Currently this is not exploitable because the workqueue was never added
to the global workqueues list and the caller frees the wq immediately
after. However, this makes sure that alloc_and_link_pwqs() doesn't leave
any half-baked structure, which may have side effects if not properly
cleaned up.

Fix this by unlinking each pwq from wq->pwqs before freeing it. No
locking is needed as the workqueue has not been published yet, thus
no concurrency is possible.

Signed-off-by: Breno Leitao 
Signed-off-by: Tejun Heo