linux-stable.git/kernel/time, branch linux-6.17.y

timers/migration: Fix imbalanced NUMA trees

2025-12-18T12:59:31+00:00

[ Upstream commit 5eb579dfd46b4949117ecb0f1ba2f12d3dc9a6f2 ]

When a CPU from a new node boots, the old root may happen to be
connected to the new root even if their node mismatch, as depicted in
the following scenario:

1) CPU 0 boots and creates the first group for node 0.

   [GRP0:0]
    node 0
      |
    CPU 0

2) CPU 1 from node 1 boots and creates a new top that corresponds to
   node 1, but it also connects the old root from node 0 to the new root
   from node 1 by mistake.

             [GRP1:0]
              node 1
            /        \
           /          \
   [GRP0:0]             [GRP0:1]
    node 0               node 1
      |                    |
    CPU 0                CPU 1

3) This eventually leads to an imbalanced tree where some node 0 CPUs
   migrate node 1 timers (and vice versa) way before reaching the
   crossnode groups, resulting in more frequent remote memory accesses
   than expected.

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:1]
              node 1               node 0
            /        \                |
           /          \             [...]
   [GRP0:0]             [GRP0:1]
    node 0               node 1
      |                    |
    CPU 0...              CPU 1...

A balanced tree should only contain groups having children that belong
to the same node:

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:0]
              node 0               node 1
            /        \             /      \
           /          \           /        \
   [GRP0:0]          [...]      [...]    [GRP0:1]
    node 0                                node 1
      |                                     |
    CPU 0...                              CPU 1...

In order to fix this, the hierarchy must be unfolded up to the crossnode
level as soon as a node mismatch is detected. For example the stage 2
above should lead to this layout:

                      [GRP2:0]
                      NUMA_NO_NODE
                     /             \
             [GRP1:0]              [GRP1:1]
              node 0               node 1
              /                         \
             /                           \
        [GRP0:0]                        [GRP0:1]
        node 0                           node 1
          |                                |
       CPU 0                             CPU 1

This means that not only GRP1:0 must be created but also GRP1:1 and
GRP2:0 in order to prepare a balanced tree for next CPUs to boot.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://patch.msgid.link/20251024132536.39841-4-frederic@kernel.org
Signed-off-by: Sasha Levin

timers/migration: Remove locking on group connection

2025-12-18T12:59:31+00:00

[ Upstream commit fa9620355d4192200f15cb3d97c6eb9c02442249 ]

Initializing the tmc's group, the group's number of children and the
group's parent can all be done without locking because:

1) Reading the group's parent and its group mask is done locklessly.

2) The connections prepared for a given CPU hierarchy are visible to the
target CPU once online, thanks to the CPU hotplug enforced memory
ordering.

3) In case of a newly created upper level, the new root and its
connections and initialization are made visible by the CPU which made
the connections. When that CPUs goes idle in the future, the new link
is published by tmigr_inactive_up() through the atomic RmW on
->migr_state.

4) If CPUs were still walking up the active hierarchy, they could observe
the new root earlier. In this case the ordering is enforced by an
early initialization of the group mask and by barriers that maintain
address dependency as explained in:

b729cc1ec21a ("timers/migration: Fix another race between hotplug and idle entry/exit")
de3ced72a792 ("timers/migration: Enforce group initialization visibility to tree walkers")

5) Timers are propagated by a chain of group locking from the bottom to
the top. And while doing so, the tree also propagates groups links
and initialization. Therefore remote expiration, which also relies
on group locking, will observe those links and initialization while
holding the root lock before walking the tree remotely and update
remote timers. This is especially important for migrators in the
active hierarchy that may observe the new root early.

Therefore the locking is unnecessary at initialization. If anything, it
just brings confusion. Remove it.

Signed-off-by: Frederic Weisbecker
Signed-off-by: Thomas Gleixner
Link: https://patch.msgid.link/20251024132536.39841-3-frederic@kernel.org
Stable-dep-of: 5eb579dfd46b ("timers/migration: Fix imbalanced NUMA trees")
Signed-off-by: Sasha Levin

timers/migration: Convert "while" loops to use "for"

2025-12-18T12:59:31+00:00

[ Upstream commit 6c181b5667eea3e6564d334443536a5974190e15 ]

Both the "do while" and "while" loops in tmigr_setup_groups() eventually
mimic the behaviour of "for" loops.

Simplify accordingly.

Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://patch.msgid.link/20251024132536.39841-2-frederic@kernel.org
Stable-dep-of: 5eb579dfd46b ("timers/migration: Fix imbalanced NUMA trees")
Signed-off-by: Sasha Levin

timekeeping: Fix error code in tk_aux_sysfs_init()

2025-12-06T21:27:34+00:00

commit c7418164b463056bf4327b6a2abe638b78250f13 upstream.

If kobject_create_and_add() fails on the first iteration, then the error
code is set to -ENOMEM which is correct. But if it fails in subsequent
iterations then "ret" is zero, which means success, but it should be
-ENOMEM.

Set the error code to -ENOMEM correctly.

Fixes: 7b5ab04f035f ("timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths")
Signed-off-by: Dan Carpenter 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Malaya Kumar Rout 
Link: https://patch.msgid.link/aSW1R8q5zoY_DgQE@stanley.mountain
Signed-off-by: Greg Kroah-Hartman

timekeeping: Fix resource leak in tk_aux_sysfs_init() error paths

2025-12-01T10:46:00+00:00

[ Upstream commit 7b5ab04f035f829ed6008e4685501ec00b3e73c9 ]

tk_aux_sysfs_init() returns immediately on error during the auxiliary clock
initialization loop without cleaning up previously allocated kobjects and
sysfs groups.

If kobject_create_and_add() or sysfs_create_group() fails during loop
iteration, the parent kobjects (tko and auxo) and any previously created
child kobjects are leaked.

Fix this by adding proper error handling with goto labels to ensure all
allocated resources are cleaned up on failure. kobject_put() on the
parent kobjects will handle cleanup of their children.

Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks")
Signed-off-by: Malaya Kumar Rout 
Signed-off-by: Thomas Gleixner 
Link: https://patch.msgid.link/20251120150213.246777-1-mrout@redhat.com
Signed-off-by: Sasha Levin

tick/sched: Fix bogus condition in report_idle_softirq()

2025-12-01T10:45:59+00:00

[ Upstream commit 807e0d187da4c0b22036b5e34000f7a8c52f6e50 ]

In commit 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle") the
new function report_idle_softirq() was created by breaking code out of the
existing can_stop_idle_tick() for kernels v5.18 and newer.

In doing so, the code essentially went from this form:

	if (A) {
		static int ratelimit;
		if (ratelimit < 10 && !C && A&D) {
                       pr_warn("NOHZ tick-stop error: ...");
		       ratelimit++;
		}
		return false;
	}

to a new function:

static bool report_idle_softirq(void)
{
       static int ratelimit;

       if (likely(!A))
               return false;

       if (ratelimit < 10)
               return false;
...
       pr_warn("NOHZ tick-stop error: local softirq work is pending, handler #%02x!!!\n",
               pending);
       ratelimit++;

       return true;
}

commit a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition") realized
ratelimit was essentially set to zero instead of ten, and hence *no*
softirq pending messages would ever be issued, but "fixed" it as:

-       if (ratelimit < 10)
+       if (ratelimit >= 10)
                return false;

However, this fix introduced another issue:

When ratelimit is greater than or equal 10, even if A is true, it will
directly return false. While ratelimit in the original code was only used
to control printing and will not affect the return value.

Restore the original logic and restrict ratelimit to control the printk and
not the return value.

Fixes: 0345691b24c0 ("tick/rcu: Stop allowing RCU_SOFTIRQ in idle")
Fixes: a7e282c77785 ("tick/rcu: Fix bogus ratelimit condition")
Signed-off-by: Wen Yang 
Signed-off-by: Thomas Gleixner 
Link: https://patch.msgid.link/20251119174525.29470-1-wen.yang@linux.dev
Signed-off-by: Sasha Levin

timers: Fix NULL function pointer race in timer_shutdown_sync()

2025-12-01T10:45:33+00:00

commit 20739af07383e6eb1ec59dcd70b72ebfa9ac362c upstream.

There is a race condition between timer_shutdown_sync() and timer
expiration that can lead to hitting a WARN_ON in expire_timers().

The issue occurs when timer_shutdown_sync() clears the timer function
to NULL while the timer is still running on another CPU. The race
scenario looks like this:

CPU0					CPU1
					
					lock_timer_base()
					expire_timers()
					base->running_timer = timer;
					unlock_timer_base()
					[call_timer_fn enter]
					mod_timer()
					...
timer_shutdown_sync()
lock_timer_base()
// For now, will not detach the timer but only clear its function to NULL
if (base->running_timer != timer)
	ret = detach_if_pending(timer, base, true);
if (shutdown)
	timer->function = NULL;
unlock_timer_base()
					[call_timer_fn exit]
					lock_timer_base()
					base->running_timer = NULL;
					unlock_timer_base()
					...
					// Now timer is pending while its function set to NULL.
					// next timer trigger
					
					expire_timers()
					WARN_ON_ONCE(!fn) // hit
					...
lock_timer_base()
// Now timer will detach
if (base->running_timer != timer)
	ret = detach_if_pending(timer, base, true);
if (shutdown)
	timer->function = NULL;
unlock_timer_base()

The problem is that timer_shutdown_sync() clears the timer function
regardless of whether the timer is currently running. This can leave a
pending timer with a NULL function pointer, which triggers the
WARN_ON_ONCE(!fn) check in expire_timers().

Fix this by only clearing the timer function when actually detaching the
timer. If the timer is running, leave the function pointer intact, which is
safe because the timer will be properly detached when it finishes running.

Fixes: 0cc04e80458a ("timers: Add shutdown mechanism to the internal functions")
Signed-off-by: Yipeng Zou 
Signed-off-by: Thomas Gleixner 
Cc: stable@vger.kernel.org
Link: https://patch.msgid.link/20251122093942.301559-1-zouyipeng@huawei.com
Signed-off-by: Greg Kroah-Hartman

posix-timers: Plug potential memory leak in do_timer_create()

2025-11-24T09:37:36+00:00

[ Upstream commit e0fd4d42e27f761e9cc82801b3f183e658dc749d ]

When posix timer creation is set to allocate a given timer ID and the
access to the user space value faults, the function terminates without
freeing the already allocated posix timer structure.

Move the allocation after the user space access to cure that.

[ tglx: Massaged change log ]

Fixes: ec2d0c04624b3 ("posix-timers: Provide a mechanism to allocate a given timer ID")
Reported-by: syzbot+9c47ad18f978d4394986@syzkaller.appspotmail.com
Suggested-by: Cyrill Gorcunov 
Signed-off-by: Eslam Khafagy 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Frederic Weisbecker 
Link: https://patch.msgid.link/20251114122739.994326-1-eslam.medhat1993@gmail.com
Closes: https://lore.kernel.org/all/69155df4.a70a0220.3124cb.0017.GAE@google.com/T/
Signed-off-by: Sasha Levin

timekeeping: Fix aux clocks sysfs initialization loop bound

2025-11-02T13:18:02+00:00

[ Upstream commit 39a9ed0fb6dac58547afdf9b6cb032d326a3698f ]

The loop in tk_aux_sysfs_init() uses `i <= MAX_AUX_CLOCKS` as the
termination condition, which results in 9 iterations (i=0 to 8) when
MAX_AUX_CLOCKS is defined as 8. However, the kernel is designed to support
only up to 8 auxiliary clocks.

This off-by-one error causes the creation of a 9th sysfs entry that exceeds
the intended auxiliary clock range.

Fix the loop bound to use `i < MAX_AUX_CLOCKS` to ensure exactly 8
auxiliary clock entries are created, matching the design specification.

Fixes: 7b95663a3d96 ("timekeeping: Provide interface to control auxiliary clocks")
Signed-off-by: Haofeng Li 
Signed-off-by: Thomas Gleixner 
Link: https://patch.msgid.link/tencent_2376993D9FC06A3616A4F981B3DE1C599607@qq.com
Signed-off-by: Sasha Levin

tick: Do not set device to detached state in tick_shutdown()

2025-10-15T10:03:26+00:00

[ Upstream commit fe2a449a45b13df1562419e0104b4777b6ea5248 ]

tick_shutdown() sets the state of the clockevent device to detached
first and the invokes clockevents_exchange_device(), which in turn
invokes clockevents_switch_state().

But clockevents_switch_state() returns without invoking the device shutdown
callback as the device is already in detached state. As a consequence the
timer device is not shutdown when a CPU goes offline.

tick_shutdown() does this because it was originally invoked on a online CPU
and not on the outgoing CPU. It therefore could not access the clockevent
device of the already offlined CPU and just set the state.

Since commit 3b1596a21fbf tick_shutdown() is called on the outgoing CPU, so
the hardware device can be accessed.

Remove the state set before calling clockevents_exchange_device(), so that
the subsequent clockevents_switch_state() handles the state transition and
invokes the shutdown callback of the clockevent device.

[ tglx: Massaged change log ]

Fixes: 3b1596a21fbf ("clockevents: Shutdown and unregister current clockevents at CPUHP_AP_TICK_DYING")
Signed-off-by: Bibo Mao 
Signed-off-by: Thomas Gleixner 
Reviewed-by: Frederic Weisbecker 
Link: https://lore.kernel.org/all/20250906064952.3749122-2-maobibo@loongson.cn
Signed-off-by: Sasha Levin