linux.git/kernel/time, branch v6.9-rc2

Fix memory leak in posix_clock_open()

2024-03-27T16:03:22+00:00

If the clk ops.open() function returns an error, we don't release the
pccontext we allocated for this clock.

Re-organize the code slightly to make it all more obvious.

Reported-by: Rohit Keshri 
Acked-by: Oleg Nesterov 
Fixes: 60c6946675fc ("posix-clock: introduce posix_clock_context concept")
Cc: Jakub Kicinski 
Cc: David S. Miller 
Cc: Thomas Gleixner 
Signed-off-by: Linus Torvalds

Merge tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-03-23T21:49:25+00:00

Pull timer fixes from Thomas Gleixner:
 "Two regression fixes for the timer and timer migration code:

   - Prevent endless timer requeuing which is caused by two CPUs racing
     out of idle. This happens when the last CPU goes idle and therefore
     has to ensure to expire the pending global timers and some other
     CPU come out of idle at the same time and the other CPU wins the
     race and expires the global queue. This causes the last CPU to
     chase ghost timers forever and reprogramming it's clockevent device
     endlessly.

     Cure this by re-evaluating the wakeup time unconditionally.

   - The split into local (pinned) and global timers in the timer wheel
     caused a regression for NOHZ full as it broke the idle tracking of
     global timers. On NOHZ full this prevents an self IPI being sent
     which in turn causes the timer to be not programmed and not being
     expired on time.

     Restore the idle tracking for the global timer base so that the
     self IPI condition for NOHZ full is working correctly again"

* tag 'timers-urgent-2024-03-23' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timers: Fix removed self-IPI on global timer's enqueue in nohz_full
  timers/migration: Fix endless timer requeue after idle interrupts

Merge tag 'rtc-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux

2024-03-22T00:16:46+00:00

Pull RTC updates from Alexandre Belloni:
 "Subsytem:
   - rtc_class is now const

  Drivers:
   - ds1511: cleanup, set date and time range and alarm offset limit
   - max31335: fix interrupt handler
   - pcf8523: improve suspend support"

* tag 'rtc-6.9' of git://git.kernel.org/pub/scm/linux/kernel/git/abelloni/linux: (28 commits)
  MAINTAINER: Include linux-arm-msm for Qualcomm RTC patches
  dt-bindings: rtc: zynqmp: Add support for Versal/Versal NET SoCs
  rtc: class: make rtc_class constant
  dt-bindings: rtc: abx80x: Improve checks on trickle charger constraints
  MAINTAINERS: adjust file entry in ARM/Mediatek RTC DRIVER
  rtc: nct3018y: fix possible NULL dereference
  rtc: max31335: fix interrupt status reg
  rtc: mt6397: select IRQ_DOMAIN instead of depending on it
  dt-bindings: rtc: abx80x: convert to yaml
  rtc: m41t80: Use the unified property API get the wakeup-source property
  dt-bindings: at91rm9260-rtt: add sam9x7 compatible
  dt-bindings: rtc: convert MT7622 RTC to the json-schema
  dt-bindings: rtc: convert MT2717 RTC to the json-schema
  rtc: pcf8523: add suspend handlers for alarm IRQ
  rtc: ds1511: set alarm offset limit
  rtc: ds1511: set range
  rtc: ds1511: drop inline/noinline hints
  rtc: ds1511: rename pdata
  rtc: ds1511: implement ds1511_rtc_read_alarm properly
  rtc: ds1511: remove partial alarm support
  ...

timers: Fix removed self-IPI on global timer's enqueue in nohz_full

2024-03-19T09:14:55+00:00

While running in nohz_full mode, a task may enqueue a timer while the
tick is stopped. However the only places where the timer wheel,
alongside the timer migration machinery's decision, may reprogram the
next event accordingly with that new timer's expiry are the idle loop or
any IRQ tail.

However neither the idle task nor an interrupt may run on the CPU if it
resumes busy work in userspace for a long while in full dynticks mode.

To solve this, the timer enqueue path raises a self-IPI that will
re-evaluate the timer wheel on its IRQ tail. This asynchronous solution
avoids potential locking inversion.

This is supposed to happen both for local and global timers but commit:

	b2cf7507e186 ("timers: Always queue timers on the local CPU")

broke the global timers case with removing the ->is_idle field handling
for the global base. As a result, global timers enqueue may go unnoticed
in nohz_full.

Fix this with restoring the idle tracking of the global timer's base,
allowing self-IPIs again on enqueue time.

Fixes: b2cf7507e186 ("timers: Always queue timers on the local CPU")
Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20240318230729.15497-3-frederic@kernel.org

timers/migration: Fix endless timer requeue after idle interrupts

2024-03-19T09:14:55+00:00

When a CPU is an idle migrator, but another CPU wakes up before it,
becomes an active migrator and handles the queue, the initial idle
migrator may end up endlessly reprogramming its clockevent, chasing ghost
timers forever such as in the following scenario:

               [GRP0:0]
             migrator = 0
             active   = 0
             nextevt  = T1
              /         \
             0           1
          active        idle (T1)

0) CPU 1 is idle and has a timer queued (T1), CPU 0 is active and is
the active migrator.

               [GRP0:0]
             migrator = NONE
             active   = NONE
             nextevt  = T1
              /         \
             0           1
          idle        idle (T1)
          wakeup = T1

1) CPU 0 is now idle and is therefore the idle migrator. It has
programmed its next timer interrupt to handle T1.

                [GRP0:0]
             migrator = 1
             active   = 1
             nextevt  = KTIME_MAX
              /         \
             0           1
          idle        active
          wakeup = T1

2) CPU 1 has woken up, it is now active and it has just handled its own
timer T1.

3) CPU 0 gets a timer interrupt to handle T1 but tmigr_handle_remote()
realize it is not the migrator anymore. So it early returns without
observing that T1 has been expired already and therefore without
updating its ->wakeup value.

4) CPU 0 goes into tmigr_cpu_new_timer() which also early returns
because it doesn't queue a timer of its own. So ->wakeup is left
unchanged and the next timer is programmed to fire now.

5) goto 3) forever

This results in timer interrupt storms in idle and also in nohz_full (as
observed in rcutorture's TREE07 scenario).

Fix this with forcing a re-evaluation of tmc->wakeup while trying
remote timer handling when the CPU isn't the migrator anymmore. The
check is inherently racy but in the worst case the CPU just races setting
the KTIME_MAX value that a remote expiry also tries to set.

Fixes: 7ee988770326 ("timers: Implement the hierarchical pull model")
Reported-by: Paul E. McKenney 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Link: https://lore.kernel.org/r/20240318230729.15497-2-frederic@kernel.org

Merge tag 'timers-urgent-2024-03-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-03-17T19:19:02+00:00

Pull timer fix from Ingo Molnar:
 "Fix timer migration bug that can result in long bootup delays and
  other oddities"

* tag 'timers-urgent-2024-03-17' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  timer/migration: Remove buggy early return on deactivation

timer/migration: Remove buggy early return on deactivation

2024-03-16T18:55:46+00:00

When a CPU enters into idle and deactivates itself from the timer
migration hierarchy without any global timer of its own to propagate,
the group event of that CPU is set to "ignore" and tmigr_update_events()
accordingly performs an early return without considering timers queued
by other CPUs.

If the hierarchy has a single level, and the CPU is the last one to
enter idle, it will ignore others' global timers, as in the following
layout:

           [GRP0:0]
         migrator = 0
         active   = 0
         nextevt  = T0i
          /         \
         0           1
      active (T0i)  idle (T1)

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.

           [GRP0:0]
         migrator = NONE
         active   = NONE
         nextevt  = T0i
          /         \
         0           1
      idle (T0i)  idle (T1)

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1 and CPU 0 goes to idle with T1
unhandled.

This isn't proper to single level hierarchy though. A similar issue,
although slightly different, may arise on multi-level:

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = 0              migrator = NONE
           active   = 0              active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
      active         idle            idle       idle

0) CPU 0 is active thus its event is ignored (the letter 'i') and so are
upper levels' events. CPU 1 is idle and has the timer T1 enqueued.
CPU 2 also has a timer. The expiry order is T0 (ignored) < T1 < T2

                            [GRP1:0]
                         migrator = GRP0:0
                         active   = GRP0:0
                         nextevt  = T0:0i, T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

1) CPU 0 goes idle without global event queued. Therefore KTIME_MAX is
pushed as its next expiry and its own event kept as "ignore". As a result
tmigr_update_events() ignores T1. The change only propagated up to 1st
level so far.

                            [GRP1:0]
                         migrator = NONE
                         active   = NONE
                         nextevt  = T0:1
                         /              \
              [GRP0:0]                  [GRP0:1]
           migrator = NONE           migrator = NONE
           active   = NONE           active   = NONE
           nextevt  = T0i            nextevt  = T2
           /         \                /         \
          0 (T0i)     1 (T1)         2 (T2)      3
        idle         idle            idle         idle

2) The change now propagates up to the top. tmigr_update_events() finds
that the child event is ignored and thus removes it. The top level next
event is now T2 which is returned to CPU 0 as its next effective expiry
to take account for as the global idle migrator. However T1 has been
ignored along the way, leaving it unhandled.

Fix those issues with removing the buggy related early return. Ignored
child events must not prevent from evaluating the other events within
the same group.

Reported-by: Boqun Feng 
Reported-by: Florian Fainelli 
Reported-by: Thomas Gleixner 
Signed-off-by: Frederic Weisbecker 
Signed-off-by: Thomas Gleixner 
Tested-by: Florian Fainelli 
Link: https://lore.kernel.org/r/ZfOhB9ZByTZcBy4u@lothringen

Merge tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-03-12T02:53:15+00:00

Pull core x86 updates from Ingo Molnar:

 - The biggest change is the rework of the percpu code, to support the
   'Named Address Spaces' GCC feature, by Uros Bizjak:

      - This allows C code to access GS and FS segment relative memory
        via variables declared with such attributes, which allows the
        compiler to better optimize those accesses than the previous
        inline assembly code.

      - The series also includes a number of micro-optimizations for
        various percpu access methods, plus a number of cleanups of %gs
        accesses in assembly code.

      - These changes have been exposed to linux-next testing for the
        last ~5 months, with no known regressions in this area.

 - Fix/clean up __switch_to()'s broken but accidentally working handling
   of FPU switching - which also generates better code

 - Propagate more RIP-relative addressing in assembly code, to generate
   slightly better code

 - Rework the CPU mitigations Kconfig space to be less idiosyncratic, to
   make it easier for distros to follow & maintain these options

 - Rework the x86 idle code to cure RCU violations and to clean up the
   logic

 - Clean up the vDSO Makefile logic

 - Misc cleanups and fixes

* tag 'x86-core-2024-03-11' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (52 commits)
  x86/idle: Select idle routine only once
  x86/idle: Let prefer_mwait_c1_over_halt() return bool
  x86/idle: Cleanup idle_setup()
  x86/idle: Clean up idle selection
  x86/idle: Sanitize X86_BUG_AMD_E400 handling
  sched/idle: Conditionally handle tick broadcast in default_idle_call()
  x86: Increase brk randomness entropy for 64-bit systems
  x86/vdso: Move vDSO to mmap region
  x86/vdso/kbuild: Group non-standard build attributes and primary object file rules together
  x86/vdso: Fix rethunk patching for vdso-image-{32,64}.o
  x86/retpoline: Ensure default return thunk isn't used at runtime
  x86/vdso: Use CONFIG_COMPAT_32 to specify vdso32
  x86/vdso: Use $(addprefix ) instead of $(foreach )
  x86/vdso: Simplify obj-y addition
  x86/vdso: Consolidate targets and clean-files
  x86/bugs: Rename CONFIG_RETHUNK              => CONFIG_MITIGATION_RETHUNK
  x86/bugs: Rename CONFIG_CPU_SRSO             => CONFIG_MITIGATION_SRSO
  x86/bugs: Rename CONFIG_CPU_IBRS_ENTRY       => CONFIG_MITIGATION_IBRS_ENTRY
  x86/bugs: Rename CONFIG_CPU_UNRET_ENTRY      => CONFIG_MITIGATION_UNRET_ENTRY
  x86/bugs: Rename CONFIG_SLS                  => CONFIG_MITIGATION_SLS
  ...

Merge tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-03-11T21:38:26+00:00

Pull timer updates from Thomas Gleixner:
"A large set of updates and features for timers and timekeeping:

- The hierarchical timer pull model

When timer wheel timers are armed they are placed into the timer
wheel of a CPU which is likely to be busy at the time of expiry.
This is done to avoid wakeups on potentially idle CPUs.

This is wrong in several aspects:

1) The heuristics to select the target CPU are wrong by
definition as the chance to get the prediction right is
close to zero.

2) Due to #1 it is possible that timers are accumulated on
a single target CPU

3) The required computation in the enqueue path is just overhead
for dubious value especially under the consideration that the
vast majority of timer wheel timers are either canceled or
rearmed before they expire.

The timer pull model avoids the above by removing the target
computation on enqueue and queueing timers always on the CPU on
which they get armed.

This is achieved by having separate wheels for CPU pinned timers
and global timers which do not care about where they expire.

As long as a CPU is busy it handles both the pinned and the global
timers which are queued on the CPU local timer wheels.

When a CPU goes idle it evaluates its own timer wheels:

- If the first expiring timer is a pinned timer, then the global
timers can be ignored as the CPU will wake up before they
expire.

- If the first expiring timer is a global timer, then the expiry
time is propagated into the timer pull hierarchy and the CPU
makes sure to wake up for the first pinned timer.

The timer pull hierarchy organizes CPUs in groups of eight at the
lowest level and at the next levels groups of eight groups up to
the point where no further aggregation of groups is required, i.e.
the number of levels is log8(NR_CPUS). The magic number of eight
has been established by experimention, but can be adjusted if
needed.

In each group one busy CPU acts as the migrator. It's only one CPU
to avoid lock contention on remote timer wheels.

The migrator CPU checks in its own timer wheel handling whether
there are other CPUs in the group which have gone idle and have
global timers to expire. If there are global timers to expire, the
migrator locks the remote CPU timer wheel and handles the expiry.

Depending on the group level in the hierarchy this handling can
require to walk the hierarchy downwards to the CPU level.

Special care is taken when the last CPU goes idle. At this point
the CPU is the systemwide migrator at the top of the hierarchy and
it therefore cannot delegate to the hierarchy. It needs to arm its
own timer device to expire either at the first expiring timer in
the hierarchy or at the first CPU local timer, which ever expires
first.

This completely removes the overhead from the enqueue path, which
is e.g. for networking a true hotpath and trades it for a slightly
more complex idle path.

This has been in development for a couple of years and the final
series has been extensively tested by various teams from silicon
vendors and ran through extensive CI.

There have been slight performance improvements observed on network
centric workloads and an Intel team confirmed that this allows them
to power down a die completely on a mult-die socket for the first
time in a mostly idle scenario.

There is only one outstanding ~1.5% regression on a specific
overloaded netperf test which is currently investigated, but the
rest is either positive or neutral performance wise and positive on
the power management side.

- Fixes for the timekeeping interpolation code for cross-timestamps:

cross-timestamps are used for PTP to get snapshots from hardware
timers and interpolated them back to clock MONOTONIC. The changes
address a few corner cases in the interpolation code which got the
math and logic wrong.

- Simplifcation of the clocksource watchdog retry logic to
automatically adjust to handle larger systems correctly instead of
having more incomprehensible command line parameters.

- Treewide consolidation of the VDSO data structures.

- The usual small improvements and cleanups all over the place"

* tag 'timers-core-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip: (62 commits)
timer/migration: Fix quick check reporting late expiry
tick/sched: Fix build failure for CONFIG_NO_HZ_COMMON=n
vdso/datapage: Quick fix - use asm/page-def.h for ARM64
timers: Assert no next dyntick timer look-up while CPU is offline
tick: Assume timekeeping is correctly handed over upon last offline idle call
tick: Shut down low-res tick from dying CPU
tick: Split nohz and highres features from nohz_mode
tick: Move individual bit features to debuggable mask accesses
tick: Move got_idle_tick away from common flags
tick: Assume the tick can't be stopped in NOHZ_MODE_INACTIVE mode
tick: Move broadcast cancellation up to CPUHP_AP_TICK_DYING
tick: Move tick cancellation up to CPUHP_AP_TICK_DYING
tick: Start centralizing tick related CPU hotplug operations
tick/sched: Don't clear ts::next_tick again in can_stop_idle_tick()
tick/sched: Rename tick_nohz_stop_sched_tick() to tick_nohz_full_stop_tick()
tick: Use IS_ENABLED() whenever possible
tick/sched: Remove useless oneshot ifdeffery
tick/nohz: Remove duplicate between lowres and highres handlers
tick/nohz: Remove duplicate between tick_nohz_switch_to_nohz() and tick_setup_sched_timer()
hrtimer: Select housekeeping CPU during migration
...

Merge tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip

2024-03-11T21:25:18+00:00

Pull clocksource updates from Thomas Gleixner:
 "Updates for timekeeping and PTP core.

  The cross-timestamp mechanism which allows to correlate hardware
  clocks uses clocksource pointers for describing the correlation.

  That's suboptimal as drivers need to obtain the pointer, which
  requires needless exports and exposing internals. This can all be
  completely avoided by assigning clocksource IDs and using them for
  describing the correlated clock source.

  So this adds clocksource IDs to all clocksources in the tree which can
  be exposed to this mechanism and removes the pointer and now needless
  exports.

  A related improvement for the core and the correlation handling has
  not made it this time, but is expected to get ready for the next
  round"

* tag 'timers-ptp-2024-03-10' of git://git.kernel.org/pub/scm/linux/kernel/git/tip/tip:
  kvmclock: Unexport kvmclock clocksource
  treewide: Remove system_counterval_t.cs, which is never read
  timekeeping: Evaluate system_counterval_t.cs_id instead of .cs
  ptp/kvm, arm_arch_timer: Set system_counterval_t.cs_id to constant
  x86/kvm, ptp/kvm: Add clocksource ID, set system_counterval_t.cs_id
  x86/tsc: Add clocksource ID, set system_counterval_t.cs_id
  timekeeping: Add clocksource ID to struct system_counterval_t
  x86/tsc: Correct kernel-doc notation