linux.git - Linux kernel source tree

diff options

author	Jeff Layton <jlayton@kernel.org>	2026-05-11 07:58:29 -0400
committer	Christian Brauner <brauner@kernel.org>	2026-06-04 10:16:51 +0200
commit	e1bf79628453e6afac81ffa57f4f40f28e5512ff (patch)
tree	99327f17371a81a2f5feb1c8e194978bfd3bccb0 /drivers/platform/wmi/tests/git@git.tavy.me:linux.git
parent	88d6f128d06d492b6d178c8e8c53db8c82305ae1 (diff)

mm: kick writeback flusher for IOCB_DONTCACHE with targeted dirty tracking

The IOCB_DONTCACHE writeback path in generic_write_sync() calls filemap_flush_range() on every write, submitting writeback inline in the writer's context. Perf lock contention profiling shows the performance problem is not lock contention but the writeback submission work itself — walking the page tree and submitting I/O blocks the writer for milliseconds, inflating p99.9 latency from 23ms (buffered) to 93ms (dontcache). Replace the inline filemap_flush_range() call with a flusher kick that drains dirty pages in the background. This moves writeback submission completely off the writer's hot path. To avoid flushing unrelated buffered dirty data, add a dedicated WB_start_dontcache bit and wb_check_start_dontcache() handler that uses the per-wb WB_DONTCACHE_DIRTY counter to determine how many pages to write back. The flusher writes back that many pages from the oldest dirty inodes (not restricted to dontcache-specific inodes). This helps preserve I/O batching while limiting the scope of expedited writeback. Like WB_start_all, the WB_start_dontcache bit coalesces multiple DONTCACHE writes into a single flusher wakeup without per-write allocations. Use test_and_clear_bit to atomically consume the kick request before reading the dirty counter and starting writeback, so that concurrent DONTCACHE writes during writeback can re-set the bit and schedule a follow-up flusher run. Read the dirty counter with wb_stat_sum() (aggregating per-CPU batches) rather than wb_stat() (which reads only the global counter) to ensure small writes below the percpu batch threshold are visible to the flusher. In filemap_dontcache_kick_writeback(), set the WB_start_dontcache bit inside the unlocked_inode_to_wb_begin/end section for correct cgroup writeback domain targeting, but defer the wb_wakeup() call until after the section ends, since wb_wakeup() uses spin_unlock_irq() which would unconditionally re-enable interrupts while the i_pages xa_lock may still be held under irqsave during a cgroup writeback switch. Pin the wb with wb_get() inside the RCU critical section before calling wb_wakeup() outside it, since cgroup bdi_writeback structures are RCU-freed and the wb pointer could become invalid after unlocked_inode_to_wb_end() drops the RCU read lock. Also add WB_REASON_DONTCACHE as a new writeback reason for tracing visibility. dontcache-bench results (same host, T6F_SKL_1920GBF, 251 GiB RAM, xfs on NVMe, fio io_uring): Buffered and direct I/O paths are unaffected by this patchset. All improvements are confined to the dontcache path: Single-stream throughput (MB/s): Before After Change seq-write/dontcache 298 897 +201% rand-write/dontcache 131 236 +80% Tail latency improvements (seq-write/dontcache): p99: 135,266 us -> 23,986 us (-82%) p99.9: 8,925,479 us -> 28,443 us (-99.7%) Multi-writer (4 jobs, sequential write): Before After Change dontcache aggregate (MB/s) 2,529 4,532 +79% dontcache p99 (us) 8,553 1,002 -88% dontcache p99.9 (us) 109,314 1,057 -99% Dontcache multi-writer throughput now matches buffered (4,532 vs 4,616 MB/s). 32-file write (Axboe test): Before After Change dontcache aggregate (MB/s) 1,548 3,499 +126% dontcache p99 (us) 10,170 602 -94% Peak dirty pages (MB) 1,837 213 -88% Dontcache now reaches 81% of buffered throughput (was 35%). Competing writers (dontcache vs buffered, separate files): Before After buffered writer 868 433 MB/s dontcache writer 415 433 MB/s Aggregate 1,284 866 MB/s Previously the buffered writer starved the dontcache writer 2:1. With per-bdi_writeback tracking, both writers now receive equal bandwidth. The aggregate matches the buffered-vs-buffered baseline (863 MB/s), indicating fair sharing regardless of I/O mode. The dontcache writer's p99.9 latency collapsed from 119 ms to 33 ms (-73%), eliminating the severe periodic stalls seen in the baseline. Both writers now share identical latency profiles, matching the buffered-vs-buffered pattern. The per-bdi_writeback dirty tracking dramatically reduces peak dirty pages in dontcache workloads, with the 32-file test dropping from 1.8 GB to 213 MB. Dontcache sequential write throughput triples and multi-writer throughput reaches parity with buffered I/O, with tail latencies collapsing by 1-2 orders of magnitude. Assisted-by: Claude:claude-opus-4-6 Signed-off-by: Jeff Layton <jlayton@kernel.org> Link: https://patch.msgid.link/20260511-dontcache-v7-3-2848ddce8090@kernel.org Reviewed-by: Jan Kara <jack@suse.cz> Reviewed-by: Ritesh Harjani (IBM) <ritesh.list@gmail.com> Signed-off-by: Christian Brauner (Amutable) <brauner@kernel.org>

Diffstat (limited to 'drivers/platform/wmi/tests/git@git.tavy.me:linux.git')

0 files changed, 0 insertions, 0 deletions


context:
space:
mode: