summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
authorChristian Brauner <brauner@kernel.org>2026-05-22 12:06:42 +0200
committerChristian Brauner <brauner@kernel.org>2026-05-22 12:06:42 +0200
commitde5fefadeff3cba9d4df1b0d6fe0518281bbccec (patch)
treeeaf2bf1530d119356ec790fddb0175ac6f601842
parent254f49634ee16a731174d2ae34bc50bd5f45e731 (diff)
parent31c1d19ead2c26a63859a2757d8b786765ba9cdd (diff)
Merge patch series "writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs()"
Baokun Li <libaokun@linux.alibaba.com> says: When a container exits, a race between cgroup_writeback_umount() and inode_switch_wbs() / cleanup_offline_cgwb() can trigger "VFS: Busy inodes after unmount" followed by a use-after-free on percpu counters. There is a window between inode_prepare_wbs_switch() returning true (having passed the SB_ACTIVE check and grabbed the inode) and the subsequent wb_queue_isw() call. If cgroup_writeback_umount() observes the global isw_nr_in_flight counter as non-zero but flush_workqueue() finds nothing queued, it returns early -- leaving a held inode reference that blocks evict_inodes() and a later iput() that hits freed percpu counters. Patch 1 closes the race by extending the RCU read-side critical section to cover the window from inode_prepare_wbs_switch() through wb_queue_isw(), and adding synchronize_rcu() in the umount path so that all in-flight switchers complete queueing before flush_workqueue() runs. rcu_barrier() is intentionally retained so the same hunk applies cleanly to stable trees that still queue switches via queue_rcu_work(). Patch 2 removes the now-dead rcu_barrier() that was left over from the queue_rcu_work() era (replaced by plain queue_work() in commit e1b849cfa6b6 "writeback: Avoid contention on wb->list_lock when switching inodes"). This is mainline-only. Patch 3 replaces the global synchronize_rcu()/flush_workqueue() pair with a per-sb counter (s_isw_nr_in_flight) plus three small helpers (cgroup_writeback_pin / cgroup_writeback_unpin / cgroup_writeback_drain), eliminating the global serialization penalty. This also reverts the RCU extension from patch 1 since the per-sb counter makes it unnecessary. Performance ----------- Measured on a 16 vCPU QEMU guest, all kernels share the same .config. Background load: 4 ext4 superblocks each running while :; do mkdir /sys/fs/cgroup/<tag>-tmp$N ( echo $BASHPID > <tag>-tmp$N/cgroup.procs dd if=/dev/zero of=$mp/burner bs=4k count=256 conv=notrunc \ oflag=sync) rmdir /sys/fs/cgroup/<tag>-tmp$N done This drives both inode_switch_wbs() (different cgroups writing the same inode) and cleanup_offline_cgwb() (dying memcgs), keeping the global isw_nr_in_flight non-zero throughout the run. Latencies are wall-clock around umount(8) on a separate target sb; only the target sb's umount is measured. Four kernels are compared at each step of the series: base pre-fix mainline +race base + patch 1 (race fix, keeps rcu_barrier) +rmbarrier +race + patch 2 (drop rcu_barrier) +persb +rmbarrier + patch 3 (per-sb counter) Target sb runs its own cgwb churn: p50 p95 p99 max base 99.7 ms 112.9 ms 112.9 ms 127.2 ms +race 110.2 ms 153.8 ms 153.8 ms 160.4 ms +rmbarrier 67.6 ms 88.3 ms 88.3 ms 96.8 ms +persb 7.9 ms 10.0 ms 10.0 ms 10.1 ms Idle target umount under cross-sb cgwb-switch pressure: p50 p95 p99 max base 92.0 ms 123.5 ms 136.5 ms 141.3 ms +race 118.8 ms 154.6 ms 164.7 ms 165.3 ms +rmbarrier 62.7 ms 95.4 ms 108.1 ms 108.6 ms +persb 5.3 ms 6.9 ms 7.4 ms 7.4 ms 8 concurrent umounts of idle sbs under the same pressure: p50 p95 p99 max base 137.5 ms 166.9 ms 166.9 ms 171.3 ms +race 162.2 ms 183.9 ms 183.9 ms 217.0 ms +rmbarrier 61.3 ms 99.5 ms 99.5 ms 113.7 ms +persb 8.1 ms 9.1 ms 9.1 ms 9.5 ms A no-pressure baseline run (no background load) measures ~5 ms p50 across all four kernels, validating that the methodology has no systematic bias. In-kernel cgroup_writeback_umount() cumulative cost across the same run (bpftrace, ~340 calls covering all four scenarios): cgroup_writeback_umount() time base 21240 ms total (~62 ms / call) +race (rcu_barrier+sync) 24966 ms total (~73 ms / call) +rmbarrier (synchronize_rcu) 12371 ms total (~36 ms / call) +persb (per-sb counter) 1.37 ms total ( ~4 us / call) Under +persb the wait_var_event() condition is true on entry whenever the target sb has nothing in flight, so synchronize_rcu() and flush_workqueue() are never called on this path. Notes: - Patch 1 adds ~10-27 ms p50 over base by introducing synchronize_rcu(). This is the cost of closing the race correctly and is paid by stable backports as well. - Patch 2 ("drop rcu_barrier()") was expected to be a pure cleanup on mainline, but actually removes a real wait: rcu_barrier() drains call_rcu() callbacks from *all* subsystems, and the cgroup teardown path keeps that pipeline busy under this workload. Removing it cuts ~43-101 ms p50 on top of patch 1. - Patch 3 (per-sb counter) replaces the global wait entirely; the target sb no longer waits for activity on unrelated sbs, recovering near-baseline latency in all three scenarios. * patches from https://patch.msgid.link/20260521095016.2791354-1-libaokun@linux.alibaba.com: writeback: use a per-sb counter to drain inode wb switches at umount writeback: drop now-unnecessary rcu_barrier() in cgroup_writeback_umount() writeback: fix race between cgroup_writeback_umount() and inode_switch_wbs() Link: https://patch.msgid.link/20260521095016.2791354-1-libaokun@linux.alibaba.com Signed-off-by: Christian Brauner <brauner@kernel.org>
-rw-r--r--fs/fs-writeback.c71
-rw-r--r--include/linux/fs/super_types.h8
2 files changed, 53 insertions, 26 deletions
diff --git a/fs/fs-writeback.c b/fs/fs-writeback.c
index a65694cbfe68..900ad7818bd4 100644
--- a/fs/fs-writeback.c
+++ b/fs/fs-writeback.c
@@ -497,6 +497,23 @@ skip_switch:
return switched;
}
+static inline void cgroup_writeback_pin(struct super_block *sb)
+{
+ atomic_inc(&sb->s_isw_nr_in_flight);
+}
+
+static inline void cgroup_writeback_unpin(struct super_block *sb)
+{
+ if (atomic_dec_and_test(&sb->s_isw_nr_in_flight))
+ wake_up_var(&sb->s_isw_nr_in_flight);
+}
+
+static inline void cgroup_writeback_drain(struct super_block *sb)
+{
+ wait_var_event(&sb->s_isw_nr_in_flight,
+ !atomic_read(&sb->s_isw_nr_in_flight));
+}
+
static void process_inode_switch_wbs(struct bdi_writeback *new_wb,
struct inode_switch_wbs_context *isw)
{
@@ -554,8 +571,12 @@ relock:
wb_put_many(old_wb, nr_switched);
}
- for (inodep = isw->inodes; *inodep; inodep++)
+ for (inodep = isw->inodes; *inodep; inodep++) {
+ struct super_block *sb = (*inodep)->i_sb;
+
iput(*inodep);
+ cgroup_writeback_unpin(sb);
+ }
wb_put(new_wb);
kfree(isw);
atomic_dec(&isw_nr_in_flight);
@@ -598,16 +619,19 @@ void inode_switch_wbs_work_fn(struct work_struct *work)
static bool inode_prepare_wbs_switch(struct inode *inode,
struct bdi_writeback *new_wb)
{
+ /* Avoid the atomic_inc/smp_mb dance once SB_ACTIVE is gone. */
+ if (!(inode->i_sb->s_flags & SB_ACTIVE))
+ return false;
+
/*
- * Paired with smp_mb() in cgroup_writeback_umount().
- * isw_nr_in_flight must be increased before checking SB_ACTIVE and
- * grabbing an inode, otherwise isw_nr_in_flight can be observed as 0
- * in cgroup_writeback_umount() and the isw_wq will be not flushed.
+ * Pairs with smp_mb() in cgroup_writeback_umount(): the umounter either
+ * sees a non-zero counter and waits, or we see SB_ACTIVE clear below.
*/
+ cgroup_writeback_pin(inode->i_sb);
smp_mb();
if (IS_DAX(inode))
- return false;
+ goto out_unpin;
/* while holding I_WB_SWITCH, no one else can update the association */
spin_lock(&inode->i_lock);
@@ -615,13 +639,17 @@ static bool inode_prepare_wbs_switch(struct inode *inode,
inode_state_read(inode) & (I_WB_SWITCH | I_FREEING | I_WILL_FREE) ||
inode_to_wb(inode) == new_wb) {
spin_unlock(&inode->i_lock);
- return false;
+ goto out_unpin;
}
inode_state_set(inode, I_WB_SWITCH);
__iget(inode);
spin_unlock(&inode->i_lock);
return true;
+
+out_unpin:
+ cgroup_writeback_unpin(inode->i_sb);
+ return false;
}
static void wb_queue_isw(struct bdi_writeback *wb,
@@ -1198,36 +1226,27 @@ out_bdi_put:
}
/**
- * cgroup_writeback_umount - flush inode wb switches for umount
+ * cgroup_writeback_umount - wait for in-flight inode wb switches on @sb
* @sb: target super_block
*
- * This function is called when a super_block is about to be destroyed and
- * flushes in-flight inode wb switches. An inode wb switch goes through
- * RCU and then workqueue, so the two need to be flushed in order to ensure
- * that all previously scheduled switches are finished. As wb switches are
- * rare occurrences and synchronize_rcu() can take a while, perform
- * flushing iff wb switches are in flight.
+ * Wait until every inode wb switch that already passed the SB_ACTIVE
+ * check on this superblock has been completed by the worker. Since
+ * SB_ACTIVE is cleared before this is called, no new switches can start
+ * for @sb, so s_isw_nr_in_flight will monotonically drop to zero.
*/
void cgroup_writeback_umount(struct super_block *sb)
{
-
if (!(sb->s_bdi->capabilities & BDI_CAP_WRITEBACK))
return;
/*
- * SB_ACTIVE should be reliably cleared before checking
- * isw_nr_in_flight, see generic_shutdown_super().
+ * Pairs with smp_mb() in inode_prepare_wbs_switch(): we either observe
+ * a non-zero counter and wait, or the switcher sees SB_ACTIVE clear
+ * (cleared by generic_shutdown_super()) and bails before grabbing the
+ * inode.
*/
smp_mb();
-
- if (atomic_read(&isw_nr_in_flight)) {
- /*
- * Use rcu_barrier() to wait for all pending callbacks to
- * ensure that all in-flight wb switches are in the workqueue.
- */
- rcu_barrier();
- flush_workqueue(isw_wq);
- }
+ cgroup_writeback_drain(sb);
}
static int __init cgroup_writeback_init(void)
diff --git a/include/linux/fs/super_types.h b/include/linux/fs/super_types.h
index 383050e7fdf5..1ab4e2265129 100644
--- a/include/linux/fs/super_types.h
+++ b/include/linux/fs/super_types.h
@@ -274,6 +274,14 @@ struct super_block {
/* number of fserrors that are being sent to fsnotify/filesystems */
refcount_t s_pending_errors;
+
+#ifdef CONFIG_CGROUP_WRITEBACK
+ /*
+ * Number of in-flight inode wb switches for this sb. Drained by
+ * cgroup_writeback_umount() before tear-down.
+ */
+ atomic_t s_isw_nr_in_flight;
+#endif
} __randomize_layout;
/*