linux.git/fs/aio.c, branch v3.19

aio: annotate aio_read_event_ring for sleep patterns

2015-02-04T00:29:05+00:00

Under CONFIG_DEBUG_ATOMIC_SLEEP=y, aio_read_event_ring() will throw
warnings like the following due to being called from wait_event
context:

 WARNING: CPU: 0 PID: 16006 at kernel/sched/core.c:7300 __might_sleep+0x7f/0x90()
 do not call blocking ops when !TASK_RUNNING; state=1 set at [] prepare_to_wait_event+0x63/0x110
 Modules linked in:
 CPU: 0 PID: 16006 Comm: aio-dio-fcntl-r Not tainted 3.19.0-rc6-dgc+ #705
 Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011
  ffffffff821c0372 ffff88003c117cd8 ffffffff81daf2bd 000000000000d8d8
  ffff88003c117d28 ffff88003c117d18 ffffffff8109beda ffff88003c117cf8
  ffffffff821c115e 0000000000000061 0000000000000000 00007ffffe4aa300
 Call Trace:
  [] dump_stack+0x4c/0x65
  [] warn_slowpath_common+0x8a/0xc0
  [] warn_slowpath_fmt+0x46/0x50
  [] ? prepare_to_wait_event+0x63/0x110
  [] ? prepare_to_wait_event+0x63/0x110
  [] __might_sleep+0x7f/0x90
  [] mutex_lock+0x24/0x45
  [] aio_read_events+0x4c/0x290
  [] read_events+0x1ec/0x220
  [] ? prepare_to_wait_event+0x110/0x110
  [] ? hrtimer_get_res+0x50/0x50
  [] SyS_io_getevents+0x4d/0xb0
  [] system_call_fastpath+0x12/0x17
 ---[ end trace bde69eaf655a4fea ]---

There is not actually a bug here, so annotate the code to tell the
debug logic that everything is just fine and not to fire a false
positive.

Signed-off-by: Dave Chinner 
Signed-off-by: Benjamin LaHaise

aio: Skip timer for io_getevents if timeout=0

2014-12-13T22:50:20+00:00

In this case, it is basically a polling. Let's not involve timer at all
because that would hurt performance for application event loops.

In an arbitrary test I've done, io_getevents syscall elapsed time
reduces from 50000+ nanoseconds to a few hundereds.

Signed-off-by: Fam Zheng 
Signed-off-by: Benjamin LaHaise

aio: Make it possible to remap aio ring

2014-12-13T22:49:50+00:00

There are actually two issues this patch addresses. Let me start with
the one I tried to solve in the beginning.

So, in the checkpoint-restore project (criu) we try to dump tasks'
state and restore one back exactly as it was. One of the tasks' state
bits is rings set up with io_setup() call. There's (almost) no problems
in dumping them, there's a problem restoring them -- if I dump a task
with aio ring originally mapped at address A, I want to restore one
back at exactly the same address A. Unfortunately, the io_setup() does
not allow for that -- it mmaps the ring at whatever place mm finds
appropriate (it calls do_mmap_pgoff() with zero address and without
the MAP_FIXED flag).

To make restore possible I'm going to mremap() the freshly created ring
into the address A (under which it was seen before dump). The problem is
that the ring's virtual address is passed back to the user-space as the
context ID and this ID is then used as search key by all the other io_foo()
calls. Reworking this ID to be just some integer doesn't seem to work, as
this value is already used by libaio as a pointer using which this library
accesses memory for aio meta-data.

So, to make restore work we need to make sure that

a) ring is mapped at desired virtual address
b) kioctx->user_id matches this value

Having said that, the patch makes mremap() on aio region update the
kioctx's user_id and mmap_base values.

Here appears the 2nd issue I mentioned in the beginning of this mail.
If (regardless of the C/R dances I do) someone creates an io context
with io_setup(), then mremap()-s the ring and then destroys the context,
the kill_ioctx() routine will call munmap() on wrong (old) address.
This will result in a) aio ring remaining in memory and b) some other
vma get unexpectedly unmapped.

What do you think?

Signed-off-by: Pavel Emelyanov 
Acked-by: Dmitry Monakhov 
Signed-off-by: Benjamin LaHaise

Merge git://git.kvack.org/~bcrl/aio-fixes

2014-11-26T02:55:44+00:00

Pull aio fix from Ben LaHaise:
 "Dirty page accounting fix for aio"

* git://git.kvack.org/~bcrl/aio-fixes:
  aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

2014-11-06T19:27:19+00:00

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
	zone_stat_item item)
{
     atomic_long_dec(&zone->vm_stat[item]);
+    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
	atomic_long_read(&zone->vm_stat[item]) < 0);
     atomic_long_dec(&vm_stat[item]);
}

[   21.341632] ------------[ cut here ]------------
[   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[   21.355296] Modules linked in: wutbox_cp sata_mv
[   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[   21.366793] Workqueue: events free_ioctx
[   21.370760] [] (unwind_backtrace) from []
(show_stack+0x20/0x24)
[   21.378562] [] (show_stack) from []
(dump_stack+0x24/0x28)
[   21.385840] [] (dump_stack) from []
(warn_slowpath_common+0x84/0x9c)
[   21.393976] [] (warn_slowpath_common) from []
(warn_slowpath_null+0x2c/0x34)
[   21.402800] [] (warn_slowpath_null) from []
(cancel_dirty_page+0x164/0x224)
[   21.411524] [] (cancel_dirty_page) from []
(truncate_inode_page+0x8c/0x158)
[   21.420272] [] (truncate_inode_page) from []
(truncate_inode_pages_range+0x11c/0x53c)
[   21.429890] [] (truncate_inode_pages_range) from
[] (truncate_pagecache+0x88/0xac)
[   21.439252] [] (truncate_pagecache) from []
(truncate_setsize+0x5c/0x74)
[   21.447731] [] (truncate_setsize) from []
(put_aio_ring_file.isra.14+0x34/0x90)
[   21.456826] [] (put_aio_ring_file.isra.14) from
[] (aio_free_ring+0x20/0xcc)
[   21.465660] [] (aio_free_ring) from []
(free_ioctx+0x24/0x44)
[   21.473190] [] (free_ioctx) from []
(process_one_work+0x134/0x47c)
[   21.481132] [] (process_one_work) from []
(worker_thread+0x130/0x414)
[   21.489350] [] (worker_thread) from []
(kthread+0xd4/0xec)
[   21.496621] [] (kthread) from []
(ret_from_fork+0x14/0x20)
[   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.

Reported-by: Markus Königshaus 
Signed-off-by: Gu Zheng 
Cc: stable 
Acked-by: Andrew Morton 
Signed-off-by: Benjamin LaHaise

percpu_ref: add PERCPU_REF_INIT_* flags

2014-09-24T17:31:50+00:00

With the recent addition of percpu_ref_reinit(), percpu_ref now can be
used as a persistent switch which can be turned on and off repeatedly
where turning off maps to killing the ref and waiting for it to drain;
however, there currently isn't a way to initialize a percpu_ref in its
off (killed and drained) state, which can be inconvenient for certain
persistent switch use cases.

Similarly, percpu_ref_switch_to_atomic/percpu() allow dynamic
selection of operation mode; however, currently a newly initialized
percpu_ref is always in percpu mode making it impossible to avoid the
latency overhead of switching to atomic mode.

This patch adds @flags to percpu_ref_init() and implements the
following flags.

* PERCPU_REF_INIT_ATOMIC	: start ref in atomic mode
* PERCPU_REF_INIT_DEAD		: start ref killed and drained

These flags should be able to serve the above two use cases.

v2: target_core_tpg.c conversion was missing.  Fixed.

Signed-off-by: Tejun Heo 
Reviewed-by: Kent Overstreet 
Cc: Jens Axboe 
Cc: Christoph Hellwig 
Cc: Johannes Weiner

Merge branch 'for-linus' of git://git.kernel.org/pub/scm/linux/kernel/git/axboe/linux-block into for-3.18

2014-09-24T17:00:21+00:00

This is to receive 0a30288da1ae ("blk-mq, percpu_ref: implement a
kludge for SCSI blk-mq stall during probe") which implements
__percpu_ref_kill_expedited() to work around SCSI blk-mq stall.  The
commit reverted and patches to implement proper fix will be added.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
Cc: Jens Axboe 
Cc: Christoph Hellwig

percpu-refcount: add @gfp to percpu_ref_init()

2014-09-08T00:51:30+00:00

Percpu allocator now supports allocation mask.  Add @gfp to
percpu_ref_init() so that !GFP_KERNEL allocation masks can be used
with percpu_refs too.

This patch doesn't make any functional difference.

v2: blk-mq conversion was missing.  Updated.

Signed-off-by: Tejun Heo 
Cc: Kent Overstreet 
Cc: Benjamin LaHaise 
Cc: Li Zefan 
Cc: Nicholas A. Bellinger 
Cc: Jens Axboe

aio: block exit_aio() until all context requests are completed

2014-09-04T20:54:47+00:00

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng 
Signed-off-by: Benjamin LaHaise 
Cc: stable@vger.kernel.org

aio: add missing smp_rmb() in read_events_ring

2014-09-02T19:20:03+00:00

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events.  After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves.  This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.

Signed-off-by: Jeff Moyer 
Signed-off-by: Benjamin LaHaise 
Cc: stable@vger.kernel.org