linux-stable.git/fs/aio.c, branch linux-3.12.y

AIO: properly check iovec sizes

2016-02-24T09:23:18+00:00

In Linus's tree, the iovec code has been reworked massively, but in
older kernels the AIO layer should be checking this before passing the
request on to other layers.

Many thanks to Ben Hawkes of Google Project Zero for pointing out the
issue.

Reported-by: Ben Hawkes 
Acked-by: Benjamin LaHaise 
Tested-by: Willy Tarreau 
Signed-off-by: Jiri Slaby

aio: fix reqs_available handling

2015-09-02T16:20:16+00:00

commit d856f32a86b2b015ab180ab7a55e455ed8d3ccc5 upstream.

As reported by Dan Aloni, commit f8567a3845ac ("aio: fix aio request
leak when events are reaped by userspace") introduces a regression when
user code attempts to perform io_submit() with more events than are
available in the ring buffer.  Reverting that commit would reintroduce a
regression when user space event reaping is used.

Fixing this bug is a bit more involved than the previous attempts to fix
this regression.  Since we do not have a single point at which we can
count events as being reaped by user space and io_getevents(), we have
to track event completion by looking at the number of events left in the
event ring.  So long as there are as many events in the ring buffer as
there have been completion events generate, we cannot call
put_reqs_available().  The code to check for this is now placed in
refill_reqs_available().

A test program from Dan and modified by me for verifying this bug is available
at http://www.kvack.org/~bcrl/20140824-aio_bug.c .

Reported-by: Dan Aloni 
Signed-off-by: Benjamin LaHaise 
Acked-by: Dan Aloni 
Cc: Kent Overstreet 
Cc: Mateusz Guzik 
Cc: Petr Matousek 
Signed-off-by: Linus Torvalds 
Signed-off-by: Jiri Slaby

aio: fix serial draining in exit_aio()

2015-05-26T12:33:45+00:00

commit dc48e56d761610da4ea1088d1bea0a030b8e3e43 upstream.

exit_aio() currently serializes killing io contexts. Each context
killing ends up having to do percpu_ref_kill(), which in turns has
to wait for an RCU grace period. This can take a long time, depending
on the number of contexts. And there's no point in doing them serially,
when we could be waiting for all of them in one fell swoop.

This patches makes my fio thread offload test case exit 0.2s instead
of almost 6s.

Reviewed-by: Jeff Moyer 
Signed-off-by: Jens Axboe 
Signed-off-by: Jiri Slaby

aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

2015-05-26T12:33:44+00:00

commit 4b70ac5fd9b58bfaa5f25b4ea48f528aefbf3308 upstream.

On 04/30, Benjamin LaHaise wrote:
>
> > -		ctx->mmap_size = 0;
> > -
> > -		kill_ioctx(mm, ctx, NULL);
> > +		if (ctx) {
> > +			ctx->mmap_size = 0;
> > +			kill_ioctx(mm, ctx, NULL);
> > +		}
>
> Rather than indenting and moving the two lines changing mmap_size and the
> kill_ioctx() call, why not just do "if (!ctx) ... continue;"?  That reduces
> the number of lines changed and avoid excessive indentation.

OK. To me the code looks better/simpler with "if (ctx)", but this is subjective
of course, I won't argue.

The patch still removes the empty line between mmap_size = 0 and kill_ioctx(),
we reset mmap_size only for kill_ioctx(). But feel free to remove this change.

-------------------------------------------------------------------------------
Subject: [PATCH v3 1/2] aio: change exit_aio() to load mm->ioctx_table once and avoid rcu_read_lock()

1. We can read ->ioctx_table only once and we do not read rcu_read_lock()
   or even rcu_dereference().

   This mm has no users, nobody else can play with ->ioctx_table. Otherwise
   the code is buggy anyway, if we need rcu_read_lock() in a loop because
   ->ioctx_table can be updated then kfree(table) is obviously wrong.

2. Update the comment. "exit_mmap(mm) is coming" is the good reason to avoid
   munmap(), but another reason is that we simply can't do vm_munmap() unless
   current->mm == mm and this is not true in general, the caller is mmput().

3. We do not really need to nullify mm->ioctx_table before return, probably
   the current code does this to catch the potential problems. But in this
   case RCU_INIT_POINTER(NULL) looks better.

Signed-off-by: Oleg Nesterov 
Signed-off-by: Benjamin LaHaise 
Signed-off-by: Jiri Slaby

ioctx_alloc(): fix vma (and file) leak on failure

2015-04-22T06:58:45+00:00

commit deeb8525f9bcea60f5e86521880c1161de7a5829 upstream.

If we fail past the aio_setup_ring(), we need to destroy the
mapping.  We don't need to care about anybody having found ctx,
or added requests to it, since the last failure exit is exactly
the failure to make ctx visible to lookups.

Reproducer (based on one by Joe Mario ):

void count(char *p)
{
	char s[80];
	printf("%s: ", p);
	fflush(stdout);
	sprintf(s, "/bin/cat /proc/%d/maps|/bin/fgrep -c '/[aio] (deleted)'", getpid());
	system(s);
}

int main()
{
	io_context_t *ctx;
	int created, limit, i, destroyed;
	FILE *f;

	count("before");
	if ((f = fopen("/proc/sys/fs/aio-max-nr", "r")) == NULL)
		perror("opening aio-max-nr");
	else if (fscanf(f, "%d", &limit) != 1)
		fprintf(stderr, "can't parse aio-max-nr\n");
	else if ((ctx = calloc(limit, sizeof(io_context_t))) == NULL)
		perror("allocating aio_context_t array");
	else {
		for (i = 0, created = 0; i < limit; i++) {
			if (io_setup(1000, ctx + created) == 0)
				created++;
		}
		for (i = 0, destroyed = 0; i < created; i++)
			if (io_destroy(ctx[i]) == 0)
				destroyed++;
		printf("created %d, failed %d, destroyed %d\n",
			created, limit - created, destroyed);
		count("after");
	}
}

Found-by: Joe Mario 
Signed-off-by: Al Viro 
Signed-off-by: Jiri Slaby

aio: fix uncorrent dirty pages accouting when truncating AIO ring buffer

2014-12-06T14:18:19+00:00

commit 835f252c6debd204fcd607c79975089b1ecd3472 upstream.

https://bugzilla.kernel.org/show_bug.cgi?id=86831

Markus reported that when shutting down mysqld (with AIO support,
on a ext3 formatted Harddrive) leads to a negative number of dirty pages
(underrun to the counter). The negative number results in a drastic reduction
of the write performance because the page cache is not used, because the kernel
thinks it is still 2 ^ 32 dirty pages open.

Add a warn trace in __dec_zone_state will catch this easily:

static inline void __dec_zone_state(struct zone *zone, enum
	zone_stat_item item)
{
     atomic_long_dec(&zone->vm_stat[item]);
+    WARN_ON_ONCE(item == NR_FILE_DIRTY &&
	atomic_long_read(&zone->vm_stat[item]) < 0);
     atomic_long_dec(&vm_stat[item]);
}

[   21.341632] ------------[ cut here ]------------
[   21.346294] WARNING: CPU: 0 PID: 309 at include/linux/vmstat.h:242
cancel_dirty_page+0x164/0x224()
[   21.355296] Modules linked in: wutbox_cp sata_mv
[   21.359968] CPU: 0 PID: 309 Comm: kworker/0:1 Not tainted 3.14.21-WuT #80
[   21.366793] Workqueue: events free_ioctx
[   21.370760] [] (unwind_backtrace) from []
(show_stack+0x20/0x24)
[   21.378562] [] (show_stack) from []
(dump_stack+0x24/0x28)
[   21.385840] [] (dump_stack) from []
(warn_slowpath_common+0x84/0x9c)
[   21.393976] [] (warn_slowpath_common) from []
(warn_slowpath_null+0x2c/0x34)
[   21.402800] [] (warn_slowpath_null) from []
(cancel_dirty_page+0x164/0x224)
[   21.411524] [] (cancel_dirty_page) from []
(truncate_inode_page+0x8c/0x158)
[   21.420272] [] (truncate_inode_page) from []
(truncate_inode_pages_range+0x11c/0x53c)
[   21.429890] [] (truncate_inode_pages_range) from
[] (truncate_pagecache+0x88/0xac)
[   21.439252] [] (truncate_pagecache) from []
(truncate_setsize+0x5c/0x74)
[   21.447731] [] (truncate_setsize) from []
(put_aio_ring_file.isra.14+0x34/0x90)
[   21.456826] [] (put_aio_ring_file.isra.14) from
[] (aio_free_ring+0x20/0xcc)
[   21.465660] [] (aio_free_ring) from []
(free_ioctx+0x24/0x44)
[   21.473190] [] (free_ioctx) from []
(process_one_work+0x134/0x47c)
[   21.481132] [] (process_one_work) from []
(worker_thread+0x130/0x414)
[   21.489350] [] (worker_thread) from []
(kthread+0xd4/0xec)
[   21.496621] [] (kthread) from []
(ret_from_fork+0x14/0x20)
[   21.503884] ---[ end trace 79c4bf42c038c9a1 ]---

The cause is that we set the aio ring file pages as *DIRTY* via SetPageDirty
(bypasses the VFS dirty pages increment) when init, and aio fs uses
*default_backing_dev_info* as the backing dev, which does not disable
the dirty pages accounting capability.
So truncating aio ring file will contribute to accounting dirty pages (VFS
dirty pages decrement), then error occurs.

The original goal is keeping these pages in memory (can not be reclaimed
or swapped) in life-time via marking it dirty. But thinking more, we have
already pinned pages via elevating the page's refcount, which can already
achieve the goal, so the SetPageDirty seems unnecessary.

In order to fix the issue, using the __set_page_dirty_no_writeback instead
of the nop .set_page_dirty, and dropped the SetPageDirty (don't manually
set the dirty flags, don't disable set_page_dirty(), rely on default behaviour).

With the above change, the dirty pages accounting can work well. But as we
known, aio fs is an anonymous one, which should never cause any real write-back,
we can ignore the dirty pages (write back) accounting by disabling the dirty
pages (write back) accounting capability. So we introduce an aio private
backing dev info (disabled the ACCT_DIRTY/WRITEBACK/ACCT_WB capabilities) to
replace the default one.

Reported-by: Markus Königshaus 
Signed-off-by: Gu Zheng 
Acked-by: Andrew Morton 
Signed-off-by: Benjamin LaHaise 
Signed-off-by: Jiri Slaby

aio: block exit_aio() until all context requests are completed

2014-10-13T13:41:41+00:00

commit 6098b45b32e6baeacc04790773ced9340601d511 upstream.

It seems that exit_aio() also needs to wait for all iocbs to complete (like
io_destroy), but we missed the wait step in current implemention, so fix
it in the same way as we did in io_destroy.

Signed-off-by: Gu Zheng 
Signed-off-by: Benjamin LaHaise 
[bwh: Backported to 3.16: adjust context]
Signed-off-by: Ben Hutchings

aio: add missing smp_rmb() in read_events_ring

2014-09-26T09:23:43+00:00

commit 2ff396be602f10b5eab8e73b24f20348fa2de159 upstream.

We ran into a case on ppc64 running mariadb where io_getevents would
return zeroed out I/O events.  After adding instrumentation, it became
clear that there was some missing synchronization between reading the
tail pointer and the events themselves.  This small patch fixes the
problem in testing.

Thanks to Zach for helping to look into this, and suggesting the fix.

Signed-off-by: Jeff Moyer 
Signed-off-by: Benjamin LaHaise 
Signed-off-by: Jiri Slaby

aio: protect reqs_available updates from changes in interrupt handlers

2014-07-29T15:01:48+00:00

commit 263782c1c95bbddbb022dc092fd89a36bb8d5577 upstream.

As of commit f8567a3845ac05bb28f3c1b478ef752762bd39ef it is now possible to
have put_reqs_available() called from irq context.  While put_reqs_available()
is per cpu, it did not protect itself from interrupts on the same CPU.  This
lead to aio_complete() corrupting the available io requests count when run
under a heavy O_DIRECT workloads as reported by Robert Elliott.  Fix this by
disabling irq updates around the per cpu batch updates of reqs_available.

Many thanks to Robert and folks for testing and tracking this down.

Reported-by: Robert Elliot 
Tested-by: Robert Elliot 
Signed-off-by: Benjamin LaHaise 
Cc: Jens Axboe , Christoph Hellwig 
Signed-off-by: Jiri Slaby

Revert "aio: fix kernel memory disclosure in io_getevents() introduced in v3.10"

2014-07-14T13:21:39+00:00

This reverts commit 0e2e24e5dc6eb6f0698e9dc97e652f132b885624, which
was applied twice mistakenly. The first one is
bee3f7b8188d4b2a5dfaeb2eb4a68d99f67daecf.

Reported-by: Gu Zheng 
Signed-off-by: Jiri Slaby 
Cc: Benjamin LaHaise 
Cc: Mateusz Guzik 
Cc: Petr Matousek 
Cc: Kent Overstreet 
Cc: Jeff Moyer