linux-stable.git/block, branch v3.18.46

block: Do a full clone when splitting discard bios

2016-10-07T14:37:27+00:00

This fixes a data corruption bug when using discard on top of MD linear,
raid0 and raid10 personalities.

Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing
the bio_vec between the two resulting bios. That is fine for read/write
requests where the bio_vec is immutable. For discards, however, we need
to be able to attach a payload and update the bio_vec so the page can
get mapped to a scatterlist entry. Therefore the bio_vec can not be
shared when splitting discards and we must do a full clone.

Signed-off-by: Martin K. Petersen 
Reported-by: Seunguk Shin 
Tested-by: Seunguk Shin 
Cc: Seunguk Shin 
Cc: Jens Axboe 
Cc: Kent Overstreet 
Cc:  # v3.14+
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe

block: fix use-after-free in seq file

2016-08-22T16:23:29+00:00

[ Upstream commit 77da160530dd1dc94f6ae15a981f24e5f0021e84 ]

I got a KASAN report of use-after-free:

    ==================================================================
    BUG: KASAN: use-after-free in klist_iter_exit+0x61/0x70 at addr ffff8800b6581508
    Read of size 8 by task trinity-c1/315
    =============================================================================
    BUG kmalloc-32 (Not tainted): kasan: bad access detected
    -----------------------------------------------------------------------------

    Disabling lock debugging due to kernel taint
    INFO: Allocated in disk_seqf_start+0x66/0x110 age=144 cpu=1 pid=315
            ___slab_alloc+0x4f1/0x520
            __slab_alloc.isra.58+0x56/0x80
            kmem_cache_alloc_trace+0x260/0x2a0
            disk_seqf_start+0x66/0x110
            traverse+0x176/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a
    INFO: Freed in disk_seqf_stop+0x42/0x50 age=160 cpu=1 pid=315
            __slab_free+0x17a/0x2c0
            kfree+0x20a/0x220
            disk_seqf_stop+0x42/0x50
            traverse+0x3b5/0x860
            seq_read+0x7e3/0x11a0
            proc_reg_read+0xbc/0x180
            do_loop_readv_writev+0x134/0x210
            do_readv_writev+0x565/0x660
            vfs_readv+0x67/0xa0
            do_preadv+0x126/0x170
            SyS_preadv+0xc/0x10
            do_syscall_64+0x1a1/0x460
            return_from_SYSCALL_64+0x0/0x6a

    CPU: 1 PID: 315 Comm: trinity-c1 Tainted: G    B           4.7.0+ #62
    Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Ubuntu-1.8.2-1ubuntu1 04/01/2014
     ffffea0002d96000 ffff880119b9f918 ffffffff81d6ce81 ffff88011a804480
     ffff8800b6581500 ffff880119b9f948 ffffffff8146c7bd ffff88011a804480
     ffffea0002d96000 ffff8800b6581500 fffffffffffffff4 ffff880119b9f970
    Call Trace:
     [] dump_stack+0x65/0x84
     [] print_trailer+0x10d/0x1a0
     [] object_err+0x2f/0x40
     [] kasan_report_error+0x221/0x520
     [] __asan_report_load8_noabort+0x3e/0x40
     [] klist_iter_exit+0x61/0x70
     [] class_dev_iter_exit+0x9/0x10
     [] disk_seqf_stop+0x3a/0x50
     [] seq_read+0x4b2/0x11a0
     [] proc_reg_read+0xbc/0x180
     [] do_loop_readv_writev+0x134/0x210
     [] do_readv_writev+0x565/0x660
     [] vfs_readv+0x67/0xa0
     [] do_preadv+0x126/0x170
     [] SyS_preadv+0xc/0x10

This problem can occur in the following situation:

open()
 - pread()
    - .seq_start()
       - iter = kmalloc() // succeeds
       - seqf->private = iter
    - .seq_stop()
       - kfree(seqf->private)
 - pread()
    - .seq_start()
       - iter = kmalloc() // fails
    - .seq_stop()
       - class_dev_iter_exit(seqf->private) // boom! old pointer

As the comment in disk_seqf_stop() says, stop is called even if start
failed, so we need to reinitialise the private pointer to NULL when seq
iteration stops.

An alternative would be to set the private pointer to NULL when the
kmalloc() in disk_seqf_start() fails.

Cc: stable@vger.kernel.org
Signed-off-by: Vegard Nossum 
Acked-by: Tejun Heo 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: fix use-after-free in sys_ioprio_get()

2016-07-19T22:20:21+00:00

[ Upstream commit 8ba8682107ee2ca3347354e018865d8e1967c5f4 ]

get_task_ioprio() accesses the task->io_context without holding the task
lock and thus can race with exit_io_context(), leading to a
use-after-free. The reproducer below hits this within a few seconds on
my 4-core QEMU VM:

#define _GNU_SOURCE
#include 
#include 
#include 
#include 

int main(int argc, char **argv)
{
	pid_t pid, child;
	long nproc, i;

	/* ioprio_set(IOPRIO_WHO_PROCESS, 0, IOPRIO_PRIO_VALUE(IOPRIO_CLASS_IDLE, 0)); */
	syscall(SYS_ioprio_set, 1, 0, 0x6000);

	nproc = sysconf(_SC_NPROCESSORS_ONLN);

	for (i = 0; i < nproc; i++) {
		pid = fork();
		assert(pid != -1);
		if (pid == 0) {
			for (;;) {
				pid = fork();
				assert(pid != -1);
				if (pid == 0) {
					_exit(0);
				} else {
					child = wait(NULL);
					assert(child == pid);
				}
			}
		}

		pid = fork();
		assert(pid != -1);
		if (pid == 0) {
			for (;;) {
				/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
				syscall(SYS_ioprio_get, 2, 0);
			}
		}
	}

	for (;;) {
		/* ioprio_get(IOPRIO_WHO_PGRP, 0); */
		syscall(SYS_ioprio_get, 2, 0);
	}

	return 0;
}

This gets us KASAN dumps like this:

[   35.526914] ==================================================================
[   35.530009] BUG: KASAN: out-of-bounds in get_task_ioprio+0x7b/0x90 at addr ffff880066f34e6c
[   35.530009] Read of size 2 by task ioprio-gpf/363
[   35.530009] =============================================================================
[   35.530009] BUG blkdev_ioc (Not tainted): kasan: bad access detected
[   35.530009] -----------------------------------------------------------------------------

[   35.530009] Disabling lock debugging due to kernel taint
[   35.530009] INFO: Allocated in create_task_io_context+0x2b/0x370 age=0 cpu=0 pid=360
[   35.530009] 	___slab_alloc+0x55d/0x5a0
[   35.530009] 	__slab_alloc.isra.20+0x2b/0x40
[   35.530009] 	kmem_cache_alloc_node+0x84/0x200
[   35.530009] 	create_task_io_context+0x2b/0x370
[   35.530009] 	get_task_io_context+0x92/0xb0
[   35.530009] 	copy_process.part.8+0x5029/0x5660
[   35.530009] 	_do_fork+0x155/0x7e0
[   35.530009] 	SyS_clone+0x19/0x20
[   35.530009] 	do_syscall_64+0x195/0x3a0
[   35.530009] 	return_from_SYSCALL_64+0x0/0x6a
[   35.530009] INFO: Freed in put_io_context+0xe7/0x120 age=0 cpu=0 pid=1060
[   35.530009] 	__slab_free+0x27b/0x3d0
[   35.530009] 	kmem_cache_free+0x1fb/0x220
[   35.530009] 	put_io_context+0xe7/0x120
[   35.530009] 	put_io_context_active+0x238/0x380
[   35.530009] 	exit_io_context+0x66/0x80
[   35.530009] 	do_exit+0x158e/0x2b90
[   35.530009] 	do_group_exit+0xe5/0x2b0
[   35.530009] 	SyS_exit_group+0x1d/0x20
[   35.530009] 	entry_SYSCALL_64_fastpath+0x1a/0xa4
[   35.530009] INFO: Slab 0xffffea00019bcd00 objects=20 used=4 fp=0xffff880066f34ff0 flags=0x1fffe0000004080
[   35.530009] INFO: Object 0xffff880066f34e58 @offset=3672 fp=0x0000000000000001
[   35.530009] ==================================================================

Fix it by grabbing the task lock while we poke at the io_context.

Cc: stable@vger.kernel.org
Reported-by: Dmitry Vyukov 
Signed-off-by: Omar Sandoval 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: fix buffer overflow when reading sysfs file of 'pending'

2015-10-07T14:03:25+00:00

[ Upstream commit 596f5aad2a704b72934e5abec1b1b4114c16f45b ]

There may be lots of pending requests so that the buffer of PAGE_SIZE
can't hold them at all.

One typical example is scsi-mq, the queue depth(.can_queue) of
scsi_host and blk-mq is quite big but scsi_device's queue_depth
is a bit small(.cmd_per_lun), then it is quite easy to have lots
of pending requests in hw queue.

This patch fixes the following warning and the related memory
destruction.

[  359.025101] fill_read_buffer: blk_mq_hw_sysfs_show+0x0/0x7d returned bad count^M
[  359.055595] irq event stamp: 15537^M
[  359.055606] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  359.055614] Dumping ftrace buffer:^M
[  359.055660]    (ftrace buffer empty)^M
[  359.055672] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  359.055678] CPU: 4 PID: 21631 Comm: stress-ng-sysfs Not tainted 4.2.0-rc5-next-20150805 #434^M
[  359.055679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  359.055682] task: ffff8802161cc000 ti: ffff88021b4a8000 task.ti: ffff88021b4a8000^M
[  359.055693] RIP: 0010:[]  [] __kmalloc+0xe8/0x152^M

Cc: 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

sd: Fix maximum I/O size for BLOCK_PC requests

2015-09-17T05:30:44+00:00

[ Upstream commit 4f258a46346c03fa0bbb6199ffaf4e1f9f599660 ]

Commit bcdb247c6b6a ("sd: Limit transfer length") clamped the maximum
size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK
LIMITS VPD. This had the unfortunate effect of also limiting the maximum
size of non-filesystem requests sent to the device through sg/bsg.

Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue
limit directly.

Also update the comment in blk_limits_max_hw_sectors() to clarify that
max_hw_sectors defines the limit for the I/O controller only.

Signed-off-by: Martin K. Petersen 
Reported-by: Brian King 
Tested-by: Brian King 
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: James Bottomley 
Signed-off-by: Sasha Levin

blkcg: fix gendisk reference leak in blkg_conf_prep()

2015-08-27T17:25:46+00:00

[ Upstream commit 5f6c2d2b7dbb541c1e922538c49fa04c494ae3d7 ]

When a blkcg configuration is targeted to a partition rather than a
whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it
forgets to put the gendisk ref in that case.  Fix it.

Signed-off-by: Tejun Heo 
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: fix CPU hotplug handling

2015-08-06T18:49:50+00:00

[ Upstream commit 2a34c0872adf252f23a6fef2d051a169ac796cef ]

hctx->tags has to be set as NULL in case that it is to be unmapped
no matter if set->tags[hctx->queue_num] is NULL or not in blk_mq_map_swqueue()
because shared tags can be freed already from another request queue.

The same situation has to be considered during handling CPU online too.
Unmapped hw queue can be remapped after CPU topo is changed, so we need
to allocate tags for the hw queue in blk_mq_map_swqueue(). Then tags
allocation for hw queue can be removed in hctx cpu online notifier, and it
is reasonable to do that after mapping is updated.

Cc: 
Reported-by: Dongsu Park 
Tested-by: Dongsu Park 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: fix ext_dev_lock lockdep report

2015-07-03T16:34:37+00:00

[ Upstream commit 4d66e5e9b6d720d8463e11d027bd4ad91c8b1318 ]

 =================================
 [ INFO: inconsistent lock state ]
 4.1.0-rc7+ #217 Tainted: G           O
 ---------------------------------
 inconsistent {SOFTIRQ-ON-W} -> {IN-SOFTIRQ-W} usage.
 swapper/6/0 [HC0[0]:SC1[1]:HE1:SE0] takes:
  (ext_devt_lock){+.?...}, at: [] blk_free_devt+0x3c/0x70
 {SOFTIRQ-ON-W} state was registered at:
   [] __lock_acquire+0x461/0x1e70
   [] lock_acquire+0xb7/0x290
   [] _raw_spin_lock+0x38/0x50
   [] blk_alloc_devt+0x6d/0xd0  <-- take the lock in process context
[..]
  [] __lock_acquire+0x3fe/0x1e70
  [] ? __lock_acquire+0xe5d/0x1e70
  [] lock_acquire+0xb7/0x290
  [] ? blk_free_devt+0x3c/0x70
  [] _raw_spin_lock+0x38/0x50
  [] ? blk_free_devt+0x3c/0x70
  [] blk_free_devt+0x3c/0x70    <-- take the lock in softirq
  [] part_release+0x1c/0x50
  [] device_release+0x36/0xb0
  [] kobject_cleanup+0x7b/0x1a0
  [] kobject_put+0x30/0x70
  [] put_device+0x17/0x20
  [] delete_partition_rcu_cb+0x16c/0x180
  [] ? read_dev_sector+0xa0/0xa0
  [] rcu_process_callbacks+0x2ff/0xa90
  [] ? rcu_process_callbacks+0x2bf/0xa90
  [] __do_softirq+0xde/0x600

Neil sees this in his tests and it also triggers on pmem driver unbind
for the libnvdimm tests.  This fix is on top of an initial fix by Keith
for incorrect usage of mutex_lock() in this path: 2da78092dda1 "block:
Fix dev_t minor allocation lifetime".  Both this and 2da78092dda1 are
candidates for -stable.

Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: 
Cc: Keith Busch 
Reported-by: NeilBrown 
Signed-off-by: Dan Williams 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

Fix bug in blk_rq_merge_ok

2015-04-23T03:32:31+00:00

[ Upstream commit 7ee8e4f3983c4ff700958a6099c8fd212ea67b94 ]

Use the right array index to reference the last
element of rq->biotail->bi_io_vec[]

Signed-off-by: Wenbo Wang 
Reviewed-by: Chong Yuan 
Fixes: 66cb45aa41315 ("block: add support for limiting gaps in SG lists")
Cc: stable@kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: fix use of incorrect goto label in blk_mq_init_queue error path

2015-04-23T03:32:10+00:00

[ Upstream commit 9a30b096b543932de218dd3501b5562e00a8792d ]

If percpu_ref_init() fails the allocated q and hctxs must get cleaned
up; using 'err_map' doesn't allow that to happen.

Signed-off-by: Mike Snitzer 
Reviewed-by: Ming Lei 
Cc: stable@kernel.org
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin