linux-stable.git/block, branch linux-4.2.y

blk-mq: fix use-after-free in blk_mq_free_tag_set()

2015-11-09T22:37:39+00:00

commit f42d79ab67322e51b92dd7aa965e310c71352a64 upstream.

tags is freed in blk_mq_free_rq_map() and should not be used after that.
The problem doesn't manifest if CONFIG_CPUMASK_OFFSTACK is false because
free_cpumask_var() is nop.

tags->cpumask is allocated in blk_mq_init_tags() so it's natural to
free cpumask in its counter part, blk_mq_free_tags().

Fixes: f26cdc8536ad ("blk-mq: Shared tag enhancements")
Signed-off-by: Jun'ichi Nomura 
Cc: Keith Busch 
Reviewed-by: Jeff Moyer 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: don't release bdi while request_queue has live references

2015-11-09T22:37:36+00:00

commit b02176f30cd30acccd3b633ab7d9aed8b5da52ff upstream.

bdi's are initialized in two steps, bdi_init() and bdi_register(), but
destroyed in a single step by bdi_destroy() which, for a bdi embedded
in a request_queue, is called during blk_cleanup_queue() which makes
the queue invisible and starts the draining of remaining usages.

A request_queue's user can access the congestion state of the embedded
bdi as long as it holds a reference to the queue.  As such, it may
access the congested state of a queue which finished
blk_cleanup_queue() but hasn't reached blk_release_queue() yet.
Because the congested state was embedded in backing_dev_info which in
turn is embedded in request_queue, accessing the congested state after
bdi_destroy() was called was fine.  The bdi was destroyed but the
memory region for the congested state remained accessible till the
queue got released.

a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in
bdi_writeback") changed the situation.  Now, the root congested state
which is expected to be pinned while request_queue remains accessible
is separately reference counted and the base ref is put during
bdi_destroy().  This means that the root congested state may go away
prematurely while the queue is between bdi_dstroy() and
blk_cleanup_queue(), which was detected by Andrey's KASAN tests.

The root cause of this problem is that bdi doesn't distinguish the two
steps of destruction, unregistration and release, and now the root
congested state actually requires a separate release step.  To fix the
issue, this patch separates out bdi_unregister() and bdi_exit() from
bdi_destroy().  bdi_unregister() is called from blk_cleanup_queue()
and bdi_exit() from blk_release_queue().  bdi_destroy() is now just a
simple wrapper calling the two steps back-to-back.

While at it, the prototype of bdi_destroy() is moved right below
bdi_setup_and_register() so that the counterpart operations are
located together.

Signed-off-by: Tejun Heo 
Fixes: a13f35e87140 ("writeback: don't embed root bdi_writeback_congested in bdi_writeback")
Reported-and-tested-by: Andrey Konovalov 
Link: http://lkml.kernel.org/g/CAAeHK+zUJ74Zn17=rOyxacHU18SgCfC6bsYW=6kCY5GXJBwGfQ@mail.gmail.com
Reviewed-by: Jan Kara 
Reviewed-by: Jeff Moyer 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: avoid setting hctx->tags->cpumask before allocation

2015-10-22T21:49:35+00:00

commit 1356aae08338f1c19ce1c67bf8c543a267688fc3 upstream.

When unmapped hw queue is remapped after CPU topology is changed,
hctx->tags->cpumask has to be set after hctx->tags is setup in
blk_mq_map_swqueue(), otherwise it causes null pointer dereference.

Fixes: f26cdc8536 ("blk-mq: Shared tag enhancements")
Signed-off-by: Akinobu Mita 
Cc: Keith Busch 
Cc: Ming Lei 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: blkg_destroy_all() should clear q->root_blkg and ->root_rl.blkg

2015-10-22T21:49:17+00:00

commit 6fe810bda0bd9a5d7674fc671fac27b8aa8ec243 upstream.

While making the root blkg unconditional, ec13b1d6f0a0 ("blkcg: always
create the blkcg_gq for the root blkcg") removed the part which clears
q->root_blkg and ->root_rl.blkg during q exit.  This leaves the two
pointers dangling after blkg_destroy_all().  blk-throttle exit path
performs blkg traversals and dereferences ->root_blkg and can lead to
the following oops.

 BUG: unable to handle kernel NULL pointer dereference at 0000000000000558
 IP: [] __blkg_lookup+0x26/0x70
 ...
 task: ffff88001b4e2580 ti: ffff88001ac0c000 task.ti: ffff88001ac0c000
 RIP: 0010:[]  [] __blkg_lookup+0x26/0x70
 ...
 Call Trace:
  [] blk_throtl_drain+0x5a/0x110
  [] blkcg_drain_queue+0x18/0x20
  [] __blk_drain_queue+0xc0/0x170
  [] blk_queue_bypass_start+0x61/0x80
  [] blkcg_deactivate_policy+0x39/0x100
  [] blk_throtl_exit+0x38/0x50
  [] blkcg_exit_queue+0x3e/0x50
  [] blk_release_queue+0x1e/0xc0
 ...

While the bug is a straigh-forward use-after-free bug, it is tricky to
reproduce because blkg release is RCU protected and the rest of exit
path usually finishes before RCU grace period.

This patch fixes the bug by updating blkg_destro_all() to clear
q->root_blkg and ->root_rl.blkg.

Signed-off-by: Tejun Heo 
Reported-by: "Richard W.M. Jones" 
Reported-by: Josh Boyer 
Link: http://lkml.kernel.org/g/CA+5PVA5rzQ0s4723n5rHBcxQa9t0cW8BPPBekr_9aMRoWt2aYg@mail.gmail.com
Fixes: ec13b1d6f0a0 ("blkcg: always create the blkcg_gq for the root blkcg")
Tested-by: Richard W.M. Jones 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: fix race between timeout and freeing request

2015-09-29T17:33:15+00:00

commit 0048b4837affd153897ed1222283492070027aa9 upstream.

Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].

Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.

Also blk_mq_tag_to_rq() gets much simplified with this patch.

Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].

[1] kernel oops log
[  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
[  439.697162] IP: [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
[  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  439.700653] Dumping ftrace buffer:^M
[  439.700653]    (ftrace buffer empty)^M
[  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
[  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
[  439.730500] RIP: 0010:[]  [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
[  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
[  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
[  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
[  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
[  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
[  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
[  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
[  439.730500] Stack:^M
[  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
[  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
[  439.755663] Call Trace:^M
[  439.755663]   ^M
[  439.755663]  [] bt_for_each+0x6e/0xc8^M
[  439.755663]  [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [] blk_mq_tag_busy_iter+0x55/0x5e^M
[  439.755663]  [] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [] blk_mq_rq_timer+0x5d/0xd4^M
[  439.755663]  [] call_timer_fn+0xf7/0x284^M
[  439.755663]  [] ? call_timer_fn+0x5/0x284^M
[  439.755663]  [] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [] run_timer_softirq+0x1ce/0x1f8^M
[  439.755663]  [] __do_softirq+0x181/0x3a4^M
[  439.755663]  [] irq_exit+0x40/0x94^M
[  439.755663]  [] smp_apic_timer_interrupt+0x33/0x3e^M
[  439.755663]  [] apic_timer_interrupt+0x84/0x90^M
[  439.755663]   ^M
[  439.755663]  [] ? _raw_spin_unlock_irq+0x32/0x4a^M
[  439.755663]  [] finish_task_switch+0xe0/0x163^M
[  439.755663]  [] ? finish_task_switch+0xa2/0x163^M
[  439.755663]  [] __schedule+0x469/0x6cd^M
[  439.755663]  [] schedule+0x82/0x9a^M
[  439.789267]  [] signalfd_read+0x186/0x49a^M
[  439.790911]  [] ? wake_up_q+0x47/0x47^M
[  439.790911]  [] __vfs_read+0x28/0x9f^M
[  439.790911]  [] ? __fget_light+0x4d/0x74^M
[  439.790911]  [] vfs_read+0x7a/0xc6^M
[  439.790911]  [] SyS_read+0x49/0x7f^M
[  439.790911]  [] entry_SYSCALL_64_fastpath+0x12/0x6f^M
[  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
^M
[  439.790911] RIP  [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.790911]  RSP ^M
[  439.790911] CR2: 0000000000000158^M
[  439.790911] ---[ end trace d40af58949325661 ]---^M

Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: fix buffer overflow when reading sysfs file of 'pending'

2015-09-29T17:33:15+00:00

commit 596f5aad2a704b72934e5abec1b1b4114c16f45b upstream.

There may be lots of pending requests so that the buffer of PAGE_SIZE
can't hold them at all.

One typical example is scsi-mq, the queue depth(.can_queue) of
scsi_host and blk-mq is quite big but scsi_device's queue_depth
is a bit small(.cmd_per_lun), then it is quite easy to have lots
of pending requests in hw queue.

This patch fixes the following warning and the related memory
destruction.

[  359.025101] fill_read_buffer: blk_mq_hw_sysfs_show+0x0/0x7d returned bad count^M
[  359.055595] irq event stamp: 15537^M
[  359.055606] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  359.055614] Dumping ftrace buffer:^M
[  359.055660]    (ftrace buffer empty)^M
[  359.055672] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  359.055678] CPU: 4 PID: 21631 Comm: stress-ng-sysfs Not tainted 4.2.0-rc5-next-20150805 #434^M
[  359.055679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  359.055682] task: ffff8802161cc000 ti: ffff88021b4a8000 task.ti: ffff88021b4a8000^M
[  359.055693] RIP: 0010:[]  [] __kmalloc+0xe8/0x152^M

Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

2015-08-15T20:54:53+00:00

Pull SCSI fixes from James Bottomley:
 "This has two libfc fixes for bugs causing rare crashes, one iscsi fix
  for a potential hang on shutdown, and a fix for an I/O blocksize issue
  which caused a regression"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  sd: Fix maximum I/O size for BLOCK_PC requests
  libfc: Fix fc_fcp_cleanup_each_cmd()
  libfc: Fix fc_exch_recv_req() error path
  libiscsi: Fix host busy blocking during connection teardown

sd: Fix maximum I/O size for BLOCK_PC requests

2015-08-12T18:54:37+00:00

Commit bcdb247c6b6a ("sd: Limit transfer length") clamped the maximum
size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK
LIMITS VPD. This had the unfortunate effect of also limiting the maximum
size of non-filesystem requests sent to the device through sg/bsg.

Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue
limit directly.

Also update the comment in blk_limits_max_hw_sectors() to clarify that
max_hw_sectors defines the limit for the I/O controller only.

Signed-off-by: Martin K. Petersen 
Reported-by: Brian King 
Tested-by: Brian King 
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: James Bottomley

block: Do a full clone when splitting discard bios

2015-07-23T22:21:34+00:00

This fixes a data corruption bug when using discard on top of MD linear,
raid0 and raid10 personalities.

Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing
the bio_vec between the two resulting bios. That is fine for read/write
requests where the bio_vec is immutable. For discards, however, we need
to be able to attach a payload and update the bio_vec so the page can
get mapped to a scatterlist entry. Therefore the bio_vec can not be
shared when splitting discards and we must do a full clone.

Signed-off-by: Martin K. Petersen 
Reported-by: Seunguk Shin 
Tested-by: Seunguk Shin 
Cc: Seunguk Shin 
Cc: Jens Axboe 
Cc: Kent Overstreet 
Cc:  # v3.14+
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe

block: export bio_associate_*() and wbc_account_io()

2015-07-23T19:36:44+00:00

bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
are used to implement cgroup writeback support for filesystems and
thus need to be exported.  Export them.

Signed-off-by: Tejun Heo 
Reported-by: Stephen Rothwell 
Signed-off-by: Jens Axboe