linux-stable.git/block, branch v4.2.3

blk-mq: fix race between timeout and freeing request

2015-09-29T17:33:15+00:00

commit 0048b4837affd153897ed1222283492070027aa9 upstream.

Inside timeout handler, blk_mq_tag_to_rq() is called
to retrieve the request from one tag. This way is obviously
wrong because the request can be freed any time and some
fiedds of the request can't be trusted, then kernel oops
might be triggered[1].

Currently wrt. blk_mq_tag_to_rq(), the only special case is
that the flush request can share same tag with the request
cloned from, and the two requests can't be active at the same
time, so this patch fixes the above issue by updating tags->rqs[tag]
with the active request(either flush rq or the request cloned
from) of the tag.

Also blk_mq_tag_to_rq() gets much simplified with this patch.

Given blk_mq_tag_to_rq() is mainly for drivers and the caller must
make sure the request can't be freed, so in bt_for_each() this
helper is replaced with tags->rqs[tag].

[1] kernel oops log
[  439.696220] BUG: unable to handle kernel NULL pointer dereference at 0000000000000158^M
[  439.697162] IP: [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.700653] PGD 7ef765067 PUD 7ef764067 PMD 0 ^M
[  439.700653] Oops: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  439.700653] Dumping ftrace buffer:^M
[  439.700653]    (ftrace buffer empty)^M
[  439.700653] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  439.700653] CPU: 6 PID: 2779 Comm: stress-ng-sigfd Not tainted 4.2.0-rc5-next-20150805+ #265^M
[  439.730500] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  439.730500] task: ffff880605308000 ti: ffff88060530c000 task.ti: ffff88060530c000^M
[  439.730500] RIP: 0010:[]  [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.730500] RSP: 0018:ffff880819203da0  EFLAGS: 00010283^M
[  439.730500] RAX: ffff880811b0e000 RBX: ffff8800bb465f00 RCX: 0000000000000002^M
[  439.730500] RDX: 0000000000000000 RSI: 0000000000000202 RDI: 0000000000000000^M
[  439.730500] RBP: ffff880819203db0 R08: 0000000000000002 R09: 0000000000000000^M
[  439.730500] R10: 0000000000000000 R11: 0000000000000000 R12: 0000000000000202^M
[  439.730500] R13: ffff880814104800 R14: 0000000000000002 R15: ffff880811a2ea00^M
[  439.730500] FS:  00007f165b3f5740(0000) GS:ffff880819200000(0000) knlGS:0000000000000000^M
[  439.730500] CS:  0010 DS: 0000 ES: 0000 CR0: 000000008005003b^M
[  439.730500] CR2: 0000000000000158 CR3: 00000007ef766000 CR4: 00000000000006e0^M
[  439.730500] Stack:^M
[  439.730500]  0000000000000008 ffff8808114eed90 ffff880819203e00 ffffffff812dc104^M
[  439.755663]  ffff880819203e40 ffffffff812d9f5e 0000020000000000 ffff8808114eed80^M
[  439.755663] Call Trace:^M
[  439.755663]   ^M
[  439.755663]  [] bt_for_each+0x6e/0xc8^M
[  439.755663]  [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [] ? blk_mq_rq_timed_out+0x6a/0x6a^M
[  439.755663]  [] blk_mq_tag_busy_iter+0x55/0x5e^M
[  439.755663]  [] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [] blk_mq_rq_timer+0x5d/0xd4^M
[  439.755663]  [] call_timer_fn+0xf7/0x284^M
[  439.755663]  [] ? call_timer_fn+0x5/0x284^M
[  439.755663]  [] ? blk_mq_bio_to_request+0x38/0x38^M
[  439.755663]  [] run_timer_softirq+0x1ce/0x1f8^M
[  439.755663]  [] __do_softirq+0x181/0x3a4^M
[  439.755663]  [] irq_exit+0x40/0x94^M
[  439.755663]  [] smp_apic_timer_interrupt+0x33/0x3e^M
[  439.755663]  [] apic_timer_interrupt+0x84/0x90^M
[  439.755663]   ^M
[  439.755663]  [] ? _raw_spin_unlock_irq+0x32/0x4a^M
[  439.755663]  [] finish_task_switch+0xe0/0x163^M
[  439.755663]  [] ? finish_task_switch+0xa2/0x163^M
[  439.755663]  [] __schedule+0x469/0x6cd^M
[  439.755663]  [] schedule+0x82/0x9a^M
[  439.789267]  [] signalfd_read+0x186/0x49a^M
[  439.790911]  [] ? wake_up_q+0x47/0x47^M
[  439.790911]  [] __vfs_read+0x28/0x9f^M
[  439.790911]  [] ? __fget_light+0x4d/0x74^M
[  439.790911]  [] vfs_read+0x7a/0xc6^M
[  439.790911]  [] SyS_read+0x49/0x7f^M
[  439.790911]  [] entry_SYSCALL_64_fastpath+0x12/0x6f^M
[  439.790911] Code: 48 89 e5 e8 a9 b8 e7 ff 5d c3 0f 1f 44 00 00 55 89
f2 48 89 e5 41 54 41 89 f4 53 48 8b 47 60 48 8b 1c d0 48 8b 7b 30 48 8b
53 38 <48> 8b 87 58 01 00 00 48 85 c0 75 09 48 8b 97 88 0c 00 00 eb 10
^M
[  439.790911] RIP  [] blk_mq_tag_to_rq+0x21/0x6e^M
[  439.790911]  RSP ^M
[  439.790911] CR2: 0000000000000158^M
[  439.790911] ---[ end trace d40af58949325661 ]---^M

Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: fix buffer overflow when reading sysfs file of 'pending'

2015-09-29T17:33:15+00:00

commit 596f5aad2a704b72934e5abec1b1b4114c16f45b upstream.

There may be lots of pending requests so that the buffer of PAGE_SIZE
can't hold them at all.

One typical example is scsi-mq, the queue depth(.can_queue) of
scsi_host and blk-mq is quite big but scsi_device's queue_depth
is a bit small(.cmd_per_lun), then it is quite easy to have lots
of pending requests in hw queue.

This patch fixes the following warning and the related memory
destruction.

[  359.025101] fill_read_buffer: blk_mq_hw_sysfs_show+0x0/0x7d returned bad count^M
[  359.055595] irq event stamp: 15537^M
[  359.055606] general protection fault: 0000 [#1] PREEMPT SMP DEBUG_PAGEALLOC ^M
[  359.055614] Dumping ftrace buffer:^M
[  359.055660]    (ftrace buffer empty)^M
[  359.055672] Modules linked in: nbd ipv6 kvm_intel kvm serio_raw^M
[  359.055678] CPU: 4 PID: 21631 Comm: stress-ng-sysfs Not tainted 4.2.0-rc5-next-20150805 #434^M
[  359.055679] Hardware name: QEMU Standard PC (i440FX + PIIX, 1996), BIOS Bochs 01/01/2011^M
[  359.055682] task: ffff8802161cc000 ti: ffff88021b4a8000 task.ti: ffff88021b4a8000^M
[  359.055693] RIP: 0010:[]  [] __kmalloc+0xe8/0x152^M

Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

Merge tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi

2015-08-15T20:54:53+00:00

Pull SCSI fixes from James Bottomley:
 "This has two libfc fixes for bugs causing rare crashes, one iscsi fix
  for a potential hang on shutdown, and a fix for an I/O blocksize issue
  which caused a regression"

* tag 'scsi-fixes' of git://git.kernel.org/pub/scm/linux/kernel/git/jejb/scsi:
  sd: Fix maximum I/O size for BLOCK_PC requests
  libfc: Fix fc_fcp_cleanup_each_cmd()
  libfc: Fix fc_exch_recv_req() error path
  libiscsi: Fix host busy blocking during connection teardown

sd: Fix maximum I/O size for BLOCK_PC requests

2015-08-12T18:54:37+00:00

Commit bcdb247c6b6a ("sd: Limit transfer length") clamped the maximum
size of an I/O request to the MAXIMUM TRANSFER LENGTH field in the BLOCK
LIMITS VPD. This had the unfortunate effect of also limiting the maximum
size of non-filesystem requests sent to the device through sg/bsg.

Avoid using blk_queue_max_hw_sectors() and set the max_sectors queue
limit directly.

Also update the comment in blk_limits_max_hw_sectors() to clarify that
max_hw_sectors defines the limit for the I/O controller only.

Signed-off-by: Martin K. Petersen 
Reported-by: Brian King 
Tested-by: Brian King 
Cc: stable@vger.kernel.org # 3.17+
Signed-off-by: James Bottomley

block: Do a full clone when splitting discard bios

2015-07-23T22:21:34+00:00

This fixes a data corruption bug when using discard on top of MD linear,
raid0 and raid10 personalities.

Commit 20d0189b1012 "block: Introduce new bio_split()" permits sharing
the bio_vec between the two resulting bios. That is fine for read/write
requests where the bio_vec is immutable. For discards, however, we need
to be able to attach a payload and update the bio_vec so the page can
get mapped to a scatterlist entry. Therefore the bio_vec can not be
shared when splitting discards and we must do a full clone.

Signed-off-by: Martin K. Petersen 
Reported-by: Seunguk Shin 
Tested-by: Seunguk Shin 
Cc: Seunguk Shin 
Cc: Jens Axboe 
Cc: Kent Overstreet 
Cc:  # v3.14+
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe

block: export bio_associate_*() and wbc_account_io()

2015-07-23T19:36:44+00:00

bio_associate_blkcg(), bio_associate_current() and wbc_account_io()
are used to implement cgroup writeback support for filesystems and
thus need to be exported.  Export them.

Signed-off-by: Tejun Heo 
Reported-by: Stephen Rothwell 
Signed-off-by: Jens Axboe

blkcg: fix gendisk reference leak in blkg_conf_prep()

2015-07-22T22:06:53+00:00

When a blkcg configuration is targeted to a partition rather than a
whole device, blkg_conf_prep fails with -EINVAL; unfortunately, it
forgets to put the gendisk ref in that case.  Fix it.

Signed-off-by: Tejun Heo 
Cc: stable@vger.kernel.org
Signed-off-by: Jens Axboe

blk-mq: set default timeout as 30 seconds

2015-07-16T14:39:11+00:00

It is reasonable to set default timeout of request as 30 seconds instead of
30000 ticks, which may be 300 seconds if HZ is 100, for example, some arm64
based systems may choose 100 HZ.

Signed-off-by: Ming Lei 
Fixes: c76cbbcf4044 ("blk-mq: put blk_queue_rq_timeout together in blk_mq_init_queue()"
Signed-off-by: Jens Axboe

blkcg: fix blkcg_policy_data allocation bug

2015-07-09T20:41:11+00:00

e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg
data") updated per-blkcg policy data to be dynamically allocated.
When a policy is registered, its policy data aren't created.  Instead,
when the policy is activated on a queue, the policy data are allocated
if there are blkg's (blkcg_gq's) which are attached to a given blkcg.
This is buggy.  Consider the following scenario.

1. A blkcg is created.  No blkg's attached yet.

2. The policy is registered.  No policy data is allocated.

3. The policy is activated on a queue.  As the above blkcg doesn't
   have any blkg's, it won't allocate the matching blkcg_policy_data.

4. An IO is issued from the blkcg and blkg is created and the blkcg
   still doesn't have the matching policy data allocated.

With cfq-iosched, this leads to an oops.

It also doesn't free policy data on policy unregistration assuming
that freeing of all policy data on blkcg destruction should take care
of it; however, this also is incorrect.

1. A blkcg has policy data.

2. The policy gets unregistered but the policy data remains.

3. Another policy gets registered on the same slot.

4. Later, the new policy tries to allocate policy data on the previous
   blkcg but the slot is already occupied and gets skipped.  The
   policy ends up operating on the policy data of the previous policy.

There's no reason to manage blkcg_policy_data lazily.  The reason we
do lazy allocation of blkg's is that the number of all possible blkg's
is the product of cgroups and block devices which can reach a
surprising level.  blkcg_policy_data is contrained by the number of
cgroups and shouldn't be a problem.

This patch makes blkcg_policy_data to be allocated for all existing
blkcg's on policy registration and freed on unregistration and removes
blkcg_policy_data handling from policy [de]activation paths.  This
makes that blkcg_policy_data are created and removed with the policy
they belong to and fixes the above described problems.

Signed-off-by: Tejun Heo 
Fixes: e48453c386f3 ("block, cgroup: implement policy-specific per-blkcg data")
Cc: Vivek Goyal 
Cc: Arianna Avanzini 
Signed-off-by: Jens Axboe

blkcg: implement all_blkcgs list

2015-07-09T20:41:09+00:00

Add all_blkcgs list goes through blkcg->all_blkcgs_node and is
protected by blkcg_pol_mutex.  This will be used to fix
blkcg_policy_data allocation bug.

Signed-off-by: Tejun Heo 
Cc: Vivek Goyal 
Cc: Arianna Avanzini 
Signed-off-by: Jens Axboe