linux-stable.git/block, branch linux-4.15.y

blk-mq: don't keep offline CPUs mapped to hctx 0

2018-04-19T06:55:10+00:00

commit bffa9909a6b48d8ca3398dec601bc9162a4020c4 upstream.

From commit 4b855ad37194 ("blk-mq: Create hctx for each present CPU),
blk-mq doesn't remap queue after CPU topo is changed, that said when
some of these offline CPUs become online, they are still mapped to
hctx 0, then hctx 0 may become the bottleneck of IO dispatch and
completion.

This patch sets up the mapping from the beginning, and aligns to
queue mapping for PCI device (blk_mq_pci_map_queues()).

Cc: Stefan Haberland 
Cc: Keith Busch 
Cc: stable@vger.kernel.org
Fixes: 4b855ad37194 ("blk-mq: Create hctx for each present CPU)
Tested-by: Christian Borntraeger 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Sagi Grimberg 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: order getting budget and driver tag

2018-04-19T06:55:10+00:00

commit 0bca799b92807ee9be0890690f5dde7d8c6a8e25 upstream.

This patch orders getting budget and driver tag by making sure to acquire
driver tag after budget is got, this way can help to avoid the following
race:

1) before dispatch request from scheduler queue, get one budget first, then
dequeue a request, call it request A.

2) in another IO path for dispatching request B which is from hctx->dispatch,
driver tag is got, then try to get budget in blk_mq_dispatch_rq_list(),
unfortunately the budget is held by request A.

3) meantime blk_mq_dispatch_rq_list() is called for dispatching request
A, and try to get driver tag first, unfortunately no driver tag is
available because the driver tag is held by request B

4) both two IO pathes can't move on, and IO stall is caused.

This issue can be observed when running dbench on USB storage.

This patch fixes this issue by always getting budget before getting
driver tag.

Cc: stable@vger.kernel.org
Fixes: de1482974080ec9e ("blk-mq: introduce .get_budget and .put_budget in blk_mq_ops")
Cc: Christoph Hellwig 
Cc: Bart Van Assche 
Cc: Omar Sandoval 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: Change a rcu_read_{lock,unlock}_sched() pair into rcu_read_{lock,unlock}()

2018-04-19T06:55:10+00:00

commit 818e0fa293ca836eba515615c64680ea916fd7cd upstream.

scsi_device_quiesce() uses synchronize_rcu() to guarantee that the
effect of blk_set_preempt_only() will be visible for percpu_ref_tryget()
calls that occur after the queue unfreeze by using the approach
explained in https://lwn.net/Articles/573497/. The rcu read lock and
unlock calls in blk_queue_enter() form a pair with the synchronize_rcu()
call in scsi_device_quiesce(). Both scsi_device_quiesce() and
blk_queue_enter() must either use regular RCU or RCU-sched.
Since neither the RCU-protected code in blk_queue_enter() nor
blk_queue_usage_counter_release() sleeps, regular RCU protection
is sufficient. Note: scsi_device_quiesce() does not have to be
modified since it already uses synchronize_rcu().

Reported-by: Tejun Heo 
Fixes: 3a0a529971ec ("block, scsi: Make SCSI quiesce and resume work reliably")
Signed-off-by: Bart Van Assche 
Acked-by: Tejun Heo 
Cc: Tejun Heo 
Cc: Hannes Reinecke 
Cc: Ming Lei 
Cc: Christoph Hellwig 
Cc: Johannes Thumshirn 
Cc: Oleksandr Natalenko 
Cc: Martin Steigerwald 
Cc: stable@vger.kernel.org # v4.15
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block, bfq: put async queues for root bfq groups too

2018-04-12T10:31:11+00:00

[ Upstream commit 52257ffbfcaf58d247b13fb148e27ed17c33e526 ]

For each pair [device for which bfq is selected as I/O scheduler,
group in blkio/io], bfq maintains a corresponding bfq group. Each such
bfq group contains a set of async queues, with each async queue
created on demand, i.e., when some I/O request arrives for it.  On
creation, an async queue gets an extra reference, to make sure that
the queue is not freed as long as its bfq group exists.  Accordingly,
to allow the queue to be freed after the group exited, this extra
reference must released on group exit.

The above holds also for a bfq root group, i.e., for the bfq group
corresponding to the root blkio/io root for a given device. Yet, by
mistake, the references to the existing async queues of a root group
are not released when the latter exits. This causes a memory leak when
the instance of bfq for a given device exits. In a similar vein,
bfqg_stats_xfer_dead is not executed for a root group.

This commit fixes bfq_pd_offline so that the latter executes the above
missing operations for a root group too.

Reported-by: Holger Hoffstätte 
Reported-by: Guoqing Jiang 
Tested-by: Holger Hoffstätte 
Signed-off-by: Davide Ferrari 
Signed-off-by: Paolo Valente 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

blk-mq: fix kernel oops in blk_mq_tag_idle()

2018-04-12T10:31:10+00:00

[ Upstream commit 8ab0b7dc73e1b3e2987d42554b2bff503f692772 ]

HW queues may be unmapped in some cases, such as blk_mq_update_nr_hw_queues(),
then we need to check it before calling blk_mq_tag_idle(), otherwise
the following kernel oops can be triggered, so fix it by checking if
the hw queue is unmapped since it doesn't make sense to idle the tags
any more after hw queues are unmapped.

[  440.771298] Workqueue: nvme-wq nvme_rdma_del_ctrl_work [nvme_rdma]
[  440.779104] task: ffff894bae755ee0 ti: ffff893bf9bc8000 task.ti: ffff893bf9bc8000
[  440.788359] RIP: 0010:[]  [] __blk_mq_tag_idle+0x24/0x40
[  440.798697] RSP: 0018:ffff893bf9bcbd10  EFLAGS: 00010286
[  440.805538] RAX: 0000000000000000 RBX: ffff895bb131dc00 RCX: 000000000000011f
[  440.814426] RDX: 00000000ffffffff RSI: 0000000000000120 RDI: ffff895bb131dc00
[  440.823301] RBP: ffff893bf9bcbd10 R08: 000000000001b860 R09: 4a51d361c00c0000
[  440.832193] R10: b5907f32b4cc7003 R11: ffffd6cabfb57000 R12: ffff894bafd1e008
[  440.841091] R13: 0000000000000001 R14: ffff895baf770000 R15: 0000000000000080
[  440.849988] FS:  0000000000000000(0000) GS:ffff894bbdcc0000(0000) knlGS:0000000000000000
[  440.859955] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  440.867274] CR2: 0000000000000008 CR3: 000000103d098000 CR4: 00000000001407e0
[  440.876169] Call Trace:
[  440.879818]  [] blk_mq_exit_hctx+0xd8/0xe0
[  440.887051]  [] blk_mq_free_queue+0xf0/0x160
[  440.894465]  [] blk_cleanup_queue+0xd9/0x150
[  440.901881]  [] nvme_ns_remove+0x5b/0xb0 [nvme_core]
[  440.910068]  [] nvme_remove_namespaces+0x3b/0x60 [nvme_core]
[  440.919026]  [] __nvme_rdma_remove_ctrl+0x2b/0xb0 [nvme_rdma]
[  440.928079]  [] nvme_rdma_del_ctrl_work+0x17/0x20 [nvme_rdma]
[  440.937126]  [] process_one_work+0x17a/0x440
[  440.944517]  [] worker_thread+0x278/0x3c0
[  440.951607]  [] ? manage_workers.isra.24+0x2a0/0x2a0
[  440.959760]  [] kthread+0xcf/0xe0
[  440.966055]  [] ? insert_kthread_work+0x40/0x40
[  440.973715]  [] ret_from_fork+0x58/0x90
[  440.980586]  [] ? insert_kthread_work+0x40/0x40
[  440.988229] Code: 5b 41 5c 5d c3 66 90 0f 1f 44 00 00 48 8b 87 20 01 00 00 f0 0f ba 77 40 01 19 d2 85 d2 75 08 c3 0f 1f 80 00 00 00 00 55 48 89 e5  ff 48 08 48 8d 78 10 e8 7f 0f 05 00 5d c3 0f 1f 00 66 2e 0f
[  441.011620] RIP  [] __blk_mq_tag_idle+0x24/0x40
[  441.019301]  RSP 
[  441.024052] CR2: 0000000000000008

Reported-by: Zhang Yi 
Tested-by: Zhang Yi 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

blk-mq: fix race between updating nr_hw_queues and switching io sched

2018-04-12T10:31:07+00:00

[ Upstream commit fb350e0ad99359768e1e80b4784692031ec340e4 ]

In both elevator_switch_mq() and blk_mq_update_nr_hw_queues(), sched tags
can be allocated, and q->nr_hw_queue is used, and race is inevitable, for
example: blk_mq_init_sched() may trigger use-after-free on hctx, which is
freed in blk_mq_realloc_hw_ctxs() when nr_hw_queues is decreased.

This patch fixes the race be holding q->sysfs_lock.

Reviewed-by: Christoph Hellwig 
Reported-by: Yi Zhang 
Tested-by: Yi Zhang 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

blk-mq: avoid to map CPU into stale hw queue

2018-04-12T10:31:07+00:00

[ Upstream commit 7d4901a90d02500c8011472a060f9b2e60e6e605 ]

blk_mq_pci_map_queues() may not map one CPU into any hw queue, but its
previous map isn't cleared yet, and may point to one stale hw queue
index.

This patch fixes the following issue by clearing the mapping table before
setting it up in blk_mq_pci_map_queues().

This patches fixes this following issue reported by Zhang Yi:

[  101.202734] BUG: unable to handle kernel NULL pointer dereference at 0000000094d3013f
[  101.211487] IP: blk_mq_map_swqueue+0xbc/0x200
[  101.216346] PGD 0 P4D 0
[  101.219171] Oops: 0000 [#1] SMP
[  101.222674] Modules linked in: sunrpc ipmi_ssif vfat fat intel_rapl sb_edac x86_pkg_temp_thermal intel_powerclamp coretemp kvm_intel kvm irqbypass crct10dif_pclmul crc32_pclmul ghash_clmulni_intel intel_cstate intel_uncore mxm_wmi intel_rapl_perf iTCO_wdt ipmi_si ipmi_devintf pcspkr iTCO_vendor_support sg dcdbas ipmi_msghandler wmi mei_me lpc_ich shpchp mei acpi_power_meter dm_multipath ip_tables xfs libcrc32c sd_mod mgag200 i2c_algo_bit drm_kms_helper syscopyarea sysfillrect sysimgblt fb_sys_fops ttm drm ahci libahci crc32c_intel libata tg3 nvme nvme_core megaraid_sas ptp i2c_core pps_core dm_mirror dm_region_hash dm_log dm_mod
[  101.284881] CPU: 0 PID: 504 Comm: kworker/u25:5 Not tainted 4.15.0-rc2 #1
[  101.292455] Hardware name: Dell Inc. PowerEdge R730xd/072T6D, BIOS 2.5.5 08/16/2017
[  101.301001] Workqueue: nvme-wq nvme_reset_work [nvme]
[  101.306636] task: 00000000f2c53190 task.stack: 000000002da874f9
[  101.313241] RIP: 0010:blk_mq_map_swqueue+0xbc/0x200
[  101.318681] RSP: 0018:ffffc9000234fd70 EFLAGS: 00010282
[  101.324511] RAX: ffff88047ffc9480 RBX: ffff88047e130850 RCX: 0000000000000000
[  101.332471] RDX: ffffe8ffffd40580 RSI: ffff88047e509b40 RDI: ffff88046f37a008
[  101.340432] RBP: 000000000000000b R08: ffff88046f37a008 R09: 0000000011f94280
[  101.348392] R10: ffff88047ffd4d00 R11: 0000000000000000 R12: ffff88046f37a008
[  101.356353] R13: ffff88047e130f38 R14: 000000000000000b R15: ffff88046f37a558
[  101.364314] FS:  0000000000000000(0000) GS:ffff880277c00000(0000) knlGS:0000000000000000
[  101.373342] CS:  0010 DS: 0000 ES: 0000 CR0: 0000000080050033
[  101.379753] CR2: 0000000000000098 CR3: 000000047f409004 CR4: 00000000001606f0
[  101.387714] Call Trace:
[  101.390445]  blk_mq_update_nr_hw_queues+0xbf/0x130
[  101.395791]  nvme_reset_work+0x6f4/0xc06 [nvme]
[  101.400848]  ? pick_next_task_fair+0x290/0x5f0
[  101.405807]  ? __switch_to+0x1f5/0x430
[  101.409988]  ? put_prev_entity+0x2f/0xd0
[  101.414365]  process_one_work+0x141/0x340
[  101.418836]  worker_thread+0x47/0x3e0
[  101.422921]  kthread+0xf5/0x130
[  101.426424]  ? rescuer_thread+0x380/0x380
[  101.430896]  ? kthread_associate_blkcg+0x90/0x90
[  101.436048]  ret_from_fork+0x1f/0x30
[  101.440034] Code: 48 83 3c ca 00 0f 84 2b 01 00 00 48 63 cd 48 8b 93 10 01 00 00 8b 0c 88 48 8b 83 20 01 00 00 4a 03 14 f5 60 04 af 81 48 8b 0c c8 <48> 8b 81 98 00 00 00 f0 4c 0f ab 30 8b 81 f8 00 00 00 89 42 44
[  101.461116] RIP: blk_mq_map_swqueue+0xbc/0x200 RSP: ffffc9000234fd70
[  101.468205] CR2: 0000000000000098
[  101.471907] ---[ end trace 5fe710f98228a3ca ]---
[  101.482489] Kernel panic - not syncing: Fatal exception
[  101.488505] Kernel Offset: disabled
[  101.497752] ---[ end Kernel panic - not syncing: Fatal exception

Reviewed-by: Christoph Hellwig 
Suggested-by: Christoph Hellwig 
Reported-by: Yi Zhang 
Tested-by: Yi Zhang 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin 
Signed-off-by: Greg Kroah-Hartman

Fix slab name "biovec-(1<<(21-12))"

2018-04-08T12:27:40+00:00

commit bd5c4facf59648581d2f1692dad7b107bf429954 upstream.

I'm getting a slab named "biovec-(1<<(21-12))". It is caused by unintended
expansion of the macro BIO_MAX_PAGES. This patch renames it to biovec-max.

Signed-off-by: Mikulas Patocka 
Cc: stable@vger.kernel.org	# v4.14+
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

partitions/msdos: Unable to mount UFS 44bsd partitions

2018-04-08T12:27:34+00:00

commit 5f15684bd5e5ef39d4337988864fec8012471dda upstream.

UFS partitions from newer versions of FreeBSD 10 and 11 use relative
addressing for their subpartitions. But older versions of FreeBSD still
use absolute addressing just like OpenBSD and NetBSD.

Instead of simply testing for a FreeBSD partition, the code needs to
also test if the starting offset of the C subpartition is zero.

https://bugzilla.kernel.org/show_bug.cgi?id=197733

Signed-off-by: Richard Narron 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

blk-mq: don't call io sched's .requeue_request when requeueing rq to ->dispatch

2018-03-09T06:47:44+00:00

commit 105976f517791aed3b11f8f53b308a2069d42055 upstream.

__blk_mq_requeue_request() covers two cases:

- one is that the requeued request is added to hctx->dispatch, such as
blk_mq_dispatch_rq_list()

- another case is that the request is requeued to io scheduler, such as
blk_mq_requeue_request().

We should call io sched's .requeue_request callback only for the 2nd
case.

Cc: Paolo Valente 
Cc: Omar Sandoval 
Fixes: bd166ef183c2 ("blk-mq-sched: add framework for MQ capable IO schedulers")
Cc: stable@vger.kernel.org
Reviewed-by: Bart Van Assche 
Acked-by: Paolo Valente 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman