linux-stable.git/block, branch linux-5.0.y

block: pass page to xen_biovec_phys_mergeable

2019-05-31T13:45:13+00:00

[ Upstream commit 0383ad4374f7ad7edd925a2ee4753035c3f5508a ]

xen_biovec_phys_mergeable() only needs .bv_page of the 2nd bio bvec
for checking if the two bvecs can be merged, so pass page to
xen_biovec_phys_mergeable() directly.

No function change.

Cc: ris Ostrovsky 
Cc: Juergen Gross 
Cc: xen-devel@lists.xenproject.org
Cc: Omar Sandoval 
Cc: Christoph Hellwig 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Boris Ostrovsky 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: sed-opal: fix IOC_OPAL_ENABLE_DISABLE_MBR

2019-05-31T13:45:09+00:00

[ Upstream commit 78bf47353b0041865564deeed257a54f047c2fdc ]

The implementation of IOC_OPAL_ENABLE_DISABLE_MBR handled the value
opal_mbr_data.enable_disable incorrectly: enable_disable is expected
to be one of OPAL_MBR_ENABLE(0) or OPAL_MBR_DISABLE(1). enable_disable
was passed directly to set_mbr_done and set_mbr_enable_disable where
is was interpreted as either OPAL_TRUE(1) or OPAL_FALSE(0). The end
result was that calling IOC_OPAL_ENABLE_DISABLE_MBR with OPAL_MBR_ENABLE
actually disabled the shadow MBR and vice versa.

This patch adds correct conversion from OPAL_MBR_DISABLE/ENABLE to
OPAL_FALSE/TRUE. The change affects existing programs using
IOC_OPAL_ENABLE_DISABLE_MBR but this is typically used only once when
setting up an Opal drive.

Acked-by: Jon Derrick 
Reviewed-by: Christoph Hellwig 
Reviewed-by: Scott Bauer 
Signed-off-by: David Kozub 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

block: fix use-after-free on gendisk

2019-05-31T13:45:02+00:00

[ Upstream commit 2c88e3c7ec32d7a40cc7c9b4a487cf90e4671bdd ]

commit 2da78092dda "block: Fix dev_t minor allocation lifetime"
specifically moved blk_free_devt(dev->devt) call to part_release()
to avoid reallocating device number before the device is fully
shutdown.

However, it can cause use-after-free on gendisk in get_gendisk().
We use md device as example to show the race scenes:

Process1		Worker			Process2
md_free
						blkdev_open
del_gendisk
  add delete_partition_work_fn() to wq
  						__blkdev_get
						get_gendisk
put_disk
  disk_release
    kfree(disk)
    						find part from ext_devt_idr
						get_disk_and_module(disk)
    					  	cause use after free

    			delete_partition_work_fn
			put_device(part)
    		  	part_release
		    	remove part from ext_devt_idr

Before  is removed from ext_devt_idr by
delete_partition_work_fn(), we can find the devt and then access
gendisk by hd_struct pointer. But, if we access the gendisk after
it have been freed, it can cause in use-after-freeon gendisk in
get_gendisk().

We fix this by adding a new helper blk_invalidate_devt() in
delete_partition() and del_gendisk(). It replaces hd_struct
pointer in idr with value 'NULL', and deletes the entry from
idr in part_release() as we do now.

Thanks to Jan Kara for providing the solution and more clear comments
for the code.

Fixes: 2da78092dda1 ("block: Fix dev_t minor allocation lifetime")
Cc: Al Viro 
Reviewed-by: Bart Van Assche 
Reviewed-by: Keith Busch 
Reviewed-by: Jan Kara 
Suggested-by: Jan Kara 
Signed-off-by: Yufen Yu 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: grab .q_usage_counter when queuing request from plug code path

2019-05-31T13:44:53+00:00

[ Upstream commit e87eb301bee183d82bb3d04bd71b6660889a2588 ]

Just like aio/io_uring, we need to grab 2 refcount for queuing one
request, one is for submission, another is for completion.

If the request isn't queued from plug code path, the refcount grabbed
in generic_make_request() serves for submission. In theroy, this
refcount should have been released after the sumission(async run queue)
is done. blk_freeze_queue() works with blk_sync_queue() together
for avoiding race between cleanup queue and IO submission, given async
run queue activities are canceled because hctx->run_work is scheduled with
the refcount held, so it is fine to not hold the refcount when
running the run queue work function for dispatch IO.

However, if request is staggered into plug list, and finally queued
from plug code path, the refcount in submission side is actually missed.
And we may start to run queue after queue is removed because the queue's
kobject refcount isn't guaranteed to be grabbed in flushing plug list
context, then kernel oops is triggered, see the following race:

blk_mq_flush_plug_list():
        blk_mq_sched_insert_requests()
                insert requests to sw queue or scheduler queue
                blk_mq_run_hw_queue

Because of concurrent run queue, all requests inserted above may be
completed before calling the above blk_mq_run_hw_queue. Then queue can
be freed during the above blk_mq_run_hw_queue().

Fixes the issue by grab .q_usage_counter before calling
blk_mq_sched_insert_requests() in blk_mq_flush_plug_list(). This way is
safe because the queue is absolutely alive before inserting request.

Cc: Dongli Zhang 
Cc: James Smart 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reviewed-by: Bart Van Assche 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: split blk_mq_alloc_and_init_hctx into two parts

2019-05-31T13:44:53+00:00

[ Upstream commit 7c6c5b7c9186e3fb5b10afb8e5f710ae661144c6 ]

Split blk_mq_alloc_and_init_hctx into two parts, and one is
blk_mq_alloc_hctx() for allocating all hctx resources, another
is blk_mq_init_hctx() for initializing hctx, which serves as
counter-part of blk_mq_exit_hctx().

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org
Cc: Martin K . Petersen 
Cc: Christoph Hellwig 
Cc: James E . J . Bottomley 
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: free hw queue's resource in hctx's release handler

2019-05-25T16:21:57+00:00

commit c7e2d94b3d1634988a95ac4d77a72dc7487ece06 upstream.

Once blk_cleanup_queue() returns, tags shouldn't be used any more,
because blk_mq_free_tag_set() may be called. Commit 45a9c9d909b2
("blk-mq: Fix a use-after-free") fixes this issue exactly.

However, that commit introduces another issue. Before 45a9c9d909b2,
we are allowed to run queue during cleaning up queue if the queue's
kobj refcount is held. After that commit, queue can't be run during
queue cleaning up, otherwise oops can be triggered easily because
some fields of hctx are freed by blk_mq_free_queue() in blk_cleanup_queue().

We have invented ways for addressing this kind of issue before, such as:

	8dc765d438f1 ("SCSI: fix queue cleanup race before queue initialization is done")
	c2856ae2f315 ("blk-mq: quiesce queue before freeing queue")

But still can't cover all cases, recently James reports another such
kind of issue:

	https://marc.info/?l=linux-scsi&m=155389088124782&w=2

This issue can be quite hard to address by previous way, given
scsi_run_queue() may run requeues for other LUNs.

Fixes the above issue by freeing hctx's resources in its release handler, and this
way is safe becasue tags isn't needed for freeing such hctx resource.

This approach follows typical design pattern wrt. kobject's release handler.

Cc: Dongli Zhang 
Cc: James Smart 
Cc: Bart Van Assche 
Cc: linux-scsi@vger.kernel.org,
Cc: Martin K . Petersen ,
Cc: Christoph Hellwig ,
Cc: James E . J . Bottomley ,
Reported-by: James Smart 
Fixes: 45a9c9d909b2 ("blk-mq: Fix a use-after-free")
Cc: stable@vger.kernel.org
Reviewed-by: Hannes Reinecke 
Reviewed-by: Christoph Hellwig 
Tested-by: James Smart 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

bfq: update internal depth state when queue depth changes

2019-05-16T17:40:10+00:00

[ Upstream commit 77f1e0a52d26242b6c2dba019f6ebebfb9ff701e ]

A previous commit moved the shallow depth and BFQ depth map calculations
to be done at init time, moving it outside of the hotter IO path. This
potentially causes hangs if the users changes the depth of the scheduler
map, by writing to the 'nr_requests' sysfs file for that device.

Add a blk-mq-sched hook that allows blk-mq to inform the scheduler if
the depth changes, so that the scheduler can update its internal state.

Tested-by: Kai Krakow 
Reported-by: Paolo Valente 
Fixes: f0635b8a416e ("bfq: calculate shallow depths at init time")
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: introduce blk_mq_complete_request_sync()

2019-05-10T16:36:11+00:00

[ Upstream commit 1b8f21b74c3c9c82fce5a751d7aefb7cc0b8d33d ]

In NVMe's error handler, follows the typical steps of tearing down
hardware for recovering controller:

1) stop blk_mq hw queues
2) stop the real hw queues
3) cancel in-flight requests via
	blk_mq_tagset_busy_iter(tags, cancel_request, ...)
cancel_request():
	mark the request as abort
	blk_mq_complete_request(req);
4) destroy real hw queues

However, there may be race between #3 and #4, because blk_mq_complete_request()
may run q->mq_ops->complete(rq) remotelly and asynchronously, and
->complete(rq) may be run after #4.

This patch introduces blk_mq_complete_request_sync() for fixing the
above race.

Cc: Sagi Grimberg 
Cc: Bart Van Assche 
Cc: James Smart 
Cc: linux-nvme@lists.infradead.org
Reviewed-by: Keith Busch 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin

blk-mq: do not reset plug->rq_count before the list is sorted

2019-05-08T05:22:53+00:00

[ Upstream commit bcc816dfe51ab86ca94663c7b225f2d6eb0fddb9 ]

We would never be able to sort the list if we first reset plug->rq_count
which is used in conditional check later.

Fixes: ce5b009cff19 ("block: improve logic around when to sort a plug list")
Reviewed-by: Ming Lei 
Signed-off-by: Dongli Zhang 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin (Microsoft)

block: use blk_free_flush_queue() to free hctx->fq in blk_mq_init_hctx

2019-05-08T05:22:52+00:00

[ Upstream commit b9a1ff504b9492ad6beb7d5606e0e3365d4d8499 ]

kfree() can leak the hctx->fq->flush_rq field.

Reviewed-by: Ming Lei 
Signed-off-by: Shenghui Wang 
Signed-off-by: Jens Axboe 
Signed-off-by: Sasha Levin (Microsoft)