linux.git/block, branch v5.4-rc2

block: sed-opal: fix sparse warning: convert __be64 data

2019-10-03T20:21:32+00:00

sparse warns about incorrect type when using __be64 data.
It is not being converted to CPU-endian but it should be.

Fixes these sparse warnings:

../block/sed-opal.c:375:20: warning: incorrect type in assignment (different base types)
../block/sed-opal.c:375:20:    expected unsigned long long [usertype] align
../block/sed-opal.c:375:20:    got restricted __be64 const [usertype] alignment_granularity
../block/sed-opal.c:376:25: warning: incorrect type in assignment (different base types)
../block/sed-opal.c:376:25:    expected unsigned long long [usertype] lowest_lba
../block/sed-opal.c:376:25:    got restricted __be64 const [usertype] lowest_aligned_lba

Fixes: 455a7b238cd6 ("block: Add Sed-opal library")
Cc: Scott Bauer 
Cc: Rafael Antognolli 
Cc: linux-block@vger.kernel.org
Reviewed-by: Jon Derrick 
Signed-off-by: Randy Dunlap 
Signed-off-by: Jens Axboe

block: sed-opal: fix sparse warning: obsolete array init.

2019-10-03T20:21:30+00:00

Fix sparse warning: (missing '=')
../block/sed-opal.c:133:17: warning: obsolete array initializer, use C99 syntax

Fixes: ff91064ea37c ("block: sed-opal: check size of shadow mbr")
Cc: linux-block@vger.kernel.org
Cc: Jonas Rabenstein 
Cc: David Kozub 
Reviewed-by: Scott Bauer 
Reviewed-by:  Revanth Rajashekar 
Signed-off-by: Randy Dunlap 
Signed-off-by: Jens Axboe

blk-mq: apply normal plugging for HDD

2019-09-27T17:40:21+00:00

Some HDD drive may expose multiple hardware queues, such as MegraRaid.
Let's apply the normal plugging for such devices because sequential IO
may benefit a lot from plug merging.

Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Dave Chinner 
Reviewed-by: Damien Le Moal 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe

blk-mq: honor IO scheduler for multiqueue devices

2019-09-27T17:38:28+00:00

If a device is using multiple queues, the IO scheduler may be bypassed.
This may hurt performance for some slow MQ devices, and it also breaks
zoned devices which depend on mq-deadline for respecting the write order
in one zone.

Don't bypass io scheduler if we have one setup.

This patch can double sequential write performance basically on MQ
scsi_debug when mq-deadline is applied.

Cc: Bart Van Assche 
Cc: Hannes Reinecke 
Cc: Dave Chinner 
Reviewed-by: Javier González 
Reviewed-by: Damien Le Moal 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe

block: fix null pointer dereference in blk_mq_rq_timed_out()

2019-09-27T13:01:25+00:00

We got a null pointer deference BUG_ON in blk_mq_rq_timed_out()
as following:

[  108.825472] BUG: kernel NULL pointer dereference, address: 0000000000000040
[  108.827059] PGD 0 P4D 0
[  108.827313] Oops: 0000 [#1] SMP PTI
[  108.827657] CPU: 6 PID: 198 Comm: kworker/6:1H Not tainted 5.3.0-rc8+ #431
[  108.829503] Workqueue: kblockd blk_mq_timeout_work
[  108.829913] RIP: 0010:blk_mq_check_expired+0x258/0x330
[  108.838191] Call Trace:
[  108.838406]  bt_iter+0x74/0x80
[  108.838665]  blk_mq_queue_tag_busy_iter+0x204/0x450
[  108.839074]  ? __switch_to_asm+0x34/0x70
[  108.839405]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.839823]  ? blk_mq_stop_hw_queue+0x40/0x40
[  108.840273]  ? syscall_return_via_sysret+0xf/0x7f
[  108.840732]  blk_mq_timeout_work+0x74/0x200
[  108.841151]  process_one_work+0x297/0x680
[  108.841550]  worker_thread+0x29c/0x6f0
[  108.841926]  ? rescuer_thread+0x580/0x580
[  108.842344]  kthread+0x16a/0x1a0
[  108.842666]  ? kthread_flush_work+0x170/0x170
[  108.843100]  ret_from_fork+0x35/0x40

The bug is caused by the race between timeout handle and completion for
flush request.

When timeout handle function blk_mq_rq_timed_out() try to read
'req->q->mq_ops', the 'req' have completed and reinitiated by next
flush request, which would call blk_rq_init() to clear 'req' as 0.

After commit 12f5b93145 ("blk-mq: Remove generation seqeunce"),
normal requests lifetime are protected by refcount. Until 'rq->ref'
drop to zero, the request can really be free. Thus, these requests
cannot been reused before timeout handle finish.

However, flush request has defined .end_io and rq->end_io() is still
called even if 'rq->ref' doesn't drop to zero. After that, the 'flush_rq'
can be reused by the next flush request handle, resulting in null
pointer deference BUG ON.

We fix this problem by covering flush request with 'rq->ref'.
If the refcount is not zero, flush_end_io() return and wait the
last holder recall it. To record the request status, we add a new
entry 'rq_status', which will be used in flush_end_io().

Cc: Christoph Hellwig 
Cc: Keith Busch 
Cc: Bart Van Assche 
Cc: stable@vger.kernel.org # v4.18+
Reviewed-by: Ming Lei 
Reviewed-by: Bob Liu 
Signed-off-by: Yufen Yu 

-------
v2:
 - move rq_status from struct request to struct blk_flush_queue
v3:
 - remove unnecessary '{}' pair.
v4:
 - let spinlock to protect 'fq->rq_status'
v5:
 - move rq_status after flush_running_idx member of struct blk_flush_queue
Signed-off-by: Jens Axboe

rq-qos: get rid of redundant wbt_update_limits()

2019-09-27T07:13:10+00:00

We have updated limits after calling wbt_set_min_lat(). No need to
update again.

Reviewed-by: Bob Liu 
Signed-off-by: Yufen Yu 
Signed-off-by: Jens Axboe

iocost: bump up default latency targets for hard disks

2019-09-26T07:12:01+00:00

The default hard disk param sets latency targets at 50ms.  As the
default target percentiles are zero, these don't directly regulate
vrate; however, they're still used to calculate the period length -
100ms in this case.

This is excessively low.  A SATA drive with QD32 saturated with random
IOs can easily reach avg completion latency of several hundred msecs.
A period duration which is substantially lower than avg completion
latency can lead to wildly fluctuating vrate.

Let's bump up the default latency targets to 250ms so that the period
duration is sufficiently long.

Signed-off-by: Tejun Heo 
Signed-off-by: Jens Axboe

iocost: improve nr_lagging handling

2019-09-26T07:12:00+00:00

Some IOs may span multiple periods.  As latencies are collected on
completion, the inbetween periods won't register them and may
incorrectly decide to increase vrate.  nr_lagging tracks these IOs to
avoid those situations.  Currently, whenever there are IOs which are
spanning from the previous period, busy_level is reset to 0 if
negative thus suppressing vrate increase.

This has the following two problems.

* When latency target percentiles aren't set, vrate adjustment should
  only be governed by queue depth depletion; however, the current code
  keeps nr_lagging active which pulls in latency results and can keep
  down vrate unexpectedly.

* When lagging condition is detected, it resets the entire negative
  busy_level.  This turned out to be way too aggressive on some
  devices which sometimes experience extended latencies on a small
  subset of commands.  In addition, a lagging IO will be accounted as
  latency target miss on completion anyway and resetting busy_level
  amplifies its impact unnecessarily.

This patch fixes the above two problems by disabling nr_lagging
counting when latency target percentiles aren't set and blocking vrate
increases when there are lagging IOs while leaving busy_level as-is.

Signed-off-by: Tejun Heo 
Signed-off-by: Jens Axboe

iocost: better trace vrate changes

2019-09-26T07:11:58+00:00

vrate_adj tracepoint traces vrate changes; however, it does so only
when busy_level is non-zero.  busy_level turning to zero can sometimes
be as interesting an event.  This patch also enables vrate_adj
tracepoint on other vrate related events - busy_level changes and
non-zero nr_lagging.

Signed-off-by: Tejun Heo 
Signed-off-by: Jens Axboe

block: don't release queue's sysfs lock during switching elevator

2019-09-26T06:45:51+00:00

cecf5d87ff20 ("block: split .sysfs_lock into two locks") starts to
release & acquire sysfs_lock before registering/un-registering elevator
queue during switching elevator for avoiding potential deadlock from
showing & storing 'queue/iosched' attributes and removing elevator's
kobject.

Turns out there isn't such deadlock because 'q->sysfs_lock' isn't
required in .show & .store of queue/iosched's attributes, and just
elevator's sysfs lock is acquired in elv_iosched_store() and
elv_iosched_show(). So it is safe to hold queue's sysfs lock when
registering/un-registering elevator queue.

The biggest issue is that commit cecf5d87ff20 assumes that concurrent
write on 'queue/scheduler' can't happen. However, this assumption isn't
true, because kernfs_fop_write() only guarantees that concurrent write
aren't called on the same open file, but the write could be from
different open on the file. So we can't release & re-acquire queue's
sysfs lock during switching elevator, otherwise use-after-free on
elevator could be triggered.

Fixes the issue by not releasing queue's sysfs lock during switching
elevator.

Fixes: cecf5d87ff20 ("block: split .sysfs_lock into two locks")
Cc: Christoph Hellwig 
Cc: Hannes Reinecke 
Cc: Greg KH 
Cc: Mike Snitzer 
Reviewed-by: Bart Van Assche 
Signed-off-by: Ming Lei 
Signed-off-by: Jens Axboe