linux-stable.git/block, branch linux-4.12.y

blk-mq-pci: add a fallback when pci_irq_get_affinity returns NULL

2017-08-25T00:15:03+00:00

commit c005390374957baacbc38eef96ea360559510aa7 upstream.

While pci_irq_get_affinity should never fail for SMP kernel that
implement the affinity mapping, it will always return NULL in the
UP case, so provide a fallback mapping of all queues to CPU 0 in
that case.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Omar Sandoval 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: Make blk_mq_delay_kick_requeue_list() rerun the queue at a quiet time

2017-08-16T20:46:57+00:00

commit d4acf3650c7c968f46ad932b9a25d1cc24cf4998 upstream.

The blk_mq_delay_kick_requeue_list() function is used by the device
mapper and only by the device mapper to rerun the queue and requeue
list after a delay. This function is called once per request that
gets requeued. Modify this function such that the queue is run once
per path change event instead of once per request that is requeued.

Fixes: commit 2849450ad39d ("blk-mq: introduce blk_mq_delay_kick_requeue_list()")
Signed-off-by: Bart Van Assche 
Cc: Mike Snitzer 
Cc: Laurence Oberman 
Signed-off-by: Jens Axboe 
Signed-off-by: Greg Kroah-Hartman

block: disable runtime-pm for blk-mq

2017-08-11T15:33:55+00:00

commit 765e40b675a9566459ddcb8358ad16f3b8344bbe upstream.

The blk-mq code lacks support for looking at the rpm_status field, tracking
active requests and the RQF_PM flag.

Due to the default switch to blk-mq for scsi people start to run into
suspend / resume issue due to this fact, so make sure we disable the runtime
PM functionality until it is properly implemented.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Ming Lei 
Signed-off-by: Jens Axboe 
Cc: Oleksandr Natalenko 
Signed-off-by: Greg Kroah-Hartman

blk-mq: Create hctx for each present CPU

2017-08-11T15:33:55+00:00

commit 4b855ad37194f7bdbb200ce7a1c7051fecb56a08 upstream.

Currently we only create hctx for online CPUs, which can lead to a lot
of churn due to frequent soft offline / online operations.  Instead
allocate one for each present CPU to avoid this and dramatically simplify
the code.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Jens Axboe 
Cc: Keith Busch 
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-3-hch@lst.de
Signed-off-by: Thomas Gleixner 
Cc: Oleksandr Natalenko 
Cc: Mike Galbraith 
Signed-off-by: Greg Kroah-Hartman

blk-mq: Include all present CPUs in the default queue mapping

2017-08-11T15:33:54+00:00

commit 5f042e7cbd9ebd3580077dcdc21f35e68c2adf5f upstream.

This way we get a nice distribution independent of the current cpu
online / offline state.

Signed-off-by: Christoph Hellwig 
Reviewed-by: Jens Axboe 
Cc: Keith Busch 
Cc: linux-block@vger.kernel.org
Cc: linux-nvme@lists.infradead.org
Link: http://lkml.kernel.org/r/20170626102058.10200-2-hch@lst.de
Signed-off-by: Thomas Gleixner 
Cc: Oleksandr Natalenko 
Cc: Mike Galbraith 
Signed-off-by: Greg Kroah-Hartman

block: provide bio_uninit() free freeing integrity/task associations

2017-06-28T21:30:13+00:00

Wen reports significant memory leaks with DIF and O_DIRECT:

"With nvme devive + T10 enabled, On a system it has 256GB and started
logging /proc/meminfo & /proc/slabinfo for every minute and in an hour
it increased by 15968128 kB or ~15+GB.. Approximately 256 MB / minute
leaking.

/proc/meminfo | grep SUnreclaim...

SUnreclaim:      6752128 kB
SUnreclaim:      6874880 kB
SUnreclaim:      7238080 kB
....
SUnreclaim:     22307264 kB
SUnreclaim:     22485888 kB
SUnreclaim:     22720256 kB

When testcases with T10 enabled call into __blkdev_direct_IO_simple,
code doesn't free memory allocated by bio_integrity_alloc. The patch
fixes the issue. HTX has been run with +60 hours without failure."

Since __blkdev_direct_IO_simple() allocates the bio on the stack, it
doesn't go through the regular bio free. This means that any ancillary
data allocated with the bio through the stack is not freed. Hence, we
can leak the integrity data associated with the bio, if the device is
using DIF/DIX.

Fix this by providing a bio_uninit() and export it, so that we can use
it to free this data. Note that this is a minimal fix for this issue.
Any current user of bio's that are allocated outside of
bio_alloc_bioset() suffers from this issue, most notably some drivers.
We will fix those in a more comprehensive patch for 4.13. This also
means that the commit marked as being fixed by this isn't the real
culprit, it's just the most obvious one out there.

Fixes: 542ff7bf18c6 ("block: new direct I/O implementation")
Reported-by: Wen Xiong 
Reviewed-by: Christoph Hellwig 
Signed-off-by: Jens Axboe

blk-mq: fix performance regression with shared tags

2017-06-21T16:17:49+00:00

If we have shared tags enabled, then every IO completion will trigger
a full loop of every queue belonging to a tag set, and every hardware
queue for each of those queues, even if nothing needs to be done.
This causes a massive performance regression if you have a lot of
shared devices.

Instead of doing this huge full scan on every IO, add an atomic
counter to the main queue that tracks how many hardware queues have
been marked as needing a restart. With that, we can avoid looking for
restartable queues, if we don't have to.

Max reports that this restores performance. Before this patch, 4K
IOPS was limited to 22-23K IOPS. With the patch, we are running at
950-970K IOPS.

Fixes: 6d8c6c0f97ad ("blk-mq: Restart a single queue if tag sets are shared")
Reported-by: Max Gurtovoy 
Tested-by: Max Gurtovoy 
Reviewed-by: Bart Van Assche 
Tested-by: Bart Van Assche 
Signed-off-by: Jens Axboe

block: Fix a blk_exit_rl() regression

2017-06-14T19:27:50+00:00

Avoid that the following complaint is reported:

 BUG: sleeping function called from invalid context at kernel/workqueue.c:2790
 in_atomic(): 1, irqs_disabled(): 0, pid: 41, name: rcuop/3
 1 lock held by rcuop/3/41:
  #0:  (rcu_callback){......}, at: [] rcu_nocb_kthread+0x282/0x500
 Call Trace:
  dump_stack+0x86/0xcf
  ___might_sleep+0x174/0x260
  __might_sleep+0x4a/0x80
  flush_work+0x7e/0x2e0
  __cancel_work_timer+0x143/0x1c0
  cancel_work_sync+0x10/0x20
  blk_throtl_exit+0x25/0x60
  blkcg_exit_queue+0x35/0x40
  blk_release_queue+0x42/0x130
  kobject_put+0xa9/0x190

This happens since we invoke callbacks that need to block from the
queue release handler. Fix this by pushing the final release to
a workqueue.

Reported-by: Ross Zwisler 
Fixes: commit b425e5049258 ("block: Avoid that blk_exit_rl() triggers a use-after-free")
Signed-off-by: Bart Van Assche 
Tested-by: Ross Zwisler 

Updated changelog
Signed-off-by: Jens Axboe

block, bfq: access and cache blkg data only when safe

2017-06-08T15:51:10+00:00

In blk-cgroup, operations on blkg objects are protected with the
request_queue lock. This is no more the lock that protects
I/O-scheduler operations in blk-mq. In fact, the latter are now
protected with a finer-grained per-scheduler-instance lock. As a
consequence, although blkg lookups are also rcu-protected, blk-mq I/O
schedulers may see inconsistent data when they access blkg and
blkg-related objects. BFQ does access these objects, and does incur
this problem, in the following case.

The blkg_lookup performed in bfq_get_queue, being protected (only)
through rcu, may happen to return the address of a copy of the
original blkg. If this is the case, then the blkg_get performed in
bfq_get_queue, to pin down the blkg, is useless: it does not prevent
blk-cgroup code from destroying both the original blkg and all objects
directly or indirectly referred by the copy of the blkg. BFQ accesses
these objects, which typically causes a crash for NULL-pointer
dereference of memory-protection violation.

Some additional protection mechanism should be added to blk-cgroup to
address this issue. In the meantime, this commit provides a quick
temporary fix for BFQ: cache (when safe) blkg data that might
disappear right after a blkg_lookup.

In particular, this commit exploits the following facts to achieve its
goal without introducing further locks.  Destroy operations on a blkg
invoke, as a first step, hooks of the scheduler associated with the
blkg. And these hooks are executed with bfqd->lock held for BFQ. As a
consequence, for any blkg associated with the request queue an
instance of BFQ is attached to, we are guaranteed that such a blkg is
not destroyed, and that all the pointers it contains are consistent,
while that instance is holding its bfqd->lock. A blkg_lookup performed
with bfqd->lock held then returns a fully consistent blkg, which
remains consistent until this lock is held. In more detail, this holds
even if the returned blkg is a copy of the original one.

Finally, also the object describing a group inside BFQ needs to be
protected from destruction on the blkg_free of the original blkg
(which invokes bfq_pd_free). This commit adds private refcounting for
this object, to let it disappear only after no bfq_queue refers to it
any longer.

This commit also removes or updates some stale comments on locking
issues related to blk-cgroup operations.

Reported-by: Tomas Konir 
Reported-by: Lee Tibbert 
Reported-by: Marco Piazza 
Signed-off-by: Paolo Valente 
Tested-by: Tomas Konir 
Tested-by: Lee Tibbert 
Tested-by: Marco Piazza 
Signed-off-by: Jens Axboe

blk-throttle: set default latency baseline for harddisk

2017-06-07T15:09:32+00:00

hard disk IO latency varies a lot depending on spindle move. The latency
range could be from several microseconds to several milliseconds. It's
pretty hard to get the baseline latency used by io.low.

We will use a different stragety here. The idea is only using IO with
spindle move to determine if cgroup IO is in good state. For HD, if io
latency is small (< 1ms), we ignore the IO. Such IO is likely from
sequential IO, and is helpless to help determine if a cgroup's IO is
impacted by other cgroups. With this, we only account IO with big
latency. Then we can choose a hardcoded baseline latency for HD (4ms,
which is typical IO latency with seek).  With all these settings, the
io.low latency works for both HD and SSD.

Signed-off-by: Shaohua Li 
Signed-off-by: Jens Axboe